1 00:00:00,300 --> 00:00:02,600 All right, so I think I've explained enough. 2 00:00:02,600 --> 00:00:06,400 Now we're relief that at least it's 100% clear for everyone. 3 00:00:06,600 --> 00:00:08,066 And so there you go, my friends. 4 00:00:08,066 --> 00:00:12,066 Let's implement one of the last tools of this data preprocessing toolkit, 5 00:00:12,233 --> 00:00:15,400 which is indeed the split of the data 6 00:00:15,400 --> 00:00:18,400 set into the training set and the test set. 7 00:00:18,466 --> 00:00:20,500 All right. So how are we going to do this. 8 00:00:20,500 --> 00:00:24,700 Well we're going to do it with a function a function by scikit learn. 9 00:00:24,700 --> 00:00:28,400 You know the most popular and useful data science library. 10 00:00:28,633 --> 00:00:33,000 Because once again this library contains a module that is called model selection, 11 00:00:33,200 --> 00:00:37,100 which contains itself a function called train test split. 12 00:00:37,333 --> 00:00:39,633 And this function will exactly do what we want, 13 00:00:39,633 --> 00:00:43,633 which is to create four separate sets, actually not two, but four, 14 00:00:43,633 --> 00:00:45,300 because we will actually create 15 00:00:45,300 --> 00:00:48,600 a pair of matrix of features independent variable for the training set. 16 00:00:48,766 --> 00:00:52,533 And another pair of matrix A features independent variable for the test set. 17 00:00:52,900 --> 00:00:53,233 All right. 18 00:00:53,233 --> 00:00:56,400 So we're basically going to get four set xtrain 19 00:00:56,400 --> 00:00:57,166 which is a matrix 20 00:00:57,166 --> 00:00:58,033 of features of the training 21 00:00:58,033 --> 00:01:02,266 set X test which is the matrix of features of the test set Y train 22 00:01:02,266 --> 00:01:04,366 which is a dependent variable of the training set, 23 00:01:04,366 --> 00:01:07,466 and Y test, which is the dependent variable of the test set. 24 00:01:07,700 --> 00:01:09,166 That's exactly what we want. 25 00:01:09,166 --> 00:01:10,933 And now why do we want this? 26 00:01:10,933 --> 00:01:12,000 Well, it's not us. 27 00:01:12,000 --> 00:01:13,500 It's actually the future 28 00:01:13,500 --> 00:01:16,900 machine learning model that we will build in the next part, 29 00:01:17,100 --> 00:01:22,400 which will be all of them expecting this format as inputs, 30 00:01:22,666 --> 00:01:25,633 you know, for the training, it will expect X train and Y 31 00:01:25,633 --> 00:01:29,000 train as inputs in the method actually called the fit method. 32 00:01:29,233 --> 00:01:32,100 And for the predictions also called inference, 33 00:01:32,100 --> 00:01:34,966 these models will predict X test. All right. 34 00:01:34,966 --> 00:01:36,500 So that's the reason. 35 00:01:36,500 --> 00:01:40,500 It is simply the format expected by the future machinery models. 36 00:01:40,500 --> 00:01:42,833 And now let's get these four sets. 37 00:01:42,833 --> 00:01:46,800 So we're going to get them from scikit learn of course. 38 00:01:48,566 --> 00:01:49,600 There you go. 39 00:01:49,600 --> 00:01:53,866 From which we're going to get access to model selection 40 00:01:53,866 --> 00:01:55,933 I really like Google Colab. 41 00:01:55,933 --> 00:01:59,800 And then from which we're going to import that train 42 00:02:00,300 --> 00:02:03,200 underscore test split function. 43 00:02:03,200 --> 00:02:04,033 Perfect. 44 00:02:04,033 --> 00:02:08,433 You see how we can be so efficient thanks to the assistance of Google Colab. 45 00:02:08,433 --> 00:02:11,133 I hope you really like it as well. 46 00:02:11,133 --> 00:02:14,133 All right, so now that we have this function, well we're going to use it. 47 00:02:14,133 --> 00:02:18,500 And since we already know what this function will return as, I just explained. 48 00:02:18,600 --> 00:02:23,433 Well let's create these four variables returned by this Traintestsplit function. 49 00:02:23,700 --> 00:02:28,833 And as we said they are first x train to the matrix of features 50 00:02:28,833 --> 00:02:33,000 of the training set, therefore containing all the countries 51 00:02:33,333 --> 00:02:36,966 one hot encoded ages and salaries of the training set. 52 00:02:37,200 --> 00:02:38,366 So xtrain. 53 00:02:38,366 --> 00:02:43,200 Then x test the matrix of features of the test set. 54 00:02:43,566 --> 00:02:47,666 Then Y train, which is the dependent variable 55 00:02:47,666 --> 00:02:50,800 of the training set, meaning all the purchased decisions 56 00:02:50,866 --> 00:02:54,200 of the customers in the training set Y train 57 00:02:54,366 --> 00:02:57,900 and then Y test, which same contains 58 00:02:57,900 --> 00:03:01,400 all the purchase decisions of the customers in the test set. 59 00:03:01,566 --> 00:03:02,433 All right. 60 00:03:02,433 --> 00:03:06,966 So that's the four variables returned by this traintestsplit function. 61 00:03:06,966 --> 00:03:08,500 And since it is the function 62 00:03:08,500 --> 00:03:11,766 that returns this variable, well let's take that function right away. 63 00:03:12,066 --> 00:03:14,800 And let's add here an equals 64 00:03:14,800 --> 00:03:18,233 and train test split and then some parenthesis. 65 00:03:18,433 --> 00:03:22,966 And now the question is what do we have to input inside this function. 66 00:03:23,633 --> 00:03:24,133 All right. 67 00:03:24,133 --> 00:03:28,266 So actually there are some parameters that we can guess right. 68 00:03:28,633 --> 00:03:32,200 Because indeed this train test split is supposed to split something. 69 00:03:32,200 --> 00:03:34,833 So one of the input will be that's something 70 00:03:34,833 --> 00:03:38,400 which we're about to split and which is of course our data set. 71 00:03:38,633 --> 00:03:42,233 However of course this function does not expect the data set as a whole. 72 00:03:42,400 --> 00:03:43,300 It expects. 73 00:03:43,300 --> 00:03:43,866 Well, the 74 00:03:43,866 --> 00:03:48,200 combination of the matrix of features X and the dependent variable vector y. 75 00:03:48,200 --> 00:03:51,100 And that's the first two inputs of this function. 76 00:03:51,100 --> 00:03:53,533 So let's input them here x. 77 00:03:53,533 --> 00:03:57,666 First a matrix of features and y the dependent variable vector 78 00:03:58,600 --> 00:04:01,000 grid y. Yes. 79 00:04:01,000 --> 00:04:03,300 Then come up and then next arguments. 80 00:04:03,300 --> 00:04:07,533 So we still have to input two more arguments which are going to be 81 00:04:07,866 --> 00:04:10,866 first the split size. 82 00:04:10,933 --> 00:04:15,533 You know, because we're not going to split this data set into a training set 83 00:04:15,533 --> 00:04:19,566 and a set of the same size actually we need a lot of observations 84 00:04:19,566 --> 00:04:22,000 in a training set and a few in the test set. 85 00:04:22,000 --> 00:04:23,500 But we need a lot of them in the training set. 86 00:04:23,500 --> 00:04:26,666 So that's to give the future machine learning model more chance 87 00:04:26,666 --> 00:04:30,000 to understand and learn the correlations in the data set. 88 00:04:30,300 --> 00:04:34,333 So let me just tell you the recommended size of the split. 89 00:04:34,533 --> 00:04:37,766 Well I recommend to have 80% observation 90 00:04:37,833 --> 00:04:40,833 in the training set and 20% in the test set. 91 00:04:41,333 --> 00:04:43,200 All right. This is a very good split. 92 00:04:43,200 --> 00:04:46,833 And therefore here we're going to input a new parameter 93 00:04:46,833 --> 00:04:49,833 which is test size. 94 00:04:49,833 --> 00:04:53,066 And we'll set that equal to 0.2. 95 00:04:53,066 --> 00:04:57,000 Right 20% observations will go into the test set. 96 00:04:57,266 --> 00:05:01,033 And therefore here since we have ten observations in this data set, 97 00:05:01,200 --> 00:05:05,000 that means that eight observations will go into the training set, meaning 98 00:05:05,033 --> 00:05:07,133 eight customers will go into the training set. 99 00:05:07,133 --> 00:05:08,566 And to in the test set. 100 00:05:08,566 --> 00:05:10,733 And this is not necessarily the last two. 101 00:05:10,733 --> 00:05:12,633 You know, they will be taken randomly, 102 00:05:12,633 --> 00:05:15,900 but eight of them will go into the training set and to notice it. 103 00:05:16,033 --> 00:05:16,900 All right. 104 00:05:16,900 --> 00:05:22,000 And now we'll add one final argument just for teaching purposes so that we can 105 00:05:22,000 --> 00:05:26,533 have the same results displayed in here, you know, in the notebook. 106 00:05:26,533 --> 00:05:28,800 Because then I'm going to run some prints 107 00:05:28,800 --> 00:05:32,533 to show you these four elements returned by this traintestsplit function. 108 00:05:32,533 --> 00:05:34,300 You know, the training set and the test set. 109 00:05:34,300 --> 00:05:37,533 And since there are some random factors that are going to happen 110 00:05:37,533 --> 00:05:40,200 during the split, right, because the observations 111 00:05:40,200 --> 00:05:43,200 will be randomly split into the training set and the test set. 112 00:05:43,466 --> 00:05:46,200 Well, to make sure we have the same random factors, we'll 113 00:05:46,200 --> 00:05:49,233 just add here random state 114 00:05:50,633 --> 00:05:51,766 one. Right. 115 00:05:51,766 --> 00:05:56,233 We were just fixing the seed here so that we'll get the same split 116 00:05:56,233 --> 00:05:59,233 and therefore the same training set and same test set.