1 00:00:00,166 --> 00:00:00,933 All right. 2 00:00:00,933 --> 00:00:02,133 And that's it. Right. 3 00:00:02,133 --> 00:00:05,900 This is the code to split the data set into the training set in a dataset. 4 00:00:06,133 --> 00:00:09,666 Let me zoom out a bit so that you can see it. 5 00:00:09,866 --> 00:00:11,700 All right. So that's the full code. 6 00:00:11,700 --> 00:00:15,733 This will return indeed these four new sets 7 00:00:15,733 --> 00:00:18,800 composed of the training set in X and Y train. 8 00:00:18,900 --> 00:00:21,433 And the test set in X and y tests. 9 00:00:21,433 --> 00:00:22,766 Let me show you this right away. 10 00:00:22,766 --> 00:00:26,700 So we're going to add four new code cells here. 11 00:00:27,300 --> 00:00:27,733 Right. 12 00:00:27,733 --> 00:00:30,966 And we're going to print each of these created sets. 13 00:00:31,266 --> 00:00:34,266 So first we're going to print X train. 14 00:00:35,033 --> 00:00:36,466 Let me copy this. 15 00:00:36,466 --> 00:00:38,833 Then we're going to print X 16 00:00:39,833 --> 00:00:41,300 test. 17 00:00:41,300 --> 00:00:45,600 Then we're going to print Y train. 18 00:00:45,733 --> 00:00:50,133 And finally we're going to print Y test. 19 00:00:50,766 --> 00:00:51,433 Perfect. 20 00:00:51,433 --> 00:00:53,833 All right. So now let's execute everything. 21 00:00:53,833 --> 00:00:58,100 Starting with this cell here splitting the dataset into training and test it. 22 00:00:58,133 --> 00:01:00,766 Done. Perfect. Run successfully. 23 00:01:00,766 --> 00:01:03,900 Now let's run the cell to print X train. 24 00:01:04,166 --> 00:01:08,533 And as you can see, indeed we have now eight observations in this training set. 25 00:01:08,533 --> 00:01:11,733 Right 12345678 26 00:01:11,966 --> 00:01:16,400 which correspond to eight customers taken randomly from this data set. 27 00:01:16,800 --> 00:01:22,066 And we clearly recognize the features here with first the three columns being that 28 00:01:22,066 --> 00:01:28,033 one hot encoded variables that encode the country categorical variable. 29 00:01:28,066 --> 00:01:30,533 We also call that dummy variables. 30 00:01:30,533 --> 00:01:33,300 Then we clearly have here the age as the second 31 00:01:33,300 --> 00:01:36,700 variable as a second feature, you know, and then the salary. 32 00:01:36,733 --> 00:01:41,400 So we clearly have a great matrix of features for the training set. 33 00:01:42,000 --> 00:01:42,866 All right. Perfect. 34 00:01:42,866 --> 00:01:44,666 Now let's print X test. 35 00:01:44,666 --> 00:01:48,766 So we'll get here two observations containing the same features 36 00:01:48,766 --> 00:01:49,700 here as here right. 37 00:01:49,700 --> 00:01:51,833 This is the matrix of features still. 38 00:01:51,833 --> 00:01:54,900 So we have the dummy variables here in the first three columns. 39 00:01:55,133 --> 00:01:59,166 Then the age and the two salaries of our two customers 40 00:01:59,166 --> 00:02:02,166 taken randomly from the data set into this test set. 41 00:02:02,633 --> 00:02:03,866 Then Y train. 42 00:02:03,866 --> 00:02:08,433 So here we'll get eight purchased decisions right with the zeros 43 00:02:08,433 --> 00:02:11,800 and ones here that were encoded before with label encoder. 44 00:02:12,300 --> 00:02:14,666 And of course make sure to understand this. 45 00:02:14,666 --> 00:02:19,566 These eight purchase decisions correspond of course to the eight 46 00:02:19,566 --> 00:02:24,300 same customers of this matrix of features X train of the training set right. 47 00:02:24,333 --> 00:02:27,333 These features correspond to these purchase decisions. 48 00:02:27,366 --> 00:02:29,633 These are the same customers here. 49 00:02:29,633 --> 00:02:33,500 And finally Y test which will output two results 50 00:02:33,633 --> 00:02:37,266 meaning two purchase decisions right zero and one corresponding 51 00:02:37,266 --> 00:02:42,033 of course to the same customers as in this matrix of features of the test set. 52 00:02:42,766 --> 00:02:45,300 All right. So there you go. Congratulations. 53 00:02:45,300 --> 00:02:48,466 Now you have a new tool in your data preprocessing 54 00:02:48,466 --> 00:02:51,633 toolkit splitting the data set into the training set and data set. 55 00:02:51,966 --> 00:02:53,200 Not only you have this tool, 56 00:02:53,200 --> 00:02:57,400 but also you have the final answer to the ultimate question. 57 00:02:57,600 --> 00:03:01,333 Do we have to apply feature scaling before or after the split? 58 00:03:01,500 --> 00:03:05,466 And it's clearly after the split to avoid indeed information leakage 59 00:03:05,700 --> 00:03:08,566 because simply the test set is supposed to be something 60 00:03:08,566 --> 00:03:13,366 you write something on which we evaluate our model on you. 61 00:03:13,366 --> 00:03:15,700 Observations. All right. Great. 62 00:03:15,700 --> 00:03:19,866 So I'm glad that you are really making progress here with new tools 63 00:03:19,866 --> 00:03:23,533 and new knowledge that actually reduce any kind of confusion. 64 00:03:23,833 --> 00:03:27,333 So now we're going to move on to our final tool right, 65 00:03:27,333 --> 00:03:31,233 feature scaling, which now you know, must be applied after the split. 66 00:03:31,500 --> 00:03:32,766 And you will see what 67 00:03:32,766 --> 00:03:37,433 we'll get with some other prints after we deploy this tool on our data set. 68 00:03:37,500 --> 00:03:39,733 So I can't wait to show this to you. 69 00:03:39,733 --> 00:03:43,400 And I can't wait to give you this last final tool in your toolkit, 70 00:03:43,666 --> 00:03:45,000 because then what does it mean? 71 00:03:45,000 --> 00:03:47,700 That means that we will be 100% ready 72 00:03:47,700 --> 00:03:51,266 to start building our future machine learning models.