1 00:00:00,766 --> 00:00:03,700 Okay, so now we imported the K-12 library. 2 00:00:03,700 --> 00:00:05,500 That is going to make a good split 3 00:00:05,500 --> 00:00:08,100 of the data set into the training set and the test set. 4 00:00:08,100 --> 00:00:10,933 So now let's make the split. 5 00:00:10,933 --> 00:00:14,666 So remember in Python, for those of you who followed the Python tutorial 6 00:00:14,800 --> 00:00:19,433 we used random state equals zero so that we get the same results. 7 00:00:19,966 --> 00:00:21,700 Well here it's going to be the same. 8 00:00:21,700 --> 00:00:24,666 We're going to set a seed to have the same results. 9 00:00:24,666 --> 00:00:26,900 Only we're going to do it now. We're not going to wait 10 00:00:26,900 --> 00:00:30,100 to be in the function in the library that makes this split. 11 00:00:30,366 --> 00:00:32,666 So to do it we have to set the same seed. 12 00:00:32,666 --> 00:00:35,900 And so we can type set dot seed. 13 00:00:36,200 --> 00:00:39,633 And in parenthesis we can choose any number we want. 14 00:00:39,633 --> 00:00:41,200 It can be this number. 15 00:00:41,200 --> 00:00:42,900 You know that's the seed. 16 00:00:42,900 --> 00:00:45,700 If we both choose this number will have the same results. 17 00:00:45,700 --> 00:00:47,366 But let's choose a more simple number. 18 00:00:47,366 --> 00:00:49,833 Let's just for example one two, three. 19 00:00:49,833 --> 00:00:52,100 And now let's make this split. 20 00:00:52,100 --> 00:00:55,100 It's not as simple as in Python where we make it in one line. 21 00:00:55,433 --> 00:00:59,633 Here we're going to have to prepare the method that we are going to call split. 22 00:01:00,100 --> 00:01:01,600 That is the method that is going to make 23 00:01:01,600 --> 00:01:05,500 the split of your data set into the training set, and the test set equals 24 00:01:06,900 --> 00:01:09,100 and then sample 25 00:01:09,100 --> 00:01:12,100 dot split 26 00:01:12,333 --> 00:01:13,100 okay. 27 00:01:13,100 --> 00:01:16,300 Now if you want we can press F1 to see what we have to input. 28 00:01:16,433 --> 00:01:16,766 Okay. 29 00:01:16,766 --> 00:01:20,300 So it's simply split split data into test and train set. 30 00:01:20,733 --> 00:01:21,633 Let's see the arguments. 31 00:01:21,633 --> 00:01:23,366 The first argument is why. 32 00:01:23,366 --> 00:01:25,966 So that's not the same argument that we had to put for Python. 33 00:01:25,966 --> 00:01:26,400 In Python 34 00:01:26,400 --> 00:01:31,100 we had to put both the matrix of features x and the dependent variable vector y. 35 00:01:31,200 --> 00:01:35,300 Here we only have to put y only we're going to take y the following way. 36 00:01:35,300 --> 00:01:39,266 We're going to type data set dollar sign purchased. 37 00:01:39,500 --> 00:01:42,366 Because your dependent variable is purchase okay. 38 00:01:42,366 --> 00:01:44,366 So that's okay for the first parameter. 39 00:01:44,366 --> 00:01:46,033 And then what is the second parameter. 40 00:01:46,033 --> 00:01:48,366 It's split ratio okay. 41 00:01:48,366 --> 00:01:51,600 So split ratio let's write split ratio here. 42 00:01:51,933 --> 00:01:54,600 And so split ratio is just a percentage 43 00:01:54,600 --> 00:01:57,900 of the observations that you want to put in your training set. 44 00:01:58,366 --> 00:02:00,000 So let's be careful. 45 00:02:00,000 --> 00:02:03,000 In Python we put the percentage for the test set. 46 00:02:03,200 --> 00:02:05,500 And here we have to put it for the training set. 47 00:02:05,500 --> 00:02:09,100 So remember in Python we chose 20% for the test set. 48 00:02:09,233 --> 00:02:14,000 So here logically for the training set we will choose 0.8 okay. 49 00:02:14,000 --> 00:02:15,966 So what we'll do split return. 50 00:02:15,966 --> 00:02:20,366 So it will return true or false for each of your observations. 51 00:02:20,533 --> 00:02:23,533 That means that each observation will have either true or false. 52 00:02:23,700 --> 00:02:24,900 And this is going to be true 53 00:02:24,900 --> 00:02:27,900 if this observation was chosen to go to the training set, 54 00:02:28,066 --> 00:02:31,233 and false if the observation was chosen to go to the test set. 55 00:02:31,666 --> 00:02:33,466 So let's have a look. Let's select this 56 00:02:34,833 --> 00:02:37,833 command or Control plus enter to execute. 57 00:02:38,100 --> 00:02:40,500 And here it is. We have the split here. 58 00:02:40,500 --> 00:02:43,500 So now let's go to the console and write split. 59 00:02:43,633 --> 00:02:45,833 All right. Enter. 60 00:02:45,833 --> 00:02:46,533 All right. 61 00:02:46,533 --> 00:02:48,866 You see that you have ten values. 62 00:02:48,866 --> 00:02:51,866 And so true means that the observation goes to the training set. 63 00:02:51,900 --> 00:02:55,500 And false means that the observation goes to the test set okay. 64 00:02:56,200 --> 00:02:59,566 So now what we have to do is to create the training set 65 00:02:59,566 --> 00:03:01,000 and the test set separately. 66 00:03:01,000 --> 00:03:01,833 So we're going to do this. 67 00:03:01,833 --> 00:03:05,633 We'll type training set which is going to be the name of our training set. 68 00:03:05,633 --> 00:03:07,133 Actually let's make it simple. 69 00:03:07,133 --> 00:03:10,700 So training set equals subset parenthesis. 70 00:03:10,966 --> 00:03:13,900 And here is the first argument we put data set 71 00:03:13,900 --> 00:03:16,800 because it's the training set is a subset of the data set. 72 00:03:17,766 --> 00:03:19,033 And here we will specify 73 00:03:19,033 --> 00:03:22,233 that we want the split equals equals true. 74 00:03:22,900 --> 00:03:24,500 And that's it okay. 75 00:03:24,500 --> 00:03:25,600 So that's it for the training set. 76 00:03:25,600 --> 00:03:28,600 Now let's copy this line. 77 00:03:29,400 --> 00:03:30,600 Paste it here. 78 00:03:30,600 --> 00:03:34,866 And here we're going to change training set to test set. 79 00:03:35,400 --> 00:03:38,466 And of course we're going to change split equals equals true 80 00:03:39,166 --> 00:03:41,566 to false. 81 00:03:41,566 --> 00:03:45,333 Because the test set are the observations for which the split equals false. 82 00:03:45,900 --> 00:03:46,866 And now we're ready. 83 00:03:46,866 --> 00:03:48,866 We are ready to make the splits 84 00:03:48,866 --> 00:03:50,933 of the data set into the training set and the test set. 85 00:03:50,933 --> 00:03:54,400 So let's execute these lines command and control. 86 00:03:54,400 --> 00:03:56,400 Plus enter to execute. 87 00:03:56,400 --> 00:03:57,600 And here we go. 88 00:03:57,600 --> 00:03:59,633 Test set and training sets are created. 89 00:03:59,633 --> 00:04:01,000 Let's look at them. 90 00:04:01,000 --> 00:04:03,700 So let's click on the training set here. 91 00:04:03,700 --> 00:04:05,600 And the test set. 92 00:04:05,600 --> 00:04:08,600 Let's move that here and that here okay. 93 00:04:08,833 --> 00:04:10,600 Let's look at the training set first okay. 94 00:04:10,600 --> 00:04:13,600 We see that we have eight observations okay good. 95 00:04:13,633 --> 00:04:15,800 And now let's look at the test set. 96 00:04:15,800 --> 00:04:20,333 Now we can clearly see that we have two observations with six and nine perfect 97 00:04:22,100 --> 00:04:22,566 okay. 98 00:04:22,566 --> 00:04:24,800 So that's it for this tutorial. 99 00:04:24,800 --> 00:04:29,400 You now know how to split your data set into a training set and a test set. 100 00:04:29,800 --> 00:04:32,233 This is a must do in any machine learning model. 101 00:04:32,233 --> 00:04:33,433 You have to test 102 00:04:33,433 --> 00:04:37,133 the performance of your machine learning model into a separate test set. 103 00:04:37,600 --> 00:04:38,766 So congratulations! 104 00:04:38,766 --> 00:04:39,900 Now you are almost ready 105 00:04:39,900 --> 00:04:44,366 to begin the journey of making exciting machine learning models. 106 00:04:44,366 --> 00:04:47,366 We just have one thing to do left feature scaling. 107 00:04:47,400 --> 00:04:50,400 You'll understand in the next tutorial why it's so important to do this. 108 00:04:50,400 --> 00:04:51,966 So I look forward to seeing you there. 109 00:04:51,966 --> 00:04:54,966 And until then, enjoy machine learning.