1 00:00:00,133 --> 00:00:02,300 Hello and welcome to this tutorial. 2 00:00:02,300 --> 00:00:05,300 Okay, so we are almost done with part one Data pre-processing. 3 00:00:05,300 --> 00:00:08,433 I can't wait to start making the machine learning models. 4 00:00:08,733 --> 00:00:12,500 We just need three more tutorials to make our data set perfectly prepared 5 00:00:12,500 --> 00:00:15,500 before starting the models, and then we'll be good to go. 6 00:00:15,600 --> 00:00:19,333 Okay, so today we are going to talk about the fact that we need to split 7 00:00:19,333 --> 00:00:22,366 the data sets into a training set and a test set, 8 00:00:22,800 --> 00:00:25,533 and I'll explain why we need to do that right now. 9 00:00:25,533 --> 00:00:28,233 So let's go to Google Sheets. 10 00:00:28,233 --> 00:00:30,333 Well I'm on Google Sheets but you can be on Excel. 11 00:00:30,333 --> 00:00:33,000 You can be on whatever tool you want. 12 00:00:33,000 --> 00:00:35,700 But here is the data set open with Google Sheets. 13 00:00:35,700 --> 00:00:39,733 So we have our ten observations and this is the data set. 14 00:00:39,733 --> 00:00:41,700 This is the whole data set. 15 00:00:41,700 --> 00:00:44,700 And what we should do in any machine learning models 16 00:00:44,733 --> 00:00:48,500 is that we're going to split this data set into two separate sets, 17 00:00:48,633 --> 00:00:51,633 which are going to be the training set and the test set. 18 00:00:51,900 --> 00:00:53,800 Now why do we need to do this? 19 00:00:53,800 --> 00:00:58,033 Well, when you take a step back and focus on the name machine learning itself, 20 00:00:58,433 --> 00:01:02,400 you understand that this is about a machine that is going to learn something. 21 00:01:02,800 --> 00:01:04,633 Well, here it's your algorithm. 22 00:01:04,633 --> 00:01:07,266 It's your model that is going to learn from your data 23 00:01:07,266 --> 00:01:10,300 to make predictions or other machine learning goals. 24 00:01:10,766 --> 00:01:13,933 And so your machine learning model is going to learn to do something 25 00:01:14,200 --> 00:01:17,133 on your data set by understanding some correlations 26 00:01:17,133 --> 00:01:18,900 that there is in your data set. 27 00:01:18,900 --> 00:01:22,200 And imagine your machine learning model is learning too much on the data 28 00:01:22,200 --> 00:01:26,433 set like it's learning too much the correlations, then I'm not sure its 29 00:01:26,433 --> 00:01:30,866 performance would be great on a new set with slightly different correlations. 30 00:01:30,866 --> 00:01:34,400 Or, you know, it's like a student who is learning by heart his lesson, 31 00:01:34,400 --> 00:01:36,666 and then when he takes the exam, he might be in trouble 32 00:01:36,666 --> 00:01:39,433 because he learned too much his lesson by heart, 33 00:01:39,433 --> 00:01:42,066 and he does not manage to make the connection between what 34 00:01:42,066 --> 00:01:43,833 he learned and the exam. 35 00:01:43,833 --> 00:01:45,566 And it's the same for machine learning. 36 00:01:45,566 --> 00:01:48,633 We are going to build our machine learning models on a data set, 37 00:01:48,966 --> 00:01:52,500 but then we have to test it on a new set, which is going to be slightly different 38 00:01:52,500 --> 00:01:55,500 from the data set on which we build the machine learning model. 39 00:01:55,700 --> 00:01:59,966 So we have to make two different sets a training set 40 00:01:59,966 --> 00:02:03,066 on which we build the machine learning model, and a test set on which 41 00:02:03,066 --> 00:02:06,500 we test the performance of this machine learning model. 42 00:02:06,800 --> 00:02:08,133 And the performance on the test 43 00:02:08,133 --> 00:02:11,733 set shouldn't be that different from the performance on the training set, 44 00:02:12,100 --> 00:02:15,400 because this would mean that the machine learning models understood 45 00:02:15,400 --> 00:02:18,400 well the correlations and didn't learn them by heart 46 00:02:18,400 --> 00:02:22,233 so that you can adapt to new sets and new situations. 47 00:02:22,800 --> 00:02:26,900 Okay, so that's the idea about splitting the data sets into a training set 48 00:02:26,900 --> 00:02:28,200 and a test set. 49 00:02:28,200 --> 00:02:29,800 And now let's do it on R. 50 00:02:29,800 --> 00:02:31,000 So here we are on R. 51 00:02:31,000 --> 00:02:34,766 The section is really splitting the data set into the training set and test set. 52 00:02:35,333 --> 00:02:37,000 And let's start coding the thing. 53 00:02:37,000 --> 00:02:38,966 Now that you understand well the difference 54 00:02:38,966 --> 00:02:42,100 between the training set and the test set, we're going to do it a little faster. 55 00:02:42,700 --> 00:02:46,600 So new thing here we have to import a library. 56 00:02:46,900 --> 00:02:50,000 We're going to import the library that is going to make a good split 57 00:02:50,300 --> 00:02:53,233 of the data set into the training set and the test set. 58 00:02:53,233 --> 00:02:55,966 And this library is called CA tools. 59 00:02:55,966 --> 00:02:57,600 So let's import it 60 00:02:57,600 --> 00:03:00,933 here I'm going to the packages tab to see the list of the libraries. 61 00:03:01,200 --> 00:03:04,266 Here you can see that the CA tools library is here 62 00:03:04,333 --> 00:03:08,300 I have it installed on the packages because I installed it before, 63 00:03:08,666 --> 00:03:12,433 but it's probably not the case for you if you're studying R for the first time. 64 00:03:12,700 --> 00:03:16,733 So we're going to install it to install a library in R, it's very simple. 65 00:03:16,766 --> 00:03:20,633 You have to type install dot packages 66 00:03:21,233 --> 00:03:23,633 and then in parentheses quotes. 67 00:03:23,633 --> 00:03:26,100 And then the name of the library in quotes. 68 00:03:26,100 --> 00:03:30,300 So here we type CA tools and we're ready to go. 69 00:03:31,200 --> 00:03:34,833 So then you have to select this line and press Command or Control. 70 00:03:34,833 --> 00:03:36,300 Press enter to execute. 71 00:03:37,566 --> 00:03:38,366 Here it is. 72 00:03:38,366 --> 00:03:42,100 And right now it's installing the package CA tools 73 00:03:43,300 --> 00:03:44,866 okay perfect. 74 00:03:44,866 --> 00:03:49,933 So then you have to put it as a comment because you will need to install it again. 75 00:03:50,633 --> 00:03:54,066 But then as you can see we just installed the CA tools library. 76 00:03:54,433 --> 00:03:56,000 But it's not activated yet. 77 00:03:56,000 --> 00:03:58,566 You know it's not selected and we have to select it. 78 00:03:58,566 --> 00:04:01,400 So to select a library you have two choices. 79 00:04:01,400 --> 00:04:03,700 Either you click on the box here. 80 00:04:03,700 --> 00:04:06,700 As you can see this generates two scripts here. 81 00:04:07,300 --> 00:04:09,100 Or you know if you have some scripts 82 00:04:09,100 --> 00:04:12,433 that you want to automate and execute once in a time. 83 00:04:12,766 --> 00:04:16,366 Well, you can specify in your script that you want to include the library. 84 00:04:16,733 --> 00:04:19,733 And to do this you just need to type library 85 00:04:19,800 --> 00:04:23,366 parenthesis and CA tools the name of your library. 86 00:04:23,366 --> 00:04:24,300 Not in quotes. 87 00:04:24,300 --> 00:04:28,566 Actually, this time library tools okay, perfect. 88 00:04:28,566 --> 00:04:29,733 And now you're ready to go. 89 00:04:29,733 --> 00:04:32,100 We can check the tools is not selected right now. 90 00:04:32,100 --> 00:04:37,066 And if I select this and press Command or Control to execute now it is. 91 00:04:37,400 --> 00:04:38,166 Now it's selected.