1 00:00:00,433 --> 00:00:01,300 Hello and welcome back. 2 00:00:01,300 --> 00:00:03,333 Today we're talking about the importance of splitting 3 00:00:03,333 --> 00:00:06,333 your data set into a training set and a test set. 4 00:00:06,333 --> 00:00:11,266 Let's imagine that you are tasked to predict the sale prices of cars. 5 00:00:11,533 --> 00:00:14,233 And that is your dependent variable. 6 00:00:14,233 --> 00:00:18,466 And your independent variables are the mileage of the car. 7 00:00:18,466 --> 00:00:20,766 And its age. 8 00:00:20,766 --> 00:00:25,666 And in your, data that was supplied to you, you have 20 cars in total. 9 00:00:25,700 --> 00:00:28,700 Of course, that's not a lot, but for illustrative purposes, 10 00:00:28,800 --> 00:00:31,333 for our tutorial, that will be sufficient. 11 00:00:31,333 --> 00:00:33,900 So what splitting your data implies 12 00:00:33,900 --> 00:00:36,900 is, separating a part of your data out. 13 00:00:37,200 --> 00:00:39,133 before you do anything. 14 00:00:39,133 --> 00:00:41,300 And usually that's about 20% of the data. 15 00:00:41,300 --> 00:00:46,600 So since we have, 20 cars here, that's about four cars that we separate out. 16 00:00:46,733 --> 00:00:50,166 So what that means is that the bulk of our data, 80%, will be our training set, 17 00:00:50,466 --> 00:00:53,666 and the separated 20% will be our test set. 18 00:00:54,000 --> 00:00:57,533 We'll use our training set to build the model. 19 00:00:57,733 --> 00:01:00,733 So in this case we're building a linear regression. 20 00:01:01,100 --> 00:01:05,433 And then we will take the cars from the test set. 21 00:01:05,733 --> 00:01:08,300 We will apply our model to them. 22 00:01:08,300 --> 00:01:12,300 So they haven't been part of the model creation process. 23 00:01:12,300 --> 00:01:14,533 The model has no information about these cars. 24 00:01:14,533 --> 00:01:16,866 And now we're applying this model to them. 25 00:01:16,866 --> 00:01:21,233 And it's predicting certain, values, certain prices. 26 00:01:21,433 --> 00:01:25,800 But the good news is that because this is something we separated in advance 27 00:01:25,800 --> 00:01:29,966 as part of the data that was given to us, we actually know the actual prices. 28 00:01:30,066 --> 00:01:33,933 So now we can compare the predicted values, 29 00:01:34,166 --> 00:01:39,366 which were generated using a model that has never seen these cars before. 30 00:01:39,900 --> 00:01:43,833 And we can compare that to the actual values that we know 31 00:01:43,833 --> 00:01:45,900 what these cars sold for. 32 00:01:45,900 --> 00:01:48,900 And so from that we can evaluate our model. 33 00:01:48,900 --> 00:01:50,100 Is it doing a good job. 34 00:01:50,100 --> 00:01:52,133 Is it doing a not so good job. 35 00:01:52,133 --> 00:01:53,633 And do we need to improve it. 36 00:01:53,633 --> 00:01:56,033 And that's how we split a training set and a test set. 37 00:01:56,033 --> 00:01:58,033 And that's why it's important to do that. 38 00:01:58,033 --> 00:01:59,766 I look forward to seeing you in the next tutorial. 39 00:01:59,766 --> 00:02:01,666 And until then, enjoy machine learning.