1 00:00:00,200 --> 00:00:04,830 In the last video we left off scoring our machine learning model that we fitted to the data. 2 00:00:05,010 --> 00:00:10,470 After we'd just filled all the numerical values filled all the categorical values and turned all our 3 00:00:10,470 --> 00:00:18,350 data into numbers model worked our data had no missing values and it was all numeric but we left off 4 00:00:18,350 --> 00:00:21,160 with the question why is this metric reliable. 5 00:00:21,170 --> 00:00:27,980 Let's put that here because that's worth highlighting because remember always remember feeding a machine 6 00:00:27,980 --> 00:00:33,290 learning model or evaluating a machine learning model is just as important as fitting one. 7 00:00:33,290 --> 00:00:37,280 And that's where we're up to now because we've just fit a machine learning model we have to make sure 8 00:00:37,280 --> 00:00:41,840 that our results hold water right because we don't want to be promising things that our model can't 9 00:00:41,840 --> 00:00:42,630 do. 10 00:00:42,650 --> 00:00:47,810 So question why doesn't the above metric hold water 11 00:00:50,900 --> 00:00:57,380 hold water is an expression for why isn't the metric reliable. 12 00:00:57,380 --> 00:00:59,950 That's just a way of speaking. 13 00:01:00,380 --> 00:01:01,300 So why doesn't it. 14 00:01:01,400 --> 00:01:03,800 Oh water why isn't it reliable. 15 00:01:03,810 --> 00:01:06,380 Have you had to think about it. 16 00:01:06,580 --> 00:01:11,470 I'll show you a diagram that may refresh your memory if we come back to our keynote. 17 00:01:11,500 --> 00:01:13,510 Remember the from right back to the beginning. 18 00:01:13,510 --> 00:01:17,300 The most important concept in machine learning a.k.a. the three sets. 19 00:01:18,250 --> 00:01:22,860 So what we've done is we've trained a machine learning model on one dataset. 20 00:01:22,870 --> 00:01:24,150 But then what have we done. 21 00:01:24,200 --> 00:01:25,560 We come back to the code. 22 00:01:25,960 --> 00:01:33,200 We fit the model on the data set on some data but then we've had value why did it on the exact same 23 00:01:33,200 --> 00:01:36,090 data. 24 00:01:37,020 --> 00:01:42,390 Okay so if we come back here essentially what we've done is if we were thinking about this has been 25 00:01:42,390 --> 00:01:46,680 a university course we've learned the course materials. 26 00:01:47,040 --> 00:01:53,600 But what we've done instead of testing our model on the equivalent of like a final exam a.k.a. a test 27 00:01:53,680 --> 00:02:01,110 dataset we've just evaluated our machine learning model on the exact same materials that it's learned 28 00:02:01,110 --> 00:02:06,600 on you are in one class and you got given a book to read and then at the exact same time and just like 29 00:02:06,600 --> 00:02:11,280 you're gonna be tested with the questions directly from that book you're reading. 30 00:02:11,760 --> 00:02:12,420 Okay. 31 00:02:12,970 --> 00:02:19,450 But we after our model's ability to generalize and a.k.a. the ability for a machine learning model to 32 00:02:19,450 --> 00:02:21,790 perform well on data it hasn't seen before. 33 00:02:21,790 --> 00:02:23,400 That's what we're after. 34 00:02:23,440 --> 00:02:30,150 If we go back to Kaggle there's a training data set which contains data through the end of 2011. 35 00:02:30,190 --> 00:02:32,740 There's a validation set and there's a test set. 36 00:02:32,740 --> 00:02:34,180 So that's what we're after. 37 00:02:34,240 --> 00:02:39,680 We need to evaluate our model not on the data we've trained it on which is this we need to evaluate 38 00:02:39,680 --> 00:02:46,360 it on the test data but before we do that if you remember right back up the top let's go right back 39 00:02:46,360 --> 00:02:51,170 up the top to we imported our first data frame I said we'd revisit this. 40 00:02:51,330 --> 00:02:52,720 And now the time has come. 41 00:02:53,590 --> 00:02:58,570 So we go here we imported train and validate CSP. 42 00:02:59,140 --> 00:03:00,920 Wonder why did we do that. 43 00:03:00,940 --> 00:03:03,850 While it's all gonna be come clear in this video. 44 00:03:03,850 --> 00:03:05,810 So we train invalid CSB. 45 00:03:05,930 --> 00:03:10,960 You are valid CSB and we've got train CSB why don't we just import them separately. 46 00:03:10,960 --> 00:03:16,960 Well I wanted to demonstrate what it's like creating your own validation set rather than someone else 47 00:03:16,960 --> 00:03:21,600 creating it for you with a time series data set which is what we're working on. 48 00:03:21,600 --> 00:03:24,270 So this is a perfect playground for that. 49 00:03:24,340 --> 00:03:26,140 So if we read here. 50 00:03:26,140 --> 00:03:28,500 Train CSB is the training set. 51 00:03:28,840 --> 00:03:33,360 So basically what we've been working on and valid CSP is a validation set. 52 00:03:33,430 --> 00:03:36,890 The key point here is because it's a time series. 53 00:03:36,890 --> 00:03:43,390 The training data set has data up to the end of 2011 whereas the validation set has data between January 54 00:03:43,390 --> 00:03:46,720 1 2012 to April 30 2012. 55 00:03:46,720 --> 00:03:53,410 So what we will do to better evaluate our model we need to split our data up into training and validation 56 00:03:53,410 --> 00:03:54,330 set so let's do that. 57 00:03:55,060 --> 00:03:56,920 Let's go here. 58 00:03:56,920 --> 00:04:05,140 Splitting data into train validation sets because that's what we're after now we've got a model we can 59 00:04:05,140 --> 00:04:05,920 build models now. 60 00:04:05,920 --> 00:04:10,180 But we need to rather than just build better models we need to make sure that what we're evaluating 61 00:04:10,180 --> 00:04:11,860 with makes sense. 62 00:04:11,860 --> 00:04:14,780 So let's check our data again as we always do. 63 00:04:14,890 --> 00:04:17,770 So maybe we can use the year. 64 00:04:17,980 --> 00:04:24,170 What was it called sale year Media Temple sale year There we go. 65 00:04:24,170 --> 00:04:26,990 Okay so maybe because our data frame is in order. 66 00:04:26,990 --> 00:04:32,600 Yes this is going to help us when we import loud at a frame we ordered it in the order of the sale dates. 67 00:04:32,630 --> 00:04:38,310 So now reading this to create our own validation set we want to split our data. 68 00:04:38,330 --> 00:04:41,240 So all of the rows up to 2011. 69 00:04:41,240 --> 00:04:48,890 So the s year 2011 that can be in the training set and then all of the rose in 2012 can be the validation 70 00:04:48,890 --> 00:04:49,370 set. 71 00:04:49,370 --> 00:04:53,530 We won't worry about the test set for now because that's in a separate data set. 72 00:04:53,540 --> 00:04:58,940 The reason why we're only working on train and valid CSP is because those we imported the train and 73 00:04:58,980 --> 00:05:02,000 valid CSC at the start of the notebook. 74 00:05:02,000 --> 00:05:07,140 This file here so we have to create our own validation dataset. 75 00:05:07,290 --> 00:05:07,950 So let's do that. 76 00:05:07,950 --> 00:05:16,370 So we've got data templates check the sale year dot value counts. 77 00:05:16,440 --> 00:05:17,160 There we go. 78 00:05:17,190 --> 00:05:23,250 2012 so there's eleven thousand five hundred seventy three samples in 2012. 79 00:05:23,280 --> 00:05:24,010 All right. 80 00:05:24,120 --> 00:05:29,480 So to split our data into training and validation it should be as easy as going. 81 00:05:29,580 --> 00:05:30,590 Okay. 82 00:05:30,840 --> 00:05:37,830 Let's introspect the s year column and every column that's equal to 2012 will be a validation set for 83 00:05:37,920 --> 00:05:40,700 our year and example. 84 00:05:40,710 --> 00:05:43,640 And every row that's not equal of s year. 85 00:05:43,660 --> 00:05:47,650 That's not equal to 2012 will be in the training set. 86 00:05:47,700 --> 00:05:49,200 So let's stop talking about it Daniel. 87 00:05:49,200 --> 00:05:55,280 Let's see the code to split data into training and validation. 88 00:05:55,300 --> 00:05:58,910 This is something that you might have to do with your own time series data right. 89 00:05:58,920 --> 00:06:02,590 Because you're not always going to be given in a casual format right. 90 00:06:02,610 --> 00:06:06,660 When you're working with a client or working on a project you're not always going to automatically have 91 00:06:06,660 --> 00:06:08,430 your data in train valid and test. 92 00:06:08,430 --> 00:06:11,540 These are things you're going to have to create for yourself. 93 00:06:11,550 --> 00:06:16,020 Looking at this sale column or looking at the time column looking at the date column and figuring out 94 00:06:16,260 --> 00:06:18,590 how you can make your own training a validation set. 95 00:06:18,600 --> 00:06:26,130 So that is exactly why we imported them as one set to begin with so we could practice making our own 96 00:06:26,340 --> 00:06:28,080 training and validation sets. 97 00:06:28,200 --> 00:06:35,540 And I feel like I'm saying the words set a lot but that's important because we go back to our keynote. 98 00:06:35,660 --> 00:06:40,460 This is the most important concept in machine learning because whatever we train our model on we want 99 00:06:40,460 --> 00:06:43,460 to make sure we're evaluating it on something else. 100 00:06:43,470 --> 00:06:50,670 So come here to the validation set is every row in DFT temp where the s year column equals 2012. 101 00:06:50,690 --> 00:06:51,590 Yes that's correct. 102 00:06:52,010 --> 00:07:01,190 And the the day of training set is every row in IDF temp where the same year column is not equal to 103 00:07:01,190 --> 00:07:02,180 2012 104 00:07:04,700 --> 00:07:05,230 WONDERFUL. 105 00:07:05,270 --> 00:07:09,950 AND THEN WE MIGHT GO THEN DLF down and land the F train. 106 00:07:10,340 --> 00:07:16,730 So all this is going to tell us is the length of these two data frames that beautiful. 107 00:07:16,730 --> 00:07:23,210 So now we have a validation set which contains eleven thousand five hundred seventy three rows and a 108 00:07:23,210 --> 00:07:30,140 training data set which contains four hundred one thousand one hundred twenty five rows or samples and 109 00:07:30,140 --> 00:07:33,180 they're split on date beautiful. 110 00:07:33,190 --> 00:07:34,350 We're taking boxes here. 111 00:07:34,360 --> 00:07:35,600 We are taking boxes here. 112 00:07:35,800 --> 00:07:41,470 So now what we might do is split data into x and y. 113 00:07:41,470 --> 00:07:48,670 We've seen this before and so that way we have an X training set and a Y training set and a X valid 114 00:07:48,670 --> 00:07:52,100 set and a Y valid set which is our data and labels. 115 00:07:52,330 --> 00:07:57,940 So we'll have X train y train equals D. 116 00:07:58,150 --> 00:08:03,370 Now we're working with the training set here drop we want to drop the sale price column on the first 117 00:08:03,370 --> 00:08:05,690 axis. 118 00:08:05,690 --> 00:08:12,880 Okay I just dropped the column and we're gonna set it to D train y train is going to be equal to sale 119 00:08:12,880 --> 00:08:20,610 price so see how I'm doing it with a comma here I'm being a bit tricky by just doing in one line beautiful 120 00:08:20,700 --> 00:08:31,950 and gone here valid why valid not sure and dear Val looked wrong sale price just the same again but 121 00:08:31,950 --> 00:08:40,830 this time with the validation set def thou Dot this is actually going to be Sal Price There we go and 122 00:08:40,830 --> 00:08:46,020 now we have a little game inspect our data just to make sure we haven't made a little error somewhere 123 00:08:46,400 --> 00:08:49,410 they should all be comparable shapes to each other 124 00:08:52,060 --> 00:08:57,550 why valid dot shape so we're just taking these data sets that we're creating here and we're finding 125 00:08:57,550 --> 00:09:00,590 out the shapes of them beautiful. 126 00:09:00,710 --> 00:09:10,400 So our training set is 400 1000 rows or about that with 102 features 102 columns our y training is 400 127 00:09:10,430 --> 00:09:17,120 1000 rows and then validation is about eleven and a half thousand with the same amount of columns for 128 00:09:17,120 --> 00:09:22,760 X and no columns for Y because Y is just one column let's have a look. 129 00:09:22,970 --> 00:09:25,810 Now look this is what we do make sure all of our data. 130 00:09:25,820 --> 00:09:27,470 Okay so these are all sale price. 131 00:09:28,280 --> 00:09:29,910 Beautiful. 132 00:09:30,340 --> 00:09:32,320 Well okay. 133 00:09:32,330 --> 00:09:39,170 So we've got our data into train and validation sets we've taken care of two of the three most important 134 00:09:39,170 --> 00:09:43,940 things or most important sets we'll have a look at the test set later but this is what we've created 135 00:09:43,940 --> 00:09:51,520 a training set and a validation set it's time to keep building some more models so we might end this 136 00:09:51,520 --> 00:09:58,120 video here and we've got to figure out a way to evaluate how a machine learning model so we'll probably 137 00:09:58,120 --> 00:09:59,620 have a look at that in the next video.