1 00:00:00,360 --> 00:00:05,480 Now we've got an ideal model that is performing pretty damn well on the validation data set so valid 2 00:00:05,560 --> 00:00:12,030 are am SLA is the metric we've been paying attention to because that is the evaluation metric for this 3 00:00:12,030 --> 00:00:13,740 particular cargo competition. 4 00:00:13,740 --> 00:00:18,060 Of course if you're working on a different problem or a different set of data you may have a different 5 00:00:18,090 --> 00:00:20,950 evaluation metric that you're trying to improve. 6 00:00:21,030 --> 00:00:23,070 But now we're doing pretty well. 7 00:00:23,220 --> 00:00:28,550 So let's use our ideal model to make predictions on test data. 8 00:00:28,710 --> 00:00:33,780 There's a little tidbit in this section that's really important to remember and we're going to come 9 00:00:33,780 --> 00:00:34,770 across it in a second. 10 00:00:34,770 --> 00:00:39,560 So first of all we're going to have to import the test data. 11 00:00:39,630 --> 00:00:47,480 Now if we have a look in now data folder to this is just in the Jupiter home because we've created a 12 00:00:47,480 --> 00:00:52,260 project folder for this particular project and we've downloaded our data put it in here. 13 00:00:52,460 --> 00:00:56,990 Inside here there's a data set which is test dot CSP. 14 00:00:56,990 --> 00:00:58,870 So this is what we're going to be using. 15 00:00:58,870 --> 00:01:04,700 And if we come back to Kaggle it tells us tests CSP is the test set which won't be released until the 16 00:01:04,700 --> 00:01:05,180 last week. 17 00:01:05,180 --> 00:01:10,430 Competition Well we've got at the moment since this competition has already passed it contains data 18 00:01:10,430 --> 00:01:15,170 from May 1 2012 to November 2012. 19 00:01:15,170 --> 00:01:15,460 OK. 20 00:01:15,500 --> 00:01:19,730 So it's about the six months or so after our validation set. 21 00:01:19,910 --> 00:01:24,200 Beautiful models performing okay on the validation set. 22 00:01:24,200 --> 00:01:30,200 So what you might see is see how we can make predictions on this future six months of data. 23 00:01:30,530 --> 00:01:34,760 So we'll get a IDF test Eagles PD or read C as V. 24 00:01:34,760 --> 00:01:43,190 Which is gonna do it the exact same we did at the beginning with our train and valid data blue ball 25 00:01:44,240 --> 00:01:46,770 for bulldozers. 26 00:01:47,090 --> 00:01:47,900 Wonderful. 27 00:01:47,960 --> 00:01:56,630 And it's called Test dot c as we will pass in low memory equals false and we're also going to pass dates 28 00:01:56,660 --> 00:02:05,200 because remember we haven't done any manipulation to this data set like we have to our other ones. 29 00:02:05,420 --> 00:02:09,880 We'll see where that comes in in a moment and then we'll check DNA test. 30 00:02:09,890 --> 00:02:12,630 Go ahead huh. 31 00:02:12,740 --> 00:02:13,690 What's happening. 32 00:02:13,760 --> 00:02:16,320 Oh typos beautiful. 33 00:02:16,650 --> 00:02:17,700 So this looks familiar. 34 00:02:17,910 --> 00:02:20,820 It's something that we've seen before. 35 00:02:21,030 --> 00:02:25,310 Except now this time the test data set is missing the sale price column. 36 00:02:25,470 --> 00:02:28,830 And the reason being is because that's the column we're trying to predict. 37 00:02:28,860 --> 00:02:30,690 So why don't we see if we can do that. 38 00:02:30,720 --> 00:02:38,610 So we go test pred equals ideal model we'll use our ideal model this time model not predict an ideal 39 00:02:38,610 --> 00:02:45,480 model has been performing the best and we should be able to just go pass it the whole data frame. 40 00:02:45,480 --> 00:02:45,830 Right. 41 00:02:45,840 --> 00:02:50,820 Because this will be our x values and we don't need to drop any columns because it doesn't have a sale 42 00:02:50,820 --> 00:02:52,200 price column. 43 00:02:52,380 --> 00:02:53,870 So let's see what happens. 44 00:02:53,880 --> 00:03:01,260 Actually put a little comment here and make predictions on the test dataset and I just sort of a little 45 00:03:01,260 --> 00:03:03,720 challenge before we run this cell. 46 00:03:03,720 --> 00:03:06,090 I want you to kind of think about what might happen. 47 00:03:06,810 --> 00:03:13,440 So our machine learning model it's been trained on a particular dataset and now we've imported the test 48 00:03:13,440 --> 00:03:15,220 data set. 49 00:03:15,340 --> 00:03:21,070 What's gonna happen if we try to make predictions on this dataset based on I'll give you a hint based 50 00:03:21,070 --> 00:03:23,110 on the data that our model has been trained on 51 00:03:26,220 --> 00:03:28,380 and if you're not sure that's fine we're going to see it in a sec. 52 00:03:28,390 --> 00:03:29,960 We should see it right now. 53 00:03:30,450 --> 00:03:32,070 Oh there we go. 54 00:03:32,140 --> 00:03:33,320 Value era. 55 00:03:33,330 --> 00:03:33,600 Okay. 56 00:03:33,610 --> 00:03:35,590 What's happening here. 57 00:03:35,590 --> 00:03:39,590 Could not convert string to float. 58 00:03:40,390 --> 00:03:47,200 Okay so if you remember right back up the top let's scroll back up through our beast of a notebook that 59 00:03:47,200 --> 00:03:48,160 we've been working through. 60 00:03:48,850 --> 00:03:54,700 What did we do before we could train a machine learning model we saw that a similar error when we first 61 00:03:54,700 --> 00:04:01,600 started right up here when we started modeling would be great if I could find it. 62 00:04:01,870 --> 00:04:05,650 Our model didn't work. 63 00:04:05,770 --> 00:04:06,490 Here we go. 64 00:04:06,580 --> 00:04:07,790 Okay. 65 00:04:07,800 --> 00:04:15,190 Build a models IDF temped or drop could not convert string to flight so the same error messages what's 66 00:04:15,190 --> 00:04:16,360 below. 67 00:04:16,360 --> 00:04:19,090 Now the reason being this has happened is why do you think 68 00:04:22,480 --> 00:04:29,170 well it's because when we imported our original data set we did some manipulations toward what did we 69 00:04:29,170 --> 00:04:33,450 do we made sure our data was numeric and we made sure there were no missing values. 70 00:04:33,460 --> 00:04:37,870 So what we've done now is we've imported DNA test 71 00:04:43,950 --> 00:04:50,740 may have some missing values which it does and it might not be all numeric 72 00:04:53,690 --> 00:04:58,490 and far out we've done a fair bit of data processing here. 73 00:04:58,660 --> 00:05:03,860 And if we check the columns so these are all the features. 74 00:05:03,860 --> 00:05:16,110 But if we check out x train columns that need brackets a we notice that the column lengths are different. 75 00:05:16,160 --> 00:05:23,350 So there's 102 columns in now x train whereas in DFT test what do we got here. 76 00:05:23,390 --> 00:05:25,610 We saw it up here before when we did. 77 00:05:25,610 --> 00:05:29,430 Dot had 52 columns. 78 00:05:29,630 --> 00:05:35,330 So the reason why our machine learning model can't predict on the test data set at the moment is because 79 00:05:35,330 --> 00:05:41,030 it's not in the same format as the model was trained on whereas the data set the model was trained on. 80 00:05:41,030 --> 00:05:43,370 So that's what we're going to have to do. 81 00:05:43,370 --> 00:05:48,720 All right let's get rid of these here and so what. 82 00:05:48,740 --> 00:05:49,800 How could we do this. 83 00:05:49,800 --> 00:05:55,170 So we might go pre processing the data. 84 00:05:55,170 --> 00:06:05,430 In other words getting the test data set in the same format as our training dataset. 85 00:06:05,550 --> 00:06:07,640 Wonderful. 86 00:06:07,730 --> 00:06:09,430 So how might we do this. 87 00:06:10,250 --> 00:06:14,320 So what we might do I think is create a function for doing so. 88 00:06:14,320 --> 00:06:20,120 So if we were to import a new dataset say a test data set we actually did some manipulations to our 89 00:06:20,120 --> 00:06:25,910 training data said we could take those functions that we did on the training data set and just do the 90 00:06:25,910 --> 00:06:28,270 exact same thing on the test data set. 91 00:06:28,340 --> 00:06:31,210 So that way they're both the same. 92 00:06:31,260 --> 00:06:32,030 That makes sense. 93 00:06:32,030 --> 00:06:33,710 Let's see it first let's see if we can do this. 94 00:06:33,710 --> 00:06:37,490 So pretty process data will pass at a data frame. 95 00:06:37,490 --> 00:06:49,580 We'll leave a little note here so performs transformations on the F and returns transformed the F simple 96 00:06:49,580 --> 00:06:49,910 right. 97 00:06:49,940 --> 00:06:53,300 We want to create a function which takes a data frame of some sort. 98 00:06:53,300 --> 00:06:58,310 It does a bunch of manipulations on that data frame and then eventually it's going to return that data 99 00:06:58,310 --> 00:06:58,610 frame. 100 00:06:59,600 --> 00:07:05,360 So now we're going to have to fill in all the transformations that we did originally to DFT temp in 101 00:07:05,360 --> 00:07:16,010 here and so we can do that by going up we might have to do a few copies and paste here. 102 00:07:16,020 --> 00:07:18,910 So remember right back up the top. 103 00:07:18,960 --> 00:07:20,850 So this is a little rule a little tidbit. 104 00:07:20,850 --> 00:07:26,070 What you do to the training data set you're going to have to do to the test data set. 105 00:07:26,100 --> 00:07:29,130 That's a that's a machine learning tidbit there. 106 00:07:29,130 --> 00:07:29,470 Okay. 107 00:07:29,490 --> 00:07:39,340 So here's the first big set of manipulations that we did that might copy that wonderful and we can just 108 00:07:39,340 --> 00:07:40,180 remember this line. 109 00:07:40,180 --> 00:07:47,800 So we've we've added some information here based on the date time column sale date and then we removed 110 00:07:47,800 --> 00:07:47,880 it. 111 00:07:47,890 --> 00:07:54,020 So the first two steps that we have to do like will scroll right down to the end of our notebook. 112 00:07:54,070 --> 00:07:58,720 It's gonna be a lot of scrolling in this video because we did three major things right. 113 00:07:58,720 --> 00:08:03,910 We added some information here to the date time or we extracted. 114 00:08:03,910 --> 00:08:05,210 That's probably a better idea. 115 00:08:05,290 --> 00:08:10,900 We extracted information from our Date Time column and added it to the data frame. 116 00:08:10,900 --> 00:08:21,250 So Mike IDF does drop and after we've done this we'll drop the sale date column and we'll go access 117 00:08:21,580 --> 00:08:22,570 equals 1. 118 00:08:22,630 --> 00:08:25,910 And in place it goes through. 119 00:08:26,200 --> 00:08:26,740 Wonderful. 120 00:08:26,740 --> 00:08:32,730 And we have to remove dot temp from all of these because we're only using DNF. 121 00:08:32,920 --> 00:08:38,720 So a little trick here is by moving your cursor and holding down command. 122 00:08:38,740 --> 00:08:44,630 If you're on a Mac or maybe it's control if you're on Windows and we can just go backspace here. 123 00:08:44,740 --> 00:08:45,520 Wonderful. 124 00:08:45,520 --> 00:08:45,930 Right. 125 00:08:45,970 --> 00:08:47,350 Because we're not working with Def temp. 126 00:08:47,350 --> 00:08:48,610 We're working with DEF. 127 00:08:48,730 --> 00:08:52,780 So any arbitrarily named data frame will work with our function. 128 00:08:52,780 --> 00:09:05,600 So the next thing we did was we need to fill the numeric rows with median and we also field categorical 129 00:09:07,210 --> 00:09:17,850 missing data and turned into numbers in turn categories into numbers. 130 00:09:18,270 --> 00:09:20,590 So that's what we have to scroll back up and find. 131 00:09:20,590 --> 00:09:21,900 Okay let's let's do that. 132 00:09:21,890 --> 00:09:25,020 First we'll get the missing missing new miracles. 133 00:09:25,050 --> 00:09:26,520 Luckily we've laid our notebooks. 134 00:09:26,550 --> 00:09:30,790 We've added these headings here so that we can see what we were doing. 135 00:09:30,810 --> 00:09:32,850 This is all part of helping yourself out right. 136 00:09:33,450 --> 00:09:35,700 So fill numeric rows at the median. 137 00:09:35,700 --> 00:09:38,490 We actually need all of this code here. 138 00:09:38,650 --> 00:09:40,340 We'll copy that. 139 00:09:40,620 --> 00:09:41,580 Beautiful. 140 00:09:41,580 --> 00:09:46,230 So this is where comments and different headings in your notebook can really help your future self out 141 00:09:46,230 --> 00:09:49,170 is when you want to function ize things and clean things up. 142 00:09:50,260 --> 00:09:53,010 So there we go again we have to remove DFA tent. 143 00:09:53,020 --> 00:09:54,160 So a good deal of temp. 144 00:09:54,160 --> 00:09:55,840 We don't want that day of temp. 145 00:09:55,840 --> 00:10:04,350 We only just want def okay beautiful and we still need to fill categorical missing data and turn categories 146 00:10:04,350 --> 00:10:05,630 into numbers. 147 00:10:05,640 --> 00:10:10,980 Now let's just see if we can remember how to do this rather than having to scroll back up because we've 148 00:10:10,980 --> 00:10:14,220 done enough scrolling to be honest so we can do this. 149 00:10:14,230 --> 00:10:18,490 If not for you remember if it's not a numeric type that's how he did it. 150 00:10:18,590 --> 00:10:24,790 Time is numeric PD type content because we're looping we're in the same loop here. 151 00:10:24,940 --> 00:10:34,510 Content we want to go the F label plus the score is missing. 152 00:10:34,540 --> 00:10:40,970 Yes equals so it's gonna check if it's now. 153 00:10:41,180 --> 00:10:47,570 If it's now we want to add a label that a label is missing that particular data point is missing and 154 00:10:47,570 --> 00:11:01,850 drove how we add plus 1 to the category code because pandas and codes the missing categories as negative 155 00:11:01,850 --> 00:11:02,840 1. 156 00:11:02,870 --> 00:11:10,640 So this is where we can go IDF label Eagles PDA categorical so turn it into a categorical so turn the 157 00:11:10,640 --> 00:11:15,760 content of a particular column into a category and then access its code. 158 00:11:15,800 --> 00:11:19,820 Okay turning it into a number and then plus 1. 159 00:11:20,030 --> 00:11:21,820 All right. 160 00:11:21,880 --> 00:11:23,170 Not a bad function. 161 00:11:24,490 --> 00:11:27,640 So this what kind of happens in a notebook workflow right is that up here. 162 00:11:27,640 --> 00:11:32,050 You could probably classify most this is pretty messy we've helped ourselves out with different headings 163 00:11:32,050 --> 00:11:33,970 and whatnot in comments. 164 00:11:33,970 --> 00:11:39,580 However as you sort of get further into the process you'll start to build these helper functions so 165 00:11:39,580 --> 00:11:43,850 you're collating a whole bunch of different steps into a function so that it really syncs things in 166 00:11:43,850 --> 00:11:46,630 the notebook and you can just call a function like this. 167 00:11:46,630 --> 00:11:51,730 Much like our show scores function and it does a process that's going to be the same every time rather 168 00:11:51,730 --> 00:11:55,300 than having to scroll up through the notebook and find something in a different cell. 169 00:11:56,350 --> 00:12:01,460 So now we can hit shift enter on that or what's happened expected not missing. 170 00:12:01,510 --> 00:12:06,050 BLOCK I didn't think we needed 1 0 here. 171 00:12:06,350 --> 00:12:07,340 That's where we need it. 172 00:12:07,340 --> 00:12:10,160 We needed indents here. 173 00:12:10,310 --> 00:12:11,110 There we go. 174 00:12:11,300 --> 00:12:12,950 Classic. 175 00:12:12,950 --> 00:12:15,500 This is what happens when we copy and paste code right. 176 00:12:15,530 --> 00:12:21,260 The formatting isn't always perfect beautiful so we have a function here that's going to pre process 177 00:12:21,260 --> 00:12:22,650 some sort of data frame. 178 00:12:22,760 --> 00:12:27,140 It's going to add some information based on the sale date column. 179 00:12:27,350 --> 00:12:30,640 It's going to fill numeric rows with the median. 180 00:12:30,650 --> 00:12:37,130 It's gonna turn the missing categorical data into a zero value and the rest of the data into then category. 181 00:12:37,160 --> 00:12:38,940 Code number. 182 00:12:39,080 --> 00:12:39,620 Beautiful. 183 00:12:39,620 --> 00:12:43,250 So it's going through a few steps but that's the right way function eyes it's it's all gonna happen 184 00:12:43,250 --> 00:12:44,180 together. 185 00:12:44,330 --> 00:12:47,480 And now what might we do here. 186 00:12:47,480 --> 00:12:50,740 Actually a good question is where might this function break. 187 00:12:50,870 --> 00:12:55,360 I might leave you on that for this video to save it from getting too long. 188 00:12:55,460 --> 00:12:59,540 I want you to have a look at this big function that we've created and think about if we pass it out 189 00:12:59,540 --> 00:13:00,670 training data frame. 190 00:13:00,800 --> 00:13:01,900 How might it work. 191 00:13:02,060 --> 00:13:08,750 And then if we passed it our test data frame right DFT test which is up here we passed at this data 192 00:13:08,750 --> 00:13:09,700 frame. 193 00:13:10,040 --> 00:13:15,170 How might it break we'll answer that next video.