1
00:00:00,360 --> 00:00:05,480
Now we've got an ideal model that is performing pretty damn well on the validation data set so valid

2
00:00:05,560 --> 00:00:12,030
are am SLA is the metric we've been paying attention to because that is the evaluation metric for this

3
00:00:12,030 --> 00:00:13,740
particular cargo competition.

4
00:00:13,740 --> 00:00:18,060
Of course if you're working on a different problem or a different set of data you may have a different

5
00:00:18,090 --> 00:00:20,950
evaluation metric that you're trying to improve.

6
00:00:21,030 --> 00:00:23,070
But now we're doing pretty well.

7
00:00:23,220 --> 00:00:28,550
So let's use our ideal model to make predictions on test data.

8
00:00:28,710 --> 00:00:33,780
There's a little tidbit in this section that's really important to remember and we're going to come

9
00:00:33,780 --> 00:00:34,770
across it in a second.

10
00:00:34,770 --> 00:00:39,560
So first of all we're going to have to import the test data.

11
00:00:39,630 --> 00:00:47,480
Now if we have a look in now data folder to this is just in the Jupiter home because we've created a

12
00:00:47,480 --> 00:00:52,260
project folder for this particular project and we've downloaded our data put it in here.

13
00:00:52,460 --> 00:00:56,990
Inside here there's a data set which is test dot CSP.

14
00:00:56,990 --> 00:00:58,870
So this is what we're going to be using.

15
00:00:58,870 --> 00:01:04,700
And if we come back to Kaggle it tells us tests CSP is the test set which won't be released until the

16
00:01:04,700 --> 00:01:05,180
last week.

17
00:01:05,180 --> 00:01:10,430
Competition Well we've got at the moment since this competition has already passed it contains data

18
00:01:10,430 --> 00:01:15,170
from May 1 2012 to November 2012.

19
00:01:15,170 --> 00:01:15,460
OK.

20
00:01:15,500 --> 00:01:19,730
So it's about the six months or so after our validation set.

21
00:01:19,910 --> 00:01:24,200
Beautiful models performing okay on the validation set.

22
00:01:24,200 --> 00:01:30,200
So what you might see is see how we can make predictions on this future six months of data.

23
00:01:30,530 --> 00:01:34,760
So we'll get a IDF test Eagles PD or read C as V.

24
00:01:34,760 --> 00:01:43,190
Which is gonna do it the exact same we did at the beginning with our train and valid data blue ball

25
00:01:44,240 --> 00:01:46,770
for bulldozers.

26
00:01:47,090 --> 00:01:47,900
Wonderful.

27
00:01:47,960 --> 00:01:56,630
And it's called Test dot c as we will pass in low memory equals false and we're also going to pass dates

28
00:01:56,660 --> 00:02:05,200
because remember we haven't done any manipulation to this data set like we have to our other ones.

29
00:02:05,420 --> 00:02:09,880
We'll see where that comes in in a moment and then we'll check DNA test.

30
00:02:09,890 --> 00:02:12,630
Go ahead huh.

31
00:02:12,740 --> 00:02:13,690
What's happening.

32
00:02:13,760 --> 00:02:16,320
Oh typos beautiful.

33
00:02:16,650 --> 00:02:17,700
So this looks familiar.

34
00:02:17,910 --> 00:02:20,820
It's something that we've seen before.

35
00:02:21,030 --> 00:02:25,310
Except now this time the test data set is missing the sale price column.

36
00:02:25,470 --> 00:02:28,830
And the reason being is because that's the column we're trying to predict.

37
00:02:28,860 --> 00:02:30,690
So why don't we see if we can do that.

38
00:02:30,720 --> 00:02:38,610
So we go test pred equals ideal model we'll use our ideal model this time model not predict an ideal

39
00:02:38,610 --> 00:02:45,480
model has been performing the best and we should be able to just go pass it the whole data frame.

40
00:02:45,480 --> 00:02:45,830
Right.

41
00:02:45,840 --> 00:02:50,820
Because this will be our x values and we don't need to drop any columns because it doesn't have a sale

42
00:02:50,820 --> 00:02:52,200
price column.

43
00:02:52,380 --> 00:02:53,870
So let's see what happens.

44
00:02:53,880 --> 00:03:01,260
Actually put a little comment here and make predictions on the test dataset and I just sort of a little

45
00:03:01,260 --> 00:03:03,720
challenge before we run this cell.

46
00:03:03,720 --> 00:03:06,090
I want you to kind of think about what might happen.

47
00:03:06,810 --> 00:03:13,440
So our machine learning model it's been trained on a particular dataset and now we've imported the test

48
00:03:13,440 --> 00:03:15,220
data set.

49
00:03:15,340 --> 00:03:21,070
What's gonna happen if we try to make predictions on this dataset based on I'll give you a hint based

50
00:03:21,070 --> 00:03:23,110
on the data that our model has been trained on

51
00:03:26,220 --> 00:03:28,380
and if you're not sure that's fine we're going to see it in a sec.

52
00:03:28,390 --> 00:03:29,960
We should see it right now.

53
00:03:30,450 --> 00:03:32,070
Oh there we go.

54
00:03:32,140 --> 00:03:33,320
Value era.

55
00:03:33,330 --> 00:03:33,600
Okay.

56
00:03:33,610 --> 00:03:35,590
What's happening here.

57
00:03:35,590 --> 00:03:39,590
Could not convert string to float.

58
00:03:40,390 --> 00:03:47,200
Okay so if you remember right back up the top let's scroll back up through our beast of a notebook that

59
00:03:47,200 --> 00:03:48,160
we've been working through.

60
00:03:48,850 --> 00:03:54,700
What did we do before we could train a machine learning model we saw that a similar error when we first

61
00:03:54,700 --> 00:04:01,600
started right up here when we started modeling would be great if I could find it.

62
00:04:01,870 --> 00:04:05,650
Our model didn't work.

63
00:04:05,770 --> 00:04:06,490
Here we go.

64
00:04:06,580 --> 00:04:07,790
Okay.

65
00:04:07,800 --> 00:04:15,190
Build a models IDF temped or drop could not convert string to flight so the same error messages what's

66
00:04:15,190 --> 00:04:16,360
below.

67
00:04:16,360 --> 00:04:19,090
Now the reason being this has happened is why do you think

68
00:04:22,480 --> 00:04:29,170
well it's because when we imported our original data set we did some manipulations toward what did we

69
00:04:29,170 --> 00:04:33,450
do we made sure our data was numeric and we made sure there were no missing values.

70
00:04:33,460 --> 00:04:37,870
So what we've done now is we've imported DNA test

71
00:04:43,950 --> 00:04:50,740
may have some missing values which it does and it might not be all numeric

72
00:04:53,690 --> 00:04:58,490
and far out we've done a fair bit of data processing here.

73
00:04:58,660 --> 00:05:03,860
And if we check the columns so these are all the features.

74
00:05:03,860 --> 00:05:16,110
But if we check out x train columns that need brackets a we notice that the column lengths are different.

75
00:05:16,160 --> 00:05:23,350
So there's 102 columns in now x train whereas in DFT test what do we got here.

76
00:05:23,390 --> 00:05:25,610
We saw it up here before when we did.

77
00:05:25,610 --> 00:05:29,430
Dot had 52 columns.

78
00:05:29,630 --> 00:05:35,330
So the reason why our machine learning model can't predict on the test data set at the moment is because

79
00:05:35,330 --> 00:05:41,030
it's not in the same format as the model was trained on whereas the data set the model was trained on.

80
00:05:41,030 --> 00:05:43,370
So that's what we're going to have to do.

81
00:05:43,370 --> 00:05:48,720
All right let's get rid of these here and so what.

82
00:05:48,740 --> 00:05:49,800
How could we do this.

83
00:05:49,800 --> 00:05:55,170
So we might go pre processing the data.

84
00:05:55,170 --> 00:06:05,430
In other words getting the test data set in the same format as our training dataset.

85
00:06:05,550 --> 00:06:07,640
Wonderful.

86
00:06:07,730 --> 00:06:09,430
So how might we do this.

87
00:06:10,250 --> 00:06:14,320
So what we might do I think is create a function for doing so.

88
00:06:14,320 --> 00:06:20,120
So if we were to import a new dataset say a test data set we actually did some manipulations to our

89
00:06:20,120 --> 00:06:25,910
training data said we could take those functions that we did on the training data set and just do the

90
00:06:25,910 --> 00:06:28,270
exact same thing on the test data set.

91
00:06:28,340 --> 00:06:31,210
So that way they're both the same.

92
00:06:31,260 --> 00:06:32,030
That makes sense.

93
00:06:32,030 --> 00:06:33,710
Let's see it first let's see if we can do this.

94
00:06:33,710 --> 00:06:37,490
So pretty process data will pass at a data frame.

95
00:06:37,490 --> 00:06:49,580
We'll leave a little note here so performs transformations on the F and returns transformed the F simple

96
00:06:49,580 --> 00:06:49,910
right.

97
00:06:49,940 --> 00:06:53,300
We want to create a function which takes a data frame of some sort.

98
00:06:53,300 --> 00:06:58,310
It does a bunch of manipulations on that data frame and then eventually it's going to return that data

99
00:06:58,310 --> 00:06:58,610
frame.

100
00:06:59,600 --> 00:07:05,360
So now we're going to have to fill in all the transformations that we did originally to DFT temp in

101
00:07:05,360 --> 00:07:16,010
here and so we can do that by going up we might have to do a few copies and paste here.

102
00:07:16,020 --> 00:07:18,910
So remember right back up the top.

103
00:07:18,960 --> 00:07:20,850
So this is a little rule a little tidbit.

104
00:07:20,850 --> 00:07:26,070
What you do to the training data set you're going to have to do to the test data set.

105
00:07:26,100 --> 00:07:29,130
That's a that's a machine learning tidbit there.

106
00:07:29,130 --> 00:07:29,470
Okay.

107
00:07:29,490 --> 00:07:39,340
So here's the first big set of manipulations that we did that might copy that wonderful and we can just

108
00:07:39,340 --> 00:07:40,180
remember this line.

109
00:07:40,180 --> 00:07:47,800
So we've we've added some information here based on the date time column sale date and then we removed

110
00:07:47,800 --> 00:07:47,880
it.

111
00:07:47,890 --> 00:07:54,020
So the first two steps that we have to do like will scroll right down to the end of our notebook.

112
00:07:54,070 --> 00:07:58,720
It's gonna be a lot of scrolling in this video because we did three major things right.

113
00:07:58,720 --> 00:08:03,910
We added some information here to the date time or we extracted.

114
00:08:03,910 --> 00:08:05,210
That's probably a better idea.

115
00:08:05,290 --> 00:08:10,900
We extracted information from our Date Time column and added it to the data frame.

116
00:08:10,900 --> 00:08:21,250
So Mike IDF does drop and after we've done this we'll drop the sale date column and we'll go access

117
00:08:21,580 --> 00:08:22,570
equals 1.

118
00:08:22,630 --> 00:08:25,910
And in place it goes through.

119
00:08:26,200 --> 00:08:26,740
Wonderful.

120
00:08:26,740 --> 00:08:32,730
And we have to remove dot temp from all of these because we're only using DNF.

121
00:08:32,920 --> 00:08:38,720
So a little trick here is by moving your cursor and holding down command.

122
00:08:38,740 --> 00:08:44,630
If you're on a Mac or maybe it's control if you're on Windows and we can just go backspace here.

123
00:08:44,740 --> 00:08:45,520
Wonderful.

124
00:08:45,520 --> 00:08:45,930
Right.

125
00:08:45,970 --> 00:08:47,350
Because we're not working with Def temp.

126
00:08:47,350 --> 00:08:48,610
We're working with DEF.

127
00:08:48,730 --> 00:08:52,780
So any arbitrarily named data frame will work with our function.

128
00:08:52,780 --> 00:09:05,600
So the next thing we did was we need to fill the numeric rows with median and we also field categorical

129
00:09:07,210 --> 00:09:17,850
missing data and turned into numbers in turn categories into numbers.

130
00:09:18,270 --> 00:09:20,590
So that's what we have to scroll back up and find.

131
00:09:20,590 --> 00:09:21,900
Okay let's let's do that.

132
00:09:21,890 --> 00:09:25,020
First we'll get the missing missing new miracles.

133
00:09:25,050 --> 00:09:26,520
Luckily we've laid our notebooks.

134
00:09:26,550 --> 00:09:30,790
We've added these headings here so that we can see what we were doing.

135
00:09:30,810 --> 00:09:32,850
This is all part of helping yourself out right.

136
00:09:33,450 --> 00:09:35,700
So fill numeric rows at the median.

137
00:09:35,700 --> 00:09:38,490
We actually need all of this code here.

138
00:09:38,650 --> 00:09:40,340
We'll copy that.

139
00:09:40,620 --> 00:09:41,580
Beautiful.

140
00:09:41,580 --> 00:09:46,230
So this is where comments and different headings in your notebook can really help your future self out

141
00:09:46,230 --> 00:09:49,170
is when you want to function ize things and clean things up.

142
00:09:50,260 --> 00:09:53,010
So there we go again we have to remove DFA tent.

143
00:09:53,020 --> 00:09:54,160
So a good deal of temp.

144
00:09:54,160 --> 00:09:55,840
We don't want that day of temp.

145
00:09:55,840 --> 00:10:04,350
We only just want def okay beautiful and we still need to fill categorical missing data and turn categories

146
00:10:04,350 --> 00:10:05,630
into numbers.

147
00:10:05,640 --> 00:10:10,980
Now let's just see if we can remember how to do this rather than having to scroll back up because we've

148
00:10:10,980 --> 00:10:14,220
done enough scrolling to be honest so we can do this.

149
00:10:14,230 --> 00:10:18,490
If not for you remember if it's not a numeric type that's how he did it.

150
00:10:18,590 --> 00:10:24,790
Time is numeric PD type content because we're looping we're in the same loop here.

151
00:10:24,940 --> 00:10:34,510
Content we want to go the F label plus the score is missing.

152
00:10:34,540 --> 00:10:40,970
Yes equals so it's gonna check if it's now.

153
00:10:41,180 --> 00:10:47,570
If it's now we want to add a label that a label is missing that particular data point is missing and

154
00:10:47,570 --> 00:11:01,850
drove how we add plus 1 to the category code because pandas and codes the missing categories as negative

155
00:11:01,850 --> 00:11:02,840
1.

156
00:11:02,870 --> 00:11:10,640
So this is where we can go IDF label Eagles PDA categorical so turn it into a categorical so turn the

157
00:11:10,640 --> 00:11:15,760
content of a particular column into a category and then access its code.

158
00:11:15,800 --> 00:11:19,820
Okay turning it into a number and then plus 1.

159
00:11:20,030 --> 00:11:21,820
All right.

160
00:11:21,880 --> 00:11:23,170
Not a bad function.

161
00:11:24,490 --> 00:11:27,640
So this what kind of happens in a notebook workflow right is that up here.

162
00:11:27,640 --> 00:11:32,050
You could probably classify most this is pretty messy we've helped ourselves out with different headings

163
00:11:32,050 --> 00:11:33,970
and whatnot in comments.

164
00:11:33,970 --> 00:11:39,580
However as you sort of get further into the process you'll start to build these helper functions so

165
00:11:39,580 --> 00:11:43,850
you're collating a whole bunch of different steps into a function so that it really syncs things in

166
00:11:43,850 --> 00:11:46,630
the notebook and you can just call a function like this.

167
00:11:46,630 --> 00:11:51,730
Much like our show scores function and it does a process that's going to be the same every time rather

168
00:11:51,730 --> 00:11:55,300
than having to scroll up through the notebook and find something in a different cell.

169
00:11:56,350 --> 00:12:01,460
So now we can hit shift enter on that or what's happened expected not missing.

170
00:12:01,510 --> 00:12:06,050
BLOCK I didn't think we needed 1 0 here.

171
00:12:06,350 --> 00:12:07,340
That's where we need it.

172
00:12:07,340 --> 00:12:10,160
We needed indents here.

173
00:12:10,310 --> 00:12:11,110
There we go.

174
00:12:11,300 --> 00:12:12,950
Classic.

175
00:12:12,950 --> 00:12:15,500
This is what happens when we copy and paste code right.

176
00:12:15,530 --> 00:12:21,260
The formatting isn't always perfect beautiful so we have a function here that's going to pre process

177
00:12:21,260 --> 00:12:22,650
some sort of data frame.

178
00:12:22,760 --> 00:12:27,140
It's going to add some information based on the sale date column.

179
00:12:27,350 --> 00:12:30,640
It's going to fill numeric rows with the median.

180
00:12:30,650 --> 00:12:37,130
It's gonna turn the missing categorical data into a zero value and the rest of the data into then category.

181
00:12:37,160 --> 00:12:38,940
Code number.

182
00:12:39,080 --> 00:12:39,620
Beautiful.

183
00:12:39,620 --> 00:12:43,250
So it's going through a few steps but that's the right way function eyes it's it's all gonna happen

184
00:12:43,250 --> 00:12:44,180
together.

185
00:12:44,330 --> 00:12:47,480
And now what might we do here.

186
00:12:47,480 --> 00:12:50,740
Actually a good question is where might this function break.

187
00:12:50,870 --> 00:12:55,360
I might leave you on that for this video to save it from getting too long.

188
00:12:55,460 --> 00:12:59,540
I want you to have a look at this big function that we've created and think about if we pass it out

189
00:12:59,540 --> 00:13:00,670
training data frame.

190
00:13:00,800 --> 00:13:01,900
How might it work.

191
00:13:02,060 --> 00:13:08,750
And then if we passed it our test data frame right DFT test which is up here we passed at this data

192
00:13:08,750 --> 00:13:09,700
frame.

193
00:13:10,040 --> 00:13:15,170
How might it break we'll answer that next video.