1
00:00:00,360 --> 00:00:05,530
We saw in the last section how to fill missing values in a data set and then convert it to numbers while

2
00:00:05,550 --> 00:00:06,820
we filled it with pandas.

3
00:00:06,870 --> 00:00:09,140
And then we converted it to numbers using psychic learn.

4
00:00:09,330 --> 00:00:13,980
But it'd be nice to do missing values with cyclone and convert it to numbers of psychic loans so we're

5
00:00:13,980 --> 00:00:16,440
kind of sticking to the one library.

6
00:00:16,440 --> 00:00:17,370
Let's see how we do then.

7
00:00:17,910 --> 00:00:23,850
So first things first we'll need to reinstall and create our car sales missing data because we've kind

8
00:00:23,850 --> 00:00:27,930
of ah filled the missing values in place up here using pander.

9
00:00:27,930 --> 00:00:33,830
So the one we have right now already has its missing values field so we'll reread it back in.

10
00:00:33,870 --> 00:00:35,690
Read CSB data.

11
00:00:36,120 --> 00:00:36,740
I'm going to go.

12
00:00:36,740 --> 00:00:38,010
Car sales.

13
00:00:38,110 --> 00:00:43,890
I've got a few missing extended Where is it.

14
00:00:44,190 --> 00:00:45,180
Car sales.

15
00:00:45,180 --> 00:00:46,520
Missing data.

16
00:00:46,920 --> 00:00:48,740
Extended missing data does see a swing.

17
00:00:48,750 --> 00:00:49,770
Beautiful.

18
00:00:49,770 --> 00:00:52,650
That's what we want and we'll just check.

19
00:00:52,710 --> 00:00:59,700
Why am I doing car missing missing cars missing dot head.

20
00:00:59,710 --> 00:01:00,460
Wonderful.

21
00:01:00,460 --> 00:01:04,490
Now we'll just clarify whether we have some missing data in this one again.

22
00:01:04,540 --> 00:01:09,630
How do we do that is and I know some wonderful.

23
00:01:09,690 --> 00:01:13,380
So we have some missing value is not so wonderful if you're actually working on this data set because

24
00:01:13,770 --> 00:01:17,440
ideally your data set doesn't add missing values but that's all right.

25
00:01:17,460 --> 00:01:19,400
We're going to figure out how to fill them up.

26
00:01:19,590 --> 00:01:21,060
So how do we do that.

27
00:01:21,060 --> 00:01:24,780
The first thing first we're gonna split it into a subset.

28
00:01:24,790 --> 00:01:29,700
Well first we're gonna get rid of rows which don't have price values and then we'll split it into x

29
00:01:29,700 --> 00:01:36,090
and y because we don't want to deal with data that doesn't have labels car sales missing dot drop in

30
00:01:36,090 --> 00:01:39,180
a subset equals.

31
00:01:39,240 --> 00:01:44,190
We're going to give it the price column in place equals true.

32
00:01:44,820 --> 00:01:51,980
So what we're saying is basically with this line here get the car sales missing data frame drop the

33
00:01:51,980 --> 00:01:56,030
end and values that are in the subset of the price column.

34
00:01:56,030 --> 00:02:00,950
These 50 values in the price column that don't have values remove them from the data frame and then

35
00:02:00,950 --> 00:02:09,020
we're going to recalculate how many missing values we have might put here.

36
00:02:09,020 --> 00:02:12,550
Little comment dropped the rows with no labels.

37
00:02:13,800 --> 00:02:14,470
Beautiful.

38
00:02:14,490 --> 00:02:19,260
We've lost some from the make color a dominant doors columns because they may have been overlapping

39
00:02:19,260 --> 00:02:23,110
with the samples in the price column that had missing values.

40
00:02:23,160 --> 00:02:24,810
So what's next.

41
00:02:24,830 --> 00:02:28,610
Well we want to split into x and y.

42
00:02:28,980 --> 00:02:34,160
Plenty of practice doing this this is what a lot of lot of machine learning problems end up being right

43
00:02:34,170 --> 00:02:40,440
is it's turning your data into features and labels and then getting a machine learning model to hopefully

44
00:02:40,530 --> 00:02:44,360
learn some patterns in those features and predict labels.

45
00:02:44,400 --> 00:02:46,220
That's what we're going to try and do here.

46
00:02:46,230 --> 00:02:50,880
Access equals one beautiful why we've seen this before.

47
00:02:50,880 --> 00:02:53,780
Car sales missing price.

48
00:02:53,850 --> 00:02:58,970
So again we're just using these four columns here to predict the price column.

49
00:02:59,030 --> 00:03:00,540
Well that's our goal anyway.

50
00:03:00,570 --> 00:03:06,620
So how would we fill the missing data and take care of these missing loans with psychic loan.

51
00:03:06,630 --> 00:03:07,200
Well let's see.

52
00:03:07,440 --> 00:03:10,860
We've got two little handy dandy.

53
00:03:10,860 --> 00:03:16,100
Well one we've seen before but one we haven't import simple computer.

54
00:03:16,620 --> 00:03:24,540
Well it might do is fill missing values with psychic line.

55
00:03:24,540 --> 00:03:29,890
Now you might remember and I recall right up there when we started this part too well so I could learn

56
00:03:29,890 --> 00:03:35,670
or Part 1 which is getting the data ready feeling missing values is also called imputation which is

57
00:03:35,670 --> 00:03:36,630
what this does.

58
00:03:36,650 --> 00:03:39,070
And just simple impute up from socket line.

59
00:03:39,360 --> 00:03:44,310
And that's going to help us fill missing values and then we've seen this one before which is the sign

60
00:03:44,310 --> 00:03:51,180
hit learn not compose import column transformer which allows us to define some kind of transformer and

61
00:03:51,180 --> 00:03:53,610
then apply it to whichever columns that we want to use it on.

62
00:03:53,990 --> 00:04:01,540
We want to fill want to just do it the exact same We did it with pandas but this time just reproduce

63
00:04:01,540 --> 00:04:02,600
it with socket lines.

64
00:04:02,620 --> 00:04:13,560
So fill categorical values with missing and numerical values with main every kind of writing these notes

65
00:04:13,590 --> 00:04:17,850
like so we can remind ourselves and communicate through our code make sure we're kind of talking it

66
00:04:17,850 --> 00:04:22,530
through doing the right stamps because otherwise you can get a bit lost with everything going on.

67
00:04:22,620 --> 00:04:29,430
So simple computer we're calling this class that we just imported here we're going to tell it this strategy

68
00:04:29,880 --> 00:04:32,960
equals constants so the strategy we want it to fills.

69
00:04:32,970 --> 00:04:39,680
It's basically going hey run over the categorical values I've shortened categorical to cat here it's

70
00:04:39,680 --> 00:04:45,840
gonna go over it saying for every value keep your strategy constant and the fill value

71
00:04:48,880 --> 00:04:51,280
is going to be missing.

72
00:04:51,530 --> 00:04:56,750
OK that's starting to make sense and then we're going to go we're gonna have a special one for our door

73
00:04:56,820 --> 00:05:03,850
feature simple impute up as door is again that little funny one that is numerical but is also categorical

74
00:05:04,170 --> 00:05:11,590
we're going to keep this as constant we're going to say Phil value equals four nice and simple and then

75
00:05:11,590 --> 00:05:17,950
we'll have numerical computer which we could have really called now and Putin actually let's do that

76
00:05:17,950 --> 00:05:28,950
because when keeping these these fairly short now computer simple computer strategy Eagles main so we'll

77
00:05:28,950 --> 00:05:33,640
see what these mean and in a second we're just kind of as I say we're working through it and letting

78
00:05:33,660 --> 00:05:46,920
go define columns so cat columns or cat features Eagles make and color and then we've got a double feature

79
00:05:47,100 --> 00:05:55,350
because again member Dawes is special case and then we've also got numb features which is odometer

80
00:05:58,000 --> 00:06:01,230
Cayenne wonderful.

81
00:06:01,630 --> 00:06:11,970
And then what we're going to do is create and impute up something that fills missing data because that's

82
00:06:11,970 --> 00:06:13,080
what imputation is right.

83
00:06:13,080 --> 00:06:18,060
If you hear that term imputation kind of referring to find a missing value and fill out with something

84
00:06:18,210 --> 00:06:20,550
or calculate something to fill it with.

85
00:06:20,580 --> 00:06:26,160
So this is where we're going to leverage our column transformer.

86
00:06:26,180 --> 00:06:31,560
I love that word column transformer always makes me think of Optimus Prime for some reason.

87
00:06:31,560 --> 00:06:34,400
What's your favorite Transformer.

88
00:06:34,710 --> 00:06:39,850
Mine's definitely Optimus Prime or bumblebee on the hype train they're getting distracted.

89
00:06:39,850 --> 00:06:44,750
Daniel we're talking about machine learning well actually transformers are machines that can learn so

90
00:06:44,750 --> 00:06:46,770
we're not really getting distracted.

91
00:06:47,000 --> 00:06:54,360
So what we're doing here is we're setting up our computer and kind of passing it a few things to get

92
00:06:54,360 --> 00:06:54,990
it ready.

93
00:06:55,050 --> 00:07:00,330
So just bear with me a second I'm kind of coding and trying to talk at the same time I'm getting distracted

94
00:07:00,570 --> 00:07:02,070
so I adore computer.

95
00:07:02,280 --> 00:07:07,570
We're gonna go door computer and then we're going to go door feature.

96
00:07:09,730 --> 00:07:19,420
And then finally one more and we got num computer and then we're gonna go numb computer and then we

97
00:07:19,420 --> 00:07:24,640
may go numb features beautiful.

98
00:07:25,510 --> 00:07:33,780
We've got a fair bit of code here and we're gonna go transform one more transform the data so going

99
00:07:33,780 --> 00:07:38,230
to go field X because remember X has some missing values at the moment.

100
00:07:38,230 --> 00:07:49,870
If I go up here we want to go X dot is and I dealt some which is just the same as up here except we've

101
00:07:49,870 --> 00:07:53,620
got no price price category there.

102
00:07:53,740 --> 00:08:02,440
So we come down here filled X equals we're going to take our little impute here impute her not fit transform

103
00:08:03,540 --> 00:08:07,870
X and then we're gonna have a look at the field X.

104
00:08:08,080 --> 00:08:08,890
Will this work.

105
00:08:08,890 --> 00:08:10,920
Fingers crossed we've got a lot of code here.

106
00:08:11,030 --> 00:08:17,430
I've got one two twenty 24 lines of code plus comments it worked.

107
00:08:17,430 --> 00:08:18,280
No errors.

108
00:08:18,300 --> 00:08:20,070
What has actually just happened.

109
00:08:20,070 --> 00:08:24,150
Well let's step through this what we might do is zoom out so we can just a little bit so we can see

110
00:08:24,150 --> 00:08:25,550
it all in one hit.

111
00:08:25,620 --> 00:08:26,430
Beautiful.

112
00:08:26,430 --> 00:08:31,970
So what we've done is we've imported from psychic learn simple computer class as well as a column transform

113
00:08:31,970 --> 00:08:39,930
the class and then we've defined some impurities and remember impute are just filling missing data using

114
00:08:39,960 --> 00:08:44,190
the simple computer class which takes strategy and a fill value.

115
00:08:44,190 --> 00:08:50,730
If the strategy is constant we have to pass it a fill value saying hey go to the categorical columns

116
00:08:50,970 --> 00:08:52,080
constantly fill them.

117
00:08:52,080 --> 00:08:56,930
If you find a missing value you just fill them with missing the string missing the same thing for outdoor

118
00:08:56,970 --> 00:09:00,010
computer it's going to say keep the strategy constant.

119
00:09:00,030 --> 00:09:03,990
So for every missing cell do the same thing and fill it with four.

120
00:09:04,010 --> 00:09:05,220
Yeah that makes sense.

121
00:09:05,220 --> 00:09:11,850
And for our numerical columns in this case which is the odometer column fill it with the mean.

122
00:09:11,880 --> 00:09:15,860
So keeping that one nice and simple then we've defined which columns are which.

123
00:09:15,960 --> 00:09:21,270
So our categorical columns are making color realistically doors is also that but we've kind of made

124
00:09:21,270 --> 00:09:25,590
it its own because it's again halfway between the number halfway between the category.

125
00:09:25,650 --> 00:09:31,080
So we've got the door feature and then we've defined the numbers feature or the numerical features which

126
00:09:31,080 --> 00:09:38,370
is our odometer and then we've used the column transform a class which is what we imported when we created

127
00:09:38,520 --> 00:09:44,310
and computer passing it the imputations we wanted to do all the transformations we wanted to do.

128
00:09:45,210 --> 00:09:47,230
So these are the names.

129
00:09:47,310 --> 00:09:52,230
So that's what column transformer takes it takes a list of multiple different Transformers.

130
00:09:52,290 --> 00:09:59,700
So see him we've got a list and now within the list we have troubles of the name the computer we want

131
00:09:59,700 --> 00:10:01,050
to use.

132
00:10:01,380 --> 00:10:03,210
This is just the name of the computer right.

133
00:10:03,210 --> 00:10:07,350
So if we had to access this imputation later we can use this as its name.

134
00:10:07,350 --> 00:10:12,480
I've just kept it simple and called it the exact same thing as what the variable computer is.

135
00:10:12,480 --> 00:10:17,270
And these are the features that we want this specific computer to change.

136
00:10:17,280 --> 00:10:24,480
So this one Kadam pewter is going to use cat in pewter on the categorical features door computer on

137
00:10:24,480 --> 00:10:28,950
the door feature and then numerical computer on the numerical features.

138
00:10:29,290 --> 00:10:30,410
That was a mouthful.

139
00:10:30,600 --> 00:10:37,740
And then once our transformer is defined we can call this field X because we're going to use our computer

140
00:10:37,800 --> 00:10:47,090
and fit transform on our X data to fill up the values of X well but don't take my word for it let's

141
00:10:47,090 --> 00:10:47,990
say it in code.

142
00:10:48,000 --> 00:10:53,690
So car sales fell let's create another data frame because we want to use the same checking method that

143
00:10:53,690 --> 00:10:54,920
we've used before.

144
00:10:55,040 --> 00:10:58,730
We'll pass it filled X and then we'll pass it the column names.

145
00:10:58,730 --> 00:11:03,980
This is just the names of what these respective columns where you can see cars and make colour doors

146
00:11:04,100 --> 00:11:05,340
and then odometer.

147
00:11:05,390 --> 00:11:06,990
So that's all we're going to do here.

148
00:11:07,130 --> 00:11:18,420
Make Carla doors DOMA to posit Kayyem as well and beautiful.

149
00:11:18,440 --> 00:11:20,000
That should work.

150
00:11:20,000 --> 00:11:24,290
Now we're gonna have a look at the head of this data frame car sales filled.

151
00:11:24,310 --> 00:11:28,370
Go ahead wonderful that looks familiar to what we've seen before.

152
00:11:28,380 --> 00:11:37,630
But the icing on the cake here the real test is seeing is an a what does is output use of all.

153
00:11:37,920 --> 00:11:40,430
We now have no missing values.

154
00:11:40,500 --> 00:11:44,060
Thanks to this bunch of code that we've written here.

155
00:11:44,090 --> 00:11:49,890
So we've used simple computer plus column transformer to fill the missing values so these missing values

156
00:11:49,900 --> 00:11:57,300
here with some preset defined ones that we've created here and now car sales filled which is what we've

157
00:11:57,300 --> 00:12:03,190
made out of our transformed data now has no missing values.

158
00:12:03,230 --> 00:12:04,510
Beautiful.

159
00:12:04,930 --> 00:12:07,550
So now what should we be able to do.

160
00:12:07,570 --> 00:12:13,410
We have no missing values should be able to convert these into numbers.

161
00:12:13,410 --> 00:12:21,320
So when we come up here and we have some code that has done this before this is in so we're going to

162
00:12:21,320 --> 00:12:24,700
copy this come back down.

163
00:12:24,740 --> 00:12:26,840
This is the only time I'll let you copy and paste code.

164
00:12:26,900 --> 00:12:31,670
I know I've said that before but again to just save us a little bit of time what we might do is again

165
00:12:31,910 --> 00:12:32,720
split.

166
00:12:32,720 --> 00:12:42,980
Car sales filled into x and y so we want X equals car sales field don't drop I know that already is

167
00:12:44,070 --> 00:12:48,430
that already is our x value.

168
00:12:48,470 --> 00:12:52,780
Beautiful so we can make this into him car sales filled

169
00:12:55,690 --> 00:12:56,620
wonderful.

170
00:12:56,680 --> 00:12:59,880
So this is telling us we've got a sparse matrix here now.

171
00:13:01,060 --> 00:13:04,000
What can we do now.

172
00:13:04,270 --> 00:13:12,690
We've got our data as numbers and filled no missing values.

173
00:13:12,790 --> 00:13:21,620
Let's put a model that lets fit our model MPD at random just to make sure everything's working right.

174
00:13:22,160 --> 00:13:23,330
Always trust the code.

175
00:13:23,330 --> 00:13:24,280
Always trust the code.

176
00:13:24,860 --> 00:13:25,540
That's what we're gonna do.

177
00:13:25,550 --> 00:13:27,400
We're gonna reemployment what are we seeing this.

178
00:13:27,490 --> 00:13:33,540
Nothing outperforms at random forest regress are going to have heaps of practice doing this from this

179
00:13:33,540 --> 00:13:36,310
K model selection.

180
00:13:36,350 --> 00:13:44,480
They want something to split our data into train and test and then we also actually want to split our

181
00:13:44,480 --> 00:13:53,960
data into try and tense by going ex train x test y train y test equals train test split wonderful and

182
00:13:53,960 --> 00:14:00,540
we're gonna pass it this time transformed x c this variable here because what we've done is we've passed

183
00:14:00,540 --> 00:14:00,680
it.

184
00:14:00,680 --> 00:14:07,920
Car sales field and basically just turned all the categorical features we've won hot encoded them so

185
00:14:07,920 --> 00:14:15,660
we'll go here transformed X also pass at Y because y still saved in memory why does it have to change

186
00:14:15,660 --> 00:14:20,220
because that's just labels there already numbers and the test size is zero point.

187
00:14:20,760 --> 00:14:21,400
Wonderful.

188
00:14:21,510 --> 00:14:25,840
And so we'll set up our model which is a random forest regress.

189
00:14:27,210 --> 00:14:29,310
And then we're going to model not fit.

190
00:14:29,340 --> 00:14:35,310
So this is telling our model to find Hey random forest regress are find the patterns between X train

191
00:14:35,310 --> 00:14:42,060
and Y train and then we'll get to a model dot score on our test data sets it's going to say hey evaluate

192
00:14:42,090 --> 00:14:47,760
I know you've found some patterns in this training dataset but now evaluate those patterns on this test

193
00:14:47,770 --> 00:14:54,930
dataset and moment of truth shift into beautiful we're getting a warning here because there's some changes

194
00:14:54,930 --> 00:14:59,460
happening in a future version of socket line you might not have this warning if you're using socket

195
00:14:59,460 --> 00:15:07,440
line version point 2 to all it's saying is that an estimate is will be changed from the default 10 inversion

196
00:15:07,470 --> 00:15:09,160
zero point two to one hundred.

197
00:15:09,210 --> 00:15:16,060
So we can basically get rid of this warning by going here equals 100 wonderful.

198
00:15:16,090 --> 00:15:21,820
So even though this model has more estimates than the previous one we've used on our car sales missing

199
00:15:21,820 --> 00:15:24,380
data it performs worse.

200
00:15:24,550 --> 00:15:32,680
So this one gets a score of zero point 2 1 9 and if we come up here where is it.

201
00:15:32,690 --> 00:15:35,510
We did trying to model in the previous section.

202
00:15:35,510 --> 00:15:38,380
This has got 3 point 3 0 4.

203
00:15:38,390 --> 00:15:40,450
The maximum score is 1.

204
00:15:40,730 --> 00:15:44,020
They're both not doing as ideally as we might like.

205
00:15:44,030 --> 00:15:47,750
But why do you think of this one performed worse.

206
00:15:47,750 --> 00:15:52,610
Well it's because it's only got nine hundred and fifty samples.

207
00:15:52,790 --> 00:16:04,280
So if we go land car sales filled as well as land car sales missing now we want maybe the original car

208
00:16:04,280 --> 00:16:05,570
sales.

209
00:16:05,570 --> 00:16:06,260
There we go.

210
00:16:06,260 --> 00:16:12,410
So our previous model was built on this one and the model we've just built was built on this one.

211
00:16:12,410 --> 00:16:18,750
So one has a thousand samples and the other only has 950 samples which is why it's done slightly worse.

212
00:16:18,750 --> 00:16:21,030
And so that's a big paradigm of machine learning right.

213
00:16:21,030 --> 00:16:26,090
Like most of the time if you have more data usually a machine learning model is able to find better

214
00:16:26,090 --> 00:16:30,830
patents and so that's kind of what's happened here is that because we've dropped the samples that don't

215
00:16:30,830 --> 00:16:36,740
have labels our model hasn't been out of Find as many patents even though when they removed 50 out of

216
00:16:36,740 --> 00:16:41,790
a thousand samples we have covered a lot here already so far.

217
00:16:41,860 --> 00:16:47,500
The most important key takeaways from this are most datasets you come across won't be in a form ready

218
00:16:47,500 --> 00:16:52,570
and immediately to start using with machine learning models and some take a bit more preparation than

219
00:16:52,570 --> 00:16:53,290
others.

220
00:16:53,410 --> 00:16:58,510
And so most of the time what you'll have to be doing is your data will have to be numerical and it can't

221
00:16:58,600 --> 00:17:00,560
have missing values.

222
00:17:00,670 --> 00:17:06,970
And so the process of filling missing values is called imputation and the process of turning your non

223
00:17:06,970 --> 00:17:13,810
numerical values into numerical values is referred to as feature engineering or feature encoding.

224
00:17:13,840 --> 00:17:18,380
So with that being said we've covered part 1 of how to deal with you Dana.

225
00:17:18,400 --> 00:17:23,440
Let's get into part two and figure out where the hell did I get this idea for choosing a random forest

226
00:17:23,440 --> 00:17:24,080
regress.

227
00:17:24,160 --> 00:17:27,030
For our problem why did I pick this machine learning.

228
00:17:27,030 --> 00:17:28,840
Well we'll have to wait and see.