1
00:00:00,560 --> 00:00:01,300
OK.

2
00:00:01,320 --> 00:00:02,400
Welcome back.

3
00:00:02,400 --> 00:00:07,010
Let's revisit our little list here of what we're covering to see where we're up to.

4
00:00:07,680 --> 00:00:12,390
OK so we've done zero and end to end psychic learn workflow and we've done one.

5
00:00:12,420 --> 00:00:14,110
Getting your data ready.

6
00:00:14,130 --> 00:00:15,080
So now we're going to do.

7
00:00:15,090 --> 00:00:15,820
Number two.

8
00:00:15,930 --> 00:00:20,480
Choose the right estimate at slash algorithm for our problems.

9
00:00:20,580 --> 00:00:23,050
Let's make a little heading for that too.

10
00:00:23,130 --> 00:00:30,540
Choosing the right estimate a slash algorithm for our problem.

11
00:00:30,540 --> 00:00:35,810
Now a little thing to note here is that psychic learn let's just put it here.

12
00:00:35,820 --> 00:00:50,000
Psychic loan users estimate as another term for machine learning model or algorithm.

13
00:00:50,170 --> 00:00:51,370
So that's an important thing to note.

14
00:00:51,370 --> 00:00:56,860
Like you come across if you're using the psychic loan documentation or if you're just browsing general

15
00:00:56,860 --> 00:01:00,380
machine learning stuff online you'll hear a few different names for it.

16
00:01:00,400 --> 00:01:04,370
So machine learning model machine learning algorithm or estimate.

17
00:01:04,390 --> 00:01:09,730
In the case of psychic line they kind of had their own nomenclature to just make things a bit more easier

18
00:01:09,790 --> 00:01:10,990
across the library.

19
00:01:10,990 --> 00:01:17,300
But if you hear those kind of times just remember it's talking about using a machine learning model.

20
00:01:17,390 --> 00:01:18,240
OK.

21
00:01:18,340 --> 00:01:19,930
How do we go about this.

22
00:01:19,930 --> 00:01:26,530
Well some other things to note is that first of all before you choose an estimate slash algorithm for

23
00:01:26,530 --> 00:01:31,250
your problem is you have to figure out what kind of problem are you working with.

24
00:01:31,330 --> 00:01:36,400
And up here we've seen a few different types of problems but the main ones we're going to be looking

25
00:01:36,400 --> 00:01:46,260
at are classification a.k.a. predicting whether a sample is one thing or another.

26
00:01:47,170 --> 00:01:56,340
And regression predicting a number so classification is like our heart disease problem.

27
00:01:56,430 --> 00:01:59,580
We're trying to predict whether someone has heart disease or not.

28
00:01:59,580 --> 00:02:04,260
And regression is we're trying to predict a number like with the Boston Housing dataset or with our

29
00:02:04,260 --> 00:02:05,680
car sales data set.

30
00:02:05,730 --> 00:02:11,340
We're trying to predict a house price or a car price that doesn't look very good.

31
00:02:11,340 --> 00:02:12,840
Let's go dot point.

32
00:02:12,840 --> 00:02:16,520
This thing's got to look good right.

33
00:02:16,590 --> 00:02:21,860
So in the previous examples I've kind of just randomly imported this random forest regress.

34
00:02:21,900 --> 00:02:24,270
But where did I get that from.

35
00:02:24,330 --> 00:02:26,220
And I'll let you in on a little secret.

36
00:02:26,220 --> 00:02:27,860
It's not that trivial.

37
00:02:27,870 --> 00:02:33,570
There are a lot of different machine learning models but psychic line has made something very helpful

38
00:02:33,600 --> 00:02:35,800
and I'm very excited to show it to you.

39
00:02:35,820 --> 00:02:36,510
Let's have a look.

40
00:02:36,820 --> 00:02:44,750
So let's go google S.K. loan email map choosing the right estimate.

41
00:02:44,840 --> 00:02:48,550
This is what we want this when you first see this.

42
00:02:48,650 --> 00:02:54,470
It's gonna look like a whole bunch of just different jargon going through but as we start to dive into

43
00:02:54,470 --> 00:02:59,450
it we'll start to realize oh wow this has some really useful things that we can use in our problems

44
00:02:59,450 --> 00:03:02,990
our machine learning problems and what is this.

45
00:03:02,990 --> 00:03:08,670
Well at this Web site here you just go SDK line or socket line machine learning map and Google.

46
00:03:08,840 --> 00:03:10,640
And this is choosing the right estimate.

47
00:03:10,780 --> 00:03:16,020
Remember estimating so I line is the same as machine learning model or machine learning algorithm.

48
00:03:16,210 --> 00:03:22,030
And the documentation here says often the hardest part of solving machine learning problem can be finding

49
00:03:22,030 --> 00:03:23,910
the right estimate for the job.

50
00:03:23,910 --> 00:03:24,690
Yes.

51
00:03:24,700 --> 00:03:29,590
Different estimates are better suited for different types of data and different problems.

52
00:03:29,650 --> 00:03:35,290
That makes sense if we have a classification problem we might want to use one of these estimates.

53
00:03:35,440 --> 00:03:40,720
And if we have a regression problem we won't want to use one of these estimates and then for something

54
00:03:40,720 --> 00:03:43,930
like a clustering problem we have these estimates.

55
00:03:44,050 --> 00:03:47,310
And there was something like a dimensionality reduction problem.

56
00:03:47,390 --> 00:03:48,750
You I want to use one of these.

57
00:03:49,180 --> 00:03:54,130
But let's get hands on right because at the moment this this graph or this flow chart this map whatever

58
00:03:54,130 --> 00:03:57,760
you want to call it it can seem a bit confusing to begin with.

59
00:03:57,760 --> 00:04:01,490
So what we're going to do is get hands on with a problem.

60
00:04:01,510 --> 00:04:04,630
So if we start here first we need some data.

61
00:04:04,660 --> 00:04:09,070
So this is what these little blue steps are you do you have above 50 samples.

62
00:04:09,090 --> 00:04:09,530
No.

63
00:04:09,550 --> 00:04:10,460
Get more data.

64
00:04:10,540 --> 00:04:11,320
Simple.

65
00:04:11,320 --> 00:04:12,880
Do we have above 50 samples.

66
00:04:12,880 --> 00:04:13,590
Yes.

67
00:04:13,780 --> 00:04:15,130
Predicting a category.

68
00:04:15,220 --> 00:04:15,800
Yes.

69
00:04:15,820 --> 00:04:16,810
Do we have labeled data.

70
00:04:16,900 --> 00:04:17,890
Yes.

71
00:04:17,940 --> 00:04:19,600
Do we have under 100 k samples.

72
00:04:19,600 --> 00:04:20,200
Yes.

73
00:04:20,200 --> 00:04:22,420
Use linear SPC.

74
00:04:23,200 --> 00:04:24,280
Okay.

75
00:04:24,310 --> 00:04:24,670
All right.

76
00:04:24,670 --> 00:04:25,360
Enough talk.

77
00:04:25,450 --> 00:04:27,950
Let's get back to our notebook and start writing some code.

78
00:04:28,090 --> 00:04:34,440
What we're going to do is to begin with two point one we'll see how we did regression just recently

79
00:04:34,450 --> 00:04:40,270
so we'll see how we would pick an estimate slash algorithm for a regression problem.

80
00:04:40,330 --> 00:04:41,710
So let's do this.

81
00:04:41,790 --> 00:04:43,070
I'll make two point one.

82
00:04:43,240 --> 00:04:53,620
Picking a machine learning model for our regression problem beautiful.

83
00:04:53,710 --> 00:04:58,420
And so what we're going to do is we're going to use one of socket loans built in datasets and that's

84
00:04:58,420 --> 00:05:00,620
the Boston Housing data set.

85
00:05:00,670 --> 00:05:09,830
So let's go import Boston housing data set and we can do that from S.K. learned on data sets.

86
00:05:10,540 --> 00:05:14,440
Import load Boston and then we want to go.

87
00:05:14,440 --> 00:05:22,960
Boston just to settle up Eagles load Boston and then we want to see what it looks like.

88
00:05:23,050 --> 00:05:26,180
Boston OK.

89
00:05:26,350 --> 00:05:30,190
So it imports as a dictionary we've got data as one of the keys.

90
00:05:30,190 --> 00:05:32,190
Target is one of the keys.

91
00:05:32,380 --> 00:05:34,930
And then I think we have a feature names.

92
00:05:34,960 --> 00:05:35,890
What can we do with this.

93
00:05:36,070 --> 00:05:40,990
Well let's first turn it into a panda's data frame so that we can see it a little bit better than being

94
00:05:40,990 --> 00:05:42,280
a dictionary.

95
00:05:42,280 --> 00:05:44,670
So we'll go to Boston DNF.

96
00:05:44,890 --> 00:05:48,780
This is one of the first steps you'll you usually do with any kind of problem with any kind of data

97
00:05:48,780 --> 00:05:52,060
set is try to get it into a panda's data frame.

98
00:05:52,060 --> 00:05:52,270
Right.

99
00:05:52,270 --> 00:05:57,940
Because we've seen what pandas is capable of and we know it looks good and we know it's pretty malleable

100
00:05:57,940 --> 00:06:01,510
and we can just do a whole bunch of different things one is in Japan this data frame.

101
00:06:01,540 --> 00:06:07,260
So rather than having an a dictionary we'll get it into a data frame and then we want to set up Boston

102
00:06:07,420 --> 00:06:08,460
DCF.

103
00:06:08,680 --> 00:06:13,930
And I'm kind of typing here without talking but essentially what I'm doing is I'm taking the data key

104
00:06:14,170 --> 00:06:19,570
from the Boston dictionary and setting the columns to be the featured names from the dictionary and

105
00:06:19,570 --> 00:06:22,350
then creating a target column here.

106
00:06:22,400 --> 00:06:30,220
This is what we're trying to predict by setting it to PD series and then taking the target key from

107
00:06:30,220 --> 00:06:32,290
the Boston dictionary.

108
00:06:32,290 --> 00:06:37,970
So right now Boston is a dictionary and we're going to turn it into Boston DNF which is a data frame.

109
00:06:38,170 --> 00:06:47,130
So let's see what this bad boy looks like key era features names it might just be feature names columns

110
00:06:47,340 --> 00:06:48,870
lots of typos.

111
00:06:48,960 --> 00:06:50,790
This is what happens when you type and talk.

112
00:06:50,880 --> 00:06:51,380
Okay.

113
00:06:51,540 --> 00:06:52,200
Excellent.

114
00:06:52,200 --> 00:06:53,370
So what can we see here.

115
00:06:54,540 --> 00:06:58,850
Well we've got cram Xian index charts.

116
00:06:58,910 --> 00:07:00,440
Not sure what all of these are.

117
00:07:00,470 --> 00:07:02,010
But it's got a target column.

118
00:07:02,180 --> 00:07:06,200
So I'm assuming that's what we're trying to predict and we can figure out what this actually is by just

119
00:07:06,200 --> 00:07:11,860
going as K line Boston housing dataset we're looking at data sets.

120
00:07:11,870 --> 00:07:19,160
Lloyd Boston so this is what we've just done we've called this function from the S.K. lined up data

121
00:07:19,160 --> 00:07:20,150
sets module.

122
00:07:20,340 --> 00:07:21,780
We can read more on the user guide.

123
00:07:21,810 --> 00:07:24,180
I've seen this before and it's one kind of familiar with it.

124
00:07:24,180 --> 00:07:27,100
But if you're first looking at it you might have to dive in.

125
00:07:27,120 --> 00:07:30,900
So this is just going to give us a data dictionary of what the columns that we're dealing with.

126
00:07:31,260 --> 00:07:37,270
So if we see Crim is per capita crime rate per town Yep.

127
00:07:37,500 --> 00:07:38,180
OK.

128
00:07:38,200 --> 00:07:44,300
And then zone proportion of residential land zoned for lots over 25000 square feet.

129
00:07:44,320 --> 00:07:44,670
Yep.

130
00:07:45,110 --> 00:07:45,610
OK.

131
00:07:45,850 --> 00:07:51,670
So long story short what this dataset is is a whole bunch of different parameters about different towns

132
00:07:51,940 --> 00:07:52,870
in Boston.

133
00:07:52,870 --> 00:07:56,920
So each row of these Boston is a city in America by the way.

134
00:07:57,250 --> 00:08:02,250
So each row is a different town in Boston and there's different features about this town.

135
00:08:02,260 --> 00:08:09,190
You can read all of these here and what we're trying to do is use these features about the town to predict

136
00:08:09,220 --> 00:08:10,930
the median house price.

137
00:08:11,020 --> 00:08:15,400
And I believe this house price is in thousands and this data says a little bit all too that's why everything

138
00:08:15,400 --> 00:08:16,450
is so cheap.

139
00:08:16,540 --> 00:08:18,170
But the premise is here.

140
00:08:18,190 --> 00:08:20,440
This is a regression problem.

141
00:08:20,440 --> 00:08:27,400
So now we can figure out a few things about our data frame so we want how many samples we're going to

142
00:08:27,400 --> 00:08:27,610
get.

143
00:08:27,630 --> 00:08:32,420
Len Boston DNF five hundred and six.

144
00:08:32,470 --> 00:08:33,100
Wonderful.

145
00:08:33,880 --> 00:08:39,670
So now that we know we have five hundred and six samples and we have a regression problem and we have

146
00:08:39,670 --> 00:08:48,550
labels what can we do we'll go back to our machine learning map we come right to the start and we go

147
00:08:48,640 --> 00:08:52,050
okay we go follow this little little golden arrow.

148
00:08:52,090 --> 00:08:52,780
Beautiful.

149
00:08:52,780 --> 00:08:54,400
Do we have above 50 samples.

150
00:08:54,400 --> 00:08:55,340
Yes.

151
00:08:55,360 --> 00:09:00,820
See if we didn't have above 50 samples SBA loan will be telling us get more data which makes sense.

152
00:09:00,820 --> 00:09:02,560
Are we predicting category.

153
00:09:02,560 --> 00:09:03,470
No.

154
00:09:03,490 --> 00:09:05,440
Are we predicting a quantity.

155
00:09:05,440 --> 00:09:10,470
Yes because we are working on a regression problem.

156
00:09:10,600 --> 00:09:14,080
Do we have over 100 k samples under 100 samples.

157
00:09:14,260 --> 00:09:14,920
Yes.

158
00:09:14,980 --> 00:09:16,820
Few features should be important.

159
00:09:16,840 --> 00:09:18,820
We're actually not sure what this is.

160
00:09:18,970 --> 00:09:21,170
So few features should be important.

161
00:09:21,220 --> 00:09:22,600
What can we.

162
00:09:22,630 --> 00:09:29,000
We've got 1 2 3 4 5 6 7 8 9 10 11 12 13.

163
00:09:29,230 --> 00:09:34,000
We've got 13 features but we don't actually know whether a few of them will be important or not.

164
00:09:34,000 --> 00:09:36,870
So let's just pretend no for the time being.

165
00:09:36,910 --> 00:09:42,100
And so this is going to point us now to one of these green squares and within each of these green squares

166
00:09:42,460 --> 00:09:46,060
is an estimate or a machine learning model.

167
00:09:46,060 --> 00:09:53,010
So if we click on this one ridge regression this is gonna take us to the ridge regression and classification

168
00:09:53,010 --> 00:09:54,220
documentation.

169
00:09:54,360 --> 00:09:59,550
So it tells us red regression rage regression addresses some of the problems of ordinary least squares

170
00:09:59,570 --> 00:10:04,050
and we got a whole bunch of math symbols but really we would just want to work out how we can use this

171
00:10:04,050 --> 00:10:05,840
machine learning model.

172
00:10:05,850 --> 00:10:09,640
Okay so from Mesquite loan import linear model Reg okay.

173
00:10:09,750 --> 00:10:11,620
So we can just important like that.

174
00:10:11,640 --> 00:10:13,000
Well let's see.

175
00:10:13,200 --> 00:10:16,240
Let's go back let's get some code working.

176
00:10:17,140 --> 00:10:22,360
Let's try the ridge progression model.

177
00:10:22,960 --> 00:10:29,840
So if we go back up here we go from S K line input linear model and its linear model dot reach to use

178
00:10:29,840 --> 00:10:33,740
the ridge progression but actually we don't want I just do it like that.

179
00:10:33,740 --> 00:10:43,550
Let's skip a line of code and we can do it by going from S.K. line Don linear model import ridge and

180
00:10:43,550 --> 00:10:46,010
then we're going to set up a random seed

181
00:10:50,260 --> 00:10:59,250
MP don't random so we can make sure our results are reproducible and then we want to create the data.

182
00:10:59,980 --> 00:11:03,040
So we're going to go X equals we've seen this before.

183
00:11:03,070 --> 00:11:05,400
Boston D F don't drop.

184
00:11:05,410 --> 00:11:11,520
We want to drop the target column for X because then it'll just be the features matrix.

185
00:11:11,550 --> 00:11:13,310
Access equals one.

186
00:11:13,410 --> 00:11:18,590
I'm going to put a few rows down here so we've got some space to see why it goes.

187
00:11:18,660 --> 00:11:24,420
BOSTON THE F and this is going to be the target column because we want to use x to predict y right.

188
00:11:24,630 --> 00:11:29,640
And then we want to go split into train and test sets.

189
00:11:29,690 --> 00:11:44,160
We're going to go X train x test y train y test equals train test split x y test size we use 20 percent

190
00:11:44,160 --> 00:11:49,020
again because that's a good general number you'll see that come up over and over again in machine learning

191
00:11:49,020 --> 00:11:55,230
projects and then the next thing to do is because we've imported Ridge much like up here we've instantiated

192
00:11:55,230 --> 00:11:57,350
the model with random forest.

193
00:11:58,170 --> 00:12:02,390
So model equals random forest regress so we can do the same thing with Ridge.

194
00:12:02,520 --> 00:12:11,070
So we go instantiate Ridge model and we'll call it the generic model again bridge and then we're going

195
00:12:11,070 --> 00:12:23,100
to go modeled off it X train y train and then you want to go check the score of the ridge Ridge model

196
00:12:23,280 --> 00:12:34,220
on test data model dot school X test y test beautiful look at that how quick was that.

197
00:12:34,410 --> 00:12:35,930
So what did we do here.

198
00:12:36,550 --> 00:12:38,050
Where did this come from.

199
00:12:38,050 --> 00:12:42,210
Remember we went back to our machine learning map.

200
00:12:42,220 --> 00:12:43,190
We started here.

201
00:12:43,240 --> 00:12:46,200
We answered a few questions and follow the flow chart along here.

202
00:12:46,230 --> 00:12:47,610
And we clicked on ridge regression.

203
00:12:48,300 --> 00:12:51,600
And that took us to the documentation and we went here.

204
00:12:51,700 --> 00:12:52,000
Okay.

205
00:12:52,000 --> 00:12:56,150
There's some example code but we prefer to just start typing it out ourselves which is what we did.

206
00:12:56,150 --> 00:13:01,690
He kind of skipped over looking at this because I wanted to just get it in here so you can get an example.

207
00:13:01,690 --> 00:13:08,930
And so now what we've done is we've split our data into x and y we've got our Boston data frame here.

208
00:13:09,010 --> 00:13:13,450
We've taken these columns because we want to use these columns to predict the target.

209
00:13:14,120 --> 00:13:15,950
And so that's what we've done X and Y.

210
00:13:15,970 --> 00:13:20,980
Now we split into train and tests and we've instantiated a bridge model because that's the model that

211
00:13:20,980 --> 00:13:22,650
the map suggested we should use.

212
00:13:23,530 --> 00:13:30,690
This bad boy here and then we fitted it to the data a.k.a. asked our model to find the patterns between

213
00:13:30,720 --> 00:13:39,660
X train and Y train and then we evaluated our model on the test data set Wow that was actually surprisingly

214
00:13:40,050 --> 00:13:40,830
easy.

215
00:13:40,890 --> 00:13:43,100
Now this score What is this.

216
00:13:43,110 --> 00:13:48,630
So this is if we press shift hab returns the coefficient of determination r squared of the prediction.

217
00:13:48,690 --> 00:13:51,670
Now we'll look into more evaluation metrics as we go on.

218
00:13:51,810 --> 00:13:56,400
But just remember the highest possible score you can get here is one point zero.

219
00:13:56,400 --> 00:14:02,710
So if our model was to predict each of these columns exactly then it would get a score of one point

220
00:14:02,720 --> 00:14:03,350
zero.

221
00:14:03,540 --> 00:14:06,630
So zero point six x is not too bad right.

222
00:14:06,630 --> 00:14:10,300
The closer to 1 the better what you might be asking is how do we.

223
00:14:10,300 --> 00:14:11,900
How do we improve this score

224
00:14:16,180 --> 00:14:17,530
Brian.

225
00:14:17,590 --> 00:14:22,720
What if Ridge wasn't working.

226
00:14:22,720 --> 00:14:28,990
And luckily if we go back to the machine learning map we've got a little error here that says not working

227
00:14:29,700 --> 00:14:30,030
right.

228
00:14:30,040 --> 00:14:37,110
So if our ridge regression wasn't performing as well as we wanted or wasn't working at all well this

229
00:14:37,120 --> 00:14:41,830
arrows pointing us here to go to one of these two green rectangles.

230
00:14:41,890 --> 00:14:43,970
So maybe we'll have a look at this in the next video.

231
00:14:44,110 --> 00:14:44,980
What's going on here.

232
00:14:44,980 --> 00:14:50,020
What if ridge regression wasn't working at the moment it is but we want to kind of improve this score

233
00:14:50,770 --> 00:14:54,060
so we'll check out what we can do with that in the next video.