1
00:00:00,300 --> 00:00:01,433
Hello and welcome to

2
00:00:01,433 --> 00:00:04,433
this art tutorial
and in previous tutorials, we already

3
00:00:04,433 --> 00:00:09,266
implemented a multiple linear regression
model that we fitted to the training set.

4
00:00:09,300 --> 00:00:11,833
But when we take a step back,
do you think it's actually

5
00:00:11,833 --> 00:00:15,500
the optimal model that we can make
with the data set that we have here?

6
00:00:15,800 --> 00:00:18,200
You know,
because when we built this model,

7
00:00:18,200 --> 00:00:21,200
we actually used
all the independent variables.

8
00:00:21,300 --> 00:00:22,200
But what if

9
00:00:22,200 --> 00:00:24,366
among these independent variables
there are some

10
00:00:24,366 --> 00:00:28,633
that are highly statistically significant,
that is have great impact

11
00:00:28,633 --> 00:00:31,633
or great effect
on the dependent variable profit,

12
00:00:31,900 --> 00:00:35,300
and some that are not
statistically significant at all.

13
00:00:35,333 --> 00:00:40,133
That means that if we removed this non
statistically significant variables

14
00:00:40,133 --> 00:00:43,733
from the model, well we would still get
some amazing predictions.

15
00:00:44,533 --> 00:00:48,300
So the goal in this tutorial
is to actually find a team,

16
00:00:48,466 --> 00:00:52,200
an optimal team of independent variables,
so that each independent

17
00:00:52,200 --> 00:00:56,700
variable of the team has great impact
on the dependent variable profit.

18
00:00:56,866 --> 00:00:59,433
That is, each independent
variable of the team

19
00:00:59,433 --> 00:01:02,966
is a powerful predictor
that is highly statistically significant

20
00:01:03,066 --> 00:01:06,500
and definitely has an effect
on the dependent variable profit.

21
00:01:06,800 --> 00:01:08,566
And this effect can be positive.

22
00:01:08,566 --> 00:01:11,833
That is, for an increase of one
unit of the independent variable,

23
00:01:11,833 --> 00:01:14,600
the profit will increase
or it can be negative.

24
00:01:14,600 --> 00:01:17,600
That is, for an increase of one
unit of the independent variable

25
00:01:17,600 --> 00:01:19,700
the profit will decrease.

26
00:01:19,700 --> 00:01:22,966
And so we're going to divide
this final step of building

27
00:01:22,966 --> 00:01:26,900
this optimal model using backward
elimination into two tutorials.

28
00:01:27,266 --> 00:01:30,600
In this first tutorial that we are having
right now, I'm going to walk you

29
00:01:30,600 --> 00:01:35,400
through the backward elimination algorithm
or without completing it up to the end.

30
00:01:35,866 --> 00:01:38,866
That means that at the end of this
tutorial, you'll get a homework

31
00:01:38,966 --> 00:01:44,200
which will consist of completing
what we started with backward elimination.

32
00:01:44,200 --> 00:01:48,000
So I'm sure you will have no problem,
because I'm going to walk you through

33
00:01:48,000 --> 00:01:51,900
the introduction of backward elimination
so that you can understand everything

34
00:01:51,900 --> 00:01:55,000
and have all the tools
to complete the homework.

35
00:01:55,333 --> 00:01:58,166
And then in the next tutorial,
I'll give you the solution of this

36
00:01:58,166 --> 00:02:01,433
homework and we will complete together
the backward elimination.

37
00:02:02,000 --> 00:02:03,833
So I hope you're excited.

38
00:02:03,833 --> 00:02:04,833
Let's start right now.

39
00:02:04,833 --> 00:02:07,266
The backward elimination.

40
00:02:07,266 --> 00:02:10,466
So for those of you
who follow the Python tutorial,

41
00:02:10,466 --> 00:02:13,466
you will notice that it's
actually a little more simple here,

42
00:02:13,500 --> 00:02:17,300
because in the Python tutorial
we had to use another library and another

43
00:02:17,300 --> 00:02:21,433
multiple linear regression
model to implement backward elimination.

44
00:02:21,866 --> 00:02:25,300
And this time we will simply
take the model that we created here,

45
00:02:25,466 --> 00:02:30,133
and we will use this amazing function
summary of R that returns

46
00:02:30,133 --> 00:02:35,900
a great deal of statistical information
that can help make our model more robust.

47
00:02:36,400 --> 00:02:39,966
And then same as in Python,
we are going to do some copy paste

48
00:02:40,100 --> 00:02:44,700
very simply until we get to the final team
of our independent variables.

49
00:02:45,166 --> 00:02:46,500
So let's do this.

50
00:02:46,500 --> 00:02:48,333
You're going to see it's really quick
and easy.

51
00:02:48,333 --> 00:02:50,166
We're going to do that very efficiently.

52
00:02:50,166 --> 00:02:54,066
And the first step of doing
that is to actually take our model

53
00:02:54,100 --> 00:02:55,633
because as I just mentioned,

54
00:02:55,633 --> 00:02:59,333
we are going to use this same model
to implement backward elimination.

55
00:03:00,000 --> 00:03:03,633
So here I just copied the model
and I'm going to paste it here.

56
00:03:04,166 --> 00:03:07,166
And now we're just going to change
two things.

57
00:03:07,500 --> 00:03:11,100
The first thing is
instead of having this dot here,

58
00:03:11,200 --> 00:03:14,200
you know this dot that represents
all the independent variables,

59
00:03:14,200 --> 00:03:17,766
we we're going to write all the
independent variables separated by a plus

60
00:03:18,300 --> 00:03:21,966
because you know,
the principle of backward elimination

61
00:03:21,966 --> 00:03:25,200
is that we will remove
each independent variable

62
00:03:25,200 --> 00:03:28,200
that is not statistically
significant one by one.

63
00:03:28,500 --> 00:03:31,900
So we need here to write
each of the independent variables

64
00:03:31,900 --> 00:03:36,400
so that when we copy paste this model here
we will just need to remove

65
00:03:36,600 --> 00:03:41,400
the non statistically significant variable
from this equation here okay.

66
00:03:41,400 --> 00:03:42,466
So let's first do this

67
00:03:42,466 --> 00:03:46,700
I'm going to take our data set here
to look at the independent variables.

68
00:03:47,166 --> 00:03:49,533
So the first independent
variable is already spent.

69
00:03:49,533 --> 00:03:51,566
So let's add it here.

70
00:03:51,566 --> 00:03:53,800
So r dot d.

71
00:03:53,800 --> 00:03:56,166
So you know as a reminder
there is a dot here

72
00:03:56,166 --> 00:04:00,666
because the original name
for this independent variable is r space

73
00:04:00,666 --> 00:04:04,200
d and r just replaced the space
by a dot here.

74
00:04:04,633 --> 00:04:07,200
So it's good to know
that if you're working with R

75
00:04:07,200 --> 00:04:10,266
and if you have some data
sets with spaces in your column names,

76
00:04:10,900 --> 00:04:12,600
and then okay, so what was the name.

77
00:04:13,833 --> 00:04:15,200
Dot another dot.

78
00:04:15,200 --> 00:04:18,066
So that means another space dot and spend.

79
00:04:18,066 --> 00:04:20,266
All right.
So that's the first independent variable.

80
00:04:20,266 --> 00:04:22,000
And now let's add the second one.

81
00:04:22,000 --> 00:04:25,200
So we need to separate them by a plus
okay.

82
00:04:25,600 --> 00:04:27,300
And what is the second one.

83
00:04:27,300 --> 00:04:30,300
The second one is administration okay.

84
00:04:30,633 --> 00:04:32,066
So here there is no dot.

85
00:04:32,066 --> 00:04:35,200
Everything is fine
just as it's spelled right.

86
00:04:35,500 --> 00:04:38,500
Administration plus.

87
00:04:40,766 --> 00:04:43,766
Marketing spend.

88
00:04:44,733 --> 00:04:45,566
Plus.

89
00:04:45,566 --> 00:04:49,800
And we have one one last
independent variable which is the state.

90
00:04:50,333 --> 00:04:54,366
So here we don't need to create the dummy
variables as we did in Python.

91
00:04:54,366 --> 00:04:57,833
Because remember we used here this amazing

92
00:04:58,000 --> 00:05:02,300
factor function that encoded this state
categorical variable

93
00:05:02,533 --> 00:05:05,333
into factors that are one two, three.

94
00:05:05,333 --> 00:05:09,500
And there is no relational order
between those categories.

95
00:05:09,500 --> 00:05:10,866
So everything is fine.

96
00:05:10,866 --> 00:05:13,433
We don't need to create
any dummy variables.

97
00:05:13,433 --> 00:05:15,200
So that's the beauty of R.

98
00:05:15,200 --> 00:05:19,200
And so here same we don't need to sum
two separate dummy variables.

99
00:05:19,200 --> 00:05:23,100
We can take the original independent
variable state okay.

100
00:05:23,100 --> 00:05:25,833
So as I mentioned there are two things
that we would like to change here.

101
00:05:25,833 --> 00:05:29,900
The first thing was to replace the dot
by all this independent

102
00:05:29,900 --> 00:05:31,766
variable separated by a plus.

103
00:05:31,766 --> 00:05:32,100
And now

104
00:05:32,100 --> 00:05:35,633
the second thing that we would like to do,
but it's not compulsory, is just because

105
00:05:35,866 --> 00:05:38,866
I would like to use all the data
set to see the correlations

106
00:05:39,000 --> 00:05:42,000
is to replace here training set by

107
00:05:42,500 --> 00:05:45,033
simply our data set.

108
00:05:45,033 --> 00:05:46,666
So that's not compulsory.

109
00:05:46,666 --> 00:05:49,866
We can actually do backward elimination
using the training set.

110
00:05:50,433 --> 00:05:54,566
But we're just taking the whole data set
in order to have complete information

111
00:05:54,633 --> 00:05:58,533
about which independent variables
are statistically significant

112
00:05:58,700 --> 00:06:01,566
and which independent variables are not.

113
00:06:01,566 --> 00:06:04,300
Okay. And now actually we're almost ready.

114
00:06:04,300 --> 00:06:09,400
We just need to use the summary function,
which we actually already used before.

115
00:06:09,700 --> 00:06:12,600
And there is nothing more simple
than using the summary function.

116
00:06:12,600 --> 00:06:15,400
We just need to take the summary function
here.

117
00:06:15,400 --> 00:06:19,100
And then in parenthesis
we input our regressor.

118
00:06:19,500 --> 00:06:22,133
Here it is. And now that's actually ready.

119
00:06:22,133 --> 00:06:26,000
We're actually ready to start the
first steps of our backward elimination.

120
00:06:26,000 --> 00:06:30,000
Well speaking of backward elimination
let's have a look at the slide.

121
00:06:30,000 --> 00:06:32,933
You saw with Kirill in the intuition
tutorial.

122
00:06:32,933 --> 00:06:33,966
And here is the slide.

123
00:06:33,966 --> 00:06:37,266
So let's have a quick reminder
about the five steps here.

124
00:06:37,500 --> 00:06:38,966
So the first step is to select

125
00:06:38,966 --> 00:06:42,466
a significance level
that is a threshold for our p value

126
00:06:42,700 --> 00:06:46,800
such that if the p value of an independent
variable is below

127
00:06:46,800 --> 00:06:50,533
the significance level, then this
independent variable stays in the model.

128
00:06:50,833 --> 00:06:53,700
And if the p value of the independent
variable is higher

129
00:06:53,700 --> 00:06:57,033
than the significance level,
then it will not stay in the model.

130
00:06:57,033 --> 00:06:58,833
We will remove it.

131
00:06:58,833 --> 00:07:02,166
So. But the first step is just to select
a significance level.

132
00:07:02,166 --> 00:07:04,800
We don't have to do anything here
with the independent variables.

133
00:07:04,800 --> 00:07:05,933
We just need to choose one.

134
00:07:05,933 --> 00:07:09,833
And we're going to choose 5% 0.05 okay.

135
00:07:09,833 --> 00:07:14,566
And now step to step two is to fit the
full model with all possible predictors.

136
00:07:14,900 --> 00:07:16,633
So that's actually what we've just done.

137
00:07:16,633 --> 00:07:19,566
You know
by taking all our independent variables

138
00:07:19,566 --> 00:07:24,166
in our regressor using the LM function
that actually fits the full model

139
00:07:24,166 --> 00:07:27,666
with all the possible predictors, that is
with all the independent variables.

140
00:07:28,266 --> 00:07:30,200
Okay. So done.

141
00:07:30,200 --> 00:07:31,533
And now what is step three.

142
00:07:31,533 --> 00:07:35,366
Step three is to look at the predictor
that has the highest p value.

143
00:07:35,966 --> 00:07:39,600
So we will find it
thanks to our summary function.

144
00:07:40,100 --> 00:07:43,100
And if the p value is higher
than the significance level

145
00:07:43,200 --> 00:07:46,233
that is if it's higher than 5%,
then we'll go to step four.

146
00:07:46,600 --> 00:07:49,600
And if that's not the case
our model is actually ready.

147
00:07:49,666 --> 00:07:51,866
But don't worry,
it will not be that quick.

148
00:07:52,900 --> 00:07:54,000
So actually.

149
00:07:54,000 --> 00:07:54,866
So let's suppose

150
00:07:54,866 --> 00:07:59,100
we found the highest p value higher
than the significance level of 5%.

151
00:07:59,433 --> 00:08:01,466
Then we need to move on to step four.

152
00:08:01,466 --> 00:08:03,500
And the step four is actually to remove

153
00:08:03,500 --> 00:08:06,533
this independent variable
that has the highest p value.

154
00:08:07,266 --> 00:08:10,433
And once we remove the predictor
we're ready to move on to step five

155
00:08:10,600 --> 00:08:13,800
which is to fit the model
without this variable.

156
00:08:14,233 --> 00:08:17,633
So that's why, you know,
we wrote all the independent variables

157
00:08:17,633 --> 00:08:19,866
one by one separated by a plus.

158
00:08:19,866 --> 00:08:23,966
Because, you know, once we reached step
five here, we will just copy paste

159
00:08:24,166 --> 00:08:29,366
the regressor and the summary function
and remove this independent variable

160
00:08:29,366 --> 00:08:32,333
that had the highest p value
from the regressor

161
00:08:32,333 --> 00:08:35,166
to build a new regressor
without this variable.

162
00:08:35,166 --> 00:08:38,166
And that will fit this model
without the variable.

163
00:08:38,233 --> 00:08:41,533
And once it's done
we go back to step three here

164
00:08:41,833 --> 00:08:45,266
to repeat this same pathway that is.

165
00:08:45,266 --> 00:08:48,266
Once again we're going to look
for the independent variables

166
00:08:48,266 --> 00:08:51,033
among the new team
of independent variables

167
00:08:51,033 --> 00:08:53,966
without the independent variable
that we just removed.

168
00:08:53,966 --> 00:08:57,066
So we're going to look for the independent
variable that has a highest p value.

169
00:08:57,366 --> 00:08:58,233
And same

170
00:08:58,233 --> 00:09:01,700
if the p value is higher than the
significance level we'll go to step four.

171
00:09:01,800 --> 00:09:04,300
And otherwise our model is ready.

172
00:09:04,300 --> 00:09:05,200
So let's do this.

173
00:09:05,200 --> 00:09:10,500
We already completed step one
by choosing a significance level of 5%.

174
00:09:10,966 --> 00:09:12,266
And same for step two.

175
00:09:12,266 --> 00:09:14,833
We actually fitted the model
with all possible predictors.

176
00:09:14,833 --> 00:09:17,833
Well we need
of course to execute the code.

177
00:09:17,933 --> 00:09:21,966
And now we will move on to step three
which will consist

178
00:09:21,966 --> 00:09:24,966
of looking for the independent variable
that has the highest p value.

179
00:09:25,166 --> 00:09:26,200
So let's do this right now

180
00:09:27,333 --> 00:09:27,866
okay.

181
00:09:27,866 --> 00:09:32,466
So as
I just mentioned we need to execute this

182
00:09:32,466 --> 00:09:35,500
to build our regressor
with all the independent variables.

183
00:09:35,500 --> 00:09:39,366
Well actually we don't really need
to execute this because we actually

184
00:09:39,366 --> 00:09:43,233
executed this code section here
that builds exactly the same regressor.

185
00:09:43,233 --> 00:09:47,700
But just to complete all these steps in
this tutorial, let's execute that again.

186
00:09:47,700 --> 00:09:49,800
That will not cause any issue.

187
00:09:49,800 --> 00:09:53,566
So I'm going to press Command and Control
plus enter to execute.

188
00:09:53,933 --> 00:09:54,900
And here we go.

189
00:09:54,900 --> 00:09:58,200
Same regressor created again
with a different syntax.

190
00:09:58,500 --> 00:10:02,233
And that's because we want to remove
the non-significant independent

191
00:10:02,233 --> 00:10:04,200
variable one by one.

192
00:10:04,200 --> 00:10:04,933
Great.

193
00:10:04,933 --> 00:10:07,933
So that actually completes step two.

194
00:10:07,933 --> 00:10:10,833
And now let's move on to step three
which was to look

195
00:10:10,833 --> 00:10:13,833
for the independent variable
that has the highest p value.

196
00:10:14,066 --> 00:10:17,566
And to do this we are going to select this

197
00:10:17,566 --> 00:10:21,433
summary function with the regressor
inside and press Command and Control Plus.

198
00:10:21,433 --> 00:10:22,433
And to execute.

199
00:10:23,866 --> 00:10:26,566
Let's move that up a little bit.

200
00:10:26,566 --> 00:10:29,300
So these informations are very important

201
00:10:29,300 --> 00:10:32,366
informations
when we want to build a robust model.

202
00:10:32,666 --> 00:10:36,333
And it's not only thanks to
the informations about the P values here

203
00:10:36,333 --> 00:10:39,666
that will help select
the optimal team of independent variables.

204
00:10:39,933 --> 00:10:42,966
Because below
we also have this multiple r squared.

205
00:10:42,966 --> 00:10:45,100
And this adjusted r squared.

206
00:10:45,100 --> 00:10:48,900
That will help us build
even more robust model than the one

207
00:10:48,900 --> 00:10:51,000
we are going to make
in the next two tutorials.

208
00:10:51,000 --> 00:10:54,700
Because at the end of this part there
is this section called Evaluating models

209
00:10:54,700 --> 00:10:58,100
performance
to actually improve the model performance.

210
00:10:58,600 --> 00:11:02,066
And in this part we will actually use
the multiple r squared and the adjusted

211
00:11:02,066 --> 00:11:06,333
R squared to finalize our journey
towards the most robust model.

212
00:11:06,533 --> 00:11:09,533
And you will perfectly understand
why at the end of this part.

213
00:11:09,933 --> 00:11:12,266
But for now, let's focus on the p values.

214
00:11:12,266 --> 00:11:14,766
So the p values are actually
in this column.

215
00:11:14,766 --> 00:11:18,666
And in R there's actually a shortcut
to look at the statistical significance.

216
00:11:18,966 --> 00:11:20,433
It's this last column here.

217
00:11:20,433 --> 00:11:22,300
Well this last column doesn't have a name.

218
00:11:22,300 --> 00:11:25,500
But you need to look at the stars here
because as a reminder,

219
00:11:25,600 --> 00:11:29,900
the more the p value is below
5% our significance level,

220
00:11:30,300 --> 00:11:33,533
then the more the independent variable
will be statistically significant.

221
00:11:33,533 --> 00:11:37,000
For the dependent variable profit,
and the more the p value

222
00:11:37,000 --> 00:11:40,000
is higher than the significance level 5%,

223
00:11:40,033 --> 00:11:43,566
then the less statistically significant
the independent variable will be.

224
00:11:44,166 --> 00:11:48,233
So in short, the lower is the p value, 
the more your independent

225
00:11:48,233 --> 00:11:51,366
variable will have high impact
on your dependent variable,

226
00:11:51,533 --> 00:11:54,366
and the higher is the p value,
the less effect.

227
00:11:54,366 --> 00:11:58,333
In fact, your independent variable is
going to have on the dependent variable.

228
00:11:59,000 --> 00:12:02,400
And there is this reminder here
that says that if the p value

229
00:12:02,400 --> 00:12:05,800
is between zero and 0.1 percent,

230
00:12:06,300 --> 00:12:10,200
then it's three stars, meaning that it's
highly statistically significant.

231
00:12:10,800 --> 00:12:14,500
Then if the p value is between
0.1% and 1%,

232
00:12:14,900 --> 00:12:18,500
then it's two stars, meaning
that it's very statistically significant

233
00:12:18,500 --> 00:12:20,900
but less significant
than when there is three stars.

234
00:12:20,900 --> 00:12:24,000
Then if the p value is between 1% and 5%,

235
00:12:24,300 --> 00:12:27,000
then it's
simply statistically significant.

236
00:12:27,000 --> 00:12:28,333
That is, your independent variable

237
00:12:28,333 --> 00:12:31,400
still have some good impact
on the dependent variable.

238
00:12:31,966 --> 00:12:37,166
And then if the p value is between
5% and 10%, then it's a dot, meaning

239
00:12:37,166 --> 00:12:41,833
that there is definitely a certain level
of statistical significance.

240
00:12:41,833 --> 00:12:45,700
That is, your independent variable has a
certain effect on your dependent variable,

241
00:12:46,200 --> 00:12:50,700
but definitely not as much as your other
independent variables

242
00:12:50,700 --> 00:12:53,700
that are in these categories here,
especially for this one.

243
00:12:54,166 --> 00:12:58,033
And finally,
if your p value is between 10% and one,

244
00:12:58,333 --> 00:13:01,766
well, there's absolutely no
statistical significance.

245
00:13:02,400 --> 00:13:05,266
So that means that
with what we first observe here,

246
00:13:05,266 --> 00:13:10,333
well we can see that the R&D spend
is highly statistically significant.

247
00:13:10,766 --> 00:13:13,700
But the rest seems to be not significant.

248
00:13:13,700 --> 00:13:17,900
But let's wait for the backward
elimination to be completed to find out

249
00:13:17,900 --> 00:13:22,300
if our final team is actually
only composed of already spent,

250
00:13:22,700 --> 00:13:26,000
because by removing some independent
variables here, that will remove

251
00:13:26,000 --> 00:13:30,000
some possible bias, that once
some independent variables are removed,

252
00:13:30,200 --> 00:13:33,066
we can actually find
an independent variable that is more

253
00:13:33,066 --> 00:13:37,200
statistically significant than what
it appeared to be at the beginning.

254
00:13:37,200 --> 00:13:39,100
That is at the first step here.

255
00:13:39,100 --> 00:13:40,466
So let's find out about that.

256
00:13:40,466 --> 00:13:41,966
And actually you will find out

257
00:13:41,966 --> 00:13:45,700
about that yourself because this will be
the subject of the homework.

258
00:13:45,700 --> 00:13:48,866
But don't worry I'm going to walk you
through the first steps of

259
00:13:48,866 --> 00:13:51,866
backward elimination
and you will complete it yourself.

260
00:13:52,000 --> 00:13:54,900
And in the next tutorial,
of course, we'll have the solution

261
00:13:54,900 --> 00:13:56,366
and we'll complete it together.

262
00:13:56,366 --> 00:13:59,366
So I look forward to seeing
if we get the same results.

263
00:13:59,466 --> 00:14:02,733
Okay, so now let's carry on with backward
elimination.

264
00:14:02,866 --> 00:14:04,966
So remember we are at step three.

265
00:14:04,966 --> 00:14:06,633
And step three is actually to look

266
00:14:06,633 --> 00:14:10,000
for the independent variable
that has the highest p value.

267
00:14:10,600 --> 00:14:12,766
And we can find it very easily.

268
00:14:12,766 --> 00:14:18,100
It's actually this one
because indeed its p value is 0.999.

269
00:14:18,100 --> 00:14:19,900
That is 99%.

270
00:14:19,900 --> 00:14:22,400
So that's actually a very high p value.

271
00:14:22,400 --> 00:14:26,100
And we are way above the significance
level of 5%.

272
00:14:26,333 --> 00:14:29,633
So this dummy variable for state state

273
00:14:29,633 --> 00:14:33,366
two is definitely not
statistically significant at all.

274
00:14:33,600 --> 00:14:37,266
It has absolutely no effect
on the dependent variable profit.

275
00:14:37,766 --> 00:14:43,033
And by the way we also observe
that state three here has a 94% p value.

276
00:14:43,233 --> 00:14:46,233
And there is no way
that if we remove state two

277
00:14:46,333 --> 00:14:50,733
well this p value will decrease below
the 5% significance level.

278
00:14:51,066 --> 00:14:55,533
So we can actually remove this state three
independent variable as well,

279
00:14:55,933 --> 00:15:00,900
because clearly the state has no effect or
impact on the dependent variable profit.

280
00:15:01,233 --> 00:15:03,900
So we will actually make some kind
of a shortcut here.

281
00:15:03,900 --> 00:15:07,966
And instead of removing the independent
variable that has the highest p value

282
00:15:08,033 --> 00:15:13,000
that is state two, we will actually remove
both these dummy variables for state

283
00:15:13,466 --> 00:15:17,100
because definitely the state
is not statistically significant.

284
00:15:17,400 --> 00:15:18,766
So let's do this.

285
00:15:18,766 --> 00:15:21,766
Let's remove
the state variable from our equation.

286
00:15:21,766 --> 00:15:23,533
So I'm going to put that down

287
00:15:25,166 --> 00:15:25,500
okay.

288
00:15:25,500 --> 00:15:27,000
So as I told you it's very simple.

289
00:15:27,000 --> 00:15:31,733
We're just going to copy this
and paste it here.

290
00:15:32,300 --> 00:15:34,733
And then so what do we have to do. Now.

291
00:15:34,733 --> 00:15:39,200
We just need to remove the state
independent variable from our equation.

292
00:15:39,200 --> 00:15:41,366
Here.

293
00:15:41,366 --> 00:15:44,800
And by doing that we complete step four.

294
00:15:44,866 --> 00:15:49,400
Because if we go back to our slide, step
four is to actually remove the predictor.

295
00:15:49,800 --> 00:15:50,333
Great.

296
00:15:50,333 --> 00:15:53,800
And now we can move on to step five
which is to fit the multiple linear

297
00:15:53,800 --> 00:15:58,733
regression model without the independent
variable state that we just removed okay.

298
00:15:58,733 --> 00:16:00,033
So we removed state.

299
00:16:00,033 --> 00:16:02,533
So step four completed.

300
00:16:02,533 --> 00:16:05,900
And now step
five is to actually fit the model

301
00:16:05,900 --> 00:16:09,133
without this independent variable state
that we just removed.

302
00:16:09,233 --> 00:16:10,200
So let's do this.

303
00:16:10,200 --> 00:16:13,433
We just need to select this press command
and control.

304
00:16:13,433 --> 00:16:14,500
Press enter to execute.

305
00:16:15,466 --> 00:16:16,500
And here it is.

306
00:16:16,500 --> 00:16:20,700
Our new regressor is ready
without the state independent variable.

307
00:16:20,900 --> 00:16:23,700
So now we have a team of three
independent variables.

308
00:16:23,700 --> 00:16:27,400
Wait and see which independent variable
is going to be kicked out of the team.

309
00:16:27,900 --> 00:16:32,200
And speaking of that, this is where I'm
going to leave you alone for the homework.

310
00:16:32,466 --> 00:16:35,566
But don't worry, you will have
the solution in the next tutorial.

311
00:16:35,733 --> 00:16:38,666
But really try to implement this yourself.

312
00:16:38,666 --> 00:16:41,566
Complete backward
elimination up to the end,

313
00:16:41,566 --> 00:16:44,133
and it will be fun
to see if we get the same results.

314
00:16:44,133 --> 00:16:45,633
And there is actually kind of a decision

315
00:16:45,633 --> 00:16:47,933
to make at the end of backward
elimination.

316
00:16:47,933 --> 00:16:51,633
So I'm curious to see how you make that
decision, make that call,

317
00:16:51,666 --> 00:16:54,966
because both solutions are actually great

318
00:16:55,333 --> 00:16:58,300
and we will talk about that
in the solution.

319
00:16:58,300 --> 00:17:01,466
So good luck for the homework
you're going to see.

320
00:17:01,466 --> 00:17:02,100
It's going to be fine.

321
00:17:02,100 --> 00:17:06,866
So basically what you only have to do is
to follow this backward elimination slide.

322
00:17:07,200 --> 00:17:10,933
And so together in this tutorial
we went up to this step five here.

323
00:17:11,233 --> 00:17:11,700
And now.

324
00:17:11,700 --> 00:17:15,833
As you can see you have to go back
to step three and redo the steps

325
00:17:15,833 --> 00:17:18,833
three four, five exactly like we just did.

326
00:17:18,833 --> 00:17:24,266
Until you find that the highest p value
is not higher than the significance level,

327
00:17:24,533 --> 00:17:27,533
and when it's the case,
your model will be ready.

328
00:17:28,233 --> 00:17:31,000
Okay, so I look forward to seeing you
in the next tutorial.

329
00:17:31,000 --> 00:17:34,833
I look forward to comparing with you
your results to mine,

330
00:17:35,266 --> 00:17:37,466
and I'm sure everything will be okay.

331
00:17:37,466 --> 00:17:39,566
Backward elimination is very practical

332
00:17:39,566 --> 00:17:43,000
and it will actually be fun
and easy to complete this.

333
00:17:43,633 --> 00:17:46,866
So thank you for watching this tutorial
and I look forward to seeing you

334
00:17:46,866 --> 00:17:49,000
in the next one for the solution.

335
00:17:49,000 --> 00:17:50,733
Until then, enjoy machine learning.