0
1
00:00:00,560 --> 00:00:05,480
In this lesson we're going to talk about simplifying our model.
1

2
00:00:05,480 --> 00:00:11,510
And that's because, all else equal, simpler models are preferable to complex ones.
2

3
00:00:11,510 --> 00:00:20,040
Remember the Zen of Python? "Simple is better than complex." and "Complex is better than complicated." and
3

4
00:00:20,070 --> 00:00:21,930
this doesn't just apply to programming.
4

5
00:00:22,110 --> 00:00:26,250
The same thing can be said about our regression model.
5

6
00:00:26,250 --> 00:00:33,450
By the way, you can always bring up this little gem of programming philosophy when you type "import this"
6

7
00:00:33,600 --> 00:00:40,760
in Jupyter notebook. Simple models really are preferable to complex ones,
7

8
00:00:40,760 --> 00:00:42,180
all else equal.
8

9
00:00:42,530 --> 00:00:47,090
So the question is: how can we simplify our model?
9

10
00:00:47,210 --> 00:00:55,900
One of the easiest ways is to remove some of the explanatory variables, but can we just drop some features?
10

11
00:00:55,900 --> 00:00:57,980
Is that a wise thing to do?
11

12
00:00:58,030 --> 00:01:02,380
And if so, which features should we drop?
12

13
00:01:02,380 --> 00:01:09,440
What about the features that were not highly correlated with property prices? In the correlation matrix,
13

14
00:01:09,460 --> 00:01:17,670
we saw that distance from employment centers only had a 0.25 correlation with our target.
14

15
00:01:17,760 --> 00:01:25,960
DIS had a low correlation with price but also it had a high correlation with our industry factor of
15

16
00:01:26,080 --> 00:01:28,950
-0.71.
16

17
00:01:29,470 --> 00:01:37,210
At the time we were wondering how much value that distance from employment centers feature really added.
17

18
00:01:37,480 --> 00:01:40,140
But now we know. Scrolling down,
18

19
00:01:40,270 --> 00:01:46,960
we've got the p-values which test for the significance of our different factors and we can see that
19

20
00:01:46,960 --> 00:01:50,890
distance is actually very statistically significant.
20

21
00:01:50,890 --> 00:01:57,800
So we should probably keep DIS around. Now on the other hand, looking back up at our industry factor,
21

22
00:01:57,910 --> 00:02:04,510
this has a p-value of around 0.44 meaning it is not statistically significant.
22

23
00:02:05,730 --> 00:02:11,560
The threshold for p-values, if you recall, was 0.05.
23

24
00:02:11,610 --> 00:02:18,340
So now the question is: should we try removing the industry factor from the model?
24

25
00:02:18,430 --> 00:02:24,770
And the thing is, it is really really tempting to remove insignificant predictors.
25

26
00:02:24,770 --> 00:02:32,030
But even dropping statistically insignificant features is not something people do lightly, because even
26

27
00:02:32,030 --> 00:02:39,500
a feature with a low p-value can add value to the model as a whole by providing some kind of information
27

28
00:02:39,710 --> 00:02:43,190
that the other features do not provide.
28

29
00:02:43,190 --> 00:02:49,670
Actually deciding on what to keep and what to throw away from a machine learning model is a bit of an
29

30
00:02:49,670 --> 00:02:57,200
art and it gets into this whole topic of feature selection which is a very, very big topic indeed and
30

31
00:02:57,200 --> 00:03:01,250
one that we will continue to tackle throughout this course.
31

32
00:03:01,250 --> 00:03:09,230
The goal of this lesson is to introduce you to this topic of feature selection in the context of a regression
32

33
00:03:09,230 --> 00:03:15,900
model. And we will be looking at a metric that you can use to help you make your decisions.
33

34
00:03:16,190 --> 00:03:24,050
And that metric is called Baysian information Criterion or BIC. The Baysian information criterion
34

35
00:03:24,380 --> 00:03:27,350
is a way you can measure complexity.
35

36
00:03:27,350 --> 00:03:31,690
It's basically a number that allows you to compare two different models.
36

37
00:03:31,790 --> 00:03:37,250
So what you end up doing is you run a regression with model number one and then you run a regression
37

38
00:03:37,580 --> 00:03:46,290
with model number two. Now model number one might have a BIC value of 148 and model number two might
38

39
00:03:46,290 --> 00:03:53,610
have a big value of 154. The actual number doesn't by itself mean very much.
39

40
00:03:53,760 --> 00:04:03,880
What matters is which one is lower, because all else equal a lower BIC number is better. So this measure
40

41
00:04:04,000 --> 00:04:08,350
can help you pick between two or more models.
41

42
00:04:08,350 --> 00:04:15,290
So what models will we compare? Well for starters, let's compare the model that includes the industry
42

43
00:04:15,290 --> 00:04:19,840
feature and the model that excludes the industry feature.
43

44
00:04:20,120 --> 00:04:26,740
Let's commemorate this with another markup cell in Jupyter notebook. So I'm going to change my cell here to Markdown
44

45
00:04:27,010 --> 00:04:29,830
and then put a section heading here
45

46
00:04:29,830 --> 00:04:41,150
"Model Simplification & the Baysian Information Criterion". To calculate the Baysian information criterion,
46

47
00:04:41,260 --> 00:04:47,630
we're once again going to use our statsmodel module's regression capabilities instead of scikit-learn
47

48
00:04:48,130 --> 00:04:54,330
so let's copy this cell here and paste it below.
48

49
00:04:54,500 --> 00:04:58,770
I'm going to delete these comments and I'm going to add a new comment up here
49

50
00:05:00,090 --> 00:05:10,360
that reads "Original model with log prices and all features" and the data frame that I've got here
50

51
00:05:10,470 --> 00:05:19,890
I'm going to store inside a variable. So I'm going to say "original_coef = pd.DataFrame()"
51

52
00:05:19,890 --> 00:05:27,810
and so on. And now what I want to do is I want to add some additional print statements and in these
52

53
00:05:27,810 --> 00:05:35,640
print statements we'll output both the Baysian information criterion value and the r-squared for the
53

54
00:05:35,640 --> 00:05:37,240
regression.
54

55
00:05:37,260 --> 00:05:44,280
Now previously we've used scikit's learn score method to print out the r-squared but things work a little
55

56
00:05:44,280 --> 00:05:51,870
differently with the statsmodel module and I think this is a good time to practice making sense of
56

57
00:05:51,870 --> 00:05:54,560
the official documentation.
57

58
00:05:54,990 --> 00:06:04,350
So as a challenge, can you look up the statsmodel docs for the regression results and figure out how
58

59
00:06:04,350 --> 00:06:12,220
to print out both the Baysian information criterion value for this regression as well as the r-squared?
59

60
00:06:12,270 --> 00:06:16,780
I'll give you a few seconds to pause the video and give this a shot.
60

61
00:06:16,890 --> 00:06:17,580
You ready?
61

62
00:06:18,550 --> 00:06:19,890
Here's the solution.
62

63
00:06:20,200 --> 00:06:26,170
Being able to read and interpret the official documentation for all a lot of these Python modules is
63

64
00:06:26,170 --> 00:06:30,310
one of the key skills at becoming a better programmer.
64

65
00:06:30,310 --> 00:06:36,700
If I click on my results object and press Shift+Tab on my keyboard to bring up the quick documentation,
65

66
00:06:37,120 --> 00:06:41,200
I don't actually get anything useful.
66

67
00:06:41,320 --> 00:06:47,710
I just see that to get further information I need to look at the regression results documentation.
67

68
00:06:47,710 --> 00:06:53,590
The same thing is true if I click here and I find out that I'm just dealing with a dataframe or when
68

69
00:06:53,590 --> 00:06:54,260
I click here.
69

70
00:06:54,610 --> 00:07:03,010
So in all these cases, the quick documentation isn't actually helping me all that much. I'm just not having
70

71
00:07:03,010 --> 00:07:05,790
any luck on getting the relevant information.
71

72
00:07:05,890 --> 00:07:10,600
So what I'm going to have to do is I'm going to have to Google the documentation myself.
72

73
00:07:10,750 --> 00:07:18,880
The best keywords to enter into that white text box are "statsmodel regression results" and that should
73

74
00:07:18,880 --> 00:07:22,200
pretty much take you to one of the statsmodel pages.
74

75
00:07:22,390 --> 00:07:29,380
Now out of these three that I've got here, the one I'm looking for is the page for the documentation
75

76
00:07:29,860 --> 00:07:32,280
on the RegressionResults object.
76

77
00:07:32,910 --> 00:07:34,360
So it's the third one down here.
77

78
00:07:36,380 --> 00:07:40,430
This is the regression results object from the linear model.
78

79
00:07:40,430 --> 00:07:44,260
This is the documentation page that you want to be looking for.
79

80
00:07:44,300 --> 00:07:51,050
The other reason why the RegressionResults page that I've clicked on is the relevant one is because
80

81
00:07:51,560 --> 00:08:00,140
this particular object, the RegressionResultsWrapper, inherits most of the methods and attributes from
81

82
00:08:00,320 --> 00:08:06,980
RegressionResults, meaning the capabilities of a RegressionResultsWrapper object which we've got and
82

83
00:08:07,040 --> 00:08:10,900
a RegressionResults object are pretty much the same.
83

84
00:08:10,940 --> 00:08:18,620
They have a lot in common, a lot of the methods and attributes are inherited from this particular object.
84

85
00:08:18,620 --> 00:08:20,820
Yeah they're closely linked.
85

86
00:08:21,200 --> 00:08:24,670
So this is why I'm looking on this page. Now,
86

87
00:08:24,680 --> 00:08:30,470
this is an incredibly long page if you look at it, it's very, very comprehensive, but the interesting thing
87

88
00:08:30,470 --> 00:08:38,000
is that we can find both the params attribute and p-values attribute on this page.
88

89
00:08:38,000 --> 00:08:44,330
So if I search for params on this page, I can find params listed as one of the attributes and
89

90
00:08:44,330 --> 00:08:50,630
I can also find p-values listed as one of the attributes. And you probably already spotted it at the
90

91
00:08:50,630 --> 00:08:58,850
top of the page is the BIC attribute, the Bayes information criteria, so BIC is the name of the
91

92
00:08:58,850 --> 00:09:05,840
attribute for the Bayes information criteria and the way we access this attribute is simply by writing
92

93
00:09:05,840 --> 00:09:06,650
"results.
93

94
00:09:06,810 --> 00:09:07,960
bic"
94

95
00:09:07,970 --> 00:09:15,830
and I if I hit Shift+Enter, I can see what the value of this actually is - it's 
95

96
00:09:16,250 --> 00:09:23,730
-139.85 so about -140. And what about the r-squared?
96

97
00:09:23,850 --> 00:09:33,360
Going back to the documentation and scrolling down, r-squared unsurprisingly has the attribute name rsquared,
97

98
00:09:33,450 --> 00:09:44,260
all lowercase and in one word. So "results.rsquared" will bring up the r-squared for this regression
98

99
00:09:44,470 --> 00:09:48,040
which is 0.793.
99

100
00:09:48,100 --> 00:09:50,890
Let's have both of these lines of code in a print statement.
100

101
00:09:50,890 --> 00:10:03,800
So I'm going to have "print('BIC is', results.bic)" and then
101

102
00:10:04,010 --> 00:10:04,880
I'll have "print(
102

103
00:10:07,650 --> 00:10:15,230
'r-squared is', results.rsquared)".
103

104
00:10:15,420 --> 00:10:16,350
There we go.
104

105
00:10:16,410 --> 00:10:23,700
Okay, so now we have both the r-squared and our BIC printed out. The comforting thing to see is that the
105

106
00:10:23,730 --> 00:10:28,000
statsmodel and scikit-learn is exactly the same r-squared.
106

107
00:10:28,050 --> 00:10:35,070
So we're doing things right. Now in this case, the Baysian information criterion is actually a negative
107

108
00:10:35,280 --> 00:10:38,540
number and that's absolutely fine.
108

109
00:10:38,550 --> 00:10:42,550
What matters is how this number stacks up to our next model.
109

110
00:10:42,570 --> 00:10:48,190
So I'm going to copy this, paste it and then I'm going to modify my comment,
110

111
00:10:48,390 --> 00:11:00,040
I'm going to write here "Reduced model #1 excluding INDUS" and then what I'll say is that 
111

112
00:11:00,040 --> 00:11:11,020
"X_incl_const = X_incl_const.drop(
112

113
00:11:11,710 --> 00:11:20,120
['INDUS'], axis=1)".
113

114
00:11:20,150 --> 00:11:27,560
So in this line of code I'm redefining the dataframe of features by overwriting what's stored inside
114

115
00:11:27,560 --> 00:11:28,370
this variable.
115

116
00:11:28,370 --> 00:11:36,720
I'm dropping the INDUS column from the dataframe and storing that as the new feature dataframe.
116

117
00:11:36,740 --> 00:11:44,600
So on this line when it comes to training our model we are excluding the INDUS feature. Next, I'm going
117

118
00:11:44,600 --> 00:11:52,100
to change the name of this variable to "coef_minus_indus",
118

119
00:11:52,100 --> 00:11:56,900
so that way we're not overwriting the coefficient data frame from the cell above
119

120
00:11:57,370 --> 00:12:00,140
and I'm going to delete this comment here.
120

121
00:12:00,200 --> 00:12:05,990
Now let me hit Shift+Enter and refresh this cell.
121

122
00:12:06,220 --> 00:12:09,070
So this result is already quite interesting.
122

123
00:12:09,070 --> 00:12:15,040
What we can see is that our Baysian information criterion has gotten more negative,
123

124
00:12:15,040 --> 00:12:20,150
we've now got the value minus -145.2,
124

125
00:12:20,170 --> 00:12:22,970
so this is an even lower number than before.
125

126
00:12:22,990 --> 00:12:29,560
So we have an improvement in terms of reducing complexity, but at the same time it's also really nice
126

127
00:12:29,560 --> 00:12:34,180
to see that the r-squared at 0.79
127

128
00:12:34,240 --> 00:12:36,430
pretty much stays where it is.
128

129
00:12:36,460 --> 00:12:43,540
So even though we have removed one feature from our dataset, it hasn't really impacted our fit in a material
129

130
00:12:43,540 --> 00:12:44,490
way.
130

131
00:12:44,500 --> 00:12:46,600
This is actually very encouraging.
131

132
00:12:46,630 --> 00:12:51,200
Let's go back up to our p-values and experiment with removing something else.
132

133
00:12:51,220 --> 00:12:59,440
Let's experiment with removing AGE. Coming back down, I'm going to copy this cell, paste it, change my
133

134
00:12:59,440 --> 00:13:08,340
comment here to "Reduced model #2 excluding INDUS and AGE" and then in this line I'm going to
134

135
00:13:08,340 --> 00:13:14,990
have to add the single quotes and AGE between the square brackets.
135

136
00:13:14,990 --> 00:13:25,150
I'm also going to rename our dataframe of coefficients to maybe "reduced_coef" and now I'm going to hit Shift+
136

137
00:13:25,400 --> 00:13:33,280
Enter and what we actually see is a further improvement based on the Baysian information criterion.
137

138
00:13:33,550 --> 00:13:40,990
So we get an even lower BIC number at -149.5 , but we see no material change in
138

139
00:13:40,990 --> 00:13:42,210
the r-squared.
139

140
00:13:42,400 --> 00:13:47,910
So this makes me think that removing both INDUS and AGE is actually a beneficial thing.
140

141
00:13:48,010 --> 00:13:54,190
We can probably safely drop these two features simplifying our model without incurring too much of a
141

142
00:13:54,190 --> 00:13:58,130
cost in terms of lost information and a worse fit.
142

143
00:13:58,360 --> 00:14:04,120
Now even though I just gave you two examples where removing a feature improved the Baysian information
143

144
00:14:04,120 --> 00:14:07,870
criterion and left the r-squared pretty much unchanged,
144

145
00:14:07,870 --> 00:14:16,740
this isn't always the case. If I change Age to one of the other features, say maybe zone, ZN, and press
145

146
00:14:16,740 --> 00:14:20,880
Shift+Enter, what we see is not really all that clear cut.
146

147
00:14:20,970 --> 00:14:25,620
In this case, we have a higher big number and a lower r-squared than before.
147

148
00:14:26,010 --> 00:14:28,760
So this is probably not the direction we want to go in.
148

149
00:14:29,130 --> 00:14:33,420
Same thing if I change this to our TAX feature.
149

150
00:14:33,420 --> 00:14:35,780
Again we're making our model worse.
150

151
00:14:35,890 --> 00:14:43,080
And same thing if I change this to the LSTAT feature. Removing LSTAT actually makes our model
151

152
00:14:43,200 --> 00:14:44,910
much, much worse.
152

153
00:14:44,910 --> 00:14:50,860
You can see how much the Baysian information criterion jumped and how much lower our r-squared is
153

154
00:14:50,880 --> 00:14:58,290
in this case. So LSTAT is actually very important to keep in the model. I'm going to change this back to AGE and
154

155
00:14:58,290 --> 00:15:03,010
press Shift+Enter so that we're back to where we started.
155

156
00:15:03,030 --> 00:15:03,310
Okay.
156

157
00:15:03,330 --> 00:15:04,020
So where are we now?
157

158
00:15:04,020 --> 00:15:06,690
We've made two small tweaks to our model.
158

159
00:15:06,690 --> 00:15:13,260
We've removed two of the features which were not statistically significant and we've looked at the Baysian
159

160
00:15:13,260 --> 00:15:20,610
information criterion and the r-squared to provide additional justification for leaving them out and
160

161
00:15:20,610 --> 00:15:28,710
simplifying our model and by doing so we've managed to improve our BIC number from around -140
161

162
00:15:28,910 --> 00:15:31,070
to -150.
162

163
00:15:31,380 --> 00:15:38,100
So we get about a 9-10 point improvement in the BIC number, but we don't incur a material penalty on
163

164
00:15:38,100 --> 00:15:39,040
the r-squared.
164

165
00:15:39,060 --> 00:15:41,480
We're still on 0.79.
165

166
00:15:41,580 --> 00:15:42,120
Cool.
166

167
00:15:42,120 --> 00:15:49,680
So that about wraps up our introduction to thinking about feature selection and one thing that we can
167

168
00:15:49,680 --> 00:15:53,280
do now is we can link this lesson to our previous one.
168

169
00:15:53,400 --> 00:16:01,050
We can link this to the discussion on multcollinearity and looking for stability in the theta estimates
169

170
00:16:01,260 --> 00:16:09,520
for our features, because we've made quite a few tweaks to our model and we said that one of the symptoms
170

171
00:16:09,910 --> 00:16:13,920
of multicollinearity are unstable coefficients.
171

172
00:16:14,080 --> 00:16:20,740
Having run three different versions of our regression and having stored our coefficients in some variables,
172

173
00:16:21,310 --> 00:16:27,160
we can now look at them side by side and double check if there are any strange developments.
173

174
00:16:27,250 --> 00:16:29,430
Now I'd be surprised if we saw any
174

175
00:16:29,430 --> 00:16:34,100
because we have no indication of multicollinearity so far.
175

176
00:16:34,330 --> 00:16:36,450
But take a look at this.
176

177
00:16:36,520 --> 00:16:43,020
If I create a variable called "frames" and make that equal to a list of our dataframes.
177

178
00:16:43,050 --> 00:16:53,740
So we had the org_coef, we had coef_minus_indus dataframe, 
178

179
00:16:54,490 --> 00:17:01,610
and then finally we had the reduced_coef dataframe of coefficients. To put them all side
179

180
00:17:01,610 --> 00:17:02,720
by side,
180

181
00:17:02,750 --> 00:17:12,110
I'm going to use pandas concat method so "pd.concat() and then in the parentheses to provide my list
181

182
00:17:12,320 --> 00:17:18,260
of dataframes and an axis. So I want this to be concatenated along the columns,
182

183
00:17:18,260 --> 00:17:25,630
so side by side as opposed to top to bottom, right? So axis is gonna be equal to 1 instead of 0.
183

184
00:17:25,640 --> 00:17:26,900
Let's see what we get.
184

185
00:17:26,930 --> 00:17:28,320
Fantastic.
185

186
00:17:28,340 --> 00:17:36,910
Now what you can also see here in this table here is how Python treats missing values in a dataframe.
186

187
00:17:36,920 --> 00:17:45,510
You see these nans, NaN? NaN stands for "Not a Number" meaning there is a missing value here.
187

188
00:17:45,590 --> 00:17:52,280
So this column here is our reduced model without the AGE and without the Industry feature.
188

189
00:17:52,310 --> 00:17:56,250
So we've got NaNs in place of these rows.
189

190
00:17:56,630 --> 00:18:03,200
Looking at this table is actually very, very encouraging because what I'm seeing is that despite tweaking
190

191
00:18:03,200 --> 00:18:06,290
the model all the coefficients,
191

192
00:18:06,360 --> 00:18:07,550
yeah for all three,
192

193
00:18:07,550 --> 00:18:14,990
so like Charles River, Crime, they all are remarkably consistent, so the numbers change somewhat but they
193

194
00:18:14,990 --> 00:18:15,970
don't switch signs,
194

195
00:18:16,010 --> 00:18:17,520
they don't change drastically,
195

196
00:18:17,540 --> 00:18:23,840
so these are all very, very stable coefficient estimates and the same actually holds true
196

197
00:18:23,840 --> 00:18:29,180
if you were to remove TAX which we suspected was a potential problem source.
197

198
00:18:29,180 --> 00:18:36,620
So even removing TAX and rerunning this, you'd see that the theta estimates are nice and stable between
198

199
00:18:36,620 --> 00:18:42,830
the three models. And that brings us to the end of this lesson. In the next one,
199

200
00:18:42,890 --> 00:18:47,060
we're gonna take our evaluation of our regression even further.
200

201
00:18:47,120 --> 00:18:53,930
We're going to look at how far off our models' predictions were from their true values.
201

202
00:18:53,930 --> 00:19:00,090
We're gonna be looking and analyzing our regression residuals. Now,
202

203
00:19:00,150 --> 00:19:06,570
as an aside, while putting this lesson together for you I found myself googling the word BIC and trying
203

204
00:19:06,570 --> 00:19:11,920
to read up on the information criterion and for a split second, I was quite confused
204

205
00:19:11,930 --> 00:19:18,900
why all I got on the front page was information about pens hand the website of the Bournemouth International
205

206
00:19:18,900 --> 00:19:19,840
Centre.
206

207
00:19:20,130 --> 00:19:25,960
The other time that happened to me was when I read up on nested tables then I was confronted with this.
207

208
00:19:26,400 --> 00:19:31,020
So yeah if you have any stories like this to share, please put them in the comments section for this
208

209
00:19:31,020 --> 00:19:31,650
video.
209

210
00:19:31,740 --> 00:19:33,930
I'd love to hear it. See you in the next lesson.