0
1
00:00:00,390 --> 00:00:06,370
Remember how we noted a couple of things down on our to do list to check for later? We said that we'd check
1

2
00:00:06,370 --> 00:00:08,550
for a potential problem in our regression.
2

3
00:00:08,800 --> 00:00:13,240
And the problem that we were concerned about was called multicollinearity.
3

4
00:00:13,240 --> 00:00:16,850
The reason was that we had some high correlations between our features,
4

5
00:00:16,990 --> 00:00:22,430
so multicollinearity was something that we wanted to test formally.
5

6
00:00:22,510 --> 00:00:28,990
Now put simply multicollinearity is when two or more predictor variables in a regression are highly
6

7
00:00:28,990 --> 00:00:30,970
related to one another.
7

8
00:00:30,970 --> 00:00:38,320
In that case they do not provide unique or independent information to the model and the consequences
8

9
00:00:38,440 --> 00:00:41,400
of multicollinearity are as follows.
9

10
00:00:41,440 --> 00:00:48,880
There is a loss of reliability in the estimates of the effects for the individual features,
10

11
00:00:48,880 --> 00:00:51,300
in particular the features that are affected.
11

12
00:00:51,610 --> 00:00:57,730
In other words, there is a high variability in the coefficient estimates for small changes in the model,
12

13
00:00:58,270 --> 00:01:02,940
for example, adding or removing a feature can have dramatic effects.
13

14
00:01:02,950 --> 00:01:08,680
The estimates of our theta values in our model become unstable and can even switch signs from positive
14

15
00:01:08,680 --> 00:01:11,020
to negative and vice versa.
15

16
00:01:11,020 --> 00:01:17,890
The third effect of the problem is that the findings are strange or misleading or don't make sense.
16

17
00:01:17,890 --> 00:01:21,960
These are the main symptoms of the multicollinearity problem.
17

18
00:01:22,120 --> 00:01:28,990
Now, given these symptoms we can definitely say that on the third one our regression definitely passed,
18

19
00:01:29,470 --> 00:01:35,200
because we did a sense check on the signs of our coefficients and we found that the coefficients definitely
19

20
00:01:35,200 --> 00:01:36,400
had the right signs.
20

21
00:01:36,400 --> 00:01:38,820
They made sense from a logical perspective,
21

22
00:01:39,000 --> 00:01:44,920
like more rooms increased the property prices and more pollution decreased the property prices for example.
22

23
00:01:45,190 --> 00:01:49,840
But in this lesson I'm going to show you how you can do a formal check, how you can look at a metric to
23

24
00:01:49,840 --> 00:01:54,040
tell you whether you have this problem or not.
24

25
00:01:54,040 --> 00:02:01,750
The formal check that we will do is by looking at a statistic called the variance inflation factor or
25

26
00:02:01,810 --> 00:02:09,250
VIF and the second thing that we will do is we will monitor if moving a feature causes a dramatic
26

27
00:02:09,250 --> 00:02:14,830
change in our theta parameters, but that will wait until the next lesson.
27

28
00:02:14,830 --> 00:02:20,260
Now let's go back to Jupyter notebook and add a new section heading. This section heading is gonna be
28

29
00:02:20,260 --> 00:02:32,140
called "Testing for Multicollinearity", so "Testing for Multicollinearity", and to test for multicollinearity,
29

30
00:02:32,140 --> 00:02:36,920
we said we're gonna be using a statistic called the variance inflation factor.
30

31
00:02:37,390 --> 00:02:44,710
So let me quickly explain how the variance inflation factor does its job, because this statistic is a
31

32
00:02:44,770 --> 00:02:51,940
measure of collinearity among the features within a multiple regression and what it will do is it
32

33
00:02:51,940 --> 00:03:01,000
will spit out a number that quantifies the severity of multicollinearity, so in effect we get a number
33

34
00:03:01,060 --> 00:03:05,830
similar to how we get a number with the p-values and we can look at this number and compare it to a
34

35
00:03:05,830 --> 00:03:09,760
threshold, is it above the threshold or is it below the threshold.
35

36
00:03:09,760 --> 00:03:11,370
That's how it's gonna work.
36

37
00:03:11,620 --> 00:03:18,430
But before we write the Python code to actually calculate the variance inflation factors, let's add some
37

38
00:03:18,430 --> 00:03:25,720
formulas in LaTeX notation to the section heading to kind of better understand what's going on.
38

39
00:03:25,720 --> 00:03:31,400
I'm going to show you how the variance inflation factor is calculated for one of the features
39

40
00:03:31,510 --> 00:03:37,700
as an example. The first thing that will happen is that a regression will be run on that feature.
40

41
00:03:37,750 --> 00:03:40,170
Let's say we're looking at our TAX features.
41

42
00:03:40,270 --> 00:03:48,790
So what will happen is that we will look to explain the TAX values as a linear combination of all the
42

43
00:03:48,790 --> 00:03:51,230
other features in the data set.
43

44
00:03:51,430 --> 00:03:57,610
So you're gonna get a regression that reads something like "TAX = " and then some intercept, say "\alpha 
44

45
00:03:58,000 --> 00:04:03,100
_0 ", plus some parameter times the number of rooms,
45

46
00:04:03,130 --> 00:04:14,830
so " \alpha _1 RM", plus some other parameter "\alpha _2" times the next feature,
46

47
00:04:15,050 --> 00:04:23,110
say nitrous oxide (NOX), plus and so on till you get to LSTAT, which will be "\alpha _
47

48
00:04:23,200 --> 00:04:25,860
{12}
48

49
00:04:26,610 --> 00:04:34,090
LSTAT". Since we have 13 features total and we're looking to explain TAX on property prices,
49

50
00:04:34,270 --> 00:04:39,900
we're left with 12 different kind of parameters in front of the other features.
50

51
00:04:39,940 --> 00:04:45,970
I've also used alphas here in my LaTeX notation so we don't get confused with the theta parameters that
51

52
00:04:45,970 --> 00:04:46,850
we used earlier.
52

53
00:04:47,560 --> 00:04:52,620
Now this is only step one of calculating the variance inflation factor.
53

54
00:04:52,690 --> 00:05:01,300
Step 2 looks like this. In step 2 we get the variance inflation factor for our TAX feature.
54

55
00:05:01,480 --> 00:05:06,150
So "VIF _{TAX} =
55

56
00:05:06,150 --> 00:05:17,670
\frac {1}/{(1-R _
56

57
00:05:18,150 --> 00:05:28,470
{TAX} ^ 2)}". Now let me quickly
57

58
00:05:28,470 --> 00:05:36,270
delete this backslash here, we don't need that, and hit Shift+Enter to make this a little more clear. This
58

59
00:05:36,270 --> 00:05:45,720
is step 2. In Step 1 a regression is being run of all the other features against TAX and in step 2 the
59

60
00:05:45,840 --> 00:05:54,180
r-squared of that regression is used to calculate the variance inflation factor, so 1 divided by 
60

61
00:05:54,180 --> 00:06:02,160
(1 - r^2) of this regression up top is the variance inflation factor for the TAX feature, so
61

62
00:06:02,160 --> 00:06:08,430
that r-squared here is the r-squared from trying to explain the tax feature in terms of all the other
62

63
00:06:08,430 --> 00:06:15,660
features in the dataset. So the point I'm trying to make is that we can calculate a variance inflation
63

64
00:06:15,660 --> 00:06:22,530
factor for all the different features, right, just by swapping out this TAX for RM we get the variance
64

65
00:06:22,530 --> 00:06:29,790
inflation factor for RM, so in our Python code what we're gonna do is we're going to calculate 13 different
65

66
00:06:30,060 --> 00:06:38,550
variants inflation factors. Now the statsmodels module actually makes this very, very easy. We're not
66

67
00:06:38,550 --> 00:06:43,950
gonna have to run this like two step calculation manually, we simply call a function, but I wanted to
67

68
00:06:43,950 --> 00:06:49,100
show you what's going on behind the scenes in the LaTeX notation nonetheless.
68

69
00:06:49,410 --> 00:06:55,980
So let's scroll to the top of our notebook and actually import this functionality. I'm going to come here
69

70
00:06:56,790 --> 00:07:13,290
and I'm going to write "from statsmodels.stats.outliers_influence import variance_
70

71
00:07:13,830 --> 00:07:18,640
inflation_factor".
71

72
00:07:18,910 --> 00:07:27,690
So, I'm importing this specific function from "statsmodels.stats.outliers_influence" 
72

73
00:07:27,690 --> 00:07:30,120
Now let me hit Shift+Enter on the cell.
73

74
00:07:30,120 --> 00:07:36,270
Interestingly, the deprecation warning disappeared when I did that. And I'm going to scroll back down and
74

75
00:07:36,270 --> 00:07:39,290
now I'm actually going to call this function.
75

76
00:07:39,450 --> 00:07:41,610
Check it out.
76

77
00:07:41,610 --> 00:07:47,390
"variance_inflation_factor()"
77

78
00:07:47,450 --> 00:07:54,090
and before I put any arguments between these two parentheses I'm going to hit Shift+Tab on my keyboard to
78

79
00:07:54,090 --> 00:07:57,840
bring up the quick documentation on this function.
79

80
00:07:57,840 --> 00:08:07,650
What we see here is that we need two inputs, exog and exog_idx, hit this little plus sign
80

81
00:08:07,650 --> 00:08:08,250
here,
81

82
00:08:08,370 --> 00:08:17,070
scroll down a little bit and we can see that the first argument is an n-dimensional array
82

83
00:08:17,760 --> 00:08:21,170
and the second argument is simply an integer.
83

84
00:08:21,330 --> 00:08:27,690
It's a whole number, and that number is basically the column of this n-dimensional array that we want
84

85
00:08:27,690 --> 00:08:29,580
to look at.
85

86
00:08:29,670 --> 00:08:30,000
All right.
86

87
00:08:30,300 --> 00:08:38,400
So we've got a data frame, X_including_const, and say we want to calculate the variance
87

88
00:08:38,400 --> 00:08:42,980
inflation factor for the first column in this data frame.
88

89
00:08:43,080 --> 00:08:44,780
How would we do it?
89

90
00:08:44,840 --> 00:08:46,880
Well I'll come down here
90

91
00:08:47,210 --> 00:08:56,810
and as an argument pass in the first one which was "exog = X_incl
91

92
00:08:56,820 --> 00:09:07,690
_const", comma, and then "exog_ids = 1".
92

93
00:09:07,850 --> 00:09:10,130
Why did I pick index 1 and not 0?
93

94
00:09:10,730 --> 00:09:15,470
Well at index 0 we've got our intercept, our constant.
94

95
00:09:15,470 --> 00:09:19,580
I'm interested in our first feature which is Crime.
95

96
00:09:20,120 --> 00:09:27,230
Okay so I'm going hit Shift+Enter on this and you're going to see that it doesn't work.
96

97
00:09:27,230 --> 00:09:28,210
Take a look.
97

98
00:09:28,280 --> 00:09:37,110
We get a type error. Scrolling down we see that the error description is "Unhashable type: 'slice'".
98

99
00:09:37,190 --> 00:09:45,160
Why did we get this? The error message is actually fairly cryptic - unhashable type, but we do discover
99

100
00:09:45,160 --> 00:09:50,150
that it actually has to do something with data types again. Now data types are actually usually quite
100

101
00:09:50,150 --> 00:09:51,420
hidden in Python,
101

102
00:09:51,590 --> 00:09:57,280
but occasionally, if you ignore them too much they come and bite you when you're not careful.
102

103
00:09:57,290 --> 00:10:04,490
Remember how when we hit Shift+Tab here and we looked at what kind of argument this exog argument
103

104
00:10:04,490 --> 00:10:05,420
should be?
104

105
00:10:05,480 --> 00:10:09,090
We found out that it was an n-dimensional array.
105

106
00:10:09,140 --> 00:10:14,810
Well let's look at the type of X_incl_const.
106

107
00:10:14,840 --> 00:10:24,110
So writing "type(X_incl_const)" and hitting Shift+Enter will tell
107

108
00:10:24,110 --> 00:10:28,450
us that this is not an n dimensional array.
108

109
00:10:28,460 --> 00:10:34,100
This is in fact a dataframe, a pandas DataFrame.
109

110
00:10:34,100 --> 00:10:42,830
So how do we get a n-dimensional array, an ndarray, from a dataframe? Pandas fortunately has a solution
110

111
00:10:42,830 --> 00:10:43,880
for us.
111

112
00:10:43,880 --> 00:10:51,840
All we have to do is call the values attribute, so all we have to do is put a dot after it and use the
112

113
00:10:51,840 --> 00:10:59,900
values attribute. This retrieves an n-dimensional array from our dataframe. And just to prove that it
113

114
00:10:59,900 --> 00:11:00,680
works,
114

115
00:11:00,690 --> 00:11:08,810
I'm going to hit Shift+Enter and print out the value - 1.7. 1.7 is the variance inflation
115

116
00:11:08,810 --> 00:11:12,580
factor for our crime feature.
116

117
00:11:12,710 --> 00:11:20,780
Nice, but how do we calculate the variance inflation factors for all the other features? To do that,
117

118
00:11:20,780 --> 00:11:26,340
we're gonna use a loop. We will need to loop across all the columns in the dataframe.
118

119
00:11:26,880 --> 00:11:31,080
But suppose you don't know how many columns there are in your dataframe.
119

120
00:11:31,080 --> 00:11:40,350
So as a challenge can you output the number of columns in X_incl_const? Bonus
120

121
00:11:40,350 --> 00:11:42,540
points if you can do it in two ways.
121

122
00:11:42,630 --> 00:11:46,690
I'll give you a few seconds to pause the video. Ready?
122

123
00:11:46,710 --> 00:11:48,190
Here's the solution.
123

124
00:11:48,240 --> 00:11:54,240
The first way you can find out how many columns there are in the dataframe are using the length function
124

125
00:11:54,820 --> 00:11:59,760
and you would use the length function on the index of columns.
125

126
00:11:59,760 --> 00:12:08,490
So as an argument you would pass in the dataframe and then put a dot after it and write columns, and
126

127
00:12:08,490 --> 00:12:12,490
this will give you the number 14.
127

128
00:12:12,540 --> 00:12:16,870
The second way you can do this is by using the shape attribute.
128

129
00:12:17,000 --> 00:12:26,640
So writing the name of the dataframe and using ".shape" will give you the number of rows and the number
129

130
00:12:26,640 --> 00:12:27,360
of columns.
130

131
00:12:27,360 --> 00:12:34,030
So there's 404 rows and 14 columns. To access the number of columns directly use the
131

132
00:12:34,030 --> 00:12:42,190
square brackets and then a one to return the second element from the shape attributes tuple.
132

133
00:12:42,190 --> 00:12:43,300
There we go.
133

134
00:12:43,300 --> 00:12:50,560
So both of these lines of Python code work as a way of retrieving the number of columns in a data frame.
134

135
00:12:50,630 --> 00:12:50,970
Now,
135

136
00:12:51,040 --> 00:12:52,560
we said we have to write a loop,
136

137
00:12:52,570 --> 00:12:53,390
right?
137

138
00:12:53,470 --> 00:12:59,860
So, as a second challenge, can you write a for loop that prints out all the variance inflation factors
138

139
00:13:00,280 --> 00:13:02,280
on all the features?
139

140
00:13:02,380 --> 00:13:05,820
I'll give you a few seconds to pause the video and try this out.
140

141
00:13:07,620 --> 00:13:08,780
Ready?
141

142
00:13:08,790 --> 00:13:10,090
Here's the solution.
142

143
00:13:10,230 --> 00:13:18,930
You write "for i in range()", and the range would be the number of columns in the dataframe, so I could have
143

144
00:13:19,560 --> 00:13:29,940
"X_incl_const.shape[1]" and then colon, and then in the body of
144

145
00:13:29,940 --> 00:13:40,140
my for loop, I have a print statement with a function call to variance_inflation_factor, so "variance_inflation
145

146
00:13:40,230 --> 00:13:43,620
_factor()",
146

147
00:13:43,620 --> 00:13:53,460
and then my first argument is going to be, as before, "X_incl_const.values,
147

148
00:13:53,940 --> 00:13:57,140
exog_
148

149
00:13:57,180 --> 00:13:59,840
idx = ",
149

150
00:14:00,000 --> 00:14:07,200
And then what? Now let's apply the iterator of the loop and that's i.
150

151
00:14:07,200 --> 00:14:08,370
There we go.
151

152
00:14:08,370 --> 00:14:08,870
That's it.
152

153
00:14:08,880 --> 00:14:10,230
That's the solution.
153

154
00:14:10,290 --> 00:14:19,440
So when the loop finishes we can print "All done!" and let's run it see what we get. Tada!
154

155
00:14:19,540 --> 00:14:27,090
These are all the variance inflation factors calculated for every single feature in our dataframe.
155

156
00:14:27,100 --> 00:14:33,970
Now printing all this stuff out is very well and good, but what if we wanted to store it in a list?
156

157
00:14:33,970 --> 00:14:37,500
What if we wanted to store all these things in a variable?
157

158
00:14:37,630 --> 00:14:39,780
So I'm going to copy the code I have above,
158

159
00:14:39,970 --> 00:14:43,460
come down here, add a few more cells, paste it in, delete
159

160
00:14:43,480 --> 00:14:44,800
my challenge comment,
160

161
00:14:44,800 --> 00:14:54,160
delete this print statement and then I'm going to create an empty list here, "vif = []",
161

162
00:14:54,500 --> 00:14:56,200
gonna put a comment here,
162

163
00:14:56,290 --> 00:15:03,380
this is an empty list, and then inside my list I'm going to use the append method,
163

164
00:15:03,670 --> 00:15:12,610
so "vif.append()" and what will it append? It will append the variance inflation factor that is calculated
164

165
00:15:13,000 --> 00:15:21,170
as part of the loop. And when we're all done, we can print our variance inflation factor list.
165

166
00:15:21,300 --> 00:15:23,400
Let's see what we get.
166

167
00:15:24,250 --> 00:15:25,290
So that worked really well.
167

168
00:15:25,780 --> 00:15:33,330
We get all the same results printed out as before except now we're storing them in a variable.
168

169
00:15:33,430 --> 00:15:40,630
But I tell you what, I'm going to show you a different kind of Python syntax to accomplish the same thing.
169

170
00:15:40,630 --> 00:15:43,750
So I'm going to copy, this paste it in here.
170

171
00:15:43,790 --> 00:15:52,420
What I'm going to do now is I'm going to run this loop inside these square brackets, and the way I would do
171

172
00:15:52,420 --> 00:15:57,430
this is I would move this part here which is the body of the loop,
172

173
00:15:57,430 --> 00:16:07,470
if you think about it, and then afterwards I will add the code for the loop, namely this part. So I can
173

174
00:16:07,470 --> 00:16:14,070
copy that, paste it in here and note the square bracket at the end,
174

175
00:16:14,070 --> 00:16:21,080
then I'm going to delete this part here and I will split this one line over two lines, just hitting Enter
175

176
00:16:21,200 --> 00:16:22,320
here.
176

177
00:16:22,440 --> 00:16:24,700
That way you can see a lot better what's going on.
177

178
00:16:25,370 --> 00:16:26,110
Mm hmm.
178

179
00:16:26,310 --> 00:16:29,430
Let me run this just to prove that it works.
179

180
00:16:29,430 --> 00:16:30,870
Here you go.
180

181
00:16:30,870 --> 00:16:31,980
So this is interesting, right?
181

182
00:16:32,010 --> 00:16:38,250
We're used to seeing loops like this, but you could also use this syntax here to run the loop inside
182

183
00:16:38,250 --> 00:16:42,750
the square brackets to populate your list.
183

184
00:16:42,750 --> 00:16:51,110
Now, I don't really like this formatting here, so I'm going to write some code to add a dataframe at
184

185
00:16:51,110 --> 00:16:54,570
the end using the dictionary notation again.
185

186
00:16:55,010 --> 00:17:04,520
And this dictionary is gonna have "'coef_name':" and then the names are going to be
186

187
00:17:05,090 --> 00:17:13,100
from my training dataset "X_incl_const.columns".
187

188
00:17:13,130 --> 00:17:20,230
So this is gonna be my list of names, and I'm going to put comma, hit Enter to go down to a new line. For my
188

189
00:17:20,230 --> 00:17:28,270
second column in my dataframe, I'll have "vif" as the column name and then I'll put the variants inflation
189

190
00:17:28,270 --> 00:17:33,850
factor list that we've calculated using our loop right afterwards.
190

191
00:17:33,850 --> 00:17:37,940
Now let's see what we get. We get something like this.
191

192
00:17:38,170 --> 00:17:45,070
We get the feature names here and we get the variance inflation factor for each feature next to it.
192

193
00:17:45,080 --> 00:17:49,710
Now there's a lot of numbers after the decimal point with the variance inflation factor.
193

194
00:17:49,840 --> 00:17:58,970
So let's round it and then we use a rounding function from numpy, "np.round()", I'll have
194

195
00:17:58,970 --> 00:18:04,410
the variance inflation factor inside for the first argument and for the second argument I'm going to put
195

196
00:18:04,400 --> 00:18:07,840
2 as how many numbers I want after the decimal point.
196

197
00:18:07,850 --> 00:18:16,370
So I want two numbers after the decimal point. I'll put a closing parentheses here and hit Shift+Enter to refresh
197

198
00:18:16,550 --> 00:18:17,490
my cell.
198

199
00:18:17,780 --> 00:18:25,440
And this is what we get. Now the variance inflation factors are a lot easier to read.
199

200
00:18:25,490 --> 00:18:29,280
So how do we interpret this output?
200

201
00:18:29,870 --> 00:18:38,660
Well, similar to how we did things with the p-value, we compare these numbers to a threshold and for the
201

202
00:18:38,660 --> 00:18:47,030
variance inflation factors, the scientific consensus seems to be that the threshold is around 10, meaning
202

203
00:18:47,210 --> 00:18:55,520
any feature that has a VIF over 10 would be considered problematic and would need closer inspection.
203

204
00:18:55,670 --> 00:19:03,500
But looking at our list here, we can see that all the numbers are below 10 and this suggests that we
204

205
00:19:03,500 --> 00:19:07,520
don't have to worry about multicollinearity.
205

206
00:19:07,660 --> 00:19:12,450
Now some academics are a little bit more conservative regarding their threshold and they think that
206

207
00:19:12,450 --> 00:19:15,270
a cutoff of 5 is better.
207

208
00:19:15,600 --> 00:19:23,580
But given the fact that our results for the coefficients make a lot of sense and we are below the threshold
208

209
00:19:23,580 --> 00:19:29,100
of 10 for the variance inflation factor, I'm really not too worried about having a multicollinearity
209

210
00:19:29,100 --> 00:19:31,520
problem. Now,
210

211
00:19:31,590 --> 00:19:39,030
as I've discussed earlier this housing dataset actually came from a research paper where two academics
211

212
00:19:39,150 --> 00:19:44,940
were looking for the demand for clean air in Boston.
212

213
00:19:44,940 --> 00:19:51,360
They were trying to put a value to how much people are willing to pay to live somewhere with good air
213

214
00:19:51,360 --> 00:19:52,680
quality.
214

215
00:19:52,680 --> 00:19:57,630
And I was curious if these two academics tested for multicollinearity as well
215

216
00:19:57,630 --> 00:20:04,170
and if they mentioned it in their paper, and what I found was that they indeed discussed this problem
216

217
00:20:04,290 --> 00:20:07,350
in the article's footnotes.
217

218
00:20:07,350 --> 00:20:12,660
The problem that the researchers faced was a little different from us, because the researchers were actually
218

219
00:20:12,660 --> 00:20:19,770
trying to estimate the value of having clean air and what they had was more than one measure of pollution.
219

220
00:20:19,770 --> 00:20:28,560
For us in our data set we have one variable, NOX, which measures nitric oxide. But originally this
220

221
00:20:28,560 --> 00:20:34,160
wasn't the only pollution factor that was tested in this research paper.
221

222
00:20:34,270 --> 00:20:40,620
They actually had two different pollution measures and what they found was they actually had a
222

223
00:20:40,620 --> 00:20:44,730
multicollinearity problem when they included both of them.
223

224
00:20:44,730 --> 00:20:51,600
In other words, one of the pollution features was redundant, and they ended up removing it from their
224

225
00:20:51,600 --> 00:20:53,180
regression model.
225

226
00:20:53,280 --> 00:20:57,310
So I think this is actually quite an interesting solution to the problem
226

227
00:20:57,390 --> 00:21:03,960
if you're encountering it in your own research - removing an unnecessary feature is a perfectly valid
227

228
00:21:03,960 --> 00:21:10,800
way to modify the model and it's something we're going to look at in the next lesson, because in the
228

229
00:21:10,800 --> 00:21:17,640
next lesson we're going to be making small tweaks to our regression model and try and simplify it and
229

230
00:21:17,640 --> 00:21:25,380
at the same time we get a chance to check for that final symptom of multicollinearity, namely seeing
230

231
00:21:25,380 --> 00:21:31,590
if our coefficient estimates change dramatically when we tweak the model.
231

232
00:21:32,010 --> 00:21:38,640
Now it's getting late over here aand I really need to get another coffee, but I've had so much coffee
232

233
00:21:38,640 --> 00:21:44,180
already so I think we'll have to maybe dilute my coffee with decaf.
233

234
00:21:44,220 --> 00:21:45,290
That should work, right?
234

235
00:21:46,360 --> 00:21:48,300
Anyhow, I'll see in the next lesson.