0 1 00:00:00,390 --> 00:00:06,370 Remember how we noted a couple of things down on our to do list to check for later? We said that we'd check 1 2 00:00:06,370 --> 00:00:08,550 for a potential problem in our regression. 2 3 00:00:08,800 --> 00:00:13,240 And the problem that we were concerned about was called multicollinearity. 3 4 00:00:13,240 --> 00:00:16,850 The reason was that we had some high correlations between our features, 4 5 00:00:16,990 --> 00:00:22,430 so multicollinearity was something that we wanted to test formally. 5 6 00:00:22,510 --> 00:00:28,990 Now put simply multicollinearity is when two or more predictor variables in a regression are highly 6 7 00:00:28,990 --> 00:00:30,970 related to one another. 7 8 00:00:30,970 --> 00:00:38,320 In that case they do not provide unique or independent information to the model and the consequences 8 9 00:00:38,440 --> 00:00:41,400 of multicollinearity are as follows. 9 10 00:00:41,440 --> 00:00:48,880 There is a loss of reliability in the estimates of the effects for the individual features, 10 11 00:00:48,880 --> 00:00:51,300 in particular the features that are affected. 11 12 00:00:51,610 --> 00:00:57,730 In other words, there is a high variability in the coefficient estimates for small changes in the model, 12 13 00:00:58,270 --> 00:01:02,940 for example, adding or removing a feature can have dramatic effects. 13 14 00:01:02,950 --> 00:01:08,680 The estimates of our theta values in our model become unstable and can even switch signs from positive 14 15 00:01:08,680 --> 00:01:11,020 to negative and vice versa. 15 16 00:01:11,020 --> 00:01:17,890 The third effect of the problem is that the findings are strange or misleading or don't make sense. 16 17 00:01:17,890 --> 00:01:21,960 These are the main symptoms of the multicollinearity problem. 17 18 00:01:22,120 --> 00:01:28,990 Now, given these symptoms we can definitely say that on the third one our regression definitely passed, 18 19 00:01:29,470 --> 00:01:35,200 because we did a sense check on the signs of our coefficients and we found that the coefficients definitely 19 20 00:01:35,200 --> 00:01:36,400 had the right signs. 20 21 00:01:36,400 --> 00:01:38,820 They made sense from a logical perspective, 21 22 00:01:39,000 --> 00:01:44,920 like more rooms increased the property prices and more pollution decreased the property prices for example. 22 23 00:01:45,190 --> 00:01:49,840 But in this lesson I'm going to show you how you can do a formal check, how you can look at a metric to 23 24 00:01:49,840 --> 00:01:54,040 tell you whether you have this problem or not. 24 25 00:01:54,040 --> 00:02:01,750 The formal check that we will do is by looking at a statistic called the variance inflation factor or 25 26 00:02:01,810 --> 00:02:09,250 VIF and the second thing that we will do is we will monitor if moving a feature causes a dramatic 26 27 00:02:09,250 --> 00:02:14,830 change in our theta parameters, but that will wait until the next lesson. 27 28 00:02:14,830 --> 00:02:20,260 Now let's go back to Jupyter notebook and add a new section heading. This section heading is gonna be 28 29 00:02:20,260 --> 00:02:32,140 called "Testing for Multicollinearity", so "Testing for Multicollinearity", and to test for multicollinearity, 29 30 00:02:32,140 --> 00:02:36,920 we said we're gonna be using a statistic called the variance inflation factor. 30 31 00:02:37,390 --> 00:02:44,710 So let me quickly explain how the variance inflation factor does its job, because this statistic is a 31 32 00:02:44,770 --> 00:02:51,940 measure of collinearity among the features within a multiple regression and what it will do is it 32 33 00:02:51,940 --> 00:03:01,000 will spit out a number that quantifies the severity of multicollinearity, so in effect we get a number 33 34 00:03:01,060 --> 00:03:05,830 similar to how we get a number with the p-values and we can look at this number and compare it to a 34 35 00:03:05,830 --> 00:03:09,760 threshold, is it above the threshold or is it below the threshold. 35 36 00:03:09,760 --> 00:03:11,370 That's how it's gonna work. 36 37 00:03:11,620 --> 00:03:18,430 But before we write the Python code to actually calculate the variance inflation factors, let's add some 37 38 00:03:18,430 --> 00:03:25,720 formulas in LaTeX notation to the section heading to kind of better understand what's going on. 38 39 00:03:25,720 --> 00:03:31,400 I'm going to show you how the variance inflation factor is calculated for one of the features 39 40 00:03:31,510 --> 00:03:37,700 as an example. The first thing that will happen is that a regression will be run on that feature. 40 41 00:03:37,750 --> 00:03:40,170 Let's say we're looking at our TAX features. 41 42 00:03:40,270 --> 00:03:48,790 So what will happen is that we will look to explain the TAX values as a linear combination of all the 42 43 00:03:48,790 --> 00:03:51,230 other features in the data set. 43 44 00:03:51,430 --> 00:03:57,610 So you're gonna get a regression that reads something like "TAX = " and then some intercept, say "\alpha 44 45 00:03:58,000 --> 00:04:03,100 _0 ", plus some parameter times the number of rooms, 45 46 00:04:03,130 --> 00:04:14,830 so " \alpha _1 RM", plus some other parameter "\alpha _2" times the next feature, 46 47 00:04:15,050 --> 00:04:23,110 say nitrous oxide (NOX), plus and so on till you get to LSTAT, which will be "\alpha _ 47 48 00:04:23,200 --> 00:04:25,860 {12} 48 49 00:04:26,610 --> 00:04:34,090 LSTAT". Since we have 13 features total and we're looking to explain TAX on property prices, 49 50 00:04:34,270 --> 00:04:39,900 we're left with 12 different kind of parameters in front of the other features. 50 51 00:04:39,940 --> 00:04:45,970 I've also used alphas here in my LaTeX notation so we don't get confused with the theta parameters that 51 52 00:04:45,970 --> 00:04:46,850 we used earlier. 52 53 00:04:47,560 --> 00:04:52,620 Now this is only step one of calculating the variance inflation factor. 53 54 00:04:52,690 --> 00:05:01,300 Step 2 looks like this. In step 2 we get the variance inflation factor for our TAX feature. 54 55 00:05:01,480 --> 00:05:06,150 So "VIF _{TAX} = 55 56 00:05:06,150 --> 00:05:17,670 \frac {1}/{(1-R _ 56 57 00:05:18,150 --> 00:05:28,470 {TAX} ^ 2)}". Now let me quickly 57 58 00:05:28,470 --> 00:05:36,270 delete this backslash here, we don't need that, and hit Shift+Enter to make this a little more clear. This 58 59 00:05:36,270 --> 00:05:45,720 is step 2. In Step 1 a regression is being run of all the other features against TAX and in step 2 the 59 60 00:05:45,840 --> 00:05:54,180 r-squared of that regression is used to calculate the variance inflation factor, so 1 divided by 60 61 00:05:54,180 --> 00:06:02,160 (1 - r^2) of this regression up top is the variance inflation factor for the TAX feature, so 61 62 00:06:02,160 --> 00:06:08,430 that r-squared here is the r-squared from trying to explain the tax feature in terms of all the other 62 63 00:06:08,430 --> 00:06:15,660 features in the dataset. So the point I'm trying to make is that we can calculate a variance inflation 63 64 00:06:15,660 --> 00:06:22,530 factor for all the different features, right, just by swapping out this TAX for RM we get the variance 64 65 00:06:22,530 --> 00:06:29,790 inflation factor for RM, so in our Python code what we're gonna do is we're going to calculate 13 different 65 66 00:06:30,060 --> 00:06:38,550 variants inflation factors. Now the statsmodels module actually makes this very, very easy. We're not 66 67 00:06:38,550 --> 00:06:43,950 gonna have to run this like two step calculation manually, we simply call a function, but I wanted to 67 68 00:06:43,950 --> 00:06:49,100 show you what's going on behind the scenes in the LaTeX notation nonetheless. 68 69 00:06:49,410 --> 00:06:55,980 So let's scroll to the top of our notebook and actually import this functionality. I'm going to come here 69 70 00:06:56,790 --> 00:07:13,290 and I'm going to write "from statsmodels.stats.outliers_influence import variance_ 70 71 00:07:13,830 --> 00:07:18,640 inflation_factor". 71 72 00:07:18,910 --> 00:07:27,690 So, I'm importing this specific function from "statsmodels.stats.outliers_influence" 72 73 00:07:27,690 --> 00:07:30,120 Now let me hit Shift+Enter on the cell. 73 74 00:07:30,120 --> 00:07:36,270 Interestingly, the deprecation warning disappeared when I did that. And I'm going to scroll back down and 74 75 00:07:36,270 --> 00:07:39,290 now I'm actually going to call this function. 75 76 00:07:39,450 --> 00:07:41,610 Check it out. 76 77 00:07:41,610 --> 00:07:47,390 "variance_inflation_factor()" 77 78 00:07:47,450 --> 00:07:54,090 and before I put any arguments between these two parentheses I'm going to hit Shift+Tab on my keyboard to 78 79 00:07:54,090 --> 00:07:57,840 bring up the quick documentation on this function. 79 80 00:07:57,840 --> 00:08:07,650 What we see here is that we need two inputs, exog and exog_idx, hit this little plus sign 80 81 00:08:07,650 --> 00:08:08,250 here, 81 82 00:08:08,370 --> 00:08:17,070 scroll down a little bit and we can see that the first argument is an n-dimensional array 82 83 00:08:17,760 --> 00:08:21,170 and the second argument is simply an integer. 83 84 00:08:21,330 --> 00:08:27,690 It's a whole number, and that number is basically the column of this n-dimensional array that we want 84 85 00:08:27,690 --> 00:08:29,580 to look at. 85 86 00:08:29,670 --> 00:08:30,000 All right. 86 87 00:08:30,300 --> 00:08:38,400 So we've got a data frame, X_including_const, and say we want to calculate the variance 87 88 00:08:38,400 --> 00:08:42,980 inflation factor for the first column in this data frame. 88 89 00:08:43,080 --> 00:08:44,780 How would we do it? 89 90 00:08:44,840 --> 00:08:46,880 Well I'll come down here 90 91 00:08:47,210 --> 00:08:56,810 and as an argument pass in the first one which was "exog = X_incl 91 92 00:08:56,820 --> 00:09:07,690 _const", comma, and then "exog_ids = 1". 92 93 00:09:07,850 --> 00:09:10,130 Why did I pick index 1 and not 0? 93 94 00:09:10,730 --> 00:09:15,470 Well at index 0 we've got our intercept, our constant. 94 95 00:09:15,470 --> 00:09:19,580 I'm interested in our first feature which is Crime. 95 96 00:09:20,120 --> 00:09:27,230 Okay so I'm going hit Shift+Enter on this and you're going to see that it doesn't work. 96 97 00:09:27,230 --> 00:09:28,210 Take a look. 97 98 00:09:28,280 --> 00:09:37,110 We get a type error. Scrolling down we see that the error description is "Unhashable type: 'slice'". 98 99 00:09:37,190 --> 00:09:45,160 Why did we get this? The error message is actually fairly cryptic - unhashable type, but we do discover 99 100 00:09:45,160 --> 00:09:50,150 that it actually has to do something with data types again. Now data types are actually usually quite 100 101 00:09:50,150 --> 00:09:51,420 hidden in Python, 101 102 00:09:51,590 --> 00:09:57,280 but occasionally, if you ignore them too much they come and bite you when you're not careful. 102 103 00:09:57,290 --> 00:10:04,490 Remember how when we hit Shift+Tab here and we looked at what kind of argument this exog argument 103 104 00:10:04,490 --> 00:10:05,420 should be? 104 105 00:10:05,480 --> 00:10:09,090 We found out that it was an n-dimensional array. 105 106 00:10:09,140 --> 00:10:14,810 Well let's look at the type of X_incl_const. 106 107 00:10:14,840 --> 00:10:24,110 So writing "type(X_incl_const)" and hitting Shift+Enter will tell 107 108 00:10:24,110 --> 00:10:28,450 us that this is not an n dimensional array. 108 109 00:10:28,460 --> 00:10:34,100 This is in fact a dataframe, a pandas DataFrame. 109 110 00:10:34,100 --> 00:10:42,830 So how do we get a n-dimensional array, an ndarray, from a dataframe? Pandas fortunately has a solution 110 111 00:10:42,830 --> 00:10:43,880 for us. 111 112 00:10:43,880 --> 00:10:51,840 All we have to do is call the values attribute, so all we have to do is put a dot after it and use the 112 113 00:10:51,840 --> 00:10:59,900 values attribute. This retrieves an n-dimensional array from our dataframe. And just to prove that it 113 114 00:10:59,900 --> 00:11:00,680 works, 114 115 00:11:00,690 --> 00:11:08,810 I'm going to hit Shift+Enter and print out the value - 1.7. 1.7 is the variance inflation 115 116 00:11:08,810 --> 00:11:12,580 factor for our crime feature. 116 117 00:11:12,710 --> 00:11:20,780 Nice, but how do we calculate the variance inflation factors for all the other features? To do that, 117 118 00:11:20,780 --> 00:11:26,340 we're gonna use a loop. We will need to loop across all the columns in the dataframe. 118 119 00:11:26,880 --> 00:11:31,080 But suppose you don't know how many columns there are in your dataframe. 119 120 00:11:31,080 --> 00:11:40,350 So as a challenge can you output the number of columns in X_incl_const? Bonus 120 121 00:11:40,350 --> 00:11:42,540 points if you can do it in two ways. 121 122 00:11:42,630 --> 00:11:46,690 I'll give you a few seconds to pause the video. Ready? 122 123 00:11:46,710 --> 00:11:48,190 Here's the solution. 123 124 00:11:48,240 --> 00:11:54,240 The first way you can find out how many columns there are in the dataframe are using the length function 124 125 00:11:54,820 --> 00:11:59,760 and you would use the length function on the index of columns. 125 126 00:11:59,760 --> 00:12:08,490 So as an argument you would pass in the dataframe and then put a dot after it and write columns, and 126 127 00:12:08,490 --> 00:12:12,490 this will give you the number 14. 127 128 00:12:12,540 --> 00:12:16,870 The second way you can do this is by using the shape attribute. 128 129 00:12:17,000 --> 00:12:26,640 So writing the name of the dataframe and using ".shape" will give you the number of rows and the number 129 130 00:12:26,640 --> 00:12:27,360 of columns. 130 131 00:12:27,360 --> 00:12:34,030 So there's 404 rows and 14 columns. To access the number of columns directly use the 131 132 00:12:34,030 --> 00:12:42,190 square brackets and then a one to return the second element from the shape attributes tuple. 132 133 00:12:42,190 --> 00:12:43,300 There we go. 133 134 00:12:43,300 --> 00:12:50,560 So both of these lines of Python code work as a way of retrieving the number of columns in a data frame. 134 135 00:12:50,630 --> 00:12:50,970 Now, 135 136 00:12:51,040 --> 00:12:52,560 we said we have to write a loop, 136 137 00:12:52,570 --> 00:12:53,390 right? 137 138 00:12:53,470 --> 00:12:59,860 So, as a second challenge, can you write a for loop that prints out all the variance inflation factors 138 139 00:13:00,280 --> 00:13:02,280 on all the features? 139 140 00:13:02,380 --> 00:13:05,820 I'll give you a few seconds to pause the video and try this out. 140 141 00:13:07,620 --> 00:13:08,780 Ready? 141 142 00:13:08,790 --> 00:13:10,090 Here's the solution. 142 143 00:13:10,230 --> 00:13:18,930 You write "for i in range()", and the range would be the number of columns in the dataframe, so I could have 143 144 00:13:19,560 --> 00:13:29,940 "X_incl_const.shape[1]" and then colon, and then in the body of 144 145 00:13:29,940 --> 00:13:40,140 my for loop, I have a print statement with a function call to variance_inflation_factor, so "variance_inflation 145 146 00:13:40,230 --> 00:13:43,620 _factor()", 146 147 00:13:43,620 --> 00:13:53,460 and then my first argument is going to be, as before, "X_incl_const.values, 147 148 00:13:53,940 --> 00:13:57,140 exog_ 148 149 00:13:57,180 --> 00:13:59,840 idx = ", 149 150 00:14:00,000 --> 00:14:07,200 And then what? Now let's apply the iterator of the loop and that's i. 150 151 00:14:07,200 --> 00:14:08,370 There we go. 151 152 00:14:08,370 --> 00:14:08,870 That's it. 152 153 00:14:08,880 --> 00:14:10,230 That's the solution. 153 154 00:14:10,290 --> 00:14:19,440 So when the loop finishes we can print "All done!" and let's run it see what we get. Tada! 154 155 00:14:19,540 --> 00:14:27,090 These are all the variance inflation factors calculated for every single feature in our dataframe. 155 156 00:14:27,100 --> 00:14:33,970 Now printing all this stuff out is very well and good, but what if we wanted to store it in a list? 156 157 00:14:33,970 --> 00:14:37,500 What if we wanted to store all these things in a variable? 157 158 00:14:37,630 --> 00:14:39,780 So I'm going to copy the code I have above, 158 159 00:14:39,970 --> 00:14:43,460 come down here, add a few more cells, paste it in, delete 159 160 00:14:43,480 --> 00:14:44,800 my challenge comment, 160 161 00:14:44,800 --> 00:14:54,160 delete this print statement and then I'm going to create an empty list here, "vif = []", 161 162 00:14:54,500 --> 00:14:56,200 gonna put a comment here, 162 163 00:14:56,290 --> 00:15:03,380 this is an empty list, and then inside my list I'm going to use the append method, 163 164 00:15:03,670 --> 00:15:12,610 so "vif.append()" and what will it append? It will append the variance inflation factor that is calculated 164 165 00:15:13,000 --> 00:15:21,170 as part of the loop. And when we're all done, we can print our variance inflation factor list. 165 166 00:15:21,300 --> 00:15:23,400 Let's see what we get. 166 167 00:15:24,250 --> 00:15:25,290 So that worked really well. 167 168 00:15:25,780 --> 00:15:33,330 We get all the same results printed out as before except now we're storing them in a variable. 168 169 00:15:33,430 --> 00:15:40,630 But I tell you what, I'm going to show you a different kind of Python syntax to accomplish the same thing. 169 170 00:15:40,630 --> 00:15:43,750 So I'm going to copy, this paste it in here. 170 171 00:15:43,790 --> 00:15:52,420 What I'm going to do now is I'm going to run this loop inside these square brackets, and the way I would do 171 172 00:15:52,420 --> 00:15:57,430 this is I would move this part here which is the body of the loop, 172 173 00:15:57,430 --> 00:16:07,470 if you think about it, and then afterwards I will add the code for the loop, namely this part. So I can 173 174 00:16:07,470 --> 00:16:14,070 copy that, paste it in here and note the square bracket at the end, 174 175 00:16:14,070 --> 00:16:21,080 then I'm going to delete this part here and I will split this one line over two lines, just hitting Enter 175 176 00:16:21,200 --> 00:16:22,320 here. 176 177 00:16:22,440 --> 00:16:24,700 That way you can see a lot better what's going on. 177 178 00:16:25,370 --> 00:16:26,110 Mm hmm. 178 179 00:16:26,310 --> 00:16:29,430 Let me run this just to prove that it works. 179 180 00:16:29,430 --> 00:16:30,870 Here you go. 180 181 00:16:30,870 --> 00:16:31,980 So this is interesting, right? 181 182 00:16:32,010 --> 00:16:38,250 We're used to seeing loops like this, but you could also use this syntax here to run the loop inside 182 183 00:16:38,250 --> 00:16:42,750 the square brackets to populate your list. 183 184 00:16:42,750 --> 00:16:51,110 Now, I don't really like this formatting here, so I'm going to write some code to add a dataframe at 184 185 00:16:51,110 --> 00:16:54,570 the end using the dictionary notation again. 185 186 00:16:55,010 --> 00:17:04,520 And this dictionary is gonna have "'coef_name':" and then the names are going to be 186 187 00:17:05,090 --> 00:17:13,100 from my training dataset "X_incl_const.columns". 187 188 00:17:13,130 --> 00:17:20,230 So this is gonna be my list of names, and I'm going to put comma, hit Enter to go down to a new line. For my 188 189 00:17:20,230 --> 00:17:28,270 second column in my dataframe, I'll have "vif" as the column name and then I'll put the variants inflation 189 190 00:17:28,270 --> 00:17:33,850 factor list that we've calculated using our loop right afterwards. 190 191 00:17:33,850 --> 00:17:37,940 Now let's see what we get. We get something like this. 191 192 00:17:38,170 --> 00:17:45,070 We get the feature names here and we get the variance inflation factor for each feature next to it. 192 193 00:17:45,080 --> 00:17:49,710 Now there's a lot of numbers after the decimal point with the variance inflation factor. 193 194 00:17:49,840 --> 00:17:58,970 So let's round it and then we use a rounding function from numpy, "np.round()", I'll have 194 195 00:17:58,970 --> 00:18:04,410 the variance inflation factor inside for the first argument and for the second argument I'm going to put 195 196 00:18:04,400 --> 00:18:07,840 2 as how many numbers I want after the decimal point. 196 197 00:18:07,850 --> 00:18:16,370 So I want two numbers after the decimal point. I'll put a closing parentheses here and hit Shift+Enter to refresh 197 198 00:18:16,550 --> 00:18:17,490 my cell. 198 199 00:18:17,780 --> 00:18:25,440 And this is what we get. Now the variance inflation factors are a lot easier to read. 199 200 00:18:25,490 --> 00:18:29,280 So how do we interpret this output? 200 201 00:18:29,870 --> 00:18:38,660 Well, similar to how we did things with the p-value, we compare these numbers to a threshold and for the 201 202 00:18:38,660 --> 00:18:47,030 variance inflation factors, the scientific consensus seems to be that the threshold is around 10, meaning 202 203 00:18:47,210 --> 00:18:55,520 any feature that has a VIF over 10 would be considered problematic and would need closer inspection. 203 204 00:18:55,670 --> 00:19:03,500 But looking at our list here, we can see that all the numbers are below 10 and this suggests that we 204 205 00:19:03,500 --> 00:19:07,520 don't have to worry about multicollinearity. 205 206 00:19:07,660 --> 00:19:12,450 Now some academics are a little bit more conservative regarding their threshold and they think that 206 207 00:19:12,450 --> 00:19:15,270 a cutoff of 5 is better. 207 208 00:19:15,600 --> 00:19:23,580 But given the fact that our results for the coefficients make a lot of sense and we are below the threshold 208 209 00:19:23,580 --> 00:19:29,100 of 10 for the variance inflation factor, I'm really not too worried about having a multicollinearity 209 210 00:19:29,100 --> 00:19:31,520 problem. Now, 210 211 00:19:31,590 --> 00:19:39,030 as I've discussed earlier this housing dataset actually came from a research paper where two academics 211 212 00:19:39,150 --> 00:19:44,940 were looking for the demand for clean air in Boston. 212 213 00:19:44,940 --> 00:19:51,360 They were trying to put a value to how much people are willing to pay to live somewhere with good air 213 214 00:19:51,360 --> 00:19:52,680 quality. 214 215 00:19:52,680 --> 00:19:57,630 And I was curious if these two academics tested for multicollinearity as well 215 216 00:19:57,630 --> 00:20:04,170 and if they mentioned it in their paper, and what I found was that they indeed discussed this problem 216 217 00:20:04,290 --> 00:20:07,350 in the article's footnotes. 217 218 00:20:07,350 --> 00:20:12,660 The problem that the researchers faced was a little different from us, because the researchers were actually 218 219 00:20:12,660 --> 00:20:19,770 trying to estimate the value of having clean air and what they had was more than one measure of pollution. 219 220 00:20:19,770 --> 00:20:28,560 For us in our data set we have one variable, NOX, which measures nitric oxide. But originally this 220 221 00:20:28,560 --> 00:20:34,160 wasn't the only pollution factor that was tested in this research paper. 221 222 00:20:34,270 --> 00:20:40,620 They actually had two different pollution measures and what they found was they actually had a 222 223 00:20:40,620 --> 00:20:44,730 multicollinearity problem when they included both of them. 223 224 00:20:44,730 --> 00:20:51,600 In other words, one of the pollution features was redundant, and they ended up removing it from their 224 225 00:20:51,600 --> 00:20:53,180 regression model. 225 226 00:20:53,280 --> 00:20:57,310 So I think this is actually quite an interesting solution to the problem 226 227 00:20:57,390 --> 00:21:03,960 if you're encountering it in your own research - removing an unnecessary feature is a perfectly valid 227 228 00:21:03,960 --> 00:21:10,800 way to modify the model and it's something we're going to look at in the next lesson, because in the 228 229 00:21:10,800 --> 00:21:17,640 next lesson we're going to be making small tweaks to our regression model and try and simplify it and 229 230 00:21:17,640 --> 00:21:25,380 at the same time we get a chance to check for that final symptom of multicollinearity, namely seeing 230 231 00:21:25,380 --> 00:21:31,590 if our coefficient estimates change dramatically when we tweak the model. 231 232 00:21:32,010 --> 00:21:38,640 Now it's getting late over here aand I really need to get another coffee, but I've had so much coffee 232 233 00:21:38,640 --> 00:21:44,180 already so I think we'll have to maybe dilute my coffee with decaf. 233 234 00:21:44,220 --> 00:21:45,290 That should work, right? 234 235 00:21:46,360 --> 00:21:48,300 Anyhow, I'll see in the next lesson.