0 1 00:00:00,560 --> 00:00:05,480 In this lesson we're going to talk about simplifying our model. 1 2 00:00:05,480 --> 00:00:11,510 And that's because, all else equal, simpler models are preferable to complex ones. 2 3 00:00:11,510 --> 00:00:20,040 Remember the Zen of Python? "Simple is better than complex." and "Complex is better than complicated." and 3 4 00:00:20,070 --> 00:00:21,930 this doesn't just apply to programming. 4 5 00:00:22,110 --> 00:00:26,250 The same thing can be said about our regression model. 5 6 00:00:26,250 --> 00:00:33,450 By the way, you can always bring up this little gem of programming philosophy when you type "import this" 6 7 00:00:33,600 --> 00:00:40,760 in Jupyter notebook. Simple models really are preferable to complex ones, 7 8 00:00:40,760 --> 00:00:42,180 all else equal. 8 9 00:00:42,530 --> 00:00:47,090 So the question is: how can we simplify our model? 9 10 00:00:47,210 --> 00:00:55,900 One of the easiest ways is to remove some of the explanatory variables, but can we just drop some features? 10 11 00:00:55,900 --> 00:00:57,980 Is that a wise thing to do? 11 12 00:00:58,030 --> 00:01:02,380 And if so, which features should we drop? 12 13 00:01:02,380 --> 00:01:09,440 What about the features that were not highly correlated with property prices? In the correlation matrix, 13 14 00:01:09,460 --> 00:01:17,670 we saw that distance from employment centers only had a 0.25 correlation with our target. 14 15 00:01:17,760 --> 00:01:25,960 DIS had a low correlation with price but also it had a high correlation with our industry factor of 15 16 00:01:26,080 --> 00:01:28,950 -0.71. 16 17 00:01:29,470 --> 00:01:37,210 At the time we were wondering how much value that distance from employment centers feature really added. 17 18 00:01:37,480 --> 00:01:40,140 But now we know. Scrolling down, 18 19 00:01:40,270 --> 00:01:46,960 we've got the p-values which test for the significance of our different factors and we can see that 19 20 00:01:46,960 --> 00:01:50,890 distance is actually very statistically significant. 20 21 00:01:50,890 --> 00:01:57,800 So we should probably keep DIS around. Now on the other hand, looking back up at our industry factor, 21 22 00:01:57,910 --> 00:02:04,510 this has a p-value of around 0.44 meaning it is not statistically significant. 22 23 00:02:05,730 --> 00:02:11,560 The threshold for p-values, if you recall, was 0.05. 23 24 00:02:11,610 --> 00:02:18,340 So now the question is: should we try removing the industry factor from the model? 24 25 00:02:18,430 --> 00:02:24,770 And the thing is, it is really really tempting to remove insignificant predictors. 25 26 00:02:24,770 --> 00:02:32,030 But even dropping statistically insignificant features is not something people do lightly, because even 26 27 00:02:32,030 --> 00:02:39,500 a feature with a low p-value can add value to the model as a whole by providing some kind of information 27 28 00:02:39,710 --> 00:02:43,190 that the other features do not provide. 28 29 00:02:43,190 --> 00:02:49,670 Actually deciding on what to keep and what to throw away from a machine learning model is a bit of an 29 30 00:02:49,670 --> 00:02:57,200 art and it gets into this whole topic of feature selection which is a very, very big topic indeed and 30 31 00:02:57,200 --> 00:03:01,250 one that we will continue to tackle throughout this course. 31 32 00:03:01,250 --> 00:03:09,230 The goal of this lesson is to introduce you to this topic of feature selection in the context of a regression 32 33 00:03:09,230 --> 00:03:15,900 model. And we will be looking at a metric that you can use to help you make your decisions. 33 34 00:03:16,190 --> 00:03:24,050 And that metric is called Baysian information Criterion or BIC. The Baysian information criterion 34 35 00:03:24,380 --> 00:03:27,350 is a way you can measure complexity. 35 36 00:03:27,350 --> 00:03:31,690 It's basically a number that allows you to compare two different models. 36 37 00:03:31,790 --> 00:03:37,250 So what you end up doing is you run a regression with model number one and then you run a regression 37 38 00:03:37,580 --> 00:03:46,290 with model number two. Now model number one might have a BIC value of 148 and model number two might 38 39 00:03:46,290 --> 00:03:53,610 have a big value of 154. The actual number doesn't by itself mean very much. 39 40 00:03:53,760 --> 00:04:03,880 What matters is which one is lower, because all else equal a lower BIC number is better. So this measure 40 41 00:04:04,000 --> 00:04:08,350 can help you pick between two or more models. 41 42 00:04:08,350 --> 00:04:15,290 So what models will we compare? Well for starters, let's compare the model that includes the industry 42 43 00:04:15,290 --> 00:04:19,840 feature and the model that excludes the industry feature. 43 44 00:04:20,120 --> 00:04:26,740 Let's commemorate this with another markup cell in Jupyter notebook. So I'm going to change my cell here to Markdown 44 45 00:04:27,010 --> 00:04:29,830 and then put a section heading here 45 46 00:04:29,830 --> 00:04:41,150 "Model Simplification & the Baysian Information Criterion". To calculate the Baysian information criterion, 46 47 00:04:41,260 --> 00:04:47,630 we're once again going to use our statsmodel module's regression capabilities instead of scikit-learn 47 48 00:04:48,130 --> 00:04:54,330 so let's copy this cell here and paste it below. 48 49 00:04:54,500 --> 00:04:58,770 I'm going to delete these comments and I'm going to add a new comment up here 49 50 00:05:00,090 --> 00:05:10,360 that reads "Original model with log prices and all features" and the data frame that I've got here 50 51 00:05:10,470 --> 00:05:19,890 I'm going to store inside a variable. So I'm going to say "original_coef = pd.DataFrame()" 51 52 00:05:19,890 --> 00:05:27,810 and so on. And now what I want to do is I want to add some additional print statements and in these 52 53 00:05:27,810 --> 00:05:35,640 print statements we'll output both the Baysian information criterion value and the r-squared for the 53 54 00:05:35,640 --> 00:05:37,240 regression. 54 55 00:05:37,260 --> 00:05:44,280 Now previously we've used scikit's learn score method to print out the r-squared but things work a little 55 56 00:05:44,280 --> 00:05:51,870 differently with the statsmodel module and I think this is a good time to practice making sense of 56 57 00:05:51,870 --> 00:05:54,560 the official documentation. 57 58 00:05:54,990 --> 00:06:04,350 So as a challenge, can you look up the statsmodel docs for the regression results and figure out how 58 59 00:06:04,350 --> 00:06:12,220 to print out both the Baysian information criterion value for this regression as well as the r-squared? 59 60 00:06:12,270 --> 00:06:16,780 I'll give you a few seconds to pause the video and give this a shot. 60 61 00:06:16,890 --> 00:06:17,580 You ready? 61 62 00:06:18,550 --> 00:06:19,890 Here's the solution. 62 63 00:06:20,200 --> 00:06:26,170 Being able to read and interpret the official documentation for all a lot of these Python modules is 63 64 00:06:26,170 --> 00:06:30,310 one of the key skills at becoming a better programmer. 64 65 00:06:30,310 --> 00:06:36,700 If I click on my results object and press Shift+Tab on my keyboard to bring up the quick documentation, 65 66 00:06:37,120 --> 00:06:41,200 I don't actually get anything useful. 66 67 00:06:41,320 --> 00:06:47,710 I just see that to get further information I need to look at the regression results documentation. 67 68 00:06:47,710 --> 00:06:53,590 The same thing is true if I click here and I find out that I'm just dealing with a dataframe or when 68 69 00:06:53,590 --> 00:06:54,260 I click here. 69 70 00:06:54,610 --> 00:07:03,010 So in all these cases, the quick documentation isn't actually helping me all that much. I'm just not having 70 71 00:07:03,010 --> 00:07:05,790 any luck on getting the relevant information. 71 72 00:07:05,890 --> 00:07:10,600 So what I'm going to have to do is I'm going to have to Google the documentation myself. 72 73 00:07:10,750 --> 00:07:18,880 The best keywords to enter into that white text box are "statsmodel regression results" and that should 73 74 00:07:18,880 --> 00:07:22,200 pretty much take you to one of the statsmodel pages. 74 75 00:07:22,390 --> 00:07:29,380 Now out of these three that I've got here, the one I'm looking for is the page for the documentation 75 76 00:07:29,860 --> 00:07:32,280 on the RegressionResults object. 76 77 00:07:32,910 --> 00:07:34,360 So it's the third one down here. 77 78 00:07:36,380 --> 00:07:40,430 This is the regression results object from the linear model. 78 79 00:07:40,430 --> 00:07:44,260 This is the documentation page that you want to be looking for. 79 80 00:07:44,300 --> 00:07:51,050 The other reason why the RegressionResults page that I've clicked on is the relevant one is because 80 81 00:07:51,560 --> 00:08:00,140 this particular object, the RegressionResultsWrapper, inherits most of the methods and attributes from 81 82 00:08:00,320 --> 00:08:06,980 RegressionResults, meaning the capabilities of a RegressionResultsWrapper object which we've got and 82 83 00:08:07,040 --> 00:08:10,900 a RegressionResults object are pretty much the same. 83 84 00:08:10,940 --> 00:08:18,620 They have a lot in common, a lot of the methods and attributes are inherited from this particular object. 84 85 00:08:18,620 --> 00:08:20,820 Yeah they're closely linked. 85 86 00:08:21,200 --> 00:08:24,670 So this is why I'm looking on this page. Now, 86 87 00:08:24,680 --> 00:08:30,470 this is an incredibly long page if you look at it, it's very, very comprehensive, but the interesting thing 87 88 00:08:30,470 --> 00:08:38,000 is that we can find both the params attribute and p-values attribute on this page. 88 89 00:08:38,000 --> 00:08:44,330 So if I search for params on this page, I can find params listed as one of the attributes and 89 90 00:08:44,330 --> 00:08:50,630 I can also find p-values listed as one of the attributes. And you probably already spotted it at the 90 91 00:08:50,630 --> 00:08:58,850 top of the page is the BIC attribute, the Bayes information criteria, so BIC is the name of the 91 92 00:08:58,850 --> 00:09:05,840 attribute for the Bayes information criteria and the way we access this attribute is simply by writing 92 93 00:09:05,840 --> 00:09:06,650 "results. 93 94 00:09:06,810 --> 00:09:07,960 bic" 94 95 00:09:07,970 --> 00:09:15,830 and I if I hit Shift+Enter, I can see what the value of this actually is - it's 95 96 00:09:16,250 --> 00:09:23,730 -139.85 so about -140. And what about the r-squared? 96 97 00:09:23,850 --> 00:09:33,360 Going back to the documentation and scrolling down, r-squared unsurprisingly has the attribute name rsquared, 97 98 00:09:33,450 --> 00:09:44,260 all lowercase and in one word. So "results.rsquared" will bring up the r-squared for this regression 98 99 00:09:44,470 --> 00:09:48,040 which is 0.793. 99 100 00:09:48,100 --> 00:09:50,890 Let's have both of these lines of code in a print statement. 100 101 00:09:50,890 --> 00:10:03,800 So I'm going to have "print('BIC is', results.bic)" and then 101 102 00:10:04,010 --> 00:10:04,880 I'll have "print( 102 103 00:10:07,650 --> 00:10:15,230 'r-squared is', results.rsquared)". 103 104 00:10:15,420 --> 00:10:16,350 There we go. 104 105 00:10:16,410 --> 00:10:23,700 Okay, so now we have both the r-squared and our BIC printed out. The comforting thing to see is that the 105 106 00:10:23,730 --> 00:10:28,000 statsmodel and scikit-learn is exactly the same r-squared. 106 107 00:10:28,050 --> 00:10:35,070 So we're doing things right. Now in this case, the Baysian information criterion is actually a negative 107 108 00:10:35,280 --> 00:10:38,540 number and that's absolutely fine. 108 109 00:10:38,550 --> 00:10:42,550 What matters is how this number stacks up to our next model. 109 110 00:10:42,570 --> 00:10:48,190 So I'm going to copy this, paste it and then I'm going to modify my comment, 110 111 00:10:48,390 --> 00:11:00,040 I'm going to write here "Reduced model #1 excluding INDUS" and then what I'll say is that 111 112 00:11:00,040 --> 00:11:11,020 "X_incl_const = X_incl_const.drop( 112 113 00:11:11,710 --> 00:11:20,120 ['INDUS'], axis=1)". 113 114 00:11:20,150 --> 00:11:27,560 So in this line of code I'm redefining the dataframe of features by overwriting what's stored inside 114 115 00:11:27,560 --> 00:11:28,370 this variable. 115 116 00:11:28,370 --> 00:11:36,720 I'm dropping the INDUS column from the dataframe and storing that as the new feature dataframe. 116 117 00:11:36,740 --> 00:11:44,600 So on this line when it comes to training our model we are excluding the INDUS feature. Next, I'm going 117 118 00:11:44,600 --> 00:11:52,100 to change the name of this variable to "coef_minus_indus", 118 119 00:11:52,100 --> 00:11:56,900 so that way we're not overwriting the coefficient data frame from the cell above 119 120 00:11:57,370 --> 00:12:00,140 and I'm going to delete this comment here. 120 121 00:12:00,200 --> 00:12:05,990 Now let me hit Shift+Enter and refresh this cell. 121 122 00:12:06,220 --> 00:12:09,070 So this result is already quite interesting. 122 123 00:12:09,070 --> 00:12:15,040 What we can see is that our Baysian information criterion has gotten more negative, 123 124 00:12:15,040 --> 00:12:20,150 we've now got the value minus -145.2, 124 125 00:12:20,170 --> 00:12:22,970 so this is an even lower number than before. 125 126 00:12:22,990 --> 00:12:29,560 So we have an improvement in terms of reducing complexity, but at the same time it's also really nice 126 127 00:12:29,560 --> 00:12:34,180 to see that the r-squared at 0.79 127 128 00:12:34,240 --> 00:12:36,430 pretty much stays where it is. 128 129 00:12:36,460 --> 00:12:43,540 So even though we have removed one feature from our dataset, it hasn't really impacted our fit in a material 129 130 00:12:43,540 --> 00:12:44,490 way. 130 131 00:12:44,500 --> 00:12:46,600 This is actually very encouraging. 131 132 00:12:46,630 --> 00:12:51,200 Let's go back up to our p-values and experiment with removing something else. 132 133 00:12:51,220 --> 00:12:59,440 Let's experiment with removing AGE. Coming back down, I'm going to copy this cell, paste it, change my 133 134 00:12:59,440 --> 00:13:08,340 comment here to "Reduced model #2 excluding INDUS and AGE" and then in this line I'm going to 134 135 00:13:08,340 --> 00:13:14,990 have to add the single quotes and AGE between the square brackets. 135 136 00:13:14,990 --> 00:13:25,150 I'm also going to rename our dataframe of coefficients to maybe "reduced_coef" and now I'm going to hit Shift+ 136 137 00:13:25,400 --> 00:13:33,280 Enter and what we actually see is a further improvement based on the Baysian information criterion. 137 138 00:13:33,550 --> 00:13:40,990 So we get an even lower BIC number at -149.5 , but we see no material change in 138 139 00:13:40,990 --> 00:13:42,210 the r-squared. 139 140 00:13:42,400 --> 00:13:47,910 So this makes me think that removing both INDUS and AGE is actually a beneficial thing. 140 141 00:13:48,010 --> 00:13:54,190 We can probably safely drop these two features simplifying our model without incurring too much of a 141 142 00:13:54,190 --> 00:13:58,130 cost in terms of lost information and a worse fit. 142 143 00:13:58,360 --> 00:14:04,120 Now even though I just gave you two examples where removing a feature improved the Baysian information 143 144 00:14:04,120 --> 00:14:07,870 criterion and left the r-squared pretty much unchanged, 144 145 00:14:07,870 --> 00:14:16,740 this isn't always the case. If I change Age to one of the other features, say maybe zone, ZN, and press 145 146 00:14:16,740 --> 00:14:20,880 Shift+Enter, what we see is not really all that clear cut. 146 147 00:14:20,970 --> 00:14:25,620 In this case, we have a higher big number and a lower r-squared than before. 147 148 00:14:26,010 --> 00:14:28,760 So this is probably not the direction we want to go in. 148 149 00:14:29,130 --> 00:14:33,420 Same thing if I change this to our TAX feature. 149 150 00:14:33,420 --> 00:14:35,780 Again we're making our model worse. 150 151 00:14:35,890 --> 00:14:43,080 And same thing if I change this to the LSTAT feature. Removing LSTAT actually makes our model 151 152 00:14:43,200 --> 00:14:44,910 much, much worse. 152 153 00:14:44,910 --> 00:14:50,860 You can see how much the Baysian information criterion jumped and how much lower our r-squared is 153 154 00:14:50,880 --> 00:14:58,290 in this case. So LSTAT is actually very important to keep in the model. I'm going to change this back to AGE and 154 155 00:14:58,290 --> 00:15:03,010 press Shift+Enter so that we're back to where we started. 155 156 00:15:03,030 --> 00:15:03,310 Okay. 156 157 00:15:03,330 --> 00:15:04,020 So where are we now? 157 158 00:15:04,020 --> 00:15:06,690 We've made two small tweaks to our model. 158 159 00:15:06,690 --> 00:15:13,260 We've removed two of the features which were not statistically significant and we've looked at the Baysian 159 160 00:15:13,260 --> 00:15:20,610 information criterion and the r-squared to provide additional justification for leaving them out and 160 161 00:15:20,610 --> 00:15:28,710 simplifying our model and by doing so we've managed to improve our BIC number from around -140 161 162 00:15:28,910 --> 00:15:31,070 to -150. 162 163 00:15:31,380 --> 00:15:38,100 So we get about a 9-10 point improvement in the BIC number, but we don't incur a material penalty on 163 164 00:15:38,100 --> 00:15:39,040 the r-squared. 164 165 00:15:39,060 --> 00:15:41,480 We're still on 0.79. 165 166 00:15:41,580 --> 00:15:42,120 Cool. 166 167 00:15:42,120 --> 00:15:49,680 So that about wraps up our introduction to thinking about feature selection and one thing that we can 167 168 00:15:49,680 --> 00:15:53,280 do now is we can link this lesson to our previous one. 168 169 00:15:53,400 --> 00:16:01,050 We can link this to the discussion on multcollinearity and looking for stability in the theta estimates 169 170 00:16:01,260 --> 00:16:09,520 for our features, because we've made quite a few tweaks to our model and we said that one of the symptoms 170 171 00:16:09,910 --> 00:16:13,920 of multicollinearity are unstable coefficients. 171 172 00:16:14,080 --> 00:16:20,740 Having run three different versions of our regression and having stored our coefficients in some variables, 172 173 00:16:21,310 --> 00:16:27,160 we can now look at them side by side and double check if there are any strange developments. 173 174 00:16:27,250 --> 00:16:29,430 Now I'd be surprised if we saw any 174 175 00:16:29,430 --> 00:16:34,100 because we have no indication of multicollinearity so far. 175 176 00:16:34,330 --> 00:16:36,450 But take a look at this. 176 177 00:16:36,520 --> 00:16:43,020 If I create a variable called "frames" and make that equal to a list of our dataframes. 177 178 00:16:43,050 --> 00:16:53,740 So we had the org_coef, we had coef_minus_indus dataframe, 178 179 00:16:54,490 --> 00:17:01,610 and then finally we had the reduced_coef dataframe of coefficients. To put them all side 179 180 00:17:01,610 --> 00:17:02,720 by side, 180 181 00:17:02,750 --> 00:17:12,110 I'm going to use pandas concat method so "pd.concat() and then in the parentheses to provide my list 181 182 00:17:12,320 --> 00:17:18,260 of dataframes and an axis. So I want this to be concatenated along the columns, 182 183 00:17:18,260 --> 00:17:25,630 so side by side as opposed to top to bottom, right? So axis is gonna be equal to 1 instead of 0. 183 184 00:17:25,640 --> 00:17:26,900 Let's see what we get. 184 185 00:17:26,930 --> 00:17:28,320 Fantastic. 185 186 00:17:28,340 --> 00:17:36,910 Now what you can also see here in this table here is how Python treats missing values in a dataframe. 186 187 00:17:36,920 --> 00:17:45,510 You see these nans, NaN? NaN stands for "Not a Number" meaning there is a missing value here. 187 188 00:17:45,590 --> 00:17:52,280 So this column here is our reduced model without the AGE and without the Industry feature. 188 189 00:17:52,310 --> 00:17:56,250 So we've got NaNs in place of these rows. 189 190 00:17:56,630 --> 00:18:03,200 Looking at this table is actually very, very encouraging because what I'm seeing is that despite tweaking 190 191 00:18:03,200 --> 00:18:06,290 the model all the coefficients, 191 192 00:18:06,360 --> 00:18:07,550 yeah for all three, 192 193 00:18:07,550 --> 00:18:14,990 so like Charles River, Crime, they all are remarkably consistent, so the numbers change somewhat but they 193 194 00:18:14,990 --> 00:18:15,970 don't switch signs, 194 195 00:18:16,010 --> 00:18:17,520 they don't change drastically, 195 196 00:18:17,540 --> 00:18:23,840 so these are all very, very stable coefficient estimates and the same actually holds true 196 197 00:18:23,840 --> 00:18:29,180 if you were to remove TAX which we suspected was a potential problem source. 197 198 00:18:29,180 --> 00:18:36,620 So even removing TAX and rerunning this, you'd see that the theta estimates are nice and stable between 198 199 00:18:36,620 --> 00:18:42,830 the three models. And that brings us to the end of this lesson. In the next one, 199 200 00:18:42,890 --> 00:18:47,060 we're gonna take our evaluation of our regression even further. 200 201 00:18:47,120 --> 00:18:53,930 We're going to look at how far off our models' predictions were from their true values. 201 202 00:18:53,930 --> 00:19:00,090 We're gonna be looking and analyzing our regression residuals. Now, 202 203 00:19:00,150 --> 00:19:06,570 as an aside, while putting this lesson together for you I found myself googling the word BIC and trying 203 204 00:19:06,570 --> 00:19:11,920 to read up on the information criterion and for a split second, I was quite confused 204 205 00:19:11,930 --> 00:19:18,900 why all I got on the front page was information about pens hand the website of the Bournemouth International 205 206 00:19:18,900 --> 00:19:19,840 Centre. 206 207 00:19:20,130 --> 00:19:25,960 The other time that happened to me was when I read up on nested tables then I was confronted with this. 207 208 00:19:26,400 --> 00:19:31,020 So yeah if you have any stories like this to share, please put them in the comments section for this 208 209 00:19:31,020 --> 00:19:31,650 video. 209 210 00:19:31,740 --> 00:19:33,930 I'd love to hear it. See you in the next lesson.