0 1 00:00:00,550 --> 00:00:06,640 In this lesson what we're gonna do is we're going to look at our regression coefficients, our thetas, in more 1 2 00:00:06,670 --> 00:00:08,110 detail. 2 3 00:00:08,110 --> 00:00:13,900 So far we've discussed how to interpret the value of our coefficients in both the original linear model 3 4 00:00:14,170 --> 00:00:15,900 and the log linear model 4 5 00:00:16,030 --> 00:00:21,670 after our data transformations. The thing is if we're looking to predict house prices we're not just 5 6 00:00:21,670 --> 00:00:25,120 going to be interested in the sign and the size of our coefficients, 6 7 00:00:25,120 --> 00:00:31,030 we're also going to be interested in their significance. Just because there's a number next to a particular 7 8 00:00:31,030 --> 00:00:35,770 feature does not mean that this feature has much explanatory power. 8 9 00:00:36,010 --> 00:00:41,180 Just because there's a number here doesn't mean that this feature is actually significant. 9 10 00:00:41,200 --> 00:00:46,780 Remember how we talked about how a doctor checks your vital stats? A regression coefficients vital 10 11 00:00:46,780 --> 00:00:49,350 stat that tells you about its significance 11 12 00:00:49,360 --> 00:00:51,900 is called the p value. 12 13 00:00:51,900 --> 00:00:58,540 Now most academic research papers that you come across analyze the significance of their findings using 13 14 00:00:58,540 --> 00:01:01,000 this metric of p-values. 14 15 00:01:01,000 --> 00:01:03,420 And here's how this metric is typically used. 15 16 00:01:03,580 --> 00:01:11,770 If the p-value is less than a certain threshold, that threshold being 0.05, then the result 16 17 00:01:12,040 --> 00:01:20,350 is deemed statistically significant and when the p value is greater than 0.05, then this 17 18 00:01:20,350 --> 00:01:25,040 result is considered to be not statistically significant. 18 19 00:01:25,060 --> 00:01:30,130 This threshold of 0.05 is kind of where the consensus amongst academics is regarding 19 20 00:01:30,130 --> 00:01:36,730 significance, and for better or worse, there's a little bit of a cult around this particular value. When 20 21 00:01:36,730 --> 00:01:39,170 it comes to calculating the p values, 21 22 00:01:39,430 --> 00:01:44,800 the scikit-learn's linear regression model isn't actually much help. 22 23 00:01:45,460 --> 00:01:51,580 So what we're gonna do is we're going to look beyond the scikit-learn module to calculate our statistics 23 24 00:01:51,610 --> 00:01:53,210 for our regression. 24 25 00:01:53,410 --> 00:01:58,660 We're gonna go beyond the simple linear regression provided by our machine learning module. In these 25 26 00:01:58,660 --> 00:02:02,830 lessons we're gonna be looking at some detailed statistics of our module. 26 27 00:02:02,830 --> 00:02:08,380 And this is why we're gonna be importing a different Python module to take us further here and this 27 28 00:02:08,380 --> 00:02:11,470 Python module is called Statsmodels. 28 29 00:02:11,470 --> 00:02:19,600 Now let's add a section heading. So I'm going to to change this cell to markdown and I'm going to put a section heading here 29 30 00:02:19,630 --> 00:02:30,740 that reads "p values and Evaluating Coefficients". What we're gonna be doing next is we're gonna be using 30 31 00:02:30,740 --> 00:02:38,330 the Statsmodels module to run our linear regression and we will run it so that we get the same results 31 32 00:02:38,420 --> 00:02:40,580 as we would with scikit-learn. 32 33 00:02:40,580 --> 00:02:46,490 However we will be able to use this Python Statsmodels module to pull up detailed statistics that we 33 34 00:02:46,490 --> 00:02:49,640 can't easily get with scikit-learn. 34 35 00:02:49,760 --> 00:02:54,240 The first thing that we need to do of course is import our Statsmodels. 35 36 00:02:54,350 --> 00:03:03,110 So we're gonna go to the top of our notebook and in our first cell, we're gonna say "import statsmodels. 36 37 00:03:03,470 --> 00:03:13,190 api as sm" and then I'm going to hit Shift+Enter on the cell and what I see when I do this is there 37 38 00:03:13,190 --> 00:03:15,960 is a deprecation warning here. 38 39 00:03:16,190 --> 00:03:22,820 This deprecation warning refers to my statsmodels module and what it's saying is that at the time of 39 40 00:03:22,820 --> 00:03:31,880 recording this module is still using a component that is outdated, so it's using this Panda's datetools 40 41 00:03:31,880 --> 00:03:39,510 component which is outdated or deprecated. Now I'm really not too concerned about this for two reasons. 41 42 00:03:39,510 --> 00:03:46,350 One is that we're not gonna be using any functionality to do with dates from the statsmodels API and 42 43 00:03:46,380 --> 00:03:53,880 two is that the statsmodels API will also be updated by the people who created it and what they will 43 44 00:03:53,880 --> 00:04:00,360 do is they will make sure that it doesn't break and that it's maintained in good working order. 44 45 00:04:00,360 --> 00:04:05,170 So by the time you're running this, you may or may not see this deprecation warning. 45 46 00:04:05,190 --> 00:04:07,930 Now let's run our regression with this new module. 46 47 00:04:07,980 --> 00:04:14,040 The thing to note is that in order to make our regression tie out with scikit-learn, we're going to have 47 48 00:04:14,040 --> 00:04:19,950 to add an intercept, because as you can see there is an intercept here from our regression with scikit- 48 49 00:04:19,950 --> 00:04:20,340 learn. 49 50 00:04:21,360 --> 00:04:29,130 So what I'm going to do is take our features from the training data set and add an intercept. So I'll write 50 51 00:04:29,460 --> 00:04:40,530 "sm.add_constant(X_train)" and what I'll do is I'll store this 51 52 00:04:40,530 --> 00:04:43,320 modified dataframe in a new variable. 52 53 00:04:43,320 --> 00:04:54,240 So what I'm gonna say is "X_incl_const = sm.add_constant( 53 54 00:04:54,270 --> 00:04:56,700 X_train)". 54 55 00:04:57,780 --> 00:05:04,350 Now what we can do is call on the statsmodels OLS function, which will give us back a model object 55 56 00:05:04,710 --> 00:05:07,930 which we can then use to fit our regression. 56 57 00:05:07,950 --> 00:05:18,990 So I'm going to say "model = sm.OLS(y_train, X_ 57 58 00:05:19,350 --> 00:05:22,650 incl_const) " 58 59 00:05:23,490 --> 00:05:31,410 What we're doing here is we're calling the OLS function, OLS stands for Ordinary Least Squares 59 60 00:05:32,160 --> 00:05:34,050 and just like scikit-learn, 60 61 00:05:34,050 --> 00:05:39,570 this gives us a linear regression model which we're storing here. As arguments 61 62 00:05:39,570 --> 00:05:46,890 we've provided our target values and our features and these features include this constant that we've 62 63 00:05:46,890 --> 00:05:47,890 added. 63 64 00:05:48,150 --> 00:05:54,660 Now we can use the statsmodels API to fit our regression. Fitting our regression will give us some results. 64 65 00:05:54,780 --> 00:06:02,400 So I'll create a new variable called results and set that equal to "model.fit()". 65 66 00:06:02,520 --> 00:06:09,680 In other words, calling the fit method on the model will return to us our regression results. 66 67 00:06:09,750 --> 00:06:12,470 So the question is how do we take a look at these results? 67 68 00:06:12,510 --> 00:06:18,110 For starters, let's see if we can print out the coefficients like we have here. To get these coefficients, 68 69 00:06:18,150 --> 00:06:21,730 we'll use the results params attribute. 69 70 00:06:21,810 --> 00:06:28,230 So "results.params" and Shift+Enter will show us the coefficients. 70 71 00:06:28,230 --> 00:06:36,010 So here they are, and these values tie out with what we saw in scikit-learn. I'm going to comment this out 71 72 00:06:36,050 --> 00:06:41,840 and now we can take a look at the p-values that I keep harping on about. To show these, 72 73 00:06:41,870 --> 00:06:50,000 we also use our results object, put a dot after it and access the pvalues attribute. So "results. 73 74 00:06:50,000 --> 00:06:56,990 pvalues" and Shift+Enter will show us the p-values for all our coefficients. 74 75 00:06:57,020 --> 00:07:01,610 I don't think this is formatted particularly nicely, so what I'm going to do is I'm going to comment this 75 76 00:07:01,610 --> 00:07:07,490 out again and both of these two series into a data frame. 76 77 00:07:07,490 --> 00:07:22,880 So with "pd.DataFrame({'coef': results.params, 77 78 00:07:22,880 --> 00:07:30,320 'p-value': results.pvalues})". 78 79 00:07:30,320 --> 00:07:35,600 We can look at our coefficients and their p values formatted nicely side by side. 79 80 00:07:35,600 --> 00:07:37,780 Check it out. 80 81 00:07:39,290 --> 00:07:41,720 Now this is starting to look pretty good, but you know what? 81 82 00:07:41,900 --> 00:07:46,670 I find these p-values really, really hard to read in scientific notation. 82 83 00:07:46,670 --> 00:07:51,050 So what I'm going to do this I'm going round them. I'm going to come up here where I'm creating our data 83 84 00:07:51,050 --> 00:07:59,210 frame and I'm going to call the round function, so "round()", closing parentheses before the 84 85 00:07:59,210 --> 00:08:06,630 curly brace and then comma, and then a number of decimals that we should round to. 85 86 00:08:06,680 --> 00:08:11,160 So I'm going to go with 3 and refresh our output. 86 87 00:08:11,160 --> 00:08:14,050 Now let's talk about how to interpret these results. 87 88 00:08:14,190 --> 00:08:22,170 Well, remember the rule of thumb that any p value over 0.05 is not significant? 88 89 00:08:22,670 --> 00:08:32,040 In our case two of our features failed this test, namely the INDUS feature and our AGE feature. 89 90 00:08:32,270 --> 00:08:36,320 These two features do not appear to add much additional information. 90 91 00:08:37,250 --> 00:08:41,660 All the others are indeed statistically significant. 91 92 00:08:41,660 --> 00:08:49,490 Now let's make a note of this for later, because maybe maybe we could remove the INDUS and the AGE features 92 93 00:08:49,670 --> 00:08:52,790 from our model. In the next lesson, 93 94 00:08:52,790 --> 00:08:56,420 we're going to discuss a potential problem in our regression. 94 95 00:08:56,420 --> 00:09:00,870 Remember how we had high correlations between our features? In the next lesson, 95 96 00:09:00,920 --> 00:09:06,820 we're going to check formally if our regression suffers from the problem of multicolinearity.