0
1
00:00:00,550 --> 00:00:06,640
In this lesson what we're gonna do is we're going to look at our regression coefficients, our thetas, in more
1

2
00:00:06,670 --> 00:00:08,110
detail.
2

3
00:00:08,110 --> 00:00:13,900
So far we've discussed how to interpret the value of our coefficients in both the original linear model
3

4
00:00:14,170 --> 00:00:15,900
and the log linear model
4

5
00:00:16,030 --> 00:00:21,670
after our data transformations. The thing is if we're looking to predict house prices we're not just
5

6
00:00:21,670 --> 00:00:25,120
going to be interested in the sign and the size of our coefficients,
6

7
00:00:25,120 --> 00:00:31,030
we're also going to be interested in their significance. Just because there's a number next to a particular
7

8
00:00:31,030 --> 00:00:35,770
feature does not mean that this feature has much explanatory power.
8

9
00:00:36,010 --> 00:00:41,180
Just because there's a number here doesn't mean that this feature is actually significant.
9

10
00:00:41,200 --> 00:00:46,780
Remember how we talked about how a doctor checks your vital stats? A regression coefficients vital
10

11
00:00:46,780 --> 00:00:49,350
stat that tells you about its significance
11

12
00:00:49,360 --> 00:00:51,900
is called the p value.
12

13
00:00:51,900 --> 00:00:58,540
Now most academic research papers that you come across analyze the significance of their findings using
13

14
00:00:58,540 --> 00:01:01,000
this metric of p-values.
14

15
00:01:01,000 --> 00:01:03,420
And here's how this metric is typically used.
15

16
00:01:03,580 --> 00:01:11,770
If the p-value is less than a certain threshold, that threshold being 0.05, then the result
16

17
00:01:12,040 --> 00:01:20,350
is deemed statistically significant and when the p value is greater than 0.05, then this
17

18
00:01:20,350 --> 00:01:25,040
result is considered to be not statistically significant.
18

19
00:01:25,060 --> 00:01:30,130
This threshold of 0.05 is kind of where the consensus amongst academics is regarding
19

20
00:01:30,130 --> 00:01:36,730
significance, and for better or worse, there's a little bit of a cult around this particular value. When
20

21
00:01:36,730 --> 00:01:39,170
it comes to calculating the p values,
21

22
00:01:39,430 --> 00:01:44,800
the scikit-learn's linear regression model isn't actually much help.
22

23
00:01:45,460 --> 00:01:51,580
So what we're gonna do is we're going to look beyond the scikit-learn module to calculate our statistics
23

24
00:01:51,610 --> 00:01:53,210
for our regression.
24

25
00:01:53,410 --> 00:01:58,660
We're gonna go beyond the simple linear regression provided by our machine learning module. In these
25

26
00:01:58,660 --> 00:02:02,830
lessons we're gonna be looking at some detailed statistics of our module.
26

27
00:02:02,830 --> 00:02:08,380
And this is why we're gonna be importing a different Python module to take us further here and this
27

28
00:02:08,380 --> 00:02:11,470
Python module is called Statsmodels.
28

29
00:02:11,470 --> 00:02:19,600
Now let's add a section heading. So I'm going to to change this cell to markdown and I'm going to put a section heading here
29

30
00:02:19,630 --> 00:02:30,740
that reads "p values and Evaluating Coefficients". What we're gonna be doing next is we're gonna be using
30

31
00:02:30,740 --> 00:02:38,330
the Statsmodels module to run our linear regression and we will run it so that we get the same results
31

32
00:02:38,420 --> 00:02:40,580
as we would with scikit-learn.
32

33
00:02:40,580 --> 00:02:46,490
However we will be able to use this Python Statsmodels module to pull up detailed statistics that we
33

34
00:02:46,490 --> 00:02:49,640
can't easily get with scikit-learn.
34

35
00:02:49,760 --> 00:02:54,240
The first thing that we need to do of course is import our Statsmodels.
35

36
00:02:54,350 --> 00:03:03,110
So we're gonna go to the top of our notebook and in our first cell, we're gonna say "import statsmodels.
36

37
00:03:03,470 --> 00:03:13,190
api as sm" and then I'm going to hit Shift+Enter on the cell and what I see when I do this is there
37

38
00:03:13,190 --> 00:03:15,960
is a deprecation warning here.
38

39
00:03:16,190 --> 00:03:22,820
This deprecation warning refers to my statsmodels module and what it's saying is that at the time of
39

40
00:03:22,820 --> 00:03:31,880
recording this module is still using a component that is outdated, so it's using this Panda's datetools
40

41
00:03:31,880 --> 00:03:39,510
component which is outdated or deprecated. Now I'm really not too concerned about this for two reasons.
41

42
00:03:39,510 --> 00:03:46,350
One is that we're not gonna be using any functionality to do with dates from the statsmodels API and
42

43
00:03:46,380 --> 00:03:53,880
two is that the statsmodels API will also be updated by the people who created it and what they will
43

44
00:03:53,880 --> 00:04:00,360
do is they will make sure that it doesn't break and that it's maintained in good working order.
44

45
00:04:00,360 --> 00:04:05,170
So by the time you're running this, you may or may not see this deprecation warning.
45

46
00:04:05,190 --> 00:04:07,930
Now let's run our regression with this new module.
46

47
00:04:07,980 --> 00:04:14,040
The thing to note is that in order to make our regression tie out with scikit-learn, we're going to have
47

48
00:04:14,040 --> 00:04:19,950
to add an intercept, because as you can see there is an intercept here from our regression with scikit-
48

49
00:04:19,950 --> 00:04:20,340
learn.
49

50
00:04:21,360 --> 00:04:29,130
So what I'm going to do is take our features from the training data set and add an intercept. So I'll write
50

51
00:04:29,460 --> 00:04:40,530
"sm.add_constant(X_train)" and what I'll do is I'll store this
51

52
00:04:40,530 --> 00:04:43,320
modified dataframe in a new variable.
52

53
00:04:43,320 --> 00:04:54,240
So what I'm gonna say is "X_incl_const = sm.add_constant(
53

54
00:04:54,270 --> 00:04:56,700
X_train)".
54

55
00:04:57,780 --> 00:05:04,350
Now what we can do is call on the statsmodels OLS function, which will give us back a model object
55

56
00:05:04,710 --> 00:05:07,930
which we can then use to fit our regression.
56

57
00:05:07,950 --> 00:05:18,990
So I'm going to say "model  = sm.OLS(y_train, X_
57

58
00:05:19,350 --> 00:05:22,650
incl_const) "
58

59
00:05:23,490 --> 00:05:31,410
What we're doing here is we're calling the OLS function, OLS stands for Ordinary Least Squares
59

60
00:05:32,160 --> 00:05:34,050
and just like scikit-learn,
60

61
00:05:34,050 --> 00:05:39,570
this gives us a linear regression model which we're storing here. As arguments
61

62
00:05:39,570 --> 00:05:46,890
we've provided our target values and our features and these features include this constant that we've
62

63
00:05:46,890 --> 00:05:47,890
added.
63

64
00:05:48,150 --> 00:05:54,660
Now we can use the statsmodels API to fit our regression. Fitting our regression will give us some results.
64

65
00:05:54,780 --> 00:06:02,400
So I'll create a new variable called results and set that equal to "model.fit()".
65

66
00:06:02,520 --> 00:06:09,680
In other words, calling the fit method on the model will return to us our regression results.
66

67
00:06:09,750 --> 00:06:12,470
So the question is how do we take a look at these results?
67

68
00:06:12,510 --> 00:06:18,110
For starters, let's see if we can print out the coefficients like we have here. To get these coefficients,
68

69
00:06:18,150 --> 00:06:21,730
we'll use the results params attribute.
69

70
00:06:21,810 --> 00:06:28,230
So "results.params" and Shift+Enter will show us the coefficients.
70

71
00:06:28,230 --> 00:06:36,010
So here they are, and these values tie out with what we saw in scikit-learn. I'm going to comment this out
71

72
00:06:36,050 --> 00:06:41,840
and now we can take a look at the p-values that I keep harping on about. To show these,
72

73
00:06:41,870 --> 00:06:50,000
we also use our results object, put a dot after it and access the pvalues attribute. So "results.
73

74
00:06:50,000 --> 00:06:56,990
pvalues" and Shift+Enter will show us the p-values for all our coefficients.
74

75
00:06:57,020 --> 00:07:01,610
I don't think this is formatted particularly nicely, so what I'm going to do is I'm going to comment this
75

76
00:07:01,610 --> 00:07:07,490
out again and both of these two series into a data frame.
76

77
00:07:07,490 --> 00:07:22,880
So with "pd.DataFrame({'coef': results.params, 
77

78
00:07:22,880 --> 00:07:30,320
'p-value': results.pvalues})".
78

79
00:07:30,320 --> 00:07:35,600
We can look at our coefficients and their p values formatted nicely side by side.
79

80
00:07:35,600 --> 00:07:37,780
Check it out.
80

81
00:07:39,290 --> 00:07:41,720
Now this is starting to look pretty good, but you know what?
81

82
00:07:41,900 --> 00:07:46,670
I find these p-values really, really hard to read in scientific notation.
82

83
00:07:46,670 --> 00:07:51,050
So what I'm going to do this I'm going round them. I'm going to come up here where I'm creating our data
83

84
00:07:51,050 --> 00:07:59,210
frame and I'm going to call the round function, so "round()", closing parentheses before the
84

85
00:07:59,210 --> 00:08:06,630
curly brace and then comma, and then a number of decimals that we should round to.
85

86
00:08:06,680 --> 00:08:11,160
So I'm going to go with 3 and refresh our output.
86

87
00:08:11,160 --> 00:08:14,050
Now let's talk about how to interpret these results.
87

88
00:08:14,190 --> 00:08:22,170
Well, remember the rule of thumb that any p value over 0.05 is not significant?
88

89
00:08:22,670 --> 00:08:32,040
In our case two of our features failed this test, namely the INDUS feature and our AGE feature.
89

90
00:08:32,270 --> 00:08:36,320
These two features do not appear to add much additional information.
90

91
00:08:37,250 --> 00:08:41,660
All the others are indeed statistically significant.
91

92
00:08:41,660 --> 00:08:49,490
Now let's make a note of this for later, because maybe maybe we could remove the INDUS and the AGE features
92

93
00:08:49,670 --> 00:08:52,790
from our model. In the next lesson,
93

94
00:08:52,790 --> 00:08:56,420
we're going to discuss a potential problem in our regression.
94

95
00:08:56,420 --> 00:09:00,870
Remember how we had high correlations between our features? In the next lesson,
95

96
00:09:00,920 --> 00:09:06,820
we're going to check formally if our regression suffers from the problem of multicolinearity.