0
1
00:00:00,240 --> 00:00:08,640
Now that we've shuffled and split our data into training and test datasets, we can finally run our regression
1

2
00:00:08,670 --> 00:00:12,200
using scikit learn. To do that,
2

3
00:00:12,300 --> 00:00:21,510
I'm going to scroll back up and import this capability, so I'm going to say "from sklearn.linear_
3

4
00:00:21,630 --> 00:00:32,970
model import LinearRegression", I'm going to hit Shift+Enter on this cell,
4

5
00:00:33,600 --> 00:00:41,850
scroll back down where I left off, and then I'm going to create a regression object called "regr", set that
5

6
00:00:41,850 --> 00:00:52,530
equal to "LinearRegression()" and then I'm going to use this regression object and call
6

7
00:00:52,560 --> 00:01:01,470
the "fit" method on it, so "regr.fit()" and now what do I add between these parentheses?
7

8
00:01:02,160 --> 00:01:06,420
What do I give has an argument to our fit method?
8

9
00:01:09,580 --> 00:01:14,330
The data that we should supply is our training datasets, right?
9

10
00:01:14,350 --> 00:01:23,710
So, X_train being the training features and y_train being the training target
10

11
00:01:23,710 --> 00:01:33,460
values. And hitting Shift+Enter on this will train our model. Great! But we don't see any output on the
11

12
00:01:33,460 --> 00:01:39,610
thetas that we've just estimated and also we don't even know how well this model fits our data. Now,
12

13
00:01:39,850 --> 00:01:47,140
before printing out these values, before printing out the values of our theta parameters, let's make some
13

14
00:01:47,140 --> 00:01:54,670
predictions on what we expect to see. Making these predictions is a very important step because it provides
14

15
00:01:54,700 --> 00:02:02,050
a sense check on what the computer is going to spit out to us. We should never blindly trust the numbers
15

16
00:02:02,110 --> 00:02:06,020
that we get back. That's how silly, silly mistakes are made.
16

17
00:02:06,070 --> 00:02:13,300
Besides, we've done all this data exploration work up until now and we can use the knowledge that we
17

18
00:02:13,300 --> 00:02:15,620
gained to make some predictions.
18

19
00:02:15,820 --> 00:02:23,150
For example, we've used seaborn's pairplot to run individual regressions against our target prices.
19

20
00:02:23,260 --> 00:02:30,340
Also, we've created this wonderful correlation matrix to show the correlations of the individual features
20

21
00:02:30,850 --> 00:02:37,510
with our target. We know which correlations were positive and which correlations were negative.
21

22
00:02:37,540 --> 00:02:46,810
So given that this is how our model looks like, let's make some predictions on the signs of the coefficients.
22

23
00:02:47,390 --> 00:02:52,850
Let's predict what kind of signs we're going to see on these theta parameters.
23

24
00:02:52,960 --> 00:03:01,270
So, get out a piece of paper and write down if you expect the theta parameter for RM to be positive
24

25
00:03:01,630 --> 00:03:13,030
or negative, and then do the same thing for NOX, PTRATIO, CRIM, DIS and LSTAT. Did you pause
25

26
00:03:13,030 --> 00:03:15,160
the video and write this down?
26

27
00:03:15,160 --> 00:03:17,830
There is actually one more prediction that we said would make.
27

28
00:03:17,830 --> 00:03:24,580
Remember how a few lessons ago, I asked you if you thought that being next to the Charles River was a good
28

29
00:03:24,580 --> 00:03:27,900
thing for property prices or a bad thing?
29

30
00:03:27,910 --> 00:03:36,740
Well now it's time to also add your prediction to the sign of the CHAS dummy variable as well.
30

31
00:03:36,790 --> 00:03:37,200
Okay.
31

32
00:03:37,240 --> 00:03:40,270
So here's the solution to this challenge.
32

33
00:03:40,450 --> 00:03:51,040
I'm going to print out the value of our intercept and it's gonna be "regr.intercept_"
33

34
00:03:51,670 --> 00:04:00,130
and then I'm going to print out the values of all the coefficients. And to format these values nicely I'm going to
34

35
00:04:00,130 --> 00:04:01,360
actually put them in a data frame.
35

36
00:04:01,360 --> 00:04:11,320
So I'm going to see "pd.DataFrame()" and then our first argument
36

37
00:04:11,410 --> 00:04:13,950
is our data that we're gonna supply
37

38
00:04:14,080 --> 00:04:20,800
and these are the coefficients which we can find under 
"regr.coef_".
38

39
00:04:20,800 --> 00:04:25,600
The second argument that we're going to supply is called index,
39

40
00:04:25,600 --> 00:04:28,060
so "index = ".
40

41
00:04:28,060 --> 00:04:28,580
and then what
41

42
00:04:28,570 --> 00:04:37,270
we're going to supply as an index are the column names, so "X_train.columns".
42

43
00:04:37,390 --> 00:04:37,720
Yeah.
43

44
00:04:38,080 --> 00:04:40,720
So these are the column names.
44

45
00:04:40,720 --> 00:04:46,190
That way we can have the names of the coefficients next to the values.
45

46
00:04:46,210 --> 00:04:48,220
Let me show you what this looks like.
46

47
00:04:48,370 --> 00:04:51,110
Voila! We get something like this.
47

48
00:04:51,130 --> 00:04:58,220
We can our intercept printed here and we get the name and the value of the different coefficients.
48

49
00:04:58,240 --> 00:05:04,530
Now I'm going to come back up here in to the code where I'm creating our data frame and I'm going to
49

50
00:05:04,540 --> 00:05:07,480
specify third argument called columns.
50

51
00:05:07,500 --> 00:05:08,160
Yeah.
51

52
00:05:08,290 --> 00:05:17,680
And I set that equal to "['COEF']". When I hit Shift+Enter now, you'll
52

53
00:05:17,680 --> 00:05:19,320
see this little 0 here,
53

54
00:05:19,330 --> 00:05:24,700
this the lack of a column name changed to the value "coef".
54

55
00:05:24,850 --> 00:05:36,520
So "columns = " and then a list of 1 value, namely the string "coef" will add a column name to our data
55

56
00:05:36,520 --> 00:05:38,610
frame here.
56

57
00:05:38,710 --> 00:05:39,180
All right.
57

58
00:05:39,210 --> 00:05:42,230
So we've got our results.
58

59
00:05:42,250 --> 00:05:45,850
How do they stack up to your predictions?
59

60
00:05:45,860 --> 00:05:53,050
Remember, we're just looking at the sign of the coefficients here for now. Are the actual coefficients
60

61
00:05:53,260 --> 00:06:01,470
positive or negative according to our model? Did you predict the right sign for the coefficients?
61

62
00:06:01,470 --> 00:06:10,710
Because what we see here is that more crime is bad for house prices and also more pollution is bad
62

63
00:06:10,800 --> 00:06:12,140
for house prices.
63

64
00:06:12,420 --> 00:06:16,560
More rooms on the other hand is good for house prices.
64

65
00:06:16,560 --> 00:06:25,740
It's a positive coefficient. And more students in class is also a negative factor for the house prices
65

66
00:06:26,220 --> 00:06:32,690
as is higher values for LSTAT. Now looking at this,
66

67
00:06:32,830 --> 00:06:40,390
I think this is some really, really good news because the signs of our coefficients passed our first sense
67

68
00:06:40,390 --> 00:06:43,210
check. All these signs that we've just talked about
68

69
00:06:43,300 --> 00:06:45,550
makes sense logically.
69

70
00:06:45,550 --> 00:06:52,120
So the things that we would have expected to be bad for property prices from a logic point of view indeed
70

71
00:06:52,180 --> 00:06:56,030
have a negative coefficient associated with them.
71

72
00:06:56,140 --> 00:06:57,040
Oh and,
72

73
00:06:57,040 --> 00:07:01,210
do you remember how I asked you to make a prediction if living next to the Charles River was a good
73

74
00:07:01,210 --> 00:07:02,360
thing or not?
74

75
00:07:02,380 --> 00:07:04,340
Well, now we've got our answer.
75

76
00:07:04,540 --> 00:07:11,410
Looking at the sign on the CHAS variable, we can see that living next to the Charles River is indeed
76

77
00:07:11,410 --> 00:07:18,700
desirable and the properties that are on the river are more expensive than those that are are away from
77

78
00:07:18,700 --> 00:07:23,570
it, all else equal. In fact we can put a number to it,
78

79
00:07:23,760 --> 00:07:32,010
Bostonians are willing to pay a premium for a property that is next to the river and that premium is
79

80
00:07:32,040 --> 00:07:35,440
equal to about 2000 dollars.
80

81
00:07:35,460 --> 00:07:44,010
Remember, CHAS only has the values 0 or 1. When CHAS is equal to 1 then it's next to the river and
81

82
00:07:44,190 --> 00:07:49,230
1 times approximately 2 is equal to 2, right?
82

83
00:07:49,660 --> 00:07:54,210
And since our target values, our property prices, are in thousands,
83

84
00:07:54,210 --> 00:07:58,350
this translates into 2000 dollars.
84

85
00:07:58,590 --> 00:08:05,820
So what we've just done in Jupyter notebook is calculate all these theta parameters.
85

86
00:08:05,820 --> 00:08:09,720
Our model now looks like this. Using our training data,
86

87
00:08:09,750 --> 00:08:17,370
we've estimated all the values of these theta parameters and the beautiful thing about this linear model
87

88
00:08:17,790 --> 00:08:24,900
is that all our coefficients have a very clear meaning - an increase in the number of rooms by one will
88

89
00:08:24,960 --> 00:08:31,520
increase the price of the property by 3100 hundred dollars.
89

90
00:08:31,560 --> 00:08:38,910
Our model is easy to understand and to interpret, it's the very opposite of a black box that just spits
90

91
00:08:38,910 --> 00:08:39,530
out an answer
91

92
00:08:39,690 --> 00:08:41,570
and we don't know where it came from.