0
1
00:00:00,870 --> 00:00:06,870
So now that we understand the intuition behind our regression algorithm, we can run and evaluate our
1

2
00:00:06,870 --> 00:00:08,250
regression.
2

3
00:00:08,250 --> 00:00:12,480
We will write the Python code to actually crunch the numbers.
3

4
00:00:12,480 --> 00:00:18,860
So I hope that at this point you still have your Jupyter notebook open and your session is still connected.
4

5
00:00:19,080 --> 00:00:23,940
While the big advantage of using the online version of Jupyter notebook is that you can get started
5

6
00:00:23,940 --> 00:00:30,930
right away and you don't have to install anything, but if you're inactive for a while and you haven't
6

7
00:00:30,930 --> 00:00:36,150
been using it then it's possible that you can get disconnected and lose your work.
7

8
00:00:36,180 --> 00:00:38,690
So in that case you might see something like this.
8

9
00:00:39,510 --> 00:00:45,450
And it's important to remember that you can always save your work by saying "Download as" and then "Notebook"
9

10
00:00:46,140 --> 00:00:51,990
and you can always restore your work by going back to try Jupyter with Python and then simply uploading
10

11
00:00:52,290 --> 00:00:57,420
the Jupyter notebook that you downloaded previously and your data file.
11

12
00:00:57,420 --> 00:01:03,620
So if you upload those, then you can continue where you left off. In the next module,
12

13
00:01:03,660 --> 00:01:07,580
I will walk you through how to install Jupyter locally on your machine.
13

14
00:01:07,650 --> 00:01:13,760
You might only encounter this situation if you are trying it out using binder through the web portal.
14

15
00:01:13,860 --> 00:01:20,460
Now without further ado, let's give our notebook the capability to run a regression. This capability, just
15

16
00:01:20,460 --> 00:01:26,780
like the others, it's going to come from a module. In this case this module is going to be called scikit-learn.
16

17
00:01:26,820 --> 00:01:33,870
Scikit-learn is one of the most popular machine learning modules in Python and we can get hold
17

18
00:01:33,870 --> 00:01:41,850
of it in our Jupyter notebook simply by typing "import sklearn", but we're only looking for something
18

19
00:01:41,850 --> 00:01:48,720
very specific out of scikit-learn, so instead of importing all of scikit-learn, what we're gonna do instead
19

20
00:01:48,720 --> 00:01:56,910
is we're going to say from as sklearn.linear_model and I hit Tab on my keyboard here to bring
20

21
00:01:56,910 --> 00:02:01,290
up this option, we're gonna import linear regression.
21

22
00:02:01,350 --> 00:02:07,560
Once again, I typed the first few characters there and then I hit Enter to insert the rest of the code
22

23
00:02:08,040 --> 00:02:15,360
and this avoids any sort of typos like, you know, having a lowercase r here for example. After we've
23

24
00:02:15,420 --> 00:02:21,770
added this line of code, let's hit Shift+Enter on our keyboard or click "Run" to run the cell.
24

25
00:02:22,230 --> 00:02:29,490
Alternatively, if you've opened this notebook from fresh, you might have to go to "Cell" and then "Run All".
25

26
00:02:30,030 --> 00:02:37,170
Let me add a few more cells here at the bottom, and then we can add the code to run our linear regression.
26

27
00:02:38,100 --> 00:02:44,640
The task of running a linear regression for calculating the slope of our line and the intercepts is
27

28
00:02:44,910 --> 00:02:50,760
once more going to be done by an object; so we're gonna create this object and we're gonna give it a name.
28

29
00:02:50,820 --> 00:02:58,200
So what I'm going to call it - regression, and I'm going to set it equal to LinearRegression with some
29

30
00:02:58,200 --> 00:03:00,330
parentheses at the end.
30

31
00:03:00,330 --> 00:03:08,340
This bit of code here will create our object and we're gonna be storing it inside here so we can always
31

32
00:03:08,340 --> 00:03:12,580
refer to our linear regression by the name "regression".
32

33
00:03:12,600 --> 00:03:18,150
So now that we've created our object we can actually tell it to do something; we can tell it to run our
33

34
00:03:18,150 --> 00:03:24,750
regression. And the way we're gonna do that is simply by using the regression and then putting a dot
34

35
00:03:24,750 --> 00:03:31,770
after it and then writing "fit". Inside the parentheses we have to tell it two things: namely the features -
35

36
00:03:32,130 --> 00:03:39,840
our X, and our labels - our lowercase y. When we hit Shift+Enter on the cell it will crunch the numbers
36

37
00:03:40,290 --> 00:03:43,060
and, just like that, we've actually run our regression.
37

38
00:03:43,110 --> 00:03:46,380
Now, I know we don't see any output here but trust me,
38

39
00:03:46,400 --> 00:03:52,380
the numbers have been crunched and to prove this we can pull up the slope coefficient and the intercept
39

40
00:03:52,590 --> 00:03:58,660
that were calculated by our regression. We can get hold of the slope through our regression object.
40

41
00:03:58,800 --> 00:04:07,110
So regression.coef and here I'm hitting Tab on my keyboard to bring up that menu there and then hit
41

42
00:04:07,230 --> 00:04:09,590
Enter to insert the code.
42

43
00:04:09,630 --> 00:04:14,790
Now you can, of course, type this out and you'll get the same result but just remember that there's an
43

44
00:04:14,850 --> 00:04:16,700
underscore here at the end.
44

45
00:04:16,770 --> 00:04:18,480
That's part of the name.
45

46
00:04:18,510 --> 00:04:23,950
So let me run the cell and let's take a look at what the slope coefficient is.
46

47
00:04:23,970 --> 00:04:27,630
So here we see that it's 3.11.
47

48
00:04:27,660 --> 00:04:33,840
Now one of the great things about Jupyter notebook is that we can add some markdown cells so I can add
48

49
00:04:33,840 --> 00:04:41,860
a cell here and I can change it to "Markdown" to add a little bit of explanation to what my code is doing.
49

50
00:04:41,970 --> 00:04:47,820
That way if I come back to this notebook in the future and I'm wondering what regression.coef_ does,
50

51
00:04:47,820 --> 00:04:53,550
I can look at the markdown cell above and I can leave myself a little note; for example, I can say slope
51

52
00:04:53,550 --> 00:04:59,050
coefficient and Shift+Enter and then I'll get this text inserted here.
52

53
00:04:59,070 --> 00:05:04,570
Now, if I wanted to leave myself little notes inside the actual cells where I've got my Python code,
53

54
00:05:04,800 --> 00:05:11,160
then I have to use something called a comment and I can add a comment here with a hashtag or a pound
54

55
00:05:11,160 --> 00:05:19,530
symbol and I can add this here and right, for example, theta _1 and you can see that the text
55

56
00:05:19,530 --> 00:05:25,530
here is green and that means that this text is considered to be a comment and will be ignored.
56

57
00:05:25,530 --> 00:05:32,130
It's not going to be treated as code and I can actually execute the cell and I'll run just fine, but
57

58
00:05:32,130 --> 00:05:37,170
if I were delete this little symbol here - this pound symbol, and try to run it then I would get an error
58

59
00:05:37,200 --> 00:05:39,600
and it would tell me "invalid syntax".
59

60
00:05:39,780 --> 00:05:42,690
So that little pound symbol is very, very important.
60

61
00:05:42,690 --> 00:05:46,490
That's what's going to differentiate code from comments.
61

62
00:05:47,010 --> 00:05:49,440
OK, so we've got our slope coefficient.
62

63
00:05:49,440 --> 00:05:52,460
What about our intercept? Our intercept
63

64
00:05:52,470 --> 00:05:55,810
we can pull up similarly through this regression object.
64

65
00:05:55,860 --> 00:06:02,300
So "regression.", and then put "intercept" with the underscore at the end.
65

66
00:06:02,550 --> 00:06:08,970
Once again I hit Tab on my keyboard there - I'm not such a fast typer, and it inserted the code for
66

67
00:06:08,970 --> 00:06:09,910
me.
67

68
00:06:10,020 --> 00:06:12,650
So let's take a look at what the value of this intercept is.
68

69
00:06:12,840 --> 00:06:17,280
Here we can see that it's about negative 7.2 million.
69

70
00:06:17,850 --> 00:06:21,660
So, now we know the regression slope and the intercept.
70

71
00:06:21,660 --> 00:06:26,840
And while it's not bad to have these two numbers to hand, we can actually do better than this.
71

72
00:06:26,880 --> 00:06:31,360
Wouldn't it be nice if we could just plot a line on our chart?
72

73
00:06:31,410 --> 00:06:35,070
Because we've painstakingly visualized our data.
73

74
00:06:35,070 --> 00:06:38,640
Why not just plot the regression line on this chart as well?
74

75
00:06:38,640 --> 00:06:39,990
So let's do just that.
75

76
00:06:39,990 --> 00:06:46,350
What I'm going to do is I'm gonna take this entire cell and I'm going to copy it and then I'm going
76

77
00:06:46,350 --> 00:06:49,810
to come down here and I'm going to paste it.
77

78
00:06:49,980 --> 00:06:53,440
And the reason I'm doing this is simply to have a reference.
78

79
00:06:53,460 --> 00:06:58,500
So here I'm going to have the chart without the line and here I'm going to modify my code here so I
79

80
00:06:58,500 --> 00:07:02,460
get the chart with the line. And you can move these cells around as well.
80

81
00:07:02,490 --> 00:07:09,820
So if I use this little up arrow here, then I can move it above this cell here. Wonderful!
81

82
00:07:09,820 --> 00:07:12,340
So, how do we plot a line on here?
82

83
00:07:12,790 --> 00:07:16,210
Well, we can use matplotlib once again.
83

84
00:07:16,210 --> 00:07:16,840
So, matplotlib
84

85
00:07:16,840 --> 00:07:20,450
has a functionality called plot.
85

86
00:07:20,590 --> 00:07:29,770
So plt.plot will allow us to plot a line on this chart, but we have to supply some information; we
86

87
00:07:29,770 --> 00:07:30,330
have to tell
87

88
00:07:30,410 --> 00:07:33,160
matplotlib what exactly to plot.
88

89
00:07:33,220 --> 00:07:34,990
And for that it will actually need two things.
89

90
00:07:35,000 --> 00:07:41,710
It will need the Xs and the y's so I'll need some information for where it should plot things on this axis,
90

91
00:07:41,710 --> 00:07:44,780
and where should plot things on this axis. Now,
91

92
00:07:44,830 --> 00:07:47,140
production budgets are our feature.
92

93
00:07:47,170 --> 00:07:49,960
So for this axis we can use our Xs, right?
93

94
00:07:50,440 --> 00:07:56,920
So I'm going to put an X here, and then a comma and I have to supply the y value. And what should this
94

95
00:07:56,920 --> 00:07:57,510
be?
95

96
00:07:57,550 --> 00:08:02,820
We don't want to use the actual value that we have for the gross revenue.
96

97
00:08:02,860 --> 00:08:08,770
Instead, what we would like to do is we'd want to use the predicted value from our regression and we
97

98
00:08:08,770 --> 00:08:14,710
can get hold of those values by calculating a predicted value for each of the X values.
98

99
00:08:14,710 --> 00:08:19,640
To do that we'll need the regression object from scikit-learn that we used earlier.
99

100
00:08:19,660 --> 00:08:22,140
All we need to do is type regression.predict(x),
100

101
00:08:22,190 --> 00:08:28,960
then it will calculate a prediction for each of the budget values in our
101

102
00:08:28,960 --> 00:08:30,900
data and plot it on the graph.
102

103
00:08:30,920 --> 00:08:36,370
So let me hit Shift+Enter on the cell and let's take a look at what we've got scrolling down.
103

104
00:08:36,370 --> 00:08:39,520
I can see we've got a regression line right here.
104

105
00:08:39,580 --> 00:08:40,910
Fantastic!
105

106
00:08:40,910 --> 00:08:44,460
But, it'd be a bit nicer if it stood out a little more, right?
106

107
00:08:44,470 --> 00:08:45,910
Let's give it a color.
107

108
00:08:45,930 --> 00:08:47,100
Let's give it a width.
108

109
00:08:47,110 --> 00:08:53,540
Let's make it a bit thicker. And we can do that by going into this line of code here.
109

110
00:08:53,590 --> 00:09:00,100
And just before this ending parenthesis we can add a comma and then we're gonna add a color here.
110

111
00:09:00,100 --> 00:09:07,070
So I'm gonna see color is equal to red in single quotes and that will make our line red,
111

112
00:09:07,150 --> 00:09:13,050
as you can guess. And to make it thicker, we can specify the line width.
112

113
00:09:13,390 --> 00:09:22,030
So linewidth, all in lowercase in one word is equal to, let's try 4; if I hit Shift+Enter to refresh
113

114
00:09:22,040 --> 00:09:28,680
my cell, then I can see that my regression line now is a lot thicker and it's changed in color.
114

115
00:09:28,690 --> 00:09:36,850
What we see now is, we see the relationship between our production budgets and our movie revenue as predicted
115

116
00:09:36,850 --> 00:09:39,080
by our linear regression model.
116

117
00:09:39,220 --> 00:09:46,810
And that means we can move on to the final part of our data science workflow, namely evaluating and analyzing
117

118
00:09:46,840 --> 00:09:48,740
our algorithms performance.
118

119
00:09:48,790 --> 00:09:52,090
How do we do? Did we do a good job or a bad job?
119

120
00:09:52,090 --> 00:09:56,000
What can the movie's budgets tell us about the movie revenue?
120

121
00:09:56,080 --> 00:10:00,250
This is the point where we have to think very, very hard about our model.
121

122
00:10:00,250 --> 00:10:06,010
The question that we should ask ourselves at this point is: are these parameters actually plausible?
122

123
00:10:06,010 --> 00:10:08,080
Let's take a look at that slope coefficient.
123

124
00:10:08,080 --> 00:10:10,220
It's 3.11.
124

125
00:10:10,270 --> 00:10:16,180
It means that there is a positive relationship between budget and revenue and not only that, it means
125

126
00:10:16,180 --> 00:10:21,790
that for each dollar that we spend on producing the movie, we should get around 3.1 dollars
126

127
00:10:21,790 --> 00:10:23,410
in revenue in return.
127

128
00:10:23,410 --> 00:10:27,180
And I actually think this seems to make sense.
128

129
00:10:27,340 --> 00:10:29,170
Bigger budget films tend to do better.
129

130
00:10:29,200 --> 00:10:34,740
So that's good news for us, right? Because we've put our life savings on the line for this zombie movie,
130

131
00:10:34,740 --> 00:10:35,070
right?
131

132
00:10:35,710 --> 00:10:38,850
Now, what about the other parameter? The intercept.
132

133
00:10:38,980 --> 00:10:42,230
This one is at -7.2 million.
133

134
00:10:42,250 --> 00:10:44,050
How do we interpret that?
134

135
00:10:44,320 --> 00:10:50,500
What this intercept is literally telling us is that a movie with a budget of zero would actually lose
135

136
00:10:50,740 --> 00:10:52,930
over 7 million dollars.
136

137
00:10:52,930 --> 00:10:55,140
So that's a bit problematic.
137

138
00:10:55,150 --> 00:10:56,120
Right?
138

139
00:10:56,140 --> 00:11:03,790
That seems quite unrealistic because if you and I went out to make a movie with a thousand dollars, it's
139

140
00:11:03,790 --> 00:11:09,340
pretty unlikely that seven million would just disappear from our bank accounts.
140

141
00:11:09,340 --> 00:11:12,090
So, this is a lot less realistic.
141

142
00:11:12,340 --> 00:11:16,300
And this begs the question - what should we conclude about our model?
142

143
00:11:16,300 --> 00:11:19,540
Well it means that we should actually take it with a grain of salt.
143

144
00:11:19,570 --> 00:11:26,210
We just have to accept that our model is a dramatic simplification of the real world.
144

145
00:11:26,410 --> 00:11:32,140
And as such we should be a little bit careful on how much we believe the predictions of our model, especially
145

146
00:11:32,140 --> 00:11:33,820
at the extreme ends.
146

147
00:11:33,910 --> 00:11:40,150
Just look at the distance here look at the gap between this data point and our line - the predictions
147

148
00:11:40,240 --> 00:11:45,640
of our model seem to fit the data a lot worse at the extreme.
148

149
00:11:45,640 --> 00:11:51,700
So how would we use this model to make a prediction, anyhow? Say we need to predict the revenue for a film with
149

150
00:11:51,700 --> 00:11:53,860
a 50 million dollar budget.
150

151
00:11:54,040 --> 00:11:55,350
We know what our intercepts are.
151

152
00:11:55,690 --> 00:11:58,740
We know what the theta zero and the theta one is equal to.
152

153
00:11:58,990 --> 00:12:04,090
And if we want to know what the revenue would be for a film with a budget of 50 million, all we would
153

154
00:12:04,090 --> 00:12:13,160
have to do is to substitute the values of our parameters that we estimated into this equation.
154

155
00:12:13,420 --> 00:12:19,300
So our X is going to be 50 million and if we do the math then we get our prediction, at least according
155

156
00:12:19,300 --> 00:12:20,070
to our model.
156

157
00:12:20,110 --> 00:12:20,700
Right?
157

158
00:12:20,920 --> 00:12:23,350
On a chart it would look something like this.
158

159
00:12:23,400 --> 00:12:28,560
We can draw a line up from 50 million and then predict how much,
159

160
00:12:28,650 --> 00:12:31,180
what it is that we're actually going to make of the movie?
160

161
00:12:31,180 --> 00:12:37,930
And it's gonna be a little bit less than three times the amount that we invested so about 148 million
161

162
00:12:37,930 --> 00:12:40,550
dollars - which is not bad, right?
162

163
00:12:40,630 --> 00:12:42,360
But, how do we know if it's accurate?
163

164
00:12:42,400 --> 00:12:45,060
How can we measure how good our model is?
164

165
00:12:45,070 --> 00:12:51,340
So even though it is very very simplistic we can still ask the question of how much of the real world
165

166
00:12:51,340 --> 00:12:53,560
data it actually explains.
166

167
00:12:53,560 --> 00:12:55,650
And for that we need some kind of measure.
167

168
00:12:55,720 --> 00:13:01,650
We need some kind of statistic and the measure that we're going to look at is called R squared.
168

169
00:13:01,840 --> 00:13:09,460
Also called the goodness of fit. To look at our R squared we'll simply take our regression and we write
169

170
00:13:09,460 --> 00:13:12,160
regression.score.
170

171
00:13:12,580 --> 00:13:20,350
And within the parentheses we supply our capital X and our lowercase y - our feature and our labels and
171

172
00:13:20,350 --> 00:13:27,460
we hit Shift+Enter and here what we see is that the R squared is approximately 0.55.
172

173
00:13:27,850 --> 00:13:34,100
This number here is the amount of variation in film revenue that is explained by the film's budget.
173

174
00:13:34,270 --> 00:13:42,010
And I got to say that 55 percent is actually pretty good because think about it this way are very simplistic
174

175
00:13:42,010 --> 00:13:48,580
model with a single feature, namely the production budget, can explain around 55 percent of the variation
175

176
00:13:48,580 --> 00:13:50,920
that we see in worldwide movie earnings.
176

177
00:13:51,340 --> 00:13:57,220
I'd say that's pretty good news for a first try but, of course, we should be a little bit cautious
177

178
00:13:57,220 --> 00:14:01,490
reading into this model too much, because we've still got a lot to learn.
178

179
00:14:01,570 --> 00:14:08,230
For example how would our model do if we added more features, like how long it took to make or if it's
179

180
00:14:08,230 --> 00:14:09,370
a sequel?
180

181
00:14:09,370 --> 00:14:10,960
Would we get more realism?
181

182
00:14:10,960 --> 00:14:14,560
Would it make our model perform better and make our predictions more accurate?
182

183
00:14:14,800 --> 00:14:21,730
And perhaps we should evaluate our model, not just on the data that we used for training it, but on new
183

184
00:14:21,730 --> 00:14:29,110
data, data that it hasn't seen yet and also, what if the relationship that we have here is actually non-linear.
184

185
00:14:29,140 --> 00:14:32,770
What if we somehow need to transform the data to get a better fit?
185

186
00:14:33,280 --> 00:14:39,820
So in a way our analysis has left us with a lot more questions that we should investigate and we will
186

187
00:14:39,820 --> 00:14:43,000
do just that in the upcoming modules.
187

188
00:14:43,000 --> 00:14:48,670
Well, you got started on the first project and you went through the whole data science workflow; you've
188

189
00:14:48,670 --> 00:14:53,740
gathered the data, you've cleaned the data, you've visualized it and then you ran a machine learning algorithm
189

190
00:14:53,950 --> 00:14:58,690
and then you've evaluated the results and you've even made a prediction, but this is only the start.
190

191
00:14:59,230 --> 00:15:05,230
We've got a whole lot more ground to cover and we'll dive deep into a lot of these concepts and the
191

192
00:15:05,230 --> 00:15:11,140
techniques that we've introduced here. In the next module we're going to install Jupyter and we're also
192

193
00:15:11,140 --> 00:15:13,500
going to learn a little bit more about regression.
193

194
00:15:13,990 --> 00:15:17,550
But the real focus will be on learning more Python programming.
194

195
00:15:17,740 --> 00:15:23,290
From there, we're going to learn about gradient descent and how optimization works for many machine learning
195

196
00:15:23,290 --> 00:15:25,570
algorithms that we're going to encounter.
196

197
00:15:25,570 --> 00:15:31,120
And after that we're going to use multivariable regression and predict some real estate prices in Boston.
197

198
00:15:31,660 --> 00:15:38,710
From there we're going to build an actual spam filter from scratch using a Naive Bayes classifier and
198

199
00:15:38,710 --> 00:15:40,280
then we're gonna take it up a notch,
199

200
00:15:40,290 --> 00:15:44,690
and we're gonna dive into deep learning with neural networks and TensorFlow.
200

201
00:15:44,800 --> 00:15:47,920
So for all of that and more, I'll see you on the next lessons.