0
1
00:00:00,590 --> 00:00:07,620
So we're slowly coming up to the best part, namely the part we're about to run our gradient descent algorithm
1

2
00:00:08,100 --> 00:00:15,630
on our mean squared error cost function. But before we dive into the Python code and calculate in which
2

3
00:00:15,630 --> 00:00:22,950
direction our algorithm should move, we need to work out the slope of our cost function first, namely
3

4
00:00:23,070 --> 00:00:29,400
our gradient. And this is where I've got some great news for you because we just have to apply some of
4

5
00:00:29,400 --> 00:00:35,520
the same calculus tricks that we've covered so far and these partial derivatives that are coming up
5

6
00:00:35,880 --> 00:00:38,130
are really not that hard to figure out.
6

7
00:00:38,550 --> 00:00:39,970
So let's dive in.
7

8
00:00:40,260 --> 00:00:49,080
Now you'll recall that our mean squared error function looks like this, it's "1/n*sum(
8

9
00:00:49,440 --> 00:00:52,520
(y - y_hat))^2".
9

10
00:00:52,650 --> 00:00:59,550
So, actual values minus the predicted values squared, you sum them all up and you take the average.
10

11
00:00:59,730 --> 00:01:05,400
But the thing is, since we're running a very simple linear regression on this with one variable only,
11

12
00:01:05,670 --> 00:01:14,630
namely x, our y hat actually takes the form "theta_0 + theta_1*x".
12

13
00:01:14,910 --> 00:01:21,810
This is the linear regression model that we're using currently. It's got one variable and two parameters - 
13

14
00:01:21,930 --> 00:01:24,630
theta_0 and theta_1.
14

15
00:01:24,660 --> 00:01:27,840
So what does this mean about our mean squared error?
15

16
00:01:27,840 --> 00:01:35,790
Well if we take our equation and we simply substitute our model, our linear regression model, into this
16

17
00:01:35,790 --> 00:01:39,020
equation for y hat then we get something like this.
17

18
00:01:39,180 --> 00:01:45,810
And by removing those parentheses, we can simplify this to the following form - our mean squared error
18

19
00:01:45,960 --> 00:01:50,020
for this particular linear regression model actually looks like this.
19

20
00:01:50,040 --> 00:01:59,510
We've got "1/n*sum((y - theta_0 - theta_1 * x)^2)".
20

21
00:01:59,670 --> 00:02:02,270
But you know what, we can take this even further,
21

22
00:02:02,370 --> 00:02:08,860
check it out. when we're going to do now is we're gonna write out all the terms in this equation.
22

23
00:02:08,970 --> 00:02:14,550
So this is the opposite of simplifying it, but it's going to make calculating our partial derivatives
23

24
00:02:14,670 --> 00:02:16,260
a lot easier.
24

25
00:02:16,260 --> 00:02:23,940
We can figure out all the terms in this sum like so, firstly we can multiply out all the terms in this
25

26
00:02:23,940 --> 00:02:28,100
equation and that means we get something like this.
26

27
00:02:28,320 --> 00:02:32,160
We get quite a few terms starting with y squared.
27

28
00:02:32,160 --> 00:02:39,120
But of course there is a few terms in this long list that we can combine so we'd actually get something like
28

29
00:02:39,120 --> 00:02:41,580
this when we simplified a little bit.
29

30
00:02:41,580 --> 00:02:47,880
Now I know that that doesn't look very pretty but the great thing about having the equation written
30

31
00:02:47,880 --> 00:02:54,390
out like this is that we can calculate partial derivatives very, very easily.
31

32
00:02:54,390 --> 00:02:59,760
So what I want to do is I want to start this lesson out on a challenge. I would like you to get out pencil
32

33
00:02:59,760 --> 00:03:08,370
and paper and try to apply the power rule to the equation above, to our mean squared error equation and
33

34
00:03:08,370 --> 00:03:13,770
take the partial derivative with respect to theta 0,
34

35
00:03:14,040 --> 00:03:20,570
so our intercept. I'll give you a few seconds to pause the video before I show you the solution.
35

36
00:03:20,640 --> 00:03:21,720
Ready?
36

37
00:03:21,720 --> 00:03:29,280
Here we go. Now looking at this equation, the first thing you'll notice is that there are quite a few
37

38
00:03:29,280 --> 00:03:39,000
terms that don't depend on theta 0, namely y^2, 2*theta_1*x*y and theta_1^2*
38

39
00:03:39,090 --> 00:03:42,180
x*2. For a partial derivative,
39

40
00:03:42,270 --> 00:03:47,450
these terms are treated as constants and they drop out of the equation.
40

41
00:03:47,460 --> 00:03:49,640
This is what we've talked about before.
41

42
00:03:49,680 --> 00:03:51,150
So what are we left with?
42

43
00:03:51,150 --> 00:04:02,260
Well, we're left with the following sum "-2y+2*theta_0 + 2*theta_1
43

44
00:04:02,290 --> 00:04:03,800
*x".
44

45
00:04:04,000 --> 00:04:08,930
And this is simply from applying the power rule that we covered in a previous lesson.
45

46
00:04:09,100 --> 00:04:14,270
Looking at this equation, we can simplify it a little bit to make it look a bit prettier.
46

47
00:04:14,350 --> 00:04:19,110
The first thing I'm going to do is I'm going to factor out the 2 that all these three terms in the sum
47

48
00:04:19,120 --> 00:04:20,460
have in common.
48

49
00:04:20,560 --> 00:04:26,650
In fact I'm actually going to factor out a -2 and I'm left with something like this.
49

50
00:04:26,650 --> 00:04:35,380
I've got "(-2)*(y - theta_0 - theta_1 * x)" and then what I can do is I can simply
50

51
00:04:35,380 --> 00:04:45,350
move this constant outside of the sum. So our equation would actually look like this. And that's really
51

52
00:04:45,350 --> 00:04:45,870
it.
52

53
00:04:45,950 --> 00:04:51,150
That's the partial derivative with respect to theta 0.
53

54
00:04:51,200 --> 00:04:57,620
One thing that you might have noticed is that I've left out the little i's in the superscript in this
54

55
00:04:57,620 --> 00:05:01,570
derivation and and I'm going to put those back now for you.
55

56
00:05:01,610 --> 00:05:07,140
I left them out earlier because otherwise the notation would have just gotten too busy on the slide.
56

57
00:05:07,220 --> 00:05:13,130
So now that we've worked out the partial derivative with respect to our first parameter, we can work
57

58
00:05:13,130 --> 00:05:19,910
out the partial derivative with respect to our second parameter, namely theta 1.
58

59
00:05:19,940 --> 00:05:26,120
Once again I'm going to pose this as a challenge to you, because you've worked out this one,
59

60
00:05:26,120 --> 00:05:28,280
working out the other one is very, very similar.
60

61
00:05:28,310 --> 00:05:33,370
You go through exactly the same steps but you'll get a slightly different result.
61

62
00:05:33,380 --> 00:05:39,490
I'll give you a few seconds to pause the video and scribble this down with pencil and paper.
62

63
00:05:41,120 --> 00:05:42,310
Ready?
63

64
00:05:42,320 --> 00:05:43,220
Here's the solution.
64

65
00:05:46,430 --> 00:05:51,620
The equation that you get at the end when you go through all the steps and you simplify it will look
65

66
00:05:51,800 --> 00:05:53,060
something like this.
66

67
00:05:53,090 --> 00:05:57,950
It'll be very, very similar to the partial derivative with respect to theta 0 
67

68
00:05:57,950 --> 00:06:05,120
except that you're multiplying the entire thing by the x values at the end of the sum. With these two
68

69
00:06:05,120 --> 00:06:06,480
equations in front of us,
69

70
00:06:06,560 --> 00:06:13,400
we can now add them to Jupyter notebook. Once again the first thing I'm gonna do is add a section heading
70

71
00:06:13,730 --> 00:06:16,390
with some markdown and our LaTeX
71

72
00:06:16,400 --> 00:06:17,480
equations.
72

73
00:06:17,480 --> 00:06:29,030
This section heading I'm going to call "Partial Derivatives of the MSE w.r.t.",
73

74
00:06:29,030 --> 00:06:36,140
then I'm gonna add a dollar sign, a backslash and write "theta_0" and then another dollar
74

75
00:06:36,140 --> 00:06:47,000
sign and "$\theta_1$". Using the dollar signs,
75

76
00:06:47,060 --> 00:06:54,080
I'm including some LaTeX notation in line for our section heading and it's going to look like this when
76

77
00:06:54,080 --> 00:06:54,980
I press Shift+Enter.
77

78
00:06:55,460 --> 00:07:04,040
But let's add our partial derivatives in LaTeX notation as well, so I'm going to add two hashtags, two
78

79
00:07:04,040 --> 00:07:09,590
pound symbols then two dollar signs and write our fraction;
79

80
00:07:09,680 --> 00:07:17,240
So it's gonna be "\frac{}{}"
80

81
00:07:17,400 --> 00:07:24,650
and within the first pair of curly braces I'm gonna write "\partial MSE" and then in the second
81

82
00:07:24,650 --> 00:07:33,680
pair of curly braces I'm going to write "\partial \theta_0" and that whole
82

83
00:07:33,680 --> 00:07:39,710
thing is gonna be equal to, but before I add that bit, let's take a quick look what this looks like, so
83

84
00:07:39,710 --> 00:07:41,640
I'm going to add my two dollar signs at the end,
84

85
00:07:41,690 --> 00:07:50,600
press Shift+Enter and there I can see my fraction with the partial derivative symbols in front. Okay,
85

86
00:07:51,050 --> 00:07:54,270
so just so we have our equation in our
86

87
00:07:54,340 --> 00:07:55,770
Jupyter notebook as well,
87

88
00:07:55,940 --> 00:07:57,470
let's write it out here together.
88

89
00:07:57,480 --> 00:08:08,220
So it's gonna be minus and then "\frac{2}{n}"
89

90
00:08:08,440 --> 00:08:14,710
space, "\sum_{i=1}
90

91
00:08:14,780 --> 00:08:16,150
{}",
91

92
00:08:16,160 --> 00:08:18,750
carrie curly braces,
92

93
00:08:18,770 --> 00:08:26,610
and then "\big( y^{(
93

94
00:08:26,660 --> 00:08:39,290
i-\theta_0 - \theta_1 x^
94

95
00:08:39,830 --> 00:08:41,170
{(
95

96
00:08:41,240 --> 00:08:43,600
i)}
96

97
00:08:43,710 --> 00:08:50,350
\big )}".
97

98
00:08:50,420 --> 00:08:51,380
Let's see what this looks like.
98

99
00:08:53,570 --> 00:08:55,820
I might say that's pretty spot on.
99

100
00:08:55,820 --> 00:08:59,180
I'm just gonna click inside here and add the second one as well.
100

101
00:08:59,180 --> 00:09:00,190
This is the easy part.
101

102
00:09:00,200 --> 00:09:04,330
This is where we just copy what we've just written, paste it below,
102

103
00:09:04,520 --> 00:09:12,950
and then change this thing here to theta_1 and then at the end we're going to add "\big( 
103

104
00:09:12,950 --> 00:09:20,500
x^{(
104

105
00:09:20,810 --> 00:09:25,000
i(} \big)"
105

106
00:09:25,370 --> 00:09:26,260
That's it.
106

107
00:09:26,300 --> 00:09:32,450
Now we've got our partial derivative equations displayed beautifully in Jupyter notebook.
107

108
00:09:32,450 --> 00:09:39,380
So one thing that I'll note here is that, you know, these partial derivatives are gonna depend on what
108

109
00:09:39,380 --> 00:09:44,660
kind of equation we're using for y hat. At this point,
109

110
00:09:44,660 --> 00:09:50,150
we're using linear regression with one variable, so we substituted that in there
110

111
00:09:50,270 --> 00:09:57,620
and then we derived our partial derivatives from that. If we had a different model, say linear regression
111

112
00:09:57,620 --> 00:10:05,390
with two variables or something that estimates our y hat a little differently then we'd simply substitute
112

113
00:10:05,600 --> 00:10:12,550
that equation into our mean squared error and then we can do the same derivation if we're so inclined.
113

114
00:10:12,740 --> 00:10:20,180
Because that means that error cost function lends itself very well to all sorts of regression problems.
114

115
00:10:20,180 --> 00:10:24,730
And it will adapt very, very well to all kinds of models as well.
115

116
00:10:24,770 --> 00:10:32,090
So having written out the partial derivatives in this form, we can create a function where we calculate
116

117
00:10:32,090 --> 00:10:39,320
the slopes of the parameters in Python code. So I'm going to add a little section heading here and I'm going to call
117

118
00:10:39,320 --> 00:10:45,150
it "MSE & Gradient Descent".
118

119
00:10:45,500 --> 00:10:49,050
And then that Python function I'm going to add here.
119

120
00:10:49,160 --> 00:10:56,690
Now what I'm going to do is I'm going to create a function called grad, I'm going to say "def grad()" and I want to give
120

121
00:10:56,690 --> 00:11:04,490
it three inputs, I'm going to give it the x, they values and an array of thetas.
121

122
00:11:04,640 --> 00:11:12,320
And I'm going to say colon, and then inside the body of this function I want to work out these two partial
122

123
00:11:12,470 --> 00:11:13,840
derivatives.
123

124
00:11:14,060 --> 00:11:15,690
So what are my inputs here?
124

125
00:11:15,770 --> 00:11:21,860
My inputs are the x values, so the data; the y values,
125

126
00:11:21,860 --> 00:11:23,880
again this is also data,
126

127
00:11:24,410 --> 00:11:31,460
and then I have an array of theta parameters.
127

128
00:11:31,580 --> 00:11:36,110
These are the bits that we're actually optimizing in our gradient descent algorithm.
128

129
00:11:36,110 --> 00:11:44,720
This array is gonna have theta 0 at index 0 and theta 1 at index 1.
129

130
00:11:44,930 --> 00:11:49,770
So this is gonna be my function. The number of samples,
130

131
00:11:49,820 --> 00:11:57,110
so n, I can work out by saying "n = y.size". This function is going to get a whole list
131

132
00:11:57,110 --> 00:12:04,880
of y values and by calling y.size I can work out how many samples were given to this function.
132

133
00:12:04,880 --> 00:12:11,180
Now, as a challenge, can you create two variables, theta0_slope and theta1_
133

134
00:12:11,180 --> 00:12:19,100
slope? And what I want you to do is I want you to translate these LaTeX equations into Python code.
134

135
00:12:19,960 --> 00:12:25,010
I'll give you a few seconds to pause the video and work this out.
135

136
00:12:25,100 --> 00:12:25,490
Ready?
136

137
00:12:26,210 --> 00:12:27,360
Here's the solution.
137

138
00:12:27,440 --> 00:12:30,730
So "theta0_slope"
138

139
00:12:30,950 --> 00:12:45,440
is gonna be equal to "(-2/n)*sum(y-thetas[0] - 
139

140
00:12:46,310 --> 00:12:51,270
thetas[1]*x)".
140

141
00:12:51,350 --> 00:12:57,170
So we're expecting that this function will receive an array of theta parameters and I'll have the theta
141

142
00:12:57,170 --> 00:12:58,760
0 at index 0,
142

143
00:12:58,790 --> 00:13:03,660
so this is what we're using here and we have theta 1 at index 1,
143

144
00:13:03,740 --> 00:13:05,900
so this is what we're using here.
144

145
00:13:06,230 --> 00:13:12,020
Now working out our theta1_slope is gonna be trivial because I can just copy this,
145

146
00:13:12,260 --> 00:13:20,120
change the name here, and then simply add another set of parentheses to our sum and multiply the whole
146

147
00:13:20,120 --> 00:13:21,610
thing by x
147

148
00:13:21,680 --> 00:13:29,870
again. That way I can capture this term in the equation here. So that's really it.
148

149
00:13:29,890 --> 00:13:37,570
The only thing left to do is output these values and we're going to output this stuff as an array.
149

150
00:13:37,570 --> 00:13:40,040
I'll show you three ways we can do this.
150

151
00:13:40,110 --> 00:13:48,970
It's a little bit of a review, so we can write "return np.array([
151

152
00:13:49,690 --> 00:13:52,120
theta0_slope])"
152

153
00:13:52,630 --> 00:13:59,590
And because this is an array as well, we just have to pull out the first value of it and then write a
153

154
00:13:59,590 --> 00:14:09,160
comma and then theta1_slope and then grab the first value of that as well.
154

155
00:14:09,160 --> 00:14:10,870
Now this is one way you can do it.
155

156
00:14:10,930 --> 00:14:16,860
We've calculated these two things separately, so we can combine them into an array like so.
156

157
00:14:17,560 --> 00:14:26,830
But I'm going to comment this out and show you a second way. So we can also return "np.append()",
157

158
00:14:26,830 --> 00:14:35,980
the first argument is gonna be our theta0_slope and then our second argument
158

159
00:14:36,130 --> 00:14:43,150
is gonna be theta1_slope.
159

160
00:14:43,380 --> 00:14:45,430
So this is a second way we can do it.
160

161
00:14:45,630 --> 00:14:51,050
We can append this array to this one and return that as well.
161

162
00:14:51,090 --> 00:14:56,250
So that way we can combine these two pieces of data that were calculated separately and append them
162

163
00:14:56,580 --> 00:14:57,510
to each other.
163

164
00:14:57,720 --> 00:15:01,980
The last way I want to show you is with the concatenate function.
164

165
00:15:01,980 --> 00:15:04,930
So this is also from numpy.
165

166
00:15:04,980 --> 00:15:13,230
It'll "numpy.concatenate()" and then another set of parentheses where we're gonna supply
166

167
00:15:13,830 --> 00:15:23,370
theta0_slope, comma theta1_slope and then we just have to supply how
167

168
00:15:23,370 --> 00:15:26,090
we're gonna concatenate it, namely along the rows,
168

169
00:15:26,130 --> 00:15:29,710
so "axis=0".
169

170
00:15:29,820 --> 00:15:34,820
So these are three ways you can write the Python code to achieve the very same output.
170

171
00:15:34,830 --> 00:15:38,370
Now it's time to run our gradient descent and actually call this function.
171

172
00:15:38,940 --> 00:15:41,610
Hope I didn't make any typos, so let's do that now.
172

173
00:15:41,970 --> 00:15:43,910
This is where the rubber meets the road,
173

174
00:15:43,920 --> 00:15:53,070
as they say. I'm going to set my multiplier to 0.01 and I'm going to set my initial guesses,
174

175
00:15:53,070 --> 00:16:04,050
so my thetas equal to an np.array where our initial guesses are 2.9, comma 2.9
175

176
00:16:04,350 --> 00:16:08,560
all in square brackets and then our gradient descent is gonna look like this.
176

177
00:16:08,560 --> 00:16:14,220
It's gonna be "for i in range"; we're going to run this a thousand times,
177

178
00:16:14,220 --> 00:16:21,210
colon and then in the body of our for loop, we're gonna have some very terse Python code that calculates
178

179
00:16:21,210 --> 00:16:24,360
our gradient and then updates are thetas array
179

180
00:16:24,510 --> 00:16:25,530
all in one go.
180

181
00:16:25,590 --> 00:16:35,260
It's gonna be "thetas = thetas - multiplier*grad".
181

182
00:16:35,280 --> 00:16:37,170
This is where we're calling our function.
182

183
00:16:37,190 --> 00:16:38,550
Now we have to supply our data.
183

184
00:16:38,550 --> 00:16:38,830
Right?
184

185
00:16:38,850 --> 00:16:43,170
"x_5, y_5"
185

186
00:16:43,170 --> 00:16:47,280
This was the data that we generated earlier and then last
186

187
00:16:47,280 --> 00:16:58,860
input is gonna be our thetas array, just like that. After our loop has run we're going to print out the results.
187

188
00:16:59,050 --> 00:17:02,110
So this is where we can check if our thing actually works.
188

189
00:17:02,110 --> 00:17:05,950
And I'm going to find out if I'm had any typos along the way.
189

190
00:17:05,950 --> 00:17:17,890
So I'm going to print and I'm going to say the "Min occurs at Theta 0: ", comma
190

191
00:17:18,160 --> 00:17:24,390
and this is gonna be "thetas[0]" because that's where our intercept is gonna live.
191

192
00:17:24,400 --> 00:17:36,730
It's gonna live at index 0, first value in our thetas array. And let's print out the minimum at theta 1
192

193
00:17:36,730 --> 00:17:38,500
as well.
193

194
00:17:38,500 --> 00:17:46,690
So that one's gonna be, in a print statement "Min occurs at Theta 1:" and then comma "thetas[
194

195
00:17:46,690 --> 00:17:48,480
1]".
195

196
00:17:48,850 --> 00:17:53,190
And finally we're gonna print out our mean squared error.
196

197
00:17:53,220 --> 00:17:57,180
So we're going to use that MSE function that we created earlier.
197

198
00:17:57,250 --> 00:18:08,230
So I'm gonna say "print('MSE is :', mse())", this is a function call parentheses and then we have
198

199
00:18:08,230 --> 00:18:10,030
to supply two things here,
199

200
00:18:10,030 --> 00:18:12,250
remember? y_5,
200

201
00:18:12,370 --> 00:18:17,200
so the actual y values and then y_hat. What's y_hat?
201

202
00:18:18,070 --> 00:18:31,150
Well after our loop runs it's gonna be "thetas[0]+thetas[1]*",
202

203
00:18:31,780 --> 00:18:33,710
our x data,
203

204
00:18:33,760 --> 00:18:37,170
so x_5.
204

205
00:18:37,390 --> 00:18:37,720
All right.
205

206
00:18:37,750 --> 00:18:43,710
So we've just written a whole bunch of code without having tested it in a little while.
206

207
00:18:43,720 --> 00:18:52,630
Let's see if it works. I'm going to hit Shift+Enter now and I'm pleasantly surprised.
207

208
00:18:52,780 --> 00:19:01,860
We get, after a thousand iterations, we get theta zero value of .85 and theta one value of
208

209
00:19:02,040 --> 00:19:09,490
.122 and mean squared error of 0.95 approximately.
209

210
00:19:09,630 --> 00:19:14,890
And this very much ties out with all the calculations we've done previously.
210

211
00:19:15,090 --> 00:19:17,130
So we've definitely done this correctly.
211

212
00:19:17,370 --> 00:19:22,620
We've worked out the partial derivatives of our cost function and then we've run our gradient descent
212

213
00:19:22,650 --> 00:19:30,180
algorithm and this gradient descent algorithm started pretty far off, started at 2.9 for both
213

214
00:19:30,270 --> 00:19:32,790
our theta zero and our theta one values
214

215
00:19:32,790 --> 00:19:40,380
and then in that for loop, having run a thousand times it converged on the values that minimized the
215

216
00:19:40,380 --> 00:19:43,370
mean squared error, that minimized our cost function.
216

217
00:19:43,380 --> 00:19:45,840
So this is brilliant.
217

218
00:19:45,840 --> 00:19:48,600
Now all that's left to do is to plot it.