0
1
00:00:00,450 --> 00:00:01,750
Welcome back.
1

2
00:00:01,800 --> 00:00:07,230
In this lesson we're going to talk about one of the key inputs to the gradient descent function
2

3
00:00:07,230 --> 00:00:09,100
and this is the learning rate.
3

4
00:00:09,150 --> 00:00:12,490
So let's add a big section heading in a markdown cell.
4

5
00:00:12,540 --> 00:00:21,930
So I'm going to click on this cell here, change it from Code to Markdown, add one pound symbol and then write "The
5

6
00:00:21,930 --> 00:00:22,680
Learning Rate".
6

7
00:00:25,720 --> 00:00:29,170
Scrolling back up to where we've defined our gradient descent function,
7

8
00:00:29,170 --> 00:00:35,500
let's take another look at our inputs. Previously we've modified our Python code so that the inputs would
8

9
00:00:35,500 --> 00:00:43,090
include not only the derivative function and the initial guess but also a multiplier, precision
9

10
00:00:43,090 --> 00:00:46,360
and the maximum number of iterations.
10

11
00:00:46,360 --> 00:00:54,640
So far we've changed up the initial guesses and analyzed the impact that this had on the algorithm in
11

12
00:00:54,640 --> 00:00:58,530
different situations, on different cost functions.
12

13
00:00:58,900 --> 00:01:03,690
But we've not really messed with the multiplier and this is what we're gonna do now.
13

14
00:01:03,700 --> 00:01:08,440
So what does the learning rate actually do in our algorithm?
14

15
00:01:08,440 --> 00:01:15,370
If we look at our update step, namely this line right here, we can see that our learning rate, which I've
15

16
00:01:15,370 --> 00:01:23,260
called multiplier, is multiplied with the gradient, the gradient is the value of the slope of the cost
16

17
00:01:23,260 --> 00:01:25,530
function and the multiplier
17

18
00:01:25,690 --> 00:01:32,590
is just a constant, but together they determine how big of a step we take.
18

19
00:01:32,590 --> 00:01:38,040
So if you remember with the gradient, when the slope was steep, then the gradient was a large number.
19

20
00:01:38,140 --> 00:01:39,960
And we take a big step.
20

21
00:01:40,060 --> 00:01:44,860
And similarly if the multiplier is large, we also take a big step.
21

22
00:01:44,920 --> 00:01:50,920
So at the moment we've got the multiplier set to a default value of 0.02.
22

23
00:01:50,920 --> 00:01:56,590
Now this might be a good time to maybe pause the video and think about what would happen if we picked
23

24
00:01:56,590 --> 00:02:03,220
a different value, say, what would happen if this value was really, really small? And what would happen
24

25
00:02:03,640 --> 00:02:07,430
if the multiplier value was really, really large?
25

26
00:02:07,440 --> 00:02:12,170
Now I'm going to illustrate the effect of the multiplier taking our second example.
26

27
00:02:12,240 --> 00:02:14,230
This was the g(x) function.
27

28
00:02:14,460 --> 00:02:16,790
I'm going to take this cell right here.
28

29
00:02:16,800 --> 00:02:23,420
This was the cell that generated our graphs and plotted our scatter plot for our gradient descent
29

30
00:02:23,430 --> 00:02:30,570
On the g(x) function. I'm going to copy this cell right here with the shortcut and I'm going to scroll all the
30

31
00:02:30,570 --> 00:02:32,590
way down to our learning rate
31

32
00:02:32,970 --> 00:02:36,240
and I'm going to paste the cell below.
32

33
00:02:36,480 --> 00:02:41,430
I probably don't need the cell up here right at the moment so I'm just going to take this one and move
33

34
00:02:41,430 --> 00:02:44,370
it up and I'm gonna modify it a little bit.
34

35
00:02:44,370 --> 00:02:48,870
I'm gonna add some print statements below our charts at the bottom.
35

36
00:02:48,870 --> 00:02:52,770
The first print statement I'm going to add is the number of steps
36

37
00:02:57,960 --> 00:03:04,670
and if you recall this was the length of this list - the length of list_x.
37

38
00:03:04,950 --> 00:03:07,800
So we'll write "len(
38

39
00:03:07,800 --> 00:03:09,610
list_x)".
39

40
00:03:09,660 --> 00:03:13,170
The other thing that I'm going to modify is the initial guess that we've got.
40

41
00:03:13,550 --> 00:03:17,240
So I'm going to change this to 1.9,
41

42
00:03:17,550 --> 00:03:19,320
and I'm also going to add a multiplier here.
42

43
00:03:19,830 --> 00:03:26,610
So I'm going to add "multiplier=", I'm not gonna change the default value just yet,
43

44
00:03:26,610 --> 00:03:29,210
let's just have it at 0.02.
44

45
00:03:29,400 --> 00:03:35,490
And since I haven't run my notebook in a little while, I'm actually going to go to "Cell" and "Run All" instead
45

46
00:03:35,490 --> 00:03:42,640
of running just my latest cell. And I'm going to have to scroll all the way down to the bottom and there's
46

47
00:03:42,650 --> 00:03:43,370
our output.
47

48
00:03:43,370 --> 00:03:44,720
So it works as expected.
48

49
00:03:44,720 --> 00:03:50,690
So if I start at 1.9 which is right here then the first step that we take is fairly large
49

50
00:03:51,020 --> 00:03:58,340
right here and then we slowly move down to our minimum and at the moment all it takes is 14 steps to
50

51
00:03:58,340 --> 00:03:59,580
get to the bottom.
51

52
00:03:59,730 --> 00:04:00,190
Okay.
52

53
00:04:00,230 --> 00:04:02,480
So this works as expected.
53

54
00:04:02,480 --> 00:04:04,130
Now let's have a little bit more fun with this.
54

55
00:04:04,490 --> 00:04:09,020
So I'm going to change our function call to have a maximum of five iterations.
55

56
00:04:09,020 --> 00:04:16,040
So I only want our loop to run five times and then I'm also gonna change our multiplier from 
56

57
00:04:16,040 --> 00:04:19,690
0.02 to 0.25
57

58
00:04:20,030 --> 00:04:24,100
and then I'm going to to rerun the cell and then we can have a look at our chart.
58

59
00:04:24,200 --> 00:04:26,100
So what are we seeing here now?
59

60
00:04:26,250 --> 00:04:32,820
Our multiplier is increasing our step size and we see our algorithm bouncing around on this function.
60

61
00:04:32,930 --> 00:04:36,740
So this is very very different behavior from what we've seen before.
61

62
00:04:36,890 --> 00:04:38,680
But let's take this to an extreme.
62

63
00:04:38,720 --> 00:04:44,990
I'm going to scroll back up to our function call and change the maximum number of times that our loop
63

64
00:04:44,990 --> 00:04:54,740
will run from 5 to 500 and I'm going to hit Shift+Enter and scrolling down we see this the whole chart is turning
64

65
00:04:54,740 --> 00:04:55,370
red.
65

66
00:04:55,370 --> 00:05:01,790
Yeah, we're bouncing around all over the place but our algorithm is never converging.
66

67
00:05:01,910 --> 00:05:07,130
In fact our loop at this point has run 500 times.
67

68
00:05:07,130 --> 00:05:11,940
In contrast we were at, what, 14, earlier in the number of steps.
68

69
00:05:12,020 --> 00:05:13,900
So this is very interesting, right.
69

70
00:05:13,910 --> 00:05:16,010
What can we learn from this example?
70

71
00:05:16,310 --> 00:05:22,940
In almost every situation that we've illustrated previously we've converged between like 20 or 60 steps,
71

72
00:05:22,940 --> 00:05:31,280
remember? This time our loop ran 500 times and the algorithm still didn't find the minimum.
72

73
00:05:31,310 --> 00:05:38,000
So what we're seeing here is that if we're not careful we can get into a situation where our algorithm
73

74
00:05:38,270 --> 00:05:44,900
isn't converging and it just might continue going and going and going and this kind of brings us back
74

75
00:05:44,900 --> 00:05:51,350
to our discussion about Python for loops and Python while loops.
75

76
00:05:51,350 --> 00:05:56,580
Let me scroll back up to where we wrote that code.
76

77
00:05:56,590 --> 00:05:58,140
Here we go.
77

78
00:05:58,170 --> 00:06:05,280
Now you might remember that with a while loop you as the developer have to take extra care with your
78

79
00:06:05,280 --> 00:06:06,910
terminating condition.
79

80
00:06:07,200 --> 00:06:09,040
What's the terminating condition?
80

81
00:06:09,180 --> 00:06:13,950
It's whatever follows the while keyword right here.
81

82
00:06:13,950 --> 00:06:19,770
So if the logic in this terminating condition isn't formulated well then it's very easy to get into
82

83
00:06:19,770 --> 00:06:26,460
a scenario where you're never exiting the loop and then your Python program accidentally ends up in
83

84
00:06:26,460 --> 00:06:27,990
an infinite loop.
84

85
00:06:27,990 --> 00:06:35,040
In other words with a while loop you as the programmer need to think of the corner cases and include the
85

86
00:06:35,040 --> 00:06:39,070
logic to stop your loop from running longer than you intended to.
86

87
00:06:40,020 --> 00:06:44,790
I know in this example if you're looking at this code this seems really, really trivial but infinite
87

88
00:06:44,790 --> 00:06:48,090
loops happen maybe more often than you expect.
88

89
00:06:48,090 --> 00:06:49,680
Let me show you an example.
89

90
00:06:49,920 --> 00:06:55,040
We could have written our gradient descent function with a while loop instead of a for loop.
90

91
00:06:55,050 --> 00:07:01,560
This is our gradient descent function as it currently stands - with a for loop and a cutoff point determined
91

92
00:07:01,770 --> 00:07:04,850
by the maximum number of iterations.
92

93
00:07:04,950 --> 00:07:10,730
And this is how one could imagine running our gradient descent with a while loop.
93

94
00:07:10,920 --> 00:07:14,710
At first glance this while loop would seem to make sense.
94

95
00:07:14,790 --> 00:07:21,840
The condition for running the loop is: run the loop as long as the step size is greater than the precision.
95

96
00:07:21,990 --> 00:07:27,420
The step size is always the difference between the new x and the previous x.
96

97
00:07:27,540 --> 00:07:34,320
So at some point that calculation would get more and more precise at which point we would exit loop.
97

98
00:07:34,320 --> 00:07:41,040
Fair enough, but this is exactly the kind of code that risks running into the infinite loop problem because
98

99
00:07:41,280 --> 00:07:48,930
in the case where one isn't converging and the step sizes between the values get larger, this loop just
99

100
00:07:48,930 --> 00:07:50,060
continues running.
100

101
00:07:50,880 --> 00:07:57,750
So I'm not going to keep this cell around, I'm going to go "Edit" > "Delete Cell" and this is why the for loop provides
101

102
00:07:57,750 --> 00:07:59,900
quite a, quite a big contrast, right?
102

103
00:07:59,910 --> 00:08:07,020
It's got this safety net by forcing you to explicitly state the number of times it will run ahead of
103

104
00:08:07,020 --> 00:08:08,340
time.
104

105
00:08:08,340 --> 00:08:11,540
Now let's crank up that learning rate even more.
105

106
00:08:11,940 --> 00:08:21,660
Scroll back down here and I'm going to increase our learning rate from 0.25 to 
106

107
00:08:22,140 --> 00:08:24,970
0.3. I'm going to leave everything else the same
107

108
00:08:25,200 --> 00:08:26,430
and rerun this thing.
108

109
00:08:26,670 --> 00:08:29,060
See what happens.
109

110
00:08:29,880 --> 00:08:34,440
This is our old friend, the overflow error.
110

111
00:08:34,450 --> 00:08:36,510
The result is too large.
111

112
00:08:36,520 --> 00:08:43,540
We've shot off to infinity and beyond. What this little exercise is showing us is how there's another
112

113
00:08:43,540 --> 00:08:48,190
quirk that we have to be aware of with our optimization algorithms.
113

114
00:08:48,280 --> 00:08:53,470
We as the machine learning experts have to choose an appropriate learning rate.
114

115
00:08:54,250 --> 00:08:59,890
So that begs the question - how do we know what the right learning rate actually is?
115

116
00:08:59,890 --> 00:09:01,480
Because that's what we've seen.
116

117
00:09:01,540 --> 00:09:06,850
If we pick a very large learning rate, then our algorithm doesn't converge.
117

118
00:09:06,970 --> 00:09:12,340
And if we pick a very, very small learning rate, then our algorithm might actually take forever.
118

119
00:09:12,430 --> 00:09:13,840
What I mean by forever?
119

120
00:09:14,080 --> 00:09:14,650
Let me show you.
120

121
00:09:15,430 --> 00:09:21,760
I'm going to scroll back up where we've called our function, I'm going to change this again to something that doesn't
121

122
00:09:21,760 --> 00:09:30,510
crash our Python program so 0.02 and then I gonna copy this bit of code right here.
122

123
00:09:33,250 --> 00:09:42,650
So that's the code where we're calling our function and the code where we're plotting our first chart.
123

124
00:09:42,820 --> 00:09:43,960
I don't like seeing that error,
124

125
00:09:43,990 --> 00:09:50,210
so I'm going to rerun the cell, and then I'm going to paste our code down here.
125

126
00:09:50,260 --> 00:09:51,860
Now I'm going to change this comment here.
126

127
00:09:51,880 --> 00:09:58,760
I'm going to call it "Run gradient descent 3 times".
127

128
00:09:59,710 --> 00:10:01,630
But before we run it three times,
128

129
00:10:01,720 --> 00:10:03,100
let's run it one time.
129

130
00:10:03,160 --> 00:10:03,570
So.
130

131
00:10:04,350 --> 00:10:05,600
So here's what I'm going to do.
131

132
00:10:05,890 --> 00:10:12,990
Instead of having the max iteration set to 500 here, I'm going to set this equal to a variable.
132

133
00:10:13,030 --> 00:10:20,980
So I'm going to say "n=100" and then instead of having our max iterations set to 500
133

134
00:10:20,980 --> 00:10:27,010
I'm going to set it equal to n. So I'm going to specify the maximum number of iterations up here. 
134

135
00:10:27,880 --> 00:10:40,750
Then I'm going to change the multiplier from 0.02 to 0.005 and
135

136
00:10:40,750 --> 00:10:46,660
then I'm going to add a precision, I'm going to make it quite precise the calculation.  I'm going to say precision
136

137
00:10:46,660 --> 00:10:56,080
should be equal to 0.0001, comma and then have the maximum number of iterations
137

138
00:10:56,680 --> 00:11:01,670
back here. And for the initial guess we'll start off at 3.
138

139
00:11:01,690 --> 00:11:05,280
So this is gonna be our first call to the gradient descent function.
139

140
00:11:06,280 --> 00:11:15,490
But instead of having this sequence unpacking code here, we're gonna store all of this information in
140

141
00:11:15,520 --> 00:11:17,350
a single tuple.
141

142
00:11:17,350 --> 00:11:22,250
So this is gonna be called low_gamma. Gamma
142

143
00:11:22,300 --> 00:11:23,960
is often used for learning rate,
143

144
00:11:24,130 --> 00:11:27,390
so we'll call our tuple low_gamma.
144

145
00:11:27,400 --> 00:11:36,550
Now let's change up the code for our plot.  I'm going to change the comment here to "Plotting reduction in cost
145

146
00:11:37,510 --> 00:11:40,450
for each iteration".
146

147
00:11:40,450 --> 00:11:44,410
So this is now gonna be our goal. In terms of the figure size,
147

148
00:11:44,410 --> 00:11:45,730
we're gonna go with a single plot.
148

149
00:11:45,820 --> 00:11:53,680
So I'm going to size this plot differently from before, I'm going to make it 20 by 10, so it's going to be quite large.
149

150
00:11:55,240 --> 00:12:01,120
And then I'm going to it delete this subplot code here, we don't need that.
150

151
00:12:01,390 --> 00:12:06,330
And then for axes, we're gonna go on the y axis,
151

152
00:12:06,340 --> 00:12:10,940
we're gonna go from 0 to 50.
152

153
00:12:11,050 --> 00:12:12,480
So this is gonna be our cost.
153

154
00:12:12,490 --> 00:12:14,610
Cost is gonna be on the y axis.
154

155
00:12:14,650 --> 00:12:20,880
We'll see this in a bit. And our x axis is gonna go from 0 to the number of iterations,
155

156
00:12:20,980 --> 00:12:28,370
so it's gonna go from 0 to n, n is going to be a hundred, gonna have the iterations on our x axis.
156

157
00:12:28,370 --> 00:12:36,350
So our x axis is gonna go from zero to n. The title of the chart is gonna be "Effect of the learning
157

158
00:12:36,350 --> 00:12:42,360
rate" and for the x label we'll have the number of iterations,
158

159
00:12:42,360 --> 00:12:50,550
and for the y label we're gonna have the cost. Now we need the data to populate our charts, so we need
159

160
00:12:50,550 --> 00:12:55,590
the values for our charts and we need two things.
160

161
00:12:55,590 --> 00:13:00,660
The first thing is going to be what we have on our y axis.
161

162
00:13:00,660 --> 00:13:02,570
That's going to require us to,
162

163
00:13:04,020 --> 00:13:14,580
and that's going to require us to convert the lists to numpy arrays,
163

164
00:13:14,640 --> 00:13:20,880
reason being we can feed an array into our g(x) function but we cannot feed a list into our 
164

165
00:13:20,880 --> 00:13:22,120
g(x) function.
165

166
00:13:22,170 --> 00:13:24,130
So we've done this previously.
166

167
00:13:24,180 --> 00:13:36,290
I'm gonna call our first array low_values, set that equal to np.array(low_gamma),
167

168
00:13:37,070 --> 00:13:41,840
and it was the second item in our tuple.
168

169
00:13:41,840 --> 00:13:43,580
Why did I just put a one there?
169

170
00:13:43,790 --> 00:13:49,090
Because the second item is at index 1, because we start counting from 0.
170

171
00:13:49,130 --> 00:13:50,980
The first item is at index 0.
171

172
00:13:50,990 --> 00:13:53,290
The second item is at index 1.
172

173
00:13:54,320 --> 00:13:54,590
Okay.
173

174
00:13:54,620 --> 00:13:56,710
So that's our y axis data.
174

175
00:13:56,750 --> 00:14:02,570
Time to get our x axis data.
175

176
00:14:02,600 --> 00:14:06,850
Now we just need our x axis to go from 0 to like 100.
176

177
00:14:06,860 --> 00:14:07,850
Right?
177

178
00:14:08,180 --> 00:14:18,590
So what we're going to do is we're going to create a list from 0 to n+1. Why n+1? Because we've got that extra initial
178

179
00:14:18,590 --> 00:14:19,070
guess.
179

180
00:14:19,160 --> 00:14:24,320
So even though our loop is going to run 100 times our extra guess is going to mean we should have
180

181
00:14:25,100 --> 00:14:26,990
100 plus 1.
181

182
00:14:27,050 --> 00:14:34,340
And I'm going to store has information in a variable called iteration_list.
182

183
00:14:34,340 --> 00:14:37,230
So how did we create a list in the past?
183

184
00:14:37,280 --> 00:14:44,930
Let me scroll down here a little bit. In the past we've had our square brackets and we said 0, 1, 2, 3.
184

185
00:14:44,930 --> 00:14:45,110
Right.
185

186
00:14:45,110 --> 00:14:48,560
We just populated that list with values.
186

187
00:14:48,560 --> 00:14:55,700
But this isn't what we're gonna do because we're not going to type out 100 different values in this
187

188
00:14:55,700 --> 00:14:56,120
list.
188

189
00:14:56,210 --> 00:15:00,230
Instead of creating this list manually, what we're going to do instead is we're going to make use of this range
189

190
00:15:00,230 --> 00:15:05,240
function that we saw in the for loop. Our range has a starting value and an ending value.
190

191
00:15:05,270 --> 00:15:12,620
So it's going to create all our values from 0 to n+1, but we can't do it like this,
191

192
00:15:12,620 --> 00:15:13,550
exactly.
192

193
00:15:13,670 --> 00:15:22,460
The reason is is that if I press Shift+Tab on this function I can see that this thing here will actually
193

194
00:15:22,460 --> 00:15:25,300
spit out a range object.
194

195
00:15:25,310 --> 00:15:30,310
So this range function will give us a range object, but what we need is a list.
195

196
00:15:30,350 --> 00:15:34,010
So how do we convert a range object to a list?
196

197
00:15:34,010 --> 00:15:45,600
Well, we can call the list function and then nest the call to the range function inside our call to the
197

198
00:15:45,600 --> 00:15:46,740
list function.
198

199
00:15:46,740 --> 00:15:51,080
So now we'll have a list starting from 0 going to n+1.
199

200
00:15:51,450 --> 00:15:56,710
That's gonna be stored inside our variable called iteration_list.
200

201
00:15:57,240 --> 00:16:02,970
And this is what we're going to take and we're gonna put that right here on our plot.
201

202
00:16:02,970 --> 00:16:11,850
So I'm going to put that here and then for our y axis on this plot we're gonna use our well, low_values,
202

203
00:16:12,480 --> 00:16:18,550
we're going to use our array of values for our cost function.
203

204
00:16:19,440 --> 00:16:21,750
And we also don't have to stick to the blue color.
204

205
00:16:21,750 --> 00:16:27,530
There's a lot of colors available including like light green for example and we can make the line of
205

206
00:16:27,530 --> 00:16:28,890
it thicker,
206

207
00:16:29,340 --> 00:16:34,130
change the line width from 3 to 5 and we're going to get rid of the alpha as well.
207

208
00:16:34,630 --> 00:16:39,810
Now I'm going to comment out the plot dot scatter code and just show our plot for a change.
208

209
00:16:39,810 --> 00:16:48,730
So I'm going to say plt.show, parentheses at the end and hit Shift+Enter to see what we get.
209

210
00:16:49,210 --> 00:16:53,920
Okay, so we get a nice line plot right here.
210

211
00:16:53,920 --> 00:16:55,370
Just like this.
211

212
00:16:55,390 --> 00:16:58,360
So this is cool, but you know what?
212

213
00:16:58,540 --> 00:17:02,920
We're gonna take our scatter functionality and use it as well.
213

214
00:17:02,920 --> 00:17:08,590
That way we get a little bubble each time our loop was run, so that way we can see the step size a bit
214

215
00:17:08,590 --> 00:17:09,960
more clearly.
215

216
00:17:10,000 --> 00:17:18,460
So our x axis is going to be the iteration_list again, and for our y axis it was going to
216

217
00:17:18,460 --> 00:17:24,030
be our low_values array.
217

218
00:17:24,030 --> 00:17:29,920
I'm going to change it to the same color as with our line plot.
218

219
00:17:29,920 --> 00:17:32,720
So it's gonna be color light green.
219

220
00:17:33,000 --> 00:17:39,520
And for the dot size we'll change it from 100 to 80 and get rid of the Alpha as well.
220

221
00:17:39,550 --> 00:17:40,770
So let me rerun this.
221

222
00:17:40,760 --> 00:17:43,500
See what it looks like. Aha!
222

223
00:17:43,540 --> 00:17:44,330
This is pretty good.
223

224
00:17:44,410 --> 00:17:51,860
So larger step size in the beginning getting smaller until the steps are very, very close at the end.
224

225
00:17:51,880 --> 00:18:00,810
So what we're looking at here is the decrease in the cost with each iteration of our loop.
225

226
00:18:00,830 --> 00:18:07,950
I think this is really, really neat, but it'll be even neat if we can plot two different learning rates
226

227
00:18:08,040 --> 00:18:14,930
or three different learning rates on the same chart next to each other and see how they compare. So I'm going to scroll
227

228
00:18:14,930 --> 00:18:24,170
back up here and I'm going to correct my typo in the title to have "Effect of learning rate" with a space.
228

229
00:18:24,650 --> 00:18:27,700
I'm going to add a little comment here as well.
229

230
00:18:27,730 --> 00:18:36,440
I going to say "Plotting low learning rate" and then I'm going to scroll back up to the beginning and
230

231
00:18:36,440 --> 00:18:40,770
put my cursor right here.
231

232
00:18:41,060 --> 00:18:47,420
Now as a challenge, can you plot two more learning rates on our chart?
232

233
00:18:47,420 --> 00:18:56,840
So pause the video, add a tuple called mid_gamma and keep all the inputs to the call, to the gradient
233

234
00:18:56,840 --> 00:19:03,800
descent function the same, but change the multiplier to double the learning rate to 0.001
234

235
00:19:03,800 --> 00:19:13,410
and then create another tuple called high_gamma and change the learning rate there to 
235

236
00:19:13,530 --> 00:19:15,240
0.002.
236

237
00:19:15,240 --> 00:19:21,930
Now you're gonna have to extract the values from the tuple that you get back and throw those onto our
237

238
00:19:21,930 --> 00:19:22,930
chart.
238

239
00:19:22,990 --> 00:19:33,450
I'll give you a few seconds to pause the video and give this a try. And here's the solution. So the quickest
239

240
00:19:33,450 --> 00:19:43,440
way to do this is to copy this code here, paste it two times, change the name of the tuple to mid_gamma,
240

241
00:19:43,530 --> 00:19:50,340
change the name of this tuple to high_gamma and then this multiplier is gonna be 
241

242
00:19:50,340 --> 00:19:51,000
0.001.
242

243
00:19:51,150 --> 00:19:58,620
This multiplier is gonna be 0.002 and then we're going to come down here,
243

244
00:19:58,620 --> 00:20:01,400
take this code right here,
244

245
00:20:01,440 --> 00:20:12,960
copy it, paste it twice and I'm going to change my comment "Plotting mid learning rate", "Plotting high learning rate",
245

246
00:20:13,860 --> 00:20:21,210
and then instead of feeding the low values into our array here, I'm going to make a call to np.array and
246

247
00:20:21,210 --> 00:20:24,500
then put an our tuple mid_gamma[1].
247

248
00:20:29,020 --> 00:20:30,430
I'm going to take this,
248

249
00:20:30,430 --> 00:20:31,700
copy it,
249

250
00:20:32,230 --> 00:20:33,790
put it here as well,
250

251
00:20:33,880 --> 00:20:35,620
mid_gamma[1].
251

252
00:20:35,890 --> 00:20:36,850
And this is going to be
252

253
00:20:40,710 --> 00:20:41,850
np.array(
253

254
00:20:42,000 --> 00:20:46,060
high_gamma[1]) and same with this one.
254

255
00:20:46,090 --> 00:20:53,210
It's gonna be np.array(high_gamma[1]), but one thing that I'm gonna change as well is the color.
255

256
00:20:53,440 --> 00:20:57,090
I don't want all of these graphs to have the very, very same color.
256

257
00:20:57,100 --> 00:20:59,990
Otherwise I can't tell them apart on the chart.
257

258
00:21:00,250 --> 00:21:06,590
One nice color to contrast the light green is steel blue.
258

259
00:21:06,730 --> 00:21:10,120
This is gonna be for,
259

260
00:21:10,350 --> 00:21:16,320
this is gonna be for the slightly higher learning rate, and then for the highest learning rate on this
260

261
00:21:16,320 --> 00:21:17,140
chart,
261

262
00:21:17,190 --> 00:21:20,460
I'm going to pick my favorite color - hot pink.
262

263
00:21:20,460 --> 00:21:21,130
Why not?
263

264
00:21:23,890 --> 00:21:24,480
Now again,
264

265
00:21:24,480 --> 00:21:26,020
I'm not making these color names up,
265

266
00:21:26,030 --> 00:21:26,190
yeah.
266

267
00:21:26,210 --> 00:21:31,320
I'm taking them from the matplotlib official documentation.
267

268
00:21:31,320 --> 00:21:37,120
So these are the names that I can put into that argument.
268

269
00:21:37,380 --> 00:21:43,970
So there's a predefined number of names that you can use and spelling has to match exactly,
269

270
00:21:43,970 --> 00:21:45,700
otherwise it won't recognize them.
270

271
00:21:45,830 --> 00:21:47,740
Okay, so I've got my charts. Now,
271

272
00:21:47,780 --> 00:21:50,290
the proof is in the pudding, as they say.
272

273
00:21:50,370 --> 00:21:54,720
Hit Shift+Enter and let's see what we get.
273

274
00:21:54,720 --> 00:22:00,840
So I think this chart is beautiful, because it illustrates so nicely how our learning rate affects
274

275
00:22:00,840 --> 00:22:02,200
our algorithm.
275

276
00:22:02,280 --> 00:22:07,410
We've run our gradient descent three separate times with different learning rates and now we can see
276

277
00:22:07,410 --> 00:22:14,610
how the cost decreases with each iteration of the loop. And what we're seeing here is that the highest
277

278
00:22:14,610 --> 00:22:18,870
learning rate, the one in hot pink, converges the fastest.
278

279
00:22:18,870 --> 00:22:23,650
In other words, the higher the multiplier the faster the convergence.
279

280
00:22:23,650 --> 00:22:29,910
And in contrast we've got the really low multiplier converging very, very slowly to the minimum.
280

281
00:22:29,970 --> 00:22:36,090
Now if we've picked an even lower multiplier, then it would reach the minimum even more slowly, but as
281

282
00:22:36,090 --> 00:22:41,940
we've shown before, the higher multiplier is only better up to a point, right?
282

283
00:22:41,940 --> 00:22:46,110
Remember how in our earlier example our entire graph was red?
283

284
00:22:46,410 --> 00:22:52,920
Clearly having like a really high multiplier, you either get an overflow error or we don't converge on
284

285
00:22:52,920 --> 00:22:53,660
our minimum.
285

286
00:22:53,700 --> 00:22:57,180
So higher multipliers don't always work out.
286

287
00:22:57,390 --> 00:23:04,640
In fact, let's plot this crazy behavior that we had early on on this chart as well. So I'm going to go up here
287

288
00:23:05,420 --> 00:23:17,240
and add a comment here, call it "Experiment" and then I'm going to take my Python code, copy it, paste it here
288

289
00:23:17,810 --> 00:23:21,690
and maybe call this insane_gamma.
289

290
00:23:21,860 --> 00:23:26,510
Now my initial guess was 1.9 in our earlier example.
290

291
00:23:26,510 --> 00:23:33,180
And the multiplier was 0.25 if I remember correctly.
291

292
00:23:33,200 --> 00:23:42,520
Now all I need to do is come down here, take this bit of code here, I'm going to paste it and I want to modify it so
292

293
00:23:42,550 --> 00:23:48,520
it'll read "Plotting insane learning rate" and this is gonna be our insane_gamma.
293

294
00:23:51,810 --> 00:23:52,380
And
294

295
00:23:52,410 --> 00:24:00,990
in terms of color, let's go for just good old red to, uh, show that this is not a good thing.
295

296
00:24:01,250 --> 00:24:05,420
Now I can hit Shift+Enter and see how that behaves.
296

297
00:24:05,440 --> 00:24:07,220
So what have we got?
297

298
00:24:07,250 --> 00:24:11,520
Whew, that looks, that looks really bad.
298

299
00:24:11,540 --> 00:24:17,840
So in our fourth example, we started out much closer to the minimum.
299

300
00:24:17,890 --> 00:24:24,760
We started out with an initial guess of 1.9 and that's compared to the initial guess of 3
300

301
00:24:25,210 --> 00:24:28,520
which is where we started out with the other functions.
301

302
00:24:28,540 --> 00:24:35,950
So even though our initial guess was actually much better and our cost was much lower, it doesn't
302

303
00:24:35,950 --> 00:24:42,280
actually help us when our learning rate is all screwed up, so we can see here that our cost doesn't actually
303

304
00:24:42,280 --> 00:24:43,280
come down.
304

305
00:24:43,410 --> 00:24:49,990
Even though that algorithm has been running 100 times, we just have our cost bouncing around decreasing,
305

306
00:24:50,020 --> 00:24:58,210
increasing, decreasing again and in the end, after a hundred iterations, we can see that our cost here
306

307
00:24:58,330 --> 00:25:05,440
at the end is actually much higher than what the cost is for all the other examples.
307

308
00:25:05,440 --> 00:25:10,910
So all the other learning rates after a 100 iterations had a lower cost,
308

309
00:25:10,950 --> 00:25:17,390
even than our really, really high multiplier of 0.25.
309

310
00:25:17,410 --> 00:25:22,750
So I think that illustrates the problem really nicely when the algorithm is bouncing around with no
310

311
00:25:22,750 --> 00:25:29,950
clear direction. And I think a takeaway from this exercise is that picking the right learning rate is
311

312
00:25:29,950 --> 00:25:36,970
both a bit of an art and a science and so it shouldn't come really as a surprise that there isn't really
312

313
00:25:36,970 --> 00:25:41,460
one perfect solution for picking a good learning rate either.
313

314
00:25:41,470 --> 00:25:46,480
So even if you go to the literature on machine learning, you'll find that there are different approaches
314

315
00:25:46,660 --> 00:25:48,910
for picking a good learning rate.
315

316
00:25:49,100 --> 00:25:54,340
Now I've talked a little bit about two of the things that this optimization algorithm is sensitive to -
316

317
00:25:54,880 --> 00:25:56,020
these two knobs, right,
317

318
00:25:56,020 --> 00:26:01,690
that we had that we could turn - the initial guess and the learning rate. And you might come away thinking
318

319
00:26:01,690 --> 00:26:09,850
that oh, you know, gradient descent it's, you know, it's a bad algorithm because it can have certain problems.
319

320
00:26:09,850 --> 00:26:16,750
And yet, you know, it has pros and cons but one of the really, really big advantages of gradient descent
320

321
00:26:17,350 --> 00:26:25,840
is that it is incredibly simple, is an incredibly simple algorithm and is also quite fast, so it's quite
321

322
00:26:25,840 --> 00:26:31,390
a fast thing to run to train your machine learning model, and that's an advantage that not every other
322

323
00:26:31,390 --> 00:26:34,600
competing algorithm can claim for themselves.
323

324
00:26:35,290 --> 00:26:41,920
So given its relative simplicity and speed you can actually try out different learning rates for your
324

325
00:26:41,920 --> 00:26:45,890
cost function and see what works, see what works best.
325

326
00:26:46,120 --> 00:26:50,630
But of course there's more elegant approaches to picking a learning rate as well.
326

327
00:26:50,650 --> 00:26:58,570
For example, we could adjust our learning rate while the algorithm runs meaning the learning rate doesn't
327

328
00:26:58,570 --> 00:27:00,140
have to be fixed.
328

329
00:27:00,340 --> 00:27:05,170
The learning rate could chop and change after each step in our loop.
329

330
00:27:05,740 --> 00:27:12,550
So why would we do that, might you ask? Why would we have a learning rate that's not fixed for a particular
330

331
00:27:13,030 --> 00:27:13,720
value?
331

332
00:27:13,720 --> 00:27:19,410
Why would you want to update the learning rate every time the loop runs? Well,
332

333
00:27:19,480 --> 00:27:26,950
so the idea behind that is that the further you are from an optimal value, the faster you should move
333

334
00:27:26,980 --> 00:27:35,300
towards that minimum and thus the ideal value of your learning rate should be larger. But on the other
334

335
00:27:35,300 --> 00:27:41,900
hand once you start getting closer and closer to that minimum then the learning rate should come down
335

336
00:27:41,900 --> 00:27:48,240
as well because you don't want to overshoot. And this is why some machine learning practitioners create
336

337
00:27:48,270 --> 00:27:52,340
like a predefined schedule for the learning rate ahead of time.
337

338
00:27:52,650 --> 00:27:58,520
And in the schedule the learning rate starts off large and then gradually gets smaller. But there is
338

339
00:27:58,520 --> 00:28:02,990
a whole host of other techniques as well that people are using.
339

340
00:28:02,990 --> 00:28:09,900
For example, one quite simple technique is called the bold driver and it works like this.
340

341
00:28:09,920 --> 00:28:11,660
If your error rate,
341

342
00:28:11,740 --> 00:28:15,780
yeah, your cost was reduced since the last iteration,
342

343
00:28:15,970 --> 00:28:19,730
then you can try increasing the learning rate by 5 percent.
343

344
00:28:20,870 --> 00:28:27,500
And if your error rate was in fact increased, meaning that you skipped over the optimal point then you
344

345
00:28:27,500 --> 00:28:34,010
should reset the values of your parameters to the values of the previous iteration and decrease the
345

346
00:28:34,010 --> 00:28:41,660
learning rate by 50 percent. So you can see with this kind of rule how the learning rate would change - 
346

347
00:28:41,980 --> 00:28:44,000
increase by 5 percent
347

348
00:28:44,000 --> 00:28:50,510
if you had a reduction in your error rate, and go back one step and decrease the learning rate by 50
348

349
00:28:50,510 --> 00:28:51,240
percent
349

350
00:28:51,380 --> 00:28:54,560
if you had an increase in your cost.
350

351
00:28:54,700 --> 00:29:01,220
Okay wow, so this was quite a theoretical lesson and we've covered quite a bit, we've covered how our
351

352
00:29:01,220 --> 00:29:04,670
algorithm is sensitive both to the initial guess and the learning rate
352

353
00:29:05,210 --> 00:29:13,040
and now we can start to tackle some more complicated cost functions. In particular,
353

354
00:29:13,040 --> 00:29:17,510
so far we've only been working with estimating a single value, right?
354

355
00:29:18,140 --> 00:29:23,370
When you look at g(x), there's only one thing, there it is, x, right?
355

356
00:29:23,480 --> 00:29:25,540
But this is only one dimension.
356

357
00:29:25,550 --> 00:29:31,940
Let's turn our attention to how we can tackle two dimensions in our gradient descent algorithm and then
357

358
00:29:31,940 --> 00:29:36,280
you'll see how you can tackle more than two very easily.
358

359
00:29:36,290 --> 00:29:38,400
I'll see you in the next lesson.
359

360
00:29:38,420 --> 00:29:39,410
Have a good one.