0
1
00:00:00,650 --> 00:00:05,900
In this lesson we're going to play around with our gradient descent algorithm a little bit and we're
1

2
00:00:05,900 --> 00:00:13,640
gonna see how it is affected by our initial guess, by our starting value. Also so we're gonna be building
2

3
00:00:13,760 --> 00:00:22,490
on our Python programming skills by covering some of the more advanced features of functions in Python.
3

4
00:00:22,490 --> 00:00:28,310
The first thing we're gonna do is we're going to generate a new cost function and kind of go into a
4

5
00:00:28,310 --> 00:00:29,960
second example.
5

6
00:00:29,960 --> 00:00:37,850
So I'm going to change the cell at the bottom here again to Markdown so that we have a nice clean section
6

7
00:00:37,850 --> 00:00:43,590
heading and we're going to call it "Example 2
7

8
00:00:46,020 --> 00:00:57,480
- Multiple Minima vs. Initial Guess & Advanced Functions".
8

9
00:00:57,780 --> 00:01:02,010
The cost function that we're gonna be working with is going to look like this.
9

10
00:01:02,040 --> 00:01:17,800
It's gonna be two dollar signs g(x) = x^4 - 4x^2 +
10

11
00:01:18,060 --> 00:01:19,030
5,
11

12
00:01:19,270 --> 00:01:22,720
And then two dollar signs at the end.
12

13
00:01:22,720 --> 00:01:30,040
So again we're gonna be using LaTeX markdown to write our function in mathematical notation.
13

14
00:01:30,220 --> 00:01:37,660
So as we've talked about before remember that LaTeX uses tags to mark text for special formatting and
14

15
00:01:37,660 --> 00:01:38,650
there's two tags here.
15

16
00:01:38,650 --> 00:01:38,880
Right.
16

17
00:01:38,890 --> 00:01:46,720
There's an opening tag of two dollar signs and there is a closing tag of two dollar signs.
17

18
00:01:46,740 --> 00:01:49,030
Now let's get stuck into the Python code.
18

19
00:01:49,080 --> 00:01:57,000
So the first thing we're gonna do is we're going to make some data and we're going to to create a variable called
19

20
00:01:57,060 --> 00:02:01,080
x_2 since this is gonna be our second example.
20

21
00:02:01,080 --> 00:02:07,500
And again we're gonna be using numpy's linspace to generate our values.
21

22
00:02:07,510 --> 00:02:14,400
So I'm gonna have value starting at -2, going to 2 and I'm going to space them out with about 1000
22

23
00:02:14,400 --> 00:02:16,350
values.
23

24
00:02:16,350 --> 00:02:27,060
Now, as a challenge can you write the g(x) function and the derivative - the dg(x) function in Python?
24

25
00:02:27,060 --> 00:02:29,940
So remember you gonna be writing two functions with the
25

26
00:02:29,960 --> 00:02:36,040
def keyword and applying that power rule that we covered in the previous lesson.
26

27
00:02:36,180 --> 00:02:39,660
I'll give you a few seconds to pause the video and write these two functions
27

28
00:02:42,860 --> 00:02:45,960
Ready? Here's the solution.
28

29
00:02:45,960 --> 00:03:04,270
It s "def" key word, "g(x)", colon, "return x**4 - 4*x**2+
29

30
00:03:04,270 --> 00:03:06,320
+5"
30

31
00:03:06,340 --> 00:03:11,890
So that's our first function, but the derivative of this function applying the power rule is going to be
31

32
00:03:11,920 --> 00:03:31,480
be "def dg(x):" and then it's "return 4*x**3 - 8*x" - so four
32

33
00:03:31,570 --> 00:03:36,640
gets multiplied by two, becomes eight and the constant drops out.
33

34
00:03:37,430 --> 00:03:38,370
And that's it.
34

35
00:03:38,460 --> 00:03:45,260
I'm going to hit Shift+Enter and if you get this error that I've gotten you've got to scroll all the way up to
35

36
00:03:45,350 --> 00:03:52,910
where you're importing numpy and just hit Shift+Enter on the cell again because np was not recognized
36

37
00:03:52,910 --> 00:04:00,210
by Python because I haven't come back to this notebook in a while. Now I can click back into the cell
37

38
00:04:00,420 --> 00:04:03,630
and hit Shift+Enter and it will run just fine.
38

39
00:04:03,630 --> 00:04:06,990
Now it's time to plot the cost function for this example.
39

40
00:04:07,050 --> 00:04:13,920
I'm going to scroll back up for a second here and I'm going to grab this cell here and I'm going to
40

41
00:04:13,920 --> 00:04:20,610
copy it because we're gonna be reusing some of this code. So I'm going to copy this cell and I'm going to come here
41

42
00:04:21,000 --> 00:04:27,240
and go to "Edit" > "Paste Cell Above", but I'm going to have to make a couple of changes here.
42

43
00:04:27,240 --> 00:04:35,220
So first off I'm going to change the x axis to go from -2 to 2 and the y axis to go from
43

44
00:04:36,240 --> 00:04:45,630
0.5 to 5.5 and then I'm going to change the label here on the y axis as well to g(x)
44

45
00:04:46,870 --> 00:04:53,950
and, of course, on my plot I'm going to take x_2 and g(x_2) and
45

46
00:04:53,950 --> 00:05:01,800
similarly for my derivative the y label is gonna be dg(x) and for my x axis I'm going to go from
46

47
00:05:01,800 --> 00:05:12,560
-2 to 2 as well and for my y axis I'm going to go from -6 to 8. And when it comes to plotting,
47

48
00:05:13,620 --> 00:05:21,280
I'm going to plot x_2 and dg(x_2). I'm going to hit Shift+Enter,
48

49
00:05:21,370 --> 00:05:26,070
see what I get. Voila! So these two plots
49

50
00:05:26,070 --> 00:05:30,370
help us visualize our second example's cost function.
50

51
00:05:30,460 --> 00:05:32,410
So what can we observe here?
51

52
00:05:32,440 --> 00:05:33,880
What can we see?
52

53
00:05:33,880 --> 00:05:37,720
Well there's a couple of things that are of note right.
53

54
00:05:37,720 --> 00:05:42,580
So if we look at the chart on the left we can see that there are two minima.
54

55
00:05:43,000 --> 00:05:46,620
There are two places where the cost is very low.
55

56
00:05:46,870 --> 00:05:56,540
One here and the other one here. Also looking at the right hand chart we see that the derivative intersects
56

57
00:05:56,690 --> 00:06:01,120
the x axis on three points, right?
57

58
00:06:01,190 --> 00:06:12,470
One here one here at x equals zero and one here, so there are three points when the slope is equal to
58

59
00:06:12,470 --> 00:06:13,760
zero.
59

60
00:06:13,760 --> 00:06:23,260
And those three points correspond to this minimum, this minimum, but also this maximum here.
60

61
00:06:23,390 --> 00:06:30,150
Now it's gonna be very interesting how this affects our gradient descent algorithm but before we start
61

62
00:06:30,150 --> 00:06:36,420
playing around with the starting values, we're going to make some modifications to our Python code because
62

63
00:06:36,420 --> 00:06:43,680
this is a really, really good time to talk about some of the advanced features of functions in Python
63

64
00:06:43,680 --> 00:06:45,180
programming.
64

65
00:06:45,270 --> 00:06:50,610
Now, we've written our code already for the gradient descent algorithm but we're gonna do is we're going to
65

66
00:06:50,610 --> 00:06:51,240
scroll up
66

67
00:06:56,840 --> 00:07:00,350
and we're gonna take this cell and copy it.
67

68
00:07:03,700 --> 00:07:12,500
And then we're gonna insert that cell above. Now we're in a good position to start modifying this little
68

69
00:07:12,500 --> 00:07:13,750
bit of code.
69

70
00:07:13,880 --> 00:07:19,280
I really can't wait to show you some more advanced Python programming features when it comes to functions,
70

71
00:07:19,850 --> 00:07:25,170
because Python functions are actually incredibly powerful and versatile things.
71

72
00:07:25,340 --> 00:07:33,500
And in this lesson we're going to cover how to pass a function as an argument, how to make an argument
72

73
00:07:33,830 --> 00:07:42,650
optional by specifying a default value and also how to have a function return multiple values.
73

74
00:07:42,660 --> 00:07:49,390
I'm going to show you all these things by turning our gradient descent algorithm into our very own function.
74

75
00:07:49,640 --> 00:07:54,650
In fact let's add some markdown to show this in our Python notebook.
75

76
00:07:54,950 --> 00:08:00,320
So I'm going to to move this cell, this one here at the bottom, I'm going to press that up arrow here, I'm going to move
76

77
00:08:00,320 --> 00:08:08,850
this cell up and then I'm going to convert this cell to markdown and then I want to use two hashtags
77

78
00:08:09,120 --> 00:08:19,100
and just put down "Gradient Descent as a Python Function".
78

79
00:08:19,110 --> 00:08:20,690
There we go.
79

80
00:08:20,770 --> 00:08:22,720
OK let's get started.
80

81
00:08:25,220 --> 00:08:29,000
The first thing we're gonna do is write our function header as always.
81

82
00:08:29,000 --> 00:08:36,630
So it's gonna be "def", and then we're going to give our function a name, let's call it "gradient_descent".
82

83
00:08:36,630 --> 00:08:37,340
Yeah.
83

84
00:08:37,340 --> 00:08:41,330
Followed by two parentheses and a colon.
84

85
00:08:41,330 --> 00:08:49,810
Now our gradient descent function is gonna be taking four arguments - the derivative function itself, a
85

86
00:08:49,840 --> 00:08:56,740
value for an initial guess, the learning rate and the precision.
86

87
00:08:56,920 --> 00:09:00,330
So let's put these in as parameters between these two parentheses.
87

88
00:09:00,400 --> 00:09:08,950
So the first we said was the derivative function, I'm going to call this "derivative_func", then
88

89
00:09:08,950 --> 00:09:09,910
the initial guess,
89

90
00:09:13,850 --> 00:09:26,350
comma, then a multiplier or learning rate and then the precision. Now if you're looking at this then you're
90

91
00:09:26,350 --> 00:09:29,480
going to be maybe wondering about my intentions here.
91

92
00:09:29,560 --> 00:09:35,910
What do I mean by putting this derivative function as a parameter here?
92

93
00:09:36,130 --> 00:09:42,770
See, the thing about Python is that a function is actually a full blown object.
93

94
00:09:42,780 --> 00:09:43,070
Yeah.
94

95
00:09:43,540 --> 00:09:51,010
So functions are stored in a piece of the computer's memory all on their own.
95

96
00:09:51,010 --> 00:09:53,310
Just like other objects are.
96

97
00:09:53,530 --> 00:09:59,890
And this means that you can stick a function in a variable and pass functions around our program.
97

98
00:10:01,030 --> 00:10:07,660
So in our gradient descent function this derivative_func parameter will be our place holder
98

99
00:10:07,930 --> 00:10:12,610
for the actual derivative function and you'll see this when we call this function.
99

100
00:10:13,150 --> 00:10:21,490
So now let's fill in our function body. In order to make all of these lines part of our function body,
100

101
00:10:21,490 --> 00:10:30,250
we have to indent them because the indentation is how Python knows that these lines should be part of
101

102
00:10:30,250 --> 00:10:31,650
our function.
102

103
00:10:31,750 --> 00:10:38,560
Now to indent a whole group of lines what you can do is you can select them and then use a keyboard
103

104
00:10:38,560 --> 00:10:39,490
shortcut.
104

105
00:10:39,490 --> 00:10:49,540
So on Windows you can press Control and then and then this key here to indent the whole group.
105

106
00:10:49,540 --> 00:10:57,850
And then on Mac instead of Control you simply press Command and the square bracket key.
106

107
00:10:57,850 --> 00:11:05,820
Let me show you. So I'm going to select all of these up to the break statement here and then I'm going to a press
107

108
00:11:06,720 --> 00:11:11,580
control and then the square bracket key and they all move over by 1.
108

109
00:11:11,580 --> 00:11:15,700
So now they're all part of our function body.
109

110
00:11:15,870 --> 00:11:16,500
That's pretty neat.
110

111
00:11:16,500 --> 00:11:28,320
Right? Now I'm going to modify these lines, so our new_x will be equal to to the initial guess that we're
111

112
00:11:28,590 --> 00:11:29,640
making here.
112

113
00:11:29,640 --> 00:11:29,910
Yeah.
113

114
00:11:30,630 --> 00:11:32,940
So the initial guess is our place holder.
114

115
00:11:32,940 --> 00:11:38,340
And when our function gets called we're gonna supply an initial guess and we're going to set new_x equal
115

116
00:11:38,340 --> 00:11:40,600
to that initial guess.
116

117
00:11:40,830 --> 00:11:43,170
I don't need this line anymore.
117

118
00:11:43,170 --> 00:11:51,240
And I also don't need this line anymore, and I don't need this line anymore as well because these variables
118

119
00:11:51,510 --> 00:11:58,680
will get their value when the function is being called. And I'm going to keep these around but I'm going to 
119

120
00:11:58,680 --> 00:12:05,470
have to make a modification here - our derivative function here isn't gonna be called df anymore, right?
120

121
00:12:05,520 --> 00:12:07,230
It's gonna have the name of our place holder.
121

122
00:12:07,230 --> 00:12:07,500
Right.
122

123
00:12:07,500 --> 00:12:15,120
It's gonna be called "derivative_func" and the same,
123

124
00:12:15,120 --> 00:12:17,340
yeah, is gonna be the case down here.
124

125
00:12:17,340 --> 00:12:20,660
This is the other occurrence where we're referring to our derivative function.
125

126
00:12:20,730 --> 00:12:27,990
So this is gonna be also called derivative_func. I'm going to delete this comment here,
126

127
00:12:28,020 --> 00:12:29,090
don't need this anymore.
127

128
00:12:31,150 --> 00:12:36,400
And then when it comes to graphing there's another reference here to our previous example.
128

129
00:12:36,470 --> 00:12:43,740
But now I'm going to replace this again with derivative_func.
129

130
00:12:43,990 --> 00:12:46,690
I'm also going to take away this print statement.
130

131
00:12:46,730 --> 00:12:48,690
Don't need this anymore as well.
131

132
00:12:48,940 --> 00:12:51,500
And then I'm also gonna delete these print statements here.
132

133
00:12:51,520 --> 00:12:53,250
We don't need these anymore as well.
133

134
00:12:55,500 --> 00:12:58,940
But there's one additional thing that I do want to add.
134

135
00:12:59,850 --> 00:13:05,730
I'm just gonna delete a couple of these lines here to tidy it up a little bit better so we can actually
135

136
00:13:05,730 --> 00:13:12,850
tell what's going on and now I can add that last thing to our gradient descent function that I wanted
136

137
00:13:12,850 --> 00:13:13,790
to add.
137

138
00:13:14,240 --> 00:13:17,830
I'm going to add my return statement.
138

139
00:13:17,830 --> 00:13:23,800
So this is the keyword return followed by whatever the function spits out.
139

140
00:13:23,830 --> 00:13:26,020
Now, what do we want this function to return?
140

141
00:13:26,020 --> 00:13:32,410
What are the important things that we want out of our gradient descent function?
141

142
00:13:32,410 --> 00:13:38,020
We want three separate things, three separate values that we're interested in.
142

143
00:13:38,140 --> 00:13:47,200
So we want the new_x value, we want the list of x values and we want our list of slopes that we calculated
143

144
00:13:47,440 --> 00:13:54,590
because our list of x values and our list of slope values is what we're going to be using for graphing,
144

145
00:13:54,790 --> 00:13:59,140
and obviously our minimum is going to be that new x value that we spit out.
145

146
00:13:59,170 --> 00:14:08,260
So one of the easiest ways of having a function return more than one value is simply by having our return
146

147
00:14:08,260 --> 00:14:10,720
values separated by a comma.
147

148
00:14:10,720 --> 00:14:22,510
So new_x, x_list, slope_list will return three values. So let's press Shift+Enter
148

149
00:14:22,510 --> 00:14:24,520
now and see if we get any errors.
149

150
00:14:26,930 --> 00:14:28,870
Okay, so far so good.
150

151
00:14:28,870 --> 00:14:31,680
Now it's time to call this function.
151

152
00:14:31,690 --> 00:14:34,730
This is where the rubber meets the road as they say right.
152

153
00:14:34,810 --> 00:14:42,730
Now, since we have multiple return values we can store these return values in three separate variables.
153

154
00:14:42,790 --> 00:14:53,160
So I'm going to call these variables "local_min, list_x, deriv_ 
154

155
00:14:53,160 --> 00:14:54,650
list".
155

156
00:14:54,650 --> 00:14:55,070
Yeah.
156

157
00:14:55,590 --> 00:15:03,540
So my function is gonna be returning three values and these are gonna be stored in this order in three
157

158
00:15:03,750 --> 00:15:04,890
variables.
158

159
00:15:04,890 --> 00:15:10,820
Okay, so time to call this function gradient_descent,
159

160
00:15:11,040 --> 00:15:12,790
open parentheses,
160

161
00:15:12,840 --> 00:15:16,980
now it's time to supply those four arguments.
161

162
00:15:16,980 --> 00:15:19,200
So the question is what are those gonna be?
162

163
00:15:19,200 --> 00:15:21,920
So let's supply these arguments by their position.
163

164
00:15:21,960 --> 00:15:29,550
The first thing that will pass into our function is going to be, well, another function, it's gonna be
164

165
00:15:29,550 --> 00:15:35,860
another function, we're going to pass in our derivative function which is kind of crazy, right?
165

166
00:15:35,880 --> 00:15:40,110
Like we're passing in a function to another function like it's no big deal.
166

167
00:15:40,440 --> 00:15:47,490
So I already mentioned that functions are just objects in Python, just like pretty much anything else.
167

168
00:15:47,490 --> 00:15:55,440
So our derivative function is an object and it's got the name dg. That's it.
168

169
00:15:55,440 --> 00:16:02,160
So if we want to get technical what's in fact happening here is that we're giving our gradient descent
169

170
00:16:02,160 --> 00:16:09,600
function a pointer to our dg function that we've defined in the cell above.
170

171
00:16:09,600 --> 00:16:15,000
Now if you've got a programming background then you might be interested to know that what we're doing
171

172
00:16:15,000 --> 00:16:23,360
here is we're not copying our dg object, we're simply pointing to our dg object. Now let's supply other
172

173
00:16:23,400 --> 00:16:31,680
three variables, let's have our starting position be 0.5, let's have our learning rate be
173

174
00:16:31,830 --> 00:16:40,710
0.02 and let's have our precision to be 0.001 and let's add some print
174

175
00:16:40,710 --> 00:16:42,150
statements for good measure.
175

176
00:16:42,190 --> 00:16:54,320
Yeah, "print('Local min occurs at: ', local_min)". Let's print out that first value and let's also
176

177
00:16:54,320 --> 00:17:00,470
print out the number of steps, so the number of steps is gonna be,
177

178
00:17:01,880 --> 00:17:12,020
well it's gonna be the length of our list, so I can have "len", the length function of our list, 
178

179
00:17:12,080 --> 00:17:20,060
list_x, so this will include the initial guess plus the number of times that the loop ran - that's
179

180
00:17:20,060 --> 00:17:27,500
gonna be the number of values that are gonna be stored in this list right here. Now as a challenge,
180

181
00:17:27,920 --> 00:17:33,770
can you figure out what the problem is if I tried to run this as it is? And also what I would have to
181

182
00:17:33,770 --> 00:17:41,180
fix in order for our function to run properly? Because there's something that I've missed in our gradient
182

183
00:17:41,180 --> 00:17:50,610
descent function that I have not taken into account yet. And here's the solution - so this variable here
183

184
00:17:50,850 --> 00:18:01,860
step_multiplier exists locally within our function, so it only exists within the function itself but
184

185
00:18:01,860 --> 00:18:05,540
the problem is is that it hasn't been defined anywhere.
185

186
00:18:05,670 --> 00:18:13,770
This means that Python actually does not know what this name refers to - in short, we've got to be consistent
186

187
00:18:14,010 --> 00:18:15,700
with our naming here.
187

188
00:18:15,720 --> 00:18:23,370
So multiplier is the name of the parameter which we actually want to use, we cannot use the name 
188

189
00:18:23,370 --> 00:18:27,550
step_multiplier which we've defined earlier and that's the fix.
189

190
00:18:27,570 --> 00:18:31,290
So let me press Shift+Enter to rerun the cell.
190

191
00:18:31,310 --> 00:18:39,950
Now I can press Shift+Enter again to run the cell below, and here's our answer - our local minimum occurs
191

192
00:18:40,310 --> 00:18:45,750
at 1.4 and the number of steps it took to get there was 23.
192

193
00:18:46,190 --> 00:18:50,810
So our function is working. Now looking at how we're calling this function here,
193

194
00:18:50,990 --> 00:19:01,610
gradient_descent(dg, 0.5, 0.02, 0.001) is not
194

195
00:19:01,610 --> 00:19:02,500
very readable.
195

196
00:19:02,510 --> 00:19:09,780
Yeah this is something I really, really dislike when writing code because these numbers just appear like
196

197
00:19:09,780 --> 00:19:10,880
magic numbers.
197

198
00:19:10,950 --> 00:19:13,520
Looking at this we don't know what they mean.
198

199
00:19:13,590 --> 00:19:14,330
Right.
199

200
00:19:14,340 --> 00:19:18,220
We'd have to actually know what the function definition is.
200

201
00:19:18,270 --> 00:19:25,120
It's much nicer to add keywords to these arguments and we can do that by modifying our function call.
201

202
00:19:25,200 --> 00:19:27,690
So I'm going to take this code here,
202

203
00:19:27,750 --> 00:19:33,860
Copy it, paste it down here for reference and then I can fill in the names of these arguments.
203

204
00:19:33,870 --> 00:19:44,950
So this was our derivative_func, it's going to be equal to dg; our initial_guess is gonna
204

205
00:19:44,960 --> 00:19:46,070
be equal to
205

206
00:19:46,280 --> 00:19:52,830
I'm going to say, uh, 0.5 and then I can, uh, I can actually hit enter here and go to a new
206

207
00:19:52,830 --> 00:19:53,680
line,
207

208
00:19:53,760 --> 00:19:59,790
this won't affect the function call at all, but at least that's gonna make our code a lot more readable.
208

209
00:19:59,820 --> 00:20:01,140
So this was our multiplier,
209

210
00:20:03,670 --> 00:20:12,160
and this was our precision. let's see what happens when our initial guess has a starting value of
210

211
00:20:12,160 --> 00:20:13,690
-0.5.
211

212
00:20:13,690 --> 00:20:17,340
Let's run it now. So we can already see a difference here.
212

213
00:20:17,370 --> 00:20:24,730
So the first time the local minimum occurs at 1.4 and the second time round the local minimum
213

214
00:20:24,820 --> 00:20:28,150
occurs at -1.4.
214

215
00:20:28,360 --> 00:20:33,850
But, uh, before we investigate this let's talk a little bit more about arguments.
215

216
00:20:33,940 --> 00:20:38,130
We already know that arguments are how a function gets its inputs.
216

217
00:20:38,200 --> 00:20:44,630
Arguments are how objects are sent to a function as an input.
217

218
00:20:44,860 --> 00:20:52,180
And I promise to show you how we can give our arguments a default value and also make some of these
218

219
00:20:52,180 --> 00:20:59,560
arguments optional. And to do that we're gonna have to modify that header in our gradient descent function,
219

220
00:21:00,490 --> 00:21:07,990
because this is where we can specify our default values. So let's specify a default value for this multiplier
220

221
00:21:07,990 --> 00:21:08,640
here.
221

222
00:21:09,010 --> 00:21:17,630
We can do that by setting it equal to 0.02, and to specify a default value for the precision,
222

223
00:21:17,710 --> 00:21:24,020
we also can just simply set it equal to 0.001.
223

224
00:21:24,240 --> 00:21:30,990
And now our gradient descent function has two required arguments -
224

225
00:21:30,990 --> 00:21:39,270
the derivative function and the initial guess, and two optional arguments - they're optional because they
225

226
00:21:39,270 --> 00:21:41,350
have default values.
226

227
00:21:41,670 --> 00:21:46,930
So let's call this function again, this time we're only gonna specify the required arguments.
227

228
00:21:46,990 --> 00:21:52,290
Now copy this code here, paste it here and I'm going to delete
228

229
00:21:55,020 --> 00:22:04,410
this part of my function call, so I'm only going to specify the derivative function and the initial guess.
229

230
00:22:04,480 --> 00:22:12,880
Now I'm going to change this guess to -0.1 and let's see where we end up.
230

231
00:22:13,230 --> 00:22:14,340
If you're getting this error.
231

232
00:22:14,640 --> 00:22:15,180
Yeah.
232

233
00:22:15,450 --> 00:22:20,110
If you're getting "gradient_descent() is missing 2 required positional arguments"
233

234
00:22:20,430 --> 00:22:25,710
despite having added this code, it's because you haven't rerun the cell.
234

235
00:22:25,710 --> 00:22:33,060
So remember to press Shift+Enter on the cell to rerun this code and make sure that our Jupyter notebook
235

236
00:22:33,120 --> 00:22:43,200
is updated and then come down here and run this one and you should see that we end up at the same minimum
236

237
00:22:43,590 --> 00:22:50,490
as before but this time it takes us 34 steps instead of 23.
237

238
00:22:50,510 --> 00:22:55,750
Now one thing that you might try is you might want to rerun this cell here.
238

239
00:22:55,770 --> 00:22:57,930
Question is: will this still work?
239

240
00:22:59,410 --> 00:23:01,560
And the answer is: yes, it will.
240

241
00:23:01,640 --> 00:23:09,050
Even though we don't have to specify a value for the multiplier and the precision, we can do.
241

242
00:23:09,050 --> 00:23:15,500
So I can add an extra zero here to the precision and make our step size even smaller by changing it
242

243
00:23:15,650 --> 00:23:25,250
from .02 to .01 and I can overwrite the default values that this function usually
243

244
00:23:25,250 --> 00:23:34,780
has and we can see that having made the step size smaller and our cut off point even more precise, now
244

245
00:23:34,780 --> 00:23:38,760
the number of steps increases to 56.
245

246
00:23:38,800 --> 00:23:40,700
So this is really, really cool right.
246

247
00:23:40,720 --> 00:23:44,170
We've got really powerful capabilities with functions.
247

248
00:23:44,170 --> 00:23:47,850
We've got a very descriptive way we can call them.
248

249
00:23:48,100 --> 00:23:56,440
We've got multiple outputs that we can store in variables just separated by commas and we can have optional
249

250
00:23:56,500 --> 00:24:04,120
arguments, so we can give default values to some of our parameters in the function when we create it.
250

251
00:24:04,510 --> 00:24:09,940
But all this stuff with arguments and optional arguments it's kind of hidden from view.
251

252
00:24:09,970 --> 00:24:10,950
Right.
252

253
00:24:10,990 --> 00:24:17,870
I mean how would you know what the arguments are for a function that you've never seen before?
253

254
00:24:17,890 --> 00:24:23,950
Wouldn't it be nice to be able to pull up this information really quickly and really easily inside Jupyter
254

255
00:24:23,950 --> 00:24:25,750
notebook without having to guess?
255

256
00:24:26,770 --> 00:24:29,680
Well, I got you covered.
256

257
00:24:29,710 --> 00:24:34,270
Let me show you a neat little trick in Jupyter notebook.
257

258
00:24:34,270 --> 00:24:44,170
So if I put my cursor over my function here gradient_descent and then I hit Shift and Tab on my keyboard,
258

259
00:24:44,200 --> 00:24:50,930
so I'm holding down the Shift key and I'm pressing Tab, then Jupyter notebook will pull up a little bit
259

260
00:24:50,930 --> 00:24:55,900
of documentation, a little bit of information on this function, so I can see here,
260

261
00:24:55,970 --> 00:25:01,880
I need four arguments, I can see what the arguments are called - derivative_func, initial_guess, multiplier,
261

262
00:25:02,060 --> 00:25:08,340
precision; and I can even see what the default values are for my other two arguments.
262

263
00:25:08,510 --> 00:25:14,090
Isn't this really, really cool? Jupyter notebook is actually smart enough to give us some information
263

264
00:25:14,420 --> 00:25:16,700
on our function right there and then.
264

265
00:25:16,880 --> 00:25:19,910
And it doesn't just work with our own function.
265

266
00:25:19,910 --> 00:25:26,360
So, scroll up and pick out matplotlib scatter function in this notebook and give this a try, Shift
266

267
00:25:26,780 --> 00:25:29,730
and then Tab on your keyboard.
267

268
00:25:29,950 --> 00:25:30,900
Go on.
268

269
00:25:31,010 --> 00:25:32,080
I'm going to wait for you right here.
269

270
00:25:34,910 --> 00:25:39,590
Did you try it in a couple of places? You might have noticed something.
270

271
00:25:39,710 --> 00:25:42,880
Sometimes it's really, really informative.
271

272
00:25:43,010 --> 00:25:46,830
And other times it's not.
272

273
00:25:46,970 --> 00:25:48,800
So, let me show you what I mean.
273

274
00:25:49,230 --> 00:25:52,300
So I've got my Python code here from our previous example.
274

275
00:25:52,340 --> 00:26:01,390
If I go to scatter and press Shift+Tab, then I get a wonderful description here with very, very detailed
275

276
00:26:01,390 --> 00:26:03,220
information on the Signature.
276

277
00:26:08,490 --> 00:26:16,620
I can click this little plus sign to even take a look at all this information here and all this documentation
277

278
00:26:17,040 --> 00:26:20,610
for our scatter function.
278

279
00:26:20,610 --> 00:26:29,970
And similarly, when I go up here to plt.figure, and I press Shift+Tab again I get a wonderful description
279

280
00:26:29,970 --> 00:26:32,040
here, a wonderful signature,
280

281
00:26:32,040 --> 00:26:39,840
these are all the things in the header and click the little plus sign and I get really, really descriptive
281

282
00:26:39,930 --> 00:26:43,650
information, right? "Facecolor: the background color. If not provided,
282

283
00:26:43,650 --> 00:26:46,500
defaults to our rc figure.facecolor."
283

284
00:26:46,500 --> 00:26:47,970
Fair enough, right?
284

285
00:26:48,510 --> 00:26:49,790
Cool.
285

286
00:26:50,010 --> 00:26:52,730
Let's go to this plot functionality here.
286

287
00:26:52,740 --> 00:26:53,500
Right.
287

288
00:26:53,520 --> 00:26:58,830
If I press Shift+Tab on this all I get is "plt.
288

289
00:26:58,860 --> 00:27:07,720
plot", and then args and kwargs, so this isn't really informative and I do have to kind of go to the
289

290
00:27:07,720 --> 00:27:15,000
documentation here and really, really figure out what it is. But it's not all that readable, right?
290

291
00:27:15,040 --> 00:27:22,360
So, I mean you can be scrolling around here for a long time trying to make sense of all this.
291

292
00:27:22,390 --> 00:27:27,520
So at this point you're probably much better off going to the website of the official documentation
292

293
00:27:27,940 --> 00:27:37,330
where you can actually read up on this in a much nicer format and you can search on this page, etc..
293

294
00:27:37,660 --> 00:27:42,700
In other words - if you want to know how something like plant or a subplot works, you're probably still
294

295
00:27:42,700 --> 00:27:48,960
better off in pulling up the documentation for these things in your browser.
295

296
00:27:49,180 --> 00:27:54,470
But speaking of plots, it's time to chart our gradient descent.
296

297
00:27:54,730 --> 00:28:01,490
So let me copy the cell that generates these charts here.
297

298
00:28:01,550 --> 00:28:09,830
"Edit" > "Copy Cells", scroll all the way down and then paste the cells above.
298

299
00:28:09,920 --> 00:28:14,050
This is our chance to play around with the starting values and our algorithm.
299

300
00:28:14,240 --> 00:28:16,850
So I'm going to add a comment here:
300

301
00:28:20,120 --> 00:28:28,780
"Calling gradient descent function"; and I'm going to edit this comment here -
301

302
00:28:28,880 --> 00:28:34,260
"Plotting function and derivative and scatter plot
302

303
00:28:34,370 --> 00:28:42,110
side by side"; then I'm going to take this one here and copy it and I'm going to paste it here.
303

304
00:28:42,710 --> 00:28:49,850
I am going to change our initial guess from -0.1 to 0.1.
304

305
00:28:50,070 --> 00:28:50,320
Now,
305

306
00:28:50,320 --> 00:28:53,980
let's add a couple of lines of code to have that scatter plot on here as well.
306

307
00:28:53,990 --> 00:28:55,460
This is gonna be "plt.
307

308
00:28:55,460 --> 00:28:57,590
scatter"
308

309
00:28:58,130 --> 00:29:06,840
and here we had our list of x values and then our cost function
309

310
00:29:06,920 --> 00:29:15,440
"g(list_x)" but we can't leave it like that because we're doing some calculations,
310

311
00:29:15,440 --> 00:29:21,350
remember? And the power function doesn't play nice with a list type.
311

312
00:29:21,350 --> 00:29:31,490
So I'm going to convert it to np.array and then I'm going to have it inside our g function.
312

313
00:29:31,860 --> 00:29:36,760
So this is going to be the input - I'm nesting two functions here,
313

314
00:29:36,870 --> 00:29:46,370
instead of splitting it up. And then we're going to add a color, I'm going to say color is red, the size of the dots will
314

315
00:29:46,370 --> 00:29:50,140
be 100 as before, and we gonna give it some transparency as well,
315

316
00:29:50,150 --> 00:29:57,260
alpha = 0.6, and then a closing parentheses for the scatter plot.
316

317
00:29:57,260 --> 00:29:59,560
Now, transparency looks pretty good on the other one as well.
317

318
00:29:59,570 --> 00:30:10,060
So, I'm going to add alpha=0.8 on our cost function chart as well.
318

319
00:30:10,150 --> 00:30:14,480
And let's do something similar for our derivative chart below.
319

320
00:30:14,480 --> 00:30:20,450
So it's also gonna be a scatter plot, "plt.scatter" and it's gonna be a list of x values again
320

321
00:30:20,480 --> 00:30:22,170
for the x axis,
321

322
00:30:22,430 --> 00:30:30,520
and then what was previously our slope list, we've stored in a variable called the deriv_list.
322

323
00:30:30,860 --> 00:30:37,360
We're gonna go again with the red color for the chart.
323

324
00:30:37,560 --> 00:30:45,270
We're gonna go with size 100 and alpha 0.5
324

325
00:30:45,270 --> 00:30:54,780
and this sky blue line is also gonna get some transparency with alpha equals 0.6.
325

326
00:30:54,780 --> 00:30:56,090
Now let's run the cell.
326

327
00:30:56,130 --> 00:30:58,720
See what happens. And voila!
327

328
00:30:58,890 --> 00:31:00,120
This is what we get.
328

329
00:31:00,420 --> 00:31:09,360
We get our initial starting value up here at 0.1 and we're descending to the minimum here.
329

330
00:31:10,200 --> 00:31:13,530
And on our derivative it looks like this.
330

331
00:31:13,800 --> 00:31:23,540
We're gonna go down down down down down down down until our slope is equal to zero.
331

332
00:31:23,660 --> 00:31:32,220
Okay so let's try out a couple of different starting values for our gradient descent function in this
332

333
00:31:32,220 --> 00:31:33,040
example.
333

334
00:31:34,010 --> 00:31:40,100
So here we've started at 0.1 and we've converged to this minimum here,
334

335
00:31:40,100 --> 00:31:42,330
this right hand minimum here.
335

336
00:31:42,800 --> 00:31:47,240
Let's see what happens when we start out with the value 2.
336

337
00:31:47,360 --> 00:31:56,250
So I'm going to say my initial guess up here is gonna be the value 2 and I'm going to hit Shift+Enter.
337

338
00:31:57,720 --> 00:32:06,150
In this case, our gradient descent algorithm goes down here and we end up at the very, very same minimum.
338

339
00:32:06,350 --> 00:32:11,000
And if we start out somewhere else, say like -1.8,
339

340
00:32:19,930 --> 00:32:22,900
then we end up at this minimum
340

341
00:32:22,900 --> 00:32:31,000
instead. We end up in the left hand minimum and we'll end up in this left hand minimum as well
341

342
00:32:31,000 --> 00:32:43,110
if we start out at -0.1. So if we start at -0.1, we end up here as well,
342

343
00:32:43,120 --> 00:32:44,800
we end up in the left hand minimum.
343

344
00:32:45,460 --> 00:32:52,210
So what we can learn from this, what we can see from this example is that our algorithm isn't perfect,
344

345
00:32:52,600 --> 00:32:57,940
it's got some weaknesses because conceptually it's a little bit disturbing,
345

346
00:32:57,940 --> 00:32:58,200
right?
346

347
00:32:58,200 --> 00:33:06,430
That we end up at completely different minima when we start out at 0.1 and 
347

348
00:33:06,550 --> 00:33:08,020
-0.1.
348

349
00:33:08,020 --> 00:33:16,390
So if we're unlucky in our choice of initial starting position, we can actually end up in very, very different
349

350
00:33:16,390 --> 00:33:17,700
places.
350

351
00:33:17,860 --> 00:33:25,840
So we can see in this example that the path of the descent can be very, very much influenced by that
351

352
00:33:25,840 --> 00:33:29,280
initial guess in certain situations.
352

353
00:33:29,290 --> 00:33:37,950
Now would you like to venture a guess what happens when we have an initial starting value of 0? So have a think about,
353

354
00:33:37,980 --> 00:33:44,760
what would happen with our gradient descent if we feed in the value zero into our initial
354

355
00:33:44,790 --> 00:33:49,560
guess before you run the algorithm. Let's try it out.
355

356
00:33:50,430 --> 00:33:58,650
So I'm gonna say instead of 0.1, I'm going to start at 0 and press Shift+Enter. What ends up
356

357
00:33:58,650 --> 00:34:03,930
happening is that we don't descend to any of the two minima here.
357

358
00:34:03,930 --> 00:34:08,710
Instead we end up sitting right here on the maximum.
358

359
00:34:08,910 --> 00:34:15,710
And that's because the slope at this very, very point is also equal to 0.
359

360
00:34:15,720 --> 00:34:21,420
And we can see this here on the right hand function here, on the slope of the cost function.
360

361
00:34:21,600 --> 00:34:28,720
Remember, a gradient descent algorithm stops running once the slope is equal to 0.
361

362
00:34:28,740 --> 00:34:34,480
So this problem is also related to this sensitivity to the starting position.
362

363
00:34:34,500 --> 00:34:35,000
Right?
363

364
00:34:35,070 --> 00:34:41,490
The sensitivity of the path of the gradient descent algorithm to that initial guess.
364

365
00:34:41,670 --> 00:34:46,990
Now, in this case, both of the two minima have the same cost.
365

366
00:34:46,990 --> 00:34:55,530
Right? The cost is equal at both of these two minima but we can also imagine a very different situation.
366

367
00:34:56,070 --> 00:35:05,430
If our cost function looked something like this, then we would have two minima, one of which has a much lower
367

368
00:35:05,430 --> 00:35:06,770
cost than the other one.
368

369
00:35:06,810 --> 00:35:09,010
One of these is a global minimum.
369

370
00:35:09,090 --> 00:35:11,600
And the other one is a local minimum.
370

371
00:35:12,090 --> 00:35:17,460
So the local minimum actually has a higher cost than the global minimum,
371

372
00:35:17,460 --> 00:35:25,140
but our gradient descent would not discover the global minimum if the initial starting point was on
372

373
00:35:25,140 --> 00:35:32,340
the right hand side of that local maximum, of that small hump right there in the middle of the chart.
373

374
00:35:32,520 --> 00:35:32,840
Okay.
374

375
00:35:32,860 --> 00:35:34,600
So we've outlined the problem.
375

376
00:35:34,860 --> 00:35:36,450
What's the solution?
376

377
00:35:36,450 --> 00:35:39,840
Right? How do we get around this weakness?
377

378
00:35:39,840 --> 00:35:49,320
Well, the easiest thing to do would be to simply try out multiple different starting values and see if
378

379
00:35:49,320 --> 00:35:52,170
they end up in the same place.
379

380
00:35:52,200 --> 00:35:58,590
Think of this as like injecting a little bit of randomness into our gradient descent, so we could do
380

381
00:35:58,590 --> 00:36:05,850
this by choosing a whole host of like random starting values and then running our gradient descent over
381

382
00:36:05,850 --> 00:36:09,160
and over again to see where it ends up.
382

383
00:36:09,330 --> 00:36:14,220
And this might be an approach you could take if you actually don't know what the cost function looks
383

384
00:36:14,220 --> 00:36:14,810
like, right?
384

385
00:36:14,820 --> 00:36:17,600
If you don't know where the minimum is,
385

386
00:36:17,670 --> 00:36:20,890
this might be something that you want to try.
386

387
00:36:20,970 --> 00:36:26,450
Another thing we could do is to try a completely different algorithm to find the minimum.
387

388
00:36:26,450 --> 00:36:26,920
Right?
388

389
00:36:27,150 --> 00:36:32,460
Because, let's face it, this particular version of gradient descent isn't our only option.
389

390
00:36:32,550 --> 00:36:40,440
Similar to how we can try a bunch of different random starting points, other algorithms have randomness
390

391
00:36:40,530 --> 00:36:41,960
baked into them.
391

392
00:36:42,060 --> 00:36:48,150
One of the versions of gradient descent that has more randomness baked in is called Stochastic Gradient
392

393
00:36:48,150 --> 00:36:49,290
Descent.
393

394
00:36:49,290 --> 00:36:56,840
And this is in contrast what we're currently doing which is called Batch Gradient Descent. But the thing
394

395
00:36:56,840 --> 00:37:02,300
to note about any of these approaches is that none of them are perfect, right.
395

396
00:37:02,300 --> 00:37:08,930
You'll find that no matter which approach you choose, it has certain strengths and it has certain weaknesses
396

397
00:37:09,590 --> 00:37:16,400
and it's important to understand what the pros and cons are of of each approach.
397

398
00:37:16,400 --> 00:37:21,830
So on that note, we're not done yet examining this particular algorithm.
398

399
00:37:21,830 --> 00:37:28,260
Our Batch Gradient Descent algorithm might actually face another problem and this is what we're gonna
399

400
00:37:28,280 --> 00:37:33,010
be looking at next. I'll see you in the next lesson.
400

401
00:37:33,030 --> 00:37:33,690
Take care.