1
00:00:01,000 --> 00:00:03,266
Hello and welcome back to the course
on Deep Learning.

2
00:00:03,266 --> 00:00:06,300
Today we're talking about
stochastic gradient descent.

3
00:00:07,066 --> 00:00:10,100
Previously
we learned about gradient descent.

4
00:00:10,100 --> 00:00:15,033
And we found out that
it is a very efficient method to solve

5
00:00:15,033 --> 00:00:19,200
our optimization problem where we are
trying to minimize the cost function.

6
00:00:19,500 --> 00:00:21,233
It basically

7
00:00:21,233 --> 00:00:26,600
takes us from ten to the power of 57 years
to solving a problem

8
00:00:26,600 --> 00:00:30,633
within, minutes or hours
or within a day or so.

9
00:00:30,933 --> 00:00:36,200
And it really helps speed things up,
because we can see which way is downhill,

10
00:00:36,200 --> 00:00:41,100
and we can just go in that direction and
take steps and get to the minimum faster.

11
00:00:41,466 --> 00:00:45,000
But the thing with the stick with,
gradient descent

12
00:00:45,300 --> 00:00:50,900
is that this method requires
for, the cost function to be convex.

13
00:00:50,966 --> 00:00:54,900
And as you can see here, we've
specifically chosen a convex cost

14
00:00:54,900 --> 00:00:55,200
function.

15
00:00:55,200 --> 00:00:59,400
Basically, convex means,
that the function looks similar

16
00:00:59,400 --> 00:01:03,166
to what we are seeing now
that, it's just convex

17
00:01:03,166 --> 00:01:06,166
into, one direction and that it,

18
00:01:06,366 --> 00:01:09,233
in essence has one global minimum.

19
00:01:09,233 --> 00:01:11,233
And that's
the one that we're going to find.

20
00:01:11,233 --> 00:01:13,966
but what if our function is not convex?

21
00:01:13,966 --> 00:01:15,966
What if our cost function is not convex?

22
00:01:15,966 --> 00:01:17,866
what if it looks something like this?

23
00:01:17,866 --> 00:01:19,700
Well, first of all, how could that happen?

24
00:01:19,700 --> 00:01:21,666
Well, that could happen,

25
00:01:21,666 --> 00:01:25,766
because if we first of all choose
a cost function, which is not,

26
00:01:26,033 --> 00:01:29,200
the squared difference between y hat
and y,

27
00:01:29,633 --> 00:01:33,766
or if, we do choose the cost function,
which is like that,

28
00:01:33,766 --> 00:01:37,500
but then in a dimensional space,
it can actually turn into something

29
00:01:37,600 --> 00:01:39,633
that is not convex.

30
00:01:39,633 --> 00:01:41,966
And so what would happen in this case
if we just tried

31
00:01:41,966 --> 00:01:44,966
to apply our normal gradient
descent method.

32
00:01:45,000 --> 00:01:46,333
something like this could happen.

33
00:01:46,333 --> 00:01:51,133
We could find a local minimum of the cost
function rather than the global one.

34
00:01:51,133 --> 00:01:53,133
So this one was the best one.

35
00:01:53,133 --> 00:01:54,633
And we found the wrong one.

36
00:01:54,633 --> 00:01:57,633
And therefore
we don't have the correct weights.

37
00:01:57,633 --> 00:02:00,166
We don't have an optimized neural network.

38
00:02:00,166 --> 00:02:02,433
we have a subpar neural network.

39
00:02:02,433 --> 00:02:04,500
And so what do we do in this case?

40
00:02:04,500 --> 00:02:09,933
Well, the answer here
is, stochastic gradient descent.

41
00:02:09,933 --> 00:02:13,366
And it turns out the stochastic gradient
descent doesn't require

42
00:02:13,433 --> 00:02:15,233
for the cost function to be convex.

43
00:02:15,233 --> 00:02:19,033
So let's have a look at the two
differences between the normal gradient

44
00:02:19,033 --> 00:02:21,700
descent that we talked about
and the stochastic gradient descent.

45
00:02:21,700 --> 00:02:25,133
So normal gradient
descent is when we take all of our rows

46
00:02:25,400 --> 00:02:27,566
we plug them into our neural network.

47
00:02:27,566 --> 00:02:31,900
And once again here we've got the neural
network copied over several times.

48
00:02:31,900 --> 00:02:35,700
But the rows are being plugged
into that same neural network every time.

49
00:02:35,866 --> 00:02:37,100
So there's only one neural network.

50
00:02:37,100 --> 00:02:39,200
This is just for visualization purposes.

51
00:02:39,200 --> 00:02:42,066
And then once we've plugged the main,
we've calculated our cost function

52
00:02:42,066 --> 00:02:43,300
based on the formula on the right.

53
00:02:43,300 --> 00:02:45,366
And looking at the chart
on the at the bottom.

54
00:02:45,366 --> 00:02:47,400
And then we adjust the weights.

55
00:02:47,400 --> 00:02:49,633
Then this is called
the gradient descent method.

56
00:02:49,633 --> 00:02:54,366
Or it's also the proper term is the batch
gradient descent method.

57
00:02:54,366 --> 00:02:59,866
So we take the whole batch of from
our sample, we apply it and then we run

58
00:02:59,866 --> 00:03:03,400
that the stochastic gradient descent
method is a bit different.

59
00:03:03,666 --> 00:03:05,866
Here. We take the rows one by one.

60
00:03:05,866 --> 00:03:07,000
So we take this row.

61
00:03:07,000 --> 00:03:11,200
We run our neural network
and then we adjust the weights.

62
00:03:11,866 --> 00:03:13,466
Then we move on to the second row.

63
00:03:13,466 --> 00:03:16,400
We take a second row.
We run our neural network.

64
00:03:16,400 --> 00:03:17,766
We look at the cost function.

65
00:03:17,766 --> 00:03:20,033
And then we adjust the weights again.

66
00:03:20,033 --> 00:03:21,100
And then we take another row.

67
00:03:21,100 --> 00:03:22,600
Take row three.

68
00:03:22,600 --> 00:03:23,700
we run our neural network.

69
00:03:23,700 --> 00:03:25,366
We look at the cost function,
we adjust the weights.

70
00:03:25,366 --> 00:03:27,666
So basically, we're looking at

71
00:03:28,800 --> 00:03:31,166
we're adjusting
the weights after every single row

72
00:03:31,166 --> 00:03:34,166
rather than doing everything together
and then adjusting weights.

73
00:03:34,400 --> 00:03:36,066
two different approaches.

74
00:03:36,066 --> 00:03:39,600
And now we're going to just compare
the two side by side.

75
00:03:39,600 --> 00:03:40,500
So here they are.

76
00:03:40,500 --> 00:03:42,766
This is how to visually remember them.

77
00:03:42,766 --> 00:03:46,066
So you've got the batch gradient descent
where you adjusting.

78
00:03:46,400 --> 00:03:48,966
the weights after you've run them.

79
00:03:48,966 --> 00:03:52,100
After you've run
all of the rows in your neural network.

80
00:03:52,866 --> 00:03:54,900
And then, basically
you adjust the weights

81
00:03:54,900 --> 00:03:57,433
and you run the whole thing
again. Iteration, iteration, iteration.

82
00:03:57,433 --> 00:04:00,866
In the stochastic gradient descent method,
you run one row at a time.

83
00:04:01,433 --> 00:04:04,933
And, you adjust the weights,
you adjust the way to adjust the weights.

84
00:04:04,933 --> 00:04:07,666
And then you do everything again
and again.

85
00:04:07,666 --> 00:04:10,666
And that is called this casting gradient
descent method.

86
00:04:11,100 --> 00:04:14,700
The main two differences
are that the stochastic gradient

87
00:04:14,700 --> 00:04:19,166
descent method helps
you avoid, the problem

88
00:04:19,166 --> 00:04:24,000
where you find those local, extrema
or local minimums

89
00:04:24,000 --> 00:04:28,200
rather than the overall,
overall global minimum.

90
00:04:28,900 --> 00:04:32,700
And the reason for that, in simple terms,
is that the SGD

91
00:04:32,900 --> 00:04:36,833
or the stochastic gradient descent
method, has much higher fluctuations

92
00:04:36,833 --> 00:04:38,100
because it can afford them.

93
00:04:38,100 --> 00:04:42,233
It's doing one iteration
or one row at a time, and therefore

94
00:04:42,233 --> 00:04:43,366
the fluctuations are much higher.

95
00:04:43,366 --> 00:04:46,366
And it it's much more likely
to find the global,

96
00:04:46,800 --> 00:04:49,333
minimum,
rather than just the local minimum.

97
00:04:49,333 --> 00:04:52,566
And the other thing about the stochastic
gradient descent,

98
00:04:52,566 --> 00:04:56,400
nothing compared to the batch
gradient is the it's faster.

99
00:04:56,433 --> 00:04:58,566
Like the first impression
that you might have is

100
00:04:58,566 --> 00:05:00,700
because it's doing
every single row one at a time.

101
00:05:00,700 --> 00:05:04,533
It is slower, but actually in fact,
it is faster because it is.

102
00:05:04,833 --> 00:05:09,000
It doesn't have to, load up all the,
data into memory

103
00:05:09,000 --> 00:05:12,433
and run and wait until all of those rows
are on all together.

104
00:05:12,566 --> 00:05:15,000
You can just row, run them one by one
so it's much lighter.

105
00:05:15,000 --> 00:05:16,733
Algorithm is much faster in that sense.

106
00:05:16,733 --> 00:05:21,133
So, though it has way more
and that's in those senses,

107
00:05:21,133 --> 00:05:25,033
it has more advantages over the, batch
gradient descent method.

108
00:05:25,266 --> 00:05:29,633
The main advantage of or the main
kind of like pro for the batch

109
00:05:29,633 --> 00:05:32,633
gradient descent method
is that it is a deterministic algorithm

110
00:05:32,733 --> 00:05:36,866
rather than, stochastic gradient descent,
being a stochastic algorithm,

111
00:05:36,866 --> 00:05:40,500
meaning it's random and with the batch
gradient descent method,

112
00:05:40,666 --> 00:05:43,666
as long as you have the same
starting weights.

113
00:05:44,233 --> 00:05:45,433
for your neural network,

114
00:05:45,433 --> 00:05:49,000
every time you run the batch
gradient descent method, you will get,

115
00:05:49,000 --> 00:05:54,200
the same, iterations, the same results
for your, for the way

116
00:05:54,200 --> 00:05:57,600
your weights are being updated,
for us to have for the stochastic gradient

117
00:05:57,600 --> 00:06:01,066
descent method, you won't get that
because it is a stochastic method.

118
00:06:01,066 --> 00:06:03,800
You are picking your rows,
possibly at random.

119
00:06:03,800 --> 00:06:08,666
And, you are updating your neural network
in a stochastic manner and therefore,

120
00:06:08,900 --> 00:06:12,433
you're just going to every single time
you run the stochastic gradient

121
00:06:12,433 --> 00:06:13,233
descent method, even

122
00:06:13,233 --> 00:06:16,500
if you have the same weights at the start,
you're going to have a different,

123
00:06:17,000 --> 00:06:20,233
process at different, 
different iterations to get there.

124
00:06:20,633 --> 00:06:23,100
So that's in a nutshell.

125
00:06:23,100 --> 00:06:27,766
What's stochastic gradient descent is
also there's a method in between

126
00:06:27,766 --> 00:06:30,400
the two called the mini batch
gradient descent method,

127
00:06:30,400 --> 00:06:34,100
where you combine the two
and you basically, run

128
00:06:34,100 --> 00:06:37,500
rather than running a whole batch
or running one at a time.

129
00:06:37,500 --> 00:06:40,500
You run batches of rows
maybe five, ten, 100.

130
00:06:40,666 --> 00:06:44,933
However many rows you decide to set, you
run those that number of rows at a time.

131
00:06:45,066 --> 00:06:47,766
Then you update your weights,
synaptic weights, and so on.

132
00:06:47,766 --> 00:06:50,433
And that's called the mini batch
gradient descent method.

133
00:06:50,433 --> 00:06:52,933
If you'd like to learn more
about gradient descent,

134
00:06:52,933 --> 00:06:56,500
there's a great article
which you can have a look at.

135
00:06:56,533 --> 00:07:00,300
It's called, 
a Neural Network in 13 Layers of Python

136
00:07:00,300 --> 00:07:03,300
part two Gradient Descent by Andrew Trask.

137
00:07:03,700 --> 00:07:05,700
and the links below.

138
00:07:05,700 --> 00:07:07,200
It's on GitHub.

139
00:07:07,200 --> 00:07:08,600
12 to 15 article.

140
00:07:08,600 --> 00:07:12,300
very well written, very in very simple
terms.

141
00:07:12,800 --> 00:07:17,666
it's got some interesting, philosophical
or just interesting

142
00:07:17,666 --> 00:07:21,400
thoughts on,
how to apply gradient descent.

143
00:07:21,400 --> 00:07:26,133
What, you know, the advantages
and disadvantages and how to be

144
00:07:26,166 --> 00:07:28,066
how to do things in certain situations.

145
00:07:28,066 --> 00:07:30,600
So he's got some very cool
tips, tricks and hacks.

146
00:07:30,600 --> 00:07:31,966
very easy to read.

147
00:07:31,966 --> 00:07:33,633
So definitely check that out.

148
00:07:33,633 --> 00:07:36,866
And another one a bit more, heavier read.

149
00:07:36,933 --> 00:07:40,766
For those of you who are into mathematics,
who want to get to the bottom

150
00:07:40,766 --> 00:07:46,066
of the mathematics, why gradient descent
in that specific, what are the formulas

151
00:07:46,066 --> 00:07:49,066
that are driving gradients
and how is it calculate and so on.

152
00:07:49,166 --> 00:07:51,533
check out the article
or actually the book.

153
00:07:51,533 --> 00:07:52,233
it's a free online

154
00:07:52,233 --> 00:07:55,566
book called Neural Networks
and Deep Learning by Michael Nielsen.

155
00:07:56,233 --> 00:07:57,033
2015 book.

156
00:07:57,033 --> 00:07:59,533
It's, just basically it's all online.

157
00:07:59,533 --> 00:08:02,933
You can go ahead
and, check it out there and there.

158
00:08:02,933 --> 00:08:05,766
Again, very soft
introduction to the mathematics.

159
00:08:05,766 --> 00:08:07,200
But then for mother estimate.

160
00:08:07,200 --> 00:08:09,966
But the mathematics are pretty heavy

161
00:08:09,966 --> 00:08:13,033
as you go along
as you read through the article.

162
00:08:13,500 --> 00:08:17,233
but at the same time
it gets you into, into that mood.

163
00:08:17,266 --> 00:08:20,100
I think he has, like, a warm up, chapter.

164
00:08:20,100 --> 00:08:22,566
We used first warm up with the math
and then you jump into him.

165
00:08:22,566 --> 00:08:23,866
So interested in math.

166
00:08:23,866 --> 00:08:26,400
Then this is the article to go to.

167
00:08:26,400 --> 00:08:29,033
And there we go. So that's in a nutshell.

168
00:08:29,033 --> 00:08:32,733
The difference between gradient descent
and stochastic gradient descent.

169
00:08:32,733 --> 00:08:36,266
And how the two work.

170
00:08:36,266 --> 00:08:39,733
And on that note, we're going to wrap up
today's tutorial.

171
00:08:39,733 --> 00:08:41,900
I look forward
to seeing you on the next one.

172
00:08:41,900 --> 00:08:44,033
And until then, enjoy deep learning.