1
00:00:00,600 --> 00:00:03,466
Hello and welcome back to the course
on Deep Learning.

2
00:00:03,466 --> 00:00:06,466
In today's tutorial
we're talking about gradient descent.

3
00:00:06,733 --> 00:00:09,333
What we learned previously was that

4
00:00:09,333 --> 00:00:14,233
in order for a neural network to learn,
what needs to happen is backpropagation.

5
00:00:14,233 --> 00:00:18,166
And that is when the error,
the difference or

6
00:00:18,366 --> 00:00:23,500
the sum of squared differences
between y hat and y is back

7
00:00:23,500 --> 00:00:28,033
propagated through the neural network,
and the weights are adjusted accordingly.

8
00:00:28,366 --> 00:00:32,600
So we saw that and today
we're going to learn exactly how these.

9
00:00:32,600 --> 00:00:35,600
Weights are adjusted.
So let's have a look.

10
00:00:35,966 --> 00:00:39,300
this is our, very simple version.

11
00:00:39,300 --> 00:00:42,566
Of a neural network, a, perceptron
or a single.

12
00:00:42,566 --> 00:00:44,700
Layer feedforward neural network.

13
00:00:44,700 --> 00:00:47,033
And what we can see here is.

14
00:00:47,033 --> 00:00:51,600
this whole process in action
where, we've got, some input value,

15
00:00:51,933 --> 00:00:56,900
then we've got a weight,
then a, activation function is applied.

16
00:00:56,933 --> 00:00:59,766
We have, we get y hat and then
we compare it to the actual value.

17
00:00:59,766 --> 00:01:01,700
We calculate the cost function.

18
00:01:01,700 --> 00:01:03,666
So how can we.

19
00:01:03,666 --> 00:01:05,233
Minimize the cost function.

20
00:01:05,233 --> 00:01:07,266
What's, can we do about it.

21
00:01:07,266 --> 00:01:12,033
Well one approach to do it is a brute
force approach where we just take.

22
00:01:12,666 --> 00:01:15,600
all. Lots of different possible weights.

23
00:01:15,600 --> 00:01:18,133
And look at them
and see which one works best.

24
00:01:18,133 --> 00:01:19,366
And what we. Would do is, for instance.

25
00:01:19,366 --> 00:01:22,966
Would, try out for a, let's say,
for example, a thousand weights

26
00:01:23,233 --> 00:01:27,633
and we try them out, we'd get something
like this, for the cost function.

27
00:01:27,633 --> 00:01:28,966
And this is a chart,

28
00:01:28,966 --> 00:01:32,700
of, on the y axis, we have cost function
on the vertical axis.

29
00:01:32,700 --> 00:01:34,766
On the horizontal axis you have y hat.

30
00:01:34,766 --> 00:01:36,000
And because you can.

31
00:01:36,000 --> 00:01:40,200
See the formula is y hat minus y a square,
this is what the cost function

32
00:01:40,200 --> 00:01:41,400
would look something like that.

33
00:01:42,533 --> 00:01:44,800
And basically.

34
00:01:44,800 --> 00:01:47,800
You'd find the best one is over here.

35
00:01:47,800 --> 00:01:50,800
so very simple, very intuitive approach.

36
00:01:50,900 --> 00:01:53,100
why. Not do. This brute force method?

37
00:01:53,100 --> 00:01:56,400
Why not just try out
a thousand different,

38
00:01:56,400 --> 00:02:00,000
cost for a thousand different,
parameters.

39
00:02:00,000 --> 00:02:01,200
Or inputs for.

40
00:02:01,200 --> 00:02:02,933
Weights and see which one works best.

41
00:02:02,933 --> 00:02:05,333
You'll find the best one that way. Well,
if you have.

42
00:02:05,333 --> 00:02:07,600
Just one way to optimize, this might work.

43
00:02:07,600 --> 00:02:09,300
But as you increase the.

44
00:02:09,300 --> 00:02:11,600
Number of weights,
increase the. Number of.

45
00:02:11,600 --> 00:02:16,133
Synapses in your network, you have to face
the curse of dimensionality.

46
00:02:16,500 --> 00:02:19,333
And so what is.
The curse of dimensionality?

47
00:02:19,333 --> 00:02:24,300
the best way to describe this or explain
it is to just look at a practical example.

48
00:02:24,466 --> 00:02:27,933
So remember this example
we had when we were talking about how.

49
00:02:27,933 --> 00:02:28,733
Neural networks.

50
00:02:28,733 --> 00:02:32,066
Actually work, where we were, building or.

51
00:02:32,333 --> 00:02:33,433
Running a neural network.

52
00:02:33,433 --> 00:02:36,666
For, a for profit evaluation.

53
00:02:36,966 --> 00:02:40,333
So this is what it looked like
when it was trained up already.

54
00:02:40,600 --> 00:02:44,100
Well, when it's not trained before,
it's trained before we know which

55
00:02:44,233 --> 00:02:45,366
one of the weights.

56
00:02:45,366 --> 00:02:47,833
The actual neural network
looks like this, right.

57
00:02:47,833 --> 00:02:50,300
Because we have. All these different.

58
00:02:51,333 --> 00:02:54,800
Possible, synopses,
and we still have to train up the weights.

59
00:02:55,133 --> 00:02:57,133
And here we have a total of 25 weights.

60
00:02:57,133 --> 00:03:00,133
So four times five at the start,
plus five more.

61
00:03:00,166 --> 00:03:00,766
from the hidden layer.

62
00:03:00,766 --> 00:03:03,533
To the output
layer, 20. Five weights total.

63
00:03:03,533 --> 00:03:08,033
And let's see,
how we could possibly brute force

64
00:03:08,033 --> 00:03:12,533
25 weights says it's a very simple, 
neural network right here.

65
00:03:12,533 --> 00:03:14,533
Very simple. Just one hidden layer.

66
00:03:14,533 --> 00:03:19,166
And how could we, brute
force our way through a.

67
00:03:19,533 --> 00:03:21,200
Neural network of this size?

68
00:03:21,200 --> 00:03:22,333
Well, let's do some.

69
00:03:22,333 --> 00:03:24,233
Simple mathematical calculations.

70
00:03:24,233 --> 00:03:25,800
We have 25 weights.

71
00:03:25,800 --> 00:03:28,300
So that means
if we have a thousand combinations

72
00:03:28,300 --> 00:03:31,200
that we're going to test out for every
weight, the total number of combinations.

73
00:03:31,200 --> 00:03:34,433
Is a thousand to the power of 25,
or 1000 or 10 to the power

74
00:03:34,433 --> 00:03:37,433
of 25 different combinations.

75
00:03:37,633 --> 00:03:40,533
now let's see how Sunway head to

76
00:03:40,533 --> 00:03:43,633
tail lights
the world's fastest supercomputer.

77
00:03:43,633 --> 00:03:46,633
As of, June 2016.

78
00:03:46,866 --> 00:03:47,933
what?

79
00:03:47,933 --> 00:03:50,533
How would it approach this problem?
Right. So.

80
00:03:50,533 --> 00:03:53,133
Sunway Tahu light, it looks like

81
00:03:53,133 --> 00:03:56,666
this is a whole huge, building.

82
00:03:56,666 --> 00:03:58,900
Pretty much for this one supercomputer.

83
00:03:58,900 --> 00:04:01,633
And, it got the Guinness World.

84
00:04:01,633 --> 00:04:04,633
Record
for being the fastest supercomputer.

85
00:04:05,066 --> 00:04:08,066
Right now, it is the fastest supercomputer
in the world.

86
00:04:08,266 --> 00:04:09,733
And Sunway.

87
00:04:09,733 --> 00:04:13,500
Tiger Light can operate at a speed of 93.

88
00:04:13,500 --> 00:04:14,800
Petaflops.

89
00:04:14,800 --> 00:04:19,666
flop stands
for, floating operation per second.

90
00:04:19,833 --> 00:04:22,833
So it can do 93 to the power.

91
00:04:22,833 --> 00:04:27,866
Well, times ten to the power of 15
floating operations per second.

92
00:04:27,866 --> 00:04:29,766
That's how quick it is.

93
00:04:29,766 --> 00:04:33,766
in comparison, average computers
right now,

94
00:04:34,033 --> 00:04:37,833
they do like just over several gigaflops
and so on.

95
00:04:38,066 --> 00:04:40,866
So like kind of those ranges.

96
00:04:40,866 --> 00:04:43,933
Weigh less. Than Thai Sunway tail light.

97
00:04:44,233 --> 00:04:47,633
So Sunway tail
light is in the forefront of technology.

98
00:04:48,233 --> 00:04:53,500
And let's say hypothetically
that it can do one.

99
00:04:54,966 --> 00:04:56,700
one test, one combination

100
00:04:56,700 --> 00:05:01,200
for our neural network in one, flop basic
and one floating operation.

101
00:05:01,533 --> 00:05:03,133
That is not possible.

102
00:05:03,133 --> 00:05:05,266
That is not practical
because you need multiple.

103
00:05:05,266 --> 00:05:07,566
Floating operations to test out.

104
00:05:07,566 --> 00:05:09,400
A single weight in your neural network.

105
00:05:09,400 --> 00:05:11,166
But even let's let's give it a head start.

106
00:05:11,166 --> 00:05:15,333
Let's say that it can do in a ideal world,
it can do.

107
00:05:15,333 --> 00:05:17,333
That in one, floating operation.

108
00:05:17,333 --> 00:05:19,966
It can do
one, test per one floating operation.

109
00:05:19,966 --> 00:05:25,200
That means it will still require
ten to the power of 25 divided by 93.

110
00:05:25,200 --> 00:05:26,466
Times ten to about.

111
00:05:26,466 --> 00:05:30,566
15 seconds to come, to run

112
00:05:30,966 --> 00:05:33,966
all of those tests
to brute force through that network.

113
00:05:33,966 --> 00:05:37,433
So that means one or approximate
ten to the power of 58.

114
00:05:37,433 --> 00:05:39,500
Seconds,
and that is the same as ten to the.

115
00:05:39,500 --> 00:05:42,000
Power of 50 years.

116
00:05:42,000 --> 00:05:45,100
That is a huge number that is longer,

117
00:05:46,200 --> 00:05:48,100
than the universe has existed.

118
00:05:48,100 --> 00:05:52,800
And that is definitely not going to simply
this number is so huge.

119
00:05:52,800 --> 00:05:58,666
It's just definitely not going to work
for us, at all in our, optimization.

120
00:05:58,966 --> 00:06:00,000
So there we go.

121
00:06:00,000 --> 00:06:01,133
This is a no no.

122
00:06:01,133 --> 00:06:05,366
Even on the world's
fastest supercomputer, Sunway Tail Light.

123
00:06:05,366 --> 00:06:07,533
So we have to come up
with a different approach.

124
00:06:07,533 --> 00:06:10,200
How are we going
to. Find the optimal weight.

125
00:06:10,200 --> 00:06:11,133
By the way.

126
00:06:11,133 --> 00:06:13,500
This our neural network was very simple.

127
00:06:13,500 --> 00:06:17,100
What about if the neural networks
looks like something like this, right?

128
00:06:17,100 --> 00:06:19,000
Or even greater than that then?

129
00:06:19,000 --> 00:06:21,600
yeah, it's
just not going to happen at all.

130
00:06:21,600 --> 00:06:22,600
Ever.

131
00:06:22,600 --> 00:06:26,100
So the method we're going to be looking at
is called gradient descent.

132
00:06:26,100 --> 00:06:28,466
And, you may. Have heard of it already.

133
00:06:28,466 --> 00:06:29,133
if not, we will.

134
00:06:29,133 --> 00:06:30,233
Find out what it is right.

135
00:06:30,233 --> 00:06:32,433
Now. So, there's our.

136
00:06:34,200 --> 00:06:36,533
cost function, and now we're

137
00:06:36,533 --> 00:06:40,833
going to see how we can faster
come up with a faster way to.

138
00:06:41,000 --> 00:06:43,100
Find the best option.

139
00:06:43,100 --> 00:06:45,833
So let's say we start somewhere.
You got to start somewhere.

140
00:06:45,833 --> 00:06:47,233
So we start over there.

141
00:06:47,233 --> 00:06:51,000
And from that point, in the top left.

142
00:06:51,400 --> 00:06:53,333
What are we going to do?
Is we're going to look.

143
00:06:53,333 --> 00:06:58,500
At the angle,
of our cost function at that point.

144
00:06:58,500 --> 00:06:59,733
So we're just going to basically that's

145
00:06:59,733 --> 00:07:02,000
why it's called gradient
because you have to differentiate.

146
00:07:02,000 --> 00:07:04,666
We're not going to look
at the mathematical equations. I, we will.

147
00:07:04,666 --> 00:07:08,100
Provide some tips on additional
reading at the end of the.

148
00:07:08,100 --> 00:07:09,633
Next lecture.

149
00:07:09,633 --> 00:07:12,900
but. Basically
you just need to differentiate.

150
00:07:12,900 --> 00:07:16,133
Find out what the slope is
in that specific point,

151
00:07:16,766 --> 00:07:19,200
and find out
if the slope is positive or negative.

152
00:07:19,200 --> 00:07:22,466
If the, if the slope is negative,
like in this case, means that

153
00:07:22,800 --> 00:07:23,866
you're going downhill.

154
00:07:23,866 --> 00:07:26,966
So, to the right is downhill, to
the left is uphill.

155
00:07:27,200 --> 00:07:29,600
And from there
it means you need to go, right.

156
00:07:29,600 --> 00:07:31,533
Basically you need to go downhill.

157
00:07:31,533 --> 00:07:32,966
And that's what. We're going to do.

158
00:07:32,966 --> 00:07:35,733
Vroom takes a step. Right to the ball.

159
00:07:35,733 --> 00:07:36,933
Rolls down.

160
00:07:36,933 --> 00:07:39,466
again same thing. You calculate the slope.

161
00:07:39,466 --> 00:07:42,233
This time the slope is positive,
meaning right is uphill.

162
00:07:42,233 --> 00:07:44,933
Left is downhill, and you need to go left.

163
00:07:44,933 --> 00:07:46,633
And you roll the ball down.

164
00:07:46,633 --> 00:07:51,533
And again
you calculate the slope and your old ball.

165
00:07:51,533 --> 00:07:51,833
Right.

166
00:07:53,033 --> 00:07:53,433
There you go.

167
00:07:53,433 --> 00:07:56,066
So that's how you find in in simple terms.

168
00:07:56,066 --> 00:07:59,100
That's how you find the best.

169
00:07:59,733 --> 00:08:04,233
weights, the best, situation
that minimizes your cost function.

170
00:08:04,466 --> 00:08:06,066
Of course, it's not going to be.

171
00:08:06,066 --> 00:08:08,333
Like a ball rolling is going to be
a very zig zag type.

172
00:08:08,333 --> 00:08:09,766
Of approach, but it's easier to.

173
00:08:09,766 --> 00:08:13,600
Remember and kind of,
it's is more fun to.

174
00:08:13,633 --> 00:08:16,100
Look at it as a ball rolling.
But in reality, yes, you.

175
00:08:16,100 --> 00:08:18,466
Just, it is going to be like a step
by step approach.

176
00:08:18,466 --> 00:08:21,266
So it's going to be a zigzag
type of method.

177
00:08:21,266 --> 00:08:22,200
Yeah.

178
00:08:22,200 --> 00:08:24,866
And also there's, there's
lots of other elements to it.

179
00:08:24,866 --> 00:08:26,100
So there's.

180
00:08:26,100 --> 00:08:31,533
things like for instance,
why like why does it go down.

181
00:08:31,533 --> 00:08:35,000
Why does it not like
go away over the line.

182
00:08:35,000 --> 00:08:39,333
So it could have jumped out of this, 
going upwards instead of downwards

183
00:08:39,333 --> 00:08:40,366
and things like that.

184
00:08:40,366 --> 00:08:41,866
So there are parameters.
That you can. Tweak.

185
00:08:41,866 --> 00:08:45,466
And again we will mention
where you can find out more on that.

186
00:08:45,466 --> 00:08:47,566
And plus we'll have this
in the practical application.

187
00:08:47,566 --> 00:08:51,600
But in the simplest intuitive approach
this is what is happening.

188
00:08:51,600 --> 00:08:54,600
We are getting to the bottom. by.

189
00:08:54,600 --> 00:08:58,000
Just understanding which way we need to go
instead of brute forcing through.

190
00:08:58,433 --> 00:09:00,966
Thousands
and thousands and millions and billions.

191
00:09:00,966 --> 00:09:04,133
And quadrillions of combinations,
we can just simply.

192
00:09:04,133 --> 00:09:05,133
Every time have a look.

193
00:09:05,133 --> 00:09:09,000
At where is where,
which way is it sloping. So.

194
00:09:09,000 --> 00:09:10,100
Right, like you.

195
00:09:10,100 --> 00:09:12,800
Imagine you're standing on a. Hill,
which way does it feel.

196
00:09:12,800 --> 00:09:15,200
That it's going downwards?
And whichever way it is going downwards.

197
00:09:15,200 --> 00:09:15,800
You just keep.

198
00:09:15,800 --> 00:09:19,333
Walking that way you like take 50 steps
that way and then you, you assess again.

199
00:09:19,333 --> 00:09:20,966
Okay. Which way is it going? Downwards.

200
00:09:20,966 --> 00:09:21,366
This way.

201
00:09:21,366 --> 00:09:21,666
Okay.

202
00:09:21,666 --> 00:09:24,500
Now take 50 steps
or let's take 40 steps that way.

203
00:09:24,500 --> 00:09:27,733
So it gets less and less and less
as you get closer.

204
00:09:28,400 --> 00:09:29,100
So here's.

205
00:09:29,100 --> 00:09:30,400
An example. Of gradient.

206
00:09:30,400 --> 00:09:32,633
Descent
applied in a two dimensional space.

207
00:09:32,633 --> 00:09:35,633
So that was a one dimensional example.

208
00:09:35,833 --> 00:09:40,166
here we have a two dimensional space
for the gradient descent.

209
00:09:40,166 --> 00:09:42,933
As you can see it's getting.
Closer to the minimum.

210
00:09:42,933 --> 00:09:44,833
And it's also called gradient descent
because you're.

211
00:09:44,833 --> 00:09:49,033
Descending
into the minimum of the cost function.

212
00:09:49,533 --> 00:09:53,266
And finally here's the gradient descent
applied in three dimensions.

213
00:09:53,266 --> 00:09:56,266
This is what it looks like
if you projected onto two dimensions,

214
00:09:56,400 --> 00:09:59,400
you can see it
zigzagging its way into the minimum.

215
00:09:59,533 --> 00:10:01,766
So there we go. That was gradient descent.

216
00:10:01,766 --> 00:10:02,933
In next tutorial we'll talk about

217
00:10:02,933 --> 00:10:06,533
stochastic gradient descent
is we'll be continuation of this tutorial.

218
00:10:06,866 --> 00:10:08,600
And. I look forward to seeing you there.

219
00:10:08,600 --> 00:10:10,533
Until next time, enjoy deep learning.