1
00:00:00,266 --> 00:00:02,966
Hello and welcome
back to the course on Deep Learning.

2
00:00:02,966 --> 00:00:05,566
This is an additional tutorial.

3
00:00:05,566 --> 00:00:08,533
to talk about the softmax
and cross entropy functions.

4
00:00:08,533 --> 00:00:09,766
It is not 100.

5
00:00:09,766 --> 00:00:12,300
Percent necessary in order for you to.

6
00:00:12,300 --> 00:00:13,300
Go through.

7
00:00:13,300 --> 00:00:17,200
All of the parts
that we've been through in the, main

8
00:00:17,200 --> 00:00:21,033
part of this section where we're talking
about the, convolutional neural networks.

9
00:00:21,166 --> 00:00:21,900
But at the same time,

10
00:00:21,900 --> 00:00:26,466
I thought it would be a good addition
to your bag of knowledge and skill sets.

11
00:00:26,466 --> 00:00:28,800
So, let's go ahead and.

12
00:00:28,800 --> 00:00:30,666
Dig. Into these functions.

13
00:00:30,666 --> 00:00:32,400
So to start off with.

14
00:00:32,400 --> 00:00:33,166
What we have here

15
00:00:33,166 --> 00:00:37,800
is the convolutional neural network
that we built in the main part of the.

16
00:00:37,800 --> 00:00:38,666
Section.

17
00:00:38,666 --> 00:00:42,533
And then at the end
it pops out some probabilities

18
00:00:42,766 --> 00:00:47,700
for 0.95 for a dog and 0.055% for a cat.

19
00:00:47,900 --> 00:00:50,900
Given that photo in the left as an input.

20
00:00:50,900 --> 00:00:52,500
This is after
the training has been conducted.

21
00:00:52,500 --> 00:00:54,766
This is actually it's running and it's

22
00:00:55,766 --> 00:00:57,233
classifying a certain image.

23
00:00:57,233 --> 00:00:58,766
And so the question here is how.

24
00:00:58,766 --> 00:01:00,766
Come these two values add up to one?

25
00:01:00,766 --> 00:01:03,500
Because as far as we know,
from everything that we've learned

26
00:01:03,500 --> 00:01:08,166
about artificial neural networks,
there is nothing to say that these two,

27
00:01:08,733 --> 00:01:11,533
final neurons
are connected between each other.

28
00:01:11,533 --> 00:01:15,066
So how would they know
what the value of that hold?

29
00:01:15,066 --> 00:01:17,233
Each one of them know what
the value of the other one is,

30
00:01:17,233 --> 00:01:20,200
and how would they know
to add their values up to one?

31
00:01:20,200 --> 00:01:22,200
Well, the answer is they wouldn't.

32
00:01:22,200 --> 00:01:25,900
in the classic, version
of artificial neural network,

33
00:01:26,033 --> 00:01:28,500
and the only way that they. Do
is because we.

34
00:01:28,500 --> 00:01:29,900
Introduce a special function

35
00:01:29,900 --> 00:01:33,566
called the softmax function
in order to help us out of the situation.

36
00:01:33,800 --> 00:01:37,500
So normally
what would happen is, the dog and cat

37
00:01:37,500 --> 00:01:40,700
neurons
would have any kind of real values that,

38
00:01:41,400 --> 00:01:44,633
they don't have to be, 
they don't have to add up to one.

39
00:01:45,033 --> 00:01:46,733
But then we would apply

40
00:01:46,733 --> 00:01:50,466
the softmax function,
which is written up over there at the top.

41
00:01:50,766 --> 00:01:54,300
And that would bring these values
to be between 0 and 1.

42
00:01:54,300 --> 00:01:56,233
And it would make them add up to one.

43
00:01:56,233 --> 00:01:57,733
And to. Quote Wikipedia.

44
00:01:59,100 --> 00:02:00,000
the softmax

45
00:02:00,000 --> 00:02:03,166
function or the normalized
exponential function is a generalization

46
00:02:03,166 --> 00:02:06,866
of the logistic function
that, quote unquote, squashes

47
00:02:07,133 --> 00:02:11,166
a k dimensional vector of arbitrary
real values to a k dimensional

48
00:02:11,166 --> 00:02:15,266
vector of real values in the range of 0
to 1 that add up to one.

49
00:02:15,266 --> 00:02:17,533
So basically it does exactly what we want.

50
00:02:17,533 --> 00:02:18,633
It brings these values

51
00:02:18,633 --> 00:02:22,366
to be between 0 and 1, and makes sure
that they add up to one.

52
00:02:22,800 --> 00:02:26,433
And the way it works
is that the way that is this is possible.

53
00:02:26,433 --> 00:02:29,800
Is that because at the bottom over here,
you can see that there's a summation.

54
00:02:29,800 --> 00:02:32,800
So it takes the exponent,

55
00:02:32,800 --> 00:02:36,466
and puts it
in, the power of z and adds it up.

56
00:02:36,466 --> 00:02:39,500
So z one that two across
all of your classes, all of these values

57
00:02:39,833 --> 00:02:43,266
and so that's your normalization
happening right there.

58
00:02:44,233 --> 00:02:47,300
So that's, how the softmax function works.

59
00:02:47,300 --> 00:02:49,066
And it makes sense to.

60
00:02:49,066 --> 00:02:53,000
Introduce the softmax function
into convolutional neural networks.

61
00:02:53,000 --> 00:02:55,566
Because how strange would it.

62
00:02:55,566 --> 00:02:59,933
Be if you had a, possible
classes of a dog and a cat?

63
00:02:59,933 --> 00:03:05,066
And for the dog class,
you had, probability of 80%.

64
00:03:05,066 --> 00:03:08,066
And for the cat class,
you had a probability of 45%.

65
00:03:08,300 --> 00:03:11,266
Right.
It just doesn't make sense like that.

66
00:03:11,266 --> 00:03:14,700
And therefore it's much better
when you introduce a softmax function.

67
00:03:14,700 --> 00:03:16,333
And that's what you will find happening

68
00:03:16,333 --> 00:03:19,333
most of the time
in convolutional neural networks.

69
00:03:19,600 --> 00:03:21,500
Now the other thing is that the.

70
00:03:21,500 --> 00:03:23,266
Softmax function,

71
00:03:23,266 --> 00:03:27,000
comes hand in hand with something
called the cross entropy function.

72
00:03:27,400 --> 00:03:28,933
And it's a very handy thing for us.

73
00:03:28,933 --> 00:03:30,466
So let's first look at the formula.

74
00:03:30,466 --> 00:03:33,033
This is what the cross entropy function
looks like.

75
00:03:33,033 --> 00:03:36,933
We're actually going to be
using a different calculation.

76
00:03:36,933 --> 00:03:39,333
I'm going to be using this representation
of the cross entropy.

77
00:03:39,333 --> 00:03:40,566
But the results are basically the same.

78
00:03:40,566 --> 00:03:42,433
This is just easier to calculate.

79
00:03:42,433 --> 00:03:44,433
And what I know this.

80
00:03:44,433 --> 00:03:47,766
Might sound very,
unrelated to anything right now.

81
00:03:47,766 --> 00:03:51,766
Just formulas on your screen, but,
there will be some additional recommended

82
00:03:51,766 --> 00:03:55,433
reading at the end of this section,
so don't worry if you're not, picking up

83
00:03:55,433 --> 00:03:58,466
on the math, like, if I, if we haven't
explained the math right now. But.

84
00:03:59,066 --> 00:04:01,700
the point here is that
what is the cross entropy?

85
00:04:01,700 --> 00:04:03,566
Well, a cross entropy function.

86
00:04:03,566 --> 00:04:05,400
Remember how we previously.

87
00:04:05,400 --> 00:04:08,766
In artificial neural networks,
we had a function

88
00:04:09,100 --> 00:04:12,366
called the mean squared error function.

89
00:04:12,366 --> 00:04:15,433
Which we used as the cost function for.

90
00:04:15,566 --> 00:04:17,700
Assessing our network performance.

91
00:04:17,700 --> 00:04:23,466
And our goal was to minimize the MSE in
order to optimize our network performance.

92
00:04:23,766 --> 00:04:25,366
Well that was our cost function.

93
00:04:25,366 --> 00:04:26,700
Then there, there.

94
00:04:26,700 --> 00:04:30,500
And in, convolutional neural networks,
you can.

95
00:04:30,500 --> 00:04:30,766
Still.

96
00:04:30,766 --> 00:04:34,200
Use MSE,
but a better option in convolutional

97
00:04:34,200 --> 00:04:37,366
neural networks
after you apply the softmax function.

98
00:04:37,500 --> 00:04:39,666
Turns out to be the.
Cross entropy function.

99
00:04:39,666 --> 00:04:43,933
And in, convolutional neural networks,
when you apply the cross

100
00:04:43,933 --> 00:04:44,833
entropy function, it's not.

101
00:04:44,833 --> 00:04:46,500
Cost called the cost. Function anymore.

102
00:04:46,500 --> 00:04:48,166
It's called the loss function.

103
00:04:48,166 --> 00:04:49,366
And they're very similar.

104
00:04:49,366 --> 00:04:52,133
They're just little
terminological differences.

105
00:04:52,133 --> 00:04:55,433
And like little,
a bit different in what they mean.

106
00:04:55,433 --> 00:04:57,800
But for our purposes, it's pretty much.

107
00:04:57,800 --> 00:04:59,533
The same thing. And.

108
00:04:59,533 --> 00:05:02,266
what happens is the loss. Function.

109
00:05:02,266 --> 00:05:03,866
Is, again.

110
00:05:03,866 --> 00:05:06,266
something that we want to minimize
in order

111
00:05:06,266 --> 00:05:09,366
to maximize
the performance of our network.

112
00:05:09,533 --> 00:05:10,766
So let's have a look.

113
00:05:10,766 --> 00:05:12,266
look at a quick example.

114
00:05:12,266 --> 00:05:15,166
On how of how this, function can be
applied.

115
00:05:15,166 --> 00:05:16,800
So let's say we've.

116
00:05:16,800 --> 00:05:19,400
Put an image of a dog into our network.

117
00:05:19,400 --> 00:05:24,400
the predicted value for dog is 0.9,
and this is during the training.

118
00:05:24,400 --> 00:05:27,166
So we know that we know the label
that is a dog.

119
00:05:27,166 --> 00:05:29,333
So the predicted value is 0.9.

120
00:05:29,333 --> 00:05:32,233
The predicted value for cat is 0.1.

121
00:05:32,233 --> 00:05:34,066
Then here we have the label. So we know

122
00:05:34,066 --> 00:05:37,566
it's a dog because this is training
and zero one for dog, zero for cat.

123
00:05:37,800 --> 00:05:39,533
And so in. This case.

124
00:05:39,533 --> 00:05:42,200
You need to use

125
00:05:43,300 --> 00:05:43,900
you need to plug

126
00:05:43,900 --> 00:05:47,300
these numbers into your formula
for the cross entropy.

127
00:05:47,666 --> 00:05:51,633
So how you do it
is, the values on the left

128
00:05:51,633 --> 00:05:52,766
go into the variable

129
00:05:52,766 --> 00:05:56,700
Q, the one that is under the logarithm
in the, on the right side.

130
00:05:56,700 --> 00:05:59,366
And the values from the right
would go into p.

131
00:05:59,366 --> 00:06:00,966
And so it's important to remember.

132
00:06:00,966 --> 00:06:02,100
Which one goes there where.

133
00:06:02,100 --> 00:06:05,400
Because if you get. Them wrong,
you don't want to be taking a logarithm

134
00:06:05,400 --> 00:06:09,466
from a, from a zero value
and or a logarithm from a one.

135
00:06:09,466 --> 00:06:11,666
So you just want to plug them in.

136
00:06:11,666 --> 00:06:13,633
make sure you plug them in to the correct.

137
00:06:13,633 --> 00:06:14,700
places.

138
00:06:14,700 --> 00:06:16,933
And then you basically add that up.

139
00:06:16,933 --> 00:06:19,366
So that's how the cross entropy works.

140
00:06:19,366 --> 00:06:21,933
And we'll look at it actually, right now
we're just going to look

141
00:06:21,933 --> 00:06:26,633
at a specific step by step example
of applying this function in real life.

142
00:06:26,633 --> 00:06:30,200
And it'll kind of make make more sense
what cross-entropy is.

143
00:06:30,200 --> 00:06:32,266
And it'll be less like.

144
00:06:32,266 --> 00:06:36,333
My goal in this tutorial is to make you
more comfortable with cross-entropy,

145
00:06:36,333 --> 00:06:41,733
because it can sound very convoluted
and no pun intended.

146
00:06:42,800 --> 00:06:44,033
it can like.

147
00:06:44,033 --> 00:06:45,566
Convolutional neural networks.

148
00:06:45,566 --> 00:06:48,566
It it can sound very complex, right?

149
00:06:48,800 --> 00:06:50,700
Scary. But it's not.

150
00:06:50,700 --> 00:06:51,566
That's that's the point.

151
00:06:51,566 --> 00:06:54,000
So let's go ahead and apply
it just so we know that it's not scary.

152
00:06:54,000 --> 00:06:57,200
So Here's neural net. And also this will.

153
00:06:57,200 --> 00:06:59,233
Explain why we're doing this.

154
00:06:59,233 --> 00:07:01,600
Why we're looking at different cost
functions.

155
00:07:01,600 --> 00:07:02,900
So neural network one.

156
00:07:02,900 --> 00:07:05,633
Neural network
two let's say we have two neural networks.

157
00:07:05,633 --> 00:07:07,766
And then we pass an image of a dog.

158
00:07:07,766 --> 00:07:11,700
and we know that this is a dog
and not a cat.

159
00:07:12,033 --> 00:07:16,833
And then we have another image of a cat,
this time an animal.

160
00:07:16,833 --> 00:07:17,833
And it's a cat, not a dog.

161
00:07:17,833 --> 00:07:21,833
And here we have a weird looking animal,
which is in fact a dog,

162
00:07:21,866 --> 00:07:23,900
not a cat, if you look very closely.

163
00:07:23,900 --> 00:07:28,300
so we want to see what our neural
networks will predict in the first case.

164
00:07:28,333 --> 00:07:31,066
Neural network 190% dog.

165
00:07:31,066 --> 00:07:33,200
10% cats. Correct.

166
00:07:33,200 --> 00:07:36,433
Neural network number 260% dog, 40% cat.

167
00:07:36,600 --> 00:07:38,800
Still correct. Worse, but correct.

168
00:07:40,133 --> 00:07:41,800
second option.

169
00:07:41,800 --> 00:07:44,533
first neural network 10% cat.

170
00:07:44,533 --> 00:07:47,233
Dog 90%. Cats. Correct.

171
00:07:47,233 --> 00:07:49,100
neural network number 230%.

172
00:07:49,100 --> 00:07:51,400
Dog 70%. Cat.

173
00:07:51,400 --> 00:07:53,400
Worse, but still correct.

174
00:07:53,400 --> 00:07:55,266
And then finally neural network one.

175
00:07:55,266 --> 00:08:00,266
And, you know, in image three,
neural Network one, 40% dog, 60% cat.

176
00:08:00,566 --> 00:08:01,766
Incorrect.

177
00:08:01,766 --> 00:08:04,100
Neural network number. 210%.

178
00:08:04,100 --> 00:08:06,966
Dog, 90% cat incorrect.

179
00:08:06,966 --> 00:08:08,100
And worse.

180
00:08:08,100 --> 00:08:10,633
So the key here is that even though.

181
00:08:10,633 --> 00:08:12,900
both networks
got it wrong in the last one.

182
00:08:12,900 --> 00:08:15,800
Throughout all three images, neural.

183
00:08:15,800 --> 00:08:18,800
Network
one was outperforming neural network two.

184
00:08:18,800 --> 00:08:22,533
So even in the last case, it was very,

185
00:08:23,200 --> 00:08:27,300
it had a it gave dog like a 40% chance
as opposed to neural network.

186
00:08:27,300 --> 00:08:29,033
Two only gave dog a 10% chance.

187
00:08:29,033 --> 00:08:32,266
So neural network
one is outperforming across the board,

188
00:08:32,700 --> 00:08:35,466
when compared to neural network two.

189
00:08:35,466 --> 00:08:37,633
And so now we're going to look at.

190
00:08:37,633 --> 00:08:39,266
The functions

191
00:08:39,266 --> 00:08:42,566
that they can measure performance
that we've kind of talked about already.

192
00:08:42,900 --> 00:08:44,733
So let's put. These into a table.

193
00:08:44,733 --> 00:08:46,300
So this is neural network one.

194
00:08:46,300 --> 00:08:48,200
you have the row number.

195
00:08:48,200 --> 00:08:49,400
So that's the image number.

196
00:08:49,400 --> 00:08:53,700
And then for image one you have what it
predicted 90% dog 10% cat.

197
00:08:53,700 --> 00:08:55,433
So those are the hat variables.

198
00:08:55,433 --> 00:08:57,266
And then you have the actual values.

199
00:08:57,266 --> 00:09:00,266
So dog correct. Cat incorrect.

200
00:09:00,366 --> 00:09:04,633
Same thing for image number two
and same thing for image number three.

201
00:09:05,100 --> 00:09:07,600
And same for neural network number two.

202
00:09:07,600 --> 00:09:10,966
So, dogs 60%, cats 40% in the first image.

203
00:09:10,966 --> 00:09:12,033
That's what it predicted.

204
00:09:12,033 --> 00:09:15,033
Correct answer is dog,
not a cat. And so on.

205
00:09:15,033 --> 00:09:17,233
And so now let's see what errors we.

206
00:09:17,233 --> 00:09:17,966
Can actually get.

207
00:09:17,966 --> 00:09:21,333
So what errors
we can calculate to estimate

208
00:09:21,333 --> 00:09:24,333
the performance and monitor
the performance of our networks.

209
00:09:24,800 --> 00:09:26,966
So one type of error is.

210
00:09:26,966 --> 00:09:28,500
Called the classification. Error.

211
00:09:28,500 --> 00:09:30,966
And that is basically just.

212
00:09:30,966 --> 00:09:33,900
Asking it did you get it right or not.

213
00:09:33,900 --> 00:09:37,166
Regardless of the probabilities
it's just did you get. It right or did

214
00:09:37,166 --> 00:09:37,833
you not get it right.

215
00:09:37,833 --> 00:09:40,400
So in. Both cases and for both.

216
00:09:40,400 --> 00:09:41,500
Neural networks,

217
00:09:41,500 --> 00:09:46,200
each of them, they got one or so
this is how many they got wrong.

218
00:09:46,200 --> 00:09:48,366
So they got one out of three wrong.

219
00:09:48,366 --> 00:09:51,933
So 33% error rate, for neural network

220
00:09:51,933 --> 00:09:54,933
one and 33% error
rate for neural network two.

221
00:09:54,933 --> 00:09:56,866
And so basically from this standpoint.

222
00:09:56,866 --> 00:09:59,066
Both neural networks
perform at the same level.

223
00:09:59,066 --> 00:10:00,066
But we know that's not true.

224
00:10:00,066 --> 00:10:03,966
We know that neural network one
is outperforming neural network two.

225
00:10:05,000 --> 00:10:07,833
That's
why a classification error is not a good,

226
00:10:07,833 --> 00:10:10,833
measure, especially for the purposes
of backpropagation.

227
00:10:11,500 --> 00:10:13,366
mean squared error.

228
00:10:13,366 --> 00:10:13,700
different.

229
00:10:13,700 --> 00:10:16,700
And by the way,
I did these calculations, in Excel.

230
00:10:16,833 --> 00:10:18,333
I just didn't want to bore you with them,

231
00:10:18,333 --> 00:10:21,900
but you can totally just sit down
and do them on a paper or an Excel.

232
00:10:21,900 --> 00:10:23,600
These are very straightforward
calculations.

233
00:10:23,600 --> 00:10:28,000
Just basically take the,
sum of squared errors

234
00:10:28,000 --> 00:10:32,800
and then just take the average across
your, across your observations.

235
00:10:32,800 --> 00:10:33,966
And that's pretty much it.

236
00:10:33,966 --> 00:10:38,700
so for the, for neural net network one,
you get 25%

237
00:10:38,966 --> 00:10:43,233
for neural network,
two you get 71% error rate.

238
00:10:43,233 --> 00:10:45,866
So as you can see
this one is more accurate.

239
00:10:45,866 --> 00:10:48,866
It's telling us that neural network
one has a much lower error

240
00:10:48,866 --> 00:10:50,000
rate than neural network two.

241
00:10:51,000 --> 00:10:52,866
And then cross entropy again.

242
00:10:52,866 --> 00:10:54,866
We've seen the formula.
You can also calculate this.

243
00:10:54,866 --> 00:10:56,633
This is actually even easier to calculate.

244
00:10:56,633 --> 00:10:57,966
The mean squared error.

245
00:10:57,966 --> 00:11:02,200
Cross error cross-entropy
gives you 38% for neural network one

246
00:11:02,400 --> 00:11:05,366
and 1.06 for neural network two.

247
00:11:05,366 --> 00:11:08,133
So you can see
the results are a bit different.

248
00:11:08,133 --> 00:11:11,133
when you look at them like that,
when you look at,

249
00:11:11,600 --> 00:11:17,200
you know, the mean squared error
and cross entropy, The question of

250
00:11:17,200 --> 00:11:20,900
why would you use cross entropy over,

251
00:11:21,600 --> 00:11:25,733
mean squared error isn't just about.

252
00:11:25,733 --> 00:11:28,600
The kind of like the numbers that
they spit. Out. This these calculations.

253
00:11:28,600 --> 00:11:30,633
Were just to show you that this.

254
00:11:30,633 --> 00:11:33,600
Is all it's all. Doable.
You can just do it on a paper.

255
00:11:33,600 --> 00:11:37,800
It's it's not
these are not very intense mathematics.

256
00:11:37,800 --> 00:11:38,366
These are.

257
00:11:38,366 --> 00:11:41,100
The pretty. Simple,
straightforward things.

258
00:11:41,100 --> 00:11:44,466
But the question of
why would you use mean, cross

259
00:11:44,466 --> 00:11:46,166
entropy over mean squared error?

260
00:11:46,166 --> 00:11:48,133
It's a very, very good question to ask.

261
00:11:48,133 --> 00:11:49,166
I'm glad you asked it.

262
00:11:49,166 --> 00:11:52,166
the the answer to that is like

263
00:11:52,166 --> 00:11:57,066
there's several advantages of,

264
00:11:57,066 --> 00:12:01,300
cross entropy over mean squared error,
which are not obvious.

265
00:12:01,300 --> 00:12:05,366
And so I'll, I'll mention a couple,
but then.

266
00:12:05,366 --> 00:12:07,066
I'll, I'll let you know
where you can find out more.

267
00:12:07,066 --> 00:12:13,533
So One of them is that if, if,
for instance, you're at the very start.

268
00:12:13,533 --> 00:12:16,600
Of your, Backpropagation,

269
00:12:16,900 --> 00:12:21,133
your output value is very, very, very,
very tiny.

270
00:12:21,133 --> 00:12:22,233
Very. Tiny.

271
00:12:22,233 --> 00:12:23,533
So it's much smaller.

272
00:12:23,533 --> 00:12:25,566
Than the actual value that you. Want.

273
00:12:25,566 --> 00:12:28,400
Then at the very start, the gradient

274
00:12:28,400 --> 00:12:31,266
in your gradient
descent will be very, very low.

275
00:12:31,266 --> 00:12:35,433
And you it won't be enough,
it'll be very hard for.

276
00:12:35,466 --> 00:12:38,933
The neural network to actually.
Start doing.

277
00:12:38,933 --> 00:12:41,700
Something and start moving around
and start adjusting those weights.

278
00:12:41,700 --> 00:12:43,666
And So you start actually.

279
00:12:43,666 --> 00:12:45,000
Moving in the right direction.

280
00:12:45,000 --> 00:12:46,966
Whereas when you use something. Like the.

281
00:12:46,966 --> 00:12:50,133
Cross entropy,
because it's got that logarithm in it, it

282
00:12:50,133 --> 00:12:54,533
actually, helps the network assess even.

283
00:12:54,533 --> 00:12:57,433
A small error like that
and just do something about it.

284
00:12:57,433 --> 00:12:58,433
Here's how to. Think about it.

285
00:12:58,433 --> 00:13:03,166
So let's say, in again, this is very
and in very intuitive approach.

286
00:13:03,166 --> 00:13:04,833
There's this, there's going to be

287
00:13:04,833 --> 00:13:07,833
a link to the mathematics,
and you can derive these things

288
00:13:07,833 --> 00:13:09,400
through the mathematics in more detail.

289
00:13:09,400 --> 00:13:12,400
But a very intuitive approach. Let's say.

290
00:13:12,633 --> 00:13:13,366
your,

291
00:13:14,400 --> 00:13:14,833
like your.

292
00:13:14,833 --> 00:13:17,566
Outcome that you want is. Is one.

293
00:13:17,566 --> 00:13:22,666
And right now you are at, 
one, one millionth of one.

294
00:13:22,666 --> 00:13:25,166
Right. So 0.000001.

295
00:13:25,166 --> 00:13:28,000
And then you improve next.

296
00:13:28,000 --> 00:13:32,400
Time you improve your outcome from,
from one millionth to, 1,000th.

297
00:13:32,700 --> 00:13:37,566
And in terms of if you calculate
this squared error, you just.

298
00:13:37,566 --> 00:13:40,800
Subtracting one from the other,
or basically in each case

299
00:13:40,800 --> 00:13:43,800
you're calculating the square. Error
and you'll see that the squared errors,

300
00:13:43,800 --> 00:13:48,033
when you compare one case versus
the other, it didn't change that much.

301
00:13:48,033 --> 00:13:49,266
You didn't improve your.

302
00:13:49,266 --> 00:13:51,966
Network that much when you're looking
at the mean squared error.

303
00:13:51,966 --> 00:13:55,233
But if you're looking at cross entropy

304
00:13:55,233 --> 00:13:58,800
because you're taking a logarithm
and then you're comparing the.

305
00:13:58,800 --> 00:14:02,433
Two dividing one by the other,
You will see

306
00:14:02,433 --> 00:14:06,066
that you have actually improved
your network significantly.

307
00:14:06,066 --> 00:14:10,966
So you that that jump from,
one millionth to 1,000th in mean

308
00:14:10,966 --> 00:14:15,233
squared error terms will be very low,
it will be insignificant, and it won't,

309
00:14:15,733 --> 00:14:18,300
it won't guide your gradient,

310
00:14:18,300 --> 00:14:21,966
boosting process or your backpropagation
in the right direction.

311
00:14:21,966 --> 00:14:24,133
It will, it will
it will guide it in the right direction.

312
00:14:24,133 --> 00:14:26,666
But it'll be like a very slow guidance.

313
00:14:26,666 --> 00:14:29,466
It won't have enough power.

314
00:14:29,466 --> 00:14:30,066
whereas.

315
00:14:30,066 --> 00:14:32,933
If you do it through cross entropy,
cross entropy will.

316
00:14:32,933 --> 00:14:35,400
Understand that. Oh,
even though these are very small.

317
00:14:35,400 --> 00:14:38,400
Adjustments that are just, 
you know, making.

318
00:14:38,400 --> 00:14:43,500
A tiny change in absolute terms in
relative terms, it's a huge improvement.

319
00:14:43,733 --> 00:14:46,000
And we we are definitely going
in the right direction.

320
00:14:46,000 --> 00:14:47,133
Let's keep going that way.

321
00:14:47,133 --> 00:14:50,700
So cross entropy
will help your neural network.

322
00:14:52,666 --> 00:14:55,800
Get to
the right gets to the optimal state.

323
00:14:56,700 --> 00:15:01,000
it's a better way for the neural network
to get to get it to an optimal state.

324
00:15:01,000 --> 00:15:02,100
But, bear in mind.

325
00:15:02,100 --> 00:15:06,433
That this only works when,
the cross entropy is only the preferred.

326
00:15:06,533 --> 00:15:08,166
Method, only for classification.

327
00:15:08,166 --> 00:15:09,133
So, if.

328
00:15:09,133 --> 00:15:11,266
You're talking about things
like regression.

329
00:15:11,266 --> 00:15:13,800
Like which we had
in artificial neural networks.

330
00:15:13,800 --> 00:15:15,733
then you would rather.

331
00:15:15,733 --> 00:15:17,400
Go with mean squared error.

332
00:15:17,400 --> 00:15:18,000
Whereas cross.

333
00:15:18,000 --> 00:15:20,533
Entropy is better for classification.

334
00:15:20,533 --> 00:15:23,600
And again, it has to do with the fact
that we're using softmax function.

335
00:15:23,600 --> 00:15:26,600
So that's a kind of
intuitive explanation of that.

336
00:15:26,900 --> 00:15:29,533
a good place
to learn a bit more about that if you're.

337
00:15:29,533 --> 00:15:33,466
Really interested in, you know,
why are we using, cross entropy versus

338
00:15:33,466 --> 00:15:34,233
mean squared error?

339
00:15:34,233 --> 00:15:38,266
Google a video
by Geoffrey Hinton called the.

340
00:15:38,266 --> 00:15:40,533
Softmax output function.

341
00:15:40,533 --> 00:15:42,800
And, he explains it very well.

342
00:15:42,800 --> 00:15:43,666
And, you know, being

343
00:15:43,666 --> 00:15:47,600
the godfather of deep learning,
who can explain it better anyway?

344
00:15:47,900 --> 00:15:50,033
and by. The way, any video.

345
00:15:50,033 --> 00:15:51,600
By Geoffrey Hinton is golden.

346
00:15:51,600 --> 00:15:54,000
He's just got a huge
talent for explaining things.

347
00:15:55,166 --> 00:15:57,233
Anyway, so that's, that's.

348
00:15:57,233 --> 00:15:58,533
Softmax versus cross. Entropy.

349
00:15:58,533 --> 00:16:00,766
I hope that gives
you kind of like an intuitive.

350
00:16:00,766 --> 00:16:02,200
Understanding of what's
going on here, but.

351
00:16:02,200 --> 00:16:06,300
More importantly, that you're not put off
by the term cross entropy,

352
00:16:06,400 --> 00:16:08,966
because Hudlin. Will mention it
in the practical tutorials.

353
00:16:08,966 --> 00:16:11,133
And I wanted to make sure
that you're prepared for that.

354
00:16:11,133 --> 00:16:12,866
And it's it's just another.

355
00:16:12,866 --> 00:16:16,266
Way of calculating your loss function
and another way

356
00:16:16,266 --> 00:16:19,733
of optimizing your network,
which is specifically tailored to,

357
00:16:20,266 --> 00:16:23,533
classification problems
and therefore convolutional neural

358
00:16:23,533 --> 00:16:27,533
networks and comes in hand,
hand in hand with the softmax function.

359
00:16:28,133 --> 00:16:31,700
So additional reading
if you'd like a light introduction

360
00:16:31,700 --> 00:16:35,233
into, cross entropy if you're interested.

361
00:16:35,233 --> 00:16:36,366
In the cross entropy of a bit. More.

362
00:16:36,366 --> 00:16:40,266
Of course, a good article to check out
is called A Friendly Introduction

363
00:16:40,266 --> 00:16:45,266
to Cross Entropy Loss by Rob deep, 2016.

364
00:16:45,266 --> 00:16:47,033
Here's the link below.

365
00:16:47,033 --> 00:16:48,100
very, very nice.

366
00:16:48,100 --> 00:16:50,400
Very soft.

367
00:16:50,400 --> 00:16:52,000
Nothing. No.

368
00:16:52,000 --> 00:16:53,933
Super complex math.

369
00:16:53,933 --> 00:16:56,100
good analogies, good examples.

370
00:16:56,100 --> 00:16:57,433
Use analogies of cars.

371
00:16:57,433 --> 00:17:00,066
And you look at cars and talks
about information and bits

372
00:17:00,066 --> 00:17:03,233
and restrictions, and, you know,
how would you encode this?

373
00:17:03,233 --> 00:17:05,300
How do you code that?
It's it's a it's a good article to.

374
00:17:05,300 --> 00:17:05,766
Have a look at.

375
00:17:05,766 --> 00:17:08,766
And we'll give you a, a good overview
of, cross entropy.

376
00:17:09,000 --> 00:17:11,766
like from an introductory standpoint.

377
00:17:11,766 --> 00:17:12,800
If you want to dig.

378
00:17:12,800 --> 00:17:17,466
Into the heavy math, like what
you see here, then check out an article

379
00:17:17,500 --> 00:17:22,466
by or a blog by How to Implement
a Neural Network Intermezzo two.

380
00:17:22,466 --> 00:17:25,600
So Intermezzo is like
is like an intermediate thing, like a.

381
00:17:26,800 --> 00:17:28,333
interim intermittent.

382
00:17:28,333 --> 00:17:32,066
In, you know, like when you
go to a theater and you have like a break,

383
00:17:32,633 --> 00:17:35,966
between,
the first part and the second part.

384
00:17:36,133 --> 00:17:38,933
So because he's like, going through
all these steps and then he's like.

385
00:17:38,933 --> 00:17:41,733
And then he says,
I got to explain this first.

386
00:17:41,733 --> 00:17:44,000
and yeah.
So that's why it's called Intermezzo.

387
00:17:44,000 --> 00:17:46,733
No, the reason
as far as I understand, the.

388
00:17:46,733 --> 00:17:49,133
Article is by Peter Rowlands.

389
00:17:49,133 --> 00:17:50,633
20. 16 as well.

390
00:17:50,633 --> 00:17:53,633
So both are quite recent and Yeah.

391
00:17:53,700 --> 00:17:57,433
Check out this if you would like to dig
into the mathematics behind,

392
00:17:57,833 --> 00:18:02,200
cross entropy behind softmax and cross
entropy in this article actually.

393
00:18:02,733 --> 00:18:03,700
So there we go.

394
00:18:03,700 --> 00:18:07,200
That's, all there is to these two.

395
00:18:07,200 --> 00:18:11,966
Hopefully I was able to add
some additional clarity and, good luck

396
00:18:11,966 --> 00:18:12,633
with that.

397
00:18:12,633 --> 00:18:16,833
It's, It's going to be fun
and, enjoy the practical tutorials.

398
00:18:16,833 --> 00:18:19,633
I'll see you next time.
Until then, enjoy deep learning.