1
00:00:00,740 --> 00:00:03,360
Hey, everyone, congratulations, first of all.

2
00:00:04,280 --> 00:00:09,890
If you understood whatever I taught you in the last few lectures, you understand neural networks in

3
00:00:09,890 --> 00:00:14,930
the way they are used for prediction purposes in this lecture.

4
00:00:15,080 --> 00:00:17,360
I'm going to talk about several subtopics.

5
00:00:18,740 --> 00:00:24,250
These topics came up in the lectures previously, also to maintain the flow.

6
00:00:25,130 --> 00:00:28,150
I did not discuss them in detail at that time.

7
00:00:29,850 --> 00:00:35,400
Also, this extra knowledge is often what is asked in interview questions.

8
00:00:36,820 --> 00:00:40,450
In fact, I'm going to cover this lecture in a question on the former.

9
00:00:41,930 --> 00:00:44,830
So keeping attention as this is also very important.

10
00:00:47,140 --> 00:00:49,870
The first topic is of activation functions.

11
00:00:51,360 --> 00:00:55,730
And the first question that we are going to discuss is, why do we use activation functions?

12
00:00:58,660 --> 00:01:02,380
Let's see if we do not have any activation function.

13
00:01:02,650 --> 00:01:09,320
What does the output of a neuron, the output would be given by this equation, W on X1 plus W two X

14
00:01:09,320 --> 00:01:13,840
two plus B is equal to Z, which is the output.

15
00:01:16,100 --> 00:01:20,300
This means that the output could be any real number with no boundaries.

16
00:01:21,610 --> 00:01:25,360
If you're solving for a regression problem, then this may be acceptable.

17
00:01:26,380 --> 00:01:33,340
But when we are doing classification, that is, we want an output of yes no type or one zero type.

18
00:01:34,780 --> 00:01:39,790
We need to read the output deals to get a zero one type of output.

19
00:01:41,950 --> 00:01:51,130
Also, if there are only linear neurons in the whole network, you can only predict a linear relationship

20
00:01:51,280 --> 00:01:53,430
between important output variables.

21
00:01:54,900 --> 00:01:59,400
So the answer is we use an activation function for two reasons.

22
00:02:00,360 --> 00:02:04,740
First, to put boundary conditions on our output for classification.

23
00:02:04,860 --> 00:02:06,120
The boundaries are obvious.

24
00:02:07,080 --> 00:02:12,610
For regression, also, if you have some boundary, you can use an activation function.

25
00:02:14,470 --> 00:02:21,400
The second reason is to introduce non-linearity so that we can find complex, nonlinear patterns.

26
00:02:21,530 --> 00:02:21,910
Also.

27
00:02:25,270 --> 00:02:29,080
Next question is, what are the different types of activation functions?

28
00:02:31,040 --> 00:02:33,350
Earlier, we had discussed two acquisition functions.

29
00:02:34,100 --> 00:02:38,970
One was the step function, which is zero below threshold value.

30
00:02:39,800 --> 00:02:43,100
And then it totally jumps to one at its actual value.

31
00:02:45,170 --> 00:02:51,830
Then we discussed the sigmoid function, which is a continuous s shape goal with zero as the lowered

32
00:02:51,830 --> 00:02:52,310
boundary.

33
00:02:53,030 --> 00:02:57,770
And one as the upper boundary for most practical business purposes.

34
00:02:58,190 --> 00:03:05,330
Sigmoid function is good enough, but for rare scenarios where competition is an issue, we sometimes

35
00:03:05,330 --> 00:03:10,070
use these two other activation functions also because of their convergence efficiency.

36
00:03:10,860 --> 00:03:15,740
First is the hyperbolic tangent function or the Panitch.

37
00:03:17,200 --> 00:03:23,110
The graph of this function is almost similar to the sigmoid in shape, but it has different boundaries.

38
00:03:24,040 --> 00:03:27,490
It has upper boundary of one at a lower boundary of minus one.

39
00:03:29,010 --> 00:03:35,950
And because it is centered at zero, it almost always has better converges efficiency than sigmoid.

40
00:03:38,200 --> 00:03:43,480
The second is Relu, which is Shorthold electrified linear unit.

41
00:03:45,310 --> 00:03:51,430
It is very widely used function, especially in the inner layers of regression neural networks.

42
00:03:53,080 --> 00:03:56,770
This is all this function looks like below zero.

43
00:03:57,010 --> 00:04:05,380
The function also outputs zero after zero function outputs the same as input.

44
00:04:05,860 --> 00:04:08,590
That is, if X is equal to X.

45
00:04:10,760 --> 00:04:15,080
So the lawman is zero, but there is no upper one.

46
00:04:16,460 --> 00:04:20,660
This function performs well because it is very easy to execute.

47
00:04:22,220 --> 00:04:28,610
The reason for using this function and hidden layers is that this function introduces non-linearity

48
00:04:29,360 --> 00:04:30,320
in different layers.

49
00:04:31,970 --> 00:04:39,320
However, on the output layer, it is rarely used because for classification, the right side of the

50
00:04:39,320 --> 00:04:43,340
function is not bounded and therefore it cannot be used.

51
00:04:44,780 --> 00:04:46,430
On the other hand, for regression.

52
00:04:46,940 --> 00:04:52,660
The left side of this function is bone and therefore this function cannot be used.

53
00:04:54,320 --> 00:04:57,430
So this function is good for activating hidden layers.

54
00:04:57,650 --> 00:04:59,990
But Northfork activating output leaves.

55
00:05:03,140 --> 00:05:07,100
You can find a summary of all these activation functions in the next slide.

56
00:05:07,940 --> 00:05:09,290
This is for your reference.

57
00:05:13,500 --> 00:05:16,890
This brings us to the next question, which we have already answered.

58
00:05:18,000 --> 00:05:22,290
Can hit a lears and output layers have different activation functions.

59
00:05:23,370 --> 00:05:24,330
Answer is yes.

60
00:05:24,960 --> 00:05:30,660
As I told you, we can implement RELU in different layers and a sigmoid in the output layer.

61
00:05:32,310 --> 00:05:35,250
Any such combination is allowed by our software tool.

62
00:05:38,250 --> 00:05:46,160
Next question is what is multiclass classification and is there any special activation function for

63
00:05:46,170 --> 00:05:47,700
multiclass classification?

64
00:05:49,830 --> 00:05:52,140
So first of all, what does multiclass classification.

65
00:05:53,010 --> 00:05:54,810
Suppose you are classifying in2.

66
00:05:54,960 --> 00:05:55,980
Yes or no.

67
00:05:56,790 --> 00:05:57,450
Or one.

68
00:05:57,450 --> 00:05:57,990
Or zero.

69
00:05:58,590 --> 00:06:01,740
This is binary classification because there are two classes only.

70
00:06:03,370 --> 00:06:05,130
But if we have more than two classes.

71
00:06:06,270 --> 00:06:12,060
So if we want to classify images into shirts, trousers, ties and socks.

72
00:06:12,720 --> 00:06:14,880
Now the output cannot be zero or one.

73
00:06:16,500 --> 00:06:23,660
We cannot do like this that we give zero for shirts, one for trousers, two fortes, three for socks.

74
00:06:24,390 --> 00:06:26,100
That would not give us the right answer.

75
00:06:27,600 --> 00:06:33,730
So we have to handle the situation in a little different manner for such a situation.

76
00:06:34,140 --> 00:06:36,960
We have an activation function called softmax.

77
00:06:40,680 --> 00:06:43,900
This activation function works similar to sigmoid.

78
00:06:44,460 --> 00:06:45,960
Mark has an additional step.

79
00:06:48,120 --> 00:06:56,280
So what do we do in multiclass classification is we usually keep as many output neurons as we have classes.

80
00:06:57,210 --> 00:07:06,420
So if we have three classes like shirts, trousers and socks, we keep three neurons at the output layer.

81
00:07:07,650 --> 00:07:14,820
You can see in this image these three output neurons correspond to three output plus's.

82
00:07:16,890 --> 00:07:21,370
These three output neurons have these sigmoid activation function only.

83
00:07:22,950 --> 00:07:27,720
So the output of first neuron would also lay between zero and one.

84
00:07:28,260 --> 00:07:35,370
And we can see that the output value is corresponding to the probability of whether it is a shirt or

85
00:07:35,370 --> 00:07:35,640
not.

86
00:07:37,380 --> 00:07:40,230
The second output would also be between zero and one.

87
00:07:40,350 --> 00:07:43,770
And we can say that it is the probability of whether it is a trouser or not.

88
00:07:44,820 --> 00:07:48,000
And the third output will also be between zero and one.

89
00:07:48,180 --> 00:07:52,170
And we can see that it is the probability of whether it is Salk's or not.

90
00:07:54,830 --> 00:08:02,990
But the thing is that the item can be only one that is the sum of probabilities should come out to be

91
00:08:02,990 --> 00:08:03,320
one.

92
00:08:04,430 --> 00:08:10,970
Either it is shirt ordered his trouser or the socks to implement this.

93
00:08:11,660 --> 00:08:20,030
We put an additional layer of softmax and we input the dessert of these three neurons into this sort

94
00:08:20,030 --> 00:08:20,650
matched layer.

95
00:08:22,340 --> 00:08:29,120
This softmax layer just divides each of the value by the sum of all the values.

96
00:08:31,610 --> 00:08:35,930
Not this output can be considered as the probability of that class occurring.

97
00:08:36,560 --> 00:08:40,070
And the sum of all these probabilities will also be equal to one.

98
00:08:43,780 --> 00:08:50,590
So for multiclass classification of softmax, Activision is often used on the output layer.

99
00:08:53,980 --> 00:08:56,200
That's all about the activation functions.

100
00:08:57,400 --> 00:08:59,740
The next topic is gradient descent.

101
00:09:01,750 --> 00:09:07,750
The question I want to discuss here is what is the difference between gradient descent and stochastic

102
00:09:07,750 --> 00:09:08,560
gradient descent?

103
00:09:10,480 --> 00:09:18,250
This is a very common question in the mind of students because they find stochastic written at some

104
00:09:18,250 --> 00:09:18,790
places.

105
00:09:18,970 --> 00:09:21,340
And it is not written in some texts.

106
00:09:22,870 --> 00:09:24,160
Let me clarify this for you.

107
00:09:25,240 --> 00:09:32,380
What we discussed in our previous lectures was actually stochastic gradient descent, because I told

108
00:09:32,380 --> 00:09:39,460
you that we take each individual training record and update our viewers and biases.

109
00:09:39,730 --> 00:09:47,680
With each training record, when we run the whole forward and backward propagation for each individual

110
00:09:47,680 --> 00:09:51,860
training record, that is stochastic gradient descent.

111
00:09:54,490 --> 00:10:02,500
But if you run forward propagation for entire training set at one, go and find out the average error

112
00:10:02,500 --> 00:10:03,520
on the entire set.

113
00:10:03,850 --> 00:10:07,450
And then nobody reads and biases during backward propagation.

114
00:10:08,260 --> 00:10:09,340
That is gradient descent.

115
00:10:11,450 --> 00:10:17,780
There is another variation in which we make small batches out of the training set and use these batches

116
00:10:17,870 --> 00:10:19,050
instead of completer.

117
00:10:20,420 --> 00:10:23,000
This one is called mini bat, greed inducing.

118
00:10:25,870 --> 00:10:36,400
The point is, stochastic gradient descent starts updating weights and biases rapidly, but it finds

119
00:10:36,490 --> 00:10:37,840
difficulty in converting.

120
00:10:40,250 --> 00:10:47,340
Whereas gradient descent is slow because in each pass, it has to go through the entire training set.

121
00:10:49,100 --> 00:10:52,990
But the good thing about gradient descent is it converges very well.

122
00:10:54,370 --> 00:10:59,650
So we have to accordingly select which optimization technique is to be used.

123
00:11:00,820 --> 00:11:03,780
We will see this further in the practical part of this course.

124
00:11:06,290 --> 00:11:16,280
Last thing I want to discuss in this lecture is epoch in neural networks, epoch is one cycle to the

125
00:11:16,280 --> 00:11:17,540
full trading data.

126
00:11:20,030 --> 00:11:28,590
So when we say epoch is equal to five, it means we want the full training data to be fed five times.

127
00:11:29,510 --> 00:11:32,450
No epoch is different from iterations.

128
00:11:34,850 --> 00:11:37,910
Suppose you have 1000 training records.

129
00:11:39,590 --> 00:11:45,950
Now, if you start weighting the records one by one, like we do doing stochastic gradient descent,

130
00:11:46,790 --> 00:11:50,570
you will have to, I predict, 1000 times.

131
00:11:53,350 --> 00:12:00,550
So ideations are the number of times you executed the process within one full training set.

132
00:12:02,200 --> 00:12:11,620
On the other hand, if you have said Epoch to do, then this means that the 1000 training examples will

133
00:12:11,620 --> 00:12:18,400
be fed two times, either one by one or all at the same time or in many batches.

134
00:12:20,230 --> 00:12:26,410
The idea of using Époque is to allow the network to rejig its performance on the same data.

135
00:12:28,600 --> 00:12:32,950
We will see how we can specify and use Epoch in a particular lichter's.

136
00:12:34,420 --> 00:12:36,730
That's all in this lecture, in the upcoming video.

137
00:12:36,760 --> 00:12:43,660
We will summarize the key parameters that you must know while implementing neural networks in the software.