1
00:00:00,720 --> 00:00:01,760
Hey everyone.

2
00:00:02,120 --> 00:00:03,370
Congratulations first of all.

3
00:00:04,280 --> 00:00:09,980
If you understood whatever I taught you in the last few lectures you understand neural networks in the

4
00:00:09,980 --> 00:00:14,210
way they are used for prediction purposes.

5
00:00:14,210 --> 00:00:21,430
In this lecture I'm going to talk about several subtropics these sub topics came up in our lectures

6
00:00:21,440 --> 00:00:25,030
previously also but to maintain the flow.

7
00:00:25,130 --> 00:00:29,760
I did not discuss them in detail at that time.

8
00:00:29,850 --> 00:00:36,760
Also this extra knowledge is often what is asked in interview questions.

9
00:00:36,820 --> 00:00:43,700
In fact I'm going to cover this lecture in a question on the format so keeping your attention as this

10
00:00:43,700 --> 00:00:44,830
is also very important.

11
00:00:47,100 --> 00:00:53,500
The first off topic is of activation functions and the first question that we are going to discuss is

12
00:00:53,660 --> 00:00:58,640
why do we use activation functions.

13
00:00:58,650 --> 00:01:02,640
Let's see if we do not have any activation function.

14
00:01:02,640 --> 00:01:04,810
What is the output of a neuron.

15
00:01:04,860 --> 00:01:16,090
The output would be given by this equation W1 X1 plus W2 x2 plus B is equal to Z which is the output.

16
00:01:16,100 --> 00:01:22,990
This means that the output could be any real number with no boundaries if you are solving for a regression

17
00:01:22,990 --> 00:01:26,380
problem then this may be acceptable.

18
00:01:26,380 --> 00:01:34,670
But when we are doing classification that is we want an output of yes no type or 1 0 type.

19
00:01:34,780 --> 00:01:41,780
We need to read the output d to get a 0 1 paper output.

20
00:01:41,950 --> 00:01:51,130
Also if there are only linear neurons in the whole network you can only predict a linear relationship

21
00:01:51,280 --> 00:01:53,440
between input and output variables.

22
00:01:54,930 --> 00:02:00,360
So the answer is we use an activation function for two reasons.

23
00:02:00,360 --> 00:02:04,860
First to put boundary conditions on our output for classification.

24
00:02:04,860 --> 00:02:08,420
The boundaries are obvious for regression also.

25
00:02:08,500 --> 00:02:14,470
If you have some boundary you can use an activation function.

26
00:02:14,470 --> 00:02:21,490
The second reason is to introduce nonlinear entity so that we can find complex nonlinear patterns.

27
00:02:21,490 --> 00:02:21,880
Also

28
00:02:25,270 --> 00:02:31,390
next question is what are the different types of activation functions earlier.

29
00:02:31,400 --> 00:02:34,080
We had discussed two activation functions.

30
00:02:34,100 --> 00:02:39,410
One was the step function which is zero below threshold value.

31
00:02:39,800 --> 00:02:45,170
And then it suddenly jumps to one at differential value.

32
00:02:45,170 --> 00:02:52,280
Then we discussed the sigmoid function which is a continuous S shape go with zero as the lower boundary

33
00:02:53,030 --> 00:03:01,220
and one as the upper boundary for most practical business purposes sigmoid function is good enough but

34
00:03:01,220 --> 00:03:07,280
for rare scenarios where computation is an issue we sometimes use these two other activation functions

35
00:03:07,280 --> 00:03:15,740
also because of their convergence efficiency first is the hyperbolic tangent function or the PAN edge

36
00:03:17,170 --> 00:03:23,640
the graph of this function is almost similar to the sigmoid in shape but it has different boundaries.

37
00:03:24,040 --> 00:03:31,780
It has upper boundary of one at a lower boundary of minus one and because it is centered at zero it

38
00:03:31,900 --> 00:03:38,060
almost always has better convergence efficiency than sigmoid.

39
00:03:38,200 --> 00:03:45,310
The second is the loop which is short for identified linear unit.

40
00:03:45,310 --> 00:03:53,030
It is very widely used function especially in the inner layers of regression neural networks.

41
00:03:53,080 --> 00:04:03,550
This is how this function looks like till 0 the function also outputs 0 but after 0 function outputs

42
00:04:03,880 --> 00:04:10,080
the same as input that is if X is equal to x.

43
00:04:10,780 --> 00:04:15,740
So the lower one is 0 but there is no upper bound.

44
00:04:16,750 --> 00:04:21,880
This function performs well because it is very easy to execute.

45
00:04:22,210 --> 00:04:28,570
The reason for using this function and it in layers is that this function introduces nonlinear entity

46
00:04:29,450 --> 00:04:31,920
different layers.

47
00:04:31,990 --> 00:04:39,820
However on the output layer it is rarely used because for classification the right side of the function

48
00:04:39,910 --> 00:04:46,860
is not bounded and therefore it cannot be used on the other hand for regression.

49
00:04:46,930 --> 00:04:52,700
The left side of this function is bone and therefore this function cannot be used.

50
00:04:54,340 --> 00:05:02,900
So this function is good for activating it in layers but not for activating output layers.

51
00:05:03,140 --> 00:05:07,900
You can find a summary of all these activation functions in the next light.

52
00:05:07,940 --> 00:05:13,390
This is for your reference.

53
00:05:13,470 --> 00:05:20,250
This brings us to the next question which we have already answered can hit in layers and output layers

54
00:05:20,460 --> 00:05:23,370
have different activation functions.

55
00:05:23,370 --> 00:05:24,710
Answer is yes.

56
00:05:24,960 --> 00:05:30,690
As I told you we can implement redo in the layers and the sigmoid in the output layer.

57
00:05:32,310 --> 00:05:35,220
Any such combination is allowed by our software to

58
00:05:38,240 --> 00:05:46,120
next question is what is multi class classification and is there any special activation function for

59
00:05:46,160 --> 00:05:49,850
multi class classification.

60
00:05:49,850 --> 00:05:53,030
So first of all what is multi class classification.

61
00:05:53,030 --> 00:05:58,510
Suppose you are classifying into yes or no or 1 0 0.

62
00:05:58,580 --> 00:06:02,960
This is binary classification because there are two classes only.

63
00:06:03,370 --> 00:06:05,810
But if we have more than two classes.

64
00:06:06,260 --> 00:06:12,560
So if we want to classify images into shirts trousers ties and socks.

65
00:06:12,710 --> 00:06:16,140
Now the output cannot be 0 or 1.

66
00:06:16,490 --> 00:06:25,010
We cannot do like this that we gave 0 for shirts 1 for trousers 2 4 days 3 for socks that would not

67
00:06:25,010 --> 00:06:26,110
give us the right answer.

68
00:06:27,580 --> 00:06:34,110
So we have to handle the situation in a little different manner for such a situation.

69
00:06:34,130 --> 00:06:36,950
We have an activation function called soft Max

70
00:06:40,650 --> 00:06:44,030
this activation function works similar to sigmoid.

71
00:06:44,460 --> 00:06:48,150
Mark has an additional step.

72
00:06:48,150 --> 00:06:57,190
So what we do in multi class classification is we usually keep as many output neurons as we have classes.

73
00:06:57,210 --> 00:07:07,550
So if we have three classes like shirts trousers and socks we keep three neurons at the output layer.

74
00:07:07,650 --> 00:07:17,630
You can see in this image these three output neurons correspond to three output classes these three

75
00:07:17,870 --> 00:07:22,110
output neurons have these sigmoid activation function only.

76
00:07:22,970 --> 00:07:31,940
So the output of first neuron would also lay between 0 and 1 and we can see that the output value is

77
00:07:31,940 --> 00:07:37,300
corresponding to the probability of whether it is a shirt or not.

78
00:07:37,400 --> 00:07:42,590
The second output would also be between 0 and 1 and we can see that it is the probability of whether

79
00:07:42,590 --> 00:07:44,640
it is a browser or not.

80
00:07:44,750 --> 00:07:50,690
And the third output will also be between 0 and 1 and we can see that it is the probability of whether

81
00:07:50,690 --> 00:07:53,090
it is socks or not.

82
00:07:54,830 --> 00:08:02,990
But the thing is that the item can be only one that is the sum of probabilities should come out to be

83
00:08:02,990 --> 00:08:03,850
one.

84
00:08:04,430 --> 00:08:09,770
Either it is shirt or it is trouser or it is socks.

85
00:08:09,950 --> 00:08:19,010
To implement this we put an additional layer of soft Max and we input the desert of these three neurons

86
00:08:19,190 --> 00:08:22,040
into this sort Max layer.

87
00:08:22,340 --> 00:08:30,670
This soft Max layer just divides each of the value by the sum of all the values.

88
00:08:31,610 --> 00:08:37,790
Not this output can be considered as the probability of that class occurring and the sum of all these

89
00:08:37,790 --> 00:08:40,070
probabilities will also be equal to one

90
00:08:43,780 --> 00:08:52,460
so for multi class classification a soft Max activation is often used on the output layer.

91
00:08:53,980 --> 00:08:57,400
That's all about the activation functions.

92
00:08:57,400 --> 00:09:01,620
The next topic is gradient descent.

93
00:09:01,750 --> 00:09:07,750
The question I want to discuss here is what is the difference between gradient descent and stochastic

94
00:09:07,750 --> 00:09:10,460
gradient descent.

95
00:09:10,480 --> 00:09:18,250
This is a very common question in the mind of students because they find stochastic written at some

96
00:09:18,250 --> 00:09:22,870
places and it is not written in some texts.

97
00:09:22,870 --> 00:09:24,840
Let me clarify this for you.

98
00:09:25,240 --> 00:09:32,620
What we discussed in our previous lectures was actually stochastic gradient descent because I told you

99
00:09:32,770 --> 00:09:41,080
that we take each individual training record and update our weights and biases with each training record

100
00:09:43,080 --> 00:09:51,040
when we run the whole forward and backward propagation for each individual training record that is stochastic

101
00:09:51,040 --> 00:09:51,880
gradient descent.

102
00:09:54,460 --> 00:10:02,500
But if you run forward propagation for entire training set at one go and find out the average error

103
00:10:02,500 --> 00:10:11,300
on the entire set and then a blade DVDs and biases during backward propagation that is gradient descent.

104
00:10:11,440 --> 00:10:17,740
There is another variation in which we make small batches out of the training set and use these batches

105
00:10:17,860 --> 00:10:19,790
instead of completed.

106
00:10:20,410 --> 00:10:25,830
This one is called mini bad gradient descent.

107
00:10:25,880 --> 00:10:37,070
The point is stochastic gradient descent starts updating weights and biases rapidly but it finds difficulty

108
00:10:37,070 --> 00:10:46,760
in converging whereas gradient descent is slow because in each pass it has to go through the entire

109
00:10:46,760 --> 00:10:47,380
training set.

110
00:10:49,100 --> 00:10:56,110
But the good thing about gradient descent is it converges very well so we have to accordingly select

111
00:10:56,680 --> 00:10:59,650
which optimization technique is to be used.

112
00:11:00,820 --> 00:11:03,820
We will see this further in the practical part of this course

113
00:11:06,370 --> 00:11:13,250
lasting I want to discuss in this lecture is epoch in neural networks.

114
00:11:13,290 --> 00:11:20,040
Epoch is one cycle through the full training data.

115
00:11:20,040 --> 00:11:29,180
So when we say epoch is equal to five it means we want the full training data to be fed five things.

116
00:11:29,460 --> 00:11:34,850
Note that epoch is different from iterations.

117
00:11:34,860 --> 00:11:39,870
Suppose you have 1000 training records now.

118
00:11:39,960 --> 00:11:47,850
If you start reading the records one by one like we do doing stochastic gradient descent you will have

119
00:11:47,850 --> 00:11:50,550
to I created 1000 times

120
00:11:53,370 --> 00:12:01,700
so I iterations are the number of times you execute any process within one full training set.

121
00:12:02,220 --> 00:12:11,610
On the other hand if you have said epoch to do then this means that the 1000 training examples will

122
00:12:11,610 --> 00:12:19,960
be fed two times either one by one or all at the same time or in many batches.

123
00:12:20,220 --> 00:12:28,210
The idea of using epoch is to allow the network to rejig its performance on the same data.

124
00:12:28,620 --> 00:12:34,380
We will see how we can specify and use epoch in our party collectives.

125
00:12:34,410 --> 00:12:36,770
That's all in this lecture in the upcoming video.

126
00:12:36,780 --> 00:12:43,680
We will summarize the key parameters that you must know while implementing neural networks in the software.