1 00:00:00,720 --> 00:00:01,760 Hey everyone. 2 00:00:02,120 --> 00:00:03,370 Congratulations first of all. 3 00:00:04,280 --> 00:00:09,980 If you understood whatever I taught you in the last few lectures you understand neural networks in the 4 00:00:09,980 --> 00:00:14,210 way they are used for prediction purposes. 5 00:00:14,210 --> 00:00:21,430 In this lecture I'm going to talk about several subtropics these sub topics came up in our lectures 6 00:00:21,440 --> 00:00:25,030 previously also but to maintain the flow. 7 00:00:25,130 --> 00:00:29,760 I did not discuss them in detail at that time. 8 00:00:29,850 --> 00:00:36,760 Also this extra knowledge is often what is asked in interview questions. 9 00:00:36,820 --> 00:00:43,700 In fact I'm going to cover this lecture in a question on the format so keeping your attention as this 10 00:00:43,700 --> 00:00:44,830 is also very important. 11 00:00:47,100 --> 00:00:53,500 The first off topic is of activation functions and the first question that we are going to discuss is 12 00:00:53,660 --> 00:00:58,640 why do we use activation functions. 13 00:00:58,650 --> 00:01:02,640 Let's see if we do not have any activation function. 14 00:01:02,640 --> 00:01:04,810 What is the output of a neuron. 15 00:01:04,860 --> 00:01:16,090 The output would be given by this equation W1 X1 plus W2 x2 plus B is equal to Z which is the output. 16 00:01:16,100 --> 00:01:22,990 This means that the output could be any real number with no boundaries if you are solving for a regression 17 00:01:22,990 --> 00:01:26,380 problem then this may be acceptable. 18 00:01:26,380 --> 00:01:34,670 But when we are doing classification that is we want an output of yes no type or 1 0 type. 19 00:01:34,780 --> 00:01:41,780 We need to read the output d to get a 0 1 paper output. 20 00:01:41,950 --> 00:01:51,130 Also if there are only linear neurons in the whole network you can only predict a linear relationship 21 00:01:51,280 --> 00:01:53,440 between input and output variables. 22 00:01:54,930 --> 00:02:00,360 So the answer is we use an activation function for two reasons. 23 00:02:00,360 --> 00:02:04,860 First to put boundary conditions on our output for classification. 24 00:02:04,860 --> 00:02:08,420 The boundaries are obvious for regression also. 25 00:02:08,500 --> 00:02:14,470 If you have some boundary you can use an activation function. 26 00:02:14,470 --> 00:02:21,490 The second reason is to introduce nonlinear entity so that we can find complex nonlinear patterns. 27 00:02:21,490 --> 00:02:21,880 Also 28 00:02:25,270 --> 00:02:31,390 next question is what are the different types of activation functions earlier. 29 00:02:31,400 --> 00:02:34,080 We had discussed two activation functions. 30 00:02:34,100 --> 00:02:39,410 One was the step function which is zero below threshold value. 31 00:02:39,800 --> 00:02:45,170 And then it suddenly jumps to one at differential value. 32 00:02:45,170 --> 00:02:52,280 Then we discussed the sigmoid function which is a continuous S shape go with zero as the lower boundary 33 00:02:53,030 --> 00:03:01,220 and one as the upper boundary for most practical business purposes sigmoid function is good enough but 34 00:03:01,220 --> 00:03:07,280 for rare scenarios where computation is an issue we sometimes use these two other activation functions 35 00:03:07,280 --> 00:03:15,740 also because of their convergence efficiency first is the hyperbolic tangent function or the PAN edge 36 00:03:17,170 --> 00:03:23,640 the graph of this function is almost similar to the sigmoid in shape but it has different boundaries. 37 00:03:24,040 --> 00:03:31,780 It has upper boundary of one at a lower boundary of minus one and because it is centered at zero it 38 00:03:31,900 --> 00:03:38,060 almost always has better convergence efficiency than sigmoid. 39 00:03:38,200 --> 00:03:45,310 The second is the loop which is short for identified linear unit. 40 00:03:45,310 --> 00:03:53,030 It is very widely used function especially in the inner layers of regression neural networks. 41 00:03:53,080 --> 00:04:03,550 This is how this function looks like till 0 the function also outputs 0 but after 0 function outputs 42 00:04:03,880 --> 00:04:10,080 the same as input that is if X is equal to x. 43 00:04:10,780 --> 00:04:15,740 So the lower one is 0 but there is no upper bound. 44 00:04:16,750 --> 00:04:21,880 This function performs well because it is very easy to execute. 45 00:04:22,210 --> 00:04:28,570 The reason for using this function and it in layers is that this function introduces nonlinear entity 46 00:04:29,450 --> 00:04:31,920 different layers. 47 00:04:31,990 --> 00:04:39,820 However on the output layer it is rarely used because for classification the right side of the function 48 00:04:39,910 --> 00:04:46,860 is not bounded and therefore it cannot be used on the other hand for regression. 49 00:04:46,930 --> 00:04:52,700 The left side of this function is bone and therefore this function cannot be used. 50 00:04:54,340 --> 00:05:02,900 So this function is good for activating it in layers but not for activating output layers. 51 00:05:03,140 --> 00:05:07,900 You can find a summary of all these activation functions in the next light. 52 00:05:07,940 --> 00:05:13,390 This is for your reference. 53 00:05:13,470 --> 00:05:20,250 This brings us to the next question which we have already answered can hit in layers and output layers 54 00:05:20,460 --> 00:05:23,370 have different activation functions. 55 00:05:23,370 --> 00:05:24,710 Answer is yes. 56 00:05:24,960 --> 00:05:30,690 As I told you we can implement redo in the layers and the sigmoid in the output layer. 57 00:05:32,310 --> 00:05:35,220 Any such combination is allowed by our software to 58 00:05:38,240 --> 00:05:46,120 next question is what is multi class classification and is there any special activation function for 59 00:05:46,160 --> 00:05:49,850 multi class classification. 60 00:05:49,850 --> 00:05:53,030 So first of all what is multi class classification. 61 00:05:53,030 --> 00:05:58,510 Suppose you are classifying into yes or no or 1 0 0. 62 00:05:58,580 --> 00:06:02,960 This is binary classification because there are two classes only. 63 00:06:03,370 --> 00:06:05,810 But if we have more than two classes. 64 00:06:06,260 --> 00:06:12,560 So if we want to classify images into shirts trousers ties and socks. 65 00:06:12,710 --> 00:06:16,140 Now the output cannot be 0 or 1. 66 00:06:16,490 --> 00:06:25,010 We cannot do like this that we gave 0 for shirts 1 for trousers 2 4 days 3 for socks that would not 67 00:06:25,010 --> 00:06:26,110 give us the right answer. 68 00:06:27,580 --> 00:06:34,110 So we have to handle the situation in a little different manner for such a situation. 69 00:06:34,130 --> 00:06:36,950 We have an activation function called soft Max 70 00:06:40,650 --> 00:06:44,030 this activation function works similar to sigmoid. 71 00:06:44,460 --> 00:06:48,150 Mark has an additional step. 72 00:06:48,150 --> 00:06:57,190 So what we do in multi class classification is we usually keep as many output neurons as we have classes. 73 00:06:57,210 --> 00:07:07,550 So if we have three classes like shirts trousers and socks we keep three neurons at the output layer. 74 00:07:07,650 --> 00:07:17,630 You can see in this image these three output neurons correspond to three output classes these three 75 00:07:17,870 --> 00:07:22,110 output neurons have these sigmoid activation function only. 76 00:07:22,970 --> 00:07:31,940 So the output of first neuron would also lay between 0 and 1 and we can see that the output value is 77 00:07:31,940 --> 00:07:37,300 corresponding to the probability of whether it is a shirt or not. 78 00:07:37,400 --> 00:07:42,590 The second output would also be between 0 and 1 and we can see that it is the probability of whether 79 00:07:42,590 --> 00:07:44,640 it is a browser or not. 80 00:07:44,750 --> 00:07:50,690 And the third output will also be between 0 and 1 and we can see that it is the probability of whether 81 00:07:50,690 --> 00:07:53,090 it is socks or not. 82 00:07:54,830 --> 00:08:02,990 But the thing is that the item can be only one that is the sum of probabilities should come out to be 83 00:08:02,990 --> 00:08:03,850 one. 84 00:08:04,430 --> 00:08:09,770 Either it is shirt or it is trouser or it is socks. 85 00:08:09,950 --> 00:08:19,010 To implement this we put an additional layer of soft Max and we input the desert of these three neurons 86 00:08:19,190 --> 00:08:22,040 into this sort Max layer. 87 00:08:22,340 --> 00:08:30,670 This soft Max layer just divides each of the value by the sum of all the values. 88 00:08:31,610 --> 00:08:37,790 Not this output can be considered as the probability of that class occurring and the sum of all these 89 00:08:37,790 --> 00:08:40,070 probabilities will also be equal to one 90 00:08:43,780 --> 00:08:52,460 so for multi class classification a soft Max activation is often used on the output layer. 91 00:08:53,980 --> 00:08:57,400 That's all about the activation functions. 92 00:08:57,400 --> 00:09:01,620 The next topic is gradient descent. 93 00:09:01,750 --> 00:09:07,750 The question I want to discuss here is what is the difference between gradient descent and stochastic 94 00:09:07,750 --> 00:09:10,460 gradient descent. 95 00:09:10,480 --> 00:09:18,250 This is a very common question in the mind of students because they find stochastic written at some 96 00:09:18,250 --> 00:09:22,870 places and it is not written in some texts. 97 00:09:22,870 --> 00:09:24,840 Let me clarify this for you. 98 00:09:25,240 --> 00:09:32,620 What we discussed in our previous lectures was actually stochastic gradient descent because I told you 99 00:09:32,770 --> 00:09:41,080 that we take each individual training record and update our weights and biases with each training record 100 00:09:43,080 --> 00:09:51,040 when we run the whole forward and backward propagation for each individual training record that is stochastic 101 00:09:51,040 --> 00:09:51,880 gradient descent. 102 00:09:54,460 --> 00:10:02,500 But if you run forward propagation for entire training set at one go and find out the average error 103 00:10:02,500 --> 00:10:11,300 on the entire set and then a blade DVDs and biases during backward propagation that is gradient descent. 104 00:10:11,440 --> 00:10:17,740 There is another variation in which we make small batches out of the training set and use these batches 105 00:10:17,860 --> 00:10:19,790 instead of completed. 106 00:10:20,410 --> 00:10:25,830 This one is called mini bad gradient descent. 107 00:10:25,880 --> 00:10:37,070 The point is stochastic gradient descent starts updating weights and biases rapidly but it finds difficulty 108 00:10:37,070 --> 00:10:46,760 in converging whereas gradient descent is slow because in each pass it has to go through the entire 109 00:10:46,760 --> 00:10:47,380 training set. 110 00:10:49,100 --> 00:10:56,110 But the good thing about gradient descent is it converges very well so we have to accordingly select 111 00:10:56,680 --> 00:10:59,650 which optimization technique is to be used. 112 00:11:00,820 --> 00:11:03,820 We will see this further in the practical part of this course 113 00:11:06,370 --> 00:11:13,250 lasting I want to discuss in this lecture is epoch in neural networks. 114 00:11:13,290 --> 00:11:20,040 Epoch is one cycle through the full training data. 115 00:11:20,040 --> 00:11:29,180 So when we say epoch is equal to five it means we want the full training data to be fed five things. 116 00:11:29,460 --> 00:11:34,850 Note that epoch is different from iterations. 117 00:11:34,860 --> 00:11:39,870 Suppose you have 1000 training records now. 118 00:11:39,960 --> 00:11:47,850 If you start reading the records one by one like we do doing stochastic gradient descent you will have 119 00:11:47,850 --> 00:11:50,550 to I created 1000 times 120 00:11:53,370 --> 00:12:01,700 so I iterations are the number of times you execute any process within one full training set. 121 00:12:02,220 --> 00:12:11,610 On the other hand if you have said epoch to do then this means that the 1000 training examples will 122 00:12:11,610 --> 00:12:19,960 be fed two times either one by one or all at the same time or in many batches. 123 00:12:20,220 --> 00:12:28,210 The idea of using epoch is to allow the network to rejig its performance on the same data. 124 00:12:28,620 --> 00:12:34,380 We will see how we can specify and use epoch in our party collectives. 125 00:12:34,410 --> 00:12:36,770 That's all in this lecture in the upcoming video. 126 00:12:36,780 --> 00:12:43,680 We will summarize the key parameters that you must know while implementing neural networks in the software.