1 00:00:00,740 --> 00:00:03,360 Hey, everyone, congratulations, first of all. 2 00:00:04,280 --> 00:00:09,890 If you understood whatever I taught you in the last few lectures, you understand neural networks in 3 00:00:09,890 --> 00:00:14,930 the way they are used for prediction purposes in this lecture. 4 00:00:15,080 --> 00:00:17,360 I'm going to talk about several subtopics. 5 00:00:18,740 --> 00:00:24,250 These topics came up in the lectures previously, also to maintain the flow. 6 00:00:25,130 --> 00:00:28,150 I did not discuss them in detail at that time. 7 00:00:29,850 --> 00:00:35,400 Also, this extra knowledge is often what is asked in interview questions. 8 00:00:36,820 --> 00:00:40,450 In fact, I'm going to cover this lecture in a question on the former. 9 00:00:41,930 --> 00:00:44,830 So keeping attention as this is also very important. 10 00:00:47,140 --> 00:00:49,870 The first topic is of activation functions. 11 00:00:51,360 --> 00:00:55,730 And the first question that we are going to discuss is, why do we use activation functions? 12 00:00:58,660 --> 00:01:02,380 Let's see if we do not have any activation function. 13 00:01:02,650 --> 00:01:09,320 What does the output of a neuron, the output would be given by this equation, W on X1 plus W two X 14 00:01:09,320 --> 00:01:13,840 two plus B is equal to Z, which is the output. 15 00:01:16,100 --> 00:01:20,300 This means that the output could be any real number with no boundaries. 16 00:01:21,610 --> 00:01:25,360 If you're solving for a regression problem, then this may be acceptable. 17 00:01:26,380 --> 00:01:33,340 But when we are doing classification, that is, we want an output of yes no type or one zero type. 18 00:01:34,780 --> 00:01:39,790 We need to read the output deals to get a zero one type of output. 19 00:01:41,950 --> 00:01:51,130 Also, if there are only linear neurons in the whole network, you can only predict a linear relationship 20 00:01:51,280 --> 00:01:53,430 between important output variables. 21 00:01:54,900 --> 00:01:59,400 So the answer is we use an activation function for two reasons. 22 00:02:00,360 --> 00:02:04,740 First, to put boundary conditions on our output for classification. 23 00:02:04,860 --> 00:02:06,120 The boundaries are obvious. 24 00:02:07,080 --> 00:02:12,610 For regression, also, if you have some boundary, you can use an activation function. 25 00:02:14,470 --> 00:02:21,400 The second reason is to introduce non-linearity so that we can find complex, nonlinear patterns. 26 00:02:21,530 --> 00:02:21,910 Also. 27 00:02:25,270 --> 00:02:29,080 Next question is, what are the different types of activation functions? 28 00:02:31,040 --> 00:02:33,350 Earlier, we had discussed two acquisition functions. 29 00:02:34,100 --> 00:02:38,970 One was the step function, which is zero below threshold value. 30 00:02:39,800 --> 00:02:43,100 And then it totally jumps to one at its actual value. 31 00:02:45,170 --> 00:02:51,830 Then we discussed the sigmoid function, which is a continuous s shape goal with zero as the lowered 32 00:02:51,830 --> 00:02:52,310 boundary. 33 00:02:53,030 --> 00:02:57,770 And one as the upper boundary for most practical business purposes. 34 00:02:58,190 --> 00:03:05,330 Sigmoid function is good enough, but for rare scenarios where competition is an issue, we sometimes 35 00:03:05,330 --> 00:03:10,070 use these two other activation functions also because of their convergence efficiency. 36 00:03:10,860 --> 00:03:15,740 First is the hyperbolic tangent function or the Panitch. 37 00:03:17,200 --> 00:03:23,110 The graph of this function is almost similar to the sigmoid in shape, but it has different boundaries. 38 00:03:24,040 --> 00:03:27,490 It has upper boundary of one at a lower boundary of minus one. 39 00:03:29,010 --> 00:03:35,950 And because it is centered at zero, it almost always has better converges efficiency than sigmoid. 40 00:03:38,200 --> 00:03:43,480 The second is Relu, which is Shorthold electrified linear unit. 41 00:03:45,310 --> 00:03:51,430 It is very widely used function, especially in the inner layers of regression neural networks. 42 00:03:53,080 --> 00:03:56,770 This is all this function looks like below zero. 43 00:03:57,010 --> 00:04:05,380 The function also outputs zero after zero function outputs the same as input. 44 00:04:05,860 --> 00:04:08,590 That is, if X is equal to X. 45 00:04:10,760 --> 00:04:15,080 So the lawman is zero, but there is no upper one. 46 00:04:16,460 --> 00:04:20,660 This function performs well because it is very easy to execute. 47 00:04:22,220 --> 00:04:28,610 The reason for using this function and hidden layers is that this function introduces non-linearity 48 00:04:29,360 --> 00:04:30,320 in different layers. 49 00:04:31,970 --> 00:04:39,320 However, on the output layer, it is rarely used because for classification, the right side of the 50 00:04:39,320 --> 00:04:43,340 function is not bounded and therefore it cannot be used. 51 00:04:44,780 --> 00:04:46,430 On the other hand, for regression. 52 00:04:46,940 --> 00:04:52,660 The left side of this function is bone and therefore this function cannot be used. 53 00:04:54,320 --> 00:04:57,430 So this function is good for activating hidden layers. 54 00:04:57,650 --> 00:04:59,990 But Northfork activating output leaves. 55 00:05:03,140 --> 00:05:07,100 You can find a summary of all these activation functions in the next slide. 56 00:05:07,940 --> 00:05:09,290 This is for your reference. 57 00:05:13,500 --> 00:05:16,890 This brings us to the next question, which we have already answered. 58 00:05:18,000 --> 00:05:22,290 Can hit a lears and output layers have different activation functions. 59 00:05:23,370 --> 00:05:24,330 Answer is yes. 60 00:05:24,960 --> 00:05:30,660 As I told you, we can implement RELU in different layers and a sigmoid in the output layer. 61 00:05:32,310 --> 00:05:35,250 Any such combination is allowed by our software tool. 62 00:05:38,250 --> 00:05:46,160 Next question is what is multiclass classification and is there any special activation function for 63 00:05:46,170 --> 00:05:47,700 multiclass classification? 64 00:05:49,830 --> 00:05:52,140 So first of all, what does multiclass classification. 65 00:05:53,010 --> 00:05:54,810 Suppose you are classifying in2. 66 00:05:54,960 --> 00:05:55,980 Yes or no. 67 00:05:56,790 --> 00:05:57,450 Or one. 68 00:05:57,450 --> 00:05:57,990 Or zero. 69 00:05:58,590 --> 00:06:01,740 This is binary classification because there are two classes only. 70 00:06:03,370 --> 00:06:05,130 But if we have more than two classes. 71 00:06:06,270 --> 00:06:12,060 So if we want to classify images into shirts, trousers, ties and socks. 72 00:06:12,720 --> 00:06:14,880 Now the output cannot be zero or one. 73 00:06:16,500 --> 00:06:23,660 We cannot do like this that we give zero for shirts, one for trousers, two fortes, three for socks. 74 00:06:24,390 --> 00:06:26,100 That would not give us the right answer. 75 00:06:27,600 --> 00:06:33,730 So we have to handle the situation in a little different manner for such a situation. 76 00:06:34,140 --> 00:06:36,960 We have an activation function called softmax. 77 00:06:40,680 --> 00:06:43,900 This activation function works similar to sigmoid. 78 00:06:44,460 --> 00:06:45,960 Mark has an additional step. 79 00:06:48,120 --> 00:06:56,280 So what do we do in multiclass classification is we usually keep as many output neurons as we have classes. 80 00:06:57,210 --> 00:07:06,420 So if we have three classes like shirts, trousers and socks, we keep three neurons at the output layer. 81 00:07:07,650 --> 00:07:14,820 You can see in this image these three output neurons correspond to three output plus's. 82 00:07:16,890 --> 00:07:21,370 These three output neurons have these sigmoid activation function only. 83 00:07:22,950 --> 00:07:27,720 So the output of first neuron would also lay between zero and one. 84 00:07:28,260 --> 00:07:35,370 And we can see that the output value is corresponding to the probability of whether it is a shirt or 85 00:07:35,370 --> 00:07:35,640 not. 86 00:07:37,380 --> 00:07:40,230 The second output would also be between zero and one. 87 00:07:40,350 --> 00:07:43,770 And we can say that it is the probability of whether it is a trouser or not. 88 00:07:44,820 --> 00:07:48,000 And the third output will also be between zero and one. 89 00:07:48,180 --> 00:07:52,170 And we can see that it is the probability of whether it is Salk's or not. 90 00:07:54,830 --> 00:08:02,990 But the thing is that the item can be only one that is the sum of probabilities should come out to be 91 00:08:02,990 --> 00:08:03,320 one. 92 00:08:04,430 --> 00:08:10,970 Either it is shirt ordered his trouser or the socks to implement this. 93 00:08:11,660 --> 00:08:20,030 We put an additional layer of softmax and we input the dessert of these three neurons into this sort 94 00:08:20,030 --> 00:08:20,650 matched layer. 95 00:08:22,340 --> 00:08:29,120 This softmax layer just divides each of the value by the sum of all the values. 96 00:08:31,610 --> 00:08:35,930 Not this output can be considered as the probability of that class occurring. 97 00:08:36,560 --> 00:08:40,070 And the sum of all these probabilities will also be equal to one. 98 00:08:43,780 --> 00:08:50,590 So for multiclass classification of softmax, Activision is often used on the output layer. 99 00:08:53,980 --> 00:08:56,200 That's all about the activation functions. 100 00:08:57,400 --> 00:08:59,740 The next topic is gradient descent. 101 00:09:01,750 --> 00:09:07,750 The question I want to discuss here is what is the difference between gradient descent and stochastic 102 00:09:07,750 --> 00:09:08,560 gradient descent? 103 00:09:10,480 --> 00:09:18,250 This is a very common question in the mind of students because they find stochastic written at some 104 00:09:18,250 --> 00:09:18,790 places. 105 00:09:18,970 --> 00:09:21,340 And it is not written in some texts. 106 00:09:22,870 --> 00:09:24,160 Let me clarify this for you. 107 00:09:25,240 --> 00:09:32,380 What we discussed in our previous lectures was actually stochastic gradient descent, because I told 108 00:09:32,380 --> 00:09:39,460 you that we take each individual training record and update our viewers and biases. 109 00:09:39,730 --> 00:09:47,680 With each training record, when we run the whole forward and backward propagation for each individual 110 00:09:47,680 --> 00:09:51,860 training record, that is stochastic gradient descent. 111 00:09:54,490 --> 00:10:02,500 But if you run forward propagation for entire training set at one, go and find out the average error 112 00:10:02,500 --> 00:10:03,520 on the entire set. 113 00:10:03,850 --> 00:10:07,450 And then nobody reads and biases during backward propagation. 114 00:10:08,260 --> 00:10:09,340 That is gradient descent. 115 00:10:11,450 --> 00:10:17,780 There is another variation in which we make small batches out of the training set and use these batches 116 00:10:17,870 --> 00:10:19,050 instead of completer. 117 00:10:20,420 --> 00:10:23,000 This one is called mini bat, greed inducing. 118 00:10:25,870 --> 00:10:36,400 The point is, stochastic gradient descent starts updating weights and biases rapidly, but it finds 119 00:10:36,490 --> 00:10:37,840 difficulty in converting. 120 00:10:40,250 --> 00:10:47,340 Whereas gradient descent is slow because in each pass, it has to go through the entire training set. 121 00:10:49,100 --> 00:10:52,990 But the good thing about gradient descent is it converges very well. 122 00:10:54,370 --> 00:10:59,650 So we have to accordingly select which optimization technique is to be used. 123 00:11:00,820 --> 00:11:03,780 We will see this further in the practical part of this course. 124 00:11:06,290 --> 00:11:16,280 Last thing I want to discuss in this lecture is epoch in neural networks, epoch is one cycle to the 125 00:11:16,280 --> 00:11:17,540 full trading data. 126 00:11:20,030 --> 00:11:28,590 So when we say epoch is equal to five, it means we want the full training data to be fed five times. 127 00:11:29,510 --> 00:11:32,450 No epoch is different from iterations. 128 00:11:34,850 --> 00:11:37,910 Suppose you have 1000 training records. 129 00:11:39,590 --> 00:11:45,950 Now, if you start weighting the records one by one, like we do doing stochastic gradient descent, 130 00:11:46,790 --> 00:11:50,570 you will have to, I predict, 1000 times. 131 00:11:53,350 --> 00:12:00,550 So ideations are the number of times you executed the process within one full training set. 132 00:12:02,200 --> 00:12:11,620 On the other hand, if you have said Epoch to do, then this means that the 1000 training examples will 133 00:12:11,620 --> 00:12:18,400 be fed two times, either one by one or all at the same time or in many batches. 134 00:12:20,230 --> 00:12:26,410 The idea of using Époque is to allow the network to rejig its performance on the same data. 135 00:12:28,600 --> 00:12:32,950 We will see how we can specify and use Epoch in a particular lichter's. 136 00:12:34,420 --> 00:12:36,730 That's all in this lecture, in the upcoming video. 137 00:12:36,760 --> 00:12:43,660 We will summarize the key parameters that you must know while implementing neural networks in the software.