1 00:00:01,380 --> 00:00:08,870 In this lecture we are going to understand the concept behind how neural networks actually learn. 2 00:00:08,900 --> 00:00:15,290 Still this lecture we have called what the neural network is now we are starting with. 3 00:00:15,600 --> 00:00:20,090 How does a neural network work. 4 00:00:20,470 --> 00:00:23,780 Here is a quick recap. 5 00:00:24,170 --> 00:00:29,220 A neural network is a network of cells in our network. 6 00:00:29,220 --> 00:00:38,610 We are going to use sigmoid neurons because these learn in a more controllable manner in a sigmoid neuron. 7 00:00:38,640 --> 00:00:40,690 Two things happen. 8 00:00:40,860 --> 00:00:48,720 We first multiply the input features we deviate that are represented by W and then add a bias to be 9 00:00:50,370 --> 00:00:51,560 this value. 10 00:00:51,600 --> 00:00:55,780 We name as Z or Z. 11 00:00:55,980 --> 00:01:02,820 The second step is the application of sigmoid or the logistic function that is the cell calculate one 12 00:01:02,820 --> 00:01:06,790 upon one plus it is the part minus the. 13 00:01:07,140 --> 00:01:13,570 This value is the output of the cell and is always between 0 and 1. 14 00:01:13,800 --> 00:01:18,480 This output becomes the new input for the next layer. 15 00:01:18,480 --> 00:01:23,160 This continues till the last layer till we get the final output. 16 00:01:23,160 --> 00:01:33,360 As per our network now the problem for which we are solving is this we want to find out the ways and 17 00:01:33,360 --> 00:01:42,240 biases of all the cells in the system so that the final output of this network is as close to the actual 18 00:01:42,240 --> 00:01:44,130 value of the variable to be predicted 19 00:01:47,040 --> 00:01:48,250 for better understanding. 20 00:01:48,780 --> 00:01:56,670 Let us just calculate the number of variables we need to calculate for this small network head we have 21 00:01:56,790 --> 00:02:03,600 two neurons in the head and left and one neuron in the output layer and there are two input features 22 00:02:03,930 --> 00:02:06,400 x1 and x2. 23 00:02:06,570 --> 00:02:19,680 So for the first neuron we are trying to calculate W1 into X1 plus W2 into x2 plus B1 is equal to Z. 24 00:02:19,770 --> 00:02:26,940 This V will be put into the activation function that is the sigmoid function and that will be the output 25 00:02:27,060 --> 00:02:30,060 of this neuron. 26 00:02:30,160 --> 00:02:36,230 US say that the output of this neuron is represented by A1 for the second neuron. 27 00:02:36,270 --> 00:02:39,860 We have two new Newgate W3 and w4. 28 00:02:40,420 --> 00:02:51,550 We calculate this value W3 X1 plus W for x2 plus B2 this B2 is the bias of this neuron is equal to z2. 29 00:02:51,700 --> 00:03:02,340 We apply activation function on the z2 to get it to these A1 and A2 are the inputs to this final output 30 00:03:02,340 --> 00:03:06,480 neuron for these two inputs. 31 00:03:06,480 --> 00:03:10,730 We need two new rate W5 and W six. 32 00:03:11,100 --> 00:03:20,630 So the equation at this output neuron is W5 into even plus W 6 into A2 plus b 3 gives z3. 33 00:03:21,360 --> 00:03:31,100 When we apply activation function on this Z3 we get the predicted output from this output neuron. 34 00:03:31,350 --> 00:03:41,040 So if we look at the variables that we need to estimate for weights we have W1 W2 W3 w for w 5 and W 35 00:03:41,040 --> 00:03:41,910 6. 36 00:03:41,910 --> 00:03:50,960 So we are estimating 6 wait for biases we have B1 B2 and B3 3 biases. 37 00:03:51,180 --> 00:03:59,610 So for this small network we need to establish the values of nine variables to make this neural network 38 00:03:59,730 --> 00:04:04,690 ready for predictions. 39 00:04:04,950 --> 00:04:13,450 Now how do we find out the values of WS and beats the technique followed for. 40 00:04:13,460 --> 00:04:21,660 This is called gradient descent gradient descent is just another optimization technique to find minimum 41 00:04:21,660 --> 00:04:23,930 of a function. 42 00:04:24,090 --> 00:04:30,210 There are other optimization techniques also such as ordinarily squared which is used in linear regression 43 00:04:31,810 --> 00:04:40,170 but for a large number of features and complex relationships gradient descent shows much better computational 44 00:04:40,170 --> 00:04:44,250 performance than any other technique. 45 00:04:44,250 --> 00:04:50,640 This means that if you have a large number of input variables and a very complex relationship between 46 00:04:50,730 --> 00:04:59,610 input and output gradient descent will train the model in a much faster way as compared to other optimization 47 00:05:01,550 --> 00:05:10,070 so let's first discuss the process followed in gradient descent in a stepwise manner. 48 00:05:10,070 --> 00:05:20,650 We start by assigning a random weight and bias values to all this is in our network. 49 00:05:20,660 --> 00:05:26,870 Since all the weights and biased values are available that is we have randomly assigned all the weight 50 00:05:26,900 --> 00:05:28,370 and biases. 51 00:05:28,580 --> 00:05:33,470 Our model is ready to give out output. 52 00:05:33,470 --> 00:05:38,540 The second step is the input one training example. 53 00:05:38,540 --> 00:05:45,830 We use the x values of the training example and calculate the final output of the network using these 54 00:05:45,830 --> 00:05:50,160 weights and biased values. 55 00:05:50,180 --> 00:05:59,510 Third step is that we compare the predicted values vs. the actual values and we know the difference 56 00:05:59,570 --> 00:06:02,630 between these two using some error function. 57 00:06:02,630 --> 00:06:07,250 E will come back to this error function later. 58 00:06:07,250 --> 00:06:15,830 Remember that we have the actual y value because this was a training observation so these actual values 59 00:06:15,950 --> 00:06:20,720 are being used to give feedback to our network that how bad it is performing 60 00:06:24,190 --> 00:06:34,540 the fourth step is we try to find all those weights and biases changing which we can reduce this error. 61 00:06:36,280 --> 00:06:46,160 Lastly we update the values of the weights and bias and repeat this process from step two. 62 00:06:46,180 --> 00:06:54,930 This loop goes on the low for the reduction and edit function can be achieved and these steps. 63 00:06:55,060 --> 00:06:59,710 The first step is called initialization head. 64 00:06:59,710 --> 00:07:06,040 We just give some random initial values to base and bias. 65 00:07:06,040 --> 00:07:10,690 The second step is called forward propagation. 66 00:07:10,690 --> 00:07:17,930 This is because in this step we start with input values process them in Layer 1. 67 00:07:18,550 --> 00:07:23,960 Then we take the output of Layer 1 and process it and layer 2 and so on. 68 00:07:24,190 --> 00:07:28,750 Then we get one final predicted output. 69 00:07:28,810 --> 00:07:33,600 We are simply moving forward in terms of the layers of the network. 70 00:07:33,600 --> 00:07:37,330 So this is forward propagation. 71 00:07:37,330 --> 00:07:43,000 The third and fourth step are quite backward propagation in these steps. 72 00:07:43,060 --> 00:07:50,650 We already have the final edit function and we look backwards in our network to find out which weeds 73 00:07:51,190 --> 00:07:56,890 and biases have maximum impact on this error function. 74 00:07:56,890 --> 00:08:01,510 Once we establish which weights and biases have maximum impact. 75 00:08:01,510 --> 00:08:06,190 We update these weights and biases slightly to reduce the edit. 76 00:08:07,510 --> 00:08:14,590 So this is the process we follow to implement gradient descent and neural networks but we are still 77 00:08:14,590 --> 00:08:17,590 not discussed the concept behind gradient descent 78 00:08:22,330 --> 00:08:29,990 gradient descent is a mathematical technique which is used to find out the minimum of a function. 79 00:08:29,990 --> 00:08:36,600 Let's see this example in the graph on the left on the x axis. 80 00:08:36,670 --> 00:08:43,110 I have this variable and on the y axis I have a function applied on this variable. 81 00:08:43,180 --> 00:08:47,420 This is the plot of this function. 82 00:08:47,420 --> 00:08:55,430 Now if you want to find out the value of X at which the function has minimum value there are two ways 83 00:08:55,430 --> 00:08:57,220 to it. 84 00:08:57,220 --> 00:09:04,720 One is if you know the exact relationship between X and the function you can use calculus to find the 85 00:09:04,720 --> 00:09:08,610 minimum of this function. 86 00:09:08,610 --> 00:09:14,950 But as you know in r machine learning problems we do not have this exact relationship. 87 00:09:16,090 --> 00:09:22,570 So we use a second technique which is an attractive technique in this technique. 88 00:09:22,570 --> 00:09:30,930 We start at a random point on this block saying we have this value of x and ethics. 89 00:09:31,750 --> 00:09:39,520 Now instead of focusing on the whole graph we focus only on this small part of the graph and try to 90 00:09:39,520 --> 00:09:47,870 find out what happens if we slightly increase the value of X or decrease the value of x. 91 00:09:48,010 --> 00:09:53,250 In other words we are trying to find out which way is this slope. 92 00:09:53,380 --> 00:10:02,050 If this globe is negative that is like this we increase the value of X a little bit and then we will 93 00:10:02,050 --> 00:10:04,720 see that if X will also decrease. 94 00:10:04,720 --> 00:10:12,250 Similarly if the slope is positive we decrease the value of x which will slightly decrease the value 95 00:10:12,250 --> 00:10:13,300 of Epix 96 00:10:16,810 --> 00:10:25,570 we continue taking these small small steps till we reach the final minimum point when we are at this 97 00:10:25,570 --> 00:10:35,060 point moving either side only increases the value of the function so we stop our process here. 98 00:10:35,560 --> 00:10:43,600 This outdated technique of finding instantaneous slope also known as gradient and slightly moving down 99 00:10:43,600 --> 00:10:48,140 that slope that is descent is called gradient descent. 100 00:10:53,230 --> 00:10:59,370 If you want to picture this you can think of yourself being on top of a hill. 101 00:11:00,160 --> 00:11:04,880 You cannot see anything around you because it is dark and foggy. 102 00:11:04,880 --> 00:11:08,990 Now you want to come down the hill as fast as possible. 103 00:11:09,100 --> 00:11:11,030 What do you do. 104 00:11:11,590 --> 00:11:17,730 Ideally if you could see you would spot the closest down hill point and run do it. 105 00:11:19,000 --> 00:11:26,230 But since you cannot see and you know the gradient descent technique you take a step in each direction 106 00:11:27,430 --> 00:11:36,440 see which direction is more down and then you move from your current position to that new position. 107 00:11:37,270 --> 00:11:43,010 Then you again Take out your right foot check which is the direction of steep slope and move. 108 00:11:43,010 --> 00:11:50,290 Did you keep doing this and eventually you will come downhill. 109 00:11:50,350 --> 00:11:54,040 This is the concept behind gradient descent. 110 00:11:54,400 --> 00:11:58,770 In the next lecture we will merge the two ideas. 111 00:11:58,780 --> 00:12:03,310 The first is the process that neural networks used to implement gradient descent. 112 00:12:04,060 --> 00:12:07,720 And the second idea was what gradient descent is. 113 00:12:07,750 --> 00:12:15,460 Mathematically we will merge these two and understand how gradient descent is helping us achieve the 114 00:12:15,460 --> 00:12:17,440 minima and neural networks.