1 00:00:00,600 --> 00:00:03,466 Hello and welcome back to the course on Deep Learning. 2 00:00:03,466 --> 00:00:06,466 In today's tutorial we're talking about gradient descent. 3 00:00:06,733 --> 00:00:09,333 What we learned previously was that 4 00:00:09,333 --> 00:00:14,233 in order for a neural network to learn, what needs to happen is backpropagation. 5 00:00:14,233 --> 00:00:18,166 And that is when the error, the difference or 6 00:00:18,366 --> 00:00:23,500 the sum of squared differences between y hat and y is back 7 00:00:23,500 --> 00:00:28,033 propagated through the neural network, and the weights are adjusted accordingly. 8 00:00:28,366 --> 00:00:32,600 So we saw that and today we're going to learn exactly how these. 9 00:00:32,600 --> 00:00:35,600 Weights are adjusted. So let's have a look. 10 00:00:35,966 --> 00:00:39,300 this is our, very simple version. 11 00:00:39,300 --> 00:00:42,566 Of a neural network, a, perceptron or a single. 12 00:00:42,566 --> 00:00:44,700 Layer feedforward neural network. 13 00:00:44,700 --> 00:00:47,033 And what we can see here is. 14 00:00:47,033 --> 00:00:51,600 this whole process in action where, we've got, some input value, 15 00:00:51,933 --> 00:00:56,900 then we've got a weight, then a, activation function is applied. 16 00:00:56,933 --> 00:00:59,766 We have, we get y hat and then we compare it to the actual value. 17 00:00:59,766 --> 00:01:01,700 We calculate the cost function. 18 00:01:01,700 --> 00:01:03,666 So how can we. 19 00:01:03,666 --> 00:01:05,233 Minimize the cost function. 20 00:01:05,233 --> 00:01:07,266 What's, can we do about it. 21 00:01:07,266 --> 00:01:12,033 Well one approach to do it is a brute force approach where we just take. 22 00:01:12,666 --> 00:01:15,600 all. Lots of different possible weights. 23 00:01:15,600 --> 00:01:18,133 And look at them and see which one works best. 24 00:01:18,133 --> 00:01:19,366 And what we. Would do is, for instance. 25 00:01:19,366 --> 00:01:22,966 Would, try out for a, let's say, for example, a thousand weights 26 00:01:23,233 --> 00:01:27,633 and we try them out, we'd get something like this, for the cost function. 27 00:01:27,633 --> 00:01:28,966 And this is a chart, 28 00:01:28,966 --> 00:01:32,700 of, on the y axis, we have cost function on the vertical axis. 29 00:01:32,700 --> 00:01:34,766 On the horizontal axis you have y hat. 30 00:01:34,766 --> 00:01:36,000 And because you can. 31 00:01:36,000 --> 00:01:40,200 See the formula is y hat minus y a square, this is what the cost function 32 00:01:40,200 --> 00:01:41,400 would look something like that. 33 00:01:42,533 --> 00:01:44,800 And basically. 34 00:01:44,800 --> 00:01:47,800 You'd find the best one is over here. 35 00:01:47,800 --> 00:01:50,800 so very simple, very intuitive approach. 36 00:01:50,900 --> 00:01:53,100 why. Not do. This brute force method? 37 00:01:53,100 --> 00:01:56,400 Why not just try out a thousand different, 38 00:01:56,400 --> 00:02:00,000 cost for a thousand different, parameters. 39 00:02:00,000 --> 00:02:01,200 Or inputs for. 40 00:02:01,200 --> 00:02:02,933 Weights and see which one works best. 41 00:02:02,933 --> 00:02:05,333 You'll find the best one that way. Well, if you have. 42 00:02:05,333 --> 00:02:07,600 Just one way to optimize, this might work. 43 00:02:07,600 --> 00:02:09,300 But as you increase the. 44 00:02:09,300 --> 00:02:11,600 Number of weights, increase the. Number of. 45 00:02:11,600 --> 00:02:16,133 Synapses in your network, you have to face the curse of dimensionality. 46 00:02:16,500 --> 00:02:19,333 And so what is. The curse of dimensionality? 47 00:02:19,333 --> 00:02:24,300 the best way to describe this or explain it is to just look at a practical example. 48 00:02:24,466 --> 00:02:27,933 So remember this example we had when we were talking about how. 49 00:02:27,933 --> 00:02:28,733 Neural networks. 50 00:02:28,733 --> 00:02:32,066 Actually work, where we were, building or. 51 00:02:32,333 --> 00:02:33,433 Running a neural network. 52 00:02:33,433 --> 00:02:36,666 For, a for profit evaluation. 53 00:02:36,966 --> 00:02:40,333 So this is what it looked like when it was trained up already. 54 00:02:40,600 --> 00:02:44,100 Well, when it's not trained before, it's trained before we know which 55 00:02:44,233 --> 00:02:45,366 one of the weights. 56 00:02:45,366 --> 00:02:47,833 The actual neural network looks like this, right. 57 00:02:47,833 --> 00:02:50,300 Because we have. All these different. 58 00:02:51,333 --> 00:02:54,800 Possible, synopses, and we still have to train up the weights. 59 00:02:55,133 --> 00:02:57,133 And here we have a total of 25 weights. 60 00:02:57,133 --> 00:03:00,133 So four times five at the start, plus five more. 61 00:03:00,166 --> 00:03:00,766 from the hidden layer. 62 00:03:00,766 --> 00:03:03,533 To the output layer, 20. Five weights total. 63 00:03:03,533 --> 00:03:08,033 And let's see, how we could possibly brute force 64 00:03:08,033 --> 00:03:12,533 25 weights says it's a very simple, neural network right here. 65 00:03:12,533 --> 00:03:14,533 Very simple. Just one hidden layer. 66 00:03:14,533 --> 00:03:19,166 And how could we, brute force our way through a. 67 00:03:19,533 --> 00:03:21,200 Neural network of this size? 68 00:03:21,200 --> 00:03:22,333 Well, let's do some. 69 00:03:22,333 --> 00:03:24,233 Simple mathematical calculations. 70 00:03:24,233 --> 00:03:25,800 We have 25 weights. 71 00:03:25,800 --> 00:03:28,300 So that means if we have a thousand combinations 72 00:03:28,300 --> 00:03:31,200 that we're going to test out for every weight, the total number of combinations. 73 00:03:31,200 --> 00:03:34,433 Is a thousand to the power of 25, or 1000 or 10 to the power 74 00:03:34,433 --> 00:03:37,433 of 25 different combinations. 75 00:03:37,633 --> 00:03:40,533 now let's see how Sunway head to 76 00:03:40,533 --> 00:03:43,633 tail lights the world's fastest supercomputer. 77 00:03:43,633 --> 00:03:46,633 As of, June 2016. 78 00:03:46,866 --> 00:03:47,933 what? 79 00:03:47,933 --> 00:03:50,533 How would it approach this problem? Right. So. 80 00:03:50,533 --> 00:03:53,133 Sunway Tahu light, it looks like 81 00:03:53,133 --> 00:03:56,666 this is a whole huge, building. 82 00:03:56,666 --> 00:03:58,900 Pretty much for this one supercomputer. 83 00:03:58,900 --> 00:04:01,633 And, it got the Guinness World. 84 00:04:01,633 --> 00:04:04,633 Record for being the fastest supercomputer. 85 00:04:05,066 --> 00:04:08,066 Right now, it is the fastest supercomputer in the world. 86 00:04:08,266 --> 00:04:09,733 And Sunway. 87 00:04:09,733 --> 00:04:13,500 Tiger Light can operate at a speed of 93. 88 00:04:13,500 --> 00:04:14,800 Petaflops. 89 00:04:14,800 --> 00:04:19,666 flop stands for, floating operation per second. 90 00:04:19,833 --> 00:04:22,833 So it can do 93 to the power. 91 00:04:22,833 --> 00:04:27,866 Well, times ten to the power of 15 floating operations per second. 92 00:04:27,866 --> 00:04:29,766 That's how quick it is. 93 00:04:29,766 --> 00:04:33,766 in comparison, average computers right now, 94 00:04:34,033 --> 00:04:37,833 they do like just over several gigaflops and so on. 95 00:04:38,066 --> 00:04:40,866 So like kind of those ranges. 96 00:04:40,866 --> 00:04:43,933 Weigh less. Than Thai Sunway tail light. 97 00:04:44,233 --> 00:04:47,633 So Sunway tail light is in the forefront of technology. 98 00:04:48,233 --> 00:04:53,500 And let's say hypothetically that it can do one. 99 00:04:54,966 --> 00:04:56,700 one test, one combination 100 00:04:56,700 --> 00:05:01,200 for our neural network in one, flop basic and one floating operation. 101 00:05:01,533 --> 00:05:03,133 That is not possible. 102 00:05:03,133 --> 00:05:05,266 That is not practical because you need multiple. 103 00:05:05,266 --> 00:05:07,566 Floating operations to test out. 104 00:05:07,566 --> 00:05:09,400 A single weight in your neural network. 105 00:05:09,400 --> 00:05:11,166 But even let's let's give it a head start. 106 00:05:11,166 --> 00:05:15,333 Let's say that it can do in a ideal world, it can do. 107 00:05:15,333 --> 00:05:17,333 That in one, floating operation. 108 00:05:17,333 --> 00:05:19,966 It can do one, test per one floating operation. 109 00:05:19,966 --> 00:05:25,200 That means it will still require ten to the power of 25 divided by 93. 110 00:05:25,200 --> 00:05:26,466 Times ten to about. 111 00:05:26,466 --> 00:05:30,566 15 seconds to come, to run 112 00:05:30,966 --> 00:05:33,966 all of those tests to brute force through that network. 113 00:05:33,966 --> 00:05:37,433 So that means one or approximate ten to the power of 58. 114 00:05:37,433 --> 00:05:39,500 Seconds, and that is the same as ten to the. 115 00:05:39,500 --> 00:05:42,000 Power of 50 years. 116 00:05:42,000 --> 00:05:45,100 That is a huge number that is longer, 117 00:05:46,200 --> 00:05:48,100 than the universe has existed. 118 00:05:48,100 --> 00:05:52,800 And that is definitely not going to simply this number is so huge. 119 00:05:52,800 --> 00:05:58,666 It's just definitely not going to work for us, at all in our, optimization. 120 00:05:58,966 --> 00:06:00,000 So there we go. 121 00:06:00,000 --> 00:06:01,133 This is a no no. 122 00:06:01,133 --> 00:06:05,366 Even on the world's fastest supercomputer, Sunway Tail Light. 123 00:06:05,366 --> 00:06:07,533 So we have to come up with a different approach. 124 00:06:07,533 --> 00:06:10,200 How are we going to. Find the optimal weight. 125 00:06:10,200 --> 00:06:11,133 By the way. 126 00:06:11,133 --> 00:06:13,500 This our neural network was very simple. 127 00:06:13,500 --> 00:06:17,100 What about if the neural networks looks like something like this, right? 128 00:06:17,100 --> 00:06:19,000 Or even greater than that then? 129 00:06:19,000 --> 00:06:21,600 yeah, it's just not going to happen at all. 130 00:06:21,600 --> 00:06:22,600 Ever. 131 00:06:22,600 --> 00:06:26,100 So the method we're going to be looking at is called gradient descent. 132 00:06:26,100 --> 00:06:28,466 And, you may. Have heard of it already. 133 00:06:28,466 --> 00:06:29,133 if not, we will. 134 00:06:29,133 --> 00:06:30,233 Find out what it is right. 135 00:06:30,233 --> 00:06:32,433 Now. So, there's our. 136 00:06:34,200 --> 00:06:36,533 cost function, and now we're 137 00:06:36,533 --> 00:06:40,833 going to see how we can faster come up with a faster way to. 138 00:06:41,000 --> 00:06:43,100 Find the best option. 139 00:06:43,100 --> 00:06:45,833 So let's say we start somewhere. You got to start somewhere. 140 00:06:45,833 --> 00:06:47,233 So we start over there. 141 00:06:47,233 --> 00:06:51,000 And from that point, in the top left. 142 00:06:51,400 --> 00:06:53,333 What are we going to do? Is we're going to look. 143 00:06:53,333 --> 00:06:58,500 At the angle, of our cost function at that point. 144 00:06:58,500 --> 00:06:59,733 So we're just going to basically that's 145 00:06:59,733 --> 00:07:02,000 why it's called gradient because you have to differentiate. 146 00:07:02,000 --> 00:07:04,666 We're not going to look at the mathematical equations. I, we will. 147 00:07:04,666 --> 00:07:08,100 Provide some tips on additional reading at the end of the. 148 00:07:08,100 --> 00:07:09,633 Next lecture. 149 00:07:09,633 --> 00:07:12,900 but. Basically you just need to differentiate. 150 00:07:12,900 --> 00:07:16,133 Find out what the slope is in that specific point, 151 00:07:16,766 --> 00:07:19,200 and find out if the slope is positive or negative. 152 00:07:19,200 --> 00:07:22,466 If the, if the slope is negative, like in this case, means that 153 00:07:22,800 --> 00:07:23,866 you're going downhill. 154 00:07:23,866 --> 00:07:26,966 So, to the right is downhill, to the left is uphill. 155 00:07:27,200 --> 00:07:29,600 And from there it means you need to go, right. 156 00:07:29,600 --> 00:07:31,533 Basically you need to go downhill. 157 00:07:31,533 --> 00:07:32,966 And that's what. We're going to do. 158 00:07:32,966 --> 00:07:35,733 Vroom takes a step. Right to the ball. 159 00:07:35,733 --> 00:07:36,933 Rolls down. 160 00:07:36,933 --> 00:07:39,466 again same thing. You calculate the slope. 161 00:07:39,466 --> 00:07:42,233 This time the slope is positive, meaning right is uphill. 162 00:07:42,233 --> 00:07:44,933 Left is downhill, and you need to go left. 163 00:07:44,933 --> 00:07:46,633 And you roll the ball down. 164 00:07:46,633 --> 00:07:51,533 And again you calculate the slope and your old ball. 165 00:07:51,533 --> 00:07:51,833 Right. 166 00:07:53,033 --> 00:07:53,433 There you go. 167 00:07:53,433 --> 00:07:56,066 So that's how you find in in simple terms. 168 00:07:56,066 --> 00:07:59,100 That's how you find the best. 169 00:07:59,733 --> 00:08:04,233 weights, the best, situation that minimizes your cost function. 170 00:08:04,466 --> 00:08:06,066 Of course, it's not going to be. 171 00:08:06,066 --> 00:08:08,333 Like a ball rolling is going to be a very zig zag type. 172 00:08:08,333 --> 00:08:09,766 Of approach, but it's easier to. 173 00:08:09,766 --> 00:08:13,600 Remember and kind of, it's is more fun to. 174 00:08:13,633 --> 00:08:16,100 Look at it as a ball rolling. But in reality, yes, you. 175 00:08:16,100 --> 00:08:18,466 Just, it is going to be like a step by step approach. 176 00:08:18,466 --> 00:08:21,266 So it's going to be a zigzag type of method. 177 00:08:21,266 --> 00:08:22,200 Yeah. 178 00:08:22,200 --> 00:08:24,866 And also there's, there's lots of other elements to it. 179 00:08:24,866 --> 00:08:26,100 So there's. 180 00:08:26,100 --> 00:08:31,533 things like for instance, why like why does it go down. 181 00:08:31,533 --> 00:08:35,000 Why does it not like go away over the line. 182 00:08:35,000 --> 00:08:39,333 So it could have jumped out of this, going upwards instead of downwards 183 00:08:39,333 --> 00:08:40,366 and things like that. 184 00:08:40,366 --> 00:08:41,866 So there are parameters. That you can. Tweak. 185 00:08:41,866 --> 00:08:45,466 And again we will mention where you can find out more on that. 186 00:08:45,466 --> 00:08:47,566 And plus we'll have this in the practical application. 187 00:08:47,566 --> 00:08:51,600 But in the simplest intuitive approach this is what is happening. 188 00:08:51,600 --> 00:08:54,600 We are getting to the bottom. by. 189 00:08:54,600 --> 00:08:58,000 Just understanding which way we need to go instead of brute forcing through. 190 00:08:58,433 --> 00:09:00,966 Thousands and thousands and millions and billions. 191 00:09:00,966 --> 00:09:04,133 And quadrillions of combinations, we can just simply. 192 00:09:04,133 --> 00:09:05,133 Every time have a look. 193 00:09:05,133 --> 00:09:09,000 At where is where, which way is it sloping. So. 194 00:09:09,000 --> 00:09:10,100 Right, like you. 195 00:09:10,100 --> 00:09:12,800 Imagine you're standing on a. Hill, which way does it feel. 196 00:09:12,800 --> 00:09:15,200 That it's going downwards? And whichever way it is going downwards. 197 00:09:15,200 --> 00:09:15,800 You just keep. 198 00:09:15,800 --> 00:09:19,333 Walking that way you like take 50 steps that way and then you, you assess again. 199 00:09:19,333 --> 00:09:20,966 Okay. Which way is it going? Downwards. 200 00:09:20,966 --> 00:09:21,366 This way. 201 00:09:21,366 --> 00:09:21,666 Okay. 202 00:09:21,666 --> 00:09:24,500 Now take 50 steps or let's take 40 steps that way. 203 00:09:24,500 --> 00:09:27,733 So it gets less and less and less as you get closer. 204 00:09:28,400 --> 00:09:29,100 So here's. 205 00:09:29,100 --> 00:09:30,400 An example. Of gradient. 206 00:09:30,400 --> 00:09:32,633 Descent applied in a two dimensional space. 207 00:09:32,633 --> 00:09:35,633 So that was a one dimensional example. 208 00:09:35,833 --> 00:09:40,166 here we have a two dimensional space for the gradient descent. 209 00:09:40,166 --> 00:09:42,933 As you can see it's getting. Closer to the minimum. 210 00:09:42,933 --> 00:09:44,833 And it's also called gradient descent because you're. 211 00:09:44,833 --> 00:09:49,033 Descending into the minimum of the cost function. 212 00:09:49,533 --> 00:09:53,266 And finally here's the gradient descent applied in three dimensions. 213 00:09:53,266 --> 00:09:56,266 This is what it looks like if you projected onto two dimensions, 214 00:09:56,400 --> 00:09:59,400 you can see it zigzagging its way into the minimum. 215 00:09:59,533 --> 00:10:01,766 So there we go. That was gradient descent. 216 00:10:01,766 --> 00:10:02,933 In next tutorial we'll talk about 217 00:10:02,933 --> 00:10:06,533 stochastic gradient descent is we'll be continuation of this tutorial. 218 00:10:06,866 --> 00:10:08,600 And. I look forward to seeing you there. 219 00:10:08,600 --> 00:10:10,533 Until next time, enjoy deep learning.