1 00:00:01,000 --> 00:00:03,266 Hello and welcome back to the course on Deep Learning. 2 00:00:03,266 --> 00:00:06,300 Today we're talking about stochastic gradient descent. 3 00:00:07,066 --> 00:00:10,100 Previously we learned about gradient descent. 4 00:00:10,100 --> 00:00:15,033 And we found out that it is a very efficient method to solve 5 00:00:15,033 --> 00:00:19,200 our optimization problem where we are trying to minimize the cost function. 6 00:00:19,500 --> 00:00:21,233 It basically 7 00:00:21,233 --> 00:00:26,600 takes us from ten to the power of 57 years to solving a problem 8 00:00:26,600 --> 00:00:30,633 within, minutes or hours or within a day or so. 9 00:00:30,933 --> 00:00:36,200 And it really helps speed things up, because we can see which way is downhill, 10 00:00:36,200 --> 00:00:41,100 and we can just go in that direction and take steps and get to the minimum faster. 11 00:00:41,466 --> 00:00:45,000 But the thing with the stick with, gradient descent 12 00:00:45,300 --> 00:00:50,900 is that this method requires for, the cost function to be convex. 13 00:00:50,966 --> 00:00:54,900 And as you can see here, we've specifically chosen a convex cost 14 00:00:54,900 --> 00:00:55,200 function. 15 00:00:55,200 --> 00:00:59,400 Basically, convex means, that the function looks similar 16 00:00:59,400 --> 00:01:03,166 to what we are seeing now that, it's just convex 17 00:01:03,166 --> 00:01:06,166 into, one direction and that it, 18 00:01:06,366 --> 00:01:09,233 in essence has one global minimum. 19 00:01:09,233 --> 00:01:11,233 And that's the one that we're going to find. 20 00:01:11,233 --> 00:01:13,966 but what if our function is not convex? 21 00:01:13,966 --> 00:01:15,966 What if our cost function is not convex? 22 00:01:15,966 --> 00:01:17,866 what if it looks something like this? 23 00:01:17,866 --> 00:01:19,700 Well, first of all, how could that happen? 24 00:01:19,700 --> 00:01:21,666 Well, that could happen, 25 00:01:21,666 --> 00:01:25,766 because if we first of all choose a cost function, which is not, 26 00:01:26,033 --> 00:01:29,200 the squared difference between y hat and y, 27 00:01:29,633 --> 00:01:33,766 or if, we do choose the cost function, which is like that, 28 00:01:33,766 --> 00:01:37,500 but then in a dimensional space, it can actually turn into something 29 00:01:37,600 --> 00:01:39,633 that is not convex. 30 00:01:39,633 --> 00:01:41,966 And so what would happen in this case if we just tried 31 00:01:41,966 --> 00:01:44,966 to apply our normal gradient descent method. 32 00:01:45,000 --> 00:01:46,333 something like this could happen. 33 00:01:46,333 --> 00:01:51,133 We could find a local minimum of the cost function rather than the global one. 34 00:01:51,133 --> 00:01:53,133 So this one was the best one. 35 00:01:53,133 --> 00:01:54,633 And we found the wrong one. 36 00:01:54,633 --> 00:01:57,633 And therefore we don't have the correct weights. 37 00:01:57,633 --> 00:02:00,166 We don't have an optimized neural network. 38 00:02:00,166 --> 00:02:02,433 we have a subpar neural network. 39 00:02:02,433 --> 00:02:04,500 And so what do we do in this case? 40 00:02:04,500 --> 00:02:09,933 Well, the answer here is, stochastic gradient descent. 41 00:02:09,933 --> 00:02:13,366 And it turns out the stochastic gradient descent doesn't require 42 00:02:13,433 --> 00:02:15,233 for the cost function to be convex. 43 00:02:15,233 --> 00:02:19,033 So let's have a look at the two differences between the normal gradient 44 00:02:19,033 --> 00:02:21,700 descent that we talked about and the stochastic gradient descent. 45 00:02:21,700 --> 00:02:25,133 So normal gradient descent is when we take all of our rows 46 00:02:25,400 --> 00:02:27,566 we plug them into our neural network. 47 00:02:27,566 --> 00:02:31,900 And once again here we've got the neural network copied over several times. 48 00:02:31,900 --> 00:02:35,700 But the rows are being plugged into that same neural network every time. 49 00:02:35,866 --> 00:02:37,100 So there's only one neural network. 50 00:02:37,100 --> 00:02:39,200 This is just for visualization purposes. 51 00:02:39,200 --> 00:02:42,066 And then once we've plugged the main, we've calculated our cost function 52 00:02:42,066 --> 00:02:43,300 based on the formula on the right. 53 00:02:43,300 --> 00:02:45,366 And looking at the chart on the at the bottom. 54 00:02:45,366 --> 00:02:47,400 And then we adjust the weights. 55 00:02:47,400 --> 00:02:49,633 Then this is called the gradient descent method. 56 00:02:49,633 --> 00:02:54,366 Or it's also the proper term is the batch gradient descent method. 57 00:02:54,366 --> 00:02:59,866 So we take the whole batch of from our sample, we apply it and then we run 58 00:02:59,866 --> 00:03:03,400 that the stochastic gradient descent method is a bit different. 59 00:03:03,666 --> 00:03:05,866 Here. We take the rows one by one. 60 00:03:05,866 --> 00:03:07,000 So we take this row. 61 00:03:07,000 --> 00:03:11,200 We run our neural network and then we adjust the weights. 62 00:03:11,866 --> 00:03:13,466 Then we move on to the second row. 63 00:03:13,466 --> 00:03:16,400 We take a second row. We run our neural network. 64 00:03:16,400 --> 00:03:17,766 We look at the cost function. 65 00:03:17,766 --> 00:03:20,033 And then we adjust the weights again. 66 00:03:20,033 --> 00:03:21,100 And then we take another row. 67 00:03:21,100 --> 00:03:22,600 Take row three. 68 00:03:22,600 --> 00:03:23,700 we run our neural network. 69 00:03:23,700 --> 00:03:25,366 We look at the cost function, we adjust the weights. 70 00:03:25,366 --> 00:03:27,666 So basically, we're looking at 71 00:03:28,800 --> 00:03:31,166 we're adjusting the weights after every single row 72 00:03:31,166 --> 00:03:34,166 rather than doing everything together and then adjusting weights. 73 00:03:34,400 --> 00:03:36,066 two different approaches. 74 00:03:36,066 --> 00:03:39,600 And now we're going to just compare the two side by side. 75 00:03:39,600 --> 00:03:40,500 So here they are. 76 00:03:40,500 --> 00:03:42,766 This is how to visually remember them. 77 00:03:42,766 --> 00:03:46,066 So you've got the batch gradient descent where you adjusting. 78 00:03:46,400 --> 00:03:48,966 the weights after you've run them. 79 00:03:48,966 --> 00:03:52,100 After you've run all of the rows in your neural network. 80 00:03:52,866 --> 00:03:54,900 And then, basically you adjust the weights 81 00:03:54,900 --> 00:03:57,433 and you run the whole thing again. Iteration, iteration, iteration. 82 00:03:57,433 --> 00:04:00,866 In the stochastic gradient descent method, you run one row at a time. 83 00:04:01,433 --> 00:04:04,933 And, you adjust the weights, you adjust the way to adjust the weights. 84 00:04:04,933 --> 00:04:07,666 And then you do everything again and again. 85 00:04:07,666 --> 00:04:10,666 And that is called this casting gradient descent method. 86 00:04:11,100 --> 00:04:14,700 The main two differences are that the stochastic gradient 87 00:04:14,700 --> 00:04:19,166 descent method helps you avoid, the problem 88 00:04:19,166 --> 00:04:24,000 where you find those local, extrema or local minimums 89 00:04:24,000 --> 00:04:28,200 rather than the overall, overall global minimum. 90 00:04:28,900 --> 00:04:32,700 And the reason for that, in simple terms, is that the SGD 91 00:04:32,900 --> 00:04:36,833 or the stochastic gradient descent method, has much higher fluctuations 92 00:04:36,833 --> 00:04:38,100 because it can afford them. 93 00:04:38,100 --> 00:04:42,233 It's doing one iteration or one row at a time, and therefore 94 00:04:42,233 --> 00:04:43,366 the fluctuations are much higher. 95 00:04:43,366 --> 00:04:46,366 And it it's much more likely to find the global, 96 00:04:46,800 --> 00:04:49,333 minimum, rather than just the local minimum. 97 00:04:49,333 --> 00:04:52,566 And the other thing about the stochastic gradient descent, 98 00:04:52,566 --> 00:04:56,400 nothing compared to the batch gradient is the it's faster. 99 00:04:56,433 --> 00:04:58,566 Like the first impression that you might have is 100 00:04:58,566 --> 00:05:00,700 because it's doing every single row one at a time. 101 00:05:00,700 --> 00:05:04,533 It is slower, but actually in fact, it is faster because it is. 102 00:05:04,833 --> 00:05:09,000 It doesn't have to, load up all the, data into memory 103 00:05:09,000 --> 00:05:12,433 and run and wait until all of those rows are on all together. 104 00:05:12,566 --> 00:05:15,000 You can just row, run them one by one so it's much lighter. 105 00:05:15,000 --> 00:05:16,733 Algorithm is much faster in that sense. 106 00:05:16,733 --> 00:05:21,133 So, though it has way more and that's in those senses, 107 00:05:21,133 --> 00:05:25,033 it has more advantages over the, batch gradient descent method. 108 00:05:25,266 --> 00:05:29,633 The main advantage of or the main kind of like pro for the batch 109 00:05:29,633 --> 00:05:32,633 gradient descent method is that it is a deterministic algorithm 110 00:05:32,733 --> 00:05:36,866 rather than, stochastic gradient descent, being a stochastic algorithm, 111 00:05:36,866 --> 00:05:40,500 meaning it's random and with the batch gradient descent method, 112 00:05:40,666 --> 00:05:43,666 as long as you have the same starting weights. 113 00:05:44,233 --> 00:05:45,433 for your neural network, 114 00:05:45,433 --> 00:05:49,000 every time you run the batch gradient descent method, you will get, 115 00:05:49,000 --> 00:05:54,200 the same, iterations, the same results for your, for the way 116 00:05:54,200 --> 00:05:57,600 your weights are being updated, for us to have for the stochastic gradient 117 00:05:57,600 --> 00:06:01,066 descent method, you won't get that because it is a stochastic method. 118 00:06:01,066 --> 00:06:03,800 You are picking your rows, possibly at random. 119 00:06:03,800 --> 00:06:08,666 And, you are updating your neural network in a stochastic manner and therefore, 120 00:06:08,900 --> 00:06:12,433 you're just going to every single time you run the stochastic gradient 121 00:06:12,433 --> 00:06:13,233 descent method, even 122 00:06:13,233 --> 00:06:16,500 if you have the same weights at the start, you're going to have a different, 123 00:06:17,000 --> 00:06:20,233 process at different, different iterations to get there. 124 00:06:20,633 --> 00:06:23,100 So that's in a nutshell. 125 00:06:23,100 --> 00:06:27,766 What's stochastic gradient descent is also there's a method in between 126 00:06:27,766 --> 00:06:30,400 the two called the mini batch gradient descent method, 127 00:06:30,400 --> 00:06:34,100 where you combine the two and you basically, run 128 00:06:34,100 --> 00:06:37,500 rather than running a whole batch or running one at a time. 129 00:06:37,500 --> 00:06:40,500 You run batches of rows maybe five, ten, 100. 130 00:06:40,666 --> 00:06:44,933 However many rows you decide to set, you run those that number of rows at a time. 131 00:06:45,066 --> 00:06:47,766 Then you update your weights, synaptic weights, and so on. 132 00:06:47,766 --> 00:06:50,433 And that's called the mini batch gradient descent method. 133 00:06:50,433 --> 00:06:52,933 If you'd like to learn more about gradient descent, 134 00:06:52,933 --> 00:06:56,500 there's a great article which you can have a look at. 135 00:06:56,533 --> 00:07:00,300 It's called, a Neural Network in 13 Layers of Python 136 00:07:00,300 --> 00:07:03,300 part two Gradient Descent by Andrew Trask. 137 00:07:03,700 --> 00:07:05,700 and the links below. 138 00:07:05,700 --> 00:07:07,200 It's on GitHub. 139 00:07:07,200 --> 00:07:08,600 12 to 15 article. 140 00:07:08,600 --> 00:07:12,300 very well written, very in very simple terms. 141 00:07:12,800 --> 00:07:17,666 it's got some interesting, philosophical or just interesting 142 00:07:17,666 --> 00:07:21,400 thoughts on, how to apply gradient descent. 143 00:07:21,400 --> 00:07:26,133 What, you know, the advantages and disadvantages and how to be 144 00:07:26,166 --> 00:07:28,066 how to do things in certain situations. 145 00:07:28,066 --> 00:07:30,600 So he's got some very cool tips, tricks and hacks. 146 00:07:30,600 --> 00:07:31,966 very easy to read. 147 00:07:31,966 --> 00:07:33,633 So definitely check that out. 148 00:07:33,633 --> 00:07:36,866 And another one a bit more, heavier read. 149 00:07:36,933 --> 00:07:40,766 For those of you who are into mathematics, who want to get to the bottom 150 00:07:40,766 --> 00:07:46,066 of the mathematics, why gradient descent in that specific, what are the formulas 151 00:07:46,066 --> 00:07:49,066 that are driving gradients and how is it calculate and so on. 152 00:07:49,166 --> 00:07:51,533 check out the article or actually the book. 153 00:07:51,533 --> 00:07:52,233 it's a free online 154 00:07:52,233 --> 00:07:55,566 book called Neural Networks and Deep Learning by Michael Nielsen. 155 00:07:56,233 --> 00:07:57,033 2015 book. 156 00:07:57,033 --> 00:07:59,533 It's, just basically it's all online. 157 00:07:59,533 --> 00:08:02,933 You can go ahead and, check it out there and there. 158 00:08:02,933 --> 00:08:05,766 Again, very soft introduction to the mathematics. 159 00:08:05,766 --> 00:08:07,200 But then for mother estimate. 160 00:08:07,200 --> 00:08:09,966 But the mathematics are pretty heavy 161 00:08:09,966 --> 00:08:13,033 as you go along as you read through the article. 162 00:08:13,500 --> 00:08:17,233 but at the same time it gets you into, into that mood. 163 00:08:17,266 --> 00:08:20,100 I think he has, like, a warm up, chapter. 164 00:08:20,100 --> 00:08:22,566 We used first warm up with the math and then you jump into him. 165 00:08:22,566 --> 00:08:23,866 So interested in math. 166 00:08:23,866 --> 00:08:26,400 Then this is the article to go to. 167 00:08:26,400 --> 00:08:29,033 And there we go. So that's in a nutshell. 168 00:08:29,033 --> 00:08:32,733 The difference between gradient descent and stochastic gradient descent. 169 00:08:32,733 --> 00:08:36,266 And how the two work. 170 00:08:36,266 --> 00:08:39,733 And on that note, we're going to wrap up today's tutorial. 171 00:08:39,733 --> 00:08:41,900 I look forward to seeing you on the next one. 172 00:08:41,900 --> 00:08:44,033 And until then, enjoy deep learning.