0 1 00:00:00,450 --> 00:00:01,750 Welcome back. 1 2 00:00:01,800 --> 00:00:07,230 In this lesson we're going to talk about one of the key inputs to the gradient descent function 2 3 00:00:07,230 --> 00:00:09,100 and this is the learning rate. 3 4 00:00:09,150 --> 00:00:12,490 So let's add a big section heading in a markdown cell. 4 5 00:00:12,540 --> 00:00:21,930 So I'm going to click on this cell here, change it from Code to Markdown, add one pound symbol and then write "The 5 6 00:00:21,930 --> 00:00:22,680 Learning Rate". 6 7 00:00:25,720 --> 00:00:29,170 Scrolling back up to where we've defined our gradient descent function, 7 8 00:00:29,170 --> 00:00:35,500 let's take another look at our inputs. Previously we've modified our Python code so that the inputs would 8 9 00:00:35,500 --> 00:00:43,090 include not only the derivative function and the initial guess but also a multiplier, precision 9 10 00:00:43,090 --> 00:00:46,360 and the maximum number of iterations. 10 11 00:00:46,360 --> 00:00:54,640 So far we've changed up the initial guesses and analyzed the impact that this had on the algorithm in 11 12 00:00:54,640 --> 00:00:58,530 different situations, on different cost functions. 12 13 00:00:58,900 --> 00:01:03,690 But we've not really messed with the multiplier and this is what we're gonna do now. 13 14 00:01:03,700 --> 00:01:08,440 So what does the learning rate actually do in our algorithm? 14 15 00:01:08,440 --> 00:01:15,370 If we look at our update step, namely this line right here, we can see that our learning rate, which I've 15 16 00:01:15,370 --> 00:01:23,260 called multiplier, is multiplied with the gradient, the gradient is the value of the slope of the cost 16 17 00:01:23,260 --> 00:01:25,530 function and the multiplier 17 18 00:01:25,690 --> 00:01:32,590 is just a constant, but together they determine how big of a step we take. 18 19 00:01:32,590 --> 00:01:38,040 So if you remember with the gradient, when the slope was steep, then the gradient was a large number. 19 20 00:01:38,140 --> 00:01:39,960 And we take a big step. 20 21 00:01:40,060 --> 00:01:44,860 And similarly if the multiplier is large, we also take a big step. 21 22 00:01:44,920 --> 00:01:50,920 So at the moment we've got the multiplier set to a default value of 0.02. 22 23 00:01:50,920 --> 00:01:56,590 Now this might be a good time to maybe pause the video and think about what would happen if we picked 23 24 00:01:56,590 --> 00:02:03,220 a different value, say, what would happen if this value was really, really small? And what would happen 24 25 00:02:03,640 --> 00:02:07,430 if the multiplier value was really, really large? 25 26 00:02:07,440 --> 00:02:12,170 Now I'm going to illustrate the effect of the multiplier taking our second example. 26 27 00:02:12,240 --> 00:02:14,230 This was the g(x) function. 27 28 00:02:14,460 --> 00:02:16,790 I'm going to take this cell right here. 28 29 00:02:16,800 --> 00:02:23,420 This was the cell that generated our graphs and plotted our scatter plot for our gradient descent 29 30 00:02:23,430 --> 00:02:30,570 On the g(x) function. I'm going to copy this cell right here with the shortcut and I'm going to scroll all the 30 31 00:02:30,570 --> 00:02:32,590 way down to our learning rate 31 32 00:02:32,970 --> 00:02:36,240 and I'm going to paste the cell below. 32 33 00:02:36,480 --> 00:02:41,430 I probably don't need the cell up here right at the moment so I'm just going to take this one and move 33 34 00:02:41,430 --> 00:02:44,370 it up and I'm gonna modify it a little bit. 34 35 00:02:44,370 --> 00:02:48,870 I'm gonna add some print statements below our charts at the bottom. 35 36 00:02:48,870 --> 00:02:52,770 The first print statement I'm going to add is the number of steps 36 37 00:02:57,960 --> 00:03:04,670 and if you recall this was the length of this list - the length of list_x. 37 38 00:03:04,950 --> 00:03:07,800 So we'll write "len( 38 39 00:03:07,800 --> 00:03:09,610 list_x)". 39 40 00:03:09,660 --> 00:03:13,170 The other thing that I'm going to modify is the initial guess that we've got. 40 41 00:03:13,550 --> 00:03:17,240 So I'm going to change this to 1.9, 41 42 00:03:17,550 --> 00:03:19,320 and I'm also going to add a multiplier here. 42 43 00:03:19,830 --> 00:03:26,610 So I'm going to add "multiplier=", I'm not gonna change the default value just yet, 43 44 00:03:26,610 --> 00:03:29,210 let's just have it at 0.02. 44 45 00:03:29,400 --> 00:03:35,490 And since I haven't run my notebook in a little while, I'm actually going to go to "Cell" and "Run All" instead 45 46 00:03:35,490 --> 00:03:42,640 of running just my latest cell. And I'm going to have to scroll all the way down to the bottom and there's 46 47 00:03:42,650 --> 00:03:43,370 our output. 47 48 00:03:43,370 --> 00:03:44,720 So it works as expected. 48 49 00:03:44,720 --> 00:03:50,690 So if I start at 1.9 which is right here then the first step that we take is fairly large 49 50 00:03:51,020 --> 00:03:58,340 right here and then we slowly move down to our minimum and at the moment all it takes is 14 steps to 50 51 00:03:58,340 --> 00:03:59,580 get to the bottom. 51 52 00:03:59,730 --> 00:04:00,190 Okay. 52 53 00:04:00,230 --> 00:04:02,480 So this works as expected. 53 54 00:04:02,480 --> 00:04:04,130 Now let's have a little bit more fun with this. 54 55 00:04:04,490 --> 00:04:09,020 So I'm going to change our function call to have a maximum of five iterations. 55 56 00:04:09,020 --> 00:04:16,040 So I only want our loop to run five times and then I'm also gonna change our multiplier from 56 57 00:04:16,040 --> 00:04:19,690 0.02 to 0.25 57 58 00:04:20,030 --> 00:04:24,100 and then I'm going to to rerun the cell and then we can have a look at our chart. 58 59 00:04:24,200 --> 00:04:26,100 So what are we seeing here now? 59 60 00:04:26,250 --> 00:04:32,820 Our multiplier is increasing our step size and we see our algorithm bouncing around on this function. 60 61 00:04:32,930 --> 00:04:36,740 So this is very very different behavior from what we've seen before. 61 62 00:04:36,890 --> 00:04:38,680 But let's take this to an extreme. 62 63 00:04:38,720 --> 00:04:44,990 I'm going to scroll back up to our function call and change the maximum number of times that our loop 63 64 00:04:44,990 --> 00:04:54,740 will run from 5 to 500 and I'm going to hit Shift+Enter and scrolling down we see this the whole chart is turning 64 65 00:04:54,740 --> 00:04:55,370 red. 65 66 00:04:55,370 --> 00:05:01,790 Yeah, we're bouncing around all over the place but our algorithm is never converging. 66 67 00:05:01,910 --> 00:05:07,130 In fact our loop at this point has run 500 times. 67 68 00:05:07,130 --> 00:05:11,940 In contrast we were at, what, 14, earlier in the number of steps. 68 69 00:05:12,020 --> 00:05:13,900 So this is very interesting, right. 69 70 00:05:13,910 --> 00:05:16,010 What can we learn from this example? 70 71 00:05:16,310 --> 00:05:22,940 In almost every situation that we've illustrated previously we've converged between like 20 or 60 steps, 71 72 00:05:22,940 --> 00:05:31,280 remember? This time our loop ran 500 times and the algorithm still didn't find the minimum. 72 73 00:05:31,310 --> 00:05:38,000 So what we're seeing here is that if we're not careful we can get into a situation where our algorithm 73 74 00:05:38,270 --> 00:05:44,900 isn't converging and it just might continue going and going and going and this kind of brings us back 74 75 00:05:44,900 --> 00:05:51,350 to our discussion about Python for loops and Python while loops. 75 76 00:05:51,350 --> 00:05:56,580 Let me scroll back up to where we wrote that code. 76 77 00:05:56,590 --> 00:05:58,140 Here we go. 77 78 00:05:58,170 --> 00:06:05,280 Now you might remember that with a while loop you as the developer have to take extra care with your 78 79 00:06:05,280 --> 00:06:06,910 terminating condition. 79 80 00:06:07,200 --> 00:06:09,040 What's the terminating condition? 80 81 00:06:09,180 --> 00:06:13,950 It's whatever follows the while keyword right here. 81 82 00:06:13,950 --> 00:06:19,770 So if the logic in this terminating condition isn't formulated well then it's very easy to get into 82 83 00:06:19,770 --> 00:06:26,460 a scenario where you're never exiting the loop and then your Python program accidentally ends up in 83 84 00:06:26,460 --> 00:06:27,990 an infinite loop. 84 85 00:06:27,990 --> 00:06:35,040 In other words with a while loop you as the programmer need to think of the corner cases and include the 85 86 00:06:35,040 --> 00:06:39,070 logic to stop your loop from running longer than you intended to. 86 87 00:06:40,020 --> 00:06:44,790 I know in this example if you're looking at this code this seems really, really trivial but infinite 87 88 00:06:44,790 --> 00:06:48,090 loops happen maybe more often than you expect. 88 89 00:06:48,090 --> 00:06:49,680 Let me show you an example. 89 90 00:06:49,920 --> 00:06:55,040 We could have written our gradient descent function with a while loop instead of a for loop. 90 91 00:06:55,050 --> 00:07:01,560 This is our gradient descent function as it currently stands - with a for loop and a cutoff point determined 91 92 00:07:01,770 --> 00:07:04,850 by the maximum number of iterations. 92 93 00:07:04,950 --> 00:07:10,730 And this is how one could imagine running our gradient descent with a while loop. 93 94 00:07:10,920 --> 00:07:14,710 At first glance this while loop would seem to make sense. 94 95 00:07:14,790 --> 00:07:21,840 The condition for running the loop is: run the loop as long as the step size is greater than the precision. 95 96 00:07:21,990 --> 00:07:27,420 The step size is always the difference between the new x and the previous x. 96 97 00:07:27,540 --> 00:07:34,320 So at some point that calculation would get more and more precise at which point we would exit loop. 97 98 00:07:34,320 --> 00:07:41,040 Fair enough, but this is exactly the kind of code that risks running into the infinite loop problem because 98 99 00:07:41,280 --> 00:07:48,930 in the case where one isn't converging and the step sizes between the values get larger, this loop just 99 100 00:07:48,930 --> 00:07:50,060 continues running. 100 101 00:07:50,880 --> 00:07:57,750 So I'm not going to keep this cell around, I'm going to go "Edit" > "Delete Cell" and this is why the for loop provides 101 102 00:07:57,750 --> 00:07:59,900 quite a, quite a big contrast, right? 102 103 00:07:59,910 --> 00:08:07,020 It's got this safety net by forcing you to explicitly state the number of times it will run ahead of 103 104 00:08:07,020 --> 00:08:08,340 time. 104 105 00:08:08,340 --> 00:08:11,540 Now let's crank up that learning rate even more. 105 106 00:08:11,940 --> 00:08:21,660 Scroll back down here and I'm going to increase our learning rate from 0.25 to 106 107 00:08:22,140 --> 00:08:24,970 0.3. I'm going to leave everything else the same 107 108 00:08:25,200 --> 00:08:26,430 and rerun this thing. 108 109 00:08:26,670 --> 00:08:29,060 See what happens. 109 110 00:08:29,880 --> 00:08:34,440 This is our old friend, the overflow error. 110 111 00:08:34,450 --> 00:08:36,510 The result is too large. 111 112 00:08:36,520 --> 00:08:43,540 We've shot off to infinity and beyond. What this little exercise is showing us is how there's another 112 113 00:08:43,540 --> 00:08:48,190 quirk that we have to be aware of with our optimization algorithms. 113 114 00:08:48,280 --> 00:08:53,470 We as the machine learning experts have to choose an appropriate learning rate. 114 115 00:08:54,250 --> 00:08:59,890 So that begs the question - how do we know what the right learning rate actually is? 115 116 00:08:59,890 --> 00:09:01,480 Because that's what we've seen. 116 117 00:09:01,540 --> 00:09:06,850 If we pick a very large learning rate, then our algorithm doesn't converge. 117 118 00:09:06,970 --> 00:09:12,340 And if we pick a very, very small learning rate, then our algorithm might actually take forever. 118 119 00:09:12,430 --> 00:09:13,840 What I mean by forever? 119 120 00:09:14,080 --> 00:09:14,650 Let me show you. 120 121 00:09:15,430 --> 00:09:21,760 I'm going to scroll back up where we've called our function, I'm going to change this again to something that doesn't 121 122 00:09:21,760 --> 00:09:30,510 crash our Python program so 0.02 and then I gonna copy this bit of code right here. 122 123 00:09:33,250 --> 00:09:42,650 So that's the code where we're calling our function and the code where we're plotting our first chart. 123 124 00:09:42,820 --> 00:09:43,960 I don't like seeing that error, 124 125 00:09:43,990 --> 00:09:50,210 so I'm going to rerun the cell, and then I'm going to paste our code down here. 125 126 00:09:50,260 --> 00:09:51,860 Now I'm going to change this comment here. 126 127 00:09:51,880 --> 00:09:58,760 I'm going to call it "Run gradient descent 3 times". 127 128 00:09:59,710 --> 00:10:01,630 But before we run it three times, 128 129 00:10:01,720 --> 00:10:03,100 let's run it one time. 129 130 00:10:03,160 --> 00:10:03,570 So. 130 131 00:10:04,350 --> 00:10:05,600 So here's what I'm going to do. 131 132 00:10:05,890 --> 00:10:12,990 Instead of having the max iteration set to 500 here, I'm going to set this equal to a variable. 132 133 00:10:13,030 --> 00:10:20,980 So I'm going to say "n=100" and then instead of having our max iterations set to 500 133 134 00:10:20,980 --> 00:10:27,010 I'm going to set it equal to n. So I'm going to specify the maximum number of iterations up here. 134 135 00:10:27,880 --> 00:10:40,750 Then I'm going to change the multiplier from 0.02 to 0.005 and 135 136 00:10:40,750 --> 00:10:46,660 then I'm going to add a precision, I'm going to make it quite precise the calculation. I'm going to say precision 136 137 00:10:46,660 --> 00:10:56,080 should be equal to 0.0001, comma and then have the maximum number of iterations 137 138 00:10:56,680 --> 00:11:01,670 back here. And for the initial guess we'll start off at 3. 138 139 00:11:01,690 --> 00:11:05,280 So this is gonna be our first call to the gradient descent function. 139 140 00:11:06,280 --> 00:11:15,490 But instead of having this sequence unpacking code here, we're gonna store all of this information in 140 141 00:11:15,520 --> 00:11:17,350 a single tuple. 141 142 00:11:17,350 --> 00:11:22,250 So this is gonna be called low_gamma. Gamma 142 143 00:11:22,300 --> 00:11:23,960 is often used for learning rate, 143 144 00:11:24,130 --> 00:11:27,390 so we'll call our tuple low_gamma. 144 145 00:11:27,400 --> 00:11:36,550 Now let's change up the code for our plot. I'm going to change the comment here to "Plotting reduction in cost 145 146 00:11:37,510 --> 00:11:40,450 for each iteration". 146 147 00:11:40,450 --> 00:11:44,410 So this is now gonna be our goal. In terms of the figure size, 147 148 00:11:44,410 --> 00:11:45,730 we're gonna go with a single plot. 148 149 00:11:45,820 --> 00:11:53,680 So I'm going to size this plot differently from before, I'm going to make it 20 by 10, so it's going to be quite large. 149 150 00:11:55,240 --> 00:12:01,120 And then I'm going to it delete this subplot code here, we don't need that. 150 151 00:12:01,390 --> 00:12:06,330 And then for axes, we're gonna go on the y axis, 151 152 00:12:06,340 --> 00:12:10,940 we're gonna go from 0 to 50. 152 153 00:12:11,050 --> 00:12:12,480 So this is gonna be our cost. 153 154 00:12:12,490 --> 00:12:14,610 Cost is gonna be on the y axis. 154 155 00:12:14,650 --> 00:12:20,880 We'll see this in a bit. And our x axis is gonna go from 0 to the number of iterations, 155 156 00:12:20,980 --> 00:12:28,370 so it's gonna go from 0 to n, n is going to be a hundred, gonna have the iterations on our x axis. 156 157 00:12:28,370 --> 00:12:36,350 So our x axis is gonna go from zero to n. The title of the chart is gonna be "Effect of the learning 157 158 00:12:36,350 --> 00:12:42,360 rate" and for the x label we'll have the number of iterations, 158 159 00:12:42,360 --> 00:12:50,550 and for the y label we're gonna have the cost. Now we need the data to populate our charts, so we need 159 160 00:12:50,550 --> 00:12:55,590 the values for our charts and we need two things. 160 161 00:12:55,590 --> 00:13:00,660 The first thing is going to be what we have on our y axis. 161 162 00:13:00,660 --> 00:13:02,570 That's going to require us to, 162 163 00:13:04,020 --> 00:13:14,580 and that's going to require us to convert the lists to numpy arrays, 163 164 00:13:14,640 --> 00:13:20,880 reason being we can feed an array into our g(x) function but we cannot feed a list into our 164 165 00:13:20,880 --> 00:13:22,120 g(x) function. 165 166 00:13:22,170 --> 00:13:24,130 So we've done this previously. 166 167 00:13:24,180 --> 00:13:36,290 I'm gonna call our first array low_values, set that equal to np.array(low_gamma), 167 168 00:13:37,070 --> 00:13:41,840 and it was the second item in our tuple. 168 169 00:13:41,840 --> 00:13:43,580 Why did I just put a one there? 169 170 00:13:43,790 --> 00:13:49,090 Because the second item is at index 1, because we start counting from 0. 170 171 00:13:49,130 --> 00:13:50,980 The first item is at index 0. 171 172 00:13:50,990 --> 00:13:53,290 The second item is at index 1. 172 173 00:13:54,320 --> 00:13:54,590 Okay. 173 174 00:13:54,620 --> 00:13:56,710 So that's our y axis data. 174 175 00:13:56,750 --> 00:14:02,570 Time to get our x axis data. 175 176 00:14:02,600 --> 00:14:06,850 Now we just need our x axis to go from 0 to like 100. 176 177 00:14:06,860 --> 00:14:07,850 Right? 177 178 00:14:08,180 --> 00:14:18,590 So what we're going to do is we're going to create a list from 0 to n+1. Why n+1? Because we've got that extra initial 178 179 00:14:18,590 --> 00:14:19,070 guess. 179 180 00:14:19,160 --> 00:14:24,320 So even though our loop is going to run 100 times our extra guess is going to mean we should have 180 181 00:14:25,100 --> 00:14:26,990 100 plus 1. 181 182 00:14:27,050 --> 00:14:34,340 And I'm going to store has information in a variable called iteration_list. 182 183 00:14:34,340 --> 00:14:37,230 So how did we create a list in the past? 183 184 00:14:37,280 --> 00:14:44,930 Let me scroll down here a little bit. In the past we've had our square brackets and we said 0, 1, 2, 3. 184 185 00:14:44,930 --> 00:14:45,110 Right. 185 186 00:14:45,110 --> 00:14:48,560 We just populated that list with values. 186 187 00:14:48,560 --> 00:14:55,700 But this isn't what we're gonna do because we're not going to type out 100 different values in this 187 188 00:14:55,700 --> 00:14:56,120 list. 188 189 00:14:56,210 --> 00:15:00,230 Instead of creating this list manually, what we're going to do instead is we're going to make use of this range 189 190 00:15:00,230 --> 00:15:05,240 function that we saw in the for loop. Our range has a starting value and an ending value. 190 191 00:15:05,270 --> 00:15:12,620 So it's going to create all our values from 0 to n+1, but we can't do it like this, 191 192 00:15:12,620 --> 00:15:13,550 exactly. 192 193 00:15:13,670 --> 00:15:22,460 The reason is is that if I press Shift+Tab on this function I can see that this thing here will actually 193 194 00:15:22,460 --> 00:15:25,300 spit out a range object. 194 195 00:15:25,310 --> 00:15:30,310 So this range function will give us a range object, but what we need is a list. 195 196 00:15:30,350 --> 00:15:34,010 So how do we convert a range object to a list? 196 197 00:15:34,010 --> 00:15:45,600 Well, we can call the list function and then nest the call to the range function inside our call to the 197 198 00:15:45,600 --> 00:15:46,740 list function. 198 199 00:15:46,740 --> 00:15:51,080 So now we'll have a list starting from 0 going to n+1. 199 200 00:15:51,450 --> 00:15:56,710 That's gonna be stored inside our variable called iteration_list. 200 201 00:15:57,240 --> 00:16:02,970 And this is what we're going to take and we're gonna put that right here on our plot. 201 202 00:16:02,970 --> 00:16:11,850 So I'm going to put that here and then for our y axis on this plot we're gonna use our well, low_values, 202 203 00:16:12,480 --> 00:16:18,550 we're going to use our array of values for our cost function. 203 204 00:16:19,440 --> 00:16:21,750 And we also don't have to stick to the blue color. 204 205 00:16:21,750 --> 00:16:27,530 There's a lot of colors available including like light green for example and we can make the line of 205 206 00:16:27,530 --> 00:16:28,890 it thicker, 206 207 00:16:29,340 --> 00:16:34,130 change the line width from 3 to 5 and we're going to get rid of the alpha as well. 207 208 00:16:34,630 --> 00:16:39,810 Now I'm going to comment out the plot dot scatter code and just show our plot for a change. 208 209 00:16:39,810 --> 00:16:48,730 So I'm going to say plt.show, parentheses at the end and hit Shift+Enter to see what we get. 209 210 00:16:49,210 --> 00:16:53,920 Okay, so we get a nice line plot right here. 210 211 00:16:53,920 --> 00:16:55,370 Just like this. 211 212 00:16:55,390 --> 00:16:58,360 So this is cool, but you know what? 212 213 00:16:58,540 --> 00:17:02,920 We're gonna take our scatter functionality and use it as well. 213 214 00:17:02,920 --> 00:17:08,590 That way we get a little bubble each time our loop was run, so that way we can see the step size a bit 214 215 00:17:08,590 --> 00:17:09,960 more clearly. 215 216 00:17:10,000 --> 00:17:18,460 So our x axis is going to be the iteration_list again, and for our y axis it was going to 216 217 00:17:18,460 --> 00:17:24,030 be our low_values array. 217 218 00:17:24,030 --> 00:17:29,920 I'm going to change it to the same color as with our line plot. 218 219 00:17:29,920 --> 00:17:32,720 So it's gonna be color light green. 219 220 00:17:33,000 --> 00:17:39,520 And for the dot size we'll change it from 100 to 80 and get rid of the Alpha as well. 220 221 00:17:39,550 --> 00:17:40,770 So let me rerun this. 221 222 00:17:40,760 --> 00:17:43,500 See what it looks like. Aha! 222 223 00:17:43,540 --> 00:17:44,330 This is pretty good. 223 224 00:17:44,410 --> 00:17:51,860 So larger step size in the beginning getting smaller until the steps are very, very close at the end. 224 225 00:17:51,880 --> 00:18:00,810 So what we're looking at here is the decrease in the cost with each iteration of our loop. 225 226 00:18:00,830 --> 00:18:07,950 I think this is really, really neat, but it'll be even neat if we can plot two different learning rates 226 227 00:18:08,040 --> 00:18:14,930 or three different learning rates on the same chart next to each other and see how they compare. So I'm going to scroll 227 228 00:18:14,930 --> 00:18:24,170 back up here and I'm going to correct my typo in the title to have "Effect of learning rate" with a space. 228 229 00:18:24,650 --> 00:18:27,700 I'm going to add a little comment here as well. 229 230 00:18:27,730 --> 00:18:36,440 I going to say "Plotting low learning rate" and then I'm going to scroll back up to the beginning and 230 231 00:18:36,440 --> 00:18:40,770 put my cursor right here. 231 232 00:18:41,060 --> 00:18:47,420 Now as a challenge, can you plot two more learning rates on our chart? 232 233 00:18:47,420 --> 00:18:56,840 So pause the video, add a tuple called mid_gamma and keep all the inputs to the call, to the gradient 233 234 00:18:56,840 --> 00:19:03,800 descent function the same, but change the multiplier to double the learning rate to 0.001 234 235 00:19:03,800 --> 00:19:13,410 and then create another tuple called high_gamma and change the learning rate there to 235 236 00:19:13,530 --> 00:19:15,240 0.002. 236 237 00:19:15,240 --> 00:19:21,930 Now you're gonna have to extract the values from the tuple that you get back and throw those onto our 237 238 00:19:21,930 --> 00:19:22,930 chart. 238 239 00:19:22,990 --> 00:19:33,450 I'll give you a few seconds to pause the video and give this a try. And here's the solution. So the quickest 239 240 00:19:33,450 --> 00:19:43,440 way to do this is to copy this code here, paste it two times, change the name of the tuple to mid_gamma, 240 241 00:19:43,530 --> 00:19:50,340 change the name of this tuple to high_gamma and then this multiplier is gonna be 241 242 00:19:50,340 --> 00:19:51,000 0.001. 242 243 00:19:51,150 --> 00:19:58,620 This multiplier is gonna be 0.002 and then we're going to come down here, 243 244 00:19:58,620 --> 00:20:01,400 take this code right here, 244 245 00:20:01,440 --> 00:20:12,960 copy it, paste it twice and I'm going to change my comment "Plotting mid learning rate", "Plotting high learning rate", 245 246 00:20:13,860 --> 00:20:21,210 and then instead of feeding the low values into our array here, I'm going to make a call to np.array and 246 247 00:20:21,210 --> 00:20:24,500 then put an our tuple mid_gamma[1]. 247 248 00:20:29,020 --> 00:20:30,430 I'm going to take this, 248 249 00:20:30,430 --> 00:20:31,700 copy it, 249 250 00:20:32,230 --> 00:20:33,790 put it here as well, 250 251 00:20:33,880 --> 00:20:35,620 mid_gamma[1]. 251 252 00:20:35,890 --> 00:20:36,850 And this is going to be 252 253 00:20:40,710 --> 00:20:41,850 np.array( 253 254 00:20:42,000 --> 00:20:46,060 high_gamma[1]) and same with this one. 254 255 00:20:46,090 --> 00:20:53,210 It's gonna be np.array(high_gamma[1]), but one thing that I'm gonna change as well is the color. 255 256 00:20:53,440 --> 00:20:57,090 I don't want all of these graphs to have the very, very same color. 256 257 00:20:57,100 --> 00:20:59,990 Otherwise I can't tell them apart on the chart. 257 258 00:21:00,250 --> 00:21:06,590 One nice color to contrast the light green is steel blue. 258 259 00:21:06,730 --> 00:21:10,120 This is gonna be for, 259 260 00:21:10,350 --> 00:21:16,320 this is gonna be for the slightly higher learning rate, and then for the highest learning rate on this 260 261 00:21:16,320 --> 00:21:17,140 chart, 261 262 00:21:17,190 --> 00:21:20,460 I'm going to pick my favorite color - hot pink. 262 263 00:21:20,460 --> 00:21:21,130 Why not? 263 264 00:21:23,890 --> 00:21:24,480 Now again, 264 265 00:21:24,480 --> 00:21:26,020 I'm not making these color names up, 265 266 00:21:26,030 --> 00:21:26,190 yeah. 266 267 00:21:26,210 --> 00:21:31,320 I'm taking them from the matplotlib official documentation. 267 268 00:21:31,320 --> 00:21:37,120 So these are the names that I can put into that argument. 268 269 00:21:37,380 --> 00:21:43,970 So there's a predefined number of names that you can use and spelling has to match exactly, 269 270 00:21:43,970 --> 00:21:45,700 otherwise it won't recognize them. 270 271 00:21:45,830 --> 00:21:47,740 Okay, so I've got my charts. Now, 271 272 00:21:47,780 --> 00:21:50,290 the proof is in the pudding, as they say. 272 273 00:21:50,370 --> 00:21:54,720 Hit Shift+Enter and let's see what we get. 273 274 00:21:54,720 --> 00:22:00,840 So I think this chart is beautiful, because it illustrates so nicely how our learning rate affects 274 275 00:22:00,840 --> 00:22:02,200 our algorithm. 275 276 00:22:02,280 --> 00:22:07,410 We've run our gradient descent three separate times with different learning rates and now we can see 276 277 00:22:07,410 --> 00:22:14,610 how the cost decreases with each iteration of the loop. And what we're seeing here is that the highest 277 278 00:22:14,610 --> 00:22:18,870 learning rate, the one in hot pink, converges the fastest. 278 279 00:22:18,870 --> 00:22:23,650 In other words, the higher the multiplier the faster the convergence. 279 280 00:22:23,650 --> 00:22:29,910 And in contrast we've got the really low multiplier converging very, very slowly to the minimum. 280 281 00:22:29,970 --> 00:22:36,090 Now if we've picked an even lower multiplier, then it would reach the minimum even more slowly, but as 281 282 00:22:36,090 --> 00:22:41,940 we've shown before, the higher multiplier is only better up to a point, right? 282 283 00:22:41,940 --> 00:22:46,110 Remember how in our earlier example our entire graph was red? 283 284 00:22:46,410 --> 00:22:52,920 Clearly having like a really high multiplier, you either get an overflow error or we don't converge on 284 285 00:22:52,920 --> 00:22:53,660 our minimum. 285 286 00:22:53,700 --> 00:22:57,180 So higher multipliers don't always work out. 286 287 00:22:57,390 --> 00:23:04,640 In fact, let's plot this crazy behavior that we had early on on this chart as well. So I'm going to go up here 287 288 00:23:05,420 --> 00:23:17,240 and add a comment here, call it "Experiment" and then I'm going to take my Python code, copy it, paste it here 288 289 00:23:17,810 --> 00:23:21,690 and maybe call this insane_gamma. 289 290 00:23:21,860 --> 00:23:26,510 Now my initial guess was 1.9 in our earlier example. 290 291 00:23:26,510 --> 00:23:33,180 And the multiplier was 0.25 if I remember correctly. 291 292 00:23:33,200 --> 00:23:42,520 Now all I need to do is come down here, take this bit of code here, I'm going to paste it and I want to modify it so 292 293 00:23:42,550 --> 00:23:48,520 it'll read "Plotting insane learning rate" and this is gonna be our insane_gamma. 293 294 00:23:51,810 --> 00:23:52,380 And 294 295 00:23:52,410 --> 00:24:00,990 in terms of color, let's go for just good old red to, uh, show that this is not a good thing. 295 296 00:24:01,250 --> 00:24:05,420 Now I can hit Shift+Enter and see how that behaves. 296 297 00:24:05,440 --> 00:24:07,220 So what have we got? 297 298 00:24:07,250 --> 00:24:11,520 Whew, that looks, that looks really bad. 298 299 00:24:11,540 --> 00:24:17,840 So in our fourth example, we started out much closer to the minimum. 299 300 00:24:17,890 --> 00:24:24,760 We started out with an initial guess of 1.9 and that's compared to the initial guess of 3 300 301 00:24:25,210 --> 00:24:28,520 which is where we started out with the other functions. 301 302 00:24:28,540 --> 00:24:35,950 So even though our initial guess was actually much better and our cost was much lower, it doesn't 302 303 00:24:35,950 --> 00:24:42,280 actually help us when our learning rate is all screwed up, so we can see here that our cost doesn't actually 303 304 00:24:42,280 --> 00:24:43,280 come down. 304 305 00:24:43,410 --> 00:24:49,990 Even though that algorithm has been running 100 times, we just have our cost bouncing around decreasing, 305 306 00:24:50,020 --> 00:24:58,210 increasing, decreasing again and in the end, after a hundred iterations, we can see that our cost here 306 307 00:24:58,330 --> 00:25:05,440 at the end is actually much higher than what the cost is for all the other examples. 307 308 00:25:05,440 --> 00:25:10,910 So all the other learning rates after a 100 iterations had a lower cost, 308 309 00:25:10,950 --> 00:25:17,390 even than our really, really high multiplier of 0.25. 309 310 00:25:17,410 --> 00:25:22,750 So I think that illustrates the problem really nicely when the algorithm is bouncing around with no 310 311 00:25:22,750 --> 00:25:29,950 clear direction. And I think a takeaway from this exercise is that picking the right learning rate is 311 312 00:25:29,950 --> 00:25:36,970 both a bit of an art and a science and so it shouldn't come really as a surprise that there isn't really 312 313 00:25:36,970 --> 00:25:41,460 one perfect solution for picking a good learning rate either. 313 314 00:25:41,470 --> 00:25:46,480 So even if you go to the literature on machine learning, you'll find that there are different approaches 314 315 00:25:46,660 --> 00:25:48,910 for picking a good learning rate. 315 316 00:25:49,100 --> 00:25:54,340 Now I've talked a little bit about two of the things that this optimization algorithm is sensitive to - 316 317 00:25:54,880 --> 00:25:56,020 these two knobs, right, 317 318 00:25:56,020 --> 00:26:01,690 that we had that we could turn - the initial guess and the learning rate. And you might come away thinking 318 319 00:26:01,690 --> 00:26:09,850 that oh, you know, gradient descent it's, you know, it's a bad algorithm because it can have certain problems. 319 320 00:26:09,850 --> 00:26:16,750 And yet, you know, it has pros and cons but one of the really, really big advantages of gradient descent 320 321 00:26:17,350 --> 00:26:25,840 is that it is incredibly simple, is an incredibly simple algorithm and is also quite fast, so it's quite 321 322 00:26:25,840 --> 00:26:31,390 a fast thing to run to train your machine learning model, and that's an advantage that not every other 322 323 00:26:31,390 --> 00:26:34,600 competing algorithm can claim for themselves. 323 324 00:26:35,290 --> 00:26:41,920 So given its relative simplicity and speed you can actually try out different learning rates for your 324 325 00:26:41,920 --> 00:26:45,890 cost function and see what works, see what works best. 325 326 00:26:46,120 --> 00:26:50,630 But of course there's more elegant approaches to picking a learning rate as well. 326 327 00:26:50,650 --> 00:26:58,570 For example, we could adjust our learning rate while the algorithm runs meaning the learning rate doesn't 327 328 00:26:58,570 --> 00:27:00,140 have to be fixed. 328 329 00:27:00,340 --> 00:27:05,170 The learning rate could chop and change after each step in our loop. 329 330 00:27:05,740 --> 00:27:12,550 So why would we do that, might you ask? Why would we have a learning rate that's not fixed for a particular 330 331 00:27:13,030 --> 00:27:13,720 value? 331 332 00:27:13,720 --> 00:27:19,410 Why would you want to update the learning rate every time the loop runs? Well, 332 333 00:27:19,480 --> 00:27:26,950 so the idea behind that is that the further you are from an optimal value, the faster you should move 333 334 00:27:26,980 --> 00:27:35,300 towards that minimum and thus the ideal value of your learning rate should be larger. But on the other 334 335 00:27:35,300 --> 00:27:41,900 hand once you start getting closer and closer to that minimum then the learning rate should come down 335 336 00:27:41,900 --> 00:27:48,240 as well because you don't want to overshoot. And this is why some machine learning practitioners create 336 337 00:27:48,270 --> 00:27:52,340 like a predefined schedule for the learning rate ahead of time. 337 338 00:27:52,650 --> 00:27:58,520 And in the schedule the learning rate starts off large and then gradually gets smaller. But there is 338 339 00:27:58,520 --> 00:28:02,990 a whole host of other techniques as well that people are using. 339 340 00:28:02,990 --> 00:28:09,900 For example, one quite simple technique is called the bold driver and it works like this. 340 341 00:28:09,920 --> 00:28:11,660 If your error rate, 341 342 00:28:11,740 --> 00:28:15,780 yeah, your cost was reduced since the last iteration, 342 343 00:28:15,970 --> 00:28:19,730 then you can try increasing the learning rate by 5 percent. 343 344 00:28:20,870 --> 00:28:27,500 And if your error rate was in fact increased, meaning that you skipped over the optimal point then you 344 345 00:28:27,500 --> 00:28:34,010 should reset the values of your parameters to the values of the previous iteration and decrease the 345 346 00:28:34,010 --> 00:28:41,660 learning rate by 50 percent. So you can see with this kind of rule how the learning rate would change - 346 347 00:28:41,980 --> 00:28:44,000 increase by 5 percent 347 348 00:28:44,000 --> 00:28:50,510 if you had a reduction in your error rate, and go back one step and decrease the learning rate by 50 348 349 00:28:50,510 --> 00:28:51,240 percent 349 350 00:28:51,380 --> 00:28:54,560 if you had an increase in your cost. 350 351 00:28:54,700 --> 00:29:01,220 Okay wow, so this was quite a theoretical lesson and we've covered quite a bit, we've covered how our 351 352 00:29:01,220 --> 00:29:04,670 algorithm is sensitive both to the initial guess and the learning rate 352 353 00:29:05,210 --> 00:29:13,040 and now we can start to tackle some more complicated cost functions. In particular, 353 354 00:29:13,040 --> 00:29:17,510 so far we've only been working with estimating a single value, right? 354 355 00:29:18,140 --> 00:29:23,370 When you look at g(x), there's only one thing, there it is, x, right? 355 356 00:29:23,480 --> 00:29:25,540 But this is only one dimension. 356 357 00:29:25,550 --> 00:29:31,940 Let's turn our attention to how we can tackle two dimensions in our gradient descent algorithm and then 357 358 00:29:31,940 --> 00:29:36,280 you'll see how you can tackle more than two very easily. 358 359 00:29:36,290 --> 00:29:38,400 I'll see you in the next lesson. 359 360 00:29:38,420 --> 00:29:39,410 Have a good one.