0 1 00:00:00,590 --> 00:00:07,620 So we're slowly coming up to the best part, namely the part we're about to run our gradient descent algorithm 1 2 00:00:08,100 --> 00:00:15,630 on our mean squared error cost function. But before we dive into the Python code and calculate in which 2 3 00:00:15,630 --> 00:00:22,950 direction our algorithm should move, we need to work out the slope of our cost function first, namely 3 4 00:00:23,070 --> 00:00:29,400 our gradient. And this is where I've got some great news for you because we just have to apply some of 4 5 00:00:29,400 --> 00:00:35,520 the same calculus tricks that we've covered so far and these partial derivatives that are coming up 5 6 00:00:35,880 --> 00:00:38,130 are really not that hard to figure out. 6 7 00:00:38,550 --> 00:00:39,970 So let's dive in. 7 8 00:00:40,260 --> 00:00:49,080 Now you'll recall that our mean squared error function looks like this, it's "1/n*sum( 8 9 00:00:49,440 --> 00:00:52,520 (y - y_hat))^2". 9 10 00:00:52,650 --> 00:00:59,550 So, actual values minus the predicted values squared, you sum them all up and you take the average. 10 11 00:00:59,730 --> 00:01:05,400 But the thing is, since we're running a very simple linear regression on this with one variable only, 11 12 00:01:05,670 --> 00:01:14,630 namely x, our y hat actually takes the form "theta_0 + theta_1*x". 12 13 00:01:14,910 --> 00:01:21,810 This is the linear regression model that we're using currently. It's got one variable and two parameters - 13 14 00:01:21,930 --> 00:01:24,630 theta_0 and theta_1. 14 15 00:01:24,660 --> 00:01:27,840 So what does this mean about our mean squared error? 15 16 00:01:27,840 --> 00:01:35,790 Well if we take our equation and we simply substitute our model, our linear regression model, into this 16 17 00:01:35,790 --> 00:01:39,020 equation for y hat then we get something like this. 17 18 00:01:39,180 --> 00:01:45,810 And by removing those parentheses, we can simplify this to the following form - our mean squared error 18 19 00:01:45,960 --> 00:01:50,020 for this particular linear regression model actually looks like this. 19 20 00:01:50,040 --> 00:01:59,510 We've got "1/n*sum((y - theta_0 - theta_1 * x)^2)". 20 21 00:01:59,670 --> 00:02:02,270 But you know what, we can take this even further, 21 22 00:02:02,370 --> 00:02:08,860 check it out. when we're going to do now is we're gonna write out all the terms in this equation. 22 23 00:02:08,970 --> 00:02:14,550 So this is the opposite of simplifying it, but it's going to make calculating our partial derivatives 23 24 00:02:14,670 --> 00:02:16,260 a lot easier. 24 25 00:02:16,260 --> 00:02:23,940 We can figure out all the terms in this sum like so, firstly we can multiply out all the terms in this 25 26 00:02:23,940 --> 00:02:28,100 equation and that means we get something like this. 26 27 00:02:28,320 --> 00:02:32,160 We get quite a few terms starting with y squared. 27 28 00:02:32,160 --> 00:02:39,120 But of course there is a few terms in this long list that we can combine so we'd actually get something like 28 29 00:02:39,120 --> 00:02:41,580 this when we simplified a little bit. 29 30 00:02:41,580 --> 00:02:47,880 Now I know that that doesn't look very pretty but the great thing about having the equation written 30 31 00:02:47,880 --> 00:02:54,390 out like this is that we can calculate partial derivatives very, very easily. 31 32 00:02:54,390 --> 00:02:59,760 So what I want to do is I want to start this lesson out on a challenge. I would like you to get out pencil 32 33 00:02:59,760 --> 00:03:08,370 and paper and try to apply the power rule to the equation above, to our mean squared error equation and 33 34 00:03:08,370 --> 00:03:13,770 take the partial derivative with respect to theta 0, 34 35 00:03:14,040 --> 00:03:20,570 so our intercept. I'll give you a few seconds to pause the video before I show you the solution. 35 36 00:03:20,640 --> 00:03:21,720 Ready? 36 37 00:03:21,720 --> 00:03:29,280 Here we go. Now looking at this equation, the first thing you'll notice is that there are quite a few 37 38 00:03:29,280 --> 00:03:39,000 terms that don't depend on theta 0, namely y^2, 2*theta_1*x*y and theta_1^2* 38 39 00:03:39,090 --> 00:03:42,180 x*2. For a partial derivative, 39 40 00:03:42,270 --> 00:03:47,450 these terms are treated as constants and they drop out of the equation. 40 41 00:03:47,460 --> 00:03:49,640 This is what we've talked about before. 41 42 00:03:49,680 --> 00:03:51,150 So what are we left with? 42 43 00:03:51,150 --> 00:04:02,260 Well, we're left with the following sum "-2y+2*theta_0 + 2*theta_1 43 44 00:04:02,290 --> 00:04:03,800 *x". 44 45 00:04:04,000 --> 00:04:08,930 And this is simply from applying the power rule that we covered in a previous lesson. 45 46 00:04:09,100 --> 00:04:14,270 Looking at this equation, we can simplify it a little bit to make it look a bit prettier. 46 47 00:04:14,350 --> 00:04:19,110 The first thing I'm going to do is I'm going to factor out the 2 that all these three terms in the sum 47 48 00:04:19,120 --> 00:04:20,460 have in common. 48 49 00:04:20,560 --> 00:04:26,650 In fact I'm actually going to factor out a -2 and I'm left with something like this. 49 50 00:04:26,650 --> 00:04:35,380 I've got "(-2)*(y - theta_0 - theta_1 * x)" and then what I can do is I can simply 50 51 00:04:35,380 --> 00:04:45,350 move this constant outside of the sum. So our equation would actually look like this. And that's really 51 52 00:04:45,350 --> 00:04:45,870 it. 52 53 00:04:45,950 --> 00:04:51,150 That's the partial derivative with respect to theta 0. 53 54 00:04:51,200 --> 00:04:57,620 One thing that you might have noticed is that I've left out the little i's in the superscript in this 54 55 00:04:57,620 --> 00:05:01,570 derivation and and I'm going to put those back now for you. 55 56 00:05:01,610 --> 00:05:07,140 I left them out earlier because otherwise the notation would have just gotten too busy on the slide. 56 57 00:05:07,220 --> 00:05:13,130 So now that we've worked out the partial derivative with respect to our first parameter, we can work 57 58 00:05:13,130 --> 00:05:19,910 out the partial derivative with respect to our second parameter, namely theta 1. 58 59 00:05:19,940 --> 00:05:26,120 Once again I'm going to pose this as a challenge to you, because you've worked out this one, 59 60 00:05:26,120 --> 00:05:28,280 working out the other one is very, very similar. 60 61 00:05:28,310 --> 00:05:33,370 You go through exactly the same steps but you'll get a slightly different result. 61 62 00:05:33,380 --> 00:05:39,490 I'll give you a few seconds to pause the video and scribble this down with pencil and paper. 62 63 00:05:41,120 --> 00:05:42,310 Ready? 63 64 00:05:42,320 --> 00:05:43,220 Here's the solution. 64 65 00:05:46,430 --> 00:05:51,620 The equation that you get at the end when you go through all the steps and you simplify it will look 65 66 00:05:51,800 --> 00:05:53,060 something like this. 66 67 00:05:53,090 --> 00:05:57,950 It'll be very, very similar to the partial derivative with respect to theta 0 67 68 00:05:57,950 --> 00:06:05,120 except that you're multiplying the entire thing by the x values at the end of the sum. With these two 68 69 00:06:05,120 --> 00:06:06,480 equations in front of us, 69 70 00:06:06,560 --> 00:06:13,400 we can now add them to Jupyter notebook. Once again the first thing I'm gonna do is add a section heading 70 71 00:06:13,730 --> 00:06:16,390 with some markdown and our LaTeX 71 72 00:06:16,400 --> 00:06:17,480 equations. 72 73 00:06:17,480 --> 00:06:29,030 This section heading I'm going to call "Partial Derivatives of the MSE w.r.t.", 73 74 00:06:29,030 --> 00:06:36,140 then I'm gonna add a dollar sign, a backslash and write "theta_0" and then another dollar 74 75 00:06:36,140 --> 00:06:47,000 sign and "$\theta_1$". Using the dollar signs, 75 76 00:06:47,060 --> 00:06:54,080 I'm including some LaTeX notation in line for our section heading and it's going to look like this when 76 77 00:06:54,080 --> 00:06:54,980 I press Shift+Enter. 77 78 00:06:55,460 --> 00:07:04,040 But let's add our partial derivatives in LaTeX notation as well, so I'm going to add two hashtags, two 78 79 00:07:04,040 --> 00:07:09,590 pound symbols then two dollar signs and write our fraction; 79 80 00:07:09,680 --> 00:07:17,240 So it's gonna be "\frac{}{}" 80 81 00:07:17,400 --> 00:07:24,650 and within the first pair of curly braces I'm gonna write "\partial MSE" and then in the second 81 82 00:07:24,650 --> 00:07:33,680 pair of curly braces I'm going to write "\partial \theta_0" and that whole 82 83 00:07:33,680 --> 00:07:39,710 thing is gonna be equal to, but before I add that bit, let's take a quick look what this looks like, so 83 84 00:07:39,710 --> 00:07:41,640 I'm going to add my two dollar signs at the end, 84 85 00:07:41,690 --> 00:07:50,600 press Shift+Enter and there I can see my fraction with the partial derivative symbols in front. Okay, 85 86 00:07:51,050 --> 00:07:54,270 so just so we have our equation in our 86 87 00:07:54,340 --> 00:07:55,770 Jupyter notebook as well, 87 88 00:07:55,940 --> 00:07:57,470 let's write it out here together. 88 89 00:07:57,480 --> 00:08:08,220 So it's gonna be minus and then "\frac{2}{n}" 89 90 00:08:08,440 --> 00:08:14,710 space, "\sum_{i=1} 90 91 00:08:14,780 --> 00:08:16,150 {}", 91 92 00:08:16,160 --> 00:08:18,750 carrie curly braces, 92 93 00:08:18,770 --> 00:08:26,610 and then "\big( y^{( 93 94 00:08:26,660 --> 00:08:39,290 i-\theta_0 - \theta_1 x^ 94 95 00:08:39,830 --> 00:08:41,170 {( 95 96 00:08:41,240 --> 00:08:43,600 i)} 96 97 00:08:43,710 --> 00:08:50,350 \big )}". 97 98 00:08:50,420 --> 00:08:51,380 Let's see what this looks like. 98 99 00:08:53,570 --> 00:08:55,820 I might say that's pretty spot on. 99 100 00:08:55,820 --> 00:08:59,180 I'm just gonna click inside here and add the second one as well. 100 101 00:08:59,180 --> 00:09:00,190 This is the easy part. 101 102 00:09:00,200 --> 00:09:04,330 This is where we just copy what we've just written, paste it below, 102 103 00:09:04,520 --> 00:09:12,950 and then change this thing here to theta_1 and then at the end we're going to add "\big( 103 104 00:09:12,950 --> 00:09:20,500 x^{( 104 105 00:09:20,810 --> 00:09:25,000 i(} \big)" 105 106 00:09:25,370 --> 00:09:26,260 That's it. 106 107 00:09:26,300 --> 00:09:32,450 Now we've got our partial derivative equations displayed beautifully in Jupyter notebook. 107 108 00:09:32,450 --> 00:09:39,380 So one thing that I'll note here is that, you know, these partial derivatives are gonna depend on what 108 109 00:09:39,380 --> 00:09:44,660 kind of equation we're using for y hat. At this point, 109 110 00:09:44,660 --> 00:09:50,150 we're using linear regression with one variable, so we substituted that in there 110 111 00:09:50,270 --> 00:09:57,620 and then we derived our partial derivatives from that. If we had a different model, say linear regression 111 112 00:09:57,620 --> 00:10:05,390 with two variables or something that estimates our y hat a little differently then we'd simply substitute 112 113 00:10:05,600 --> 00:10:12,550 that equation into our mean squared error and then we can do the same derivation if we're so inclined. 113 114 00:10:12,740 --> 00:10:20,180 Because that means that error cost function lends itself very well to all sorts of regression problems. 114 115 00:10:20,180 --> 00:10:24,730 And it will adapt very, very well to all kinds of models as well. 115 116 00:10:24,770 --> 00:10:32,090 So having written out the partial derivatives in this form, we can create a function where we calculate 116 117 00:10:32,090 --> 00:10:39,320 the slopes of the parameters in Python code. So I'm going to add a little section heading here and I'm going to call 117 118 00:10:39,320 --> 00:10:45,150 it "MSE & Gradient Descent". 118 119 00:10:45,500 --> 00:10:49,050 And then that Python function I'm going to add here. 119 120 00:10:49,160 --> 00:10:56,690 Now what I'm going to do is I'm going to create a function called grad, I'm going to say "def grad()" and I want to give 120 121 00:10:56,690 --> 00:11:04,490 it three inputs, I'm going to give it the x, they values and an array of thetas. 121 122 00:11:04,640 --> 00:11:12,320 And I'm going to say colon, and then inside the body of this function I want to work out these two partial 122 123 00:11:12,470 --> 00:11:13,840 derivatives. 123 124 00:11:14,060 --> 00:11:15,690 So what are my inputs here? 124 125 00:11:15,770 --> 00:11:21,860 My inputs are the x values, so the data; the y values, 125 126 00:11:21,860 --> 00:11:23,880 again this is also data, 126 127 00:11:24,410 --> 00:11:31,460 and then I have an array of theta parameters. 127 128 00:11:31,580 --> 00:11:36,110 These are the bits that we're actually optimizing in our gradient descent algorithm. 128 129 00:11:36,110 --> 00:11:44,720 This array is gonna have theta 0 at index 0 and theta 1 at index 1. 129 130 00:11:44,930 --> 00:11:49,770 So this is gonna be my function. The number of samples, 130 131 00:11:49,820 --> 00:11:57,110 so n, I can work out by saying "n = y.size". This function is going to get a whole list 131 132 00:11:57,110 --> 00:12:04,880 of y values and by calling y.size I can work out how many samples were given to this function. 132 133 00:12:04,880 --> 00:12:11,180 Now, as a challenge, can you create two variables, theta0_slope and theta1_ 133 134 00:12:11,180 --> 00:12:19,100 slope? And what I want you to do is I want you to translate these LaTeX equations into Python code. 134 135 00:12:19,960 --> 00:12:25,010 I'll give you a few seconds to pause the video and work this out. 135 136 00:12:25,100 --> 00:12:25,490 Ready? 136 137 00:12:26,210 --> 00:12:27,360 Here's the solution. 137 138 00:12:27,440 --> 00:12:30,730 So "theta0_slope" 138 139 00:12:30,950 --> 00:12:45,440 is gonna be equal to "(-2/n)*sum(y-thetas[0] - 139 140 00:12:46,310 --> 00:12:51,270 thetas[1]*x)". 140 141 00:12:51,350 --> 00:12:57,170 So we're expecting that this function will receive an array of theta parameters and I'll have the theta 141 142 00:12:57,170 --> 00:12:58,760 0 at index 0, 142 143 00:12:58,790 --> 00:13:03,660 so this is what we're using here and we have theta 1 at index 1, 143 144 00:13:03,740 --> 00:13:05,900 so this is what we're using here. 144 145 00:13:06,230 --> 00:13:12,020 Now working out our theta1_slope is gonna be trivial because I can just copy this, 145 146 00:13:12,260 --> 00:13:20,120 change the name here, and then simply add another set of parentheses to our sum and multiply the whole 146 147 00:13:20,120 --> 00:13:21,610 thing by x 147 148 00:13:21,680 --> 00:13:29,870 again. That way I can capture this term in the equation here. So that's really it. 148 149 00:13:29,890 --> 00:13:37,570 The only thing left to do is output these values and we're going to output this stuff as an array. 149 150 00:13:37,570 --> 00:13:40,040 I'll show you three ways we can do this. 150 151 00:13:40,110 --> 00:13:48,970 It's a little bit of a review, so we can write "return np.array([ 151 152 00:13:49,690 --> 00:13:52,120 theta0_slope])" 152 153 00:13:52,630 --> 00:13:59,590 And because this is an array as well, we just have to pull out the first value of it and then write a 153 154 00:13:59,590 --> 00:14:09,160 comma and then theta1_slope and then grab the first value of that as well. 154 155 00:14:09,160 --> 00:14:10,870 Now this is one way you can do it. 155 156 00:14:10,930 --> 00:14:16,860 We've calculated these two things separately, so we can combine them into an array like so. 156 157 00:14:17,560 --> 00:14:26,830 But I'm going to comment this out and show you a second way. So we can also return "np.append()", 157 158 00:14:26,830 --> 00:14:35,980 the first argument is gonna be our theta0_slope and then our second argument 158 159 00:14:36,130 --> 00:14:43,150 is gonna be theta1_slope. 159 160 00:14:43,380 --> 00:14:45,430 So this is a second way we can do it. 160 161 00:14:45,630 --> 00:14:51,050 We can append this array to this one and return that as well. 161 162 00:14:51,090 --> 00:14:56,250 So that way we can combine these two pieces of data that were calculated separately and append them 162 163 00:14:56,580 --> 00:14:57,510 to each other. 163 164 00:14:57,720 --> 00:15:01,980 The last way I want to show you is with the concatenate function. 164 165 00:15:01,980 --> 00:15:04,930 So this is also from numpy. 165 166 00:15:04,980 --> 00:15:13,230 It'll "numpy.concatenate()" and then another set of parentheses where we're gonna supply 166 167 00:15:13,830 --> 00:15:23,370 theta0_slope, comma theta1_slope and then we just have to supply how 167 168 00:15:23,370 --> 00:15:26,090 we're gonna concatenate it, namely along the rows, 168 169 00:15:26,130 --> 00:15:29,710 so "axis=0". 169 170 00:15:29,820 --> 00:15:34,820 So these are three ways you can write the Python code to achieve the very same output. 170 171 00:15:34,830 --> 00:15:38,370 Now it's time to run our gradient descent and actually call this function. 171 172 00:15:38,940 --> 00:15:41,610 Hope I didn't make any typos, so let's do that now. 172 173 00:15:41,970 --> 00:15:43,910 This is where the rubber meets the road, 173 174 00:15:43,920 --> 00:15:53,070 as they say. I'm going to set my multiplier to 0.01 and I'm going to set my initial guesses, 174 175 00:15:53,070 --> 00:16:04,050 so my thetas equal to an np.array where our initial guesses are 2.9, comma 2.9 175 176 00:16:04,350 --> 00:16:08,560 all in square brackets and then our gradient descent is gonna look like this. 176 177 00:16:08,560 --> 00:16:14,220 It's gonna be "for i in range"; we're going to run this a thousand times, 177 178 00:16:14,220 --> 00:16:21,210 colon and then in the body of our for loop, we're gonna have some very terse Python code that calculates 178 179 00:16:21,210 --> 00:16:24,360 our gradient and then updates are thetas array 179 180 00:16:24,510 --> 00:16:25,530 all in one go. 180 181 00:16:25,590 --> 00:16:35,260 It's gonna be "thetas = thetas - multiplier*grad". 181 182 00:16:35,280 --> 00:16:37,170 This is where we're calling our function. 182 183 00:16:37,190 --> 00:16:38,550 Now we have to supply our data. 183 184 00:16:38,550 --> 00:16:38,830 Right? 184 185 00:16:38,850 --> 00:16:43,170 "x_5, y_5" 185 186 00:16:43,170 --> 00:16:47,280 This was the data that we generated earlier and then last 186 187 00:16:47,280 --> 00:16:58,860 input is gonna be our thetas array, just like that. After our loop has run we're going to print out the results. 187 188 00:16:59,050 --> 00:17:02,110 So this is where we can check if our thing actually works. 188 189 00:17:02,110 --> 00:17:05,950 And I'm going to find out if I'm had any typos along the way. 189 190 00:17:05,950 --> 00:17:17,890 So I'm going to print and I'm going to say the "Min occurs at Theta 0: ", comma 190 191 00:17:18,160 --> 00:17:24,390 and this is gonna be "thetas[0]" because that's where our intercept is gonna live. 191 192 00:17:24,400 --> 00:17:36,730 It's gonna live at index 0, first value in our thetas array. And let's print out the minimum at theta 1 192 193 00:17:36,730 --> 00:17:38,500 as well. 193 194 00:17:38,500 --> 00:17:46,690 So that one's gonna be, in a print statement "Min occurs at Theta 1:" and then comma "thetas[ 194 195 00:17:46,690 --> 00:17:48,480 1]". 195 196 00:17:48,850 --> 00:17:53,190 And finally we're gonna print out our mean squared error. 196 197 00:17:53,220 --> 00:17:57,180 So we're going to use that MSE function that we created earlier. 197 198 00:17:57,250 --> 00:18:08,230 So I'm gonna say "print('MSE is :', mse())", this is a function call parentheses and then we have 198 199 00:18:08,230 --> 00:18:10,030 to supply two things here, 199 200 00:18:10,030 --> 00:18:12,250 remember? y_5, 200 201 00:18:12,370 --> 00:18:17,200 so the actual y values and then y_hat. What's y_hat? 201 202 00:18:18,070 --> 00:18:31,150 Well after our loop runs it's gonna be "thetas[0]+thetas[1]*", 202 203 00:18:31,780 --> 00:18:33,710 our x data, 203 204 00:18:33,760 --> 00:18:37,170 so x_5. 204 205 00:18:37,390 --> 00:18:37,720 All right. 205 206 00:18:37,750 --> 00:18:43,710 So we've just written a whole bunch of code without having tested it in a little while. 206 207 00:18:43,720 --> 00:18:52,630 Let's see if it works. I'm going to hit Shift+Enter now and I'm pleasantly surprised. 207 208 00:18:52,780 --> 00:19:01,860 We get, after a thousand iterations, we get theta zero value of .85 and theta one value of 208 209 00:19:02,040 --> 00:19:09,490 .122 and mean squared error of 0.95 approximately. 209 210 00:19:09,630 --> 00:19:14,890 And this very much ties out with all the calculations we've done previously. 210 211 00:19:15,090 --> 00:19:17,130 So we've definitely done this correctly. 211 212 00:19:17,370 --> 00:19:22,620 We've worked out the partial derivatives of our cost function and then we've run our gradient descent 212 213 00:19:22,650 --> 00:19:30,180 algorithm and this gradient descent algorithm started pretty far off, started at 2.9 for both 213 214 00:19:30,270 --> 00:19:32,790 our theta zero and our theta one values 214 215 00:19:32,790 --> 00:19:40,380 and then in that for loop, having run a thousand times it converged on the values that minimized the 215 216 00:19:40,380 --> 00:19:43,370 mean squared error, that minimized our cost function. 216 217 00:19:43,380 --> 00:19:45,840 So this is brilliant. 217 218 00:19:45,840 --> 00:19:48,600 Now all that's left to do is to plot it.