0 1 00:00:00,390 --> 00:00:07,430 Now I hope that you can't wait to calculate the mean squared error. From the previous cell up here, 1 2 00:00:07,530 --> 00:00:15,220 we already know what the lowest cost theta zero and theta one parameters are for our regression. 2 3 00:00:15,420 --> 00:00:18,760 And this is thanks to the scikit learn package. 3 4 00:00:18,930 --> 00:00:26,400 But let's work out what the mean squared error actually is for these two values. 4 5 00:00:26,640 --> 00:00:30,960 And for that we're going to need our estimated y values. 5 6 00:00:31,290 --> 00:00:37,340 Looking at our equation up here we need to calculate our y hat. 6 7 00:00:37,950 --> 00:00:51,070 What's this y hat equal to? Well, our y hat is gonna be equal to theta_0 + theta_1*x. 7 8 00:00:51,160 --> 00:00:58,780 This is the equation that underlines our linear regression in this case - our y hat is based on the intercept 8 9 00:00:59,110 --> 00:01:00,970 plus the slope, 9 10 00:01:01,000 --> 00:01:03,050 times our X values. 10 11 00:01:03,190 --> 00:01:08,420 Let's translate that to Python code. So I'm going to create a variable called y_hat 11 12 00:01:08,750 --> 00:01:15,320 and I'm gonna set it equal to our theta_0 value which was this one here. 12 13 00:01:15,370 --> 00:01:16,920 I'm going to copy this. 13 14 00:01:17,220 --> 00:01:17,820 Come down here. 14 15 00:01:17,840 --> 00:01:20,970 Paste it in, add a plus sign. 15 16 00:01:21,300 --> 00:01:23,970 Take the theta_1 value, 16 17 00:01:23,970 --> 00:01:28,690 copy and paste it in and then write 17 18 00:01:28,740 --> 00:01:32,270 "* x_5". 18 19 00:01:32,460 --> 00:01:36,800 This after all are all the X's in our dataset. 19 20 00:01:37,230 --> 00:01:43,830 And that way we can calculate all the predicted values in our dataset. 20 21 00:01:43,920 --> 00:01:45,590 Let's print these out for good measure. 21 22 00:01:45,600 --> 00:01:53,980 So I'm going to say "print" and then the string "'Est values y_hat are: ', 22 23 00:01:54,090 --> 00:01:57,610 y_5". 23 24 00:01:57,840 --> 00:02:00,670 Let me hit Shift+Enter and let's see what these look like. 24 25 00:02:01,440 --> 00:02:02,120 So here they are. 25 26 00:02:02,130 --> 00:02:05,790 These are all the predicted values. Now, 26 27 00:02:06,330 --> 00:02:15,030 one thing that you can do is if you don't want this first value here on the same line as your string 27 28 00:02:15,810 --> 00:02:21,700 what you can do is you can come in here with a string is and insert a special character - 28 29 00:02:21,840 --> 00:02:23,430 the new line character. 29 30 00:02:23,430 --> 00:02:25,730 So that's backslash and then 30 31 00:02:25,990 --> 00:02:36,180 "n". When I hit shift enter now that 0.969 will move to a new line, "\n" is 31 32 00:02:36,180 --> 00:02:40,020 like a special character for hitting return on your keyboard. 32 33 00:02:40,050 --> 00:02:44,170 So how do these predicted values compared to our actual values? 33 34 00:02:44,280 --> 00:02:45,410 Let's print those out as well. 34 35 00:02:45,430 --> 00:02:59,590 So I'm going to say "print" and then write "In comparison, the actual y values are \n, y_ 35 36 00:02:59,750 --> 00:03:00,720 5". 36 37 00:03:00,750 --> 00:03:03,690 These are actual values. 37 38 00:03:03,750 --> 00:03:05,040 There we go. 38 39 00:03:05,040 --> 00:03:12,750 Ideally we want these estimated values to be as close to these actual values as possible. 39 40 00:03:12,900 --> 00:03:14,750 And I'm looking at this. 40 41 00:03:14,850 --> 00:03:16,470 They're actually not too far off. 41 42 00:03:16,470 --> 00:03:18,270 This is not too bad. 42 43 00:03:18,420 --> 00:03:22,700 Given this theta_0 and this theta_1. 43 44 00:03:22,710 --> 00:03:30,420 So now that we know how to calculate our y hat values, let's work out the mean square error of the regression. 44 45 00:03:30,780 --> 00:03:33,480 And I think this actually makes a really good challenge. 45 46 00:03:33,480 --> 00:03:45,780 So can you write a Python function that takes two inputs y and y hat and returns the mean squared error? 46 47 00:03:46,680 --> 00:03:54,380 And after you've done that, after you've written this MSE function can you call this function and print 47 48 00:03:54,400 --> 00:04:04,710 out the mean squared error with the y hat calculated above, so feed in this y hat value into your MSE 48 49 00:04:04,710 --> 00:04:08,790 function and print out what the mean squared error is. 49 50 00:04:08,790 --> 00:04:15,290 Now remember the mean squared error equation looks like this. 50 51 00:04:15,540 --> 00:04:22,050 We've got that formula in LaTeX notation, so all you need to do is translate it into Python code and 51 52 00:04:22,170 --> 00:04:27,400 look up what code to use for that pesky summation symbol right here. 52 53 00:04:28,290 --> 00:04:33,410 I'll give you a few seconds to pause the video and try this on your own 53 54 00:04:36,280 --> 00:04:36,610 All right. 54 55 00:04:36,610 --> 00:04:37,210 Ready? 55 56 00:04:37,250 --> 00:04:39,060 So I hope you figured this out. 56 57 00:04:39,160 --> 00:04:40,950 Now in terms of the solution. 57 58 00:04:41,020 --> 00:04:42,980 I'm not just going to show you one solution. 58 59 00:04:43,060 --> 00:04:49,810 I'm going to show you three different ways that you can implement this, because there's actually many 59 60 00:04:49,810 --> 00:04:53,520 ways you can skin a cat as they say. 60 61 00:04:53,530 --> 00:04:58,060 So while you have the video paused and were solving the challenge, 61 62 00:04:58,060 --> 00:05:01,570 I was busy looking up possible uses for all that cat skin. 62 63 00:05:02,200 --> 00:05:08,060 And the best one I came across was a Japanese instrument called shamisen. 63 64 00:05:08,080 --> 00:05:13,120 Apparently the drum on this banjo was actually traditionally made from from cat skin. 64 65 00:05:13,120 --> 00:05:19,510 So uh there's your random fact of the day. If you've come across any other uses do let me know in the 65 66 00:05:19,570 --> 00:05:21,610 comments section. 66 67 00:05:21,610 --> 00:05:27,910 The Python code that you will have written will probably look something like this. 67 68 00:05:27,910 --> 00:05:36,760 You're going to have have "def mse():" and then for the parameters you'll have y and say y_hat, colon. And the 68 69 00:05:36,820 --> 00:05:39,490 first possible approach would look something like this. 69 70 00:05:39,500 --> 00:05:48,490 You might have a variable say call it "mse_calc" and that would have been equal to 1 divided by 7 because 70 71 00:05:48,490 --> 00:05:54,290 we've got 7 data points and you multiply that by the sum, 71 72 00:05:54,340 --> 00:05:56,680 this is an inbuilt Python function; 72 73 00:05:56,680 --> 00:06:04,420 "sum()" and then inside the sum you'll have another set of parentheses and there you'll have 73 74 00:06:04,630 --> 00:06:10,550 "y-y_hat" to the power of 2. 74 75 00:06:10,600 --> 00:06:13,030 So **2. 75 76 00:06:13,250 --> 00:06:20,090 And because this function needs an output you would return the results of your calculation. 76 77 00:06:20,110 --> 00:06:25,650 So mse_calc and this is one possible way to do it. 77 78 00:06:25,660 --> 00:06:31,050 Let me call this function and print out the mean squared error of y hat that we calculated. 78 79 00:06:31,060 --> 00:06:39,070 So it'll be "mse(y_5,", because these are the actual y values and then I'll supply y_hat that we 79 80 00:06:39,070 --> 00:06:49,160 calculated a few cells above. I'm going to hit Shift+Enter and see what we get. We get 0.95 approximately. 80 81 00:06:49,180 --> 00:06:53,830 Now there is one improvement that we can make to this function. 81 82 00:06:53,830 --> 00:07:02,050 So this 1/7 will probably be bothering you a little bit because it's hardcoded and 82 83 00:07:02,110 --> 00:07:05,450 it would only work or calculate the mean squared error correctly 83 84 00:07:05,620 --> 00:07:12,520 if we had 7 data points - if we had 8 or 9 then the mean square error formula wouldn't be apt 84 85 00:07:12,520 --> 00:07:15,250 anymore, wouldn't be correct anymore with this Python code. 85 86 00:07:15,670 --> 00:07:18,450 So I'm going to comment this out. 86 87 00:07:18,520 --> 00:07:21,160 But I'm going to copy, paste it below, 87 88 00:07:21,520 --> 00:07:28,580 and what I want to do is I'm going to replace this 1/7 with some different code. 88 89 00:07:28,690 --> 00:07:38,230 I'm going to say "1/y.size" - y.size will return the length or the number of 89 90 00:07:38,230 --> 00:07:43,380 samples that are fed in to the place holder for our y values. 90 91 00:07:43,420 --> 00:07:46,860 So this will also work. 91 92 00:07:46,900 --> 00:07:56,500 Let me hit Shift+Enter on this cell and the one below to prove just that. Writing the code this way makes 92 93 00:07:56,680 --> 00:08:03,590 the MSE function much more general because it figures out the number of samples inside of the function. 93 94 00:08:03,670 --> 00:08:11,170 It doesn't have to be supplied as an extra parameter and it's not hardcoded in the instructions themselves. 94 95 00:08:11,170 --> 00:08:17,710 So this is already quite nice but there is another way you could have done this as well, maybe you did 95 96 00:08:17,710 --> 00:08:26,410 a bit of googling and you figured out that numpy actually has a function called "average" and this does 96 97 00:08:26,410 --> 00:08:31,950 both the summation and the dividing by the number of samples for us. 97 98 00:08:31,990 --> 00:08:33,460 So let's see what this would look like. 98 99 00:08:33,550 --> 00:08:43,180 This would be "mse_calc = np.average()" and then it needs two things - 99 100 00:08:43,210 --> 00:08:50,560 it needs the function itself that it's averaging and if it should be averaging the rows, the columns 100 101 00:08:50,770 --> 00:08:52,130 or the whole thing. 101 102 00:08:52,210 --> 00:09:00,960 So first things first, the function its averaging would be (y-y_hat)**2 102 103 00:09:01,620 --> 00:09:07,390 and the way to tell it whether it should average the rows or the columns would be by supplying the "axis" 103 104 00:09:07,750 --> 00:09:15,820 argument, so "axis = 0" sums up the rows instead of both the rows and columns which is not what 104 105 00:09:15,820 --> 00:09:16,670 we want. 105 106 00:09:16,690 --> 00:09:24,900 So when I hit Shift+Enter on this and then run the cell below again, we can see that this works as well. 106 107 00:09:25,540 --> 00:09:32,580 But just because we've printed out a mean square error in this cell here doesn't mean it's correct. 107 108 00:09:32,620 --> 00:09:37,450 So let's check it against an inbuilt function from scikit learn. 108 109 00:09:37,560 --> 00:09:40,690 So we wrap this in a print statement. Write 109 110 00:09:40,690 --> 00:09:42,280 print and then the string 110 111 00:09:44,920 --> 00:09:57,510 "Manually calculated MSE is:" and then we have our mse function in here. To use the inbuilt function 111 112 00:09:57,750 --> 00:10:01,260 that inbuilt mean square error function from scikit learn, 112 113 00:10:01,500 --> 00:10:03,130 we're gonna have to import it. 113 114 00:10:03,200 --> 00:10:07,980 I'll have to scroll to the very top and then here we're gonna add another import statement, we're going to say 114 115 00:10:07,980 --> 00:10:22,770 "from sklearn.metrics import mean_squared_error . 115 116 00:10:23,300 --> 00:10:24,720 Sorry sorry. 116 117 00:10:24,930 --> 00:10:31,380 I hope you're not too upset and will forgive me that I made you do all this work even though there is 117 118 00:10:31,380 --> 00:10:40,300 an inbuilt function for this already. Let's call this mean_squared_error function with both our 118 119 00:10:40,300 --> 00:10:47,110 manually calculated y_hat as well as from the output of the regression directly. 119 120 00:10:47,110 --> 00:10:53,440 So where we were adding our print statement let's add another one let's add "print('MSE 120 121 00:10:56,660 --> 00:11:08,090 regression using manual calc is:', )", and then "mean_squared_error()" and it will 121 122 00:11:08,090 --> 00:11:16,250 supply two inputs: y_5, our actual y values and y_hat, which we've calculated 122 123 00:11:16,790 --> 00:11:18,500 above just here. 123 124 00:11:22,250 --> 00:11:27,530 And now let me copy this line and paste it below. 124 125 00:11:27,530 --> 00:11:35,230 And this one will read "'MSE regression is', mean_squared_error(y_5,)" and instead of our manually calculated 125 126 00:11:35,230 --> 00:11:45,390 one we can use "regr.predict(x_5)" and when we hit Shift+Enter, 126 127 00:11:45,880 --> 00:11:53,530 we see that indeed all the outputs agree with one another. Our manually calculated mean squared error is 127 128 00:11:53,530 --> 00:11:58,240 the same as what we get from the machine learning package. 128 129 00:11:58,240 --> 00:12:03,700 So we've dissected and implemented this cost function correctly. 129 130 00:12:03,700 --> 00:12:04,840 Brilliant. 130 131 00:12:04,840 --> 00:12:11,770 Now it's time to plot our cost function and visualize it. We'll do that in the next lesson. 131 132 00:12:11,830 --> 00:12:12,670 I'll see you there.