0 1 00:00:00,180 --> 00:00:06,600 In the coming lessons we're gonna be introducing some more realism and building out our intuition for 1 2 00:00:06,600 --> 00:00:09,630 the machine learning problems that are to come. 2 3 00:00:09,690 --> 00:00:15,330 We're now going to have a look on how you'd run a gradient descent algorithm given some data points 3 4 00:00:15,420 --> 00:00:24,750 and using a real cost function. This cost function is called the Mean Squared Error or MSE. 4 5 00:00:24,750 --> 00:00:31,140 Previously we've been using example cost functions to learn more about how our gradient descent algorithm 5 6 00:00:31,140 --> 00:00:32,280 behaves. 6 7 00:00:32,280 --> 00:00:38,970 And this was fine at the time but now we turn our attention to a cost function that is actually being 7 8 00:00:38,970 --> 00:00:42,780 used in practice for many machine learning problems. 8 9 00:00:42,780 --> 00:00:45,890 Now to recap a little bit about linear regression. 9 10 00:00:46,140 --> 00:00:52,980 We said that the goal of linear regression was to minimize the distances between the points and the 10 11 00:00:52,980 --> 00:00:59,420 line, and a good line was one that minimizes the residual sum of squares. 11 12 00:00:59,550 --> 00:01:04,410 When we were talking about the distances, we meant the distances between the data points, 12 13 00:01:04,410 --> 00:01:12,780 so the actual values, the ys, and our fitted values or our hypothesis. To calculate a residual sum of 13 14 00:01:12,780 --> 00:01:13,470 squares, 14 15 00:01:13,560 --> 00:01:22,620 we took the difference between the actual value and our fitted value and then we added them all together. 15 16 00:01:22,860 --> 00:01:28,620 But because some values would have been below the line having them all up just like this would create 16 17 00:01:28,620 --> 00:01:29,870 a problem. 17 18 00:01:29,880 --> 00:01:37,130 So what we have to do instead is we have to square all the values and then add them up. 18 19 00:01:37,350 --> 00:01:44,490 Thus our goal was to choose a line or choose the parameters theta zero and theta one that would minimize 19 20 00:01:44,880 --> 00:01:52,290 the sum of the squared differences and the name for this equation was the Residual Sum of Squares or 20 21 00:01:52,350 --> 00:01:54,060 RSS. 21 22 00:01:54,120 --> 00:02:00,570 The way to interpret this number, this residual sum of squares, was as follows 22 23 00:02:00,570 --> 00:02:08,090 it's how much of the dependent variables variation our model did not explain. 23 24 00:02:08,160 --> 00:02:14,520 So it's the sum of the squared differences between the actual y values and the predicted y values and 24 25 00:02:14,520 --> 00:02:22,770 the smaller this residual sum of squares, the better our model fits our data and the greater the residual 25 26 00:02:22,770 --> 00:02:26,220 sum of squares the poorer our model fits our data. 26 27 00:02:26,220 --> 00:02:29,600 Thinking about it this way, what would be a perfect fit? 27 28 00:02:29,670 --> 00:02:33,700 What kind of model would have a perfect prediction? 28 29 00:02:33,740 --> 00:02:38,330 Well it would be one where the residual sum of squares is zero. 29 30 00:02:38,360 --> 00:02:39,060 Right? 30 31 00:02:39,090 --> 00:02:44,200 Value of zero for the residual sum of squares would imply a perfect fit. 31 32 00:02:45,910 --> 00:02:52,480 So I know that at this point you're probably thinking: Would the residual sum of squares make a good cost 32 33 00:02:52,480 --> 00:02:54,440 function? 33 34 00:02:54,700 --> 00:03:03,840 And the answer is - yes it would, but with one small modification and that modification is dividing the 34 35 00:03:03,840 --> 00:03:11,030 whole thing by the number of data points. Then the residual sum of squares gets a new name, 35 36 00:03:11,310 --> 00:03:18,690 and this equation is called the Mean Squared Error or MSE and that's the cost function 36 37 00:03:18,780 --> 00:03:23,880 we're gonna be using in our last example. Back in our Jupyter notebook, 37 38 00:03:23,910 --> 00:03:32,250 let's add some section headings. So I'm going to change my cell to Markdown here and I'm gonna put in one pound 38 39 00:03:32,250 --> 00:03:34,310 symbol and then I'm going to write 39 40 00:03:34,350 --> 00:03:37,010 "Example 5 40 41 00:03:37,410 --> 00:03:46,150 Working with Data & a Real Cost Function". 41 42 00:03:46,920 --> 00:03:50,190 And below that I'm going to add a subheading that reads 42 43 00:03:53,040 --> 00:04:01,830 "Mean Squared Error: a cost function for regression problems". 43 44 00:04:02,010 --> 00:04:09,090 This is also a really good opportunity to try your hand at writing some more LaTeX notation in Jupyter. 44 45 00:04:09,090 --> 00:04:14,580 Let's add the formula for the residual sum of squares first. I'm going to make this fairly large, 45 46 00:04:14,660 --> 00:04:22,380 so I'm going to give it three pound symbols, two dollar signs "RSS = " and now we need that summation 46 47 00:04:22,380 --> 00:04:25,210 symbol and it's gonna be with the sum tag, 47 48 00:04:25,230 --> 00:04:27,720 that's how we're gonna get that summation symbol. 48 49 00:04:27,720 --> 00:04:31,060 Let me add another two dollar signs as the closing tag, hit Shift+ 49 50 00:04:31,090 --> 00:04:34,670 Enter and show you and show you what I mean. 50 51 00:04:34,710 --> 00:04:38,600 So "\sum" gives us the summation symbol. 51 52 00:04:38,640 --> 00:04:45,600 Now we need to add some things to the top and the bottom of the summation symbol. I can add things to 52 53 00:04:45,600 --> 00:04:50,730 the bottom with the underscore, and they're inside some curly braces; 53 54 00:04:50,730 --> 00:04:58,740 I'm going to write "i=1", and then I'm going to use the carry symbol and another pair of curly braces 54 55 00:04:59,340 --> 00:05:02,220 and put an "n" to put something on top. 55 56 00:05:02,250 --> 00:05:04,350 Let me Shift+Enter and show you what this looks like. 56 57 00:05:06,440 --> 00:05:12,880 Now I'm going to add the rest of the equation in some big parentheses, so I'm going to write backslash, and then 57 58 00:05:13,000 --> 00:05:22,510 write "big" for the big tag, open parentheses and then space, another backslash, big closing parenthesis. 58 59 00:05:23,020 --> 00:05:26,590 And then everything inside the parentheses will be raised to the power of two. 59 60 00:05:26,620 --> 00:05:29,460 So I want to see carry and then 2. 60 61 00:05:29,530 --> 00:05:31,660 Let's take a look at what this looks like now. 61 62 00:05:34,220 --> 00:05:34,540 All right. 62 63 00:05:34,550 --> 00:05:44,130 So now I'm just gonna add the final bit and that's gonna be "y^{()}", 63 64 00:05:44,450 --> 00:05:58,170 "i" it's gonna be the i-th value of y, minus "h_\theta" 64 65 00:05:58,170 --> 00:06:04,640 space, x, carry, "{(i)}" 65 66 00:06:04,830 --> 00:06:08,850 "i" for the ith value of X. 66 67 00:06:08,850 --> 00:06:13,170 I'm going to put another space here at the end to make it a bit more readable, hit Shift+Enter and 67 68 00:06:13,170 --> 00:06:14,340 take a look at this equation. 68 69 00:06:15,590 --> 00:06:15,890 Okay. 69 70 00:06:15,930 --> 00:06:22,090 So I think that looks pretty solid and in the process we've discovered a few more LaTeX tricks. 70 71 00:06:22,170 --> 00:06:26,610 One is that this underscore adds things to the bottom like this, 71 72 00:06:26,610 --> 00:06:33,520 "i = 1" for the summation symbol; or the Greek letter theta for our hypothesis, for h; 72 73 00:06:33,750 --> 00:06:38,020 And we also see the big parentheses in action with the big tag. 73 74 00:06:38,400 --> 00:06:44,300 Now having the LaTeX equation for the mean squared error is actually gonna be very easy. 74 75 00:06:44,370 --> 00:06:49,260 So I'm just gonna copy this and paste it and I'm going to modify it. 75 76 00:06:49,290 --> 00:06:57,240 So instead of RSS it's going to read MSE and then just in front I'm going to add a fraction, so I'm going 76 77 00:06:57,240 --> 00:07:07,470 to add "\frac{1} 77 78 00:07:07,470 --> 00:07:16,090 {n}" and when I hit Shift+Enter I can now see my mean squared error equation like so. 78 79 00:07:16,140 --> 00:07:22,550 Now one thing I'll point out is that you might see an alternative way of writing this equation. 79 80 00:07:22,650 --> 00:07:31,470 Sometimes this equation is written with a y hat, with a will carry symbol above the y instead of this 80 81 00:07:31,470 --> 00:07:36,560 notation with the hypothesis function, instead of this h theta x i. 81 82 00:07:36,930 --> 00:07:42,180 So let's add the alternative notation as well so that we've got it on here. 82 83 00:07:42,180 --> 00:07:49,800 So I'm going to select this part here which reads "h_\theta x^{(i)}". I am going to 83 84 00:07:49,800 --> 00:08:00,060 delete it and instead I'm going to show you guys another LaTeX tag which is "\hat" followed 84 85 00:08:00,060 --> 00:08:06,920 by a pair of curly braces and a y inside. When you hit Shift+Enter now, 85 86 00:08:06,990 --> 00:08:15,380 you should see this. The other thing I'll do is I'll remove the carry, the curly braces and the i from the 86 87 00:08:15,380 --> 00:08:17,340 preceding y value. 87 88 00:08:17,450 --> 00:08:23,000 So I'm going to delete that and hit Shift+Enter and let's see what this looks like. 88 89 00:08:23,240 --> 00:08:25,970 So these are the two notations that you gonna come across 89 90 00:08:25,970 --> 00:08:31,540 most often - the predicted values either represented as the hypothesis function, 90 91 00:08:31,550 --> 00:08:38,990 so using this h notation or it's represented with this y hat notation, this is also very, very common. 91 92 00:08:39,470 --> 00:08:45,980 And y hat is also gonna be the way I'm gonna be referring to our predicted values in our Python code 92 93 00:08:45,980 --> 00:08:53,000 that we're gonna write. Now with the notation out of the way and just looking at these equations you 93 94 00:08:53,120 --> 00:09:01,430 probably might be wondering why is the means squared error more useful as a cost function than the residual 94 95 00:09:01,430 --> 00:09:03,960 sum of squares? 95 96 00:09:03,980 --> 00:09:05,780 I mean they look pretty similar. 96 97 00:09:05,780 --> 00:09:06,140 Right? 97 98 00:09:06,750 --> 00:09:08,810 But let's think about it. 98 99 00:09:09,770 --> 00:09:14,880 What would happen when our dataset gets large? 99 100 00:09:15,230 --> 00:09:19,430 We're gonna have a y and a y hat value for each and every data point. 100 101 00:09:19,430 --> 00:09:20,600 Right? 101 102 00:09:20,660 --> 00:09:30,040 So as we're adding more and more data points to our dataset that sum starts to get pretty big because 102 103 00:09:30,040 --> 00:09:33,990 we're adding up all the squared differences, right? And 103 104 00:09:34,000 --> 00:09:41,590 we've already seen what happens when computers are confronted with very, very large numbers before. 104 105 00:09:41,590 --> 00:09:42,820 That's right. 105 106 00:09:42,820 --> 00:09:52,590 When working with very, very large numbers we might encounter our old friend, the overflow error, and this 106 107 00:09:52,590 --> 00:09:55,740 is where the mean squared error comes to the rescue. 107 108 00:09:56,070 --> 00:10:03,870 Because what happens when we're dividing by the number of samples. By dividing that sum by the number 108 109 00:10:03,870 --> 00:10:09,640 of data points we're kind of taking an average, right? An average, a mean. 109 110 00:10:09,810 --> 00:10:16,020 So hence the name - mean squared error. And dividing by that number of samples 110 111 00:10:16,320 --> 00:10:23,250 you can start to handle very large datasets, right? Datasets with tens of thousands of data points and 111 112 00:10:23,250 --> 00:10:29,700 the cost function isn't going to encounter an overflow error because you're keeping that mean squared 112 113 00:10:29,760 --> 00:10:33,460 error manageably small. 113 114 00:10:33,630 --> 00:10:39,620 I hope that explains why the mean square error is a very very popular cost function - because it allows 114 115 00:10:39,620 --> 00:10:47,270 people to handle data sets of all sorts of sizes and your gradient descent algorithm won't cause an 115 116 00:10:47,270 --> 00:10:49,430 overflow as your scaling up.