1 00:00:00,366 --> 00:00:01,300 Hello and welcome back. 2 00:00:01,300 --> 00:00:03,333 Today we're talking about the R-squared. 3 00:00:03,333 --> 00:00:07,166 A very important concept when it comes to evaluating the goodness 4 00:00:07,166 --> 00:00:08,333 of fit of our models. 5 00:00:08,333 --> 00:00:09,433 Now, in order to understand 6 00:00:09,433 --> 00:00:13,233 r squared we're going to need to look at two versions of our chart. 7 00:00:13,366 --> 00:00:14,900 So here is our data set. 8 00:00:14,900 --> 00:00:19,433 And what we're going to do here is plot the regression as we were doing before. 9 00:00:19,900 --> 00:00:21,633 And here's our data set again. 10 00:00:21,633 --> 00:00:24,333 And this time we're going to just plot an average line. 11 00:00:24,333 --> 00:00:27,033 And we'll see that in a second. So we'll start with the regression. 12 00:00:27,033 --> 00:00:29,933 Let's draw our regression line as usual. 13 00:00:29,933 --> 00:00:33,366 Let's project vertically our data points onto it. 14 00:00:33,633 --> 00:00:37,200 And for each data point we're going to look at the difference between the y 15 00:00:37,600 --> 00:00:41,000 the actual value and y hat the predicted value. 16 00:00:41,700 --> 00:00:44,133 Now as we discussed before, the way we built this line is 17 00:00:44,133 --> 00:00:47,166 we're minimizing this sum over here. 18 00:00:47,566 --> 00:00:50,866 And that is the ordinary least squares method. 19 00:00:51,266 --> 00:00:53,566 Well actually this sum has a name. 20 00:00:53,566 --> 00:00:55,466 It's called the residual sum of squares. 21 00:00:55,466 --> 00:00:58,800 And it's indicated by this abbreviation. 22 00:00:59,666 --> 00:01:03,366 Now on the right here we're going to draw an average line. 23 00:01:03,366 --> 00:01:09,000 And this is simply taking all of the y values. 24 00:01:09,000 --> 00:01:12,666 So the actual values of our data set the y values 25 00:01:12,900 --> 00:01:15,900 and taking their average. 26 00:01:15,933 --> 00:01:19,900 Again we're going to project vertically our data points onto this line. 27 00:01:20,133 --> 00:01:23,866 And for each data point we're going to look at y I. 28 00:01:24,766 --> 00:01:27,933 And here we're going to calculate another total. 29 00:01:27,933 --> 00:01:28,800 And this one is called 30 00:01:28,800 --> 00:01:32,166 the total sum of squares and is indicated by this abbreviation. 31 00:01:33,066 --> 00:01:36,000 And it's similar to the residual sum of squares. 32 00:01:36,000 --> 00:01:39,866 But instead of using y I hat we're looking at the difference 33 00:01:39,866 --> 00:01:45,000 between y I, the actual value and y average the average value 34 00:01:45,000 --> 00:01:48,000 of of our data set. 35 00:01:48,000 --> 00:01:51,200 And now we can calculate r squared. 36 00:01:51,400 --> 00:01:55,466 R squared is defined as one minus the ratio 37 00:01:55,466 --> 00:01:58,933 between the residual sum of squares and the total sum of squares. 38 00:01:58,966 --> 00:02:01,966 Now let's pause here for a second and discuss this a bit. 39 00:02:02,133 --> 00:02:05,766 So we know that we're minimizing the residual sum of squares. 40 00:02:05,766 --> 00:02:07,600 We want to make it as small as possible. 41 00:02:07,600 --> 00:02:10,800 And from these two images you can already see just by judging 42 00:02:11,400 --> 00:02:13,900 based on the lengths of these 43 00:02:13,900 --> 00:02:18,233 blue dotted blue dashed lines, we can see that 44 00:02:18,433 --> 00:02:22,233 they're generally longer on the right with averages 45 00:02:22,466 --> 00:02:23,700 and they're shorter where the regression is. 46 00:02:23,700 --> 00:02:28,600 And that's because, we've designed our line to minimize these lengths. 47 00:02:28,600 --> 00:02:30,600 So the sum of squared is smaller. 48 00:02:30,600 --> 00:02:33,900 And so what that means is that the, residual 49 00:02:33,900 --> 00:02:39,433 sum of squares is usually in most cases it's less than the total sum of squares. 50 00:02:39,600 --> 00:02:42,866 So the way to think about it is in the total sum of squares on the right. 51 00:02:42,866 --> 00:02:45,433 You're just putting an average line. You're not modeling anything. 52 00:02:45,433 --> 00:02:49,633 This is the most rudimentary thing that we can do. is 53 00:02:49,633 --> 00:02:53,133 just put our average line and approximate our data with that average line. 54 00:02:53,766 --> 00:02:55,400 Of course it will be. 55 00:02:55,400 --> 00:02:58,633 And it should be worse than any, 56 00:02:58,633 --> 00:03:01,866 thought out model that we create, which is the example on the left. 57 00:03:01,866 --> 00:03:07,066 So thereby, unless our regression model is facing the absolutely 58 00:03:07,066 --> 00:03:11,000 the wrong way, for example, a downward slope on the left over here, 59 00:03:11,400 --> 00:03:16,000 if our model was sloping downwards, then the residual 60 00:03:16,000 --> 00:03:19,666 sum of squares would be huge because our model's just incorrect. 61 00:03:19,866 --> 00:03:23,400 But in all other cases, the residual sum of squares is less than the total 62 00:03:23,400 --> 00:03:23,900 sum of squares. 63 00:03:23,900 --> 00:03:27,900 And what that means is that the ratio is less than one. 64 00:03:27,900 --> 00:03:30,966 So r squared is somewhere between 0 and 1. 65 00:03:30,966 --> 00:03:33,766 And the better our model fits the data, the smaller 66 00:03:33,766 --> 00:03:37,500 the residual sum of squares will be, and that means the greater r squared will be. 67 00:03:37,666 --> 00:03:40,900 So here's a quick rule of thumb for r squared. 68 00:03:40,933 --> 00:03:44,866 Now bear in mind that it highly depends on the context. 69 00:03:44,866 --> 00:03:48,500 And this rule of thumb is just for the practical tutorials 70 00:03:48,500 --> 00:03:51,466 that we're looking at in this section of the course. 71 00:03:51,466 --> 00:03:52,800 And here we go. 72 00:03:52,800 --> 00:03:57,066 So if you have A1R squared of one that's a perfect fit. 73 00:03:57,066 --> 00:04:00,766 That means, the residual sum of squares is zero. 74 00:04:00,766 --> 00:04:02,566 And basically your line is going through 75 00:04:02,566 --> 00:04:04,800 all the data points, which is virtually impossible. 76 00:04:04,800 --> 00:04:08,333 So it's very suspicious if your r squared is about 0.9. 77 00:04:08,333 --> 00:04:10,500 That's a very good model. 78 00:04:10,500 --> 00:04:12,266 If your r squared is less than zero point. 79 00:04:12,266 --> 00:04:13,700 So it's not a great model. 80 00:04:13,700 --> 00:04:15,633 It's not the end of the world but not great. 81 00:04:15,633 --> 00:04:18,500 If it's less than 0.4, it's quite a terrible model. 82 00:04:18,500 --> 00:04:22,400 And if it's less than zero then the model makes no sense for this data. 83 00:04:22,400 --> 00:04:23,266 As we discussed. 84 00:04:24,266 --> 00:04:24,866 So there we go. 85 00:04:24,866 --> 00:04:28,700 That's how the r squared works a very important concept to understand 86 00:04:28,700 --> 00:04:31,200 because it's used quite a lot to evaluate models. 87 00:04:31,200 --> 00:04:32,366 I look forward to seeing you next time. 88 00:04:32,366 --> 00:04:34,133 And until then, enjoy machine learning.