1 00:00:00,400 --> 00:00:01,433 Hello and welcome back. 2 00:00:01,433 --> 00:00:04,766 Today we're continuing our saga of model evaluation. 3 00:00:04,766 --> 00:00:07,233 And we're talking about adjusted R squared. 4 00:00:07,233 --> 00:00:09,533 Previously we discussed r squared which is defined 5 00:00:09,533 --> 00:00:12,733 as one minus the residual sum squared divided by the total sum of squared. 6 00:00:12,733 --> 00:00:17,500 And r squared is a goodness of fit for our model's greater or is better. 7 00:00:18,066 --> 00:00:19,300 just a quick reminder. 8 00:00:19,300 --> 00:00:23,866 We mentioned that r squared is between 0 and 1 and that 9 00:00:23,866 --> 00:00:26,866 there is no overarching rule of thumb. 10 00:00:27,000 --> 00:00:30,433 The values really depend on the industry and use case. 11 00:00:30,433 --> 00:00:36,133 So for some industries and use cases, 0.9 might be a great R squared and 0.4 12 00:00:36,133 --> 00:00:39,266 might be a terrible R squared for the R for other industries 13 00:00:39,266 --> 00:00:41,833 and use cases, 0.4 might be a great R squared. 14 00:00:41,833 --> 00:00:43,266 So it really depends. 15 00:00:43,266 --> 00:00:46,700 However there is an overarching problem across the board 16 00:00:46,966 --> 00:00:50,500 which is to do with adding new independent variables. 17 00:00:50,500 --> 00:00:56,533 So let's say we have a linear regression with two independent variables. 18 00:00:56,533 --> 00:00:57,866 And we decide to add a third one. 19 00:00:57,866 --> 00:01:00,866 For example, we got new data, 20 00:01:00,933 --> 00:01:05,366 new column of data or we got, we're just trying to explore 21 00:01:05,366 --> 00:01:11,100 and see what other variables might be, helpful in the explanation in our model. 22 00:01:12,300 --> 00:01:15,400 So what happens when we add another variable is that the total 23 00:01:15,400 --> 00:01:20,300 sum of squares doesn't change because it only depends on the average 24 00:01:20,300 --> 00:01:24,233 of the y actual values, and doesn't depend on the y hat values. 25 00:01:25,200 --> 00:01:27,800 But the residual sum of squares will change 26 00:01:27,800 --> 00:01:31,833 and in fact it will only either decrease or stay the same. 27 00:01:32,266 --> 00:01:34,566 The problem we're facing is that the residual 28 00:01:34,566 --> 00:01:38,366 sum of squares will never increase when we add when we try to add 29 00:01:38,366 --> 00:01:41,633 another variable, and this might not be intuitive at first. 30 00:01:41,633 --> 00:01:44,600 So let's talk about it a little bit. 31 00:01:44,600 --> 00:01:48,333 the main reason for this is that we are using the ordinary 32 00:01:48,333 --> 00:01:52,666 least squares method to build our models, and what the ordinary least squares 33 00:01:52,666 --> 00:01:53,533 method does is. 34 00:01:53,533 --> 00:01:57,200 It aims to minimize the residual sum of squares. 35 00:01:57,300 --> 00:02:02,166 So let's try see this in action when we add this new variable x3. 36 00:02:02,766 --> 00:02:05,533 the ordinary least squares method is going to look for coefficients 37 00:02:05,533 --> 00:02:10,533 b3 that improve the y hat predicted values. 38 00:02:11,333 --> 00:02:15,266 As long as it finds, a coefficient b3 where 39 00:02:15,266 --> 00:02:19,533 the y hat values are better than they were before, closer to the actual values 40 00:02:19,533 --> 00:02:23,033 than the sum of the residual sum of squares will improve 41 00:02:23,433 --> 00:02:27,833 it could improve by a lot if the the prediction is much better. 42 00:02:27,833 --> 00:02:30,833 Now, what can improve by a tiny little bit even, 43 00:02:31,266 --> 00:02:33,833 if the prediction is a little bit better? 44 00:02:33,833 --> 00:02:37,733 Now, in the situation where the ordinary least squares 45 00:02:37,733 --> 00:02:43,000 method cannot find a coefficient b3 that improves the predictions 46 00:02:43,000 --> 00:02:46,533 like all possible coefficients, B3 make the predictions worse, 47 00:02:46,800 --> 00:02:49,666 then the ordinary least squares method is just going to be 48 00:02:49,666 --> 00:02:53,400 very, smart or sneaky, you can call it. 49 00:02:53,500 --> 00:02:56,733 And this is going to put turn B3 into zero. 50 00:02:56,733 --> 00:03:00,700 It's just going to say, okay, we're going to set B3 at zero. 51 00:03:01,000 --> 00:03:04,800 And that means even though we technically added an extra variable, 52 00:03:04,800 --> 00:03:06,866 it's not participating at all in the predictions 53 00:03:06,866 --> 00:03:08,666 because its coefficient is zero. 54 00:03:08,666 --> 00:03:12,233 So in that case the residual sum of squares won't 55 00:03:12,266 --> 00:03:14,533 change, will be exactly as it was before. 56 00:03:14,533 --> 00:03:16,200 So we end up with a situation where 57 00:03:17,300 --> 00:03:19,600 we can just keep adding more and more variables 58 00:03:19,600 --> 00:03:22,933 that maybe have even nothing to do with our problem at hand. 59 00:03:22,933 --> 00:03:26,800 But by virtue of some random correlations, our 60 00:03:26,800 --> 00:03:30,733 r squared in some cases will be improving. 61 00:03:30,733 --> 00:03:32,966 Improving it never gets worse. 62 00:03:32,966 --> 00:03:34,966 So the residual sum squared will decrease. 63 00:03:34,966 --> 00:03:36,900 That means r squared will increase. 64 00:03:36,900 --> 00:03:39,166 And that's a problem because we don't want to end up 65 00:03:39,166 --> 00:03:42,300 with models that have lots of, variables 66 00:03:43,166 --> 00:03:46,733 that have nothing to do with that model, that are not really adding a lot of value, 67 00:03:47,200 --> 00:03:50,200 but in general are just increasing r squared. 68 00:03:50,300 --> 00:03:51,900 So what is the solution? 69 00:03:51,900 --> 00:03:57,266 The solution is, a new version of r squared and adjusted r squared. 70 00:03:57,266 --> 00:03:58,900 And that's exactly what it's called. 71 00:03:58,900 --> 00:04:02,800 It is calculated with this scary looking formula. 72 00:04:02,800 --> 00:04:03,500 But as you'll see 73 00:04:03,500 --> 00:04:07,066 from the practical tutorials we've had learn, it's not scary at all. 74 00:04:07,800 --> 00:04:12,166 and you'll be able to recreate this, actually manually. 75 00:04:12,166 --> 00:04:15,400 So here there are a couple of new, parameters. 76 00:04:15,700 --> 00:04:19,400 K is the number of independent variables that are in our model, 77 00:04:19,633 --> 00:04:22,566 and n is the sample size. 78 00:04:22,566 --> 00:04:25,200 And the important thing here is to look at k. 79 00:04:25,200 --> 00:04:26,333 So if we 80 00:04:27,400 --> 00:04:29,233 if k increases over 81 00:04:29,233 --> 00:04:32,333 here then the denominator decreases. 82 00:04:32,333 --> 00:04:35,366 That means the whole ratio increases. 83 00:04:35,700 --> 00:04:40,600 And because it's being subtracted that means adjusted r square decreases. 84 00:04:40,900 --> 00:04:44,966 So that's the important point here that this new formula 85 00:04:44,966 --> 00:04:49,800 penalizes us for adding additional variables. 86 00:04:49,966 --> 00:04:55,633 So basically it's only worth adding an extra variable if r squared. 87 00:04:55,766 --> 00:04:59,800 This original r squared if it increases substantially 88 00:04:59,800 --> 00:05:03,333 enough that compensates for this penalty. 89 00:05:03,333 --> 00:05:06,333 So you know, adding new variables 90 00:05:06,533 --> 00:05:09,600 becomes something that has to be justified. 91 00:05:09,900 --> 00:05:12,900 If it's not justified, then the new variable is not worth adding. 92 00:05:13,300 --> 00:05:15,466 And that's what adjusted r squared is about. 93 00:05:15,466 --> 00:05:19,166 It's about making sure that we only add variables 94 00:05:19,166 --> 00:05:22,166 when they bring substantial improvement to our model. 95 00:05:22,866 --> 00:05:24,666 So we go that's adjusted r squared. 96 00:05:24,666 --> 00:05:27,066 Hope you enjoyed this tutorial and I'll see you next time. 97 00:05:27,066 --> 00:05:29,033 Until then, enjoy machine learning.