0 1 00:00:00,740 --> 00:00:08,690 In this lesson we will look at the residuals of our regression and we will check for problems and also 1 2 00:00:08,690 --> 00:00:11,140 validate our regression model. 2 3 00:00:11,390 --> 00:00:14,870 At the moment our model looks something like this, 3 4 00:00:14,900 --> 00:00:19,590 we dropped two features, so we only have 11 features instead of the original 13 4 5 00:00:19,790 --> 00:00:25,520 and we're also using log prices, because we transformed our target data. 5 6 00:00:25,520 --> 00:00:31,830 Let me change the notation here a little bit and write y_hat on the left hand side of the equation. 6 7 00:00:32,090 --> 00:00:38,180 What we're looking at here is the equation to predict a property price given some information about 7 8 00:00:38,180 --> 00:00:47,180 that property. In our notation y_hat is that property's predicted value. The predicted value is calculated 8 9 00:00:47,180 --> 00:00:52,040 from all the theta parameters and all the values of the individual features. 9 10 00:00:52,160 --> 00:00:58,850 And it's important to remember that in our notation y is not equal to y_hat. The predicted value is not 10 11 00:00:58,850 --> 00:01:01,790 the same as the target value. 11 12 00:01:01,790 --> 00:01:07,310 And as a matter of fact there will usually be a difference between the two, there usually will be a difference 12 13 00:01:07,310 --> 00:01:14,720 between the predicted value and the actual true value and what we can do is we can subtract the predicted 13 14 00:01:14,720 --> 00:01:18,060 value from the observed target value. 14 15 00:01:18,260 --> 00:01:23,300 And when we do that what we're left with is called the residual. 15 16 00:01:23,300 --> 00:01:30,350 So if the observed target value for property is say 50 and our predicted value from our model is equal 16 17 00:01:30,350 --> 00:01:36,090 to 48 then the residual is equal to 2 - 50 minus 48 is equal to 2. 17 18 00:01:36,140 --> 00:01:43,130 The math here isn't gonna get anybody excited but this is what we do for all the 400 odd individual 18 19 00:01:43,280 --> 00:01:46,000 data points in our training dataset. 19 20 00:01:46,070 --> 00:01:53,240 We will do this very unexciting bit of math for all the predicted values and all the target values. 20 21 00:01:53,240 --> 00:01:59,780 Now since we have 404 target values in our training dataset, we have 404 predicted 21 22 00:01:59,870 --> 00:02:02,450 or fitted values as well. 22 23 00:02:02,450 --> 00:02:06,680 And this means we have 404 residuals. 23 24 00:02:06,830 --> 00:02:13,280 Now, the key is what these 404 residuals look like as a group. 24 25 00:02:13,460 --> 00:02:14,650 But why? 25 26 00:02:15,140 --> 00:02:19,950 I hear you asking across time and space - why are we looking at these residuals? 26 27 00:02:20,000 --> 00:02:22,730 Why do the residuals matter? 27 28 00:02:22,760 --> 00:02:24,050 Here's the thing. 28 29 00:02:24,050 --> 00:02:32,390 Our regression relies on certain assumptions, it's like you're back at your real estate office in Boston 29 30 00:02:32,420 --> 00:02:40,520 and you're looking out the window and you're thinking "Man, the real world is a really complicated place. 30 31 00:02:40,550 --> 00:02:42,200 I can't model all this. 31 32 00:02:42,410 --> 00:02:43,510 I know what I'm going to do. 32 33 00:02:43,660 --> 00:02:48,100 I'm going to make my life easier by making some very, very crude assumptions. 33 34 00:02:48,320 --> 00:02:52,100 Maybe something like a simple linear model will be good enough." 34 35 00:02:52,940 --> 00:02:54,750 Now here's the thing. 35 36 00:02:54,830 --> 00:03:02,510 If these crude assumptions that we're making more or less hold up, then our simplified model is useful. 36 37 00:03:02,990 --> 00:03:08,990 It's not 100% correct but it will be something that we can use in practice and our results 37 38 00:03:09,080 --> 00:03:13,800 that we get back, the numbers that our Python code spits out, have meaning. 38 39 00:03:13,880 --> 00:03:19,940 But, if the assumptions that we're making don't hold at all all we get back is a whole bunch of rubbish. 39 40 00:03:19,940 --> 00:03:25,070 Now I'm talking a lot about simplifying assumptions that we're making, but what are we talking about 40 41 00:03:25,070 --> 00:03:25,210 here 41 42 00:03:25,210 --> 00:03:32,300 concretely? We actually covered one of the key simplifying assumptions already on the lesson on data 42 43 00:03:32,330 --> 00:03:33,690 transformations. 43 44 00:03:33,800 --> 00:03:35,660 We are assuming linearity, 44 45 00:03:35,660 --> 00:03:40,760 for starters. We are currently fitting a linear model to our data. 45 46 00:03:40,790 --> 00:03:46,640 We are assuming that a linear model is more or less appropriate and we even transformed our data to 46 47 00:03:46,640 --> 00:03:50,020 make it better fit our linearity assumption. 47 48 00:03:50,060 --> 00:03:53,300 But there are some other key assumptions as well. 48 49 00:03:53,300 --> 00:03:58,700 For starters, we like to think that our model can more or less explain what's actually going on in the 49 50 00:03:58,700 --> 00:04:00,010 real world. 50 51 00:04:00,050 --> 00:04:05,360 I mean after all we have a high r-squared and our 11 features together explain about 75 percent 51 52 00:04:05,360 --> 00:04:08,110 of the variance in property prices. 52 53 00:04:08,210 --> 00:04:11,930 So this means that what's left unexplained, 53 54 00:04:12,200 --> 00:04:21,260 yeah the residuals, the difference between our predictions and the actual values should be random. 54 55 00:04:21,290 --> 00:04:25,420 The part that we're missing from our model shouldn't have a clear pattern. 55 56 00:04:25,820 --> 00:04:28,520 Where can we see what's missing from our model? 56 57 00:04:28,520 --> 00:04:35,240 Well, in the residuals, right? We can spot what's missing in the differences between the target values 57 58 00:04:35,630 --> 00:04:38,060 and our predicted values. 58 59 00:04:38,060 --> 00:04:45,230 If there's a pattern in the residuals then there's also some predictive information in the residuals. 59 60 00:04:45,230 --> 00:04:50,720 And if there's predictive information in the residuals then that predictive information is missing from 60 61 00:04:50,720 --> 00:04:51,630 our model. 61 62 00:04:51,710 --> 00:04:53,620 Makes sense, right? 62 63 00:04:53,630 --> 00:04:58,550 The only thing you're asking about now is - well what kind of patrons do I have to watch out for? 63 64 00:04:58,550 --> 00:04:58,780 Right. 64 65 00:04:58,790 --> 00:05:00,090 What are you talking about with 65 66 00:05:00,240 --> 00:05:01,380 patterns? 66 67 00:05:01,500 --> 00:05:03,420 So let me show you a few examples. 67 68 00:05:03,510 --> 00:05:10,980 Let me show you a few plots of patterns that are indicative of problems in our regression. Now, every 68 69 00:05:10,980 --> 00:05:12,010 dataset is different 69 70 00:05:12,060 --> 00:05:19,290 and there's many, many possible variations, but there's also some common type of problems that have particular 70 71 00:05:19,290 --> 00:05:20,910 patterns that you might see. 71 72 00:05:20,940 --> 00:05:28,200 So I'm going to show you some examples of problematic residual plots, so you can get a feel for what's 72 73 00:05:28,200 --> 00:05:31,230 coming up in our Python analysis. 73 74 00:05:31,230 --> 00:05:38,730 Here we see a plot of the residual values versus our predicted values and there's clearly a relationship 74 75 00:05:38,940 --> 00:05:41,740 being traced out in the residuals. 75 76 00:05:41,790 --> 00:05:46,490 It's not a linear relationship, but it's definitely a pattern. 76 77 00:05:46,680 --> 00:05:48,350 Now of course the same thing holds, 77 78 00:05:48,390 --> 00:05:51,620 if you see this pattern the other way around, 78 79 00:05:51,810 --> 00:06:00,390 if you plot your residuals and you see a cone shape like this where the residuals are small for smaller 79 80 00:06:00,600 --> 00:06:06,860 predictions and where the residuals are large for larger predictions, then you also have a problem. 80 81 00:06:07,650 --> 00:06:12,060 In this plot the residuals get larger and larger, the larger the prediction is. 81 82 00:06:12,750 --> 00:06:17,420 So this is also a problem that you should watch out for if you see it in your residuals. 82 83 00:06:17,460 --> 00:06:18,500 Now what if you see 83 84 00:06:18,500 --> 00:06:19,720 this kind of plot? 84 85 00:06:19,920 --> 00:06:27,150 This looks more random already but you can see that there are these kind of vertical clusters in the 85 86 00:06:27,150 --> 00:06:27,880 plot. 86 87 00:06:28,080 --> 00:06:34,380 The residuals are grouping together and you might see this kind of pattern when there are some features, 87 88 00:06:34,430 --> 00:06:39,270 like missing from your model that are very important or when there are some interactions between the 88 89 00:06:39,270 --> 00:06:41,560 features that you're not capturing. 89 90 00:06:42,330 --> 00:06:44,880 Here's another plot that you might see at some point. 90 91 00:06:44,910 --> 00:06:47,890 This is the kind of classic outlier plot. 91 92 00:06:48,240 --> 00:06:53,130 If you see this kind of thing, then you should take a look at what that lonely data point in the top 92 93 00:06:53,130 --> 00:06:55,860 right corner actually represents and why it's there. 93 94 00:06:56,550 --> 00:07:02,710 And lastly in our gallery of bad residual plots, we've got this one here. 94 95 00:07:02,730 --> 00:07:06,750 Here we see an unbalanced y axis. 95 96 00:07:06,750 --> 00:07:12,150 Most of the residuals are sitting right at the bottom but there's also a few massive ones at the top 96 97 00:07:12,150 --> 00:07:13,450 of the chart. 97 98 00:07:13,560 --> 00:07:18,930 Again, if you see this kind of shape you have to take another look at your data and maybe consider a 98 99 00:07:18,930 --> 00:07:23,370 data transformation for a better fit for your model. 99 100 00:07:23,370 --> 00:07:30,600 The point I'm trying to make here with this gallery is that any non-random pattern in the residuals 100 101 00:07:30,960 --> 00:07:37,880 indicates that there is an issue, because just like humans, regressions can actually suffer from like 101 102 00:07:37,890 --> 00:07:45,930 many illnesses and these have to be diagnosed on a case by case basis. For example it's possible that 102 103 00:07:45,930 --> 00:07:47,950 you're missing an important feature from your data. 103 104 00:07:48,570 --> 00:07:51,490 It's possible that you need to transform your data. 104 105 00:07:51,540 --> 00:07:57,450 It's possible that there's some interaction between the features that you need to capture and it's possible 105 106 00:07:57,510 --> 00:08:04,050 that the features in the model are not capturing some kind of important explanatory information that's 106 107 00:08:04,050 --> 00:08:08,130 then leaking into the residuals and creating these patterns. 107 108 00:08:08,160 --> 00:08:13,570 So the question you might ask is: Well okay, so you've showed me a lot of these terrible plots that you 108 109 00:08:13,570 --> 00:08:17,190 know, hopefully, I'll never see in my own work. 109 110 00:08:17,200 --> 00:08:20,940 What would a healthy residual plot look like? 110 111 00:08:20,970 --> 00:08:25,590 What do we want to see from our plot of residuals? 111 112 00:08:25,590 --> 00:08:27,930 Well, we don't wanna see patterns that's for sure. 112 113 00:08:27,930 --> 00:08:31,470 So here's two examples without clear patterns. 113 114 00:08:31,470 --> 00:08:37,740 So here we've got a plot of, well not that many data points, but they're kind of scattered about in a 114 115 00:08:37,740 --> 00:08:39,310 pretty random fashion. 115 116 00:08:39,330 --> 00:08:42,400 This is what you would kind of expect to see. 116 117 00:08:42,690 --> 00:08:49,520 Would you like to venture a guess as to what you should see when there's loads and loads of data points? 117 118 00:08:49,650 --> 00:08:53,080 You'll want to see something like this. 118 119 00:08:53,130 --> 00:09:02,640 You'll want to see kind of like a cloud shape where most of the residuals are centered around zero and 119 120 00:09:02,700 --> 00:09:04,690 the cloud is more or less symmetric. 120 121 00:09:04,770 --> 00:09:10,620 There isn't a bias for, say like high predictions or low predictions. 121 122 00:09:10,620 --> 00:09:17,430 You kind of want symmetry there and you want them to be centered around zero, because in an ideal world 122 123 00:09:17,550 --> 00:09:26,310 or the perfect dataset fitted by the perfect model, the residuals are actually normally distributed or 123 124 00:09:26,310 --> 00:09:34,430 close to normally distributed and this normality assumption is also pretty important actually. 124 125 00:09:34,680 --> 00:09:41,880 The thing you have to remember is that assuming normality doesn't apply to our target values, doesn't 125 126 00:09:41,880 --> 00:09:43,560 apply to our house prices. 126 127 00:09:43,980 --> 00:09:52,110 And assuming normality most certainly does not apply to our features, but normality is something that 127 128 00:09:52,110 --> 00:09:54,750 we look for in our residuals. 128 129 00:09:54,750 --> 00:10:01,700 Do you remember what characterizes a normal distribution? What are the things that we can look at? Well 129 130 00:10:02,480 --> 00:10:06,200 we can look at the skew and the mean, right? 130 131 00:10:06,210 --> 00:10:10,270 Both the skew and the mean should be equal to zero. 131 132 00:10:10,310 --> 00:10:19,240 Now in truth, normality is maybe not as important as not having a pattern in the residuals. 132 133 00:10:19,250 --> 00:10:25,170 And there is a famous quote by the statistician George Box actually. 133 134 00:10:25,190 --> 00:10:33,380 So he said "...the statistician knows... that in nature there never was a normal distribution, there never was 134 135 00:10:33,530 --> 00:10:39,460 a straight line, yet with normal and linear assumptions, known to be false, 135 136 00:10:39,530 --> 00:10:48,220 he can often derive results which match, to a useful approximation, those found in the real world". 136 137 00:10:48,230 --> 00:10:50,690 That's a really long quote actually. 137 138 00:10:50,690 --> 00:10:56,690 But the gist of it is this - All models are wrong, but some models are useful.