0 1 00:00:00,270 --> 00:00:08,400 So now that we've plotted our actual prices versus our predictions and we've generated this chart here, 1 2 00:00:09,210 --> 00:00:19,470 let's move on in our residual analysis to the sister chart of this one, because in this chart we can't 2 3 00:00:19,590 --> 00:00:23,580 actually see the residuals explicitly, right? 3 4 00:00:23,640 --> 00:00:28,350 We can see kind of how far the data points are from the cyan line here, 4 5 00:00:28,620 --> 00:00:32,910 but our residuals are actually not on one of the axes. 5 6 00:00:33,000 --> 00:00:41,870 Let's make this much more explicit and plot our residuals versus the predicted values. I'm going to come up here, 6 7 00:00:42,000 --> 00:00:50,510 add a comment that reads "Residuals vs Predicted values". 7 8 00:00:50,610 --> 00:00:54,890 I'm going to take these lines of code, copy them, paste them below 8 9 00:00:55,020 --> 00:00:58,350 and now I'm going to modify them a little bit. 9 10 00:00:58,350 --> 00:01:01,520 So I'm going to get rid of this line on the x axis, 10 11 00:01:01,590 --> 00:01:14,010 I want the label to read "Predicted log prices" and then "\hat y_i", "fontsize = 14". On the 11 12 00:01:14,010 --> 00:01:15,090 y axis 12 13 00:01:15,150 --> 00:01:28,030 I want it to read just "Residuals". And for the title I want it to read "Residuals vs Fitted Values". 13 14 00:01:28,030 --> 00:01:30,780 Now let's update the arguments in the scatter plot. 14 15 00:01:30,850 --> 00:01:39,220 So for the x axis, you've already guessed - it's gonna be "results.fittedvalues" and for the y axis, 15 16 00:01:39,850 --> 00:01:42,590 it's going to read "results. 16 17 00:01:42,610 --> 00:01:46,430 resid". For the color, 17 18 00:01:46,430 --> 00:01:49,530 I'm going to go with "navy". For the alpha, 18 19 00:01:49,550 --> 00:01:51,530 I'll leave it at 0.6. 19 20 00:01:51,560 --> 00:01:53,410 Now let me hit Shift+Enter. 20 21 00:01:54,080 --> 00:02:01,700 So this is the chart that you can compare to the charts that you've seen in the slides earlier on in 21 22 00:02:01,700 --> 00:02:02,560 the lesson. 22 23 00:02:02,570 --> 00:02:05,220 How do we interpret what we're looking at here? 23 24 00:02:05,270 --> 00:02:07,600 Are you seeing any obvious patterns? 24 25 00:02:07,610 --> 00:02:15,710 I think it's actually pretty ok, the residuals look fairly random for the most part and the residuals 25 26 00:02:15,830 --> 00:02:18,800 are actually centered around zero. 26 27 00:02:18,800 --> 00:02:27,050 So looking at the y axis here, we can see that a lot of the residuals are centered around zero. The residuals 27 28 00:02:27,050 --> 00:02:29,980 are also fairly symmetric. 28 29 00:02:30,080 --> 00:02:33,980 They don't seem to be systematically high or systematically low. 29 30 00:02:34,040 --> 00:02:37,040 So the model is kind of correct on, on average. 30 31 00:02:37,640 --> 00:02:44,090 But in this chart we also see the issue with the high price bracket homes filtering through. 31 32 00:02:44,090 --> 00:02:45,530 No surprise there. 32 33 00:02:45,530 --> 00:02:46,710 Can you spot them? 33 34 00:02:46,970 --> 00:02:54,200 Which data points on this chart correspond to all the fifty thousand dollar homes? 34 35 00:02:54,200 --> 00:02:55,790 They're actually right here. 35 36 00:02:55,880 --> 00:03:03,320 You see how this almost traces out a line? These data points on this chart correspond to these data points 36 37 00:03:03,890 --> 00:03:10,010 on this chart. Those fifty thousand dollar properties that we're bad at predicting seemed to be lining 37 38 00:03:10,010 --> 00:03:10,820 up. 38 39 00:03:10,850 --> 00:03:15,410 Now let's do our next check on the residuals that we talked about. 39 40 00:03:15,410 --> 00:03:18,750 Let's check for normality. 40 41 00:03:18,770 --> 00:03:26,210 Let's see if our normality assumption is satisfied or close to satisfied, because in the beginning of 41 42 00:03:26,210 --> 00:03:30,980 the lesson we said, well we kind of want these residuals to be normally distributed. 42 43 00:03:30,980 --> 00:03:34,070 Let's check out if they really are or not. 43 44 00:03:34,070 --> 00:03:44,390 In the cell below, I'm going to add a comment and that's going to read "Distribution of residuals (log prices) 44 45 00:03:44,900 --> 00:03:55,690 - checking for normality'. A normal distribution if you remember has a mean and a skew of what? Zero, 45 46 00:03:55,710 --> 00:03:56,760 Right? 46 47 00:03:56,820 --> 00:03:59,850 The skew should be zero and the mean should be zero. 47 48 00:03:59,850 --> 00:04:02,520 How do we print out the mean and the skew? 48 49 00:04:02,550 --> 00:04:09,570 We'll take our results objects, "results.resid", get the residuals and then we can chain a method 49 50 00:04:09,570 --> 00:04:18,730 onto this. For the mean, we would use ".mean()". Let me hit Shift+Enter. What's printed out here 50 51 00:04:18,870 --> 00:04:21,190 is in scientific notation. 51 52 00:04:21,420 --> 00:04:30,990 So let's round this, let's say "round(results.resid.mean())", comma and then let's 52 53 00:04:30,990 --> 00:04:38,430 round to 3 decimal places and have the closing parentheses at the end, Shift+Enter, and we see here 53 54 00:04:38,430 --> 00:04:44,570 that the mean of our residuals is indeed very, very close to zero. 54 55 00:04:44,780 --> 00:04:49,350 Let me store this in a variable called "resid_mean". 55 56 00:04:49,780 --> 00:04:59,880 And now let's print out the skew, so "results.resid.skew()" should be our skew. 56 57 00:04:59,880 --> 00:05:01,390 See what that is. 57 58 00:05:01,580 --> 00:05:02,060 Huh. 58 59 00:05:02,100 --> 00:05:02,610 Okay. 59 60 00:05:02,610 --> 00:05:11,740 0.12 approximately. I can also round that, round to three decimal points. 60 61 00:05:11,740 --> 00:05:18,530 And I'm also going to store this in a variable, "visit_skew" is equal to this whole thing. 61 62 00:05:18,550 --> 00:05:19,170 So fair enough. 62 63 00:05:19,180 --> 00:05:20,620 The mean is equal to zero. 63 64 00:05:20,620 --> 00:05:25,690 The skew is not equal to zero, but it's not too far off. 64 65 00:05:25,740 --> 00:05:34,450 Now looking at these two numbers is helpful but it's even better if we complement this with a plot, with 65 66 00:05:34,450 --> 00:05:35,470 a graphic. 66 67 00:05:35,470 --> 00:05:43,630 So I'm going to use seaborn here. "sns.distplot()", distribution plot parentheses and then as the arguments 67 68 00:05:44,290 --> 00:05:45,870 we'll provide our residuals, 68 69 00:05:45,890 --> 00:05:48,820 so "results.resid". 69 70 00:05:50,050 --> 00:05:55,310 And for the color we'll go with the "navy" again. 70 71 00:05:55,750 --> 00:06:03,820 I think every plot needs a title so "plt.title()" and then as a title we'll say 71 72 00:06:04,330 --> 00:06:10,440 'Log price model: residuals'. 72 73 00:06:10,450 --> 00:06:17,740 Now let's go with "plt.show()" and see what this looks like. 73 74 00:06:17,740 --> 00:06:18,770 Here we go. 74 75 00:06:18,880 --> 00:06:25,840 Here we see the distribution of our residuals using seaborn's distplot function. 75 76 00:06:25,840 --> 00:06:31,750 I can come back up here, make this an f-string by putting the little f in front and add our residuals 76 77 00:06:31,750 --> 00:06:42,880 mean and the skew into the title, so I'll go with "Skew ({resid_skew})" 77 78 00:06:43,570 --> 00:06:55,280 and "Mean ({resid_mean})". We didn't calculate the mean and the skew and rounded it 78 79 00:06:55,400 --> 00:06:59,060 for nothing after all. Let's show it in our chart. 79 80 00:06:59,090 --> 00:06:59,560 There we go. 80 81 00:07:00,900 --> 00:07:02,880 So how are we doing? 81 82 00:07:02,880 --> 00:07:06,980 Well, the mean is equal to zero, but that's no surprise. 82 83 00:07:06,990 --> 00:07:09,030 That's actually by design. 83 84 00:07:09,060 --> 00:07:13,950 That's how the regression model's best fit line is calculated. 84 85 00:07:13,950 --> 00:07:20,220 No matter how bad your regression line, the mean is gonna be equal to zero by design, but I think the 85 86 00:07:20,280 --> 00:07:27,680 skew being close to zero is a result of our data transformation and I'm going to prove this to you shortly. 86 87 00:07:27,780 --> 00:07:36,570 Looking at this histogram and the estimated distribution for the residuals by seaborn, what's really 87 88 00:07:36,570 --> 00:07:44,550 comforting to see is that the residuals are fairly symmetrical, right, and they have a fairly constant 88 89 00:07:44,820 --> 00:07:47,240 spread throughout the range. 89 90 00:07:47,280 --> 00:07:50,450 So I think we're doing pretty ok. 90 91 00:07:50,670 --> 00:07:56,940 The thing that you do notice however is that this distribution in contrast to a normal distribution 91 92 00:07:57,300 --> 00:07:59,350 has much longer tails. 92 93 00:07:59,400 --> 00:08:07,680 So there's more values in the extreme left and the extreme right than what you would see with a normal 93 94 00:08:07,680 --> 00:08:08,640 distribution. 94 95 00:08:08,670 --> 00:08:13,880 You've got a bigger peak in the middle and then you've got longer tails on either end. 95 96 00:08:13,920 --> 00:08:21,120 So this is where the similarity to the normal distribution is much, much weaker. 96 97 00:08:21,180 --> 00:08:21,580 Okay. 97 98 00:08:21,600 --> 00:08:28,650 So we've looked at three charts of our residuals, but I think what we really, really need to do is, we 98 99 00:08:28,650 --> 00:08:34,920 need to compare how these charts looked like for different models, because if these three charts are 99 100 00:08:35,010 --> 00:08:38,220 all we've ever seen we don't really have much context, right? 100 101 00:08:39,480 --> 00:08:44,520 And so on that note I'd like to pose a challenge to you. 101 102 00:08:44,610 --> 00:08:47,840 I want you to generate these three plots, right. 102 103 00:08:47,850 --> 00:08:57,570 So this distribution, the residuals vs the fitted values and the fitted values vs the observed 103 104 00:08:57,570 --> 00:09:03,240 values for the original model that we had. 104 105 00:09:03,240 --> 00:09:11,630 So this was the model with all the features using normal prices not the transformed log prices. 105 106 00:09:11,910 --> 00:09:19,200 And after you've generated those charts, I want you to analyze and interpret the results that you're 106 107 00:09:19,200 --> 00:09:25,960 getting back, so I'll give you a few seconds to pause the video and give this a shot. 107 108 00:09:28,390 --> 00:09:29,970 OK, ready? 108 109 00:09:29,980 --> 00:09:32,130 Here's the solution. 109 110 00:09:32,160 --> 00:09:45,550 Use the lazy man's approach and copy this entire cell, I'm going to then come here and paste it in and 110 111 00:09:45,550 --> 00:09:51,340 I'm going to modify the code a little bit. I'm going to change my comment here, 111 112 00:09:51,340 --> 00:09:53,640 say "Original model" 112 113 00:09:56,940 --> 00:10:04,320 "normal prices & all features". To use normal prices, 113 114 00:10:04,320 --> 00:10:12,960 I have to, not just get rid of this comment, but I'm going to have to get rid of this "np.log()" here and to use 114 115 00:10:13,620 --> 00:10:25,690 all the features, I'm going to delete "INDUS" and "AGE" from the arguments under the drop method. Scrolling 115 116 00:10:25,690 --> 00:10:26,170 down, 116 117 00:10:26,890 --> 00:10:29,830 don't need this comment anymore. 117 118 00:10:29,970 --> 00:10:34,050 Don't need these comments anymore and then for the scatter plot, 118 119 00:10:34,060 --> 00:10:35,910 I'm gonna go with a different color. 119 120 00:10:36,040 --> 00:10:40,300 I'm gonna go with Indigo. For the labels on this chart, 120 121 00:10:41,570 --> 00:10:51,850 I'm going to say "Actual prices 000s", "Predicted prices 000s". For the title, 121 122 00:10:51,850 --> 00:10:56,410 I'm going to say "Actual vs Predicted prices". Coming down, 122 123 00:10:56,410 --> 00:11:03,310 I'm going to delete this line of code, which we don't need. For our second chart, 123 124 00:11:03,310 --> 00:11:11,830 I'm also gonna go with indigo, and I'm going to update the labels and now all I have to do is add the distribution 124 125 00:11:11,830 --> 00:11:12,360 graph. 125 126 00:11:12,430 --> 00:11:21,100 So that's gonna be a "Residual Distribution Chart" which I'm going to grab from up here. 126 127 00:11:21,190 --> 00:11:23,820 I'm going to grab these lines of code here, 127 128 00:11:23,920 --> 00:11:28,820 copy them, put them down here, paste them in. Again, 128 129 00:11:28,850 --> 00:11:32,800 change the color to indigo to set them apart a little bit, 129 130 00:11:32,810 --> 00:11:42,770 update my title, let's have it read "Residuals" and print out the skew and the mean in the title. 130 131 00:11:43,130 --> 00:11:44,480 And that's pretty much it. 131 132 00:11:44,540 --> 00:11:49,910 The coding side of this challenge is pretty trivial because we're reusing a lot of the code. 132 133 00:11:49,910 --> 00:11:57,260 But let's take a look at what the charts look like and see what the differences are between what we 133 134 00:11:57,260 --> 00:12:00,910 are doing here and what we did earlier. 134 135 00:12:00,920 --> 00:12:05,110 First up, our actual versus our predicted prices. 135 136 00:12:05,510 --> 00:12:12,020 Now visually the first graph and this one here actually seem quite similar. 136 137 00:12:12,170 --> 00:12:18,320 And that's no surprise given that the correlation between the fitted values and the observed values 137 138 00:12:18,830 --> 00:12:20,810 is around the same. 138 139 00:12:20,810 --> 00:12:26,690 Yes, it's a bit was we know that from the r-squared that we calculated and it has a little bit lower 139 140 00:12:26,690 --> 00:12:33,980 correlation but it's not super dramatic on the differences. The predicted and the actual values are actually 140 141 00:12:33,980 --> 00:12:42,110 fairly close to the cyan line as they were with the log prices. Now coming down on the second chart here. 141 142 00:12:42,130 --> 00:12:46,880 This one is much more interesting. Here we're definitely starting to see a little bit of a difference. 142 143 00:12:47,540 --> 00:12:50,380 Compared with the log prices, 143 144 00:12:50,720 --> 00:12:58,690 the cloud of residuals looks almost like it's got a little bit of a parabolic shape to it. 144 145 00:12:58,880 --> 00:13:02,490 It's kind of subtle and you almost have to kind of squint a little bit. 145 146 00:13:02,660 --> 00:13:07,730 But what we're seeing here doesn't look entirely random. 146 147 00:13:07,730 --> 00:13:15,830 This provides further justification that the log transformation for the target values that we did was 147 148 00:13:15,830 --> 00:13:18,350 indeed appropriate. 148 149 00:13:18,350 --> 00:13:19,940 Now, what about the third chart? 149 150 00:13:19,940 --> 00:13:27,200 What about the histogram and the distribution of the residuals? Coming down 150 151 00:13:27,200 --> 00:13:27,980 we see that 151 152 00:13:28,010 --> 00:13:28,540 yeah, 152 153 00:13:28,640 --> 00:13:30,090 the mean is equal to zero. 153 154 00:13:30,170 --> 00:13:32,680 But what about the skew? 154 155 00:13:32,840 --> 00:13:40,600 And here we need to see that with a skew of 1.5 approximately the distribution of the residuals 155 156 00:13:40,760 --> 00:13:43,180 is actually fairly lopsided. 156 157 00:13:43,280 --> 00:13:51,170 This makes this distribution a lot more dissimilar from a normal distribution, because the skew of a 157 158 00:13:51,170 --> 00:13:52,480 normal distribution is zero 158 159 00:13:52,520 --> 00:13:59,350 and we've got 1.5 approximately. A distribution of residuals, 159 160 00:13:59,420 --> 00:14:07,910 that's not close to a normal distribution makes things much more difficult when it comes to making predictions 160 161 00:14:07,910 --> 00:14:12,820 and making forecasts, which is ultimately what we wanted to do, right? 161 162 00:14:12,830 --> 00:14:20,920 This is the assignment that our boss gave us in our imaginary real estate office. 162 163 00:14:21,110 --> 00:14:25,370 So I hope this was a helpful contrast to what we saw earlier and provides a bit more context. 163 164 00:14:25,580 --> 00:14:31,610 But I want to show you one more example, because before we finish this lesson I want to show you the 164 165 00:14:31,610 --> 00:14:39,950 kind of pattern that you could see in your residuals when you're missing important features or omitting 165 166 00:14:40,550 --> 00:14:44,180 kind of key variables in your regression. 166 167 00:14:44,450 --> 00:14:53,150 So let me come back up here, copy this, paste it and then I'm going to update my comment here. 167 168 00:14:53,150 --> 00:14:54,920 I'm going to say "Model 168 169 00:14:58,170 --> 00:15:05,130 Omitting Key Features using log prices", 169 170 00:15:07,870 --> 00:15:15,110 and now what I'm going to do is start dropping quite a few features from our dataset. 170 171 00:15:15,220 --> 00:15:26,550 I'm going to drop, not just INDUS and AGE but I'm also going to drop LSTAT, I'm going to drop 171 172 00:15:26,560 --> 00:15:34,270 RM, I'm going to drop NOX and I'm going to drop Crime. 172 173 00:15:34,270 --> 00:15:36,600 Now we said we'll use log prices, 173 174 00:15:36,730 --> 00:15:45,580 so I'm going to add "np.log" back here where we're getting our prices and then just as a review, 174 175 00:15:46,330 --> 00:15:51,730 you don't actually have to stick to the named colors that are in matplotlib, 175 176 00:15:51,850 --> 00:15:54,750 you can actually specify any color you want, any shade you want. 176 177 00:15:55,510 --> 00:16:02,680 If you go to a web site like flatuicolors.com you can grab a particular hex code that identifies 177 178 00:16:02,680 --> 00:16:10,210 a particular shade of a color. The hex codes always start with this pound symbol and then there are six letters 178 179 00:16:10,330 --> 00:16:12,150 or numbers following that. 179 180 00:16:12,600 --> 00:16:20,600 So I'm going to take Alizarin here, which I can then paste in here where I've referenced Indigo. 180 181 00:16:20,850 --> 00:16:31,620 So "c=#e74c3c". This is this shade of Alzarin that we've copied from the other website. 181 182 00:16:32,620 --> 00:16:33,590 Coming down, 182 183 00:16:33,650 --> 00:16:38,860 I'm also gonna replace the color in our second chart, so that way each of our models has a certain theme 183 184 00:16:38,860 --> 00:16:44,890 going on, and I'm also gonna delete this block of code at the bottom. 184 185 00:16:44,890 --> 00:16:47,880 Finally just gonna update the title here. 185 186 00:16:48,070 --> 00:16:58,030 So I want that title to read "Actual vs Predicted prices with omitted variables". And the very 186 187 00:16:58,030 --> 00:17:03,520 last thing we have to do on the labeling front is change our x and y labels. 187 188 00:17:03,550 --> 00:17:13,640 So these are gonna be back to log prices, xlabel is gonna read "Actual log prices" and our ylabel is gonna 188 189 00:17:13,650 --> 00:17:19,260 read "Predicted log prices". 189 190 00:17:19,430 --> 00:17:22,290 Let's take a look at our charts. 190 191 00:17:22,320 --> 00:17:24,000 There we go. 191 192 00:17:24,000 --> 00:17:26,250 So this is interesting, right? 192 193 00:17:26,500 --> 00:17:34,610 As before, we see this banding here on the top right with our very expensive properties at fifty thousand. 193 194 00:17:34,620 --> 00:17:43,080 We also see that, as expected, the correlation between our fitted values and our observed values is much, 194 195 00:17:43,080 --> 00:17:50,340 much lower because we're leaving out a lot of information, a lot of explanatory features from our model 195 196 00:17:51,780 --> 00:17:55,260 but not only that, we see this kind of like banding here. 196 197 00:17:55,350 --> 00:18:05,570 So you've got all these data points lining up here and here and even inside this cloud here. Scrolling 197 198 00:18:05,570 --> 00:18:13,790 down, we see that this is even more extreme when we look at the residuals vs the fitted values. 198 199 00:18:13,790 --> 00:18:21,770 Here you can see the banding very, very clearly in the residual chart. Instead of a completely random 199 200 00:18:22,010 --> 00:18:24,450 distribution of residuals, 200 201 00:18:24,530 --> 00:18:29,240 what we see in this chart here are clusters. 201 202 00:18:29,300 --> 00:18:35,450 This is a very, very clear pattern and it's telling us that there's some important information that's 202 203 00:18:35,570 --> 00:18:42,790 missing from our model and this information has somehow found its way into the residuals. 203 204 00:18:43,050 --> 00:18:51,380 And this kind of brings me to my final thoughts on the banding that we see with the fifty thousand dollar 204 205 00:18:51,380 --> 00:18:52,710 homes. 205 206 00:18:52,820 --> 00:19:01,220 My hypothesis as to why we see these properties lining up like this is because there's something maybe 206 207 00:19:01,220 --> 00:19:08,870 missing from our model, maybe there is some feature that these are homes all have in common or there 207 208 00:19:08,870 --> 00:19:14,900 was something in the way that the data was collected or there is some sort of interaction between a 208 209 00:19:14,900 --> 00:19:20,900 feature of these homes that we're not capturing in our model. 209 210 00:19:20,900 --> 00:19:27,380 If I wanted to kind of dig into this further and improve this model that we have further, this would 210 211 00:19:27,380 --> 00:19:29,510 be one of the things I would be looking at. 211 212 00:19:29,510 --> 00:19:31,950 This would be something I could dig into. 212 213 00:19:32,630 --> 00:19:35,160 But we have more important things to do. 213 214 00:19:35,450 --> 00:19:39,950 You and I we're gonna be moving on to bigger and better things. 214 215 00:19:39,980 --> 00:19:45,490 We're gonna be moving on to making predictions from our regression model. 215 216 00:19:45,560 --> 00:19:48,340 This is what we ultimately set out to do, right? 216 217 00:19:48,440 --> 00:19:50,480 I'll see you in the next lesson. 217 218 00:19:50,570 --> 00:19:51,050 Take care.