0
1
00:00:00,270 --> 00:00:08,400
So now that we've plotted our actual prices versus our predictions and we've generated this chart here,
1

2
00:00:09,210 --> 00:00:19,470
let's move on in our residual analysis to the sister chart of this one, because in this chart we can't
2

3
00:00:19,590 --> 00:00:23,580
actually see the residuals explicitly, right?
3

4
00:00:23,640 --> 00:00:28,350
We can see kind of how far the data points are from the cyan line here,
4

5
00:00:28,620 --> 00:00:32,910
but our residuals are actually not on one of the axes.
5

6
00:00:33,000 --> 00:00:41,870
Let's make this much more explicit and plot our residuals versus the predicted values. I'm going to come up here,
6

7
00:00:42,000 --> 00:00:50,510
add a comment that reads "Residuals vs Predicted values".
7

8
00:00:50,610 --> 00:00:54,890
I'm going to take these lines of code, copy them, paste them below
8

9
00:00:55,020 --> 00:00:58,350
and now I'm going to modify them a little bit.
9

10
00:00:58,350 --> 00:01:01,520
So I'm going to get rid of this line on the x axis,
10

11
00:01:01,590 --> 00:01:14,010
I want the label to read "Predicted log prices" and then "\hat y_i", "fontsize = 14". On the
11

12
00:01:14,010 --> 00:01:15,090
y axis
12

13
00:01:15,150 --> 00:01:28,030
I want it to read just "Residuals". And for the title I want it to read "Residuals vs Fitted Values".
13

14
00:01:28,030 --> 00:01:30,780
Now let's update the arguments in the scatter plot.
14

15
00:01:30,850 --> 00:01:39,220
So for the x axis, you've already guessed  - it's gonna be "results.fittedvalues" and for the y axis,
15

16
00:01:39,850 --> 00:01:42,590
it's going to read "results.
16

17
00:01:42,610 --> 00:01:46,430
resid". For the color,
17

18
00:01:46,430 --> 00:01:49,530
I'm going to go with "navy". For the alpha,
18

19
00:01:49,550 --> 00:01:51,530
I'll leave it at 0.6.
19

20
00:01:51,560 --> 00:01:53,410
Now let me hit Shift+Enter.
20

21
00:01:54,080 --> 00:02:01,700
So this is the chart that you can compare to the charts that you've seen in the slides earlier on in
21

22
00:02:01,700 --> 00:02:02,560
the lesson.
22

23
00:02:02,570 --> 00:02:05,220
How do we interpret what we're looking at here?
23

24
00:02:05,270 --> 00:02:07,600
Are you seeing any obvious patterns?
24

25
00:02:07,610 --> 00:02:15,710
I think it's actually pretty ok, the residuals look fairly random for the most part and the residuals
25

26
00:02:15,830 --> 00:02:18,800
are actually centered around zero.
26

27
00:02:18,800 --> 00:02:27,050
So looking at the y axis here, we can see that a lot of the residuals are centered around zero. The residuals
27

28
00:02:27,050 --> 00:02:29,980
are also fairly symmetric.
28

29
00:02:30,080 --> 00:02:33,980
They don't seem to be systematically high or systematically low.
29

30
00:02:34,040 --> 00:02:37,040
So the model is kind of correct on, on average.
30

31
00:02:37,640 --> 00:02:44,090
But in this chart we also see the issue with the high price bracket homes filtering through.
31

32
00:02:44,090 --> 00:02:45,530
No surprise there.
32

33
00:02:45,530 --> 00:02:46,710
Can you spot them?
33

34
00:02:46,970 --> 00:02:54,200
Which data points on this chart correspond to all the fifty thousand dollar homes?
34

35
00:02:54,200 --> 00:02:55,790
They're actually right here.
35

36
00:02:55,880 --> 00:03:03,320
You see how this almost traces out a line? These data points on this chart correspond to these data points
36

37
00:03:03,890 --> 00:03:10,010
on this chart. Those fifty thousand dollar properties that we're bad at predicting seemed to be lining
37

38
00:03:10,010 --> 00:03:10,820
up.
38

39
00:03:10,850 --> 00:03:15,410
Now let's do our next check on the residuals that we talked about.
39

40
00:03:15,410 --> 00:03:18,750
Let's check for normality.
40

41
00:03:18,770 --> 00:03:26,210
Let's see if our normality assumption is satisfied or close to satisfied, because in the beginning of
41

42
00:03:26,210 --> 00:03:30,980
the lesson we said, well we kind of want these residuals to be normally distributed.
42

43
00:03:30,980 --> 00:03:34,070
Let's check out if they really are or not.
43

44
00:03:34,070 --> 00:03:44,390
In the cell below, I'm going to add a comment and that's going to read "Distribution of residuals (log prices) 
44

45
00:03:44,900 --> 00:03:55,690
- checking for normality'. A normal distribution if you remember has a mean and a skew of what? Zero,
45

46
00:03:55,710 --> 00:03:56,760
Right?
46

47
00:03:56,820 --> 00:03:59,850
The skew should be zero and the mean should be zero.
47

48
00:03:59,850 --> 00:04:02,520
How do we print out the mean and the skew?
48

49
00:04:02,550 --> 00:04:09,570
We'll take our results objects, "results.resid", get the residuals and then we can chain a method
49

50
00:04:09,570 --> 00:04:18,730
onto this. For the mean, we would use ".mean()". Let me hit Shift+Enter. What's printed out here
50

51
00:04:18,870 --> 00:04:21,190
is in scientific notation.
51

52
00:04:21,420 --> 00:04:30,990
So let's round this, let's say "round(results.resid.mean())", comma and then let's
52

53
00:04:30,990 --> 00:04:38,430
round to 3 decimal places and have the closing parentheses at the end, Shift+Enter, and we see here
53

54
00:04:38,430 --> 00:04:44,570
that the mean of our residuals is indeed very, very close to zero.
54

55
00:04:44,780 --> 00:04:49,350
Let me store this in a variable called "resid_mean".
55

56
00:04:49,780 --> 00:04:59,880
And now let's print out the skew, so "results.resid.skew()" should be our skew.
56

57
00:04:59,880 --> 00:05:01,390
See what that is.
57

58
00:05:01,580 --> 00:05:02,060
Huh.
58

59
00:05:02,100 --> 00:05:02,610
Okay.
59

60
00:05:02,610 --> 00:05:11,740
0.12 approximately. I can also round that, round to three decimal points.
60

61
00:05:11,740 --> 00:05:18,530
And I'm also going to store this in a variable, "visit_skew" is equal to this whole thing.
61

62
00:05:18,550 --> 00:05:19,170
So fair enough.
62

63
00:05:19,180 --> 00:05:20,620
The mean is equal to zero.
63

64
00:05:20,620 --> 00:05:25,690
The skew is not equal to zero, but it's not too far off.
64

65
00:05:25,740 --> 00:05:34,450
Now looking at these two numbers is helpful but it's even better if we complement this with a plot, with
65

66
00:05:34,450 --> 00:05:35,470
a graphic.
66

67
00:05:35,470 --> 00:05:43,630
So I'm going to use seaborn here. "sns.distplot()", distribution plot parentheses and then as the arguments
67

68
00:05:44,290 --> 00:05:45,870
we'll provide our residuals,
68

69
00:05:45,890 --> 00:05:48,820
so "results.resid".
69

70
00:05:50,050 --> 00:05:55,310
And for the color we'll go with the "navy" again.
70

71
00:05:55,750 --> 00:06:03,820
I think every plot needs a title so "plt.title()" and then as a title we'll say 
71

72
00:06:04,330 --> 00:06:10,440
'Log price model: residuals'.
72

73
00:06:10,450 --> 00:06:17,740
Now let's go with "plt.show()" and see what this looks like.
73

74
00:06:17,740 --> 00:06:18,770
Here we go.
74

75
00:06:18,880 --> 00:06:25,840
Here we see the distribution of our residuals using seaborn's distplot function.
75

76
00:06:25,840 --> 00:06:31,750
I can come back up here, make this an f-string by putting the little f in front and add our residuals
76

77
00:06:31,750 --> 00:06:42,880
mean and the skew into the title, so I'll go with "Skew ({resid_skew})" 
77

78
00:06:43,570 --> 00:06:55,280
and "Mean ({resid_mean})". We didn't calculate the mean and the skew and rounded it
78

79
00:06:55,400 --> 00:06:59,060
for nothing after all. Let's show it in our chart.
79

80
00:06:59,090 --> 00:06:59,560
There we go.
80

81
00:07:00,900 --> 00:07:02,880
So how are we doing?
81

82
00:07:02,880 --> 00:07:06,980
Well, the mean is equal to zero, but that's no surprise.
82

83
00:07:06,990 --> 00:07:09,030
That's actually by design.
83

84
00:07:09,060 --> 00:07:13,950
That's how the regression model's best fit line is calculated.
84

85
00:07:13,950 --> 00:07:20,220
No matter how bad your regression line, the mean is gonna be equal to zero by design, but I think the
85

86
00:07:20,280 --> 00:07:27,680
skew being close to zero is a result of our data transformation and I'm going to prove this to you shortly.
86

87
00:07:27,780 --> 00:07:36,570
Looking at this histogram and the estimated distribution for the residuals by seaborn, what's really
87

88
00:07:36,570 --> 00:07:44,550
comforting to see is that the residuals are fairly symmetrical, right, and they have a fairly constant
88

89
00:07:44,820 --> 00:07:47,240
spread throughout the range.
89

90
00:07:47,280 --> 00:07:50,450
So I think we're doing pretty ok.
90

91
00:07:50,670 --> 00:07:56,940
The thing that you do notice however is that this distribution in contrast to a normal distribution
91

92
00:07:57,300 --> 00:07:59,350
has much longer tails.
92

93
00:07:59,400 --> 00:08:07,680
So there's more values in the extreme left and the extreme right than what you would see with a normal
93

94
00:08:07,680 --> 00:08:08,640
distribution.
94

95
00:08:08,670 --> 00:08:13,880
You've got a bigger peak in the middle and then you've got longer tails on either end.
95

96
00:08:13,920 --> 00:08:21,120
So this is where the similarity to the normal distribution is much, much weaker.
96

97
00:08:21,180 --> 00:08:21,580
Okay.
97

98
00:08:21,600 --> 00:08:28,650
So we've looked at three charts of our residuals, but I think what we really, really need to do is, we
98

99
00:08:28,650 --> 00:08:34,920
need to compare how these charts looked like for different models, because if these three charts are
99

100
00:08:35,010 --> 00:08:38,220
all we've ever seen we don't really have much context, right?
100

101
00:08:39,480 --> 00:08:44,520
And so on that note I'd like to pose a challenge to you.
101

102
00:08:44,610 --> 00:08:47,840
I want you to generate these three plots, right.
102

103
00:08:47,850 --> 00:08:57,570
So this distribution, the residuals vs the fitted values and the fitted values vs the observed
103

104
00:08:57,570 --> 00:09:03,240
values for the original model that we had.
104

105
00:09:03,240 --> 00:09:11,630
So this was the model with all the features using normal prices not the transformed log prices.
105

106
00:09:11,910 --> 00:09:19,200
And after you've generated those charts, I want you to analyze and interpret the results that you're
106

107
00:09:19,200 --> 00:09:25,960
getting back, so I'll give you a few seconds to pause the video and give this a shot.
107

108
00:09:28,390 --> 00:09:29,970
OK, ready?
108

109
00:09:29,980 --> 00:09:32,130
Here's the solution.
109

110
00:09:32,160 --> 00:09:45,550
Use the lazy man's approach and copy this entire cell,  I'm going to then come here and paste it in and
110

111
00:09:45,550 --> 00:09:51,340
I'm going to modify the code a little bit. I'm going to change my comment here,
111

112
00:09:51,340 --> 00:09:53,640
say "Original model"
112

113
00:09:56,940 --> 00:10:04,320
"normal prices & all features". To use normal prices,
113

114
00:10:04,320 --> 00:10:12,960
I have to, not just get rid of this comment, but I'm going to have to get rid of this "np.log()" here and to use
114

115
00:10:13,620 --> 00:10:25,690
all the features, I'm going to delete "INDUS" and "AGE" from the arguments under the drop method. Scrolling
115

116
00:10:25,690 --> 00:10:26,170
down,
116

117
00:10:26,890 --> 00:10:29,830
don't need this comment anymore.
117

118
00:10:29,970 --> 00:10:34,050
Don't need these comments anymore and then for the scatter plot,
118

119
00:10:34,060 --> 00:10:35,910
I'm gonna go with a different color.
119

120
00:10:36,040 --> 00:10:40,300
I'm gonna go with Indigo. For the labels on this chart,
120

121
00:10:41,570 --> 00:10:51,850
I'm going to say "Actual prices 000s", "Predicted prices 000s". For the title,
121

122
00:10:51,850 --> 00:10:56,410
I'm going to say "Actual vs Predicted prices". Coming down,
122

123
00:10:56,410 --> 00:11:03,310
I'm going to delete this line of code, which we don't need. For our second chart,
123

124
00:11:03,310 --> 00:11:11,830
I'm also gonna go with indigo, and I'm going to update the labels and now all I have to do is add the distribution
124

125
00:11:11,830 --> 00:11:12,360
graph.
125

126
00:11:12,430 --> 00:11:21,100
So that's gonna be a "Residual Distribution Chart" which I'm going to grab from up here.
126

127
00:11:21,190 --> 00:11:23,820
I'm going to grab these lines of code here,
127

128
00:11:23,920 --> 00:11:28,820
copy them, put them down here, paste them in. Again,
128

129
00:11:28,850 --> 00:11:32,800
change the color to indigo to set them apart a little bit,
129

130
00:11:32,810 --> 00:11:42,770
update my title, let's have it read "Residuals" and print out the skew and the mean in the title.
130

131
00:11:43,130 --> 00:11:44,480
And that's pretty much it.
131

132
00:11:44,540 --> 00:11:49,910
The coding side of this challenge is pretty trivial because we're reusing a lot of the code.
132

133
00:11:49,910 --> 00:11:57,260
But let's take a look at what the charts look like and see what the differences are between what we
133

134
00:11:57,260 --> 00:12:00,910
are doing here and what we did earlier.
134

135
00:12:00,920 --> 00:12:05,110
First up, our actual versus our predicted prices.
135

136
00:12:05,510 --> 00:12:12,020
Now visually the first graph and this one here actually seem quite similar.
136

137
00:12:12,170 --> 00:12:18,320
And that's no surprise given that the correlation between the fitted values and the observed values
137

138
00:12:18,830 --> 00:12:20,810
is around the same.
138

139
00:12:20,810 --> 00:12:26,690
Yes, it's a bit was we know that from the r-squared that we calculated and it has a little bit lower
139

140
00:12:26,690 --> 00:12:33,980
correlation but it's not super dramatic on the differences. The predicted and the actual values are actually
140

141
00:12:33,980 --> 00:12:42,110
fairly close to the cyan line as they were with the log prices. Now coming down on the second chart here.
141

142
00:12:42,130 --> 00:12:46,880
This one is much more interesting. Here we're definitely starting to see a little bit of a difference.
142

143
00:12:47,540 --> 00:12:50,380
Compared with the log prices,
143

144
00:12:50,720 --> 00:12:58,690
the cloud of residuals looks almost like it's got a little bit of a parabolic shape to it.
144

145
00:12:58,880 --> 00:13:02,490
It's kind of subtle and you almost have to kind of squint a little bit.
145

146
00:13:02,660 --> 00:13:07,730
But what we're seeing here doesn't look entirely random.
146

147
00:13:07,730 --> 00:13:15,830
This provides further justification that the log transformation for the target values that we did was
147

148
00:13:15,830 --> 00:13:18,350
indeed appropriate.
148

149
00:13:18,350 --> 00:13:19,940
Now, what about the third chart?
149

150
00:13:19,940 --> 00:13:27,200
What about the histogram and the distribution of the residuals? Coming down
150

151
00:13:27,200 --> 00:13:27,980
we see that
151

152
00:13:28,010 --> 00:13:28,540
yeah,
152

153
00:13:28,640 --> 00:13:30,090
the mean is equal to zero.
153

154
00:13:30,170 --> 00:13:32,680
But what about the skew?
154

155
00:13:32,840 --> 00:13:40,600
And here we need to see that with a skew of 1.5 approximately the distribution of the residuals
155

156
00:13:40,760 --> 00:13:43,180
is actually fairly lopsided.
156

157
00:13:43,280 --> 00:13:51,170
This makes this distribution a lot more dissimilar from a normal distribution, because the skew of a
157

158
00:13:51,170 --> 00:13:52,480
normal distribution is zero
158

159
00:13:52,520 --> 00:13:59,350
and we've got 1.5 approximately. A distribution of residuals,
159

160
00:13:59,420 --> 00:14:07,910
that's not close to a normal distribution makes things much more difficult when it comes to making predictions
160

161
00:14:07,910 --> 00:14:12,820
and making forecasts, which is ultimately what we wanted to do, right?
161

162
00:14:12,830 --> 00:14:20,920
This is the assignment that our boss gave us in our imaginary real estate office.
162

163
00:14:21,110 --> 00:14:25,370
So I hope this was a helpful contrast to what we saw earlier and provides a bit more context.
163

164
00:14:25,580 --> 00:14:31,610
But I want to show you one more example, because before we finish this lesson I want to show you the
164

165
00:14:31,610 --> 00:14:39,950
kind of pattern that you could see in your residuals when you're missing important features or omitting
165

166
00:14:40,550 --> 00:14:44,180
kind of key variables in your regression.
166

167
00:14:44,450 --> 00:14:53,150
So let me come back up here, copy this, paste it and then I'm going to update my comment here.
167

168
00:14:53,150 --> 00:14:54,920
I'm going to say "Model
168

169
00:14:58,170 --> 00:15:05,130
Omitting Key Features using log prices",
169

170
00:15:07,870 --> 00:15:15,110
and now what I'm going to do is start dropping quite a few features from our dataset.
170

171
00:15:15,220 --> 00:15:26,550
I'm going to drop, not just INDUS and AGE but I'm also going to drop LSTAT, I'm going to drop
171

172
00:15:26,560 --> 00:15:34,270
RM, I'm going to drop NOX and I'm going to drop Crime.
172

173
00:15:34,270 --> 00:15:36,600
Now we said we'll use log prices,
173

174
00:15:36,730 --> 00:15:45,580
so I'm going to add "np.log" back here where we're getting our prices and then just as a review,
174

175
00:15:46,330 --> 00:15:51,730
you don't actually have to stick to the named colors that are in matplotlib,
175

176
00:15:51,850 --> 00:15:54,750
you can actually specify any color you want, any shade you want.
176

177
00:15:55,510 --> 00:16:02,680
If you go to a web site like flatuicolors.com you can grab a particular hex code that identifies
177

178
00:16:02,680 --> 00:16:10,210
a particular shade of a color. The hex codes always start with this pound symbol and then there are six letters
178

179
00:16:10,330 --> 00:16:12,150
or numbers following that.
179

180
00:16:12,600 --> 00:16:20,600
So I'm going to take Alizarin here, which I can then paste in here where I've referenced Indigo.
180

181
00:16:20,850 --> 00:16:31,620
So "c=#e74c3c". This is this shade of Alzarin that we've copied from the other website.
181

182
00:16:32,620 --> 00:16:33,590
Coming down,
182

183
00:16:33,650 --> 00:16:38,860
I'm also gonna replace the color in our second chart, so that way each of our models has a certain theme
183

184
00:16:38,860 --> 00:16:44,890
going on, and I'm also gonna delete this block of code at the bottom.
184

185
00:16:44,890 --> 00:16:47,880
Finally just gonna update the title here.
185

186
00:16:48,070 --> 00:16:58,030
So I want that title to read "Actual vs Predicted prices with omitted variables". And the very
186

187
00:16:58,030 --> 00:17:03,520
last thing we have to do on the labeling front is change our x and y labels.
187

188
00:17:03,550 --> 00:17:13,640
So these are gonna be back to log prices, xlabel is gonna read "Actual log prices" and our ylabel is gonna
188

189
00:17:13,650 --> 00:17:19,260
read "Predicted log prices".
189

190
00:17:19,430 --> 00:17:22,290
Let's take a look at our charts.
190

191
00:17:22,320 --> 00:17:24,000
There we go.
191

192
00:17:24,000 --> 00:17:26,250
So this is interesting, right?
192

193
00:17:26,500 --> 00:17:34,610
As before, we see this banding here on the top right with our very expensive properties at fifty thousand.
193

194
00:17:34,620 --> 00:17:43,080
We also see that, as expected, the correlation between our fitted values and our observed values is much,
194

195
00:17:43,080 --> 00:17:50,340
much lower because we're leaving out a lot of information, a lot of explanatory features from our model
195

196
00:17:51,780 --> 00:17:55,260
but not only that, we see this kind of like banding here.
196

197
00:17:55,350 --> 00:18:05,570
So you've got all these data points lining up here and here and even inside this cloud here. Scrolling
197

198
00:18:05,570 --> 00:18:13,790
down, we see that this is even more extreme when we look at the residuals vs the fitted values.
198

199
00:18:13,790 --> 00:18:21,770
Here you can see the banding very, very clearly in the residual chart. Instead of a completely random
199

200
00:18:22,010 --> 00:18:24,450
distribution of residuals,
200

201
00:18:24,530 --> 00:18:29,240
what we see in this chart here are clusters.
201

202
00:18:29,300 --> 00:18:35,450
This is a very, very clear pattern and it's telling us that there's some important information that's
202

203
00:18:35,570 --> 00:18:42,790
missing from our model and this information has somehow found its way into the residuals.
203

204
00:18:43,050 --> 00:18:51,380
And this kind of brings me to my final thoughts on the banding that we see with the fifty thousand dollar
204

205
00:18:51,380 --> 00:18:52,710
homes.
205

206
00:18:52,820 --> 00:19:01,220
My hypothesis as to why we see these properties lining up like this is because there's something maybe
206

207
00:19:01,220 --> 00:19:08,870
missing from our model, maybe there is some feature that these are homes all have in common or there
207

208
00:19:08,870 --> 00:19:14,900
was something in the way that the data was collected or there is some sort of interaction between a
208

209
00:19:14,900 --> 00:19:20,900
feature of these homes that we're not capturing in our model.
209

210
00:19:20,900 --> 00:19:27,380
If I wanted to kind of dig into this further and improve this model that we have further, this would
210

211
00:19:27,380 --> 00:19:29,510
be one of the things I would be looking at.
211

212
00:19:29,510 --> 00:19:31,950
This would be something I could dig into.
212

213
00:19:32,630 --> 00:19:35,160
But we have more important things to do.
213

214
00:19:35,450 --> 00:19:39,950
You and I we're gonna be moving on to bigger and better things.
214

215
00:19:39,980 --> 00:19:45,490
We're gonna be moving on to making predictions from our regression model.
215

216
00:19:45,560 --> 00:19:48,340
This is what we ultimately set out to do, right?
216

217
00:19:48,440 --> 00:19:50,480
I'll see you in the next lesson.
217

218
00:19:50,570 --> 00:19:51,050
Take care.