0
1
00:00:00,740 --> 00:00:08,690
In this lesson we will look at the residuals of our regression and we will check for problems and also
1

2
00:00:08,690 --> 00:00:11,140
validate our regression model.
2

3
00:00:11,390 --> 00:00:14,870
At the moment our model looks something like this,
3

4
00:00:14,900 --> 00:00:19,590
we dropped two features, so we only have 11 features instead of the original 13
4

5
00:00:19,790 --> 00:00:25,520
and we're also using log prices, because we transformed our target data.
5

6
00:00:25,520 --> 00:00:31,830
Let me change the notation here a little bit and write y_hat on the left hand side of the equation.
6

7
00:00:32,090 --> 00:00:38,180
What we're looking at here is the equation to predict a property price given some information about
7

8
00:00:38,180 --> 00:00:47,180
that property. In our notation y_hat is that property's predicted value. The predicted value is calculated
8

9
00:00:47,180 --> 00:00:52,040
from all the theta parameters and all the values of the individual features.
9

10
00:00:52,160 --> 00:00:58,850
And it's important to remember that in our notation y is not equal to y_hat. The predicted value is not
10

11
00:00:58,850 --> 00:01:01,790
the same as the target value.
11

12
00:01:01,790 --> 00:01:07,310
And as a matter of fact there will usually be a difference between the two, there usually will be a difference
12

13
00:01:07,310 --> 00:01:14,720
between the predicted value and the actual true value and what we can do is we can subtract the predicted
13

14
00:01:14,720 --> 00:01:18,060
value from the observed target value.
14

15
00:01:18,260 --> 00:01:23,300
And when we do that what we're left with is called the residual.
15

16
00:01:23,300 --> 00:01:30,350
So if the observed target value for property is say 50 and our predicted value from our model is equal
16

17
00:01:30,350 --> 00:01:36,090
to 48 then the residual is equal to 2 - 50 minus 48 is equal to 2.
17

18
00:01:36,140 --> 00:01:43,130
The math here isn't gonna get anybody excited but this is what we do for all the 400 odd individual
18

19
00:01:43,280 --> 00:01:46,000
data points in our training dataset.
19

20
00:01:46,070 --> 00:01:53,240
We will do this very unexciting bit of math for all the predicted values and all the target values.
20

21
00:01:53,240 --> 00:01:59,780
Now since we have 404 target values in our training dataset, we have 404 predicted
21

22
00:01:59,870 --> 00:02:02,450
or fitted values as well.
22

23
00:02:02,450 --> 00:02:06,680
And this means we have 404 residuals.
23

24
00:02:06,830 --> 00:02:13,280
Now, the key is what these 404 residuals look like as a group.
24

25
00:02:13,460 --> 00:02:14,650
But why?
25

26
00:02:15,140 --> 00:02:19,950
I hear you asking across time and space - why are we looking at these residuals?
26

27
00:02:20,000 --> 00:02:22,730
Why do the residuals matter?
27

28
00:02:22,760 --> 00:02:24,050
Here's the thing.
28

29
00:02:24,050 --> 00:02:32,390
Our regression relies on certain assumptions, it's like you're back at your real estate office in Boston
29

30
00:02:32,420 --> 00:02:40,520
and you're looking out the window and you're thinking "Man, the real world is a really complicated place.
30

31
00:02:40,550 --> 00:02:42,200
I can't model all this.
31

32
00:02:42,410 --> 00:02:43,510
I know what I'm going to do.
32

33
00:02:43,660 --> 00:02:48,100
I'm going to make my life easier by making some very, very crude assumptions.
33

34
00:02:48,320 --> 00:02:52,100
Maybe something like a simple linear model will be good enough."
34

35
00:02:52,940 --> 00:02:54,750
Now here's the thing.
35

36
00:02:54,830 --> 00:03:02,510
If these crude assumptions that we're making more or less hold up, then our simplified model is useful.
36

37
00:03:02,990 --> 00:03:08,990
It's not 100% correct but it will be something that we can use in practice and our results
37

38
00:03:09,080 --> 00:03:13,800
that we get back, the numbers that our Python code spits out, have meaning.
38

39
00:03:13,880 --> 00:03:19,940
But, if the assumptions that we're making don't hold at all all we get back is a whole bunch of rubbish.
39

40
00:03:19,940 --> 00:03:25,070
Now I'm talking a lot about simplifying assumptions that we're making, but what are we talking about
40

41
00:03:25,070 --> 00:03:25,210
here
41

42
00:03:25,210 --> 00:03:32,300
concretely? We actually covered one of the key simplifying assumptions already on the lesson on data
42

43
00:03:32,330 --> 00:03:33,690
transformations.
43

44
00:03:33,800 --> 00:03:35,660
We are assuming linearity,
44

45
00:03:35,660 --> 00:03:40,760
for starters. We are currently fitting a linear model to our data.
45

46
00:03:40,790 --> 00:03:46,640
We are assuming that a linear model is more or less appropriate and we even transformed our data to
46

47
00:03:46,640 --> 00:03:50,020
make it better fit our linearity assumption.
47

48
00:03:50,060 --> 00:03:53,300
But there are some other key assumptions as well.
48

49
00:03:53,300 --> 00:03:58,700
For starters, we like to think that our model can more or less explain what's actually going on in the
49

50
00:03:58,700 --> 00:04:00,010
real world.
50

51
00:04:00,050 --> 00:04:05,360
I mean after all we have a high r-squared and our 11 features together explain about 75 percent
51

52
00:04:05,360 --> 00:04:08,110
of the variance in property prices.
52

53
00:04:08,210 --> 00:04:11,930
So this means that what's left unexplained,
53

54
00:04:12,200 --> 00:04:21,260
yeah the residuals, the difference between our predictions and the actual values should be random.
54

55
00:04:21,290 --> 00:04:25,420
The part that we're missing from our model shouldn't have a clear pattern.
55

56
00:04:25,820 --> 00:04:28,520
Where can we see what's missing from our model?
56

57
00:04:28,520 --> 00:04:35,240
Well, in the residuals, right? We can spot what's missing in the differences between the target values
57

58
00:04:35,630 --> 00:04:38,060
and our predicted values.
58

59
00:04:38,060 --> 00:04:45,230
If there's a pattern in the residuals then there's also some predictive information in the residuals.
59

60
00:04:45,230 --> 00:04:50,720
And if there's predictive information in the residuals then that predictive information is missing from
60

61
00:04:50,720 --> 00:04:51,630
our model.
61

62
00:04:51,710 --> 00:04:53,620
Makes sense, right?
62

63
00:04:53,630 --> 00:04:58,550
The only thing you're asking about now is - well what kind of patrons do I have to watch out for?
63

64
00:04:58,550 --> 00:04:58,780
Right.
64

65
00:04:58,790 --> 00:05:00,090
What are you talking about with
65

66
00:05:00,240 --> 00:05:01,380
patterns?
66

67
00:05:01,500 --> 00:05:03,420
So let me show you a few examples.
67

68
00:05:03,510 --> 00:05:10,980
Let me show you a few plots of patterns that are indicative of problems in our regression. Now, every
68

69
00:05:10,980 --> 00:05:12,010
dataset is different
69

70
00:05:12,060 --> 00:05:19,290
and there's many, many possible variations, but there's also some common type of problems that have particular
70

71
00:05:19,290 --> 00:05:20,910
patterns that you might see.
71

72
00:05:20,940 --> 00:05:28,200
So I'm going to show you some examples of problematic residual plots, so you can get a feel for what's
72

73
00:05:28,200 --> 00:05:31,230
coming up in our Python analysis.
73

74
00:05:31,230 --> 00:05:38,730
Here we see a plot of the residual values versus our predicted values and there's clearly a relationship
74

75
00:05:38,940 --> 00:05:41,740
being traced out in the residuals.
75

76
00:05:41,790 --> 00:05:46,490
It's not a linear relationship, but it's definitely a pattern.
76

77
00:05:46,680 --> 00:05:48,350
Now of course the same thing holds,
77

78
00:05:48,390 --> 00:05:51,620
if you see this pattern the other way around,
78

79
00:05:51,810 --> 00:06:00,390
if you plot your residuals and you see a cone shape like this where the residuals are small for smaller
79

80
00:06:00,600 --> 00:06:06,860
predictions and where the residuals are large for larger predictions, then you also have a problem.
80

81
00:06:07,650 --> 00:06:12,060
In this plot the residuals get larger and larger, the larger the prediction is.
81

82
00:06:12,750 --> 00:06:17,420
So this is also a problem that you should watch out for if you see it in your residuals.
82

83
00:06:17,460 --> 00:06:18,500
Now what if you see
83

84
00:06:18,500 --> 00:06:19,720
this kind of plot?
84

85
00:06:19,920 --> 00:06:27,150
This looks more random already but you can see that there are these kind of vertical clusters in the
85

86
00:06:27,150 --> 00:06:27,880
plot.
86

87
00:06:28,080 --> 00:06:34,380
The residuals are grouping together and you might see this kind of pattern when there are some features,
87

88
00:06:34,430 --> 00:06:39,270
like missing from your model that are very important or when there are some interactions between the
88

89
00:06:39,270 --> 00:06:41,560
features that you're not capturing.
89

90
00:06:42,330 --> 00:06:44,880
Here's another plot that you might see at some point.
90

91
00:06:44,910 --> 00:06:47,890
This is the kind of classic outlier plot.
91

92
00:06:48,240 --> 00:06:53,130
If you see this kind of thing, then you should take a look at what that lonely data point in the top
92

93
00:06:53,130 --> 00:06:55,860
right corner actually represents and why it's there.
93

94
00:06:56,550 --> 00:07:02,710
And lastly in our gallery of bad residual plots, we've got this one here.
94

95
00:07:02,730 --> 00:07:06,750
Here we see an unbalanced y axis.
95

96
00:07:06,750 --> 00:07:12,150
Most of the residuals are sitting right at the bottom but there's also a few massive ones at the top
96

97
00:07:12,150 --> 00:07:13,450
of the chart.
97

98
00:07:13,560 --> 00:07:18,930
Again, if you see this kind of shape you have to take another look at your data and maybe consider a
98

99
00:07:18,930 --> 00:07:23,370
data transformation for a better fit for your model.
99

100
00:07:23,370 --> 00:07:30,600
The point I'm trying to make here with this gallery is that any non-random pattern in the residuals
100

101
00:07:30,960 --> 00:07:37,880
indicates that there is an issue, because just like humans, regressions can actually suffer from like
101

102
00:07:37,890 --> 00:07:45,930
many illnesses and these have to be diagnosed on a case by case basis. For example it's possible that
102

103
00:07:45,930 --> 00:07:47,950
you're missing an important feature from your data.
103

104
00:07:48,570 --> 00:07:51,490
It's possible that you need to transform your data.
104

105
00:07:51,540 --> 00:07:57,450
It's possible that there's some interaction between the features that you need to capture and it's possible
105

106
00:07:57,510 --> 00:08:04,050
that the features in the model are not capturing some kind of important explanatory information that's
106

107
00:08:04,050 --> 00:08:08,130
then leaking into the residuals and creating these patterns.
107

108
00:08:08,160 --> 00:08:13,570
So the question you might ask is: Well okay, so you've showed me a lot of these terrible plots that you
108

109
00:08:13,570 --> 00:08:17,190
know, hopefully, I'll never see in my own work.
109

110
00:08:17,200 --> 00:08:20,940
What would a healthy residual plot look like?
110

111
00:08:20,970 --> 00:08:25,590
What do we want to see from our plot of residuals?
111

112
00:08:25,590 --> 00:08:27,930
Well, we don't wanna see patterns that's for sure.
112

113
00:08:27,930 --> 00:08:31,470
So here's two examples without clear patterns.
113

114
00:08:31,470 --> 00:08:37,740
So here we've got a plot of, well not that many data points, but they're kind of scattered about in a
114

115
00:08:37,740 --> 00:08:39,310
pretty random fashion.
115

116
00:08:39,330 --> 00:08:42,400
This is what you would kind of expect to see.
116

117
00:08:42,690 --> 00:08:49,520
Would you like to venture a guess as to what you should see when there's loads and loads of data points?
117

118
00:08:49,650 --> 00:08:53,080
You'll want to see something like this.
118

119
00:08:53,130 --> 00:09:02,640
You'll want to see kind of like a cloud shape where most of the residuals are centered around zero and
119

120
00:09:02,700 --> 00:09:04,690
the cloud is more or less symmetric.
120

121
00:09:04,770 --> 00:09:10,620
There isn't a bias for, say like high predictions or low predictions.
121

122
00:09:10,620 --> 00:09:17,430
You kind of want symmetry there and you want them to be centered around zero, because in an ideal world
122

123
00:09:17,550 --> 00:09:26,310
or the perfect dataset fitted by the perfect model, the residuals are actually normally distributed or
123

124
00:09:26,310 --> 00:09:34,430
close to normally distributed and this normality assumption is also pretty important actually.
124

125
00:09:34,680 --> 00:09:41,880
The thing you have to remember is that assuming normality doesn't apply to our target values, doesn't
125

126
00:09:41,880 --> 00:09:43,560
apply to our house prices.
126

127
00:09:43,980 --> 00:09:52,110
And assuming normality most certainly does not apply to our features, but normality is something that
127

128
00:09:52,110 --> 00:09:54,750
we look for in our residuals.
128

129
00:09:54,750 --> 00:10:01,700
Do you remember what characterizes a normal distribution? What are the things that we can look at? Well
129

130
00:10:02,480 --> 00:10:06,200
we can look at the skew and the mean, right?
130

131
00:10:06,210 --> 00:10:10,270
Both the skew and the mean should be equal to zero.
131

132
00:10:10,310 --> 00:10:19,240
Now in truth, normality is maybe not as important as not having a pattern in the residuals.
132

133
00:10:19,250 --> 00:10:25,170
And there is a famous quote by the statistician George Box actually.
133

134
00:10:25,190 --> 00:10:33,380
So he said "...the statistician knows... that in nature there never was a normal distribution, there never was
134

135
00:10:33,530 --> 00:10:39,460
a straight line, yet with normal and linear assumptions, known to be false,
135

136
00:10:39,530 --> 00:10:48,220
he can often derive results which match, to a useful approximation, those found in the real world".
136

137
00:10:48,230 --> 00:10:50,690
That's a really long quote actually.
137

138
00:10:50,690 --> 00:10:56,690
But the gist of it is this - All models are wrong, but some models are useful.