0
1
00:00:01,570 --> 00:00:09,380
All right, so how do we get to making a prediction or a forecast? We have to provide
1

2
00:00:09,550 --> 00:00:10,820
two things - 
2

3
00:00:10,840 --> 00:00:17,290
the estimated property price and the range that goes with the price.
3

4
00:00:17,290 --> 00:00:24,550
Because this is what the pros do and will also provide an estimate of what the home is worth plus a
4

5
00:00:24,550 --> 00:00:28,600
degree of uncertainty around that number.
5

6
00:00:28,900 --> 00:00:30,820
The estimated price is easy enough, right?
6

7
00:00:30,820 --> 00:00:36,760
Because once we have the theta parameters for the model, all we have to do is plug them in and then together
7

8
00:00:36,760 --> 00:00:42,220
with the values for the individual features, we get an estimate, we get our y_hat.
8

9
00:00:42,220 --> 00:00:44,430
But what about the range?
9

10
00:00:44,530 --> 00:00:45,970
What does that come from?
10

11
00:00:46,000 --> 00:00:50,950
The range will actually depend on the shape of the distribution that you're working with.
11

12
00:00:51,520 --> 00:00:57,430
If we know the distribution we can estimate the range very accurately.
12

13
00:00:57,430 --> 00:00:58,710
And here's the thing.
13

14
00:00:58,900 --> 00:01:06,730
Our go to distribution is usually the normal distribution, because the very, very nice thing about the
14

15
00:01:06,730 --> 00:01:10,550
normal distribution is that we know its shape.
15

16
00:01:10,630 --> 00:01:20,000
We know that for a normal distribution, 68% of the observations are between these two points.
16

17
00:01:20,040 --> 00:01:26,440
68% of all the values in this distribution are within this purple range right here.
17

18
00:01:27,160 --> 00:01:36,580
And for normal distribution, we also know that around 95 percent of the values are between these two
18

19
00:01:36,580 --> 00:01:38,280
points here.
19

20
00:01:38,350 --> 00:01:45,220
95% of the observations fall within this pink range that I've highlighted right here.
20

21
00:01:45,220 --> 00:01:49,670
The individual points that I've drawn on this histogram right here actually have a name.
21

22
00:01:49,930 --> 00:01:56,200
They quantify the amount of variation around the mean.
22

23
00:01:56,200 --> 00:02:03,090
The mean is right here in the middle of the distribution and the distance from the mean to that bright
23

24
00:02:03,100 --> 00:02:09,690
purple point is called one standard deviation. For a normal distribution,
24

25
00:02:09,760 --> 00:02:16,210
you'll usually see the Greek letter Sigma used to denote the standard deviation.
25

26
00:02:16,210 --> 00:02:19,940
Now what about the other points that I had drawn on here earlier?
26

27
00:02:20,620 --> 00:02:29,140
Well the left purple point is at minus one standard deviation - minus one standard deviation from the
27

28
00:02:29,140 --> 00:02:38,080
mean and for our normal distribution, as we said before, around 68 percent of all the observations lie between
28

29
00:02:38,170 --> 00:02:42,210
minus and plus one standard deviation.
29

30
00:02:42,220 --> 00:02:42,480
OK.
30

31
00:02:42,490 --> 00:02:43,960
What about the pink points?
31

32
00:02:43,960 --> 00:02:52,510
Well the right pink point is at two standard deviations and the left pink point is at minus two standard
32

33
00:02:52,510 --> 00:02:53,600
deviations.
33

34
00:02:53,740 --> 00:02:59,680
And as we've said before, approximately 95% of observations lie between plus two and
34

35
00:02:59,680 --> 00:03:04,090
minus two standard deviations for a normal distribution.
35

36
00:03:04,090 --> 00:03:05,740
Now let me ask you this.
36

37
00:03:06,160 --> 00:03:13,360
If this green point here right in the middle is our estimate for the property price from our model, our
37

38
00:03:13,360 --> 00:03:19,500
y_hat, what is the distribution that we're gonna be looking at here?
38

39
00:03:19,630 --> 00:03:29,710
Any guess? What's the distribution that tells us something about the variance in our price estimates? Well,
39

40
00:03:30,350 --> 00:03:32,060
we're going to be coming full circle here.
40

41
00:03:32,150 --> 00:03:37,210
It's actually the distribution of the residuals from our regression.
41

42
00:03:37,310 --> 00:03:44,880
This is the reason why we cared so much about whether the distribution is a normal distribution or not.
42

43
00:03:44,910 --> 00:03:48,100
Now the next question you might ask at this point is: Well,
43

44
00:03:48,130 --> 00:03:48,400
ok,
44

45
00:03:48,430 --> 00:03:56,980
so if the distribution is the distribution of the residuals, then what do the purple and pink dots represent?
45

46
00:03:57,050 --> 00:03:59,090
How do we get our range?
46

47
00:03:59,150 --> 00:04:01,580
Do you remember our mean squared error?
47

48
00:04:01,580 --> 00:04:09,380
And no, the mean square is not the purple dot, but we can make one small modification to the mean squared
48

49
00:04:09,410 --> 00:04:16,010
error and get something very, very handy for calculating the range and making predictions.
49

50
00:04:16,010 --> 00:04:23,300
And that small modification is by taking the square root of this thing, by taking the square root of
50

51
00:04:23,300 --> 00:04:25,040
the mean squared error,
51

52
00:04:25,040 --> 00:04:30,500
we get another metric and this one is called, yes surprise, surprise,
52

53
00:04:30,590 --> 00:04:39,710
the Root Mean Square Error or RMSE and it's this metric, the RMSE that has a really, really nice interpretation
53

54
00:04:40,160 --> 00:04:48,110
because the Root Mean Squared Error represents one standard deviation of the differences between our
54

55
00:04:48,170 --> 00:04:50,750
actual and our predicted values.
55

56
00:04:50,750 --> 00:04:58,160
The Root Mean Squared Error is one standard deviation in the distribution of our residuals.
56

57
00:04:58,160 --> 00:05:06,260
So let's look at the chart again. To create our range around our estimated price, our so-called prediction
57

58
00:05:06,470 --> 00:05:07,430
interval,
58

59
00:05:07,520 --> 00:05:16,610
the first thing we choose is how wide we want that interval to be, say we want to cover around 95 percent
59

60
00:05:16,730 --> 00:05:18,440
of the observations.
60

61
00:05:18,440 --> 00:05:23,170
Then we would use two standard deviations either side.
61

62
00:05:23,630 --> 00:05:30,080
And this means taking our prediction and adding two times the Root Mean Squared Error to it for the
62

63
00:05:30,080 --> 00:05:38,310
upper bound and subtracting two times the Root Mean Squared Error for our lower bound on our prediction
63

64
00:05:38,350 --> 00:05:43,060
and that's how we get the range. In our Jupyter notebook,
64

65
00:05:43,070 --> 00:05:50,840
this means simply taking the square root of our mean squared error that we've already calculated. Coming
65

66
00:05:50,840 --> 00:05:55,100
to the cell where we've got our dataframe with our Mean Squared Error,
66

67
00:05:55,100 --> 00:06:02,180
what I'm going to do is add another column, so I'll put a comma here, go to the next line, single quotes
67

68
00:06:02,420 --> 00:06:06,670
and put RMSE here and then a colon,
68

69
00:06:06,720 --> 00:06:10,580
and now I can take this entire list here,
69

70
00:06:10,640 --> 00:06:11,430
copy it,
70

71
00:06:11,780 --> 00:06:19,830
and then here, what I'm going to do is I'm going to to use numpy and call the square root function from numpy
71

72
00:06:20,560 --> 00:06:28,030
and as an argument I'm going to pass in the list that I just copied and what this will do is it'll take
72

73
00:06:28,030 --> 00:06:32,320
the square root of all the items in the list.
73

74
00:06:32,320 --> 00:06:34,660
Let me refresh the cell.
74

75
00:06:34,660 --> 00:06:35,950
Here we go.
75

76
00:06:35,950 --> 00:06:41,310
Now that we've done that I want to give you a challenge.
76

77
00:06:41,320 --> 00:06:45,070
Suppose we have an estimate from our model,
77

78
00:06:45,080 --> 00:06:49,530
yeah, for a house price of 30000 dollars.
78

79
00:06:49,570 --> 00:06:57,940
Can you calculate the upper and lower bound for a 95 percent prediction interval using the reduced log
79

80
00:06:57,940 --> 00:06:59,170
model?
80

81
00:06:59,170 --> 00:07:06,890
In other words, can you calculate the upper and the lower bound for the range around this estimate?
81

82
00:07:06,940 --> 00:07:10,810
I'll give you a few seconds to pause the video before I show you the solution.
82

83
00:07:13,590 --> 00:07:13,980
All right,
83

84
00:07:13,980 --> 00:07:15,950
let's take it from the top.
84

85
00:07:15,970 --> 00:07:19,980
So I'm going to add a print statement and I'm going to spell it out,
85

86
00:07:20,020 --> 00:07:28,440
so I'm going to say "1 s.d. in log prices", because we've got units, is, and that will be the
86

87
00:07:28,440 --> 00:07:32,790
square root of our log mean squared error,
87

88
00:07:32,820 --> 00:07:44,290
so "np.sqrt(reduced_log_mse)". Agreed? What's that equal to? It's equal to this much.
88

89
00:07:44,290 --> 00:07:47,050
Now what about two standard deviations?
89

90
00:07:47,050 --> 00:07:54,160
I can calculate that simply by taking the first print statement, copying it and then multiplying this
90

91
00:07:54,160 --> 00:07:56,280
whole thing by two.
91

92
00:07:56,290 --> 00:07:56,770
Here we go.
92

93
00:07:57,280 --> 00:08:05,140
This is two standard deviations - the upper bound for the prediction interval will be equal to our y_hat
93

94
00:08:05,140 --> 00:08:09,990
plus two times the root mean squared error.
94

95
00:08:10,000 --> 00:08:15,340
Now I've been pretty sneaky and I've given you the y_hat in dollar values.
95

96
00:08:15,340 --> 00:08:23,320
So you actually have to use a log transformation "np.log(30)", since our model is given in thousands
96

97
00:08:23,830 --> 00:08:33,790
and then you have to add "2*np.sqrt(reduced_log_mse)". So this
97

98
00:08:33,790 --> 00:08:42,540
is our y_hat plus two times the root mean squared error. Let me print that out, so I'm going to say "print(
98

99
00:08:43,300 --> 00:08:58,310
"'The upper bound for a 95% prediction interval is', upper_bound". That's
99

100
00:08:58,310 --> 00:09:07,070
equal to 3.78 approximately. Now if we wanted to see this in dollar values, then we
100

101
00:09:07,070 --> 00:09:15,200
can say, well "The upper bound for log prices for a 95% prediction interval is" this much and I
101

102
00:09:15,200 --> 00:09:27,020
can copy this, paste it in and I can say "'The upper bound in normal prices is', np.e**
102

103
00:09:27,350 --> 00:09:28,310
upper_bound."
103

104
00:09:28,310 --> 00:09:30,180
So let's see what that is.
104

105
00:09:30,410 --> 00:09:36,880
I can make that more explicit by multiplying this whole thing, times 1000,
105

106
00:09:37,040 --> 00:09:43,910
putting a little dollar sign here and there I've got my upper bound. The lower bound is very, very similar.
106

107
00:09:43,910 --> 00:09:52,010
You can even copy these three lines, paste them in and change this to "lower_bound = np.log(
107

108
00:09:52,190 --> 00:10:01,430
30)" minus two times the root mean squared error and then I can change my print statements to read "'The lower
108

109
00:10:01,430 --> 00:10:09,110
bound in log prices for a 95% prediction interval is', lower_bound" and the lower bound
109

110
00:10:09,140 --> 00:10:16,990
in normal prices is "np.e**lower_bound * 1000".
110

111
00:10:17,720 --> 00:10:19,110
Let's see what this reads.
111

112
00:10:19,280 --> 00:10:25,140
The lower bound in this case is 20635 dollars.
112

113
00:10:25,250 --> 00:10:32,360
Now the trick with this challenge is to do the addition of the root mean squared error and the transformation
113

114
00:10:32,660 --> 00:10:37,600
in the right order, because otherwise you'll get a very, very different result.
114

115
00:10:37,880 --> 00:10:43,910
The incorrect way of calculating the upper bound would have been to take two times the root mean squared
115

116
00:10:43,940 --> 00:10:51,440
error and simply say "Well okay, we've got a estimate of thirty thousand and then we're going to add to
116

117
00:10:51,440 --> 00:10:59,130
it the transformed value 'np.e**'", and then the root mean squared error
117

118
00:10:59,240 --> 00:11:01,230
times 1000.
118

119
00:11:01,910 --> 00:11:11,120
So this was the little trick in how I phrased this challenge. In summary, we often look towards the root
119

120
00:11:11,300 --> 00:11:18,110
mean squared error when we're interested in the predictive power of our models, and to some extent we
120

121
00:11:18,110 --> 00:11:19,790
can use the root mean squared error
121

122
00:11:19,790 --> 00:11:26,630
to also compare the models. And the reason is is that the root mean squared is a very good measure of
122

123
00:11:26,630 --> 00:11:36,390
how accurately the model predicts the target, because we can determine a range, the width of this range
123

124
00:11:36,710 --> 00:11:38,490
is a very important criterion
124

125
00:11:38,490 --> 00:11:44,370
if the main purpose of the model is prediction. And this is a big contrast to something like R-Squared
125

126
00:11:45,000 --> 00:11:51,690
because R-Squared says absolutely nothing about the predictive power of the model or the prediction
126

127
00:11:51,690 --> 00:11:53,070
error.
127

128
00:11:53,310 --> 00:11:53,780
Okay.
128

129
00:11:53,810 --> 00:11:59,290
So we're slowly coming towards the end of the section. In the next lesson,
129

130
00:11:59,340 --> 00:12:08,100
we're gonna finish it up by building a valuation tool for our boss in our real estate office and that's
130

131
00:12:08,100 --> 00:12:13,930
going to involve probably updating the prices to reflect today's dollar values a little bit.
131

132
00:12:14,070 --> 00:12:21,600
I think the Boston upper price of fifty thousand dollars is not really that accurate anymore.
132

133
00:12:21,840 --> 00:12:29,310
And we're also going to be looking at how we can create a Python function with optional arguments.
133

134
00:12:29,310 --> 00:12:35,820
So arguments where there's already default values set similar to what we've seen with seaborn and
134

135
00:12:35,820 --> 00:12:42,930
matplotlib and we're going to cover the syntax, the Python syntax how we can create functions like this
135

136
00:12:43,080 --> 00:12:47,190
ourselves. I'll see you in the next lessons. Take care.