0
1
00:00:00,690 --> 00:00:08,880
The first topic that we're going to talk about on the evaluation front is this topic of data transformations.
1

2
00:00:08,880 --> 00:00:13,990
Let's take another closer look at the values that we're trying to predict,
2

3
00:00:14,160 --> 00:00:17,730
our target values, the property prices.
3

4
00:00:17,730 --> 00:00:20,070
Check out this histogram.
4

5
00:00:20,070 --> 00:00:22,750
This is what we plotted earlier.
5

6
00:00:22,770 --> 00:00:30,420
Now, what we see here is that there are quite a few very, very expensive properties on the right hand
6

7
00:00:30,420 --> 00:00:33,400
side of this graph.
7

8
00:00:33,480 --> 00:00:38,540
There's a bunch of properties in the right tail of this distribution.
8

9
00:00:38,790 --> 00:00:45,780
And here's what a normal distribution looks like in comparison. The normal distribution has very, very
9

10
00:00:45,780 --> 00:00:49,020
few occurrences in the tails.
10

11
00:00:49,080 --> 00:00:50,070
And here's the thing,
11

12
00:00:50,070 --> 00:00:59,070
having more data points in one of the tails is called a skew. Our distribution of house prices is skewed
12

13
00:00:59,370 --> 00:01:02,060
to the right.
13

14
00:01:02,100 --> 00:01:05,180
Now, we can actually put a number to the skew.
14

15
00:01:05,280 --> 00:01:09,210
We can measure the skew of our target values.
15

16
00:01:09,210 --> 00:01:12,380
Let's add a markdown cell to this section and commemorate
16

17
00:01:12,390 --> 00:01:21,340
what we're going to do, I'm going to write down "Data Transformations". There we go. To calculate the skew of our
17

18
00:01:21,340 --> 00:01:22,600
target values,
18

19
00:01:22,600 --> 00:01:30,790
all we have to do is grab our price series, "data['PRICE']", put a dot after it and call the
19

20
00:01:30,790 --> 00:01:34,760
skew method. Hitting Shift+Enter,
20

21
00:01:34,900 --> 00:01:40,490
we see that this skew is 1.1.
21

22
00:01:40,510 --> 00:01:46,180
Now what does that mean and how does that compare to a normal distribution?
22

23
00:01:46,930 --> 00:01:52,010
Well, a normal distribution is completely symmetrical, right?
23

24
00:01:52,150 --> 00:01:56,770
The right tail and the left tail are exactly the same size.
24

25
00:01:56,800 --> 00:02:04,930
So the skew of a normal distribution is equal to 0 and this leads me to think that there's something
25

26
00:02:04,930 --> 00:02:08,710
that we can try to improve our model.
26

27
00:02:08,890 --> 00:02:15,490
We can try transforming our price data and then running our regression.
27

28
00:02:15,540 --> 00:02:24,720
Now what I mean by transforming our data? Well, a transformation would be something like multiplying all
28

29
00:02:24,720 --> 00:02:31,340
the prices by two or dividing all the prices in half.
29

30
00:02:31,380 --> 00:02:39,560
A transformation would be applying some sort of calculation to all the prices in the dataset.
30

31
00:02:39,630 --> 00:02:46,020
However, dividing or multiplying our prices isn't the transformation that I have in mind.
31

32
00:02:46,140 --> 00:02:51,670
I want to do something to our prices that will shift our distribution.
32

33
00:02:51,690 --> 00:02:57,700
I want to do something that will affect our large house prices in the tail more than the rest.
33

34
00:02:58,590 --> 00:03:05,620
And to accomplish this, I will use a log transformation on our property prices.
34

35
00:03:05,810 --> 00:03:12,890
And so that means that if one of the target values, one of the prices, is equal to the value 7 then the
35

36
00:03:13,130 --> 00:03:18,650
log of this price is equal to 1.95.
36

37
00:03:18,680 --> 00:03:25,580
And you can calculate this simply by grabbing a calculator and using the ln function on that thing.
37

38
00:03:25,580 --> 00:03:27,320
But here's the interesting part.
38

39
00:03:27,440 --> 00:03:34,580
If a property price is listed as 50, so some of the highest values that we have in our data set, then
39

40
00:03:34,580 --> 00:03:41,230
the log price of this property would now be equal to 3.91.
40

41
00:03:41,450 --> 00:03:50,000
Thus what the log transformation achieves is you get a small change of around 5 in the small price
41

42
00:03:50,000 --> 00:04:00,140
of 7, but you get a large change of around 46 for the large values in the dataset.
42

43
00:04:00,380 --> 00:04:04,290
Now a very reasonable question to ask is: why does this matter?
43

44
00:04:04,290 --> 00:04:06,260
Why should you care?
44

45
00:04:06,260 --> 00:04:12,840
Well, we have to remember that what we're doing here is we're fitting a linear model to our data.
45

46
00:04:13,010 --> 00:04:19,160
So say we have some data that's distributed like so, distributed like this.
46

47
00:04:19,160 --> 00:04:26,330
If you were to try to fit a line to this data, a straight line you don't tend to get a very good fit.
47

48
00:04:26,750 --> 00:04:35,120
But if you were to transform this data using the log transformation, then those blue dots would line
48

49
00:04:35,120 --> 00:04:42,620
up like so, and then you could fit a linear regression very, very easily and you'd get a very, very good
49

50
00:04:42,620 --> 00:04:50,560
fit. So in other words, based on the skew that I've seen in the distribution, I want to try transforming
50

51
00:04:50,560 --> 00:04:55,390
our data and then fitting the linear regression.
51

52
00:04:55,620 --> 00:05:02,630
Now one thing I'm noticing is I've actually got a typo here in my notebook title, this should read multivariable
52

53
00:05:02,660 --> 00:05:08,530
regression not multivariate regression.
53

54
00:05:08,540 --> 00:05:10,220
There you go.
54

55
00:05:10,340 --> 00:05:10,780
OK.
55

56
00:05:10,940 --> 00:05:18,520
So how do we apply a log transformation to our target values?
56

57
00:05:18,790 --> 00:05:27,640
Well what we could do is write something like "from math import log" and use the log function from the
57

58
00:05:27,640 --> 00:05:29,350
math module.
58

59
00:05:29,500 --> 00:05:36,340
However, this log function from the math module doesn't particularly enjoy being applied to an entire
59

60
00:05:36,340 --> 00:05:38,110
data series.
60

61
00:05:38,140 --> 00:05:42,150
Lucky for us, numpy actually solve this problem for us.
61

62
00:05:42,250 --> 00:05:55,120
We can access the log function through numpy, so "np.log(data['
62

63
00:05:55,120 --> 00:06:04,240
PRICE']" will transform our entire series of prices into log prices.
63

64
00:06:04,240 --> 00:06:11,830
I'm going to create another variable called y_log and set that equal to the output of numpy's
64

65
00:06:11,830 --> 00:06:19,120
log function. Let's take a look at what y_log actually looks like. So "y_log.
65

66
00:06:19,120 --> 00:06:27,380
head()" will show you the first few values and they look something like this, and "y_log.
66

67
00:06:27,790 --> 00:06:34,460
tail()" will show me the last couple of values and they look something like this.
67

68
00:06:34,460 --> 00:06:42,140
So what we saw here was that this log transformation was applied successfully to the entire data series.
68

69
00:06:42,140 --> 00:06:50,440
We now have the logs of the property prices stored in a variable called y_log.
69

70
00:06:50,450 --> 00:06:56,990
Now I can already hear you asking the question: well where can we see the benefit of using log prices?
70

71
00:06:57,140 --> 00:06:59,750
And I think that's a really, really good question.
71

72
00:06:59,820 --> 00:07:01,660
We're going to look at three things.
72

73
00:07:01,700 --> 00:07:07,810
First off, let's look at the skew of the distribution of the log prices.
73

74
00:07:07,910 --> 00:07:19,330
So "y_log.skew()" parentheses will output -0.33.
74

75
00:07:19,490 --> 00:07:24,610
So that's nice. With the skew of -0.33,
75

76
00:07:24,620 --> 00:07:30,920
we're a lot closer to zero than with a skew of 1.1.
76

77
00:07:31,130 --> 00:07:35,360
But what does this distribution look like graphically?
77

78
00:07:35,360 --> 00:07:47,750
Let's pull up our old friend "sns.distplot(y_log)", "plt.title(
78

79
00:07:48,500 --> 00:08:04,820
f'Log price with skew {y_log.skew()}')", next line "plt.show()". Here
79

80
00:08:04,940 --> 00:08:12,320
we're using seaborn's distplot function and we're feeding in our log prices. We're gonna give this little
80

81
00:08:12,320 --> 00:08:21,800
chart a title and as an argument we're feeding in an f-string. I've misspelled price, so I'm going to fix
81

82
00:08:21,800 --> 00:08:29,720
that now. So our f-string will take this part here, that's in between the curly braces and actually calculate
82

83
00:08:29,900 --> 00:08:30,490
the skew.
83

84
00:08:30,520 --> 00:08:38,600
So this bit here will show -0.33 and then we're showing our chart.
84

85
00:08:38,700 --> 00:08:41,660
Let me hit Shift+Enter, see what this looks like.
85

86
00:08:41,740 --> 00:08:48,740
Voila! Now this distribution already looks a lot more symmetrical and with that skew being a lot closer
86

87
00:08:48,740 --> 00:08:57,370
to zero, it also definitely is a lot more symmetrical. We can actually see this difference very, very clearly
87

88
00:08:57,760 --> 00:09:00,950
when we look at these two charts side by side.
88

89
00:09:01,060 --> 00:09:06,850
Now I'd say this transformation worked really, really well from this point of view.
89

90
00:09:06,850 --> 00:09:14,290
Now, having looked at the skew, a good question is: well can we visualize how this transformation makes
90

91
00:09:14,290 --> 00:09:17,630
our data more linear graphically?
91

92
00:09:17,680 --> 00:09:20,720
And the answer is: yes, but it's a little hard to see.
92

93
00:09:20,770 --> 00:09:24,820
It's a little hard to make out on real data.
93

94
00:09:24,820 --> 00:09:30,980
You're just not going to get a graph that's clear as day as this kind of hypothetical example.
94

95
00:09:31,030 --> 00:09:37,960
The biggest difference that you'll actually be able to spot just by inspecting it is on a plot of property
95

96
00:09:37,960 --> 00:09:42,230
prices versus the LSTAT feature.
96

97
00:09:42,280 --> 00:09:51,880
So using seaborn with sns and then ".lmplot()", will give us our scatter plot with a regression line
97

98
00:09:52,930 --> 00:09:58,140
and this scatter plot is going to have, on the x axis,
98

99
00:09:58,330 --> 00:10:06,610
it's gonna have LSTAT, the LSTAT feature; on the y axis with "y = " as the second argument, it's
99

100
00:10:06,610 --> 00:10:08,340
gonna have PRICE.
100

101
00:10:08,530 --> 00:10:18,330
So we're gonna compare our non log prices and then log prices afterwards. As a third argument we're gonna
101

102
00:10:18,340 --> 00:10:27,160
give our data, as a fourth one we're gonna say "size = 7" to make that thing a bit larger, as the
102

103
00:10:27,160 --> 00:10:39,410
fifth argument I'm going to add some transparency to our data points with "scatter_kws = "
103

104
00:10:40,010 --> 00:10:41,680
and then a dictionary,
104

105
00:10:41,930 --> 00:10:56,470
"{'alpha': 0.6}" and then I'm gonna add a colored regression line,
105

106
00:10:57,100 --> 00:11:05,860
so that I can do with "line_kws = {
106

107
00:11:05,860 --> 00:11:12,850
'color': }" and I'm gonna go for dark red,
107

108
00:11:12,940 --> 00:11:15,370
'darkred'.
108

109
00:11:15,390 --> 00:11:16,520
There we go.
109

110
00:11:16,520 --> 00:11:23,620
I'm gonna spread this over two lines and then I'm going to go with "plt.show()" hit Shift+Enter.
110

111
00:11:25,680 --> 00:11:27,010
Tada!
111

112
00:11:27,070 --> 00:11:35,080
This here is a larger version of something we've already seen in our pairplot a little bit earlier and one
112

113
00:11:35,080 --> 00:11:44,750
thing that you can see is that this regression line doesn't fit the data maybe as well as it could. Just
113

114
00:11:44,750 --> 00:11:45,670
by looking at this,
114

115
00:11:45,740 --> 00:11:52,690
we can see that the relationship between LSTAT and price might not be a linear one.
115

116
00:11:53,600 --> 00:11:55,890
But what about our log prices?
116

117
00:11:55,950 --> 00:12:05,260
Let me take this cell, copy it, paste it and then I'm going to change the color to 'cyan' and what I'm going
117

118
00:12:05,250 --> 00:12:10,110
to do now is I'm going to create a new variable, a new data frame,
118

119
00:12:10,110 --> 00:12:20,880
call it "transformed_data" and I'm going to set it equal to, say features, which we've created
119

120
00:12:21,600 --> 00:12:22,760
up here.
120

121
00:12:22,900 --> 00:12:28,370
Our features variable was equal to our data frame minus that price column.
121

122
00:12:28,620 --> 00:12:35,760
So we're gonna reuse that down here and then what we'll do is we're going to add another column to the
122

123
00:12:35,760 --> 00:12:39,090
transformed_data dataframe.
123

124
00:12:39,090 --> 00:12:47,900
And I'm going to call that column 'LOG_PRICE', all caps.
124

125
00:12:48,670 --> 00:12:54,930
And that's gonna be equal to "y_log" which we've created up here.
125

126
00:12:54,960 --> 00:12:59,500
So we can reuse this variable as well.
126

127
00:12:59,500 --> 00:13:07,630
Okay, so we've created a new dataframe and it has all the features and our log price and in our lmplot
127

128
00:13:07,660 --> 00:13:13,150
function we're going to feed in the transformed data and
128

129
00:13:16,950 --> 00:13:18,200
on the y axis
129

130
00:13:18,210 --> 00:13:22,500
we're gonna be plotting LOG_PRICE,
130

131
00:13:22,500 --> 00:13:30,330
so the LOG_PRICE column from the transformed_data dataframe. So lets hit Shift+Enter and see what we
131

132
00:13:30,330 --> 00:13:37,830
get. Tada! So here's our LOG_PRICE versus LSTAT.
132

133
00:13:38,000 --> 00:13:46,110
And here is our untransformed, normal price versus LSTAT. Now,
133

134
00:13:46,150 --> 00:13:52,840
we can see somewhat of an improvement based on this particular scatter plot, but trying to spot the differences
134

135
00:13:52,840 --> 00:13:57,750
visually against all the individual features isn't really all that useful.
135

136
00:13:57,760 --> 00:14:05,890
What we really want to do is we want to rerun our regression and see the combined effect of using log
136

137
00:14:05,890 --> 00:14:10,860
prices instead of the standard prices.
137

138
00:14:10,930 --> 00:14:15,670
The thing to note is that now we've got a different model, right?
138

139
00:14:15,730 --> 00:14:18,740
We're actually going to be changing our regression model.
139

140
00:14:18,760 --> 00:14:23,530
We're gonna be using log prices instead of normal prices.
140

141
00:14:23,530 --> 00:14:31,690
So our previous equation looked like this and our new equation will look like this.
141

142
00:14:31,690 --> 00:14:37,810
And because it's a very different model that we're using, all the theta values will in fact change and
142

143
00:14:37,810 --> 00:14:45,310
the interpretation will also be taking into account that a unit change in say distance or the number
143

144
00:14:45,310 --> 00:14:49,120
of rooms now reflects the change in the log price.
144

145
00:14:50,080 --> 00:14:52,060
So that's just something to be aware of.
145

146
00:14:52,730 --> 00:14:54,280
Let's see this in action.
146

147
00:14:54,350 --> 00:14:59,230
I'm going to come up here in our notebook where we were splitting our dataset.
147

148
00:14:59,300 --> 00:15:07,870
I'm going to copy this cell and then I'm going to paste it down here. The first thing I'm going to do is going to change
148

149
00:15:07,870 --> 00:15:08,810
this line of code.
149

150
00:15:08,890 --> 00:15:17,290
I'm going to say "prices = np.log(data['PRICE'])",
150

151
00:15:17,320 --> 00:15:20,550
so here we're using log prices
151

152
00:15:20,550 --> 00:15:28,890
now when it comes to shuffling and splitting our dataset. The next thing I'm going to do is I'm going to delete
152

153
00:15:29,280 --> 00:15:35,190
this line of code and I'm going to come back up here while we're running our regression.
153

154
00:15:35,460 --> 00:15:40,550
I'm going to copy these cells, come down here,
154

155
00:15:40,850 --> 00:15:49,700
paste them in, delete this comment and then this cell here I'm going to change to markdown,
155

156
00:15:50,040 --> 00:16:00,040
put two hash tags there and I'll write down "Regression using log prices".
156

157
00:16:00,720 --> 00:16:06,950
And now I'm going to hit Shift+Enter here and this is the output that we'll get.
157

158
00:16:06,960 --> 00:16:11,220
So the question is how do we interpret this?
158

159
00:16:11,250 --> 00:16:13,290
Let's look at the r-squared values
159

160
00:16:13,350 --> 00:16:21,990
first. The r-squared on our training data is 0.79.
160

161
00:16:22,050 --> 00:16:28,400
This is an increase actually from before. Before we had 0.75.
161

162
00:16:28,680 --> 00:16:32,820
And we also see an increase on the testing data. Here,
162

163
00:16:32,940 --> 00:16:39,920
the value went from 0.67 to 0.74.
163

164
00:16:40,050 --> 00:16:48,090
So the performance of our model improved on both counts, and based on this it makes me think that our
164

165
00:16:48,090 --> 00:16:56,220
little data transformation experiment was a success. Reducing the skew in our distribution of target
165

166
00:16:56,220 --> 00:17:05,960
values allowed us to improve our model, and as a result we have a higher r-squared and a better fit. But
166

167
00:17:05,960 --> 00:17:11,090
I also said that the interpretation of the parameters has changed.
167

168
00:17:11,090 --> 00:17:17,780
Previously, the coefficient for the Charles River dummy variable was around 2, so people were willing
168

169
00:17:17,780 --> 00:17:23,140
to pay 2000 dollars more to live close to the river.
169

170
00:17:23,160 --> 00:17:29,650
Now the coefficient on the Charles River, a dummy variable, is 0.08.
170

171
00:17:29,970 --> 00:17:35,130
In order to figure out how much more people are willing to pay to be close to the river according to
171

172
00:17:35,130 --> 00:17:43,350
this model, what we have to do is reverse the log transformation, because not even mathematicians think
172

173
00:17:43,440 --> 00:17:51,260
in log prices when they go to the supermarket. So let's see what this value translates into using actual
173

174
00:17:51,410 --> 00:17:52,650
dollar prices.
174

175
00:17:52,760 --> 00:17:59,150
I'm going to copy of this whole thing and in this cell down here I'm going to add a comment
175

176
00:17:59,150 --> 00:18:03,860
"Charles River Property Premium".
176

177
00:18:04,910 --> 00:18:12,510
And in this cell I'm going to reverse the law calculation so we can see what the dollar value is. Now,
177

178
00:18:12,510 --> 00:18:14,040
as a math refresher,
178

179
00:18:14,040 --> 00:18:20,570
the way that the log transformation worked was we took the log of our prices,
179

180
00:18:20,570 --> 00:18:28,120
so if the price was 12 then the log price would be equal to 2.485 approximately.
180

181
00:18:28,710 --> 00:18:35,460
To reverse this transformation we have to raise Euler's number, which is approximately 2.7,
181

182
00:18:35,880 --> 00:18:40,200
to the power of 2.485.
182

183
00:18:40,200 --> 00:18:45,480
And then we get back our starting log prices.
183

184
00:18:45,540 --> 00:18:48,360
So this is how we're going to reverse that calculation.
184

185
00:18:48,390 --> 00:18:50,760
We need to get hold of this number e,
185

186
00:18:50,980 --> 00:18:53,870
and we can get hold of this number e
186

187
00:18:53,870 --> 00:19:03,210
through numpy, "np.e" will give us access to this irrational number and with ** we
187

188
00:19:03,210 --> 00:19:10,710
can raise e to the power of 0.080475,
188

189
00:19:10,920 --> 00:19:13,220
our coefficient.
189

190
00:19:13,380 --> 00:19:20,970
Now we can see how to interpret this coefficient and what our new model is telling us is that people
190

191
00:19:21,060 --> 00:19:30,670
are willing to pay approximately 1084 and dollars more to live close to the river.
191

192
00:19:30,800 --> 00:19:32,940
Okay so that was quite a big lesson.
192

193
00:19:32,940 --> 00:19:39,280
Quite a lot going on. Through our process of transforming our target values,
193

194
00:19:39,340 --> 00:19:46,700
we've created a whole new regression and this new and improved regression fits our dataset even better.
194

195
00:19:47,610 --> 00:19:54,810
But because we're now working with log prices in our model our interpretation of these coefficients
195

196
00:19:55,050 --> 00:20:00,380
has also changed. In order to get back the changes in the dollar values
196

197
00:20:00,510 --> 00:20:06,390
we have to reverse the log transformation when it comes to actually interpreting the meaning of our
197

198
00:20:06,390 --> 00:20:13,260
coefficients. And speaking of coefficients, in the next lessons we're gonna be looking at them in more
198

199
00:20:13,260 --> 00:20:13,980
detail.
199

200
00:20:13,980 --> 00:20:20,310
We're going to be evaluating our coefficients and looking at their significance, their p-values.
200

201
00:20:20,310 --> 00:20:21,680
I'll see you there.