0
1
00:00:00,530 --> 00:00:07,730
In this lesson we're gonna put everything that we've learned together and build a quick and dirty Property
1

2
00:00:07,730 --> 00:00:13,050
valuation tool for Boston using our existing data set.
2

3
00:00:13,400 --> 00:00:20,330
And this means that we're going to apply all the concepts that we've discussed previously, but also it
3

4
00:00:20,330 --> 00:00:25,940
gives us a chance to expand our knowledge of Python programming techniques.
4

5
00:00:25,940 --> 00:00:32,810
In fact, we're going to package our Boston property valuation tool as a Python module that you can then
5

6
00:00:32,870 --> 00:00:39,140
import into any other notebook just as we've been importing say pandas or numpy.
6

7
00:00:39,590 --> 00:00:45,220
And also, we're going to cover how to write Python functions that have default values for arguments.
7

8
00:00:45,410 --> 00:00:51,780
And we're going to cover how we can include helpful documentation in our Python code as well.
8

9
00:00:51,960 --> 00:00:53,460
So how will this tool work?
9

10
00:00:53,610 --> 00:00:57,040
How will it find a price for a property?
10

11
00:00:57,060 --> 00:01:01,140
Well, it will make use of our existing model.
11

12
00:01:01,410 --> 00:01:08,280
Here we get the theta values from our regression and then all we need to do is plug in custom values
12

13
00:01:08,370 --> 00:01:15,540
for all the features like RM, NOX, LSTAT, CHAS and so on.
13

14
00:01:15,540 --> 00:01:21,860
And once we've done that, we have our y_hat for a property that is not in the dataset.
14

15
00:01:21,900 --> 00:01:24,480
So that's pretty simple, right? Now,
15

16
00:01:24,600 --> 00:01:30,090
of course there are certain limitations of the data set that we've been using and we're gonna have to
16

17
00:01:30,090 --> 00:01:32,190
work around these limitations.
17

18
00:01:32,310 --> 00:01:38,640
For starters, we don't have a column with the location of the homes to assist us in pricing homes depending
18

19
00:01:38,640 --> 00:01:40,160
on an area.
19

20
00:01:40,200 --> 00:01:45,270
Also if you are searching for properties online you're not going to be able to input some pieces of
20

21
00:01:45,270 --> 00:01:50,580
information that are just just very, very abstract, like nobody is really going to know what the correct
21

22
00:01:50,580 --> 00:01:51,410
value for
22

23
00:01:51,470 --> 00:01:58,290
LSTAT is in the area that they're looking to buy a home in, or what the proportion of non-retail business
23

24
00:01:58,350 --> 00:01:59,110
acres are,
24

25
00:01:59,610 --> 00:02:02,370
which was the INDUS feature in our model.
25

26
00:02:02,370 --> 00:02:07,020
In other words, we'll be working around these limitations and also we'll be making some generous assumptions,
26

27
00:02:07,320 --> 00:02:12,090
but it's all in good fun and we can learn a few things while doing this as well.
27

28
00:02:12,150 --> 00:02:15,270
So let's get started writing some code in Jupyter notebook.
28

29
00:02:15,510 --> 00:02:21,420
Let's create a new Python 3 notebook to hold our code for our valuation tool.
29

30
00:02:21,450 --> 00:02:31,380
I'm going to call this notebook "04 Valuation Tool" and then I'm going to get started with our import
30

31
00:02:31,380 --> 00:02:32,820
statements.
31

32
00:02:32,820 --> 00:02:34,980
So we're gonna need a couple of things.
32

33
00:02:35,340 --> 00:02:44,550
We're gonna need "from sklearn.datasets import load_boston". We're gonna need our scikit-
33

34
00:02:44,580 --> 00:02:57,450
learn's regression capability, so that's "from sklearn.linear_model import LinearRegression",
34

35
00:02:59,400 --> 00:03:13,950
"from sklearn.metrics import mean_squared_error" and then we're gonna import
35

36
00:03:14,310 --> 00:03:22,500
pandas as pd and we're gonna import numpy as np.
36

37
00:03:22,680 --> 00:03:25,250
These are all the import statements that we need for now.
37

38
00:03:25,500 --> 00:03:34,640
Let me hit Shift+Enter. Now I'm going to add a comment in the next cell and it's gonna read "Gather Data".
38

39
00:03:34,830 --> 00:03:39,090
It's time to create our target and our features.
39

40
00:03:39,090 --> 00:03:46,290
If you recall, we can grab our data set by calling the "load_boston()" function. I'm going to store our data set in
40

41
00:03:46,290 --> 00:03:55,530
a variable called "boston_dataset" and that's going to be equal to the return value from "load
41

42
00:03:55,590 --> 00:03:58,560
_boston()".
42

43
00:03:58,560 --> 00:04:07,290
Now let me create a dataframe, I'm going to say "data = pd.DataFrame()" and in the parentheses
43

44
00:04:07,320 --> 00:04:15,800
I'm going to set the data of this data frame equal to "boston_dataset.data",
44

45
00:04:15,810 --> 00:04:20,970
this if you recall is not a dataframe, which is why we're extracting the pieces of information that
45

46
00:04:20,970 --> 00:04:30,150
we need, namely our features data by using that data attribute on the boston_dataset object. Our data
46

47
00:04:30,150 --> 00:04:36,660
frame should also have some columns and these columns have names, so "columns = boston_
47

48
00:04:36,660 --> 00:04:42,600
dataset.feature_names".
48

49
00:04:42,600 --> 00:04:48,360
I think this is all a little bit of review but we're just going to convert our data into a format that
49

50
00:04:48,360 --> 00:04:49,730
we need.
50

51
00:04:49,740 --> 00:04:54,260
So how does our dataframe look like at the moment? "data.head()"
51

52
00:04:54,930 --> 00:04:57,850
will show us the first five rows.
52

53
00:04:58,140 --> 00:04:59,240
So that's fair enough.
53

54
00:04:59,280 --> 00:05:06,730
We've got a dataframe with all the features, but we're only gonna use a subset of them, so "features" is
54

55
00:05:06,750 --> 00:05:09,510
gonna be equal to a new data frame,
55

56
00:05:09,540 --> 00:05:10,390
so it's gonna be "data.
56

57
00:05:10,410 --> 00:05:14,580
drop()" and then in the parentheses,
57

58
00:05:14,580 --> 00:05:21,980
gonna have a list of things we want to drop - we want to drop INDUS and we want to drop AGE.
58

59
00:05:22,590 --> 00:05:27,160
Both of these are columns, so I'm going to say "axis = 1".
59

60
00:05:27,240 --> 00:05:34,110
Let's take a look at what the first five rows of our features dataset looks like. We should be missing
60

61
00:05:34,350 --> 00:05:41,250
this column here and we should be missing this column here, "features.head()" will show us just
61

62
00:05:41,490 --> 00:05:44,650
that. Brilliant.
62

63
00:05:44,680 --> 00:05:47,070
This is what we've had before.
63

64
00:05:47,080 --> 00:05:51,820
Now let me delete this line and work out our prices.
64

65
00:05:51,820 --> 00:05:57,490
We're gonna be working with log prices so I'll create a variable called "log_prices", set that
65

66
00:05:57,490 --> 00:06:01,460
equal to "np.log(
66

67
00:06:01,460 --> 00:06:06,430
boston_dataset.target".
67

68
00:06:07,210 --> 00:06:14,950
Let's take a look at what this variable looks like. So "log_prices" is an array with
68

69
00:06:14,950 --> 00:06:17,560
506 rows.
69

70
00:06:17,590 --> 00:06:20,150
We can see this by saying "log_
70

71
00:06:20,140 --> 00:06:30,940
prices.shape". This confirms that we need to have an array with 506 rows but this
71

72
00:06:30,940 --> 00:06:32,100
array is flat.
72

73
00:06:32,140 --> 00:06:34,600
It's just one dimensional.
73

74
00:06:34,600 --> 00:06:42,880
In contrast the shape of our features data frame is 506 by 11.
74

75
00:06:44,200 --> 00:06:51,840
So I'm planning to work with prices that are two dimensional, so 506 by 1.
75

76
00:06:51,990 --> 00:07:01,570
I'm going to get there by converting our log prices into a dataframe, so I'll say "target = 
76

77
00:07:01,900 --> 00:07:06,990
pd.DataFrame(
77

78
00:07:07,170 --> 00:07:10,290
log_prices,
78

79
00:07:10,520 --> 00:07:16,080
columns = [
79

80
00:07:16,180 --> 00:07:20,670
'PRICE'])"
80

81
00:07:20,680 --> 00:07:21,940
Here we go.
81

82
00:07:21,940 --> 00:07:25,710
Now if I say "target.shape",
82

83
00:07:26,260 --> 00:07:28,110
let's see what we get.
83

84
00:07:28,300 --> 00:07:31,060
506 by 1.
84

85
00:07:31,090 --> 00:07:32,770
Perfect.
85

86
00:07:32,770 --> 00:07:40,840
Now, as we've said in the introduction, if we want to get an estimate for the value of a property, we basically
86

87
00:07:40,840 --> 00:07:47,350
have to create something that looks like another row of data, something that's structured exactly the
87

88
00:07:47,350 --> 00:07:50,770
way the features dataframe is structured.
88

89
00:07:50,770 --> 00:07:57,190
So 1 row and 11 columns with a value for each column.
89

90
00:07:57,280 --> 00:07:58,760
How could we do this?
90

91
00:07:58,870 --> 00:08:08,980
Say we create a variable called "property_stats", set that equal to an empty ndarray
91

92
00:08:09,010 --> 00:08:09,780
from numpy,
92

93
00:08:09,860 --> 00:08:18,580
so "np.ndarray()" and we want that array to be 1 row by 11 columns.
93

94
00:08:18,610 --> 00:08:19,700
So we'll say
94

95
00:08:19,750 --> 00:08:25,930
"shape = (1, 11)".
95

96
00:08:26,020 --> 00:08:29,140
Okay, so now we have an empty array.
96

97
00:08:29,140 --> 00:08:35,470
Now what we can do is give a value for every single column, so we can write something like "property_
97

98
00:08:35,470 --> 00:08:40,940
stats" and then access the very, very first column,
98

99
00:08:40,990 --> 00:08:43,390
so that'll be in row number 0,
99

100
00:08:43,420 --> 00:08:47,250
so the first row and the first column, column numbers 0,
100

101
00:08:47,260 --> 00:08:55,780
so "[0][0]" and I can set that to a very particular value, so I can set that
101

102
00:08:55,780 --> 00:08:58,850
to say 0.02.
102

103
00:08:59,050 --> 00:09:02,860
This is now my crime per capita.
103

104
00:09:02,860 --> 00:09:04,380
Let's see what this looks like.
104

105
00:09:04,450 --> 00:09:08,270
"property_stats", Shift+Enter will
105

106
00:09:08,880 --> 00:09:11,200
now show us something like this.
106

107
00:09:11,200 --> 00:09:12,610
This is scientific notation.
107

108
00:09:12,640 --> 00:09:12,970
Yeah.
108

109
00:09:13,000 --> 00:09:19,530
So 0.02 will be 2*10^(-2).
109

110
00:09:19,810 --> 00:09:25,010
And these other values are "10^(-314)".
110

111
00:09:25,030 --> 00:09:32,240
This looks really strange, but what you're looking at is pretty much equal to zero. If I change this value
111

112
00:09:32,240 --> 00:09:39,550
here to say 83 and hit Shift+Enter then you'll see the array displayed like that, you have 83
112

113
00:09:39,550 --> 00:09:43,880
and then 0, 0, 0, 0, 0, 0, right?
113

114
00:09:44,180 --> 00:09:48,890
So I know this might seem confusing, but before we were looking at the output in scientific notation
114

115
00:09:49,850 --> 00:09:58,100
and here we're looking at it more normally. Now a reasonable thing to ask is "How do you know that this very
115

116
00:09:58,100 --> 00:10:01,350
first column here is the crime column?"
116

117
00:10:01,730 --> 00:10:02,950
Yeah, so this value here.
117

118
00:10:02,950 --> 00:10:06,380
How do I know that? This should be around 0.02.
118

119
00:10:06,500 --> 00:10:14,600
Yeah, well the answer is is that our property_stats variable. our 1 by 11 array, will have the same
119

120
00:10:14,600 --> 00:10:20,990
structure as our features dataframe, so "features.head()"
120

121
00:10:21,260 --> 00:10:26,420
if you recall, will show us that the first column is Crime,
121

122
00:10:26,540 --> 00:10:28,860
the second column are the zones.
122

123
00:10:28,880 --> 00:10:34,070
the third column is the Charles River dummy variable.
123

124
00:10:34,070 --> 00:10:36,490
So one thing that we might do, right,
124

125
00:10:36,530 --> 00:10:42,430
one thing that we might find helpful is if we give these different indices a name.
125

126
00:10:43,250 --> 00:10:48,480
So if we want to set the value of our second column and our third column we could do it like this,
126

127
00:10:48,500 --> 00:10:55,760
I could copy this, paste it twice, change the second zero here in property_stats to 1,
127

128
00:10:55,820 --> 00:11:02,060
this would now be the zone, and if I want the zone to be equal to say 15, then I can do it like this.
128

129
00:11:02,290 --> 00:11:08,000
And if I want the Charles River dummy variable to be equal to, say 1, then I would have to pick index
129

130
00:11:08,200 --> 00:11:11,700
2 and set that equal to 1.
130

131
00:11:12,080 --> 00:11:13,070
You get the idea, right?
131

132
00:11:13,340 --> 00:11:21,740
So property_stats now looks like so, we've got crime, we've got our ZN feature and we have our Charles
132

133
00:11:21,740 --> 00:11:24,280
River dummy variable.
133

134
00:11:24,380 --> 00:11:31,280
Now personally, I find accessing these indices by number very, very confusing, because I'm going to come
134

135
00:11:31,280 --> 00:11:37,820
back in a week's time and I'm not going to remember that crime is at zero or ZN is at 1 and Charles
135

136
00:11:37,820 --> 00:11:39,110
River is at 2.
136

137
00:11:39,120 --> 00:11:45,560
I only know that because I've worked with this dataset and I'm looking at my features dataframe
137

138
00:11:45,980 --> 00:11:47,680
below.
138

139
00:11:47,720 --> 00:11:54,520
So one thing that might be quite handy is if we give these numbers names, right?
139

140
00:11:54,550 --> 00:12:06,150
So I can come up here and say "CRIME_IDX = 0" and I can say "ZN_
140

141
00:12:06,140 --> 00:12:12,180
IDX = 1" and "CHAS_
141

142
00:12:12,260 --> 00:12:15,270
IDX = 2"
142

143
00:12:15,380 --> 00:12:16,410
and so on.
143

144
00:12:16,520 --> 00:12:25,370
Now I can come in here and instead of having a zero there, I'll say "CRIME_IDX", instead of having
144

145
00:12:25,370 --> 00:12:33,080
a 1 here, I'll say "ZN_IDX" and so on.
145

146
00:12:33,080 --> 00:12:33,580
Right?
146

147
00:12:33,890 --> 00:12:37,460
"CHAS_IDX".
147

148
00:12:37,670 --> 00:12:45,560
In other words, this is a technique for giving certain hard values a descriptive name, that way when you're
148

149
00:12:45,560 --> 00:12:51,840
using them in your code later on it's a little more clear, a little easier to read.
149

150
00:12:52,070 --> 00:12:58,640
Since we're not really going to change these values here, I've written them in all caps and separated
150

151
00:12:58,640 --> 00:13:01,010
them with an underscore.
151

152
00:13:01,010 --> 00:13:04,230
Now I'm going to add two more named indices here.
152

153
00:13:04,250 --> 00:13:06,680
The first one is going to be for the number of rooms,
153

154
00:13:06,710 --> 00:13:12,480
so "RM_IDX" and that's in row number 4
154

155
00:13:12,860 --> 00:13:22,460
and the next one is "PTRATIO_IDX" and that's in row number 8. Scrolling down you can verify
155

156
00:13:22,460 --> 00:13:23,260
this.
156

157
00:13:23,410 --> 00:13:29,130
0, 1, 2, 3, 4, "RM", 5,6,7,8,
157

158
00:13:29,210 --> 00:13:31,330
for PTRATIO.
158

159
00:13:31,500 --> 00:13:33,640
Brilliant.
159

160
00:13:33,730 --> 00:13:38,930
Now remember how this property_stats array is empty at the moment,
160

161
00:13:39,100 --> 00:13:46,520
it's got zeros for all of these values and it's got three of these values defined.
161

162
00:13:47,080 --> 00:13:54,520
Now, to be honest, we're not going to be customizing all these values, right, because something like crime
162

163
00:13:54,550 --> 00:14:01,760
per capita is quite hard to know or the acres of industrial land in a particular area,
163

164
00:14:01,780 --> 00:14:05,300
it's also really hard to know. We're gonna make some assumptions.
164

165
00:14:05,590 --> 00:14:10,390
In other words, for the property that we're looking at, we're just gonna go with the average for all of
165

166
00:14:10,390 --> 00:14:14,830
Boston, for now at least. To get the average,
166

167
00:14:14,830 --> 00:14:23,860
we can simply grab it from our features dataframe, so "features['CRIM']
167

168
00:14:24,520 --> 00:14:35,170
.mean()" will give us the average and I can of course take this, I can do the same thing for our zones.
168

169
00:14:35,350 --> 00:14:42,610
So "features['ZN'].mean()" and I could do the same thing for Charles River, "features['CHAS'].mean()" and I could do the
169

170
00:14:42,610 --> 00:14:46,820
same thing for all the other eleven features.
170

171
00:14:46,990 --> 00:14:49,420
Now, what would this look like at the moment?
171

172
00:14:49,420 --> 00:14:55,780
If I refresh, I can see this is the average crime per capita,
172

173
00:14:55,910 --> 00:15:02,150
this is the average value for the ZN index and this is the average value for Charles River.
173

174
00:15:02,150 --> 00:15:06,640
I'm going to stop copy pasting code and making this super repetitive.
174

175
00:15:06,980 --> 00:15:13,400
Instead, I'm going to grab the mean value for all the features at the same time.
175

176
00:15:13,490 --> 00:15:18,280
Check it out, "features.mean()"
176

177
00:15:18,490 --> 00:15:24,120
will give me all the mean values, all the average values for all the features.
177

178
00:15:24,140 --> 00:15:33,500
My goal is to populate our property_stats with all these values, our property stats is an ndarray at
178

179
00:15:33,500 --> 00:15:34,320
the moment.
179

180
00:15:34,610 --> 00:15:35,620
But what is this?
180

181
00:15:35,630 --> 00:15:38,070
Let's take a look at what this is.
181

182
00:15:38,150 --> 00:15:45,140
"type(featyres.mean())" shows us that this is a Series.
182

183
00:15:45,230 --> 00:15:46,890
It's a different kind of object.
183

184
00:15:47,000 --> 00:15:49,370
So we have to do a little bit of conversion here.
184

185
00:15:49,400 --> 00:15:58,350
We have to make the series object play nice with our array, so "features.mean()" gives us a Series,
185

186
00:15:58,550 --> 00:16:04,760
but the series object has an attribute called values.
186

187
00:16:05,060 --> 00:16:11,310
So, I'm going to copy this, paste it in and show you what this type is
187

188
00:16:11,310 --> 00:16:22,190
by adding ".values" at the end - we can see that the values attribute on a series will give us an 
188

189
00:16:22,310 --> 00:16:29,570
ndarray, so array and array, the two things should play nice because they're the same type of object.
189

190
00:16:29,740 --> 00:16:34,570
But remember how this is a 1 by 11 array?
190

191
00:16:34,810 --> 00:16:44,820
Let's double check what the dimensions are of this array here, "features.mean().values.shape"
191

192
00:16:45,460 --> 00:16:48,370
will tell us exactly that.
192

193
00:16:48,370 --> 00:16:52,200
This thing here it turns out is completely flat.
193

194
00:16:52,210 --> 00:16:54,080
It's a one dimensional array.
194

195
00:16:54,280 --> 00:16:59,240
Unlike our property_stats array it is not two dimensional.
195

196
00:16:59,370 --> 00:17:09,200
It means that we have to reshape this array from a flat with 11 values to a 1 by 11 array.
196

197
00:17:09,240 --> 00:17:14,650
The easiest way to do this is to call the "reshape" method,
197

198
00:17:14,790 --> 00:17:26,160
so "features.mean().values.reshape(1, 11)" will give us exactly what
198

199
00:17:26,210 --> 00:17:26,760
it is
199

200
00:17:26,790 --> 00:17:27,890
we're looking for.
200

201
00:17:27,900 --> 00:17:29,550
Check it out.
201

202
00:17:29,640 --> 00:17:39,290
Brilliant. So I'm going to take this here and I'm going to
say "property_stats = features.mean().
202

203
00:17:39,290 --> 00:17:48,650
values.reshape(1,11)" and this means I do not have to do any of this. I can comment out
203

204
00:17:48,860 --> 00:17:58,670
all of these lines of code and save us all of this work, because we now have a property with some starting
204

205
00:17:58,670 --> 00:18:00,440
characteristics, right.
205

206
00:18:00,830 --> 00:18:05,950
So we have a property, a single row, 11 features, they all have a value
206

207
00:18:06,140 --> 00:18:14,700
and in this case, the value is just the average of all the 506 properties in the dataset.
207

208
00:18:14,720 --> 00:18:20,830
In other words, property_stats is kind of our template for making our prediction.
208

209
00:18:20,860 --> 00:18:23,620
This is the object that we're going to be working with.