0 1 00:00:00,530 --> 00:00:07,730 In this lesson we're gonna put everything that we've learned together and build a quick and dirty Property 1 2 00:00:07,730 --> 00:00:13,050 valuation tool for Boston using our existing data set. 2 3 00:00:13,400 --> 00:00:20,330 And this means that we're going to apply all the concepts that we've discussed previously, but also it 3 4 00:00:20,330 --> 00:00:25,940 gives us a chance to expand our knowledge of Python programming techniques. 4 5 00:00:25,940 --> 00:00:32,810 In fact, we're going to package our Boston property valuation tool as a Python module that you can then 5 6 00:00:32,870 --> 00:00:39,140 import into any other notebook just as we've been importing say pandas or numpy. 6 7 00:00:39,590 --> 00:00:45,220 And also, we're going to cover how to write Python functions that have default values for arguments. 7 8 00:00:45,410 --> 00:00:51,780 And we're going to cover how we can include helpful documentation in our Python code as well. 8 9 00:00:51,960 --> 00:00:53,460 So how will this tool work? 9 10 00:00:53,610 --> 00:00:57,040 How will it find a price for a property? 10 11 00:00:57,060 --> 00:01:01,140 Well, it will make use of our existing model. 11 12 00:01:01,410 --> 00:01:08,280 Here we get the theta values from our regression and then all we need to do is plug in custom values 12 13 00:01:08,370 --> 00:01:15,540 for all the features like RM, NOX, LSTAT, CHAS and so on. 13 14 00:01:15,540 --> 00:01:21,860 And once we've done that, we have our y_hat for a property that is not in the dataset. 14 15 00:01:21,900 --> 00:01:24,480 So that's pretty simple, right? Now, 15 16 00:01:24,600 --> 00:01:30,090 of course there are certain limitations of the data set that we've been using and we're gonna have to 16 17 00:01:30,090 --> 00:01:32,190 work around these limitations. 17 18 00:01:32,310 --> 00:01:38,640 For starters, we don't have a column with the location of the homes to assist us in pricing homes depending 18 19 00:01:38,640 --> 00:01:40,160 on an area. 19 20 00:01:40,200 --> 00:01:45,270 Also if you are searching for properties online you're not going to be able to input some pieces of 20 21 00:01:45,270 --> 00:01:50,580 information that are just just very, very abstract, like nobody is really going to know what the correct 21 22 00:01:50,580 --> 00:01:51,410 value for 22 23 00:01:51,470 --> 00:01:58,290 LSTAT is in the area that they're looking to buy a home in, or what the proportion of non-retail business 23 24 00:01:58,350 --> 00:01:59,110 acres are, 24 25 00:01:59,610 --> 00:02:02,370 which was the INDUS feature in our model. 25 26 00:02:02,370 --> 00:02:07,020 In other words, we'll be working around these limitations and also we'll be making some generous assumptions, 26 27 00:02:07,320 --> 00:02:12,090 but it's all in good fun and we can learn a few things while doing this as well. 27 28 00:02:12,150 --> 00:02:15,270 So let's get started writing some code in Jupyter notebook. 28 29 00:02:15,510 --> 00:02:21,420 Let's create a new Python 3 notebook to hold our code for our valuation tool. 29 30 00:02:21,450 --> 00:02:31,380 I'm going to call this notebook "04 Valuation Tool" and then I'm going to get started with our import 30 31 00:02:31,380 --> 00:02:32,820 statements. 31 32 00:02:32,820 --> 00:02:34,980 So we're gonna need a couple of things. 32 33 00:02:35,340 --> 00:02:44,550 We're gonna need "from sklearn.datasets import load_boston". We're gonna need our scikit- 33 34 00:02:44,580 --> 00:02:57,450 learn's regression capability, so that's "from sklearn.linear_model import LinearRegression", 34 35 00:02:59,400 --> 00:03:13,950 "from sklearn.metrics import mean_squared_error" and then we're gonna import 35 36 00:03:14,310 --> 00:03:22,500 pandas as pd and we're gonna import numpy as np. 36 37 00:03:22,680 --> 00:03:25,250 These are all the import statements that we need for now. 37 38 00:03:25,500 --> 00:03:34,640 Let me hit Shift+Enter. Now I'm going to add a comment in the next cell and it's gonna read "Gather Data". 38 39 00:03:34,830 --> 00:03:39,090 It's time to create our target and our features. 39 40 00:03:39,090 --> 00:03:46,290 If you recall, we can grab our data set by calling the "load_boston()" function. I'm going to store our data set in 40 41 00:03:46,290 --> 00:03:55,530 a variable called "boston_dataset" and that's going to be equal to the return value from "load 41 42 00:03:55,590 --> 00:03:58,560 _boston()". 42 43 00:03:58,560 --> 00:04:07,290 Now let me create a dataframe, I'm going to say "data = pd.DataFrame()" and in the parentheses 43 44 00:04:07,320 --> 00:04:15,800 I'm going to set the data of this data frame equal to "boston_dataset.data", 44 45 00:04:15,810 --> 00:04:20,970 this if you recall is not a dataframe, which is why we're extracting the pieces of information that 45 46 00:04:20,970 --> 00:04:30,150 we need, namely our features data by using that data attribute on the boston_dataset object. Our data 46 47 00:04:30,150 --> 00:04:36,660 frame should also have some columns and these columns have names, so "columns = boston_ 47 48 00:04:36,660 --> 00:04:42,600 dataset.feature_names". 48 49 00:04:42,600 --> 00:04:48,360 I think this is all a little bit of review but we're just going to convert our data into a format that 49 50 00:04:48,360 --> 00:04:49,730 we need. 50 51 00:04:49,740 --> 00:04:54,260 So how does our dataframe look like at the moment? "data.head()" 51 52 00:04:54,930 --> 00:04:57,850 will show us the first five rows. 52 53 00:04:58,140 --> 00:04:59,240 So that's fair enough. 53 54 00:04:59,280 --> 00:05:06,730 We've got a dataframe with all the features, but we're only gonna use a subset of them, so "features" is 54 55 00:05:06,750 --> 00:05:09,510 gonna be equal to a new data frame, 55 56 00:05:09,540 --> 00:05:10,390 so it's gonna be "data. 56 57 00:05:10,410 --> 00:05:14,580 drop()" and then in the parentheses, 57 58 00:05:14,580 --> 00:05:21,980 gonna have a list of things we want to drop - we want to drop INDUS and we want to drop AGE. 58 59 00:05:22,590 --> 00:05:27,160 Both of these are columns, so I'm going to say "axis = 1". 59 60 00:05:27,240 --> 00:05:34,110 Let's take a look at what the first five rows of our features dataset looks like. We should be missing 60 61 00:05:34,350 --> 00:05:41,250 this column here and we should be missing this column here, "features.head()" will show us just 61 62 00:05:41,490 --> 00:05:44,650 that. Brilliant. 62 63 00:05:44,680 --> 00:05:47,070 This is what we've had before. 63 64 00:05:47,080 --> 00:05:51,820 Now let me delete this line and work out our prices. 64 65 00:05:51,820 --> 00:05:57,490 We're gonna be working with log prices so I'll create a variable called "log_prices", set that 65 66 00:05:57,490 --> 00:06:01,460 equal to "np.log( 66 67 00:06:01,460 --> 00:06:06,430 boston_dataset.target". 67 68 00:06:07,210 --> 00:06:14,950 Let's take a look at what this variable looks like. So "log_prices" is an array with 68 69 00:06:14,950 --> 00:06:17,560 506 rows. 69 70 00:06:17,590 --> 00:06:20,150 We can see this by saying "log_ 70 71 00:06:20,140 --> 00:06:30,940 prices.shape". This confirms that we need to have an array with 506 rows but this 71 72 00:06:30,940 --> 00:06:32,100 array is flat. 72 73 00:06:32,140 --> 00:06:34,600 It's just one dimensional. 73 74 00:06:34,600 --> 00:06:42,880 In contrast the shape of our features data frame is 506 by 11. 74 75 00:06:44,200 --> 00:06:51,840 So I'm planning to work with prices that are two dimensional, so 506 by 1. 75 76 00:06:51,990 --> 00:07:01,570 I'm going to get there by converting our log prices into a dataframe, so I'll say "target = 76 77 00:07:01,900 --> 00:07:06,990 pd.DataFrame( 77 78 00:07:07,170 --> 00:07:10,290 log_prices, 78 79 00:07:10,520 --> 00:07:16,080 columns = [ 79 80 00:07:16,180 --> 00:07:20,670 'PRICE'])" 80 81 00:07:20,680 --> 00:07:21,940 Here we go. 81 82 00:07:21,940 --> 00:07:25,710 Now if I say "target.shape", 82 83 00:07:26,260 --> 00:07:28,110 let's see what we get. 83 84 00:07:28,300 --> 00:07:31,060 506 by 1. 84 85 00:07:31,090 --> 00:07:32,770 Perfect. 85 86 00:07:32,770 --> 00:07:40,840 Now, as we've said in the introduction, if we want to get an estimate for the value of a property, we basically 86 87 00:07:40,840 --> 00:07:47,350 have to create something that looks like another row of data, something that's structured exactly the 87 88 00:07:47,350 --> 00:07:50,770 way the features dataframe is structured. 88 89 00:07:50,770 --> 00:07:57,190 So 1 row and 11 columns with a value for each column. 89 90 00:07:57,280 --> 00:07:58,760 How could we do this? 90 91 00:07:58,870 --> 00:08:08,980 Say we create a variable called "property_stats", set that equal to an empty ndarray 91 92 00:08:09,010 --> 00:08:09,780 from numpy, 92 93 00:08:09,860 --> 00:08:18,580 so "np.ndarray()" and we want that array to be 1 row by 11 columns. 93 94 00:08:18,610 --> 00:08:19,700 So we'll say 94 95 00:08:19,750 --> 00:08:25,930 "shape = (1, 11)". 95 96 00:08:26,020 --> 00:08:29,140 Okay, so now we have an empty array. 96 97 00:08:29,140 --> 00:08:35,470 Now what we can do is give a value for every single column, so we can write something like "property_ 97 98 00:08:35,470 --> 00:08:40,940 stats" and then access the very, very first column, 98 99 00:08:40,990 --> 00:08:43,390 so that'll be in row number 0, 99 100 00:08:43,420 --> 00:08:47,250 so the first row and the first column, column numbers 0, 100 101 00:08:47,260 --> 00:08:55,780 so "[0][0]" and I can set that to a very particular value, so I can set that 101 102 00:08:55,780 --> 00:08:58,850 to say 0.02. 102 103 00:08:59,050 --> 00:09:02,860 This is now my crime per capita. 103 104 00:09:02,860 --> 00:09:04,380 Let's see what this looks like. 104 105 00:09:04,450 --> 00:09:08,270 "property_stats", Shift+Enter will 105 106 00:09:08,880 --> 00:09:11,200 now show us something like this. 106 107 00:09:11,200 --> 00:09:12,610 This is scientific notation. 107 108 00:09:12,640 --> 00:09:12,970 Yeah. 108 109 00:09:13,000 --> 00:09:19,530 So 0.02 will be 2*10^(-2). 109 110 00:09:19,810 --> 00:09:25,010 And these other values are "10^(-314)". 110 111 00:09:25,030 --> 00:09:32,240 This looks really strange, but what you're looking at is pretty much equal to zero. If I change this value 111 112 00:09:32,240 --> 00:09:39,550 here to say 83 and hit Shift+Enter then you'll see the array displayed like that, you have 83 112 113 00:09:39,550 --> 00:09:43,880 and then 0, 0, 0, 0, 0, 0, right? 113 114 00:09:44,180 --> 00:09:48,890 So I know this might seem confusing, but before we were looking at the output in scientific notation 114 115 00:09:49,850 --> 00:09:58,100 and here we're looking at it more normally. Now a reasonable thing to ask is "How do you know that this very 115 116 00:09:58,100 --> 00:10:01,350 first column here is the crime column?" 116 117 00:10:01,730 --> 00:10:02,950 Yeah, so this value here. 117 118 00:10:02,950 --> 00:10:06,380 How do I know that? This should be around 0.02. 118 119 00:10:06,500 --> 00:10:14,600 Yeah, well the answer is is that our property_stats variable. our 1 by 11 array, will have the same 119 120 00:10:14,600 --> 00:10:20,990 structure as our features dataframe, so "features.head()" 120 121 00:10:21,260 --> 00:10:26,420 if you recall, will show us that the first column is Crime, 121 122 00:10:26,540 --> 00:10:28,860 the second column are the zones. 122 123 00:10:28,880 --> 00:10:34,070 the third column is the Charles River dummy variable. 123 124 00:10:34,070 --> 00:10:36,490 So one thing that we might do, right, 124 125 00:10:36,530 --> 00:10:42,430 one thing that we might find helpful is if we give these different indices a name. 125 126 00:10:43,250 --> 00:10:48,480 So if we want to set the value of our second column and our third column we could do it like this, 126 127 00:10:48,500 --> 00:10:55,760 I could copy this, paste it twice, change the second zero here in property_stats to 1, 127 128 00:10:55,820 --> 00:11:02,060 this would now be the zone, and if I want the zone to be equal to say 15, then I can do it like this. 128 129 00:11:02,290 --> 00:11:08,000 And if I want the Charles River dummy variable to be equal to, say 1, then I would have to pick index 129 130 00:11:08,200 --> 00:11:11,700 2 and set that equal to 1. 130 131 00:11:12,080 --> 00:11:13,070 You get the idea, right? 131 132 00:11:13,340 --> 00:11:21,740 So property_stats now looks like so, we've got crime, we've got our ZN feature and we have our Charles 132 133 00:11:21,740 --> 00:11:24,280 River dummy variable. 133 134 00:11:24,380 --> 00:11:31,280 Now personally, I find accessing these indices by number very, very confusing, because I'm going to come 134 135 00:11:31,280 --> 00:11:37,820 back in a week's time and I'm not going to remember that crime is at zero or ZN is at 1 and Charles 135 136 00:11:37,820 --> 00:11:39,110 River is at 2. 136 137 00:11:39,120 --> 00:11:45,560 I only know that because I've worked with this dataset and I'm looking at my features dataframe 137 138 00:11:45,980 --> 00:11:47,680 below. 138 139 00:11:47,720 --> 00:11:54,520 So one thing that might be quite handy is if we give these numbers names, right? 139 140 00:11:54,550 --> 00:12:06,150 So I can come up here and say "CRIME_IDX = 0" and I can say "ZN_ 140 141 00:12:06,140 --> 00:12:12,180 IDX = 1" and "CHAS_ 141 142 00:12:12,260 --> 00:12:15,270 IDX = 2" 142 143 00:12:15,380 --> 00:12:16,410 and so on. 143 144 00:12:16,520 --> 00:12:25,370 Now I can come in here and instead of having a zero there, I'll say "CRIME_IDX", instead of having 144 145 00:12:25,370 --> 00:12:33,080 a 1 here, I'll say "ZN_IDX" and so on. 145 146 00:12:33,080 --> 00:12:33,580 Right? 146 147 00:12:33,890 --> 00:12:37,460 "CHAS_IDX". 147 148 00:12:37,670 --> 00:12:45,560 In other words, this is a technique for giving certain hard values a descriptive name, that way when you're 148 149 00:12:45,560 --> 00:12:51,840 using them in your code later on it's a little more clear, a little easier to read. 149 150 00:12:52,070 --> 00:12:58,640 Since we're not really going to change these values here, I've written them in all caps and separated 150 151 00:12:58,640 --> 00:13:01,010 them with an underscore. 151 152 00:13:01,010 --> 00:13:04,230 Now I'm going to add two more named indices here. 152 153 00:13:04,250 --> 00:13:06,680 The first one is going to be for the number of rooms, 153 154 00:13:06,710 --> 00:13:12,480 so "RM_IDX" and that's in row number 4 154 155 00:13:12,860 --> 00:13:22,460 and the next one is "PTRATIO_IDX" and that's in row number 8. Scrolling down you can verify 155 156 00:13:22,460 --> 00:13:23,260 this. 156 157 00:13:23,410 --> 00:13:29,130 0, 1, 2, 3, 4, "RM", 5,6,7,8, 157 158 00:13:29,210 --> 00:13:31,330 for PTRATIO. 158 159 00:13:31,500 --> 00:13:33,640 Brilliant. 159 160 00:13:33,730 --> 00:13:38,930 Now remember how this property_stats array is empty at the moment, 160 161 00:13:39,100 --> 00:13:46,520 it's got zeros for all of these values and it's got three of these values defined. 161 162 00:13:47,080 --> 00:13:54,520 Now, to be honest, we're not going to be customizing all these values, right, because something like crime 162 163 00:13:54,550 --> 00:14:01,760 per capita is quite hard to know or the acres of industrial land in a particular area, 163 164 00:14:01,780 --> 00:14:05,300 it's also really hard to know. We're gonna make some assumptions. 164 165 00:14:05,590 --> 00:14:10,390 In other words, for the property that we're looking at, we're just gonna go with the average for all of 165 166 00:14:10,390 --> 00:14:14,830 Boston, for now at least. To get the average, 166 167 00:14:14,830 --> 00:14:23,860 we can simply grab it from our features dataframe, so "features['CRIM'] 167 168 00:14:24,520 --> 00:14:35,170 .mean()" will give us the average and I can of course take this, I can do the same thing for our zones. 168 169 00:14:35,350 --> 00:14:42,610 So "features['ZN'].mean()" and I could do the same thing for Charles River, "features['CHAS'].mean()" and I could do the 169 170 00:14:42,610 --> 00:14:46,820 same thing for all the other eleven features. 170 171 00:14:46,990 --> 00:14:49,420 Now, what would this look like at the moment? 171 172 00:14:49,420 --> 00:14:55,780 If I refresh, I can see this is the average crime per capita, 172 173 00:14:55,910 --> 00:15:02,150 this is the average value for the ZN index and this is the average value for Charles River. 173 174 00:15:02,150 --> 00:15:06,640 I'm going to stop copy pasting code and making this super repetitive. 174 175 00:15:06,980 --> 00:15:13,400 Instead, I'm going to grab the mean value for all the features at the same time. 175 176 00:15:13,490 --> 00:15:18,280 Check it out, "features.mean()" 176 177 00:15:18,490 --> 00:15:24,120 will give me all the mean values, all the average values for all the features. 177 178 00:15:24,140 --> 00:15:33,500 My goal is to populate our property_stats with all these values, our property stats is an ndarray at 178 179 00:15:33,500 --> 00:15:34,320 the moment. 179 180 00:15:34,610 --> 00:15:35,620 But what is this? 180 181 00:15:35,630 --> 00:15:38,070 Let's take a look at what this is. 181 182 00:15:38,150 --> 00:15:45,140 "type(featyres.mean())" shows us that this is a Series. 182 183 00:15:45,230 --> 00:15:46,890 It's a different kind of object. 183 184 00:15:47,000 --> 00:15:49,370 So we have to do a little bit of conversion here. 184 185 00:15:49,400 --> 00:15:58,350 We have to make the series object play nice with our array, so "features.mean()" gives us a Series, 185 186 00:15:58,550 --> 00:16:04,760 but the series object has an attribute called values. 186 187 00:16:05,060 --> 00:16:11,310 So, I'm going to copy this, paste it in and show you what this type is 187 188 00:16:11,310 --> 00:16:22,190 by adding ".values" at the end - we can see that the values attribute on a series will give us an 188 189 00:16:22,310 --> 00:16:29,570 ndarray, so array and array, the two things should play nice because they're the same type of object. 189 190 00:16:29,740 --> 00:16:34,570 But remember how this is a 1 by 11 array? 190 191 00:16:34,810 --> 00:16:44,820 Let's double check what the dimensions are of this array here, "features.mean().values.shape" 191 192 00:16:45,460 --> 00:16:48,370 will tell us exactly that. 192 193 00:16:48,370 --> 00:16:52,200 This thing here it turns out is completely flat. 193 194 00:16:52,210 --> 00:16:54,080 It's a one dimensional array. 194 195 00:16:54,280 --> 00:16:59,240 Unlike our property_stats array it is not two dimensional. 195 196 00:16:59,370 --> 00:17:09,200 It means that we have to reshape this array from a flat with 11 values to a 1 by 11 array. 196 197 00:17:09,240 --> 00:17:14,650 The easiest way to do this is to call the "reshape" method, 197 198 00:17:14,790 --> 00:17:26,160 so "features.mean().values.reshape(1, 11)" will give us exactly what 198 199 00:17:26,210 --> 00:17:26,760 it is 199 200 00:17:26,790 --> 00:17:27,890 we're looking for. 200 201 00:17:27,900 --> 00:17:29,550 Check it out. 201 202 00:17:29,640 --> 00:17:39,290 Brilliant. So I'm going to take this here and I'm going to say "property_stats = features.mean(). 202 203 00:17:39,290 --> 00:17:48,650 values.reshape(1,11)" and this means I do not have to do any of this. I can comment out 203 204 00:17:48,860 --> 00:17:58,670 all of these lines of code and save us all of this work, because we now have a property with some starting 204 205 00:17:58,670 --> 00:18:00,440 characteristics, right. 205 206 00:18:00,830 --> 00:18:05,950 So we have a property, a single row, 11 features, they all have a value 206 207 00:18:06,140 --> 00:18:14,700 and in this case, the value is just the average of all the 506 properties in the dataset. 207 208 00:18:14,720 --> 00:18:20,830 In other words, property_stats is kind of our template for making our prediction. 208 209 00:18:20,860 --> 00:18:23,620 This is the object that we're going to be working with.