0 1 00:00:00,570 --> 00:00:01,170 All right. 1 2 00:00:01,320 --> 00:00:05,210 So now let's change track a little bit. 2 3 00:00:05,220 --> 00:00:09,900 We now have our template for our property that we want to get an estimate for. 3 4 00:00:10,170 --> 00:00:14,660 That template is currently populated with the average values in the dataset, 4 5 00:00:14,670 --> 00:00:19,830 so now let's get the estimated theta values for our model. 5 6 00:00:20,540 --> 00:00:27,120 And while we're doing that, we can also calculate the root mean squared error for our prediction interval. 6 7 00:00:28,180 --> 00:00:30,720 I'm going to be using scikit-learn to do this. 7 8 00:00:31,210 --> 00:00:42,060 Let's store our regression in a variable called "regr" and that's going to be equal to LinearRegression(), 8 9 00:00:42,100 --> 00:00:50,950 so this is scikit-learn's linear regression, and we can use that linear regression to already fit our values, 9 10 00:00:51,010 --> 00:00:58,870 so we can calculate all the theta parameters simply by calling the fit method on this and the fit method 10 11 00:00:59,320 --> 00:01:08,820 needs two inputs, namely the features and our target. If I hit Shift+Enter, 11 12 00:01:09,010 --> 00:01:16,930 this calculates all the theta values in the background. To get the predicted values or the fitted values, 12 13 00:01:16,930 --> 00:01:23,410 we can use "regr.predict(features)", 13 14 00:01:23,410 --> 00:01:31,270 so based on our features dataframe we are calculating all the predicted values using the thetas from 14 15 00:01:31,270 --> 00:01:32,620 our model. 15 16 00:01:32,640 --> 00:01:40,780 Let me store this in a variable called "fitted_vals". "fitted_vals" 16 17 00:01:40,840 --> 00:01:42,840 now has all our predictions. 17 18 00:01:42,940 --> 00:01:49,580 The next step is calculating our mean squared error and our root mean squared error. 18 19 00:01:50,680 --> 00:01:56,950 And this is where I'm going to throw it over to you. Can you calculate the mean squared error and the root 19 20 00:01:57,190 --> 00:02:01,740 mean squared error using our scikit-learn module? 20 21 00:02:02,140 --> 00:02:05,730 I'll give you a few seconds to pause the video and give this a go. 21 22 00:02:07,270 --> 00:02:08,100 Ready? 22 23 00:02:08,110 --> 00:02:09,330 Here's the solution. 23 24 00:02:09,460 --> 00:02:16,170 The MSE is going to be equal to the return value from our mean_squared_error 24 25 00:02:16,180 --> 00:02:25,000 function that we've imported at the top, "mean_squared_error" needs two inputs - the 25 26 00:02:25,000 --> 00:02:27,790 target and the fitted values. 26 27 00:02:30,880 --> 00:02:37,830 If we output the mean_squared_error for our regression it's currently equal to 0.035 27 28 00:02:38,060 --> 00:02:39,640 approximately. 28 29 00:02:39,760 --> 00:02:41,860 Now what about the root mean squared error, 29 30 00:02:41,950 --> 00:02:50,380 the RMSE? That's simply going to be equal to the square root of the mean squared error. 30 31 00:02:50,380 --> 00:02:52,060 Now I'm going to use numpy again, 31 32 00:02:52,240 --> 00:02:57,390 so "np.sqrt(MSE)" 32 33 00:02:57,610 --> 00:03:00,280 will give me the root mean squared error. 33 34 00:03:00,550 --> 00:03:09,780 The value of the RMSE is equal to 0.175, and the units for both, 34 35 00:03:09,780 --> 00:03:12,090 the mean squared error and the root mean squared error 35 36 00:03:12,220 --> 00:03:16,450 are still log dollar prices in thousands. 36 37 00:03:16,450 --> 00:03:20,120 Now we have everything in place to make our predictions. 37 38 00:03:20,280 --> 00:03:25,730 I'm going to delete this here, hit Shift+Enter and go down to the next cell. 38 39 00:03:25,840 --> 00:03:35,770 What I want to do here is I want to create a Python function which will estimate the log house prices 39 40 00:03:35,950 --> 00:03:38,940 for a specific property. 40 41 00:03:39,040 --> 00:03:45,670 We'll get the log prices first, the log estimate, using our data set and afterwards we'll do step 2 where 41 42 00:03:45,670 --> 00:03:50,860 we convert this output into today's dollar values. 42 43 00:03:50,860 --> 00:03:54,140 Now we've not written a Python function of our own for a little while. 43 44 00:03:54,190 --> 00:03:56,980 So this is a good chance to review this as well. 44 45 00:03:57,220 --> 00:04:03,680 The way you create a function is with the "def" keyword, "def" and then the function name, 45 46 00:04:03,680 --> 00:04:04,800 I'm going to call this function 46 47 00:04:04,810 --> 00:04:12,140 "get_log_estimate():". 47 48 00:04:12,160 --> 00:04:21,040 Let's have this function return a dummy value for now, so I'm going to say "log_estimate = 48 49 00:04:21,480 --> 00:04:24,310 21" and then down here 49 50 00:04:24,550 --> 00:04:33,350 I'll specify a return value for this function, so I'll say "return log_estimate". 50 51 00:04:33,810 --> 00:04:37,650 Now that we've defined our function, we can use it. 51 52 00:04:37,680 --> 00:04:43,460 We can call it, so "get_log_estimate()" 52 53 00:04:43,680 --> 00:04:48,840 will output the value 21. 53 54 00:04:48,860 --> 00:04:49,910 Fair enough. 54 55 00:04:49,910 --> 00:04:52,820 So this is pretty much all the review. 55 56 00:04:52,820 --> 00:04:57,950 The thing I actually want to show you in this lesson is not just how to create your own Python function, 56 57 00:04:58,310 --> 00:05:06,590 but how to create a Python function with arguments some of which have a default value, because once you 57 58 00:05:06,590 --> 00:05:14,970 do that when calling the function the arguments that already have a default value are optional. 58 59 00:05:14,970 --> 00:05:16,600 Let me show you what I mean by that. 59 60 00:05:16,860 --> 00:05:25,290 If I come in here into my function definition and I specify an argument, say "nr_rooms", 60 61 00:05:26,220 --> 00:05:33,810 and hit Shift+Enter, when it comes to calling our Python function, I now need to supply this argument, 61 62 00:05:33,810 --> 00:05:41,400 so if I hit Shift+Enter now, I'll get a type error, because I'm missing one required positional argument, 62 63 00:05:42,060 --> 00:05:43,480 nr_rooms. 63 64 00:05:44,580 --> 00:05:51,860 Calling the function now requires that I have some value for the number of rooms. 64 65 00:05:52,260 --> 00:05:59,840 Only then will the code execute. Let's add four arguments to this function total, number of rooms being 65 66 00:05:59,840 --> 00:06:07,610 1, the next one being "students_per_classroom", 66 67 00:06:07,610 --> 00:06:18,560 so this will be our "PTRATIO", comma, and then we'll have "next_to_river", 67 68 00:06:18,560 --> 00:06:24,060 comma, and then we'll have a fourth argument, "high_ 68 69 00:06:24,060 --> 00:06:31,260 confidence", and this last argument will be for whether or not we want to have a 95% prediction 69 70 00:06:31,260 --> 00:06:32,880 interval. 70 71 00:06:32,920 --> 00:06:33,300 All right. 71 72 00:06:33,300 --> 00:06:34,710 So now we have four arguments, 72 73 00:06:34,830 --> 00:06:43,010 meaning we need to specify values for all of these arguments here. To make one of these arguments optional, 73 74 00:06:43,050 --> 00:06:44,830 say next_to_river, 74 75 00:06:45,030 --> 00:06:50,910 I can give it a default value in our function signature, in our function definition. 75 76 00:06:50,970 --> 00:06:59,860 So next_to_river can be equal to say False and high_confidence can be equal to say True. 76 77 00:07:00,450 --> 00:07:07,770 What we've just done is made two of these arguments optional, meaning we only need to specify the number 77 78 00:07:07,770 --> 00:07:10,710 of rooms and the students per classroom. 78 79 00:07:11,540 --> 00:07:19,920 So coming down here where I'm calling this function, I can have 5 and say 19 and this cell will now 79 80 00:07:19,950 --> 00:07:24,050 execute just fine. All right, 80 81 00:07:24,080 --> 00:07:27,680 so we can't have this function returned the value 21 all the time. 81 82 00:07:27,680 --> 00:07:29,520 That's, that's just silly. 82 83 00:07:29,540 --> 00:07:40,330 What we actually want is we want this function to return the price estimate for a particular property, 83 84 00:07:40,940 --> 00:07:47,720 so we're going to use our regression object, regr, and the predict method on that object 84 85 00:07:47,720 --> 00:07:55,790 and then, as an argument to the predict method, we need to supply a single row of features. That single 85 86 00:07:55,790 --> 00:08:03,500 row will be our property_stats, which currently hold on to the average values for all the features in 86 87 00:08:03,590 --> 00:08:05,120 our dataset. 87 88 00:08:05,120 --> 00:08:12,170 Check it out, if I press Shift+Enter to refresh the cell and then pull up the values stored inside property_ 88 89 00:08:12,170 --> 00:08:12,980 stats, 89 90 00:08:13,220 --> 00:08:20,150 I can see an array with 11 features and all the average values. 90 91 00:08:20,150 --> 00:08:26,380 Okay, let's see what the log price estimate for this particular property is. 91 92 00:08:26,690 --> 00:08:36,770 "regr.predict(property_stats)", Shift+Enter and Shift+Enter again, we see that we get 92 93 00:08:36,770 --> 00:08:39,140 an array of two dimensions, 93 94 00:08:39,140 --> 00:08:47,000 mind you, it has two sets of square brackets, and then we get a price estimate in log dollars. 94 95 00:08:47,000 --> 00:08:52,880 Now if I wanted to get the raw number instead of an array we can access that value with 95 96 00:08:52,880 --> 00:08:55,800 "[0][0]", 96 97 00:08:55,820 --> 00:09:05,240 so first row, first column, Shift+Enter and Shift+Enter again will give us the log price estimate of the 97 98 00:09:05,330 --> 00:09:06,300 array. 98 99 00:09:06,650 --> 00:09:08,550 Now of course this is fairly boring, right? 99 100 00:09:08,570 --> 00:09:15,200 Because even though we've inputed some values here as arguments the number of rooms and students per 100 101 00:09:15,200 --> 00:09:20,680 classroom, we're not really making use of them inside our function. 101 102 00:09:20,750 --> 00:09:26,900 So I'm going add a comment here that reads "Configure property" 102 103 00:09:30,330 --> 00:09:39,450 and then I'll add a second comment here that reads "Make prediction". When it comes to configuring our 103 104 00:09:39,450 --> 00:09:47,150 property, I'm thinking we can do it in a very, very similar way to the way we did it up here, so check 104 105 00:09:47,150 --> 00:09:47,680 it out. 105 106 00:09:47,960 --> 00:09:55,250 We can make use of this input, the number of rooms, by changing a particular index of our n-dimensional 106 107 00:09:55,250 --> 00:10:02,550 array, so "property_stats[0]", 107 108 00:10:02,730 --> 00:10:12,900 so row number 1 and then column number will be "RM_IDX", which is column number 4, 108 109 00:10:13,530 --> 00:10:18,710 and we're gonna set that equal to "nr_rooms". 109 110 00:10:18,800 --> 00:10:21,290 Now we have this number here 110 111 00:10:21,410 --> 00:10:28,940 when calling the function, filtering through into configuring the feature that we're then going to make 111 112 00:10:28,940 --> 00:10:30,230 the prediction on. 112 113 00:10:30,620 --> 00:10:36,550 So Shift+Enter and Shift+Enter, you'll see that number change. 113 114 00:10:36,650 --> 00:10:47,430 Now as that changes from 5 to say 8, to 9, to 3, we can now get price estimates for properties 114 115 00:10:47,820 --> 00:10:56,340 that have the average values for Crime and LSTAT and Pollution from the dataset for all the features 115 116 00:10:56,550 --> 00:10:59,560 except for one feature, namely the number of rooms. 116 117 00:10:59,880 --> 00:11:07,520 This now depends on the input to our function. Let's do the same thing for PTRATIO, 117 118 00:11:07,520 --> 00:11:15,880 the students per classroom. I'll let you pause the video to add that line of Python code. Ready? 118 119 00:11:15,890 --> 00:11:16,660 Here it goes. 119 120 00:11:16,670 --> 00:11:22,620 So "property_stats[0][ 120 121 00:11:22,730 --> 00:11:37,500 PTRATIO_IDX] = students_per 121 122 00:11:37,510 --> 00:11:40,150 _classroom" 122 123 00:11:40,210 --> 00:11:48,910 Now we can adjust the valuation of our property, based on how many kids there are per teacher, Shift+Enter, 123 124 00:11:49,930 --> 00:11:54,430 and Shift+Enter again and changing this and hitting Shift+ 124 125 00:11:54,440 --> 00:11:55,830 Enter again, 125 126 00:11:56,530 --> 00:12:04,810 se can see that for a PTRATIO of 10 to 1, the log price estimate is 3.05 and then for 126 127 00:12:04,810 --> 00:12:06,790 a PTRATIO of say 20, 127 128 00:12:06,820 --> 00:12:10,700 so a much bigger class, the log price estimate will go down. 128 129 00:12:10,720 --> 00:12:11,090 Right? 129 130 00:12:11,110 --> 00:12:16,270 So it'll be 2.68. Brilliant. 130 131 00:12:16,280 --> 00:12:17,720 Now you can see where this is going. 131 132 00:12:17,720 --> 00:12:18,820 Right? 132 133 00:12:19,340 --> 00:12:24,070 Next up is this "next_to_river" argument that I've given here. 133 134 00:12:24,140 --> 00:12:27,740 This is gonna be all about our CHAS variable. Now, 134 135 00:12:27,800 --> 00:12:29,270 CHAS was a dummy variable, right? 135 136 00:12:29,270 --> 00:12:33,500 It's either equal to 0 or it's equal to 1. 136 137 00:12:33,500 --> 00:12:38,740 The way we're gonna configure our features is using an if-else clause, 137 138 00:12:38,930 --> 00:12:51,570 so "if next_to_river:", new line, "property_stats[0] 138 139 00:12:52,370 --> 00:13:00,660 [CHAS_IDX] = 1", 139 140 00:13:00,680 --> 00:13:13,030 new line, "else:", "property_stats[0][CHAS_ 140 141 00:13:13,570 --> 00:13:14,100 IDX] 141 142 00:13:16,930 --> 00:13:19,330 = 0" 142 143 00:13:19,330 --> 00:13:19,860 What code 143 144 00:13:19,860 --> 00:13:22,110 did I just write here? In English, 144 145 00:13:22,120 --> 00:13:25,790 this would read "if the property is next to the river, 145 146 00:13:25,930 --> 00:13:35,370 if next to the river is true, then we're going to change one feature, namely the column for Charles River, 146 147 00:13:35,890 --> 00:13:39,390 to 1, otherwise, 147 148 00:13:39,870 --> 00:13:48,060 if next to river is false which is the default in this case, property_stats for the Charles 148 149 00:13:48,060 --> 00:13:51,600 River column should be equal to 0". 149 150 00:13:51,780 --> 00:14:00,900 The value of this point in our array is conditional on the value of next_to_river, which should either 150 151 00:14:00,900 --> 00:14:04,140 be true or false. 151 152 00:14:04,280 --> 00:14:05,790 We're calling our function 152 153 00:14:05,790 --> 00:14:13,320 like so. This value here that we're getting for a three room dwelling with PTRATIO of 20, we're getting 153 154 00:14:13,320 --> 00:14:15,400 a value of 2.68. 154 155 00:14:15,480 --> 00:14:18,750 This property here is not next to the river. 155 156 00:14:19,350 --> 00:14:25,220 If I hit Shift+Enter on this cell to refresh and refresh this cell, then we see what this property would be priced 156 157 00:14:25,230 --> 00:14:27,270 that if it was not next to the river. 157 158 00:14:28,110 --> 00:14:38,610 If however, "next_to_river = True" and I refresh the cell, then we can see 158 159 00:14:38,610 --> 00:14:41,550 the premium of this property reflected in the price 159 160 00:14:41,820 --> 00:14:43,420 if it was next to the river. 160 161 00:14:43,780 --> 00:14:48,390 And that's because again living next to the river is a desirable thing. 161 162 00:14:48,430 --> 00:14:50,710 All right, so you can see where this is going. 162 163 00:14:50,820 --> 00:14:58,830 What we've just done is we've created a function and we're configuring the property that we're getting 163 164 00:14:58,830 --> 00:15:02,940 the estimate for using the arguments to that function. 164 165 00:15:02,940 --> 00:15:11,550 So number of rooms, students per classroom and whether the property is next to the river. For all the 165 166 00:15:11,550 --> 00:15:14,860 other features, like pollution and highway accessibility, 166 167 00:15:14,940 --> 00:15:19,210 we're just using the average values of the dataset. 167 168 00:15:19,290 --> 00:15:25,440 If we wanted to, we could add more lines of code here and increase the number of arguments that are being 168 169 00:15:25,440 --> 00:15:28,950 supplied to this function to set those as well. 169 170 00:15:28,950 --> 00:15:34,780 But for this exercise, let's keep it simple, let's keep it to 3 custom inputs. 170 171 00:15:34,890 --> 00:15:41,100 Now, what about this fourth one that I made you add earlier, "high_confidence"? What I had in 171 172 00:15:41,100 --> 00:15:44,880 mind for this one was to calculate the range. 172 173 00:15:44,880 --> 00:15:53,400 So let's add another comment here and I'm going to have that read "Calc Range". 173 174 00:15:53,400 --> 00:15:55,840 This is our prediction interval. 174 175 00:15:56,160 --> 00:16:12,850 I'm going to adopt the same pattern, so "if high_confidence:", "Do X", "else:", "Do Y". 175 176 00:16:13,750 --> 00:16:15,370 Now what am I trying to get at here? 176 177 00:16:16,420 --> 00:16:24,060 If high confidence is equal to True, then I want to calculate the two standard deviation prediction interval, 177 178 00:16:24,160 --> 00:16:33,250 I want to calculate the 95% prediction interval and otherwise, I want to calculate the one standard 178 179 00:16:33,250 --> 00:16:35,250 deviation prediction interval, 179 180 00:16:35,330 --> 00:16:41,560 68% prediction interval. The prediction intervals of course have an upper and a lower 180 181 00:16:41,560 --> 00:16:42,250 bound, 181 182 00:16:42,250 --> 00:16:53,640 so "upper_bound = log_estimate + 2*RMSE", 182 183 00:16:53,860 --> 00:16:55,900 yes that's our root mean squared error 183 184 00:16:55,900 --> 00:17:07,480 coming back in here, and the "lower_bound = log_estimate - 2* 184 185 00:17:07,660 --> 00:17:10,940 RMSE" 185 186 00:17:10,990 --> 00:17:15,530 Now what about if we don't want a high confidence prediction interval? 186 187 00:17:15,550 --> 00:17:17,980 What if we don't want a large range? 187 188 00:17:17,980 --> 00:17:25,990 Well, I can copy these two lines and I can paste them in here and instead of using two times the root 188 189 00:17:25,990 --> 00:17:34,010 mean squared error, we can simply use 1 times the root means squared error to calculate our range. 189 190 00:17:34,030 --> 00:17:38,590 So this is a wide range and this is a narrow range. 190 191 00:17:38,800 --> 00:17:49,310 The last thing I'm going to do is I want to make this very explicit, so I'll say "interval = 95" and here 191 192 00:17:49,340 --> 00:17:53,650 I'm going to say "interval = 68". 192 193 00:17:53,710 --> 00:17:58,800 I'm going to use this variable here in a print statement later on. 193 194 00:17:59,110 --> 00:18:05,050 For now what we're gonna do is return all of these calculations, so check it out, we're going to return our "log_ 194 195 00:18:05,050 --> 00:18:08,740 estimate", we're going to return our "upper_bound", 195 196 00:18:08,740 --> 00:18:16,810 we're gonna return our "lower_bound" and we're going to return our interval. Hitting Shift+Enter, I 196 197 00:18:17,110 --> 00:18:21,110 refresh the cell and hitting Shift+Enter on this one, 197 198 00:18:21,460 --> 00:18:24,150 I can now see that I get a tuple. 198 199 00:18:24,340 --> 00:18:31,480 This is what the parentheses are telling us and that tuple consists of four things, the log price estimate, 199 200 00:18:33,150 --> 00:18:40,290 plus two standard deviations and minus two standard deviations for the range - an upper and a lower bound, 200 201 00:18:40,290 --> 00:18:46,400 plus an indication if this is a wide range or a narrow range. 201 202 00:18:46,530 --> 00:18:57,930 In this case, it is a wide range. Why? Because confidence is set to True by default. To get a narrow range, 202 203 00:18:58,100 --> 00:19:07,810 I can use "high_confidence = False" in which case we will enter the else block of 203 204 00:19:07,810 --> 00:19:16,960 the function and get the following output. We get the same log estimate, but the upper and the lower bound 204 205 00:19:17,200 --> 00:19:26,200 will be plus and minus one standard deviation, one times the root mean squared error, and our prediction 205 206 00:19:26,200 --> 00:19:26,650 interval 206 207 00:19:26,650 --> 00:19:32,300 in this case, will be 68%, so this is the 68% prediction interval 207 208 00:19:32,570 --> 00:19:36,850 and this is the 95% prediction interval. 208 209 00:19:37,000 --> 00:19:44,290 Now that we have a price estimate and an upper and a lower bound and a prediction interval, the next 209 210 00:19:44,290 --> 00:19:49,560 thing to do is to convert all of these log prices into dollar values.