0 1 00:00:00,720 --> 00:00:06,780 In the upcoming lessons we're going to talk about how to make predictions using our model. 1 2 00:00:07,170 --> 00:00:16,140 And believe it or not, this gives us a chance to link together several concepts of the past lessons. In 2 3 00:00:16,140 --> 00:00:19,080 our technical section on gradient descent 3 4 00:00:19,080 --> 00:00:26,910 we've talked extensively about cost functions and the role of the mean squared error cost function 4 5 00:00:27,140 --> 00:00:29,130 in machine learning. 5 6 00:00:29,130 --> 00:00:37,590 We've also covered how an algorithm that finds the best fit line minimizes this particular equation. 6 7 00:00:37,590 --> 00:00:45,510 And we've also talked a lot about metrics, like r-squared, and we touched on how to compare different 7 8 00:00:45,510 --> 00:00:46,590 models. 8 9 00:00:46,620 --> 00:00:51,390 And finally, we've talked a lot about how to analyze the residuals. 9 10 00:00:51,480 --> 00:00:53,660 So how does this all tie together? 10 11 00:00:53,730 --> 00:01:01,890 Let's go back to the equation of the mean squared error for a little bit, y_hat in this equation is the value 11 12 00:01:01,950 --> 00:01:09,600 that is estimated by our model. Since we're using a multi variable regression with 11 features, our equation 12 13 00:01:09,600 --> 00:01:13,800 for y_hat will actually break down into something like this. 13 14 00:01:13,850 --> 00:01:18,230 We have all the thetas and then all the features. Taking it a step further, 14 15 00:01:18,270 --> 00:01:26,610 we've also encountered the "y - y_hat" before. "y - y_ hat" were the residuals, the residuals are 15 16 00:01:26,610 --> 00:01:30,300 the actual values minus the fitted values. 16 17 00:01:30,330 --> 00:01:37,560 So what does that make the mean squared error? The mean squared error is just the residuals squared and 17 18 00:01:37,570 --> 00:01:45,000 then summed up and then divided by the number of observations. And now do you see the squaring that's 18 19 00:01:45,000 --> 00:01:46,320 happening here? 19 20 00:01:46,320 --> 00:01:53,880 We're squaring the differences between the actual and the predicted values - we're squaring the residuals. 20 21 00:01:53,880 --> 00:01:59,100 And this is nice because we're treating the positive and the negative residuals equally, which means 21 22 00:01:59,160 --> 00:02:06,540 we can then add them up in the summation, but this mathematical operation of squaring the differences 22 23 00:02:06,900 --> 00:02:08,670 also has another effect. 23 24 00:02:08,670 --> 00:02:09,900 Think about this. 24 25 00:02:09,900 --> 00:02:16,310 What if the difference between the predicted value and the actual value was large? 25 26 00:02:16,350 --> 00:02:24,000 What if we have a large residual? What happens to the mean squared error? In this case, the squaring of 26 27 00:02:24,000 --> 00:02:27,360 a big number will increase the mean squared error massively. 27 28 00:02:27,360 --> 00:02:28,370 Right? 28 29 00:02:28,380 --> 00:02:35,940 In other words, the mean squared error weighs the large differences more heavily and thus it punishes large 29 30 00:02:35,970 --> 00:02:42,720 residuals. The effect of this is is that it makes the mean squared error very sensitive to outliers. 30 31 00:02:42,780 --> 00:02:44,160 So that's one thing to be aware of. 31 32 00:02:44,760 --> 00:02:51,690 Now let's have a think about how the main squared error compares to r-squared, and to do that we're 32 33 00:02:51,690 --> 00:02:56,940 going to check out the differences between the two in Jupyter notebook. 33 34 00:02:56,940 --> 00:03:03,660 In the past lessons, we've written some code where we plotted our residuals and calculated them and compared 34 35 00:03:03,720 --> 00:03:06,770 the residual plots of three different models. 35 36 00:03:06,780 --> 00:03:13,700 Now let's look at these models' r-squared and their mean squared errors. In some of the previous lessons, 36 37 00:03:13,710 --> 00:03:20,730 the way we've worked with the mean squared error was by importing a function from scikit-learn. 37 38 00:03:20,850 --> 00:03:26,260 So we said "from sklearn.metrics import mean_squared_error" 38 39 00:03:26,550 --> 00:03:30,180 and then we used this function to calculate the MSE. 39 40 00:03:30,180 --> 00:03:37,230 But in this case, we're using our statsmodel module here and it actually makes accessing the mean squared 40 41 00:03:37,240 --> 00:03:43,390 error very, very easy because again we just have to use our results object to get hold of it. 41 42 00:03:43,950 --> 00:03:44,840 Here's how. 42 43 00:03:45,120 --> 00:03:50,670 Coming down here, I'm going to add a comment that reads "Mean Squared Error". 43 44 00:03:51,720 --> 00:04:00,460 And I can access the mean squared error from our results objects by typing "results.mse_ 44 45 00:04:01,310 --> 00:04:02,460 resid". 45 46 00:04:02,520 --> 00:04:05,970 This will give us our mean squared error. 46 47 00:04:05,970 --> 00:04:12,390 And what I'll do is I'll store this in a variable, I'm going to call this variable or "reduced_ 47 48 00:04:12,450 --> 00:04:13,740 log_ 48 49 00:04:13,800 --> 00:04:19,620 mse", and set that equal to the rounded value of this whole thing, 49 50 00:04:19,620 --> 00:04:20,010 so 50 51 00:04:20,140 --> 00:04:28,230 I'll add a round function and have the "results.mse_resid" inside the parentheses, put a comma 51 52 00:04:28,230 --> 00:04:33,300 after it, round it to three decimal places and close it all off. 52 53 00:04:33,480 --> 00:04:36,450 I'm going to do a same thing for the r-squared. 53 54 00:04:36,450 --> 00:04:41,810 So in our comment I'll add "R-squared" and 54 55 00:04:41,920 --> 00:04:45,590 I'm going to copy this line, paste it down here, 55 56 00:04:45,590 --> 00:04:52,340 change the variable name to r-squared and change the attribute as well. 56 57 00:04:52,370 --> 00:05:02,760 So "results.rsquared", rounded to three decimal places will be stored in this variable "reduced 57 58 00:05:02,790 --> 00:05:04,780 _log_ 58 59 00:05:04,840 --> 00:05:13,680 rsquared = round(results.rsquared, 3)". I'm going to 59 60 00:05:13,720 --> 00:05:14,990 take this here, 60 61 00:05:15,310 --> 00:05:22,570 these three lines of code and the comment, copy them and I'm going to come down here to model number 61 62 00:05:22,800 --> 00:05:23,990 2 - 62 63 00:05:24,220 --> 00:05:30,620 the original model using normal prices and all features. And what I'll do here is 63 64 00:05:30,680 --> 00:05:36,280 I'm going to come down to the very bottom, paste it in and change the variable names, 64 65 00:05:36,280 --> 00:05:41,010 the first one I'll call "full_normal_mse" 65 66 00:05:41,260 --> 00:05:45,160 and the second one I'm going to call "full_normal_ 66 67 00:05:45,190 --> 00:05:47,270 rsquared". 67 68 00:05:47,320 --> 00:05:52,490 Now, I'm going to hit Shift+Enter on both of these cells since I've added new code, I think I didn't do it on 68 69 00:05:52,490 --> 00:05:59,390 the first one, but I'm going to do it on this one, Shift+Enter and then coming up here, I'm going to hit Shift+Enter 69 70 00:05:59,540 --> 00:06:00,570 as well. 70 71 00:06:00,590 --> 00:06:07,760 This leaves us with our third model, the one where we omitted some variables and are still using log 71 72 00:06:07,760 --> 00:06:08,860 prices. 72 73 00:06:09,290 --> 00:06:15,270 And I'm going to paste in the three lines, the comment reading "Mean Squared Error and R-squared" 73 74 00:06:15,270 --> 00:06:18,340 and then I'm going to change the variable names, 74 75 00:06:18,540 --> 00:06:27,990 once again. I'm going to call this one "omitted_var_ 75 76 00:06:28,020 --> 00:06:36,590 mse" and "omitted_var_rqsuared". I'm going to hit Shift+ 76 77 00:06:36,600 --> 00:06:45,860 Enter, and now that we've added this code, calculated all these values, we can look at them side by side, 77 78 00:06:46,190 --> 00:06:49,000 which is what I wanted to do in the first place. 78 79 00:06:49,220 --> 00:07:00,470 So I put him in a data frame "pd.DataFrame({'R-Squared':" 79 80 00:07:01,370 --> 00:07:07,010 and now I'm going to provide a list - a list of all the r-squared values. 80 81 00:07:07,010 --> 00:07:15,510 So the first one was "reduced_log_rquared", the second one was "full_ 81 82 00:07:15,990 --> 00:07:27,300 normal_rsquared", and the third one was "omitted_var_rsquared". 82 83 00:07:27,780 --> 00:07:30,110 Those are all the r-squared values. 83 84 00:07:30,140 --> 00:07:38,250 Now I'm going to put a comma here after the square brackets, go down to the next line and add our second key for 84 85 00:07:38,250 --> 00:07:45,810 our dictionary, it's gonna be "mse", mean squared error, colon and again we're gonna be putting in a list of 85 86 00:07:45,990 --> 00:07:49,050 values, so I can probably take this list here, 86 87 00:07:49,050 --> 00:07:52,510 copy it, paste it in and then change that 87 88 00:07:52,530 --> 00:07:55,460 last thing here in the name to 88 89 00:07:55,470 --> 00:08:04,490 mse. This is how I've named all my mean squared error variables. So that's easy enough, 89 90 00:08:04,530 --> 00:08:11,910 let's take a look at what we've got. I'm going to hit Shift+Enter and what we see is two columns, one reading "MSE" 90 91 00:08:12,240 --> 00:08:20,360 and the other one reading "R-Squared". Now the rows are still indexed by 0, 1 and 2, but we can change that 91 92 00:08:20,360 --> 00:08:24,440 by adding an additional argument here. 92 93 00:08:24,530 --> 00:08:31,160 So I'm going to put a come up here and then I'm going to supply the index argument, and I'm going to make this 93 94 00:08:31,160 --> 00:08:32,660 equal to, again, a list, 94 95 00:08:32,960 --> 00:08:42,080 so square brackets, and I'm going put three values, the first one it's gonna be our "Reduced Log Model", 95 96 00:08:42,080 --> 00:08:51,170 the second one is going to be our "Full Normal Price Model" and then the third one is going to be our 96 97 00:08:52,190 --> 00:08:52,790 "Omitted 97 98 00:08:55,410 --> 00:08:59,790 Var Model". Refreshing our output, 98 99 00:09:00,000 --> 00:09:04,860 we see the names listed here very nicely. Okay, 99 100 00:09:05,070 --> 00:09:07,170 so let's interpret what we're looking at. 100 101 00:09:09,050 --> 00:09:11,750 First off the R-squared - 101 102 00:09:11,750 --> 00:09:20,480 we know that R-squared is always between 0 and 1 for every single regression model out there. 102 103 00:09:20,480 --> 00:09:21,440 But what does this mean? 103 104 00:09:22,610 --> 00:09:31,100 R-squared is a relative measure of fit and R-Squared does not have any units, right? 104 105 00:09:31,100 --> 00:09:36,840 This means that the R-Squared doesn't scale with the data, it's always between 0 and 1. 105 106 00:09:37,060 --> 00:09:46,040 And in our table we can see that the simplified model with the log prices has a higher fit than our 106 107 00:09:46,040 --> 00:09:48,980 full model with the normal prices or 107 108 00:09:49,070 --> 00:09:51,890 of course, the model where we left out some variables. 108 109 00:09:51,890 --> 00:09:55,760 This one in fact has the lowest fit of them all. 109 110 00:09:56,330 --> 00:09:58,830 But what about the mean squared error? 110 111 00:09:59,360 --> 00:10:06,670 In contrast to R-Squared, the mean squared error actually is an absolute measure of fit. 111 112 00:10:06,680 --> 00:10:13,880 It's not a relative measure, it's an absolute measure and it has units, namely the units of the target, 112 113 00:10:13,880 --> 00:10:22,070 our y variable, that 19.9 that you see here for the full normal price model, 113 114 00:10:22,070 --> 00:10:22,260 yeah, 114 115 00:10:22,280 --> 00:10:29,600 the second one, is in the units of the target which in this case is thousands of dollars. 115 116 00:10:29,600 --> 00:10:34,790 So this mean squared error here is actually approximately twenty thousand dollars. 116 117 00:10:34,790 --> 00:10:39,340 Is this third one better than the second one because it has a lower mean squared error? 117 118 00:10:39,520 --> 00:10:40,830 And the answer is no, 118 119 00:10:40,850 --> 00:10:41,140 right? 119 120 00:10:41,150 --> 00:10:45,980 Because you can't compare the two because they're in different units. The model in the first row and 120 121 00:10:45,980 --> 00:10:50,710 the model in the third row are using log prices, right? 121 122 00:10:50,750 --> 00:10:53,050 Log prices in thousands. 122 123 00:10:53,060 --> 00:11:00,470 In other words the scale of the MSE is different depending on the data that you're using. 123 124 00:11:00,470 --> 00:11:09,770 But of course lower values of MSE indicate a better fit and an MSE of 0 indicates a perfect fit. 124 125 00:11:10,430 --> 00:11:16,850 So you have to remember that when you're comparing models. Okay, so now that we've calculated our mean 125 126 00:11:16,850 --> 00:11:24,970 squared error, let's talk about how we would go about making a prediction for a house price. 126 127 00:11:25,040 --> 00:11:30,590 Now I think that before we go ahead and do that we should actually check out how the pros predict house 127 128 00:11:30,590 --> 00:11:40,830 prices. Two websites that are very good at this are Zoopla in the UK and Zillow in the US, but of course 128 129 00:11:40,920 --> 00:11:45,740 you can go and hunt around on the Internet to find a equivalent website 129 130 00:11:45,820 --> 00:11:51,150 that's a bit closer to home depending on where you live, and you can either maybe like pause the video 130 131 00:11:51,240 --> 00:11:56,310 and try out one of these websites to get an estimate for a home 131 132 00:11:56,310 --> 00:11:58,840 or you can watch me go through the process. 132 133 00:11:58,980 --> 00:12:05,340 One word of warning though, Zillow thinks they're being very cute or clever and calling their estimate 133 134 00:12:05,610 --> 00:12:07,100 a Zestimate, 134 135 00:12:07,310 --> 00:12:07,840 mm hmm, 135 136 00:12:08,520 --> 00:12:08,910 yeah, 136 137 00:12:08,970 --> 00:12:14,600 so, just something to note. So how did that go? 137 138 00:12:14,600 --> 00:12:16,040 Did you give it a shot? 138 139 00:12:16,670 --> 00:12:22,850 So yesterday I actually tried using Zoopla to get a price estimate for Buckingham Palace or the Big 139 140 00:12:22,850 --> 00:12:28,360 Ben, but I didn't have much luck providing the postcodes for these landblocks, 140 141 00:12:28,520 --> 00:12:35,870 but funny enough there were some properties listed in Windsor Castle of all places, including like the 141 142 00:12:35,870 --> 00:12:39,750 Windsor Castle library. 142 143 00:12:40,170 --> 00:12:43,010 So yeah I'm really not sure why, 143 144 00:12:43,070 --> 00:12:50,030 but I tell you what, I did dig around and I want to show you how to get a property price estimate for 144 145 00:12:50,300 --> 00:12:51,350 this particular home. 145 146 00:12:51,350 --> 00:12:53,600 Let me show you the brochure. 146 147 00:12:54,080 --> 00:12:58,170 So this is an old brochure and it's for this house in Kenwood. 147 148 00:12:58,220 --> 00:13:01,730 And as you can see it's a it's a very, very nice house. 148 149 00:13:01,770 --> 00:13:10,160 I think the agent Knight Frank was looking to sell it and the story with this house is that it used 149 150 00:13:10,160 --> 00:13:13,340 to belong to John Lennon from the Beatles. 150 151 00:13:13,340 --> 00:13:13,920 That's right. 151 152 00:13:13,940 --> 00:13:15,040 This John Lennon. 152 153 00:13:15,260 --> 00:13:23,110 Contrary to popular belief, he did not live on Abbey Road nor did he live in a yellow submarine. 153 154 00:13:23,120 --> 00:13:23,510 OK. 154 155 00:13:23,580 --> 00:13:28,330 So let me show you how to get an estimate for his previous crib. 155 156 00:13:28,530 --> 00:13:35,870 Now to get a valuation for this home on Zoopla, what I have to do is I have to go to "Get your home 156 157 00:13:35,870 --> 00:13:43,710 valued" and then I think it's like come down here and I think I have to click on "What's my home worth", 157 158 00:13:43,790 --> 00:13:49,820 so I think the first place they send you is trying to get Agent valuations, but I'm actually interested 158 159 00:13:49,910 --> 00:13:53,880 in the Zoopla estimate, so how much Zoopla thinks it's worth. 159 160 00:13:53,920 --> 00:13:59,240 So I'm going to click on this get, a Zoopla estimate and then I'm going to punch in the postcode, 160 161 00:13:59,240 --> 00:14:06,440 so this was KT13 0JU. I looked this up. 161 162 00:14:06,480 --> 00:14:16,010 So John Lennon used to live on this postcode and from that property brochure we know that it's 162 163 00:14:16,430 --> 00:14:17,910 Kenwood, Wood Lane. 163 164 00:14:18,000 --> 00:14:28,900 So let's click on "Get estimate". Now we can supply some additional information about this property, the 164 165 00:14:28,960 --> 00:14:39,400 "Property type" is it's a "Detached house"; the "Property style" is that it's a "Period" home and it's a "Freehold". 165 166 00:14:39,400 --> 00:14:45,310 I don't know how many floors it has, but I did check the brochure and for bedrooms it was six bedrooms, 166 167 00:14:45,550 --> 00:14:49,300 six bathrooms and six receptions. 167 168 00:14:49,480 --> 00:14:55,630 I think they also had like a big garden and a swimming pool and for the internal area according to the 168 169 00:14:55,630 --> 00:14:59,440 brochure it's a 1110 square metres, 169 170 00:14:59,440 --> 00:15:08,990 so absolutely enormous. Now I can press "Continue" and then I'm going to tick "Swimming pool" here and I'm gonna leave 170 171 00:15:09,000 --> 00:15:09,810 these other ones alone, 171 172 00:15:09,810 --> 00:15:19,200 I can't be bothered. But what's very interesting is that Zoopla knows how much this home sold for previously. 172 173 00:15:19,200 --> 00:15:22,800 This is actually public knowledge, right? 173 174 00:15:22,830 --> 00:15:28,740 If you're wondering where they're getting this from, they're getting it from the UK Government's land 174 175 00:15:28,740 --> 00:15:29,780 registry. 175 176 00:15:30,090 --> 00:15:37,740 So the government basically tracks all the property transactions in the UK and you can search for house 176 177 00:15:37,740 --> 00:15:40,750 prices of the previous transactions. 177 178 00:15:41,040 --> 00:15:43,520 And if I put in the postcode here, 178 179 00:15:43,530 --> 00:15:56,130 so KT13 0JU and then say "Show results", then I can see here that for Kenwood, Wood Lane in 2007 179 180 00:15:56,340 --> 00:16:00,290 this home sold for 5.8 million pounds. 180 181 00:16:00,310 --> 00:16:00,620 Yeah. 181 182 00:16:01,080 --> 00:16:06,990 So this is where Zoopla is getting its price data from. 182 183 00:16:06,990 --> 00:16:07,220 Yeah. 183 184 00:16:07,230 --> 00:16:09,380 They're getting it directly from the government. 184 185 00:16:09,450 --> 00:16:15,510 So that's one of the data sources. Now I'm going to confirm the information above is accurate and I agree 185 186 00:16:15,510 --> 00:16:18,480 to the terms of use and I'm going to say "Get estimate" 186 187 00:16:21,520 --> 00:16:29,440 and what we see is that Zoopla estimates John Lennon's former home to be worth 187 188 00:16:29,440 --> 00:16:31,440 8.75 million pounds. 188 189 00:16:32,200 --> 00:16:34,000 So it's sold for 5.8 189 190 00:16:34,000 --> 00:16:41,140 back in 07 and today Zoopla reckons it's worth about 8.75 190 191 00:16:41,140 --> 00:16:44,140 But here's the part I want to draw your attention to. 191 192 00:16:44,350 --> 00:16:48,620 I want to draw your attention to this value range that they're providing here. 192 193 00:16:48,640 --> 00:16:49,770 So we don't just get a price, 193 194 00:16:49,780 --> 00:16:50,520 yeah. 194 195 00:16:50,810 --> 00:16:55,420 Zoopla isn't telling us "Oh it's worth 8.75 million on the nose", 195 196 00:16:55,420 --> 00:16:59,740 what we also get from them is a range around that price. 196 197 00:16:59,740 --> 00:17:07,360 We see that Zoopla thinks the property is worth between 8.16 and 9.34 197 198 00:17:07,570 --> 00:17:09,330 million pounds. 198 199 00:17:09,430 --> 00:17:15,560 What Zoopla actually telling us is that they are estimating this home to be worth 8.75 199 200 00:17:15,560 --> 00:17:20,360 plus or minus 590000 pounds. 200 201 00:17:20,470 --> 00:17:25,690 They're giving us an upper and a lower bound to their estimate. 201 202 00:17:25,990 --> 00:17:29,620 And this is a key component for making a good prediction. 202 203 00:17:29,770 --> 00:17:32,330 Why? Well, 203 204 00:17:32,340 --> 00:17:37,770 because it tells us how confident Zoopla is in their estimate. 204 205 00:17:37,800 --> 00:17:44,100 I mean they're giving us this little bar here "Confidence level: high" but we can also infer how confident 205 206 00:17:44,130 --> 00:17:49,870 they are by how wide this range is. In other words, 206 207 00:17:49,890 --> 00:17:56,010 if Zoopla says "Oh it's 8.7 milion plus or minus 600000", that's very, 207 208 00:17:56,010 --> 00:18:02,130 very different from saying "Oh it's worth 8.7 million plus or minus 2 million", 208 209 00:18:02,130 --> 00:18:03,180 right? 209 210 00:18:03,270 --> 00:18:09,600 The point I'm trying to make is that, when it comes to our predictions, our forecasting that we're going 210 211 00:18:09,600 --> 00:18:16,740 to be doing, we're going to have to provide not just a house price but we're also going to have to provide 211 212 00:18:16,860 --> 00:18:19,830 a range around that price. 212 213 00:18:19,830 --> 00:18:27,810 And believe it or not, this range actually has something to do with the residuals and the mean squared 213 214 00:18:27,870 --> 00:18:30,040 error. And 214 215 00:18:30,200 --> 00:18:33,710 that's what we're going to be looking at in the next lesson. 215 216 00:18:33,800 --> 00:18:37,640 But before you head over there let me, let me ask you a question. 216 217 00:18:37,790 --> 00:18:44,080 Have you ever used a website like Zoopla or Zillow to get an estimate for a property? And, 217 218 00:18:44,090 --> 00:18:51,160 if so, please post the link of the service that you used in the comments section for this video. 218 219 00:18:51,250 --> 00:18:56,900 I'd actually be quite curious to know for which places on earth we can get estimates like this and 219 220 00:18:56,960 --> 00:19:00,870 for which ones we can't. At the time of recording, 220 221 00:19:00,920 --> 00:19:06,880 I tried finding the equivalent of Zoopla for Austria but no dice. 221 222 00:19:06,890 --> 00:19:07,130 Yeah. 222 223 00:19:07,150 --> 00:19:12,950 The property market in Austria is just not as transparent as in the UK. 223 224 00:19:13,190 --> 00:19:19,820 And this property price transaction data that Zoopla is piggy backing off of to make their estimates, 224 225 00:19:20,050 --> 00:19:25,130 you know, that they're getting from the land registry is just not publicly available. 225 226 00:19:25,530 --> 00:19:28,090 Anyhow I'll see you in the next video. 226 227 00:19:28,100 --> 00:19:28,760 Take care.