0 1 00:00:01,570 --> 00:00:09,380 All right, so how do we get to making a prediction or a forecast? We have to provide 1 2 00:00:09,550 --> 00:00:10,820 two things - 2 3 00:00:10,840 --> 00:00:17,290 the estimated property price and the range that goes with the price. 3 4 00:00:17,290 --> 00:00:24,550 Because this is what the pros do and will also provide an estimate of what the home is worth plus a 4 5 00:00:24,550 --> 00:00:28,600 degree of uncertainty around that number. 5 6 00:00:28,900 --> 00:00:30,820 The estimated price is easy enough, right? 6 7 00:00:30,820 --> 00:00:36,760 Because once we have the theta parameters for the model, all we have to do is plug them in and then together 7 8 00:00:36,760 --> 00:00:42,220 with the values for the individual features, we get an estimate, we get our y_hat. 8 9 00:00:42,220 --> 00:00:44,430 But what about the range? 9 10 00:00:44,530 --> 00:00:45,970 What does that come from? 10 11 00:00:46,000 --> 00:00:50,950 The range will actually depend on the shape of the distribution that you're working with. 11 12 00:00:51,520 --> 00:00:57,430 If we know the distribution we can estimate the range very accurately. 12 13 00:00:57,430 --> 00:00:58,710 And here's the thing. 13 14 00:00:58,900 --> 00:01:06,730 Our go to distribution is usually the normal distribution, because the very, very nice thing about the 14 15 00:01:06,730 --> 00:01:10,550 normal distribution is that we know its shape. 15 16 00:01:10,630 --> 00:01:20,000 We know that for a normal distribution, 68% of the observations are between these two points. 16 17 00:01:20,040 --> 00:01:26,440 68% of all the values in this distribution are within this purple range right here. 17 18 00:01:27,160 --> 00:01:36,580 And for normal distribution, we also know that around 95 percent of the values are between these two 18 19 00:01:36,580 --> 00:01:38,280 points here. 19 20 00:01:38,350 --> 00:01:45,220 95% of the observations fall within this pink range that I've highlighted right here. 20 21 00:01:45,220 --> 00:01:49,670 The individual points that I've drawn on this histogram right here actually have a name. 21 22 00:01:49,930 --> 00:01:56,200 They quantify the amount of variation around the mean. 22 23 00:01:56,200 --> 00:02:03,090 The mean is right here in the middle of the distribution and the distance from the mean to that bright 23 24 00:02:03,100 --> 00:02:09,690 purple point is called one standard deviation. For a normal distribution, 24 25 00:02:09,760 --> 00:02:16,210 you'll usually see the Greek letter Sigma used to denote the standard deviation. 25 26 00:02:16,210 --> 00:02:19,940 Now what about the other points that I had drawn on here earlier? 26 27 00:02:20,620 --> 00:02:29,140 Well the left purple point is at minus one standard deviation - minus one standard deviation from the 27 28 00:02:29,140 --> 00:02:38,080 mean and for our normal distribution, as we said before, around 68 percent of all the observations lie between 28 29 00:02:38,170 --> 00:02:42,210 minus and plus one standard deviation. 29 30 00:02:42,220 --> 00:02:42,480 OK. 30 31 00:02:42,490 --> 00:02:43,960 What about the pink points? 31 32 00:02:43,960 --> 00:02:52,510 Well the right pink point is at two standard deviations and the left pink point is at minus two standard 32 33 00:02:52,510 --> 00:02:53,600 deviations. 33 34 00:02:53,740 --> 00:02:59,680 And as we've said before, approximately 95% of observations lie between plus two and 34 35 00:02:59,680 --> 00:03:04,090 minus two standard deviations for a normal distribution. 35 36 00:03:04,090 --> 00:03:05,740 Now let me ask you this. 36 37 00:03:06,160 --> 00:03:13,360 If this green point here right in the middle is our estimate for the property price from our model, our 37 38 00:03:13,360 --> 00:03:19,500 y_hat, what is the distribution that we're gonna be looking at here? 38 39 00:03:19,630 --> 00:03:29,710 Any guess? What's the distribution that tells us something about the variance in our price estimates? Well, 39 40 00:03:30,350 --> 00:03:32,060 we're going to be coming full circle here. 40 41 00:03:32,150 --> 00:03:37,210 It's actually the distribution of the residuals from our regression. 41 42 00:03:37,310 --> 00:03:44,880 This is the reason why we cared so much about whether the distribution is a normal distribution or not. 42 43 00:03:44,910 --> 00:03:48,100 Now the next question you might ask at this point is: Well, 43 44 00:03:48,130 --> 00:03:48,400 ok, 44 45 00:03:48,430 --> 00:03:56,980 so if the distribution is the distribution of the residuals, then what do the purple and pink dots represent? 45 46 00:03:57,050 --> 00:03:59,090 How do we get our range? 46 47 00:03:59,150 --> 00:04:01,580 Do you remember our mean squared error? 47 48 00:04:01,580 --> 00:04:09,380 And no, the mean square is not the purple dot, but we can make one small modification to the mean squared 48 49 00:04:09,410 --> 00:04:16,010 error and get something very, very handy for calculating the range and making predictions. 49 50 00:04:16,010 --> 00:04:23,300 And that small modification is by taking the square root of this thing, by taking the square root of 50 51 00:04:23,300 --> 00:04:25,040 the mean squared error, 51 52 00:04:25,040 --> 00:04:30,500 we get another metric and this one is called, yes surprise, surprise, 52 53 00:04:30,590 --> 00:04:39,710 the Root Mean Square Error or RMSE and it's this metric, the RMSE that has a really, really nice interpretation 53 54 00:04:40,160 --> 00:04:48,110 because the Root Mean Squared Error represents one standard deviation of the differences between our 54 55 00:04:48,170 --> 00:04:50,750 actual and our predicted values. 55 56 00:04:50,750 --> 00:04:58,160 The Root Mean Squared Error is one standard deviation in the distribution of our residuals. 56 57 00:04:58,160 --> 00:05:06,260 So let's look at the chart again. To create our range around our estimated price, our so-called prediction 57 58 00:05:06,470 --> 00:05:07,430 interval, 58 59 00:05:07,520 --> 00:05:16,610 the first thing we choose is how wide we want that interval to be, say we want to cover around 95 percent 59 60 00:05:16,730 --> 00:05:18,440 of the observations. 60 61 00:05:18,440 --> 00:05:23,170 Then we would use two standard deviations either side. 61 62 00:05:23,630 --> 00:05:30,080 And this means taking our prediction and adding two times the Root Mean Squared Error to it for the 62 63 00:05:30,080 --> 00:05:38,310 upper bound and subtracting two times the Root Mean Squared Error for our lower bound on our prediction 63 64 00:05:38,350 --> 00:05:43,060 and that's how we get the range. In our Jupyter notebook, 64 65 00:05:43,070 --> 00:05:50,840 this means simply taking the square root of our mean squared error that we've already calculated. Coming 65 66 00:05:50,840 --> 00:05:55,100 to the cell where we've got our dataframe with our Mean Squared Error, 66 67 00:05:55,100 --> 00:06:02,180 what I'm going to do is add another column, so I'll put a comma here, go to the next line, single quotes 67 68 00:06:02,420 --> 00:06:06,670 and put RMSE here and then a colon, 68 69 00:06:06,720 --> 00:06:10,580 and now I can take this entire list here, 69 70 00:06:10,640 --> 00:06:11,430 copy it, 70 71 00:06:11,780 --> 00:06:19,830 and then here, what I'm going to do is I'm going to to use numpy and call the square root function from numpy 71 72 00:06:20,560 --> 00:06:28,030 and as an argument I'm going to pass in the list that I just copied and what this will do is it'll take 72 73 00:06:28,030 --> 00:06:32,320 the square root of all the items in the list. 73 74 00:06:32,320 --> 00:06:34,660 Let me refresh the cell. 74 75 00:06:34,660 --> 00:06:35,950 Here we go. 75 76 00:06:35,950 --> 00:06:41,310 Now that we've done that I want to give you a challenge. 76 77 00:06:41,320 --> 00:06:45,070 Suppose we have an estimate from our model, 77 78 00:06:45,080 --> 00:06:49,530 yeah, for a house price of 30000 dollars. 78 79 00:06:49,570 --> 00:06:57,940 Can you calculate the upper and lower bound for a 95 percent prediction interval using the reduced log 79 80 00:06:57,940 --> 00:06:59,170 model? 80 81 00:06:59,170 --> 00:07:06,890 In other words, can you calculate the upper and the lower bound for the range around this estimate? 81 82 00:07:06,940 --> 00:07:10,810 I'll give you a few seconds to pause the video before I show you the solution. 82 83 00:07:13,590 --> 00:07:13,980 All right, 83 84 00:07:13,980 --> 00:07:15,950 let's take it from the top. 84 85 00:07:15,970 --> 00:07:19,980 So I'm going to add a print statement and I'm going to spell it out, 85 86 00:07:20,020 --> 00:07:28,440 so I'm going to say "1 s.d. in log prices", because we've got units, is, and that will be the 86 87 00:07:28,440 --> 00:07:32,790 square root of our log mean squared error, 87 88 00:07:32,820 --> 00:07:44,290 so "np.sqrt(reduced_log_mse)". Agreed? What's that equal to? It's equal to this much. 88 89 00:07:44,290 --> 00:07:47,050 Now what about two standard deviations? 89 90 00:07:47,050 --> 00:07:54,160 I can calculate that simply by taking the first print statement, copying it and then multiplying this 90 91 00:07:54,160 --> 00:07:56,280 whole thing by two. 91 92 00:07:56,290 --> 00:07:56,770 Here we go. 92 93 00:07:57,280 --> 00:08:05,140 This is two standard deviations - the upper bound for the prediction interval will be equal to our y_hat 93 94 00:08:05,140 --> 00:08:09,990 plus two times the root mean squared error. 94 95 00:08:10,000 --> 00:08:15,340 Now I've been pretty sneaky and I've given you the y_hat in dollar values. 95 96 00:08:15,340 --> 00:08:23,320 So you actually have to use a log transformation "np.log(30)", since our model is given in thousands 96 97 00:08:23,830 --> 00:08:33,790 and then you have to add "2*np.sqrt(reduced_log_mse)". So this 97 98 00:08:33,790 --> 00:08:42,540 is our y_hat plus two times the root mean squared error. Let me print that out, so I'm going to say "print( 98 99 00:08:43,300 --> 00:08:58,310 "'The upper bound for a 95% prediction interval is', upper_bound". That's 99 100 00:08:58,310 --> 00:09:07,070 equal to 3.78 approximately. Now if we wanted to see this in dollar values, then we 100 101 00:09:07,070 --> 00:09:15,200 can say, well "The upper bound for log prices for a 95% prediction interval is" this much and I 101 102 00:09:15,200 --> 00:09:27,020 can copy this, paste it in and I can say "'The upper bound in normal prices is', np.e** 102 103 00:09:27,350 --> 00:09:28,310 upper_bound." 103 104 00:09:28,310 --> 00:09:30,180 So let's see what that is. 104 105 00:09:30,410 --> 00:09:36,880 I can make that more explicit by multiplying this whole thing, times 1000, 105 106 00:09:37,040 --> 00:09:43,910 putting a little dollar sign here and there I've got my upper bound. The lower bound is very, very similar. 106 107 00:09:43,910 --> 00:09:52,010 You can even copy these three lines, paste them in and change this to "lower_bound = np.log( 107 108 00:09:52,190 --> 00:10:01,430 30)" minus two times the root mean squared error and then I can change my print statements to read "'The lower 108 109 00:10:01,430 --> 00:10:09,110 bound in log prices for a 95% prediction interval is', lower_bound" and the lower bound 109 110 00:10:09,140 --> 00:10:16,990 in normal prices is "np.e**lower_bound * 1000". 110 111 00:10:17,720 --> 00:10:19,110 Let's see what this reads. 111 112 00:10:19,280 --> 00:10:25,140 The lower bound in this case is 20635 dollars. 112 113 00:10:25,250 --> 00:10:32,360 Now the trick with this challenge is to do the addition of the root mean squared error and the transformation 113 114 00:10:32,660 --> 00:10:37,600 in the right order, because otherwise you'll get a very, very different result. 114 115 00:10:37,880 --> 00:10:43,910 The incorrect way of calculating the upper bound would have been to take two times the root mean squared 115 116 00:10:43,940 --> 00:10:51,440 error and simply say "Well okay, we've got a estimate of thirty thousand and then we're going to add to 116 117 00:10:51,440 --> 00:10:59,130 it the transformed value 'np.e**'", and then the root mean squared error 117 118 00:10:59,240 --> 00:11:01,230 times 1000. 118 119 00:11:01,910 --> 00:11:11,120 So this was the little trick in how I phrased this challenge. In summary, we often look towards the root 119 120 00:11:11,300 --> 00:11:18,110 mean squared error when we're interested in the predictive power of our models, and to some extent we 120 121 00:11:18,110 --> 00:11:19,790 can use the root mean squared error 121 122 00:11:19,790 --> 00:11:26,630 to also compare the models. And the reason is is that the root mean squared is a very good measure of 122 123 00:11:26,630 --> 00:11:36,390 how accurately the model predicts the target, because we can determine a range, the width of this range 123 124 00:11:36,710 --> 00:11:38,490 is a very important criterion 124 125 00:11:38,490 --> 00:11:44,370 if the main purpose of the model is prediction. And this is a big contrast to something like R-Squared 125 126 00:11:45,000 --> 00:11:51,690 because R-Squared says absolutely nothing about the predictive power of the model or the prediction 126 127 00:11:51,690 --> 00:11:53,070 error. 127 128 00:11:53,310 --> 00:11:53,780 Okay. 128 129 00:11:53,810 --> 00:11:59,290 So we're slowly coming towards the end of the section. In the next lesson, 129 130 00:11:59,340 --> 00:12:08,100 we're gonna finish it up by building a valuation tool for our boss in our real estate office and that's 130 131 00:12:08,100 --> 00:12:13,930 going to involve probably updating the prices to reflect today's dollar values a little bit. 131 132 00:12:14,070 --> 00:12:21,600 I think the Boston upper price of fifty thousand dollars is not really that accurate anymore. 132 133 00:12:21,840 --> 00:12:29,310 And we're also going to be looking at how we can create a Python function with optional arguments. 133 134 00:12:29,310 --> 00:12:35,820 So arguments where there's already default values set similar to what we've seen with seaborn and 134 135 00:12:35,820 --> 00:12:42,930 matplotlib and we're going to cover the syntax, the Python syntax how we can create functions like this 135 136 00:12:43,080 --> 00:12:47,190 ourselves. I'll see you in the next lessons. Take care.