0 1 00:00:00,690 --> 00:00:08,880 The first topic that we're going to talk about on the evaluation front is this topic of data transformations. 1 2 00:00:08,880 --> 00:00:13,990 Let's take another closer look at the values that we're trying to predict, 2 3 00:00:14,160 --> 00:00:17,730 our target values, the property prices. 3 4 00:00:17,730 --> 00:00:20,070 Check out this histogram. 4 5 00:00:20,070 --> 00:00:22,750 This is what we plotted earlier. 5 6 00:00:22,770 --> 00:00:30,420 Now, what we see here is that there are quite a few very, very expensive properties on the right hand 6 7 00:00:30,420 --> 00:00:33,400 side of this graph. 7 8 00:00:33,480 --> 00:00:38,540 There's a bunch of properties in the right tail of this distribution. 8 9 00:00:38,790 --> 00:00:45,780 And here's what a normal distribution looks like in comparison. The normal distribution has very, very 9 10 00:00:45,780 --> 00:00:49,020 few occurrences in the tails. 10 11 00:00:49,080 --> 00:00:50,070 And here's the thing, 11 12 00:00:50,070 --> 00:00:59,070 having more data points in one of the tails is called a skew. Our distribution of house prices is skewed 12 13 00:00:59,370 --> 00:01:02,060 to the right. 13 14 00:01:02,100 --> 00:01:05,180 Now, we can actually put a number to the skew. 14 15 00:01:05,280 --> 00:01:09,210 We can measure the skew of our target values. 15 16 00:01:09,210 --> 00:01:12,380 Let's add a markdown cell to this section and commemorate 16 17 00:01:12,390 --> 00:01:21,340 what we're going to do, I'm going to write down "Data Transformations". There we go. To calculate the skew of our 17 18 00:01:21,340 --> 00:01:22,600 target values, 18 19 00:01:22,600 --> 00:01:30,790 all we have to do is grab our price series, "data['PRICE']", put a dot after it and call the 19 20 00:01:30,790 --> 00:01:34,760 skew method. Hitting Shift+Enter, 20 21 00:01:34,900 --> 00:01:40,490 we see that this skew is 1.1. 21 22 00:01:40,510 --> 00:01:46,180 Now what does that mean and how does that compare to a normal distribution? 22 23 00:01:46,930 --> 00:01:52,010 Well, a normal distribution is completely symmetrical, right? 23 24 00:01:52,150 --> 00:01:56,770 The right tail and the left tail are exactly the same size. 24 25 00:01:56,800 --> 00:02:04,930 So the skew of a normal distribution is equal to 0 and this leads me to think that there's something 25 26 00:02:04,930 --> 00:02:08,710 that we can try to improve our model. 26 27 00:02:08,890 --> 00:02:15,490 We can try transforming our price data and then running our regression. 27 28 00:02:15,540 --> 00:02:24,720 Now what I mean by transforming our data? Well, a transformation would be something like multiplying all 28 29 00:02:24,720 --> 00:02:31,340 the prices by two or dividing all the prices in half. 29 30 00:02:31,380 --> 00:02:39,560 A transformation would be applying some sort of calculation to all the prices in the dataset. 30 31 00:02:39,630 --> 00:02:46,020 However, dividing or multiplying our prices isn't the transformation that I have in mind. 31 32 00:02:46,140 --> 00:02:51,670 I want to do something to our prices that will shift our distribution. 32 33 00:02:51,690 --> 00:02:57,700 I want to do something that will affect our large house prices in the tail more than the rest. 33 34 00:02:58,590 --> 00:03:05,620 And to accomplish this, I will use a log transformation on our property prices. 34 35 00:03:05,810 --> 00:03:12,890 And so that means that if one of the target values, one of the prices, is equal to the value 7 then the 35 36 00:03:13,130 --> 00:03:18,650 log of this price is equal to 1.95. 36 37 00:03:18,680 --> 00:03:25,580 And you can calculate this simply by grabbing a calculator and using the ln function on that thing. 37 38 00:03:25,580 --> 00:03:27,320 But here's the interesting part. 38 39 00:03:27,440 --> 00:03:34,580 If a property price is listed as 50, so some of the highest values that we have in our data set, then 39 40 00:03:34,580 --> 00:03:41,230 the log price of this property would now be equal to 3.91. 40 41 00:03:41,450 --> 00:03:50,000 Thus what the log transformation achieves is you get a small change of around 5 in the small price 41 42 00:03:50,000 --> 00:04:00,140 of 7, but you get a large change of around 46 for the large values in the dataset. 42 43 00:04:00,380 --> 00:04:04,290 Now a very reasonable question to ask is: why does this matter? 43 44 00:04:04,290 --> 00:04:06,260 Why should you care? 44 45 00:04:06,260 --> 00:04:12,840 Well, we have to remember that what we're doing here is we're fitting a linear model to our data. 45 46 00:04:13,010 --> 00:04:19,160 So say we have some data that's distributed like so, distributed like this. 46 47 00:04:19,160 --> 00:04:26,330 If you were to try to fit a line to this data, a straight line you don't tend to get a very good fit. 47 48 00:04:26,750 --> 00:04:35,120 But if you were to transform this data using the log transformation, then those blue dots would line 48 49 00:04:35,120 --> 00:04:42,620 up like so, and then you could fit a linear regression very, very easily and you'd get a very, very good 49 50 00:04:42,620 --> 00:04:50,560 fit. So in other words, based on the skew that I've seen in the distribution, I want to try transforming 50 51 00:04:50,560 --> 00:04:55,390 our data and then fitting the linear regression. 51 52 00:04:55,620 --> 00:05:02,630 Now one thing I'm noticing is I've actually got a typo here in my notebook title, this should read multivariable 52 53 00:05:02,660 --> 00:05:08,530 regression not multivariate regression. 53 54 00:05:08,540 --> 00:05:10,220 There you go. 54 55 00:05:10,340 --> 00:05:10,780 OK. 55 56 00:05:10,940 --> 00:05:18,520 So how do we apply a log transformation to our target values? 56 57 00:05:18,790 --> 00:05:27,640 Well what we could do is write something like "from math import log" and use the log function from the 57 58 00:05:27,640 --> 00:05:29,350 math module. 58 59 00:05:29,500 --> 00:05:36,340 However, this log function from the math module doesn't particularly enjoy being applied to an entire 59 60 00:05:36,340 --> 00:05:38,110 data series. 60 61 00:05:38,140 --> 00:05:42,150 Lucky for us, numpy actually solve this problem for us. 61 62 00:05:42,250 --> 00:05:55,120 We can access the log function through numpy, so "np.log(data[' 62 63 00:05:55,120 --> 00:06:04,240 PRICE']" will transform our entire series of prices into log prices. 63 64 00:06:04,240 --> 00:06:11,830 I'm going to create another variable called y_log and set that equal to the output of numpy's 64 65 00:06:11,830 --> 00:06:19,120 log function. Let's take a look at what y_log actually looks like. So "y_log. 65 66 00:06:19,120 --> 00:06:27,380 head()" will show you the first few values and they look something like this, and "y_log. 66 67 00:06:27,790 --> 00:06:34,460 tail()" will show me the last couple of values and they look something like this. 67 68 00:06:34,460 --> 00:06:42,140 So what we saw here was that this log transformation was applied successfully to the entire data series. 68 69 00:06:42,140 --> 00:06:50,440 We now have the logs of the property prices stored in a variable called y_log. 69 70 00:06:50,450 --> 00:06:56,990 Now I can already hear you asking the question: well where can we see the benefit of using log prices? 70 71 00:06:57,140 --> 00:06:59,750 And I think that's a really, really good question. 71 72 00:06:59,820 --> 00:07:01,660 We're going to look at three things. 72 73 00:07:01,700 --> 00:07:07,810 First off, let's look at the skew of the distribution of the log prices. 73 74 00:07:07,910 --> 00:07:19,330 So "y_log.skew()" parentheses will output -0.33. 74 75 00:07:19,490 --> 00:07:24,610 So that's nice. With the skew of -0.33, 75 76 00:07:24,620 --> 00:07:30,920 we're a lot closer to zero than with a skew of 1.1. 76 77 00:07:31,130 --> 00:07:35,360 But what does this distribution look like graphically? 77 78 00:07:35,360 --> 00:07:47,750 Let's pull up our old friend "sns.distplot(y_log)", "plt.title( 78 79 00:07:48,500 --> 00:08:04,820 f'Log price with skew {y_log.skew()}')", next line "plt.show()". Here 79 80 00:08:04,940 --> 00:08:12,320 we're using seaborn's distplot function and we're feeding in our log prices. We're gonna give this little 80 81 00:08:12,320 --> 00:08:21,800 chart a title and as an argument we're feeding in an f-string. I've misspelled price, so I'm going to fix 81 82 00:08:21,800 --> 00:08:29,720 that now. So our f-string will take this part here, that's in between the curly braces and actually calculate 82 83 00:08:29,900 --> 00:08:30,490 the skew. 83 84 00:08:30,520 --> 00:08:38,600 So this bit here will show -0.33 and then we're showing our chart. 84 85 00:08:38,700 --> 00:08:41,660 Let me hit Shift+Enter, see what this looks like. 85 86 00:08:41,740 --> 00:08:48,740 Voila! Now this distribution already looks a lot more symmetrical and with that skew being a lot closer 86 87 00:08:48,740 --> 00:08:57,370 to zero, it also definitely is a lot more symmetrical. We can actually see this difference very, very clearly 87 88 00:08:57,760 --> 00:09:00,950 when we look at these two charts side by side. 88 89 00:09:01,060 --> 00:09:06,850 Now I'd say this transformation worked really, really well from this point of view. 89 90 00:09:06,850 --> 00:09:14,290 Now, having looked at the skew, a good question is: well can we visualize how this transformation makes 90 91 00:09:14,290 --> 00:09:17,630 our data more linear graphically? 91 92 00:09:17,680 --> 00:09:20,720 And the answer is: yes, but it's a little hard to see. 92 93 00:09:20,770 --> 00:09:24,820 It's a little hard to make out on real data. 93 94 00:09:24,820 --> 00:09:30,980 You're just not going to get a graph that's clear as day as this kind of hypothetical example. 94 95 00:09:31,030 --> 00:09:37,960 The biggest difference that you'll actually be able to spot just by inspecting it is on a plot of property 95 96 00:09:37,960 --> 00:09:42,230 prices versus the LSTAT feature. 96 97 00:09:42,280 --> 00:09:51,880 So using seaborn with sns and then ".lmplot()", will give us our scatter plot with a regression line 97 98 00:09:52,930 --> 00:09:58,140 and this scatter plot is going to have, on the x axis, 98 99 00:09:58,330 --> 00:10:06,610 it's gonna have LSTAT, the LSTAT feature; on the y axis with "y = " as the second argument, it's 99 100 00:10:06,610 --> 00:10:08,340 gonna have PRICE. 100 101 00:10:08,530 --> 00:10:18,330 So we're gonna compare our non log prices and then log prices afterwards. As a third argument we're gonna 101 102 00:10:18,340 --> 00:10:27,160 give our data, as a fourth one we're gonna say "size = 7" to make that thing a bit larger, as the 102 103 00:10:27,160 --> 00:10:39,410 fifth argument I'm going to add some transparency to our data points with "scatter_kws = " 103 104 00:10:40,010 --> 00:10:41,680 and then a dictionary, 104 105 00:10:41,930 --> 00:10:56,470 "{'alpha': 0.6}" and then I'm gonna add a colored regression line, 105 106 00:10:57,100 --> 00:11:05,860 so that I can do with "line_kws = { 106 107 00:11:05,860 --> 00:11:12,850 'color': }" and I'm gonna go for dark red, 107 108 00:11:12,940 --> 00:11:15,370 'darkred'. 108 109 00:11:15,390 --> 00:11:16,520 There we go. 109 110 00:11:16,520 --> 00:11:23,620 I'm gonna spread this over two lines and then I'm going to go with "plt.show()" hit Shift+Enter. 110 111 00:11:25,680 --> 00:11:27,010 Tada! 111 112 00:11:27,070 --> 00:11:35,080 This here is a larger version of something we've already seen in our pairplot a little bit earlier and one 112 113 00:11:35,080 --> 00:11:44,750 thing that you can see is that this regression line doesn't fit the data maybe as well as it could. Just 113 114 00:11:44,750 --> 00:11:45,670 by looking at this, 114 115 00:11:45,740 --> 00:11:52,690 we can see that the relationship between LSTAT and price might not be a linear one. 115 116 00:11:53,600 --> 00:11:55,890 But what about our log prices? 116 117 00:11:55,950 --> 00:12:05,260 Let me take this cell, copy it, paste it and then I'm going to change the color to 'cyan' and what I'm going 117 118 00:12:05,250 --> 00:12:10,110 to do now is I'm going to create a new variable, a new data frame, 118 119 00:12:10,110 --> 00:12:20,880 call it "transformed_data" and I'm going to set it equal to, say features, which we've created 119 120 00:12:21,600 --> 00:12:22,760 up here. 120 121 00:12:22,900 --> 00:12:28,370 Our features variable was equal to our data frame minus that price column. 121 122 00:12:28,620 --> 00:12:35,760 So we're gonna reuse that down here and then what we'll do is we're going to add another column to the 122 123 00:12:35,760 --> 00:12:39,090 transformed_data dataframe. 123 124 00:12:39,090 --> 00:12:47,900 And I'm going to call that column 'LOG_PRICE', all caps. 124 125 00:12:48,670 --> 00:12:54,930 And that's gonna be equal to "y_log" which we've created up here. 125 126 00:12:54,960 --> 00:12:59,500 So we can reuse this variable as well. 126 127 00:12:59,500 --> 00:13:07,630 Okay, so we've created a new dataframe and it has all the features and our log price and in our lmplot 127 128 00:13:07,660 --> 00:13:13,150 function we're going to feed in the transformed data and 128 129 00:13:16,950 --> 00:13:18,200 on the y axis 129 130 00:13:18,210 --> 00:13:22,500 we're gonna be plotting LOG_PRICE, 130 131 00:13:22,500 --> 00:13:30,330 so the LOG_PRICE column from the transformed_data dataframe. So lets hit Shift+Enter and see what we 131 132 00:13:30,330 --> 00:13:37,830 get. Tada! So here's our LOG_PRICE versus LSTAT. 132 133 00:13:38,000 --> 00:13:46,110 And here is our untransformed, normal price versus LSTAT. Now, 133 134 00:13:46,150 --> 00:13:52,840 we can see somewhat of an improvement based on this particular scatter plot, but trying to spot the differences 134 135 00:13:52,840 --> 00:13:57,750 visually against all the individual features isn't really all that useful. 135 136 00:13:57,760 --> 00:14:05,890 What we really want to do is we want to rerun our regression and see the combined effect of using log 136 137 00:14:05,890 --> 00:14:10,860 prices instead of the standard prices. 137 138 00:14:10,930 --> 00:14:15,670 The thing to note is that now we've got a different model, right? 138 139 00:14:15,730 --> 00:14:18,740 We're actually going to be changing our regression model. 139 140 00:14:18,760 --> 00:14:23,530 We're gonna be using log prices instead of normal prices. 140 141 00:14:23,530 --> 00:14:31,690 So our previous equation looked like this and our new equation will look like this. 141 142 00:14:31,690 --> 00:14:37,810 And because it's a very different model that we're using, all the theta values will in fact change and 142 143 00:14:37,810 --> 00:14:45,310 the interpretation will also be taking into account that a unit change in say distance or the number 143 144 00:14:45,310 --> 00:14:49,120 of rooms now reflects the change in the log price. 144 145 00:14:50,080 --> 00:14:52,060 So that's just something to be aware of. 145 146 00:14:52,730 --> 00:14:54,280 Let's see this in action. 146 147 00:14:54,350 --> 00:14:59,230 I'm going to come up here in our notebook where we were splitting our dataset. 147 148 00:14:59,300 --> 00:15:07,870 I'm going to copy this cell and then I'm going to paste it down here. The first thing I'm going to do is going to change 148 149 00:15:07,870 --> 00:15:08,810 this line of code. 149 150 00:15:08,890 --> 00:15:17,290 I'm going to say "prices = np.log(data['PRICE'])", 150 151 00:15:17,320 --> 00:15:20,550 so here we're using log prices 151 152 00:15:20,550 --> 00:15:28,890 now when it comes to shuffling and splitting our dataset. The next thing I'm going to do is I'm going to delete 152 153 00:15:29,280 --> 00:15:35,190 this line of code and I'm going to come back up here while we're running our regression. 153 154 00:15:35,460 --> 00:15:40,550 I'm going to copy these cells, come down here, 154 155 00:15:40,850 --> 00:15:49,700 paste them in, delete this comment and then this cell here I'm going to change to markdown, 155 156 00:15:50,040 --> 00:16:00,040 put two hash tags there and I'll write down "Regression using log prices". 156 157 00:16:00,720 --> 00:16:06,950 And now I'm going to hit Shift+Enter here and this is the output that we'll get. 157 158 00:16:06,960 --> 00:16:11,220 So the question is how do we interpret this? 158 159 00:16:11,250 --> 00:16:13,290 Let's look at the r-squared values 159 160 00:16:13,350 --> 00:16:21,990 first. The r-squared on our training data is 0.79. 160 161 00:16:22,050 --> 00:16:28,400 This is an increase actually from before. Before we had 0.75. 161 162 00:16:28,680 --> 00:16:32,820 And we also see an increase on the testing data. Here, 162 163 00:16:32,940 --> 00:16:39,920 the value went from 0.67 to 0.74. 163 164 00:16:40,050 --> 00:16:48,090 So the performance of our model improved on both counts, and based on this it makes me think that our 164 165 00:16:48,090 --> 00:16:56,220 little data transformation experiment was a success. Reducing the skew in our distribution of target 165 166 00:16:56,220 --> 00:17:05,960 values allowed us to improve our model, and as a result we have a higher r-squared and a better fit. But 166 167 00:17:05,960 --> 00:17:11,090 I also said that the interpretation of the parameters has changed. 167 168 00:17:11,090 --> 00:17:17,780 Previously, the coefficient for the Charles River dummy variable was around 2, so people were willing 168 169 00:17:17,780 --> 00:17:23,140 to pay 2000 dollars more to live close to the river. 169 170 00:17:23,160 --> 00:17:29,650 Now the coefficient on the Charles River, a dummy variable, is 0.08. 170 171 00:17:29,970 --> 00:17:35,130 In order to figure out how much more people are willing to pay to be close to the river according to 171 172 00:17:35,130 --> 00:17:43,350 this model, what we have to do is reverse the log transformation, because not even mathematicians think 172 173 00:17:43,440 --> 00:17:51,260 in log prices when they go to the supermarket. So let's see what this value translates into using actual 173 174 00:17:51,410 --> 00:17:52,650 dollar prices. 174 175 00:17:52,760 --> 00:17:59,150 I'm going to copy of this whole thing and in this cell down here I'm going to add a comment 175 176 00:17:59,150 --> 00:18:03,860 "Charles River Property Premium". 176 177 00:18:04,910 --> 00:18:12,510 And in this cell I'm going to reverse the law calculation so we can see what the dollar value is. Now, 177 178 00:18:12,510 --> 00:18:14,040 as a math refresher, 178 179 00:18:14,040 --> 00:18:20,570 the way that the log transformation worked was we took the log of our prices, 179 180 00:18:20,570 --> 00:18:28,120 so if the price was 12 then the log price would be equal to 2.485 approximately. 180 181 00:18:28,710 --> 00:18:35,460 To reverse this transformation we have to raise Euler's number, which is approximately 2.7, 181 182 00:18:35,880 --> 00:18:40,200 to the power of 2.485. 182 183 00:18:40,200 --> 00:18:45,480 And then we get back our starting log prices. 183 184 00:18:45,540 --> 00:18:48,360 So this is how we're going to reverse that calculation. 184 185 00:18:48,390 --> 00:18:50,760 We need to get hold of this number e, 185 186 00:18:50,980 --> 00:18:53,870 and we can get hold of this number e 186 187 00:18:53,870 --> 00:19:03,210 through numpy, "np.e" will give us access to this irrational number and with ** we 187 188 00:19:03,210 --> 00:19:10,710 can raise e to the power of 0.080475, 188 189 00:19:10,920 --> 00:19:13,220 our coefficient. 189 190 00:19:13,380 --> 00:19:20,970 Now we can see how to interpret this coefficient and what our new model is telling us is that people 190 191 00:19:21,060 --> 00:19:30,670 are willing to pay approximately 1084 and dollars more to live close to the river. 191 192 00:19:30,800 --> 00:19:32,940 Okay so that was quite a big lesson. 192 193 00:19:32,940 --> 00:19:39,280 Quite a lot going on. Through our process of transforming our target values, 193 194 00:19:39,340 --> 00:19:46,700 we've created a whole new regression and this new and improved regression fits our dataset even better. 194 195 00:19:47,610 --> 00:19:54,810 But because we're now working with log prices in our model our interpretation of these coefficients 195 196 00:19:55,050 --> 00:20:00,380 has also changed. In order to get back the changes in the dollar values 196 197 00:20:00,510 --> 00:20:06,390 we have to reverse the log transformation when it comes to actually interpreting the meaning of our 197 198 00:20:06,390 --> 00:20:13,260 coefficients. And speaking of coefficients, in the next lessons we're gonna be looking at them in more 198 199 00:20:13,260 --> 00:20:13,980 detail. 199 200 00:20:13,980 --> 00:20:20,310 We're going to be evaluating our coefficients and looking at their significance, their p-values. 200 201 00:20:20,310 --> 00:20:21,680 I'll see you there.