0 1 00:00:00,390 --> 00:00:06,900 In the previous lesson we talked about how the Pearson correlation should only really be calculated 1 2 00:00:07,290 --> 00:00:15,560 on continuous data and also how outliers can kind of distort the picture and mislead us. 2 3 00:00:15,570 --> 00:00:23,100 The examples that I've used to illustrate this point was Anscombe's quartet and these four charts illustrated 3 4 00:00:23,340 --> 00:00:29,440 how very different patterns in the data can actually give you very, very similar descriptive statistics 4 5 00:00:29,490 --> 00:00:35,700 if you're not careful. What I want to show you now is how this isn't really just a hypothetical example. 5 6 00:00:35,700 --> 00:00:42,810 We can also see something similar happening in our housing data set when we don't apply the right tools 6 7 00:00:42,810 --> 00:00:47,880 to the right kind of data and aren't careful with our interpretations. 7 8 00:00:47,880 --> 00:00:57,480 Remember how a correlation of 0.9 or 1 is meant to look like? You'd expect to see a chart 8 9 00:00:57,510 --> 00:01:04,860 that kind of looks like this, where the data points almost form a straight line. Looking at a correlation 9 10 00:01:04,860 --> 00:01:05,600 matrix 10 11 00:01:05,640 --> 00:01:12,960 we've actually got a number here, a very high number, 0.91 as the correlation between RAD, 11 12 00:01:13,140 --> 00:01:21,750 our access to radial highways and our TAX feature. So given our expectation of what a high correlation 12 13 00:01:21,750 --> 00:01:29,540 between two variables would look like, let's visualize this relationship and see what we get. 13 14 00:01:29,550 --> 00:01:35,310 I'm actually gonna copy this cell here with our jointplot from seaborn and then I'm going to paste it 14 15 00:01:35,310 --> 00:01:43,470 in here. Move the cell down and then what I'm gonna do is going to change the references from distance 15 16 00:01:43,560 --> 00:01:56,100 and nitrous oxide to TAX and our RAD, and for the color I'm going to go with, I don't know, maybe dark 16 17 00:01:56,400 --> 00:02:07,380 red and then hit Shift+Enter. Voila! This is what we get. Now this picture is probably not what we would've 17 18 00:02:07,440 --> 00:02:15,240 expected to see had we just looked at the correlation, but we've said before that this very high correlation 18 19 00:02:15,270 --> 00:02:22,530 that we have between these two variables is a result of the fact that this data is not continuous and 19 20 00:02:22,530 --> 00:02:29,940 the fact that there are these big outliers in the top right corner that are driving this correlation. 20 21 00:02:31,210 --> 00:02:37,810 One thing that's actually very interesting to do as an exercise is to plot the linear regression onto 21 22 00:02:37,810 --> 00:02:43,450 this chart and see how it's affected by the data here. 22 23 00:02:43,510 --> 00:02:48,470 So let's see what would happen if we ran a linear regression between these two features. 23 24 00:02:48,550 --> 00:02:54,400 Now previously we've run our regressions separately using the scikit learn module and we've plotted 24 25 00:02:54,490 --> 00:02:58,770 our best fit lines on a matplotlib chart. 25 26 00:02:58,960 --> 00:03:03,930 Let me show you a shortcut on how you can use seaborn to do this very, very quickly. 26 27 00:03:04,330 --> 00:03:12,320 So if I come down here and in this cell I'm going to make use of a function called "lmplot". 27 28 00:03:12,370 --> 00:03:26,950 So "sns.lmplot()" and an x is gonna be our TAX and y is gonna be RAD and for the data 28 29 00:03:26,980 --> 00:03:31,920 that we're gonna feed in, the data keyword argument is gonna be our data frame. 29 30 00:03:31,990 --> 00:03:39,910 So it's gonna be "data = data, size = 7". 30 31 00:03:40,510 --> 00:03:46,780 So what this line of code is gonna do is gonna look in our data frame which is this bit of Python 31 32 00:03:46,780 --> 00:03:54,040 code here and it's gonna look for a column called RAD and it's also going to look for a column called 32 33 00:03:54,250 --> 00:04:01,750 TAX. The thing that might be a little bit confusing is this "data = data" Python code. 33 34 00:04:01,750 --> 00:04:06,940 But this first part data equals is the keyword argument and data 34 35 00:04:06,940 --> 00:04:12,140 here is the name of our Python data frame. 35 36 00:04:12,400 --> 00:04:18,610 So I'm gonna add another line "plt.show()" and hit Shift+Enter. Voila. And 36 37 00:04:18,610 --> 00:04:19,360 just like that, 37 38 00:04:19,360 --> 00:04:24,240 we've plotted a linear regression between our two features on this chart. 38 39 00:04:24,400 --> 00:04:30,460 Seaborn has done a lot of work for us here and I hope you're seeing just how useful this little Python 39 40 00:04:30,460 --> 00:04:32,450 module really is. 40 41 00:04:32,710 --> 00:04:37,420 Now let's interpret and make sense of what we're looking at on this chart. 41 42 00:04:37,420 --> 00:04:45,970 The data points in our top right corner are affecting the slope of our regression line significantly. 42 43 00:04:45,970 --> 00:04:51,790 It's like they're pulling up that regression line to make it steeper. And looking at this, 43 44 00:04:51,790 --> 00:04:57,430 you can see how a linear regression between RAD and TAX might not be such a good idea because the regression 44 45 00:04:57,430 --> 00:05:00,180 line is kind of meant to represent the data a bit better. 45 46 00:05:00,190 --> 00:05:00,510 Right? 46 47 00:05:00,670 --> 00:05:07,430 And here we're forcing this linear regression model onto a data set that's not really suited for it. 47 48 00:05:07,640 --> 00:05:15,520 Our computers will happily do this calculation but we end up with a model that's perhaps not so useful 48 49 00:05:15,520 --> 00:05:21,920 to capture the true relationship between accessibility to highways and tax. 49 50 00:05:22,180 --> 00:05:27,530 But luckily we're in the business of estimating house prices, right? 50 51 00:05:27,550 --> 00:05:35,560 So what we should actually be looking at is how our features relate to our target, namely our Boston 51 52 00:05:35,710 --> 00:05:39,100 property values. Coming down here, 52 53 00:05:39,100 --> 00:05:41,090 I want to give you another challenge, 53 54 00:05:41,290 --> 00:05:49,060 I want you to pause the video and run a few more scatter plots for data visualization. 54 55 00:05:49,060 --> 00:05:55,750 The first one should be between the number of rooms, RM, and our target. 55 56 00:05:55,750 --> 00:05:59,690 This is the price series in our data frame. 56 57 00:05:59,710 --> 00:06:04,030 Now you can either use matplotlib or seaborn to accomplish this. 57 58 00:06:04,210 --> 00:06:06,250 I'll give you a few seconds to pause the video. 58 59 00:06:09,370 --> 00:06:10,350 Ready? 59 60 00:06:10,360 --> 00:06:12,110 Here's the solution. 60 61 00:06:12,190 --> 00:06:20,050 Now I'm going to be very, very lazy and I'm going to come up here and copy the code that I've previously 61 62 00:06:20,050 --> 00:06:20,370 written 62 63 00:06:23,250 --> 00:06:30,900 and paste that down here and what I'm going to do is I want to change all the references from NOX and 63 64 00:06:30,900 --> 00:06:37,830 DIS to RM and TGT, target. 64 65 00:06:38,070 --> 00:06:43,010 So "data['RM']" and "data[' 65 66 00:06:43,010 --> 00:06:46,510 PRICE']" for the correlations, 66 67 00:06:47,090 --> 00:06:52,580 and of course I'm going to change the inputs to the scatter plot. For the color, 67 68 00:06:52,670 --> 00:07:01,070 I'm gonna go with "skyblue" to differentiate a little bit and then I'm going to say "RM vs PRICE" and then for 68 69 00:07:01,070 --> 00:07:08,780 the correlation, of course, I'm going to go "rm_tgt_corr" and this is going to 69 70 00:07:08,780 --> 00:07:18,030 be RM which is the median number of rooms. And on the y label, 70 71 00:07:18,140 --> 00:07:27,660 it's gonna be "PRICE - property price in 000s". Let's see what we get. 71 72 00:07:28,850 --> 00:07:30,780 Voila. 72 73 00:07:30,800 --> 00:07:38,030 Now let me do the same thing with seaborn and you know what, I'm going to use "lmplot" to pull up the 73 74 00:07:38,060 --> 00:07:49,430 regression line as well. So "sns.lmplot(x='RM', y=' 74 75 00:07:50,090 --> 00:08:08,350 PRICE', data = data, size = 7)"; "plt.show()". 75 76 00:08:09,790 --> 00:08:15,370 Okay, so how do we interpret what we're seeing here? 76 77 00:08:15,390 --> 00:08:23,100 Well the first thing is that our positive correlation of 0.7 approximately ties out really, really 77 78 00:08:23,100 --> 00:08:27,060 nicely with the relationship that we're seeing on the scatter plot. 78 79 00:08:28,360 --> 00:08:36,040 So this means that room size alone has a very, very clear relationship with our house price. 79 80 00:08:36,040 --> 00:08:38,950 And I think that's kind of neat to see. 80 81 00:08:39,310 --> 00:08:45,430 One of the other things that I find quite interesting when looking at this chart is that there almost 81 82 00:08:45,430 --> 00:08:52,780 seems to be like a ceiling on the property prices at the 50000 dollar mark, because there's a lot of 82 83 00:08:52,780 --> 00:08:55,180 data points that are on this line here. 83 84 00:08:55,330 --> 00:09:01,660 And I think that's quite, quite curious because it could be coincidence, but it might also have something 84 85 00:09:01,660 --> 00:09:09,610 to do with how the data was collected in the 1970s. But for the purposes of this tutorial I'm not really 85 86 00:09:09,610 --> 00:09:12,190 going to dig into these kind of details. 86 87 00:09:12,220 --> 00:09:16,840 There's so much other stuff that I want to focus on. 87 88 00:09:16,840 --> 00:09:23,350 One of the things that you might already be thinking at this point is: you've got like 13 variables and 88 89 00:09:23,710 --> 00:09:31,870 there are just too many combinations to graph individually. We can't possibly copy and paste like our cells 89 90 00:09:31,870 --> 00:09:35,200 in our code to graph all these scatter plots individually. 90 91 00:09:35,200 --> 00:09:37,080 Right? And 91 92 00:09:37,120 --> 00:09:37,410 yeah, 92 93 00:09:37,410 --> 00:09:43,990 we're not going to do that because remember, we have to think like lazy programmers and what would 93 94 00:09:43,990 --> 00:09:45,720 lazy programmers do? 94 95 00:09:45,910 --> 00:09:52,900 They would graph every single combination at once and this is where seaborn is going to come to the 95 96 00:09:52,900 --> 00:10:01,030 rescue once again, because seaborn has a function called "pairplot" which will do just that. 96 97 00:10:01,060 --> 00:10:11,530 So "sns.pairplot()" and then as an argument it needs our entire data frame, 97 98 00:10:11,620 --> 00:10:21,640 so "sns.pairplot(data)", "plot.show()" will create scatter plots between all our 98 99 00:10:21,640 --> 00:10:24,220 features and our target. 99 100 00:10:24,220 --> 00:10:29,970 But I hope you didn't type Shift+Enter just now and run this thing already because there's something 100 101 00:10:29,980 --> 00:10:34,180 I want to show you before I execute this cell. 101 102 00:10:34,210 --> 00:10:39,520 I want to show you a little bit of Jupyter a notebook specific code, a little bit of Jupyter notebook 102 103 00:10:39,640 --> 00:10:46,750 magic if you will, cause the thing is pairplot actually has to do a lot of calculations and it takes 103 104 00:10:46,750 --> 00:10:48,650 some time to run. 104 105 00:10:48,880 --> 00:10:55,150 And what I want to show you is how you can benchmark or time specific bits of code. 105 106 00:10:55,150 --> 00:11:00,570 People actually call this microbenchmarking, because we're benchmarking a few lines of code, 106 107 00:11:00,580 --> 00:11:07,900 we're benchmarking an individual cell and there's a magic command for Jupyter notebook and this command is 107 108 00:11:08,080 --> 00:11:12,340 "%%time". 108 109 00:11:12,340 --> 00:11:18,460 And now I'm going to hit Shift+Enter to run the cell. Ready? 109 110 00:11:18,460 --> 00:11:24,460 Here we go. Now my computer has produced all these plots. 110 111 00:11:24,490 --> 00:11:32,590 Check it out. And because I've added that Jupyter notebook magic code, I've got some additional information 111 112 00:11:32,800 --> 00:11:40,140 printed out here as well. This is showing me how long it took to run my code. 112 113 00:11:40,170 --> 00:11:43,120 Now when you're running this at home, I don't know how long it'll take you. 113 114 00:11:43,120 --> 00:11:46,390 I don't know if it'll take you like 19 seconds like it took me 114 115 00:11:46,540 --> 00:11:52,330 or like 40 seconds, or maybe five seconds depending on the kind of machine that you've got. 115 116 00:11:53,740 --> 00:12:00,400 But I think this is a really neat feature from Jupyter notebook, because suppose you want to make a decision 116 117 00:12:00,430 --> 00:12:07,900 between two different types of algorithms or two different ways of running your code, say if you've got 117 118 00:12:07,900 --> 00:12:13,870 something that takes a little bit longer and you're looking to like optimize it a little bit or do a 118 119 00:12:13,870 --> 00:12:19,780 horse race between two different algorithms, this timing capability of Jupyter notebook can come in 119 120 00:12:19,780 --> 00:12:21,040 quite handy. 120 121 00:12:21,070 --> 00:12:26,260 So we've done this a little bit out of curiosity and so that you can see that, you know, your computer 121 122 00:12:26,260 --> 00:12:31,550 didn't freeze up or whatever because it took me some time as well to run this bit of code. 122 123 00:12:31,840 --> 00:12:36,760 But that said, this has some other practical applications as well. 123 124 00:12:36,760 --> 00:12:38,930 Another thing to know about Jupyter notebook actually, 124 125 00:12:38,950 --> 00:12:46,270 on this note, is that up here in the right hand corner there is a little dot here that for me is currently 125 126 00:12:46,270 --> 00:12:47,110 not filled in. 126 127 00:12:47,140 --> 00:12:51,580 And when I hover over it I can see it says "Kernel idle". 127 128 00:12:51,640 --> 00:12:54,740 This means that Jupyter notebook is currently not doing anything. 128 129 00:12:54,820 --> 00:13:02,390 If I was to execute this cell again, during that execution you could see that this dot is filled in 129 130 00:13:02,620 --> 00:13:04,430 and it says "Kernel busy". 130 131 00:13:04,510 --> 00:13:11,380 So this gives you an indication that at the moment our Jupyter notebook is doing a lot of work and that 131 132 00:13:11,470 --> 00:13:15,170 even though it might look like there is nothing happening, 132 133 00:13:15,520 --> 00:13:17,590 our computer is actually working. 133 134 00:13:17,590 --> 00:13:22,040 This is another handy little thing to know about Jupyter notebook. 134 135 00:13:22,040 --> 00:13:24,430 Now let's take a look at our output here. 135 136 00:13:24,490 --> 00:13:32,590 So what you can actually see here is kind of a cropped version or a zoomed out version of the actual 136 137 00:13:32,590 --> 00:13:33,940 size of this image. 137 138 00:13:34,000 --> 00:13:39,280 This image in fact here is a lot larger than what's being displayed. 138 139 00:13:39,280 --> 00:13:45,250 So we're kind of zoomed out. If you want to see the full size image which I encourage you to try and 139 140 00:13:45,250 --> 00:13:50,580 do, what you can do as you can right click on this thing click "Save As", 140 141 00:13:50,590 --> 00:13:55,100 stick it in your downloads folder as jointplot or what have you. 141 142 00:13:55,360 --> 00:14:04,830 And then in your Downloads folder open it up and pull it up in your graphics program of choice. 142 143 00:14:04,920 --> 00:14:10,950 So I've got a Mac and I'm using Preview here so I can go here and go to "Actual Size". 143 144 00:14:10,950 --> 00:14:19,230 And here you see a zoomed in, actual size, 100 percent version of this image. 144 145 00:14:19,230 --> 00:14:23,730 Now the other thing that you can actually do is you can open this image up in a new tab. 145 146 00:14:23,760 --> 00:14:31,680 So if I right click on this image and, I'm using Firefox here, this will be different on Chrome or Safari 146 147 00:14:31,740 --> 00:14:40,880 or Internet Explorer of course but you can easily do is you can do something like this and then you 147 148 00:14:40,880 --> 00:14:44,450 don't have to download this image and dig it out of your Downloads folder. 148 149 00:14:44,450 --> 00:14:46,430 So I think this is quite handy. 149 150 00:14:46,520 --> 00:14:48,170 Now I've got two tabs - 150 151 00:14:48,420 --> 00:14:53,230 one with the full sized version of the image and one with my notebook. 151 152 00:14:53,270 --> 00:15:00,850 So what we can see here is that we've got scatter plots for every single column in our data frame. 152 153 00:15:01,160 --> 00:15:08,210 So we've got Pice vs Crime, Price vs Zone, Price vs Industry, Price vs CHAS, Price vs 153 154 00:15:08,270 --> 00:15:10,190 NOX or RM, 154 155 00:15:10,790 --> 00:15:13,960 And then we've also got LSTAT vs crime, 155 156 00:15:13,970 --> 00:15:21,050 LSTAT vs Zone, LSTAT vs industry, but the really neat thing is that, in the diagonal down 156 157 00:15:21,050 --> 00:15:21,940 the middle, 157 158 00:15:21,960 --> 00:15:22,300 yeah, 158 159 00:15:22,310 --> 00:15:24,570 so this would be Crime vs Crime, 159 160 00:15:24,620 --> 00:15:25,910 we get a histogram. 160 161 00:15:25,910 --> 00:15:29,990 So this is the histogram for the ZN feature. 161 162 00:15:30,050 --> 00:15:32,490 This is the histogram for INDUS. 162 163 00:15:32,660 --> 00:15:37,610 This is the histogram for, this is the histogram for CHAS and so on. 163 164 00:15:37,610 --> 00:15:41,060 And down here we've got the histogram for Price. 164 165 00:15:41,060 --> 00:15:48,800 So the diagonal shows is the histograms and everything else shows us a scatter plot. Scrolling down to 165 166 00:15:48,800 --> 00:15:51,280 the last row, namely the Price row, 166 167 00:15:51,410 --> 00:15:59,060 we can check out for which features we can very, very clearly see a relationship between property prices 167 168 00:15:59,150 --> 00:16:00,260 and the feature. 168 169 00:16:00,260 --> 00:16:09,670 So here this is Crime and Price. Going a bit to the right we've got Price and Industry and we can definitely 169 170 00:16:09,760 --> 00:16:11,590 picture some sort of relationship here, 170 171 00:16:11,620 --> 00:16:13,060 right? 171 172 00:16:13,060 --> 00:16:14,020 Going a bit further, 172 173 00:16:14,020 --> 00:16:16,620 we've got Price and Room size, 173 174 00:16:16,660 --> 00:16:19,550 this is a scatter plot that we've created earlier 174 175 00:16:19,560 --> 00:16:24,340 and here there was definitely also a relationship. Going further all the way to the right, 175 176 00:16:24,340 --> 00:16:28,240 we see this one here, Price and LSTAT. 176 177 00:16:28,270 --> 00:16:32,650 Now again this scatter plot is actually showing us a very, very clear relationship between these two 177 178 00:16:32,650 --> 00:16:33,410 things. 178 179 00:16:33,430 --> 00:16:37,990 It almost reminds me of Pollution and Distance in terms of its shape, 179 180 00:16:38,020 --> 00:16:39,090 right? 180 181 00:16:39,160 --> 00:16:44,680 But what is LSTAT? We haven't really talked about this feature before, 181 182 00:16:44,680 --> 00:16:48,010 so let's take a look at the description. 182 183 00:16:48,010 --> 00:16:53,650 So coming back to the Jupyter notebook, to the top where we printed out the description, we see that for 183 184 00:16:53,680 --> 00:16:59,410 LSTAT it's "% lower status of the population". 184 185 00:16:59,500 --> 00:17:05,170 Now this description might not be sort of the politically correct wording that people might use these 185 186 00:17:05,170 --> 00:17:13,510 days, but in a nutshell this feature measures the socioeconomic background of people in the neighborhood. 186 187 00:17:13,570 --> 00:17:16,960 I actually wanted to get a little bit more detail on LSTAT, 187 188 00:17:17,020 --> 00:17:23,320 so I checked back with what it said in the original research paper to get a little bit more of a complete 188 189 00:17:23,320 --> 00:17:30,700 description and what it says here is that by "lower status" what the data's actually capturing are things 189 190 00:17:30,700 --> 00:17:39,010 like what percentage of people don't have a high school education and what proportion of male workers 190 191 00:17:39,220 --> 00:17:42,870 are classified as laborers. 191 192 00:17:42,880 --> 00:17:48,970 So when I read this description, I immediately looked back at my correlation matrix because what I wanted 192 193 00:17:48,970 --> 00:17:54,580 to check was the correlation between LSTAT and Industry, 193 194 00:17:54,580 --> 00:18:02,620 so the proportion of industry in a neighborhood and the proportion of people who are classified as laborers 194 195 00:18:02,680 --> 00:18:05,250 or who don't have a high school education. 195 196 00:18:05,380 --> 00:18:09,390 And that correlation, it turns out, is 0.6. 196 197 00:18:09,400 --> 00:18:12,570 So it's both positive and fairly high. 197 198 00:18:12,580 --> 00:18:18,070 So reading this description now all of sudden this correlation made a whole lot more sense and coming 198 199 00:18:18,070 --> 00:18:26,050 back to this scatter plot between LSTAT and house prices is that what we can also clearly see is 199 200 00:18:26,050 --> 00:18:33,310 that the property prices tend to be lower in areas that have a high proportion of people who have no 200 201 00:18:33,310 --> 00:18:37,400 high school education or who work in factories. 201 202 00:18:37,450 --> 00:18:45,340 In other words, the property prices appear to be affected by the kind of people who live in the neighborhood. 202 203 00:18:45,520 --> 00:18:52,300 But we shouldn't jump to conclusions just yet, our multivariable regression will tell us a lot more. 203 204 00:18:53,250 --> 00:18:54,050 But 204 205 00:18:54,160 --> 00:19:01,210 speaking of regression and looking at these charts and the pairplot, wouldn't it be nice to be able 205 206 00:19:01,210 --> 00:19:05,940 to plot a regression line onto all of these plots as well? 206 207 00:19:05,980 --> 00:19:08,520 I think that would make it a lot more clear, don't you think? 207 208 00:19:08,650 --> 00:19:19,210 Let's go back to our Jupyter notebook and go down into the next cell, because "sns.pairplot(data, 208 209 00:19:19,660 --> 00:19:33,640 kind='reg')" will show us that seaborn will happily oblige if we add 209 210 00:19:33,640 --> 00:19:37,710 an extra argument to the powerful pairplot function. 210 211 00:19:38,110 --> 00:19:46,700 All we have to do is change the kind argument from "scatter", which is the default, so pressing Shift+Enter, 211 212 00:19:46,720 --> 00:19:54,310 we can see that "kind = 'scatter'" is the default for this function and change that to regression, which 212 213 00:19:54,310 --> 00:20:00,160 has the code "reg". Now I know what this regression will look like because I've already run this code 213 214 00:20:00,160 --> 00:20:06,940 before, but the thing is, it will be colored in exactly the same way as our data points. 214 215 00:20:07,030 --> 00:20:11,290 So it'll be blue on blue, which won't be very, very helpful. 215 216 00:20:11,290 --> 00:20:14,850 We want to color our regression line in a little bit differently. 216 217 00:20:15,040 --> 00:20:15,970 And I want to do that now, 217 218 00:20:15,970 --> 00:20:17,670 I don't want to execute this code, 218 219 00:20:17,770 --> 00:20:21,310 wait 20 seconds and then add more code, 219 220 00:20:21,340 --> 00:20:22,960 wait another 20 seconds. 220 221 00:20:22,960 --> 00:20:29,620 So I'm going to add another keyword argument here and that's gonna be the "plot_kws" 221 222 00:20:29,630 --> 00:20:39,610 keyword argument, "plot_kws". I'm gonna set that equal to a dictionary. 222 223 00:20:39,610 --> 00:20:47,740 This is how we're gonna add a splash of color to our regression line. A dictionary always has the curly 223 224 00:20:47,740 --> 00:20:58,690 braces and then it has a key and a value. The key is gonna be "line_kws" and then the value 224 225 00:20:58,690 --> 00:21:00,580 comes after the colon. 225 226 00:21:00,580 --> 00:21:02,880 So what's the value that we're gonna put here? 226 227 00:21:03,100 --> 00:21:09,910 It turns out it's actually gonna be another dictionary, so we're gonna have curly braces and then another 227 228 00:21:09,910 --> 00:21:13,330 key which is gonna be "color" and then a colon, 228 229 00:21:13,600 --> 00:21:18,080 and here's the value, "cyan", which is gonna be the color I'm going to use. 229 230 00:21:18,370 --> 00:21:29,350 So what we're looking at here now is a dictionary "{'color': 'cyan'}" which is the value for another dictionary, 230 231 00:21:29,410 --> 00:21:33,310 which is the "line_kws" key. 231 232 00:21:33,400 --> 00:21:39,760 In other words, we've got a nested dictionary, a dictionary inside a dictionary, 232 233 00:21:39,760 --> 00:21:45,160 it's like the most boring version of the movie Inception, but this is how we're going to get a color 233 234 00:21:45,220 --> 00:21:47,080 for our regression line. 234 235 00:21:47,080 --> 00:21:51,730 All that's left to add is "plt.show()". 235 236 00:21:52,030 --> 00:21:59,530 And because I'm a very curious person, I'm also going to add the "%%time" Jupyter notebook 236 237 00:21:59,770 --> 00:22:09,350 magic. Now I'm going hit Shift+Enter. *drum roll*, what is happening? I should really get a coffee while this 237 238 00:22:09,350 --> 00:22:12,080 is running. There we go, 238 239 00:22:12,180 --> 00:22:13,650 now it finished. 239 240 00:22:13,650 --> 00:22:17,940 How long did it take to run? 47 seconds. 240 241 00:22:17,990 --> 00:22:25,540 It's absolutely brutal. But let's take a look at what this looks like now. Again, 241 242 00:22:25,570 --> 00:22:27,660 we can see just how big this image is. 242 243 00:22:27,700 --> 00:22:29,880 It's about 2500 pixels 243 244 00:22:29,890 --> 00:22:32,700 times 2500 pixels. 244 245 00:22:32,770 --> 00:22:35,550 This is at 25 percent. 100 percent 245 246 00:22:35,650 --> 00:22:40,410 looks more like this. With the help of our regression line on our chart, 246 247 00:22:40,540 --> 00:22:47,620 we can now also visually see the positive relationship between our Industry variable and LSTAT. So 247 248 00:22:47,620 --> 00:22:50,630 we know the correlation was around 0.6 248 249 00:22:50,800 --> 00:22:57,640 and now we can also see that borne out in the positively sloped regression line right here and in the 249 250 00:22:57,640 --> 00:23:05,380 last row we can see a regression line between each feature and our target variable individually. 250 251 00:23:05,380 --> 00:23:11,260 This is really neat because we can see if the slope is positive or negative and it's basically doing 251 252 00:23:11,260 --> 00:23:15,280 a univariate regression 13 times. 252 253 00:23:15,280 --> 00:23:20,500 But the power of our analysis isn't gonna come from looking at our regressions against that target in 253 254 00:23:20,500 --> 00:23:21,670 isolation. 254 255 00:23:21,670 --> 00:23:28,590 The power comes from combining all our explanatory variables and running a multivariable regression. 255 256 00:23:28,600 --> 00:23:35,410 Now I have a feeling this air pollution themed cartoon reference is betraying my age a little bit. 256 257 00:23:35,410 --> 00:23:41,500 The point I'm trying to make is that we're going to combine all the explanatory power in our features 257 258 00:23:41,500 --> 00:23:49,900 to estimate our property prices in Boston and our model of choice is called multivariable regression. 258 259 00:23:50,830 --> 00:23:57,550 After we've done all that, we're going to evaluate our results and see if we can improve our model further. 259 260 00:23:57,550 --> 00:23:58,980 I'll see you in the next lessons.