0 1 00:00:00,890 --> 00:00:08,180 In the meantime let's do a little bit of work in our Python code to make this table a little bit more 1 2 00:00:08,180 --> 00:00:09,110 clear. 2 3 00:00:09,170 --> 00:00:18,130 Let's visualize our correlations in a way that we could put into a really snazzy report. And to do this, 3 4 00:00:18,190 --> 00:00:24,610 we're going to represent our correlations as a triangle instead of this whole table here. 4 5 00:00:24,640 --> 00:00:29,560 We don't need to show all these duplicate values. Showing all these duplicate values doesn't really add 5 6 00:00:29,560 --> 00:00:30,090 anything 6 7 00:00:30,100 --> 00:00:33,710 and it just makes the whole thing look really, really busy. 7 8 00:00:33,730 --> 00:00:42,160 So my goal is to hide half of this table, and to accomplish this, I will create an array which will help 8 9 00:00:42,160 --> 00:00:47,850 me filter out the values that I don't want to show and the values that I want to show. 9 10 00:00:48,020 --> 00:00:55,090 I'm going to call this filter array "mask" and I'm going to set it equal to an array that's identical 10 11 00:00:55,300 --> 00:01:00,000 in size to this table of correlations, 11 12 00:01:00,010 --> 00:01:02,710 this correlation matrix that we've got up here. 12 13 00:01:03,160 --> 00:01:07,740 The module that I will use to help me do this is called numpy. 13 14 00:01:07,750 --> 00:01:13,740 I'm going to have to add this to my notebook imports at the top in order to use it. 14 15 00:01:13,830 --> 00:01:25,300 So I'm going to say "import numpy as np" hit Shift+Enter on this, import this module, scroll back down here 15 16 00:01:26,110 --> 00:01:35,640 and then I'm going to use the "zeros_like" function from the numpy module, so I'm going to write "np.zeros_ 16 17 00:01:35,640 --> 00:01:43,300 like()" and this function will create an array of zeros that is like whatever array is passed 17 18 00:01:43,690 --> 00:01:51,550 into this function as a parameter, in our case that's gonna be the return value from calling the correlation 18 19 00:01:51,550 --> 00:01:54,790 method on our data frame. 19 20 00:01:54,790 --> 00:02:01,330 So let's have a look at what this mask array looks like at the moment. I'm going to hit Shift+Enter here and we can 20 21 00:02:01,330 --> 00:02:07,790 see that we have an array of, well just zeros. 21 22 00:02:07,930 --> 00:02:14,070 Now I need to make another modification. To filter on the values in the top triangle, I 22 23 00:02:14,080 --> 00:02:19,940 first need to know the indices of these cells in my array. 23 24 00:02:20,020 --> 00:02:25,020 Thankfully there is another numpy function that will help me find these. 24 25 00:02:25,020 --> 00:02:34,000 So I'm going to say "triangle_indices", which is going to hold on to all the indices in the top 25 26 00:02:34,000 --> 00:02:45,340 triangle of my array and I'm going to set that equal to "numpy.triu_indices_ 26 27 00:02:45,560 --> 00:02:56,550 from()", and then I'm going to pass in my mask. This will retrieve the indices for the top triangle of the 27 28 00:02:56,670 --> 00:02:57,780 array. 28 29 00:02:57,780 --> 00:03:07,090 And now that I've got my indices, I can use my mask array to select just those cells and change their values. 29 30 00:03:07,290 --> 00:03:09,030 So I'm going to say "mask[ 30 31 00:03:09,030 --> 00:03:19,280 triangle_indices]" and set those equal to True. 31 32 00:03:19,360 --> 00:03:21,720 Let me show you what our filter looks like now. 32 33 00:03:21,740 --> 00:03:31,230 So I'm going to say "mask", hit Shift+Enter and then you can see here that the top triangle in this array has 33 34 00:03:31,230 --> 00:03:36,680 the value 1 and the bottom triangle has the value 0, 34 35 00:03:36,930 --> 00:03:46,080 and that's because True is mapped to the value 1 and False is mapped to the numerical value of 0. So with 35 36 00:03:46,080 --> 00:03:54,140 this in hand I can now move on to creating this beautiful visualization that I keep talking about. 36 37 00:03:54,240 --> 00:04:00,300 We're gonna use our old friends seaborn and matplotlib to accomplish this. 37 38 00:04:00,460 --> 00:04:04,320 The first thing I'm going to do is I'm going to set the size of our figure, 38 39 00:04:04,370 --> 00:04:05,610 so I'm going to say "plt. 39 40 00:04:05,670 --> 00:04:15,940 figure(figsize = (16,10))". 40 41 00:04:15,940 --> 00:04:23,010 And then I'm going to use our seaborn's heatmap function to generate a heat map of our correlations. 41 42 00:04:23,080 --> 00:04:30,580 We had the seaborne module as "sns", then put a dot after it and write "heatmap()" and then within 42 43 00:04:30,580 --> 00:04:35,220 the parentheses I'm going to provide our correlations. 43 44 00:04:35,230 --> 00:04:42,590 So this was the value returned by calling the corr method on our dataframe. 44 45 00:04:42,610 --> 00:04:50,230 So when I leave it like this, "sns.heatmap(data.corr())" and then I'm going 45 46 00:04:50,230 --> 00:04:52,400 to show our plot with "plt. 46 47 00:04:52,660 --> 00:04:57,860 show()". Let me hit Shift+Enter to see what this looks like. 47 48 00:04:57,890 --> 00:05:00,220 Voila! Look at that. 48 49 00:05:00,290 --> 00:05:02,000 We're almost there. 49 50 00:05:02,120 --> 00:05:08,900 What we can see already is that the different colors show us a strong positive correlations have a dark 50 51 00:05:09,080 --> 00:05:15,140 red color and the negative strong correlations have a dark blue color. 51 52 00:05:15,320 --> 00:05:22,280 Anything that's close to zero is pale or white. So this color scheme is actually conveying quite a lot 52 53 00:05:22,280 --> 00:05:28,860 of information already which is really, really neat on the visualization front. 53 54 00:05:28,910 --> 00:05:36,560 Now if you're having trouble reading what it says down the sides and at the bottom of this chart, we 54 55 00:05:36,560 --> 00:05:49,860 can increase the font size of these labels with "plt.xticks(fontsize = 14)" and 55 56 00:05:49,880 --> 00:05:53,250 I can do the same for the y axis with "plt. 56 57 00:05:53,420 --> 00:06:02,030 yticks(fontsize = 14)". Hitting Shift+Enter, 57 58 00:06:02,420 --> 00:06:04,490 we see it updated like so. 58 59 00:06:04,730 --> 00:06:07,640 So now it's a bit easier to read. 59 60 00:06:07,700 --> 00:06:10,940 Now it's time to add that mask that we created. 60 61 00:06:11,360 --> 00:06:19,340 We want to hide the correlations on this chart that are duplicates. Coming back up here, inside the heat 61 62 00:06:19,340 --> 00:06:20,490 map method, 62 63 00:06:20,530 --> 00:06:28,640 I'm going to add another argument - we're gonna say "mask", so the argument called mask is equal to, well, 63 64 00:06:31,510 --> 00:06:41,110 the mask that we've so painstakingly created in the cell above. So "mask = mask" and this might 64 65 00:06:41,230 --> 00:06:51,580 look very confusing but this mask here refers to our variable in this cell here and this Python code 65 66 00:06:51,580 --> 00:06:58,980 reading "mask = " refers to the name of the key word in this function. 66 67 00:06:58,990 --> 00:07:01,810 Let me Shift+Enter and show you what this looks like. 67 68 00:07:03,140 --> 00:07:08,850 Voila! Now we've effectively hidden half of our chart. 68 69 00:07:09,640 --> 00:07:17,710 So I'm going to modify this even further, I'm gonna add the actual values of our correlations on our heat map 69 70 00:07:18,150 --> 00:07:26,050 because what I want to do is I want to display these numbers here on our chart with the colors, 70 71 00:07:26,560 --> 00:07:36,910 so I'm going to say "annot = True" and hit Shift+Enter. Now you'll see the values of the correlations 71 72 00:07:37,240 --> 00:07:40,570 being displayed in the heat map. 72 73 00:07:40,570 --> 00:07:46,050 Of course, by default these numbers actually get really small and difficult to read. 73 74 00:07:46,050 --> 00:07:46,600 I don't know why, 74 75 00:07:46,600 --> 00:07:55,220 it's just how it is. So we can increase their font size with another keyword argument, so we can say 'annot 75 76 00:07:55,650 --> 00:07:56,730 _kws = 76 77 00:07:56,780 --> 00:08:04,200 {" 77 78 00:08:04,360 --> 00:08:10,530 size": 14}'; 14 78 79 00:08:10,570 --> 00:08:15,580 is gonna be the font size of our annotations. 79 80 00:08:15,580 --> 00:08:22,550 The value of this annot_kws argument is given as a dictionary. 80 81 00:08:22,560 --> 00:08:28,750 It's a Python dictionary that we're looking at here and you can always spot Python dictionaries very 81 82 00:08:28,750 --> 00:08:39,640 very easily with this kind of curly bracket notation and a key value pair or some key value pairs inside. 82 83 00:08:39,790 --> 00:08:48,900 The key here is the string "size" and the value is 14. 83 84 00:08:48,980 --> 00:08:52,530 These are always separated by this colon. 84 85 00:08:52,730 --> 00:08:54,440 Let me hit Shift+Enter and update 85 86 00:08:54,440 --> 00:08:55,270 the heat map now. 86 87 00:08:57,150 --> 00:08:59,060 Voila! Brilliant! 87 88 00:08:59,120 --> 00:09:06,380 Now the only thing I find a little bit strange is why this background here is not all white, because 88 89 00:09:06,380 --> 00:09:09,840 I expected the styling to be a little bit different. 89 90 00:09:09,860 --> 00:09:14,850 I expected this to be a white background instead of this gray here. 90 91 00:09:15,020 --> 00:09:19,790 Now if you're also seeing something a little bit unexpected like this on the styling front, you can 91 92 00:09:19,790 --> 00:09:29,030 always set the style manually of seaborn with "sns.set_style()" and then 92 93 00:09:29,180 --> 00:09:31,260 provide the name of a style. 93 94 00:09:31,280 --> 00:09:39,540 So I'm going to go with white and hit Shift+Enter and line of code should force this background color 94 95 00:09:39,540 --> 00:09:42,580 here to be set to white. 95 96 00:09:42,660 --> 00:09:48,210 But you know, the thing is all in all writing this Python code with the mask and with seaborn and the 96 97 00:09:48,210 --> 00:09:51,040 heat map it's kind of like the easy part actually. 97 98 00:09:51,870 --> 00:09:57,900 The much harder part is making sense of what it is that we're actually looking at here. 98 99 00:09:59,080 --> 00:10:01,880 What is it that we can learn from this correlation matrix? 99 100 00:10:02,740 --> 00:10:10,570 So first off, you and I we said we're gonna be looking at two things - strength and direction. 100 101 00:10:10,840 --> 00:10:19,770 An example of a strong positive correlation would be something like NOX and INDUS. 101 102 00:10:20,170 --> 00:10:28,750 Now this INDUS feature measures the proportion of non retail business acres per town and this NOX 102 103 00:10:28,750 --> 00:10:35,640 feature measures the nitric oxide concentration in parts per 10 million. 103 104 00:10:35,660 --> 00:10:42,070 At least that's me reading it off the documentation on the feature descriptions. 104 105 00:10:42,070 --> 00:10:47,840 These two features have a correlation of 0.76. 105 106 00:10:47,920 --> 00:10:52,160 So the question is: does this make sense? And I think, 106 107 00:10:52,510 --> 00:10:52,750 yeah, 107 108 00:10:52,780 --> 00:10:54,070 yeah it does. 108 109 00:10:54,070 --> 00:10:59,200 I would expect the pollution to be higher in industrial areas. 109 110 00:10:59,200 --> 00:11:07,200 The amount of industry and the amount of pollution should be positively correlated. But looking at this 110 111 00:11:07,200 --> 00:11:08,310 table a little bit more, 111 112 00:11:08,370 --> 00:11:10,170 you know what I found quite interesting? 112 113 00:11:10,890 --> 00:11:21,690 It's the correlation of TAX and the industry variable, higher tax levels are apparently associated with 113 114 00:11:21,690 --> 00:11:23,810 more industrial areas. 114 115 00:11:23,820 --> 00:11:26,250 I actually found this quite surprising. 115 116 00:11:26,250 --> 00:11:34,950 So coming across these kind of relationships is why the correlation matrix is a useful tool for data 116 117 00:11:34,980 --> 00:11:44,420 exploration, but there are of course, as with everything, some limitations. Looking at this heat map here, 117 118 00:11:44,630 --> 00:11:53,330 we can see that the highest correlation of all is the correlation between TAX and RAD - access to radial 118 119 00:11:53,330 --> 00:11:54,480 highways. 119 120 00:11:54,500 --> 00:12:01,190 This is a positive correlation of 0.91 which seems super high. 120 121 00:12:01,260 --> 00:12:01,860 Now, 121 122 00:12:01,880 --> 00:12:08,010 remember how we looked at the documentation of this correlation function? 122 123 00:12:08,120 --> 00:12:16,730 We went up here and we hit Shift+Tab and we learned that the default method for calculating this correlation 123 124 00:12:16,880 --> 00:12:19,640 is the Pearson method. 124 125 00:12:19,640 --> 00:12:27,950 Now it turns out that one of the things that you have to know about this type of correlation is that 125 126 00:12:27,950 --> 00:12:32,160 it makes some assumptions about the kind of data that it's running on. 126 127 00:12:32,390 --> 00:12:39,620 This correlation calculation is actually only valid for continuous variables and this means that it's 127 128 00:12:39,620 --> 00:12:46,790 not valid for, say like a dummy variable, like whether a property is on the Charles River or not because 128 129 00:12:46,790 --> 00:12:51,050 this is not a continuous variable, it's only got two values, right, 129 130 00:12:51,050 --> 00:12:59,390 0 or 1. And looking back up here where we've created our histogram for accessibility to radial highways, 130 131 00:12:59,930 --> 00:13:05,060 we can also see that this is not a continuous variable. 131 132 00:13:05,120 --> 00:13:07,100 This feature was an index, 132 133 00:13:07,100 --> 00:13:13,070 if you remember. And what this means is that our correlation calculation is actually not valid for the 133 134 00:13:13,070 --> 00:13:21,710 RAD feature because RAD is not a continuous variable, which goes to show that it's very important to 134 135 00:13:21,710 --> 00:13:28,250 know how the individual features are measured, what units they're in and what the distribution of the 135 136 00:13:28,250 --> 00:13:35,750 data looks like for these features. Because we can only use statistical tools that are appropriate for 136 137 00:13:35,750 --> 00:13:37,220 the kind of data you're working with. 137 138 00:13:37,950 --> 00:13:40,920 Okay, so let's look at this last row down here. 138 139 00:13:41,000 --> 00:13:49,310 The row that reads price which is our target value. On this row you see the correlation of all the features 139 140 00:13:49,820 --> 00:13:54,610 in our model with the price, with our target. 140 141 00:13:54,890 --> 00:14:00,170 One of the things that I'm interested in looking for here is for which features we don't find a 141 142 00:14:00,170 --> 00:14:07,700 relationship, for which of the features is the correlation close to zero. The lowest correlation of course 142 143 00:14:07,820 --> 00:14:11,600 is with the Charles River dummy variable. 143 144 00:14:11,600 --> 00:14:17,600 But as we've just said, CHAS is a dummy variable with only values between 1 and 0, 144 145 00:14:17,870 --> 00:14:21,340 so the correlation measure is actually not appropriate. 145 146 00:14:21,620 --> 00:14:23,660 But what about the next lowest one? 146 147 00:14:24,050 --> 00:14:35,000 The next lowest one is this one called DIS and DIS is defined as the distance from employment centers. 147 148 00:14:36,410 --> 00:14:37,010 Now, 148 149 00:14:37,280 --> 00:14:37,930 that's interesting. 149 150 00:14:37,930 --> 00:14:47,360 So DIS is not very correlated with price, but DIS is very highly correlated with the industry feature. 150 151 00:14:47,660 --> 00:14:56,250 Looking here we see that there is a correlation of -0.71 between DIS and INDUS. 151 152 00:14:56,270 --> 00:15:04,340 The reason I suspect this is the case is because many industrial areas are probably employment centers, 152 153 00:15:04,700 --> 00:15:14,660 so being far away from an employment center is associated with a low amount of industry and this discovery 153 154 00:15:15,710 --> 00:15:22,030 adds something to my to do list for the regression analysis stage. 154 155 00:15:22,130 --> 00:15:32,720 What we should probably do is we should check if our distance feature adds explanatory value to our 155 156 00:15:32,720 --> 00:15:42,160 regression model. In other words, does having both the industry feature and the distance feature included 156 157 00:15:42,310 --> 00:15:46,850 in the regression make our model better or worse? 157 158 00:15:47,740 --> 00:15:52,830 Can we get away with just having the industry feature for example? 158 159 00:15:53,140 --> 00:16:00,220 Because the thing is, if a feature is not adding any explanatory value, it's often better to exclude it 159 160 00:16:00,490 --> 00:16:07,990 and trying to run the regression without it, because by excluding features you might end up with a simpler 160 161 00:16:07,990 --> 00:16:12,570 model and simplicity is usually a good thing. 161 162 00:16:12,580 --> 00:16:15,670 Okay so where does this leave us? 162 163 00:16:15,730 --> 00:16:21,520 The correlation matrix is no silver data exploration bullet. 163 164 00:16:21,640 --> 00:16:28,990 While it may not answer all our questions it can give us a bit more perspective. And the correlation 164 165 00:16:28,990 --> 00:16:30,800 matrix has its pros and cons. 165 166 00:16:30,910 --> 00:16:33,520 It has strengths and it has limitations, 166 167 00:16:33,520 --> 00:16:40,280 just like every other tool. Regarding the pros, we've learned something about our data, 167 168 00:16:40,330 --> 00:16:46,810 we've learned that the amount of tax and the amount of industry are correlated and we've added something 168 169 00:16:46,810 --> 00:16:53,530 to our to do list for later, namely that we should investigate if we really need the DIS feature in our 169 170 00:16:53,530 --> 00:16:55,630 model or not. 170 171 00:16:55,630 --> 00:17:02,380 Another pro is that we've learned that certain features with high correlations are possible sources 171 172 00:17:02,680 --> 00:17:04,830 of multicollinearity. 172 173 00:17:04,870 --> 00:17:10,270 Now I emphasize the word possible and this is another thing for our to do list. 173 174 00:17:10,420 --> 00:17:17,920 High correlations don't necessarily imply this problem of multicollinearity but we will revisit this 174 175 00:17:17,920 --> 00:17:26,150 issue during the regression analysis stage by running a formal test for this problem. 175 176 00:17:26,160 --> 00:17:32,950 Now we're also learning a few things about some weaknesses of looking at correlations. 176 177 00:17:33,030 --> 00:17:39,420 For example, we've learned that the correlation calculations assume continuous data. 177 178 00:17:39,480 --> 00:17:44,700 This Pearson correlation calculation that we've looked at is not valid 178 179 00:17:44,820 --> 00:17:52,620 if the data is not continuous as it is the case with our accessibility index or our Charles River dummy 179 180 00:17:52,620 --> 00:18:00,690 variable. And a second limitation that everybody likes to harp on about is that correlation does not 180 181 00:18:00,840 --> 00:18:02,790 imply causation. 181 182 00:18:03,000 --> 00:18:09,210 Just because two things move together doesn't mean that one thing causes another. 182 183 00:18:09,210 --> 00:18:14,300 In other words, everybody who drank water in 1850 is now dead, 183 184 00:18:14,520 --> 00:18:17,940 but this doesn't mean that drinking water will kill you. 184 185 00:18:17,970 --> 00:18:24,330 In fact if you look at enough data and you look hard enough you will find that there are all sorts of 185 186 00:18:24,330 --> 00:18:26,770 weird correlations out there. 186 187 00:18:26,970 --> 00:18:34,740 Just google "funny correlations" or "spurious correlations" and you'll find a bunch of great examples of 187 188 00:18:34,740 --> 00:18:40,270 completely unrelated things that move together purely by chance. 188 189 00:18:40,420 --> 00:18:46,680 And if you do this, you'll probably come across Tyler Vigen's Web site who uses census data and data 189 190 00:18:46,710 --> 00:18:54,360 from the U.S. Department of Agriculture to show that divorce rates in Maine and margarine consumption 190 191 00:18:54,690 --> 00:18:59,490 are in fact highly correlated. So the earlier chart of mine showing a zero correlation between these 191 192 00:18:59,490 --> 00:19:05,890 two things was in fact a lie, Tyler's chart shows us how it actually works. Now, 192 193 00:19:05,940 --> 00:19:14,190 another limitation of correlations is that they only check for linear relationships and it turns out, 193 194 00:19:14,400 --> 00:19:20,460 just because there's a low Pearson correlation coefficient, does not mean that there is no relationship 194 195 00:19:20,550 --> 00:19:22,940 between two variables. 195 196 00:19:22,950 --> 00:19:26,960 Let me show you some examples so you can actually see what I mean. 196 197 00:19:27,000 --> 00:19:35,070 Here's some fictional data on a chart showing the x and y values, X and Y have a correlation of 197 198 00:19:35,070 --> 00:19:37,950 0.816. 198 199 00:19:37,980 --> 00:19:44,980 And let me show you a different chart. This is some more fictional data and the correlation between X2 199 200 00:19:45,030 --> 00:19:52,030 and Y2 is in fact also 0.816. 200 201 00:19:52,490 --> 00:19:54,170 And on this third chart here, 201 202 00:19:54,450 --> 00:20:02,880 you guessed it, the correlation is also 0.816. And the same goes for this 4th chart here, 202 203 00:20:03,690 --> 00:20:09,800 X4 and Y4 also have a correlation of 0.816. 203 204 00:20:09,810 --> 00:20:12,860 In fact these photographs are very famous. 204 205 00:20:12,960 --> 00:20:18,960 They're called Anscombe's Quartet and they're named after an English statistician who came up with 205 206 00:20:18,960 --> 00:20:19,830 them. 206 207 00:20:20,010 --> 00:20:26,670 These four graphs actually have very, very similar descriptive statistics and a very, very similar regression 207 208 00:20:26,670 --> 00:20:26,940 line. 208 209 00:20:27,960 --> 00:20:32,100 But of course, they're showing us completely different relationships. 209 210 00:20:32,220 --> 00:20:41,130 They're showing us that outliers and non-linear relationships often only become apparent after visualizing 210 211 00:20:41,280 --> 00:20:42,470 the data. 211 212 00:20:42,690 --> 00:20:44,440 And this is what this implies. 212 213 00:20:44,580 --> 00:20:51,750 It means that it's important to look at these correlations and these descriptive statistics in conjunction 213 214 00:20:51,960 --> 00:20:59,020 with some charts. And with this in mind we're gonna be complementing our analysis of the correlation 214 215 00:20:59,410 --> 00:21:02,040 with some more graphical analysis. 215 216 00:21:02,080 --> 00:21:08,620 That way we can discover if there's any hidden linear relationships or outliers in our data. 216 217 00:21:08,620 --> 00:21:16,210 As such we're gonna be visiting our old friend again, the scatter plot. But before we move on, I can't 217 218 00:21:16,210 --> 00:21:20,900 resist showing you this infamous comic strip from XKCD. 218 219 00:21:20,950 --> 00:21:24,480 If this is the kind of humor that appeals to you more than you'd care to admit, 219 220 00:21:24,520 --> 00:21:31,960 then I highly recommend subscribing to XKCD's RSS feed and get your dose of geeky web comics on a 220 221 00:21:31,960 --> 00:21:33,590 regular basis. 221 222 00:21:33,670 --> 00:21:35,360 I'll see you in the next lessons. 222 223 00:21:35,380 --> 00:21:35,950 Take care.