0 1 00:00:00,390 --> 00:00:08,190 First let's add a markdown cell and some LaTeX notation, so I'm going to change this to "Markdown", put two pound 1 2 00:00:08,190 --> 00:00:09,100 symbols there, 2 3 00:00:09,140 --> 00:00:10,170 write 3 4 00:00:10,470 --> 00:00:12,040 "Correlation". 4 5 00:00:12,050 --> 00:00:16,500 The LaTeX notation we're gonna add its gonna look like this good. Two dollar signs and then I'm going 5 6 00:00:16,500 --> 00:00:23,640 to have the Greek symbol for rho and then we're going have an "_{ 6 7 00:00:23,640 --> 00:00:24,590 XY}", 7 8 00:00:24,630 --> 00:00:34,710 and then " = corr(X, Y)". 8 9 00:00:34,860 --> 00:00:37,030 And then to two dollar signs. 9 10 00:00:37,080 --> 00:00:39,010 Let's see what this looks like. 10 11 00:00:39,030 --> 00:00:42,190 I think this makes a nice section heading. Just for completion, 11 12 00:00:42,210 --> 00:00:50,220 let's add two more pound symbols, two dollar signs -1.0 and then let's have the following 12 13 00:00:50,220 --> 00:01:00,210 tag "\leq" and then we'll have "row _{ 13 14 00:01:00,330 --> 00:01:03,510 XY}" and then another leq tag, 14 15 00:01:03,550 --> 00:01:11,440 so "\leq +1.0" and then two dollar signs at the end for the closing tag. 15 16 00:01:12,900 --> 00:01:20,970 Now we've got the minimum and maximum values of the correlation in our heading as well, and in the latest 16 17 00:01:20,970 --> 00:01:27,510 notation the Greek symbol row is given like this with the slash and then the keyword we create the subscript 17 18 00:01:27,690 --> 00:01:31,590 XY with the underscore and then the curly braces. 18 19 00:01:31,590 --> 00:01:38,730 And finally we've used the slash and then "leq" for the less than or equal to symbol. 19 20 00:01:38,820 --> 00:01:44,280 So now that we've added our section heading let's calculate the correlation between the average number 20 21 00:01:44,280 --> 00:01:47,510 of rooms and the house price. 21 22 00:01:47,580 --> 00:01:52,740 Now before we just punch in the Python code, what would you expect to see? 22 23 00:01:52,830 --> 00:01:56,820 Do you think this correlation should be positive or negative? 23 24 00:01:56,910 --> 00:02:03,200 Do you think that the correlation between the number of rooms and the property price is strong or weak? 24 25 00:02:03,270 --> 00:02:09,930 So looking at the LaTeX notation in our section heading here, pick out a number in your head between 25 26 00:02:09,930 --> 00:02:13,150 -1 and 1? Ready? 26 27 00:02:13,170 --> 00:02:14,450 Did you make your guess? 27 28 00:02:14,610 --> 00:02:14,930 Okay. 28 29 00:02:14,970 --> 00:02:24,300 So to get the answer you simply call the "corr" method on the series object, so "data[' 29 30 00:02:24,300 --> 00:02:35,580 PRICE'].corr(data['RM'])" will retrieve the 30 31 00:02:35,580 --> 00:02:38,760 price column from our data frame, 31 32 00:02:38,760 --> 00:02:46,710 so this is our series object and then call the corr method on that series object and then as an argument 32 33 00:02:46,770 --> 00:02:53,220 between the parentheses for this method we supply a single piece of information, namely the series that 33 34 00:02:53,220 --> 00:02:56,610 we want to calculate the correlation against. 34 35 00:02:56,610 --> 00:03:00,690 In this case the average room size. Hitting Shift+Enter, 35 36 00:03:01,140 --> 00:03:08,520 you see that the correlation is indeed positive and around 0.7. And that makes sense, right? 36 37 00:03:08,550 --> 00:03:13,020 The larger the property is the more rooms it has, the more expensive it should be. 37 38 00:03:13,020 --> 00:03:21,750 Now as a challenge, I want to do the same thing for the property prices and the pupil teacher ratio. 38 39 00:03:21,750 --> 00:03:27,570 I want you to calculate the correlation between this feature and the property price. But before you 39 40 00:03:27,570 --> 00:03:29,230 type in the Python code, 40 41 00:03:29,310 --> 00:03:32,890 I want you to make another guess at the correlation. 41 42 00:03:33,120 --> 00:03:37,800 So pick a number between -1 and 1 and then write your Python code. 42 43 00:03:37,860 --> 00:03:43,740 Do you think that the house prices will be positively or negatively correlated with this feature? 43 44 00:03:43,770 --> 00:03:44,240 Now, 44 45 00:03:44,250 --> 00:03:48,410 figuring out the reason for this is probably harder than typing in the Python code. 45 46 00:03:48,470 --> 00:03:52,440 I'll give you a few seconds to pause the video and just give it a shot. 46 47 00:03:53,980 --> 00:03:55,410 All right, ready? 47 48 00:03:55,440 --> 00:03:57,300 Here's the solution. 48 49 00:03:57,300 --> 00:04:05,790 I can simply copy what I've written in the cell above, paste it in and then change RM to the name of 49 50 00:04:05,970 --> 00:04:09,820 the feature that I want to calculate my correlation against 50 51 00:04:09,870 --> 00:04:14,310 and this is "PTRATIO" in all caps. 51 52 00:04:14,310 --> 00:04:20,990 And when I hit Shift+Enter I see that the correlation here is -0.5. 52 53 00:04:21,040 --> 00:04:22,560 But why is that? 53 54 00:04:22,590 --> 00:04:25,740 What does this number actually mean? 54 55 00:04:25,740 --> 00:04:27,690 So the first one was a positive correlation, 55 56 00:04:27,690 --> 00:04:28,610 it was fairly strong, 56 57 00:04:28,650 --> 00:04:30,040 0.7, 57 58 00:04:30,040 --> 00:04:34,200 and this is a fairly strong negative correlation - 0.5. 58 59 00:04:34,650 --> 00:04:37,780 Now let's have a little think about why this might be. 59 60 00:04:38,040 --> 00:04:47,130 If you had one teacher and two students, what would the value of PTRATIO be equal to? The PTRATIO 60 61 00:04:47,310 --> 00:04:55,570 would be equal to 2, because there's 2 pupils and 1 teacher. Now, 2 students and 1 teacher, 61 62 00:04:55,590 --> 00:04:57,570 it's kind of like having private tuition. 62 63 00:04:58,070 --> 00:05:04,560 So if we take this thinking further, what if you had 15 students per teacher? Then the value inside PTRATIO 63 64 00:05:04,560 --> 00:05:10,620 would be equal to 15, and with 15 students instead of 2, 64 65 00:05:10,620 --> 00:05:14,280 each student is getting less attention from the teacher. 65 66 00:05:14,280 --> 00:05:16,410 So what if the class size grows? 66 67 00:05:16,410 --> 00:05:21,940 What if you have 30 students? Then each student is getting even less attention from the teacher. 67 68 00:05:22,020 --> 00:05:27,540 You can imagine the students in the back of the class giggling, passing notes, playing clash Royale on 68 69 00:05:27,540 --> 00:05:30,590 their cell phones and throwing paper airplanes, right? 69 70 00:05:30,600 --> 00:05:39,000 In other words, this PTRATio feature measures the quality of the education, the quality of the schools. 70 71 00:05:39,690 --> 00:05:47,700 The schools with many kids and few teachers are under-resourced and they tend to have a lower quality 71 72 00:05:47,700 --> 00:05:53,840 of education, and this quality of education is reflected in the house price. 72 73 00:05:53,880 --> 00:06:00,990 If the PTRATIO goes up, which is a bad thing because we have so many pupils per class, then the property 73 74 00:06:00,990 --> 00:06:07,850 prices tend to go down and this is why we see a negative correlation. 74 75 00:06:08,060 --> 00:06:08,370 Okay. 75 76 00:06:08,400 --> 00:06:12,060 So we've picked out two correlations one by one 76 77 00:06:12,270 --> 00:06:15,240 against our target, against our house price. 77 78 00:06:15,270 --> 00:06:21,360 What if we wanted to calculate all the correlations at the same time, because doing this one by one is 78 79 00:06:21,360 --> 00:06:23,610 gonna be pretty painful, right? 79 80 00:06:23,760 --> 00:06:28,230 Well pandas has got us covered. In this next cell here, 80 81 00:06:28,230 --> 00:06:33,630 I'm going to take my whole data frame, data, and call the correlation method on it. 81 82 00:06:33,630 --> 00:06:40,360 So "data.corr()" and Shift+Enter will produce this. 82 83 00:06:40,460 --> 00:06:46,640 This is an entire table that doesn't just show the correlations with the house prices, 83 84 00:06:46,770 --> 00:06:49,820 those are in the last column here. 84 85 00:06:49,920 --> 00:06:55,570 It also shows the correlations amongst all the different features. 85 86 00:06:55,620 --> 00:07:03,410 For example, how the average number of rooms is correlated with the amount of crime. In this table, 86 87 00:07:03,420 --> 00:07:05,610 you can still find these two values. 87 88 00:07:05,670 --> 00:07:09,870 So Price versus Room Size and Price vs. PT ratio. 88 89 00:07:09,870 --> 00:07:18,630 If I go down here the first value, 0.7 is here and that second value -0.5 89 90 00:07:19,140 --> 00:07:20,440 is right here. 90 91 00:07:20,970 --> 00:07:28,920 But the other interesting thing about this table is this diagonal here. You see these cells of ones? along 91 92 00:07:28,920 --> 00:07:33,250 the diagonal all the correlations are equal to one. 92 93 00:07:33,250 --> 00:07:39,220 And that's because the correlation of a variable with itself will always be equal to one. 93 94 00:07:39,240 --> 00:07:42,090 So crime correlated with crime is equal to one. 94 95 00:07:42,090 --> 00:07:46,070 The correlation of number of rooms versus number of rooms is equal to one. 95 96 00:07:46,140 --> 00:07:48,590 The correlation of age against itself is equal to one. 96 97 00:07:48,840 --> 00:07:49,950 And so on. 97 98 00:07:50,040 --> 00:07:51,640 So you can pretty much ignore this diagonal. 98 99 00:07:51,660 --> 00:07:54,340 It's not telling us anything interesting. 99 100 00:07:54,720 --> 00:08:00,560 In fact, you pretty much can also ignore half of this entire table. 100 101 00:08:00,660 --> 00:08:02,810 You see, this table is symmetric. 101 102 00:08:02,850 --> 00:08:03,800 This value here, 102 103 00:08:03,810 --> 00:08:14,550 Crime versus Zoning is the same as Zoning versus Crime and Industry versus Zoning is the same as Zoning 103 104 00:08:14,640 --> 00:08:16,500 versus Industry. 104 105 00:08:16,500 --> 00:08:22,280 In other words, the two halves of this table split along the diagonal are the same. 105 106 00:08:22,290 --> 00:08:29,430 Let's come back up here to this correlation method and hit Shift+Tab on our keyboard to bring up the 106 107 00:08:29,490 --> 00:08:34,270 quick documentation. Hitting the little plus sign expands 107 108 00:08:34,290 --> 00:08:39,390 this whole thing and we can see something kind of interesting. 108 109 00:08:39,390 --> 00:08:45,240 Now it turns out that we didn't supply any arguments to this correlation method. 109 110 00:08:45,240 --> 00:08:47,130 There's no value here. 110 111 00:08:47,160 --> 00:08:50,440 We didn't put any inputs between these two parentheses. 111 112 00:08:50,440 --> 00:08:59,570 And what this means is kind of going with the correlation methods defaults and it turns out that there 112 113 00:08:59,570 --> 00:09:07,360 are multiple ways you can calculate a correlation and we're using that default way of doing this calculation. 113 114 00:09:07,640 --> 00:09:13,360 Our specific type of correlation is the Pearson correlation. 114 115 00:09:13,430 --> 00:09:20,180 There's two other ones down here which we could have picked, but Pearson is the default correlation that 115 116 00:09:20,180 --> 00:09:21,940 we're going to be looking at. 116 117 00:09:21,980 --> 00:09:26,000 So I'm going to add this as a comment here in this cell. 117 118 00:09:26,000 --> 00:09:33,470 I'm going to say "Pearson Correlation Coefficients". 118 119 00:09:33,840 --> 00:09:36,140 That way there's no ambiguity here. 119 120 00:09:36,210 --> 00:09:42,600 Now since we brought up this table here, this kind of brings me to our next point. 120 121 00:09:42,600 --> 00:09:47,540 We spoke about kind of two things that we look for with these correlation calculations. 121 122 00:09:47,580 --> 00:09:53,940 We look for the strength and we look for the direction of the correlations. But there's actually also 122 123 00:09:54,030 --> 00:10:02,160 a third thing that we kind of care about, because in this table we didn't just have the correlations 123 124 00:10:02,160 --> 00:10:06,750 between our features and our target - the house prices. 124 125 00:10:06,750 --> 00:10:11,820 We also had the correlations of the features with each other. 125 126 00:10:11,820 --> 00:10:13,760 So let me ask you a question. 126 127 00:10:14,220 --> 00:10:23,330 If two features were perfectly correlated would that be a good thing or would that be a bad thing for 127 128 00:10:23,450 --> 00:10:25,930 our regression model? 128 129 00:10:26,000 --> 00:10:29,210 And the answer is: It depends. 129 130 00:10:29,210 --> 00:10:36,260 But it could be a bad thing and something that we probably want to discover early on. 130 131 00:10:36,290 --> 00:10:43,550 Let me explain the problem that high correlations between features could pose for our regression model 131 132 00:10:43,850 --> 00:10:51,830 with an example. Suppose you're medical researcher and you're analyzing the bone densities of people. 132 133 00:10:51,830 --> 00:10:58,970 Your goal is trying to figure out what makes people have strong healthy bones and you have all this 133 134 00:10:58,970 --> 00:11:04,250 data on people and you're running your regression and the kind of data that you're feeding into your 134 135 00:11:04,250 --> 00:11:14,470 regression model includes a person's age, a person's body fat and a person's weight. 135 136 00:11:14,470 --> 00:11:17,640 These are your explanatory variables. 136 137 00:11:17,710 --> 00:11:19,240 Now here's the catch. 137 138 00:11:19,240 --> 00:11:22,810 It turns out no one looks like this. 138 139 00:11:22,900 --> 00:11:30,200 The people who actually do look like this get put up on a stage and cast in movies like Conan the Barbarian. 139 140 00:11:30,400 --> 00:11:37,150 And the thing about these people is that they have a very high weight but very, very low body fat. 140 141 00:11:37,870 --> 00:11:47,530 But in this world they are very, very few tall heavy but lean bodybuilders. For most of the population 141 142 00:11:47,980 --> 00:11:53,020 body fat and weight are actually really highly correlated. 142 143 00:11:53,050 --> 00:12:00,160 Most people who weigh a lot tend to have high body fat and most people who are very, very skinny tend 143 144 00:12:00,160 --> 00:12:02,620 to weigh very, very little. 144 145 00:12:02,620 --> 00:12:08,230 So given this pattern in the data, what does this imply for the medical research that you're doing? 145 146 00:12:08,230 --> 00:12:18,100 Because body fat and weight move together, you're going to have difficulty telling apart their effects. 146 147 00:12:18,100 --> 00:12:27,280 In other words, body fat and weight are highly correlated and it becomes very difficult to see the individual 147 148 00:12:27,280 --> 00:12:33,220 contributions to bone density of either of these two explanatory variables. 148 149 00:12:33,220 --> 00:12:40,660 One of these features is redundant and this problem that you've just encountered in your hypothetical 149 150 00:12:40,660 --> 00:12:43,520 medical research actually has a name. 150 151 00:12:43,600 --> 00:12:50,860 It's called multicollinearity. Now multicollinearity is a word that only a statistician 151 152 00:12:50,860 --> 00:12:58,420 could love, but put simply, multicollinearity is when two or more predictors in a regression are 152 153 00:12:58,420 --> 00:13:00,730 highly related to one another. 153 154 00:13:01,180 --> 00:13:06,310 So for this example it was body fat and weight which are very highly related. 154 155 00:13:06,340 --> 00:13:14,230 Each of these do not provide unique and independent information to the regression. 155 156 00:13:14,260 --> 00:13:21,370 Now the result of your regression having this problem of multicollinearity means that the estimates 156 157 00:13:21,610 --> 00:13:27,000 start becoming unreliable and your findings stop making sense. 157 158 00:13:27,010 --> 00:13:30,790 Put simply, the model starts getting confused. 158 159 00:13:30,790 --> 00:13:37,270 But the thing to remember is that high correlations don't automatically mean that you have this problem 159 160 00:13:37,270 --> 00:13:38,920 of multicollinearity. 160 161 00:13:38,920 --> 00:13:45,530 I wish it was that easy. However, what it does imply is that we should be looking at these correlations 161 162 00:13:45,530 --> 00:13:52,520 between our features, we should be investigating if our features are highly correlated and if we find 162 163 00:13:52,760 --> 00:14:00,500 a high correlation between our features, we should investigate why that is. The high correlations can 163 164 00:14:00,500 --> 00:14:07,040 be an early warning sign that there is some sort of problem. High correlations between your features 164 165 00:14:07,550 --> 00:14:13,760 are kind of like a toothache that make you go to the dentist, the dentist is well then investigate further 165 166 00:14:13,760 --> 00:14:16,330 to see what the actual problem is. 166 167 00:14:16,820 --> 00:14:23,960 So having introduced this topic, we're going to keep this in mind for our regression analysis stage. 167 168 00:14:23,960 --> 00:14:28,100 This is when we're going to be revisiting this question of multicollinearity.