0
1
00:00:00,390 --> 00:00:08,190
First let's add a markdown cell and some LaTeX notation, so I'm going to change this to "Markdown", put two pound
1

2
00:00:08,190 --> 00:00:09,100
symbols there,
2

3
00:00:09,140 --> 00:00:10,170
write
3

4
00:00:10,470 --> 00:00:12,040
"Correlation".
4

5
00:00:12,050 --> 00:00:16,500
The LaTeX notation we're gonna add its gonna look like this good. Two dollar signs and then I'm going
5

6
00:00:16,500 --> 00:00:23,640
to have the Greek symbol for rho and then we're going have an "_{
6

7
00:00:23,640 --> 00:00:24,590
XY}",
7

8
00:00:24,630 --> 00:00:34,710
and then " = corr(X, Y)".
8

9
00:00:34,860 --> 00:00:37,030
And then to two dollar signs.
9

10
00:00:37,080 --> 00:00:39,010
Let's see what this looks like.
10

11
00:00:39,030 --> 00:00:42,190
I think this makes a nice section heading. Just for completion,
11

12
00:00:42,210 --> 00:00:50,220
let's add two more pound symbols, two dollar signs -1.0 and then let's have the following
12

13
00:00:50,220 --> 00:01:00,210
tag "\leq" and then we'll have "row _{
13

14
00:01:00,330 --> 00:01:03,510
XY}" and then another leq tag,
14

15
00:01:03,550 --> 00:01:11,440
so "\leq +1.0" and then two dollar signs at the end for the closing tag.
15

16
00:01:12,900 --> 00:01:20,970
Now we've got the minimum and maximum values of the correlation in our heading as well, and in the latest
16

17
00:01:20,970 --> 00:01:27,510
notation the Greek symbol row is given like this with the slash and then the keyword we create the subscript
17

18
00:01:27,690 --> 00:01:31,590
XY with the underscore and then the curly braces.
18

19
00:01:31,590 --> 00:01:38,730
And finally we've used the slash and then "leq" for the less than or equal to symbol.
19

20
00:01:38,820 --> 00:01:44,280
So now that we've added our section heading let's calculate the correlation between the average number
20

21
00:01:44,280 --> 00:01:47,510
of rooms and the house price.
21

22
00:01:47,580 --> 00:01:52,740
Now before we just punch in the Python code, what would you expect to see?
22

23
00:01:52,830 --> 00:01:56,820
Do you think this correlation should be positive or negative?
23

24
00:01:56,910 --> 00:02:03,200
Do you think that the correlation between the number of rooms and the property price is strong or weak?
24

25
00:02:03,270 --> 00:02:09,930
So looking at the LaTeX notation in our section heading here, pick out a number in your head between
25

26
00:02:09,930 --> 00:02:13,150
-1 and 1? Ready?
26

27
00:02:13,170 --> 00:02:14,450
Did you make your guess?
27

28
00:02:14,610 --> 00:02:14,930
Okay.
28

29
00:02:14,970 --> 00:02:24,300
So to get the answer you simply call the "corr" method on the series object, so "data['
29

30
00:02:24,300 --> 00:02:35,580
PRICE'].corr(data['RM'])" will retrieve the
30

31
00:02:35,580 --> 00:02:38,760
price column from our data frame,
31

32
00:02:38,760 --> 00:02:46,710
so this is our series object and then call the corr method on that series object and then as an argument
32

33
00:02:46,770 --> 00:02:53,220
between the parentheses for this method we supply a single piece of information, namely the series that
33

34
00:02:53,220 --> 00:02:56,610
we want to calculate the correlation against.
34

35
00:02:56,610 --> 00:03:00,690
In this case the average room size. Hitting Shift+Enter,
35

36
00:03:01,140 --> 00:03:08,520
you see that the correlation is indeed positive and around 0.7. And that makes sense, right?
36

37
00:03:08,550 --> 00:03:13,020
The larger the property is the more rooms it has, the more expensive it should be.
37

38
00:03:13,020 --> 00:03:21,750
Now as a challenge, I want to do the same thing for the property prices and the pupil teacher ratio.
38

39
00:03:21,750 --> 00:03:27,570
I want you to calculate the correlation between this feature and the property price. But before you
39

40
00:03:27,570 --> 00:03:29,230
type in the Python code,
40

41
00:03:29,310 --> 00:03:32,890
I want you to make another guess at the correlation.
41

42
00:03:33,120 --> 00:03:37,800
So pick a number between -1 and 1 and then write your Python code.
42

43
00:03:37,860 --> 00:03:43,740
Do you think that the house prices will be positively or negatively correlated with this feature?
43

44
00:03:43,770 --> 00:03:44,240
Now,
44

45
00:03:44,250 --> 00:03:48,410
figuring out the reason for this is probably harder than typing in the Python code.
45

46
00:03:48,470 --> 00:03:52,440
I'll give you a few seconds to pause the video and just give it a shot.
46

47
00:03:53,980 --> 00:03:55,410
All right, ready?
47

48
00:03:55,440 --> 00:03:57,300
Here's the solution.
48

49
00:03:57,300 --> 00:04:05,790
I can simply copy what I've written in the cell above, paste it in and then change RM to the name of
49

50
00:04:05,970 --> 00:04:09,820
the feature that I want to calculate my correlation against
50

51
00:04:09,870 --> 00:04:14,310
and this is "PTRATIO" in all caps.
51

52
00:04:14,310 --> 00:04:20,990
And when I hit Shift+Enter I see that the correlation here is -0.5.
52

53
00:04:21,040 --> 00:04:22,560
But why is that?
53

54
00:04:22,590 --> 00:04:25,740
What does this number actually mean?
54

55
00:04:25,740 --> 00:04:27,690
So the first one was a positive correlation,
55

56
00:04:27,690 --> 00:04:28,610
it was fairly strong,
56

57
00:04:28,650 --> 00:04:30,040
0.7,
57

58
00:04:30,040 --> 00:04:34,200
and this is a fairly strong negative correlation - 0.5.
58

59
00:04:34,650 --> 00:04:37,780
Now let's have a little think about why this might be.
59

60
00:04:38,040 --> 00:04:47,130
If you had one teacher and two students, what would the value of PTRATIO be equal to? The PTRATIO
60

61
00:04:47,310 --> 00:04:55,570
would be equal to 2, because there's 2 pupils and 1 teacher. Now, 2 students and 1 teacher,
61

62
00:04:55,590 --> 00:04:57,570
it's kind of like having private tuition.
62

63
00:04:58,070 --> 00:05:04,560
So if we take this thinking further, what if you had 15 students per teacher? Then the value inside PTRATIO
63

64
00:05:04,560 --> 00:05:10,620
would be equal to 15, and with 15 students instead of 2,
64

65
00:05:10,620 --> 00:05:14,280
each student is getting less attention from the teacher.
65

66
00:05:14,280 --> 00:05:16,410
So what if the class size grows?
66

67
00:05:16,410 --> 00:05:21,940
What if you have 30 students? Then each student is getting even less attention from the teacher.
67

68
00:05:22,020 --> 00:05:27,540
You can imagine the students in the back of the class giggling, passing notes, playing clash Royale on
68

69
00:05:27,540 --> 00:05:30,590
their cell phones and throwing paper airplanes, right?
69

70
00:05:30,600 --> 00:05:39,000
In other words, this PTRATio feature measures the quality of the education, the quality of the schools.
70

71
00:05:39,690 --> 00:05:47,700
The schools with many kids and few teachers are under-resourced and they tend to have a lower quality
71

72
00:05:47,700 --> 00:05:53,840
of education, and this quality of education is reflected in the house price.
72

73
00:05:53,880 --> 00:06:00,990
If the PTRATIO goes up, which is a bad thing because we have so many pupils per class, then the property
73

74
00:06:00,990 --> 00:06:07,850
prices tend to go down and this is why we see a negative correlation.
74

75
00:06:08,060 --> 00:06:08,370
Okay.
75

76
00:06:08,400 --> 00:06:12,060
So we've picked out two correlations one by one
76

77
00:06:12,270 --> 00:06:15,240
against our target, against our house price.
77

78
00:06:15,270 --> 00:06:21,360
What if we wanted to calculate all the correlations at the same time, because doing this one by one is
78

79
00:06:21,360 --> 00:06:23,610
gonna be pretty painful, right?
79

80
00:06:23,760 --> 00:06:28,230
Well pandas has got us covered. In this next cell here,
80

81
00:06:28,230 --> 00:06:33,630
I'm going to take my whole data frame, data, and call the correlation method on it.
81

82
00:06:33,630 --> 00:06:40,360
So "data.corr()" and Shift+Enter will produce this.
82

83
00:06:40,460 --> 00:06:46,640
This is an entire table that doesn't just show the correlations with the house prices,
83

84
00:06:46,770 --> 00:06:49,820
those are in the last column here.
84

85
00:06:49,920 --> 00:06:55,570
It also shows the correlations amongst all the different features.
85

86
00:06:55,620 --> 00:07:03,410
For example, how the average number of rooms is correlated with the amount of crime. In this table,
86

87
00:07:03,420 --> 00:07:05,610
you can still find these two values.
87

88
00:07:05,670 --> 00:07:09,870
So Price versus Room Size and Price vs. PT ratio.
88

89
00:07:09,870 --> 00:07:18,630
If I go down here the first value, 0.7 is here and that second value -0.5
89

90
00:07:19,140 --> 00:07:20,440
is right here.
90

91
00:07:20,970 --> 00:07:28,920
But the other interesting thing about this table is this diagonal here. You see these cells of ones? along
91

92
00:07:28,920 --> 00:07:33,250
the diagonal all the correlations are equal to one.
92

93
00:07:33,250 --> 00:07:39,220
And that's because the correlation of a variable with itself will always be equal to one.
93

94
00:07:39,240 --> 00:07:42,090
So crime correlated with crime is equal to one.
94

95
00:07:42,090 --> 00:07:46,070
The correlation of number of rooms versus number of rooms is equal to one.
95

96
00:07:46,140 --> 00:07:48,590
The correlation of age against itself is equal to one.
96

97
00:07:48,840 --> 00:07:49,950
And so on.
97

98
00:07:50,040 --> 00:07:51,640
So you can pretty much ignore this diagonal.
98

99
00:07:51,660 --> 00:07:54,340
It's not telling us anything interesting.
99

100
00:07:54,720 --> 00:08:00,560
In fact, you pretty much can also ignore half of this entire table.
100

101
00:08:00,660 --> 00:08:02,810
You see, this table is symmetric.
101

102
00:08:02,850 --> 00:08:03,800
This value here,
102

103
00:08:03,810 --> 00:08:14,550
Crime versus Zoning is the same as Zoning versus Crime and Industry versus Zoning is the same as Zoning
103

104
00:08:14,640 --> 00:08:16,500
versus Industry.
104

105
00:08:16,500 --> 00:08:22,280
In other words, the two halves of this table split along the diagonal are the same.
105

106
00:08:22,290 --> 00:08:29,430
Let's come back up here to this correlation method and hit Shift+Tab on our keyboard to bring up the
106

107
00:08:29,490 --> 00:08:34,270
quick documentation. Hitting the little plus sign expands
107

108
00:08:34,290 --> 00:08:39,390
this whole thing and we can see something kind of interesting.
108

109
00:08:39,390 --> 00:08:45,240
Now it turns out that we didn't supply any arguments to this correlation method.
109

110
00:08:45,240 --> 00:08:47,130
There's no value here.
110

111
00:08:47,160 --> 00:08:50,440
We didn't put any inputs between these two parentheses.
111

112
00:08:50,440 --> 00:08:59,570
And what this means is kind of going with the correlation methods defaults and it turns out that there
112

113
00:08:59,570 --> 00:09:07,360
are multiple ways you can calculate a correlation and we're using that default way of doing this calculation.
113

114
00:09:07,640 --> 00:09:13,360
Our specific type of correlation is the Pearson correlation.
114

115
00:09:13,430 --> 00:09:20,180
There's two other ones down here which we could have picked, but Pearson is the default correlation that
115

116
00:09:20,180 --> 00:09:21,940
we're going to be looking at.
116

117
00:09:21,980 --> 00:09:26,000
So I'm going to add this as a comment here in this cell.
117

118
00:09:26,000 --> 00:09:33,470
I'm going to say "Pearson Correlation Coefficients".
118

119
00:09:33,840 --> 00:09:36,140
That way there's no ambiguity here.
119

120
00:09:36,210 --> 00:09:42,600
Now since we brought up this table here, this kind of brings me to our next point.
120

121
00:09:42,600 --> 00:09:47,540
We spoke about kind of two things that we look for with these correlation calculations.
121

122
00:09:47,580 --> 00:09:53,940
We look for the strength and we look for the direction of the correlations. But there's actually also
122

123
00:09:54,030 --> 00:10:02,160
a third thing that we kind of care about, because in this table we didn't just have the correlations
123

124
00:10:02,160 --> 00:10:06,750
between our features and our target - the house prices.
124

125
00:10:06,750 --> 00:10:11,820
We also had the correlations of the features with each other.
125

126
00:10:11,820 --> 00:10:13,760
So let me ask you a question.
126

127
00:10:14,220 --> 00:10:23,330
If two features were perfectly correlated would that be a good thing or would that be a bad thing for
127

128
00:10:23,450 --> 00:10:25,930
our regression model?
128

129
00:10:26,000 --> 00:10:29,210
And the answer is: It depends.
129

130
00:10:29,210 --> 00:10:36,260
But it could be a bad thing and something that we probably want to discover early on.
130

131
00:10:36,290 --> 00:10:43,550
Let me explain the problem that high correlations between features could pose for our regression model
131

132
00:10:43,850 --> 00:10:51,830
with an example. Suppose you're medical researcher and you're analyzing the bone densities of people.
132

133
00:10:51,830 --> 00:10:58,970
Your goal is trying to figure out what makes people have strong healthy bones and you have all this
133

134
00:10:58,970 --> 00:11:04,250
data on people and you're running your regression and the kind of data that you're feeding into your
134

135
00:11:04,250 --> 00:11:14,470
regression model includes a person's age, a person's body fat and a person's weight.
135

136
00:11:14,470 --> 00:11:17,640
These are your explanatory variables.
136

137
00:11:17,710 --> 00:11:19,240
Now here's the catch.
137

138
00:11:19,240 --> 00:11:22,810
It turns out no one looks like this.
138

139
00:11:22,900 --> 00:11:30,200
The people who actually do look like this get put up on a stage and cast in movies like Conan the Barbarian.
139

140
00:11:30,400 --> 00:11:37,150
And the thing about these people is that they have a very high weight but very, very low body fat.
140

141
00:11:37,870 --> 00:11:47,530
But in this world they are very, very few tall heavy but lean bodybuilders. For most of the population
141

142
00:11:47,980 --> 00:11:53,020
body fat and weight are actually really highly correlated.
142

143
00:11:53,050 --> 00:12:00,160
Most people who weigh a lot tend to have high body fat and most people who are very, very skinny tend
143

144
00:12:00,160 --> 00:12:02,620
to weigh very, very little.
144

145
00:12:02,620 --> 00:12:08,230
So given this pattern in the data, what does this imply for the medical research that you're doing?
145

146
00:12:08,230 --> 00:12:18,100
Because body fat and weight move together, you're going to have difficulty telling apart their effects.
146

147
00:12:18,100 --> 00:12:27,280
In other words, body fat and weight are highly correlated and it becomes very difficult to see the individual
147

148
00:12:27,280 --> 00:12:33,220
contributions to bone density of either of these two explanatory variables.
148

149
00:12:33,220 --> 00:12:40,660
One of these features is redundant and this problem that you've just encountered in your hypothetical
149

150
00:12:40,660 --> 00:12:43,520
medical research actually has a name.
150

151
00:12:43,600 --> 00:12:50,860
It's called multicollinearity. Now multicollinearity is a word that only a statistician
151

152
00:12:50,860 --> 00:12:58,420
could love, but put simply, multicollinearity is when two or more predictors in a regression are
152

153
00:12:58,420 --> 00:13:00,730
highly related to one another.
153

154
00:13:01,180 --> 00:13:06,310
So for this example it was body fat and weight which are very highly related.
154

155
00:13:06,340 --> 00:13:14,230
Each of these do not provide unique and independent information to the regression.
155

156
00:13:14,260 --> 00:13:21,370
Now the result of your regression having this problem of multicollinearity means that the estimates
156

157
00:13:21,610 --> 00:13:27,000
start becoming unreliable and your findings stop making sense.
157

158
00:13:27,010 --> 00:13:30,790
Put simply, the model starts getting confused.
158

159
00:13:30,790 --> 00:13:37,270
But the thing to remember is that high correlations don't automatically mean that you have this problem
159

160
00:13:37,270 --> 00:13:38,920
of multicollinearity.
160

161
00:13:38,920 --> 00:13:45,530
I wish it was that easy. However, what it does imply is that we should be looking at these correlations
161

162
00:13:45,530 --> 00:13:52,520
between our features, we should be investigating if our features are highly correlated and if we find
162

163
00:13:52,760 --> 00:14:00,500
a high correlation between our features, we should investigate why that is. The high correlations can
163

164
00:14:00,500 --> 00:14:07,040
be an early warning sign that there is some sort of problem. High correlations between your features
164

165
00:14:07,550 --> 00:14:13,760
are kind of like a toothache that make you go to the dentist, the dentist is well then investigate further
165

166
00:14:13,760 --> 00:14:16,330
to see what the actual problem is.
166

167
00:14:16,820 --> 00:14:23,960
So having introduced this topic, we're going to keep this in mind for our regression analysis stage.
167

168
00:14:23,960 --> 00:14:28,100
This is when we're going to be revisiting this question of multicollinearity.