0
1
00:00:00,890 --> 00:00:08,180
In the meantime let's do a little bit of work in our Python code to make this table a little bit more
1

2
00:00:08,180 --> 00:00:09,110
clear.
2

3
00:00:09,170 --> 00:00:18,130
Let's visualize our correlations in a way that we could put into a really snazzy report. And to do this,
3

4
00:00:18,190 --> 00:00:24,610
we're going to represent our correlations as a triangle instead of this whole table here.
4

5
00:00:24,640 --> 00:00:29,560
We don't need to show all these duplicate values. Showing all these duplicate values doesn't really add
5

6
00:00:29,560 --> 00:00:30,090
anything
6

7
00:00:30,100 --> 00:00:33,710
and it just makes the whole thing look really, really busy.
7

8
00:00:33,730 --> 00:00:42,160
So my goal is to hide half of this table, and to accomplish this, I will create an array which will help
8

9
00:00:42,160 --> 00:00:47,850
me filter out the values that I don't want to show and the values that I want to show.
9

10
00:00:48,020 --> 00:00:55,090
I'm going to call this filter array "mask" and I'm going to set it equal to an array that's identical
10

11
00:00:55,300 --> 00:01:00,000
in size to this table of correlations,
11

12
00:01:00,010 --> 00:01:02,710
this correlation matrix that we've got up here.
12

13
00:01:03,160 --> 00:01:07,740
The module that I will use to help me do this is called numpy.
13

14
00:01:07,750 --> 00:01:13,740
I'm going to have to add this to my notebook imports at the top in order to use it.
14

15
00:01:13,830 --> 00:01:25,300
So I'm going to say "import numpy as np" hit Shift+Enter on this, import this module, scroll back down here
15

16
00:01:26,110 --> 00:01:35,640
and then I'm going to use the "zeros_like" function from the numpy module, so I'm going to write "np.zeros_
16

17
00:01:35,640 --> 00:01:43,300
like()" and this function will create an array of zeros that is like whatever array is passed
17

18
00:01:43,690 --> 00:01:51,550
into this function as a parameter, in our case that's gonna be the return value from calling the correlation
18

19
00:01:51,550 --> 00:01:54,790
method on our data frame.
19

20
00:01:54,790 --> 00:02:01,330
So let's have a look at what this mask array looks like at the moment. I'm going to hit Shift+Enter here and we can
20

21
00:02:01,330 --> 00:02:07,790
see that we have an array of, well just zeros.
21

22
00:02:07,930 --> 00:02:14,070
Now I need to make another modification. To filter on the values in the top triangle, I
22

23
00:02:14,080 --> 00:02:19,940
first need to know the indices of these cells in my array.
23

24
00:02:20,020 --> 00:02:25,020
Thankfully there is another numpy function that will help me find these.
24

25
00:02:25,020 --> 00:02:34,000
So I'm going to say "triangle_indices", which is going to hold on to all the indices in the top
25

26
00:02:34,000 --> 00:02:45,340
triangle of my array and I'm going to set that equal to "numpy.triu_indices_
26

27
00:02:45,560 --> 00:02:56,550
from()", and then I'm going to pass in my mask. This will retrieve the indices for the top triangle of the
27

28
00:02:56,670 --> 00:02:57,780
array.
28

29
00:02:57,780 --> 00:03:07,090
And now that I've got my indices, I can use my mask array to select just those cells and change their values.
29

30
00:03:07,290 --> 00:03:09,030
So I'm going to say "mask[
30

31
00:03:09,030 --> 00:03:19,280
triangle_indices]" and set those equal to True.
31

32
00:03:19,360 --> 00:03:21,720
Let me show you what our filter looks like now.
32

33
00:03:21,740 --> 00:03:31,230
So I'm going to say "mask", hit Shift+Enter and then you can see here that the top triangle in this array has
33

34
00:03:31,230 --> 00:03:36,680
the value 1 and the bottom triangle has the value 0,
34

35
00:03:36,930 --> 00:03:46,080
and that's because True is mapped to the value 1 and False is mapped to the numerical value of 0. So with
35

36
00:03:46,080 --> 00:03:54,140
this in hand I can now move on to creating this beautiful visualization that I keep talking about.
36

37
00:03:54,240 --> 00:04:00,300
We're gonna use our old friends seaborn and matplotlib to accomplish this.
37

38
00:04:00,460 --> 00:04:04,320
The first thing I'm going to do is I'm going to set the size of our figure,
38

39
00:04:04,370 --> 00:04:05,610
so I'm going to say "plt.
39

40
00:04:05,670 --> 00:04:15,940
figure(figsize = (16,10))".
40

41
00:04:15,940 --> 00:04:23,010
And then I'm going to use our seaborn's heatmap function to generate a heat map of our correlations.
41

42
00:04:23,080 --> 00:04:30,580
We had the seaborne module as "sns", then put a dot after it and write "heatmap()" and then within
42

43
00:04:30,580 --> 00:04:35,220
the parentheses I'm going to provide our correlations.
43

44
00:04:35,230 --> 00:04:42,590
So this was the value returned by calling the corr method on our dataframe.
44

45
00:04:42,610 --> 00:04:50,230
So when I leave it like this, "sns.heatmap(data.corr())" and then I'm going
45

46
00:04:50,230 --> 00:04:52,400
to show our plot with "plt.
46

47
00:04:52,660 --> 00:04:57,860
show()". Let me hit Shift+Enter to see what this looks like.
47

48
00:04:57,890 --> 00:05:00,220
Voila! Look at that.
48

49
00:05:00,290 --> 00:05:02,000
We're almost there.
49

50
00:05:02,120 --> 00:05:08,900
What we can see already is that the different colors show us a strong positive correlations have a dark
50

51
00:05:09,080 --> 00:05:15,140
red color and the negative strong correlations have a dark blue color.
51

52
00:05:15,320 --> 00:05:22,280
Anything that's close to zero is pale or white. So this color scheme is actually conveying quite a lot
52

53
00:05:22,280 --> 00:05:28,860
of information already which is really, really neat on the visualization front.
53

54
00:05:28,910 --> 00:05:36,560
Now if you're having trouble reading what it says down the sides and at the bottom of this chart, we
54

55
00:05:36,560 --> 00:05:49,860
can increase the font size of these labels with "plt.xticks(fontsize = 14)" and
55

56
00:05:49,880 --> 00:05:53,250
I can do the same for the y axis with "plt.
56

57
00:05:53,420 --> 00:06:02,030
yticks(fontsize = 14)". Hitting Shift+Enter,
57

58
00:06:02,420 --> 00:06:04,490
we see it updated like so.
58

59
00:06:04,730 --> 00:06:07,640
So now it's a bit easier to read.
59

60
00:06:07,700 --> 00:06:10,940
Now it's time to add that mask that we created.
60

61
00:06:11,360 --> 00:06:19,340
We want to hide the correlations on this chart that are duplicates. Coming back up here, inside the heat
61

62
00:06:19,340 --> 00:06:20,490
map method,
62

63
00:06:20,530 --> 00:06:28,640
I'm going to add another argument - we're gonna say "mask", so the argument called mask is equal to, well,
63

64
00:06:31,510 --> 00:06:41,110
the mask that we've so painstakingly created in the cell above. So "mask = mask" and this might
64

65
00:06:41,230 --> 00:06:51,580
look very confusing but this mask here refers to our variable in this cell here and this Python code
65

66
00:06:51,580 --> 00:06:58,980
reading "mask = " refers to the name of the key word in this function.
66

67
00:06:58,990 --> 00:07:01,810
Let me Shift+Enter and show you what this looks like.
67

68
00:07:03,140 --> 00:07:08,850
Voila! Now we've effectively hidden half of our chart.
68

69
00:07:09,640 --> 00:07:17,710
So I'm going to modify this even further, I'm gonna add the actual values of our correlations on our heat map
69

70
00:07:18,150 --> 00:07:26,050
because what I want to do is I want to display these numbers here on our chart with the colors,
70

71
00:07:26,560 --> 00:07:36,910
so I'm going to say "annot = True" and hit Shift+Enter. Now you'll see the values of the correlations
71

72
00:07:37,240 --> 00:07:40,570
being displayed in the heat map.
72

73
00:07:40,570 --> 00:07:46,050
Of course, by default these numbers actually get really small and difficult to read.
73

74
00:07:46,050 --> 00:07:46,600
I don't know why,
74

75
00:07:46,600 --> 00:07:55,220
it's just how it is. So we can increase their font size with another keyword argument, so we can say 'annot
75

76
00:07:55,650 --> 00:07:56,730
_kws =
76

77
00:07:56,780 --> 00:08:04,200
{"
77

78
00:08:04,360 --> 00:08:10,530
size": 14}'; 14
78

79
00:08:10,570 --> 00:08:15,580
is gonna be the font size of our annotations.
79

80
00:08:15,580 --> 00:08:22,550
The value of this annot_kws argument is given as a dictionary.
80

81
00:08:22,560 --> 00:08:28,750
It's a Python dictionary that we're looking at here and you can always spot Python dictionaries very
81

82
00:08:28,750 --> 00:08:39,640
very easily with this kind of curly bracket notation and a key value pair or some key value pairs inside.
82

83
00:08:39,790 --> 00:08:48,900
The key here is the string "size" and the value is 14.
83

84
00:08:48,980 --> 00:08:52,530
These are always separated by this colon.
84

85
00:08:52,730 --> 00:08:54,440
Let me hit Shift+Enter and update
85

86
00:08:54,440 --> 00:08:55,270
the heat map now.
86

87
00:08:57,150 --> 00:08:59,060
Voila! Brilliant!
87

88
00:08:59,120 --> 00:09:06,380
Now the only thing I find a little bit strange is why this background here is not all white, because
88

89
00:09:06,380 --> 00:09:09,840
I expected the styling to be a little bit different.
89

90
00:09:09,860 --> 00:09:14,850
I expected this to be a white background instead of this gray here.
90

91
00:09:15,020 --> 00:09:19,790
Now if you're also seeing something a little bit unexpected like this on the styling front, you can
91

92
00:09:19,790 --> 00:09:29,030
always set the style manually of seaborn with "sns.set_style()" and then
92

93
00:09:29,180 --> 00:09:31,260
provide the name of a style.
93

94
00:09:31,280 --> 00:09:39,540
So I'm going to go with white and hit Shift+Enter and line of code should force this background color
94

95
00:09:39,540 --> 00:09:42,580
here to be set to white.
95

96
00:09:42,660 --> 00:09:48,210
But you know, the thing is all in all writing this Python code with the mask and with seaborn and the
96

97
00:09:48,210 --> 00:09:51,040
heat map it's kind of like the easy part actually.
97

98
00:09:51,870 --> 00:09:57,900
The much harder part is making sense of what it is that we're actually looking at here.
98

99
00:09:59,080 --> 00:10:01,880
What is it that we can learn from this correlation matrix?
99

100
00:10:02,740 --> 00:10:10,570
So first off, you and I we said we're gonna be looking at two things - strength and direction.
100

101
00:10:10,840 --> 00:10:19,770
An example of a strong positive correlation would be something like NOX and INDUS.
101

102
00:10:20,170 --> 00:10:28,750
Now this INDUS feature measures the proportion of non retail business acres per town and this NOX
102

103
00:10:28,750 --> 00:10:35,640
feature measures the nitric oxide concentration in parts per 10 million.
103

104
00:10:35,660 --> 00:10:42,070
At least that's me reading it off the documentation on the feature descriptions.
104

105
00:10:42,070 --> 00:10:47,840
These two features have a correlation of 0.76.
105

106
00:10:47,920 --> 00:10:52,160
So the question is: does this make sense? And I think,
106

107
00:10:52,510 --> 00:10:52,750
yeah,
107

108
00:10:52,780 --> 00:10:54,070
yeah it does.
108

109
00:10:54,070 --> 00:10:59,200
I would expect the pollution to be higher in industrial areas.
109

110
00:10:59,200 --> 00:11:07,200
The amount of industry and the amount of pollution should be positively correlated. But looking at this
110

111
00:11:07,200 --> 00:11:08,310
table a little bit more,
111

112
00:11:08,370 --> 00:11:10,170
you know what I found quite interesting?
112

113
00:11:10,890 --> 00:11:21,690
It's the correlation of TAX and the industry variable, higher tax levels are apparently associated with
113

114
00:11:21,690 --> 00:11:23,810
more industrial areas.
114

115
00:11:23,820 --> 00:11:26,250
I actually found this quite surprising.
115

116
00:11:26,250 --> 00:11:34,950
So coming across these kind of relationships is why the correlation matrix is a useful tool for data
116

117
00:11:34,980 --> 00:11:44,420
exploration, but there are of course, as with everything, some limitations. Looking at this heat map here,
117

118
00:11:44,630 --> 00:11:53,330
we can see that the highest correlation of all is the correlation between TAX and RAD - access to radial
118

119
00:11:53,330 --> 00:11:54,480
highways.
119

120
00:11:54,500 --> 00:12:01,190
This is a positive correlation of 0.91 which seems super high.
120

121
00:12:01,260 --> 00:12:01,860
Now,
121

122
00:12:01,880 --> 00:12:08,010
remember how we looked at the documentation of this correlation function?
122

123
00:12:08,120 --> 00:12:16,730
We went up here and we hit Shift+Tab and we learned that the default method for calculating this correlation
123

124
00:12:16,880 --> 00:12:19,640
is the Pearson method.
124

125
00:12:19,640 --> 00:12:27,950
Now it turns out that one of the things that you have to know about this type of correlation is that
125

126
00:12:27,950 --> 00:12:32,160
it makes some assumptions about the kind of data that it's running on.
126

127
00:12:32,390 --> 00:12:39,620
This correlation calculation is actually only valid for continuous variables and this means that it's
127

128
00:12:39,620 --> 00:12:46,790
not valid for, say like a dummy variable, like whether a property is on the Charles River or not because
128

129
00:12:46,790 --> 00:12:51,050
this is not a continuous variable, it's only got two values, right,
129

130
00:12:51,050 --> 00:12:59,390
0 or 1. And looking back up here where we've created our histogram for accessibility to radial highways,
130

131
00:12:59,930 --> 00:13:05,060
we can also see that this is not a continuous variable.
131

132
00:13:05,120 --> 00:13:07,100
This feature was an index,
132

133
00:13:07,100 --> 00:13:13,070
if you remember. And what this means is that our correlation calculation is actually not valid for the
133

134
00:13:13,070 --> 00:13:21,710
RAD feature because RAD is not a continuous variable, which goes to show that it's very important to
134

135
00:13:21,710 --> 00:13:28,250
know how the individual features are measured, what units they're in and what the distribution of the
135

136
00:13:28,250 --> 00:13:35,750
data looks like for these features. Because we can only use statistical tools that are appropriate for
136

137
00:13:35,750 --> 00:13:37,220
the kind of data you're working with.
137

138
00:13:37,950 --> 00:13:40,920
Okay, so let's look at this last row down here.
138

139
00:13:41,000 --> 00:13:49,310
The row that reads price which is our target value. On this row you see the correlation of all the features
139

140
00:13:49,820 --> 00:13:54,610
in our model with the price, with our target.
140

141
00:13:54,890 --> 00:14:00,170
One of the things that I'm interested in looking for here is for which features we don't find a
141

142
00:14:00,170 --> 00:14:07,700
relationship, for which of the features is the correlation close to zero. The lowest correlation of course
142

143
00:14:07,820 --> 00:14:11,600
is with the Charles River dummy variable.
143

144
00:14:11,600 --> 00:14:17,600
But as we've just said, CHAS is a dummy variable with only values between 1 and 0,
144

145
00:14:17,870 --> 00:14:21,340
so the correlation measure is actually not appropriate.
145

146
00:14:21,620 --> 00:14:23,660
But what about the next lowest one?
146

147
00:14:24,050 --> 00:14:35,000
The next lowest one is this one called DIS and DIS is defined as the distance from employment centers.
147

148
00:14:36,410 --> 00:14:37,010
Now,
148

149
00:14:37,280 --> 00:14:37,930
that's interesting.
149

150
00:14:37,930 --> 00:14:47,360
So DIS is not very correlated with price, but DIS is very highly correlated with the industry feature.
150

151
00:14:47,660 --> 00:14:56,250
Looking here we see that there is a correlation of -0.71 between DIS and INDUS.
151

152
00:14:56,270 --> 00:15:04,340
The reason I suspect this is the case is because many industrial areas are probably employment centers,
152

153
00:15:04,700 --> 00:15:14,660
so being far away from an employment center is associated with a low amount of industry and this discovery
153

154
00:15:15,710 --> 00:15:22,030
adds something to my to do list for the regression analysis stage.
154

155
00:15:22,130 --> 00:15:32,720
What we should probably do is we should check if our distance feature adds explanatory value to our
155

156
00:15:32,720 --> 00:15:42,160
regression model. In other words, does having both the industry feature and the distance feature included
156

157
00:15:42,310 --> 00:15:46,850
in the regression make our model better or worse?
157

158
00:15:47,740 --> 00:15:52,830
Can we get away with just having the industry feature for example?
158

159
00:15:53,140 --> 00:16:00,220
Because the thing is, if a feature is not adding any explanatory value, it's often better to exclude it
159

160
00:16:00,490 --> 00:16:07,990
and trying to run the regression without it, because by excluding features you might end up with a simpler
160

161
00:16:07,990 --> 00:16:12,570
model and simplicity is usually a good thing.
161

162
00:16:12,580 --> 00:16:15,670
Okay so where does this leave us?
162

163
00:16:15,730 --> 00:16:21,520
The correlation matrix is no silver data exploration bullet.
163

164
00:16:21,640 --> 00:16:28,990
While it may not answer all our questions it can give us a bit more perspective. And the correlation
164

165
00:16:28,990 --> 00:16:30,800
matrix has its pros and cons.
165

166
00:16:30,910 --> 00:16:33,520
It has strengths and it has limitations,
166

167
00:16:33,520 --> 00:16:40,280
just like every other tool. Regarding the pros, we've learned something about our data,
167

168
00:16:40,330 --> 00:16:46,810
we've learned that the amount of tax and the amount of industry are correlated and we've added something
168

169
00:16:46,810 --> 00:16:53,530
to our to do list for later, namely that we should investigate if we really need the DIS feature in our
169

170
00:16:53,530 --> 00:16:55,630
model or not.
170

171
00:16:55,630 --> 00:17:02,380
Another pro is that we've learned that certain features with high correlations are possible sources
171

172
00:17:02,680 --> 00:17:04,830
of multicollinearity.
172

173
00:17:04,870 --> 00:17:10,270
Now I emphasize the word possible and this is another thing for our to do list.
173

174
00:17:10,420 --> 00:17:17,920
High correlations don't necessarily imply this problem of multicollinearity but we will revisit this
174

175
00:17:17,920 --> 00:17:26,150
issue during the regression analysis stage by running a formal test for this problem.
175

176
00:17:26,160 --> 00:17:32,950
Now we're also learning a few things about some weaknesses of looking at correlations.
176

177
00:17:33,030 --> 00:17:39,420
For example, we've learned that the correlation calculations assume continuous data.
177

178
00:17:39,480 --> 00:17:44,700
This Pearson correlation calculation that we've looked at is not valid
178

179
00:17:44,820 --> 00:17:52,620
if the data is not continuous as it is the case with our accessibility index or our Charles River dummy
179

180
00:17:52,620 --> 00:18:00,690
variable. And a second limitation that everybody likes to harp on about is that correlation does not
180

181
00:18:00,840 --> 00:18:02,790
imply causation.
181

182
00:18:03,000 --> 00:18:09,210
Just because two things move together doesn't mean that one thing causes another.
182

183
00:18:09,210 --> 00:18:14,300
In other words, everybody who drank water in 1850 is now dead,
183

184
00:18:14,520 --> 00:18:17,940
but this doesn't mean that drinking water will kill you.
184

185
00:18:17,970 --> 00:18:24,330
In fact if you look at enough data and you look hard enough you will find that there are all sorts of
185

186
00:18:24,330 --> 00:18:26,770
weird correlations out there.
186

187
00:18:26,970 --> 00:18:34,740
Just google "funny correlations" or "spurious correlations" and you'll find a bunch of great examples of
187

188
00:18:34,740 --> 00:18:40,270
completely unrelated things that move together purely by chance.
188

189
00:18:40,420 --> 00:18:46,680
And if you do this, you'll probably come across Tyler Vigen's Web site who uses census data and data
189

190
00:18:46,710 --> 00:18:54,360
from the U.S. Department of Agriculture to show that divorce rates in Maine and margarine consumption
190

191
00:18:54,690 --> 00:18:59,490
are in fact highly correlated. So the earlier chart of mine showing a zero correlation between these
191

192
00:18:59,490 --> 00:19:05,890
two things was in fact a lie, Tyler's chart shows us how it actually works. Now,
192

193
00:19:05,940 --> 00:19:14,190
another limitation of correlations is that they only check for linear relationships and it turns out,
193

194
00:19:14,400 --> 00:19:20,460
just because there's a low Pearson correlation coefficient, does not mean that there is no relationship
194

195
00:19:20,550 --> 00:19:22,940
between two variables.
195

196
00:19:22,950 --> 00:19:26,960
Let me show you some examples so you can actually see what I mean.
196

197
00:19:27,000 --> 00:19:35,070
Here's some fictional data on a chart showing the x and y values, X and Y have a correlation of
197

198
00:19:35,070 --> 00:19:37,950
0.816.
198

199
00:19:37,980 --> 00:19:44,980
And let me show you a different chart. This is some more fictional data and the correlation between X2
199

200
00:19:45,030 --> 00:19:52,030
and Y2 is in fact also 0.816.
200

201
00:19:52,490 --> 00:19:54,170
And on this third chart here,
201

202
00:19:54,450 --> 00:20:02,880
you guessed it, the correlation is also 0.816. And the same goes for this 4th chart here,
202

203
00:20:03,690 --> 00:20:09,800
X4 and Y4 also have a correlation of 0.816.
203

204
00:20:09,810 --> 00:20:12,860
In fact these photographs are very famous.
204

205
00:20:12,960 --> 00:20:18,960
They're called Anscombe's Quartet and they're named after an English statistician who came up with
205

206
00:20:18,960 --> 00:20:19,830
them.
206

207
00:20:20,010 --> 00:20:26,670
These four graphs actually have very, very similar descriptive statistics and a very, very similar regression
207

208
00:20:26,670 --> 00:20:26,940
line.
208

209
00:20:27,960 --> 00:20:32,100
But of course, they're showing us completely different relationships.
209

210
00:20:32,220 --> 00:20:41,130
They're showing us that outliers and non-linear relationships often only become apparent after visualizing
210

211
00:20:41,280 --> 00:20:42,470
the data.
211

212
00:20:42,690 --> 00:20:44,440
And this is what this implies.
212

213
00:20:44,580 --> 00:20:51,750
It means that it's important to look at these correlations and these descriptive statistics in conjunction
213

214
00:20:51,960 --> 00:20:59,020
with some charts. And with this in mind we're gonna be complementing our analysis of the correlation
214

215
00:20:59,410 --> 00:21:02,040
with some more graphical analysis.
215

216
00:21:02,080 --> 00:21:08,620
That way we can discover if there's any hidden linear relationships or outliers in our data.
216

217
00:21:08,620 --> 00:21:16,210
As such we're gonna be visiting our old friend again, the scatter plot. But before we move on, I can't
217

218
00:21:16,210 --> 00:21:20,900
resist showing you this infamous comic strip from XKCD.
218

219
00:21:20,950 --> 00:21:24,480
If this is the kind of humor that appeals to you more than you'd care to admit,
219

220
00:21:24,520 --> 00:21:31,960
then I highly recommend subscribing to XKCD's RSS feed and get your dose of geeky web comics on a
220

221
00:21:31,960 --> 00:21:33,590
regular basis.
221

222
00:21:33,670 --> 00:21:35,360
I'll see you in the next lessons.
222

223
00:21:35,380 --> 00:21:35,950
Take care.