0
1
00:00:00,390 --> 00:00:06,900
In the previous lesson we talked about how the Pearson correlation should only really be calculated
1

2
00:00:07,290 --> 00:00:15,560
on continuous data and also how outliers can kind of distort the picture and mislead us.
2

3
00:00:15,570 --> 00:00:23,100
The examples that I've used to illustrate this point was Anscombe's quartet and these four charts illustrated
3

4
00:00:23,340 --> 00:00:29,440
how very different patterns in the data can actually give you very, very similar descriptive statistics
4

5
00:00:29,490 --> 00:00:35,700
if you're not careful. What I want to show you now is how this isn't really just a hypothetical example.
5

6
00:00:35,700 --> 00:00:42,810
We can also see something similar happening in our housing data set when we don't apply the right tools
6

7
00:00:42,810 --> 00:00:47,880
to the right kind of data and aren't careful with our interpretations.
7

8
00:00:47,880 --> 00:00:57,480
Remember how a correlation of 0.9 or 1 is meant to look like? You'd expect to see a chart
8

9
00:00:57,510 --> 00:01:04,860
that kind of looks like this, where the data points almost form a straight line. Looking at a correlation
9

10
00:01:04,860 --> 00:01:05,600
matrix
10

11
00:01:05,640 --> 00:01:12,960
we've actually got a number here, a very high number, 0.91 as the correlation between RAD,
11

12
00:01:13,140 --> 00:01:21,750
our access to radial highways and our TAX feature. So given our expectation of what a high correlation
12

13
00:01:21,750 --> 00:01:29,540
between two variables would look like, let's visualize this relationship and see what we get.
13

14
00:01:29,550 --> 00:01:35,310
I'm actually gonna copy this cell here with our jointplot from seaborn and then I'm going to paste it
14

15
00:01:35,310 --> 00:01:43,470
in here. Move the cell down and then what I'm gonna do is going to change the references from distance
15

16
00:01:43,560 --> 00:01:56,100
and nitrous oxide to TAX and our RAD, and for the color I'm going to go with, I don't know, maybe dark
16

17
00:01:56,400 --> 00:02:07,380
red and then hit Shift+Enter. Voila! This is what we get. Now this picture is probably not what we would've
17

18
00:02:07,440 --> 00:02:15,240
expected to see had we just looked at the correlation, but we've said before that this very high correlation
18

19
00:02:15,270 --> 00:02:22,530
that we have between these two variables is a result of the fact that this data is not continuous and
19

20
00:02:22,530 --> 00:02:29,940
the fact that there are these big outliers in the top right corner that are driving this correlation.
20

21
00:02:31,210 --> 00:02:37,810
One thing that's actually very interesting to do as an exercise is to plot the linear regression onto
21

22
00:02:37,810 --> 00:02:43,450
this chart and see how it's affected by the data here.
22

23
00:02:43,510 --> 00:02:48,470
So let's see what would happen if we ran a linear regression between these two features.
23

24
00:02:48,550 --> 00:02:54,400
Now previously we've run our regressions separately using the scikit learn module and we've plotted
24

25
00:02:54,490 --> 00:02:58,770
our best fit lines on a matplotlib chart.
25

26
00:02:58,960 --> 00:03:03,930
Let me show you a shortcut on how you can use seaborn to do this very, very quickly.
26

27
00:03:04,330 --> 00:03:12,320
So if I come down here and in this cell I'm going to make use of a function called "lmplot".
27

28
00:03:12,370 --> 00:03:26,950
So "sns.lmplot()" and an x is gonna be our TAX and y is gonna be RAD and for the data
28

29
00:03:26,980 --> 00:03:31,920
that we're gonna feed in, the data keyword argument is gonna be our data frame.
29

30
00:03:31,990 --> 00:03:39,910
So it's gonna be "data = data, size = 7".
30

31
00:03:40,510 --> 00:03:46,780
So what this line of code is gonna do is gonna look in our data frame which is this bit of Python
31

32
00:03:46,780 --> 00:03:54,040
code here and it's gonna look for a column called RAD and it's also going to look for a column called
32

33
00:03:54,250 --> 00:04:01,750
TAX. The thing that might be a little bit confusing is this "data = data" Python code.
33

34
00:04:01,750 --> 00:04:06,940
But this first part data equals is the keyword argument and data
34

35
00:04:06,940 --> 00:04:12,140
here is the name of our Python data frame.
35

36
00:04:12,400 --> 00:04:18,610
So I'm gonna add another line "plt.show()" and hit Shift+Enter. Voila. And
36

37
00:04:18,610 --> 00:04:19,360
just like that,
37

38
00:04:19,360 --> 00:04:24,240
we've plotted a linear regression between our two features on this chart.
38

39
00:04:24,400 --> 00:04:30,460
Seaborn has done a lot of work for us here and I hope you're seeing just how useful this little Python
39

40
00:04:30,460 --> 00:04:32,450
module really is.
40

41
00:04:32,710 --> 00:04:37,420
Now let's interpret and make sense of what we're looking at on this chart.
41

42
00:04:37,420 --> 00:04:45,970
The data points in our top right corner are affecting the slope of our regression line significantly.
42

43
00:04:45,970 --> 00:04:51,790
It's like they're pulling up that regression line to make it steeper. And looking at this,
43

44
00:04:51,790 --> 00:04:57,430
you can see how a linear regression between RAD and TAX might not be such a good idea because the regression
44

45
00:04:57,430 --> 00:05:00,180
line is kind of meant to represent the data a bit better.
45

46
00:05:00,190 --> 00:05:00,510
Right?
46

47
00:05:00,670 --> 00:05:07,430
And here we're forcing this linear regression model onto a data set that's not really suited for it.
47

48
00:05:07,640 --> 00:05:15,520
Our computers will happily do this calculation but we end up with a model that's perhaps not so useful
48

49
00:05:15,520 --> 00:05:21,920
to capture the true relationship between accessibility to highways and tax.
49

50
00:05:22,180 --> 00:05:27,530
But luckily we're in the business of estimating house prices, right?
50

51
00:05:27,550 --> 00:05:35,560
So what we should actually be looking at is how our features relate to our target, namely our Boston
51

52
00:05:35,710 --> 00:05:39,100
property values. Coming down here,
52

53
00:05:39,100 --> 00:05:41,090
I want to give you another challenge,
53

54
00:05:41,290 --> 00:05:49,060
I want you to pause the video and run a few more scatter plots for data visualization.
54

55
00:05:49,060 --> 00:05:55,750
The first one should be between the number of rooms, RM, and our target.
55

56
00:05:55,750 --> 00:05:59,690
This is the price series in our data frame.
56

57
00:05:59,710 --> 00:06:04,030
Now you can either use matplotlib or seaborn to accomplish this.
57

58
00:06:04,210 --> 00:06:06,250
I'll give you a few seconds to pause the video.
58

59
00:06:09,370 --> 00:06:10,350
Ready?
59

60
00:06:10,360 --> 00:06:12,110
Here's the solution.
60

61
00:06:12,190 --> 00:06:20,050
Now I'm going to be very, very lazy and I'm going to come up here and copy the code that I've previously
61

62
00:06:20,050 --> 00:06:20,370
written
62

63
00:06:23,250 --> 00:06:30,900
and paste that down here and what I'm going to do is I want to change all the references from NOX and
63

64
00:06:30,900 --> 00:06:37,830
DIS to RM and TGT, target.
64

65
00:06:38,070 --> 00:06:43,010
So "data['RM']" and "data['
65

66
00:06:43,010 --> 00:06:46,510
PRICE']" for the correlations,
66

67
00:06:47,090 --> 00:06:52,580
and of course I'm going to change the inputs to the scatter plot. For the color,
67

68
00:06:52,670 --> 00:07:01,070
I'm gonna go with "skyblue" to differentiate a little bit and then I'm going to say "RM vs PRICE" and then for
68

69
00:07:01,070 --> 00:07:08,780
the correlation, of course, I'm going to go "rm_tgt_corr" and this is going to
69

70
00:07:08,780 --> 00:07:18,030
be RM which is the median number of rooms. And on the y label,
70

71
00:07:18,140 --> 00:07:27,660
it's gonna be "PRICE - property price in 000s". Let's see what we get.
71

72
00:07:28,850 --> 00:07:30,780
Voila.
72

73
00:07:30,800 --> 00:07:38,030
Now let me do the same thing with seaborn and you know what, I'm going to use "lmplot" to pull up the
73

74
00:07:38,060 --> 00:07:49,430
regression line as well. So "sns.lmplot(x='RM', y='
74

75
00:07:50,090 --> 00:08:08,350
PRICE', data = data, size = 7)"; "plt.show()".
75

76
00:08:09,790 --> 00:08:15,370
Okay, so how do we interpret what we're seeing here?
76

77
00:08:15,390 --> 00:08:23,100
Well the first thing is that our positive correlation of 0.7 approximately ties out really, really
77

78
00:08:23,100 --> 00:08:27,060
nicely with the relationship that we're seeing on the scatter plot.
78

79
00:08:28,360 --> 00:08:36,040
So this means that room size alone has a very, very clear relationship with our house price.
79

80
00:08:36,040 --> 00:08:38,950
And I think that's kind of neat to see.
80

81
00:08:39,310 --> 00:08:45,430
One of the other things that I find quite interesting when looking at this chart is that there almost
81

82
00:08:45,430 --> 00:08:52,780
seems to be like a ceiling on the property prices at the 50000 dollar mark, because there's a lot of
82

83
00:08:52,780 --> 00:08:55,180
data points that are on this line here.
83

84
00:08:55,330 --> 00:09:01,660
And I think that's quite, quite curious because it could be coincidence, but it might also have something
84

85
00:09:01,660 --> 00:09:09,610
to do with how the data was collected in the 1970s. But for the purposes of this tutorial I'm not really
85

86
00:09:09,610 --> 00:09:12,190
going to dig into these kind of details.
86

87
00:09:12,220 --> 00:09:16,840
There's so much other stuff that I want to focus on.
87

88
00:09:16,840 --> 00:09:23,350
One of the things that you might already be thinking at this point is: you've got like 13 variables and
88

89
00:09:23,710 --> 00:09:31,870
there are just too many combinations to graph individually. We can't possibly copy and paste like our cells
89

90
00:09:31,870 --> 00:09:35,200
in our code to graph all these scatter plots individually.
90

91
00:09:35,200 --> 00:09:37,080
Right? And
91

92
00:09:37,120 --> 00:09:37,410
yeah,
92

93
00:09:37,410 --> 00:09:43,990
we're not going to do that because remember, we have to think like lazy programmers and what would
93

94
00:09:43,990 --> 00:09:45,720
lazy programmers do?
94

95
00:09:45,910 --> 00:09:52,900
They would graph every single combination at once and this is where seaborn is going to come to the
95

96
00:09:52,900 --> 00:10:01,030
rescue once again, because seaborn has a function called "pairplot" which will do just that.
96

97
00:10:01,060 --> 00:10:11,530
So "sns.pairplot()" and then as an argument it needs our entire data frame,
97

98
00:10:11,620 --> 00:10:21,640
so "sns.pairplot(data)", "plot.show()" will create scatter plots between all our
98

99
00:10:21,640 --> 00:10:24,220
features and our target.
99

100
00:10:24,220 --> 00:10:29,970
But I hope you didn't type Shift+Enter just now and run this thing already because there's something
100

101
00:10:29,980 --> 00:10:34,180
I want to show you before I execute this cell.
101

102
00:10:34,210 --> 00:10:39,520
I want to show you a little bit of Jupyter a notebook specific code, a little bit of Jupyter notebook
102

103
00:10:39,640 --> 00:10:46,750
magic if you will, cause the thing is pairplot actually has to do a lot of calculations and it takes
103

104
00:10:46,750 --> 00:10:48,650
some time to run.
104

105
00:10:48,880 --> 00:10:55,150
And what I want to show you is how you can benchmark or time specific bits of code.
105

106
00:10:55,150 --> 00:11:00,570
People actually call this microbenchmarking, because we're benchmarking a few lines of code,
106

107
00:11:00,580 --> 00:11:07,900
we're benchmarking an individual cell and there's a magic command for Jupyter notebook and this command is
107

108
00:11:08,080 --> 00:11:12,340
"%%time".
108

109
00:11:12,340 --> 00:11:18,460
And now I'm going to hit Shift+Enter to run the cell. Ready?
109

110
00:11:18,460 --> 00:11:24,460
Here we go. Now my computer has produced all these plots.
110

111
00:11:24,490 --> 00:11:32,590
Check it out. And because I've added that Jupyter notebook magic code, I've got some additional information
111

112
00:11:32,800 --> 00:11:40,140
printed out here as well. This is showing me how long it took to run my code.
112

113
00:11:40,170 --> 00:11:43,120
Now when you're running this at home, I don't know how long it'll take you.
113

114
00:11:43,120 --> 00:11:46,390
I don't know if it'll take you like 19 seconds like it took me
114

115
00:11:46,540 --> 00:11:52,330
or like 40 seconds, or maybe five seconds depending on the kind of machine that you've got.
115

116
00:11:53,740 --> 00:12:00,400
But I think this is a really neat feature from Jupyter notebook, because suppose you want to make a decision
116

117
00:12:00,430 --> 00:12:07,900
between two different types of algorithms or two different ways of running your code, say if you've got
117

118
00:12:07,900 --> 00:12:13,870
something that takes a little bit longer and you're looking to like optimize it a little bit or do a
118

119
00:12:13,870 --> 00:12:19,780
horse race between two different algorithms, this timing capability of Jupyter notebook can come in
119

120
00:12:19,780 --> 00:12:21,040
quite handy.
120

121
00:12:21,070 --> 00:12:26,260
So we've done this a little bit out of curiosity and so that you can see that, you know, your computer
121

122
00:12:26,260 --> 00:12:31,550
didn't freeze up or whatever because it took me some time as well to run this bit of code.
122

123
00:12:31,840 --> 00:12:36,760
But that said, this has some other practical applications as well.
123

124
00:12:36,760 --> 00:12:38,930
Another thing to know about Jupyter notebook actually,
124

125
00:12:38,950 --> 00:12:46,270
on this note, is that up here in the right hand corner there is a little dot here that for me is currently
125

126
00:12:46,270 --> 00:12:47,110
not filled in.
126

127
00:12:47,140 --> 00:12:51,580
And when I hover over it I can see it says "Kernel idle".
127

128
00:12:51,640 --> 00:12:54,740
This means that Jupyter notebook is currently not doing anything.
128

129
00:12:54,820 --> 00:13:02,390
If I was to execute this cell again, during that execution you could see that this dot is filled in
129

130
00:13:02,620 --> 00:13:04,430
and it says "Kernel busy".
130

131
00:13:04,510 --> 00:13:11,380
So this gives you an indication that at the moment our Jupyter notebook is doing a lot of work and that
131

132
00:13:11,470 --> 00:13:15,170
even though it might look like there is nothing happening,
132

133
00:13:15,520 --> 00:13:17,590
our computer is actually working.
133

134
00:13:17,590 --> 00:13:22,040
This is another handy little thing to know about Jupyter notebook.
134

135
00:13:22,040 --> 00:13:24,430
Now let's take a look at our output here.
135

136
00:13:24,490 --> 00:13:32,590
So what you can actually see here is kind of a cropped version or a zoomed out version of the actual
136

137
00:13:32,590 --> 00:13:33,940
size of this image.
137

138
00:13:34,000 --> 00:13:39,280
This image in fact here is a lot larger than what's being displayed.
138

139
00:13:39,280 --> 00:13:45,250
So we're kind of zoomed out. If you want to see the full size image which I encourage you to try and
139

140
00:13:45,250 --> 00:13:50,580
do, what you can do as you can right click on this thing click "Save As",
140

141
00:13:50,590 --> 00:13:55,100
stick it in your downloads folder as jointplot or what have you.
141

142
00:13:55,360 --> 00:14:04,830
And then in your Downloads folder open it up and pull it up in your graphics program of choice.
142

143
00:14:04,920 --> 00:14:10,950
So I've got a Mac and I'm using Preview here so I can go here and go to "Actual Size".
143

144
00:14:10,950 --> 00:14:19,230
And here you see a zoomed in, actual size, 100 percent version of this image.
144

145
00:14:19,230 --> 00:14:23,730
Now the other thing that you can actually do is you can open this image up in a new tab.
145

146
00:14:23,760 --> 00:14:31,680
So if I right click on this image and, I'm using Firefox here, this will be different on Chrome or Safari
146

147
00:14:31,740 --> 00:14:40,880
or Internet Explorer of course but you can easily do is you can do something like this and then you
147

148
00:14:40,880 --> 00:14:44,450
don't have to download this image and dig it out of your Downloads folder.
148

149
00:14:44,450 --> 00:14:46,430
So I think this is quite handy.
149

150
00:14:46,520 --> 00:14:48,170
Now I've got two tabs - 
150

151
00:14:48,420 --> 00:14:53,230
one with the full sized version of the image and one with my notebook.
151

152
00:14:53,270 --> 00:15:00,850
So what we can see here is that we've got scatter plots for every single column in our data frame.
152

153
00:15:01,160 --> 00:15:08,210
So we've got Pice vs Crime, Price vs Zone, Price vs Industry, Price vs CHAS, Price vs
153

154
00:15:08,270 --> 00:15:10,190
NOX or RM,
154

155
00:15:10,790 --> 00:15:13,960
And then we've also got LSTAT vs crime,
155

156
00:15:13,970 --> 00:15:21,050
LSTAT vs Zone, LSTAT vs industry, but the really neat thing is that, in the diagonal down
156

157
00:15:21,050 --> 00:15:21,940
the middle,
157

158
00:15:21,960 --> 00:15:22,300
yeah,
158

159
00:15:22,310 --> 00:15:24,570
so this would be Crime vs Crime,
159

160
00:15:24,620 --> 00:15:25,910
we get a histogram.
160

161
00:15:25,910 --> 00:15:29,990
So this is the histogram for the ZN feature.
161

162
00:15:30,050 --> 00:15:32,490
This is the histogram for INDUS.
162

163
00:15:32,660 --> 00:15:37,610
This is the histogram for, this is the histogram for CHAS and so on.
163

164
00:15:37,610 --> 00:15:41,060
And down here we've got the histogram for Price.
164

165
00:15:41,060 --> 00:15:48,800
So the diagonal shows is the histograms and everything else shows us a scatter plot. Scrolling down to
165

166
00:15:48,800 --> 00:15:51,280
the last row, namely the Price row,
166

167
00:15:51,410 --> 00:15:59,060
we can check out for which features we can very, very clearly see a relationship between property prices
167

168
00:15:59,150 --> 00:16:00,260
and the feature.
168

169
00:16:00,260 --> 00:16:09,670
So here this is Crime and Price. Going a bit to the right we've got Price and Industry and we can definitely
169

170
00:16:09,760 --> 00:16:11,590
picture some sort of relationship here,
170

171
00:16:11,620 --> 00:16:13,060
right?
171

172
00:16:13,060 --> 00:16:14,020
Going a bit further,
172

173
00:16:14,020 --> 00:16:16,620
we've got Price and Room size,
173

174
00:16:16,660 --> 00:16:19,550
this is a scatter plot that we've created earlier
174

175
00:16:19,560 --> 00:16:24,340
and here there was definitely also a relationship. Going further all the way to the right,
175

176
00:16:24,340 --> 00:16:28,240
we see this one here, Price and LSTAT.
176

177
00:16:28,270 --> 00:16:32,650
Now again this scatter plot is actually showing us a very, very clear relationship between these two
177

178
00:16:32,650 --> 00:16:33,410
things.
178

179
00:16:33,430 --> 00:16:37,990
It almost reminds me of Pollution and Distance in terms of its shape,
179

180
00:16:38,020 --> 00:16:39,090
right?
180

181
00:16:39,160 --> 00:16:44,680
But what is LSTAT? We haven't really talked about this feature before,
181

182
00:16:44,680 --> 00:16:48,010
so let's take a look at the description.
182

183
00:16:48,010 --> 00:16:53,650
So coming back to the Jupyter notebook, to the top where we printed out the description, we see that for
183

184
00:16:53,680 --> 00:16:59,410
LSTAT it's "% lower status of the population".
184

185
00:16:59,500 --> 00:17:05,170
Now this description might not be sort of the politically correct wording that people might use these
185

186
00:17:05,170 --> 00:17:13,510
days, but in a nutshell this feature measures the socioeconomic background of people in the neighborhood.
186

187
00:17:13,570 --> 00:17:16,960
I actually wanted to get a little bit more detail on LSTAT,
187

188
00:17:17,020 --> 00:17:23,320
so I checked back with what it said in the original research paper to get a little bit more of a complete
188

189
00:17:23,320 --> 00:17:30,700
description and what it says here is that by "lower status" what the data's actually capturing are things
189

190
00:17:30,700 --> 00:17:39,010
like what percentage of people don't have a high school education and what proportion of male workers
190

191
00:17:39,220 --> 00:17:42,870
are classified as laborers.
191

192
00:17:42,880 --> 00:17:48,970
So when I read this description, I immediately looked back at my correlation matrix because what I wanted
192

193
00:17:48,970 --> 00:17:54,580
to check was the correlation between LSTAT and Industry,
193

194
00:17:54,580 --> 00:18:02,620
so the proportion of industry in a neighborhood and the proportion of people who are classified as laborers
194

195
00:18:02,680 --> 00:18:05,250
or who don't have a high school education.
195

196
00:18:05,380 --> 00:18:09,390
And that correlation, it turns out, is 0.6.
196

197
00:18:09,400 --> 00:18:12,570
So it's both positive and fairly high.
197

198
00:18:12,580 --> 00:18:18,070
So reading this description now all of sudden this correlation made a whole lot more sense and coming
198

199
00:18:18,070 --> 00:18:26,050
back to this scatter plot between LSTAT and house prices is that what we can also clearly see is
199

200
00:18:26,050 --> 00:18:33,310
that the property prices tend to be lower in areas that have a high proportion of people who have no
200

201
00:18:33,310 --> 00:18:37,400
high school education or who work in factories.
201

202
00:18:37,450 --> 00:18:45,340
In other words, the property prices appear to be affected by the kind of people who live in the neighborhood.
202

203
00:18:45,520 --> 00:18:52,300
But we shouldn't jump to conclusions just yet, our multivariable regression will tell us a lot more.
203

204
00:18:53,250 --> 00:18:54,050
But
204

205
00:18:54,160 --> 00:19:01,210
speaking of regression and looking at these charts and the pairplot, wouldn't it be nice to be able
205

206
00:19:01,210 --> 00:19:05,940
to plot a regression line onto all of these plots as well?
206

207
00:19:05,980 --> 00:19:08,520
I think that would make it a lot more clear, don't you think?
207

208
00:19:08,650 --> 00:19:19,210
Let's go back to our Jupyter notebook and go down into the next cell, because "sns.pairplot(data,
208

209
00:19:19,660 --> 00:19:33,640
kind='reg')" will show us that seaborn will happily oblige if we add
209

210
00:19:33,640 --> 00:19:37,710
an extra argument to the powerful pairplot function.
210

211
00:19:38,110 --> 00:19:46,700
All we have to do is change the kind argument from "scatter", which is the default, so pressing Shift+Enter,
211

212
00:19:46,720 --> 00:19:54,310
we can see that "kind = 'scatter'" is the default for this function and change that to regression, which
212

213
00:19:54,310 --> 00:20:00,160
has the code "reg". Now I know what this regression will look like because I've already run this code
213

214
00:20:00,160 --> 00:20:06,940
before, but the thing is, it will be colored in exactly the same way as our data points.
214

215
00:20:07,030 --> 00:20:11,290
So it'll be blue on blue, which won't be very, very helpful.
215

216
00:20:11,290 --> 00:20:14,850
We want to color our regression line in a little bit differently.
216

217
00:20:15,040 --> 00:20:15,970
And I want to do that now,
217

218
00:20:15,970 --> 00:20:17,670
I don't want to execute this code,
218

219
00:20:17,770 --> 00:20:21,310
wait 20 seconds and then add more code,
219

220
00:20:21,340 --> 00:20:22,960
wait another 20 seconds.
220

221
00:20:22,960 --> 00:20:29,620
So I'm going to add another keyword argument here and that's gonna be the "plot_kws"
221

222
00:20:29,630 --> 00:20:39,610
keyword argument, "plot_kws". I'm gonna set that equal to a dictionary.
222

223
00:20:39,610 --> 00:20:47,740
This is how we're gonna add a splash of color to our regression line. A dictionary always has the curly
223

224
00:20:47,740 --> 00:20:58,690
braces and then it has a key and a value. The key is gonna be "line_kws" and then the value
224

225
00:20:58,690 --> 00:21:00,580
comes after the colon.
225

226
00:21:00,580 --> 00:21:02,880
So what's the value that we're gonna put here?
226

227
00:21:03,100 --> 00:21:09,910
It turns out it's actually gonna be another dictionary, so we're gonna have curly braces and then another
227

228
00:21:09,910 --> 00:21:13,330
key which is gonna be "color" and then a colon,
228

229
00:21:13,600 --> 00:21:18,080
and here's the value, "cyan", which is gonna be the color I'm going to use.
229

230
00:21:18,370 --> 00:21:29,350
So what we're looking at here now is a dictionary "{'color': 'cyan'}" which is the value for another dictionary,
230

231
00:21:29,410 --> 00:21:33,310
which is the "line_kws" key.
231

232
00:21:33,400 --> 00:21:39,760
In other words, we've got a nested dictionary, a dictionary inside a dictionary,
232

233
00:21:39,760 --> 00:21:45,160
it's like the most boring version of the movie Inception, but this is how we're going to get a color
233

234
00:21:45,220 --> 00:21:47,080
for our regression line.
234

235
00:21:47,080 --> 00:21:51,730
All that's left to add is "plt.show()".
235

236
00:21:52,030 --> 00:21:59,530
And because I'm a very curious person, I'm also going to add the "%%time" Jupyter notebook
236

237
00:21:59,770 --> 00:22:09,350
magic. Now I'm going hit Shift+Enter. *drum roll*, what is happening? I should really get a coffee while this
237

238
00:22:09,350 --> 00:22:12,080
is running. There we go,
238

239
00:22:12,180 --> 00:22:13,650
now it finished.
239

240
00:22:13,650 --> 00:22:17,940
How long did it take to run? 47 seconds.
240

241
00:22:17,990 --> 00:22:25,540
It's absolutely brutal. But let's take a look at what this looks like now. Again,
241

242
00:22:25,570 --> 00:22:27,660
we can see just how big this image is.
242

243
00:22:27,700 --> 00:22:29,880
It's about 2500 pixels
243

244
00:22:29,890 --> 00:22:32,700
times 2500 pixels.
244

245
00:22:32,770 --> 00:22:35,550
This is at 25 percent. 100 percent
245

246
00:22:35,650 --> 00:22:40,410
looks more like this. With the help of our regression line on our chart,
246

247
00:22:40,540 --> 00:22:47,620
we can now also visually see the positive relationship between our Industry variable and LSTAT. So
247

248
00:22:47,620 --> 00:22:50,630
we know the correlation was around 0.6
248

249
00:22:50,800 --> 00:22:57,640
and now we can also see that borne out in the positively sloped regression line right here and in the
249

250
00:22:57,640 --> 00:23:05,380
last row we can see a regression line between each feature and our target variable individually.
250

251
00:23:05,380 --> 00:23:11,260
This is really neat because we can see if the slope is positive or negative and it's basically doing
251

252
00:23:11,260 --> 00:23:15,280
a univariate regression 13 times.
252

253
00:23:15,280 --> 00:23:20,500
But the power of our analysis isn't gonna come from looking at our regressions against that target in
253

254
00:23:20,500 --> 00:23:21,670
isolation.
254

255
00:23:21,670 --> 00:23:28,590
The power comes from combining all our explanatory variables and running a multivariable regression.
255

256
00:23:28,600 --> 00:23:35,410
Now I have a feeling this air pollution themed cartoon reference is betraying my age a little bit.
256

257
00:23:35,410 --> 00:23:41,500
The point I'm trying to make is that we're going to combine all the explanatory power in our features
257

258
00:23:41,500 --> 00:23:49,900
to estimate our property prices in Boston and our model of choice is called multivariable regression.
258

259
00:23:50,830 --> 00:23:57,550
After we've done all that, we're going to evaluate our results and see if we can improve our model further.
259

260
00:23:57,550 --> 00:23:58,980
I'll see you in the next lessons.