0
1
00:00:00,420 --> 00:00:06,450
Let's revisit our old friend the scatter plot. We saw in the previous lessons
1

2
00:00:06,540 --> 00:00:14,430
how important it is to use plots in conjunction with descriptive statistics to spot patterns and outliers.
2

3
00:00:15,030 --> 00:00:16,910
Using both of these tools together
3

4
00:00:17,100 --> 00:00:20,920
we get a more complete picture of what's actually going on.
4

5
00:00:21,210 --> 00:00:27,990
So far we've been visualizing our data and looked at the distribution of values of some individual features,
5

6
00:00:28,380 --> 00:00:36,270
like RM for like the average number of rooms, or RAD, the index of accessibility to highways.
6

7
00:00:36,270 --> 00:00:43,950
We can dig deeper into the relationships between these feature pairs as well as between the features
7

8
00:00:44,160 --> 00:00:47,760
and our target value with some scatter plots.
8

9
00:00:47,760 --> 00:00:53,520
The correlation matrix that we created already hinted at the fact that there are relationships amongst
9

10
00:00:53,520 --> 00:01:01,360
the features that we can visualize. So on that note, I'd like to start this lesson off with a challenge.
10

11
00:01:01,380 --> 00:01:09,650
First, I want you to picture what the relationship would look like between the NOX and DIS features.
11

12
00:01:09,930 --> 00:01:17,910
If you recall, NOX was a measure of pollution and DIS was a measure of distance from employment centers.
12

13
00:01:17,910 --> 00:01:25,610
Picture a graph in your head and then write the two lines of Python code to visualize the scatter plot.
13

14
00:01:25,630 --> 00:01:29,030
I'll give you a few seconds to pause the video before I show you the solution.
14

15
00:01:31,870 --> 00:01:32,770
Ready?
15

16
00:01:32,770 --> 00:01:33,580
Here we go.
16

17
00:01:33,580 --> 00:01:40,750
So we've created a scatter plot many times before with "plt", which is our matplotlib module,
17

18
00:01:41,470 --> 00:01:47,930
".scatter()" and then we have to supply two things, the data for the x axis and the data for the y axis.
18

19
00:01:47,950 --> 00:01:52,960
So on the x axis we're gonna have our "data[
19

20
00:01:53,260 --> 00:01:54,290
DIS']",
20

21
00:01:54,520 --> 00:01:59,020
so this is going to be our distance which is going to be on the x axis, and on the y axis,
21

22
00:01:59,020 --> 00:02:07,030
we're going to have our pollution measure, which is going to be "data['NOX']" and then
22

23
00:02:07,030 --> 00:02:10,150
finally "PLT.show()".
23

24
00:02:10,150 --> 00:02:14,560
So hitting Shift+Enter just gave me this error because I've come back to this notebook and I haven't
24

25
00:02:14,560 --> 00:02:16,260
run the cells above it yet.
25

26
00:02:16,420 --> 00:02:25,270
So I'm going to go to "Cell" > "Run All" and then I'm going to wait a little bit, scroll all the way down,
26

27
00:02:25,600 --> 00:02:26,950
and here we go.
27

28
00:02:26,950 --> 00:02:29,610
Was this the relationship that you imagined in your head?
28

29
00:02:29,790 --> 00:02:32,230
A kind of downward sloping line?
29

30
00:02:32,560 --> 00:02:35,740
Let me add some labels to this graph before I give you my interpretation.
30

31
00:02:35,890 --> 00:02:51,760
So "plt.xlabel('DIS - Distance from employment', fontsize = 14)", and then for
31

32
00:02:51,760 --> 00:02:56,100
the Y label, I'm going to copy this line, paste it in,
32

33
00:02:56,140 --> 00:02:57,510
change it to the Y label,
33

34
00:02:57,580 --> 00:03:09,480
change that to read "NOX - Nitric Oxide Pollution", and then my figure size I want to change as well.
34

35
00:03:09,700 --> 00:03:19,230
So I'm going to make it a bit bigger, so I'm going to say "plt.figure(figsize = ())",
35

36
00:03:19,770 --> 00:03:22,720
say maybe 9 and 6.
36

37
00:03:22,800 --> 00:03:32,730
And then I'm also going to add a title here, I'm going to say "plt.title('DIS vs
37

38
00:03:32,970 --> 00:03:42,300
NOX', fontsize = 14)". We're going to refresh this graph, see what we get.
38

39
00:03:43,800 --> 00:03:50,760
Okay so this makes the relationship between distance from employment centers and NOX, our nitric oxide
39

40
00:03:50,760 --> 00:03:57,660
pollution much, much more clear. What we can see here is that as distance increases, as we go more to the
40

41
00:03:57,660 --> 00:04:03,580
right of this chart here, pollution goes down and this makes sense, right?
41

42
00:04:03,750 --> 00:04:10,380
The city center of Boston is going to be an employment center but city centers would also have much
42

43
00:04:10,380 --> 00:04:14,510
more air pollution than in the suburbs or on the outskirts of the city.
43

44
00:04:15,540 --> 00:04:21,930
Now one thing that might be quite interesting to add to this graph is a little bit of transparency on
44

45
00:04:21,960 --> 00:04:29,190
these data points as well as the, maybe putting down the correlation that we've calculated up here and
45

46
00:04:29,190 --> 00:04:31,530
including that in our title.
46

47
00:04:31,530 --> 00:04:35,670
So let's do that now. To calculate the correlation,
47

48
00:04:35,670 --> 00:04:48,890
I'm going to add a nox_dis_corr variable, set that equal to "data['NOX'].
48

49
00:04:49,280 --> 00:04:59,060
corr(data['DIS'])" and then when I'm going to do is in the title I'm going to
49

50
00:04:59,060 --> 00:05:08,420
use this variable here and I'm going to include it in my string and I'm going to use Python's fstring notation
50

51
00:05:08,600 --> 00:05:09,710
to accomplish this.
51

52
00:05:09,710 --> 00:05:16,400
So I'm going to put f in front of the single quote and then I'm going to modify my string as follows, I'm
52

53
00:05:16,400 --> 00:05:22,160
going to say "(Correlation )" and here's the key,
53

54
00:05:22,160 --> 00:05:27,800
"{nox_dis_corr}".
54

55
00:05:27,860 --> 00:05:33,750
So this is going to grab our variable from up here,
55

56
00:05:33,760 --> 00:05:41,680
it's gonna grab our correlation between distance and pollution and it's going to insert it into our string.
56

57
00:05:41,710 --> 00:05:44,100
And that's thanks to the fact that we have
57

58
00:05:44,100 --> 00:05:50,750
the curly bracket notation outside of the variable name and this little f in front.
58

59
00:05:50,770 --> 00:05:53,550
So let me hit Shift+Enter and see what this looks like.
59

60
00:05:54,870 --> 00:05:55,320
Voila!
60

61
00:05:56,210 --> 00:06:00,400
Now we've got a graphical representation of our data and the correlation,
61

62
00:06:00,470 --> 00:06:06,500
all in one place. And the correlation is indeed negative and it's quite high actually with 0.77.
62

63
00:06:06,500 --> 00:06:09,440
Now in terms of styling,
63

64
00:06:09,450 --> 00:06:13,350
you might say to yourself: You know what this number is way too precise,
64

65
00:06:13,350 --> 00:06:17,710
it's difficult to read because it's got too many values after the decimal point.
65

66
00:06:17,760 --> 00:06:23,790
So why don't we round it? And we can do this with the Python round function.
66

67
00:06:23,790 --> 00:06:29,760
So I'm going to do it up here where I've actually calculated the correlation and I'm just going to surround
67

68
00:06:30,030 --> 00:06:37,590
my correlation calculation with this Python function, so "round", comma at the end and then a value for
68

69
00:06:37,590 --> 00:06:43,920
how many decimal places I want to round it to. So I'm going to round it to three decimal places and close my
69

70
00:06:43,920 --> 00:06:45,580
parentheses at the end.
70

71
00:06:45,810 --> 00:06:52,020
If I press Shift+Enter now it should refresh and we should get something like this, we should get
71

72
00:06:52,170 --> 00:06:56,760
-0.769.
72

73
00:06:56,760 --> 00:07:02,460
The other thing I quite like doing with scatter plots is adding a little bit of transparency to the
73

74
00:07:02,460 --> 00:07:10,050
data points so that we can get a better feel for how dense particular areas of the chart are.
74

75
00:07:10,050 --> 00:07:14,700
So in my line of code where I'm creating my scatter plot, namely this one, I'm going to add some other
75

76
00:07:14,730 --> 00:07:16,290
keyword arguments.
76

77
00:07:16,410 --> 00:07:25,430
The transparency is set with the alpha keyword, and I'm going to set it to a value of 0.6.
77

78
00:07:25,610 --> 00:07:32,140
Let me hit Shift+Enter and we can clearly see that there's a lot more data points here than over here.
78

79
00:07:32,150 --> 00:07:38,900
I think this is a nice touch, but we can make this even more explicit by changing the size of our dots
79

80
00:07:39,230 --> 00:07:40,830
and making them a little bit larger.
80

81
00:07:40,830 --> 00:07:48,740
So if I choose something like 80, "s = 80" as a keyword argument, changing the size, then I've got slightly
81

82
00:07:48,740 --> 00:07:51,150
larger dots for my data points.
82

83
00:07:51,170 --> 00:07:58,340
Now of course we can continue adding keyword arguments here to style the graph as we see fit, famously
83

84
00:07:58,520 --> 00:08:01,900
color and there's quite a few to choose from.
84

85
00:08:02,060 --> 00:08:08,980
I'm going to go with indigo and give my scatter plot a purple make over. Okay,
85

86
00:08:09,010 --> 00:08:13,970
so I think creating a scatterplot with matplotlib is pretty straightforward,
86

87
00:08:14,470 --> 00:08:21,160
but now let's do the same thing with the seaborn module to mix it up a little bit, because remember
87

88
00:08:21,220 --> 00:08:24,430
I said that seaborn builds upon matplotlib?
88

89
00:08:24,460 --> 00:08:30,390
Well you're gonna see in a minute how seaborn really adds some nice little touches to these visualizations.
89

90
00:08:30,400 --> 00:08:31,600
Check this out.
90

91
00:08:31,600 --> 00:08:37,770
So I'm going to come down here, add few more cells and then I'm going to write the following code,
91

92
00:08:37,960 --> 00:08:47,980
I'm going to say "sns.jointplot" so sns being the name for seaborn module and then jointplot
92

93
00:08:48,130 --> 00:08:52,620
being the function to create our scatter plot.
93

94
00:08:52,810 --> 00:09:04,870
So I'm going to say "jointplot(x=data['DIS'], y=data['NOX'])" and
94

95
00:09:04,870 --> 00:09:14,340
then on the next line I'm going to say "plt.show()", hit Shift+Enter and what we get is something like this.
95

96
00:09:14,410 --> 00:09:20,870
Now again I've only specified two parameters in my function call here, but you can already see that there
96

97
00:09:20,870 --> 00:09:27,250
is some sort of histogram on the side and there's some additional data being provided here in this corner.
97

98
00:09:27,330 --> 00:09:32,480
Now if you can't read this on your screen, this is actually the Pearson correlation coefficient down
98

99
00:09:32,480 --> 00:09:36,310
to two decimal places, -0.77.
99

100
00:09:36,530 --> 00:09:43,730
I can make the chart a little larger so that it's a bit more clear by going to the arguments and providing
100

101
00:09:43,730 --> 00:09:46,220
the size argument.
101

102
00:09:46,280 --> 00:09:53,170
So I'm going to say "size = 7", increase the size a little bit but not too much.
102

103
00:09:53,330 --> 00:09:59,010
Now you should see the chart appear a little bit larger on your screen, but I think these histogram is
103

104
00:09:59,300 --> 00:10:05,420
and the correlation coefficient and the fact that it adds some labels for the y axis and the x axis
104

105
00:10:05,840 --> 00:10:12,960
automatically straight out of the box is a really, really nice touch. In terms of styling.
105

106
00:10:12,980 --> 00:10:19,010
one thing that you might notice is that that the Jupyter notebook remembers how you've styled charts
106

107
00:10:19,130 --> 00:10:20,420
previously.
107

108
00:10:20,540 --> 00:10:25,130
So if you're working in a new cell and you want a new look for the chart you might have to reset the
108

109
00:10:25,130 --> 00:10:31,670
styling. The way to reset the styling for seaborn is with a function called "set".
109

110
00:10:31,820 --> 00:10:37,540
So "sns.set()" will reset the styling to the default styling.
110

111
00:10:37,540 --> 00:10:44,780
So now if I press Shift+Enter I get the default parameters for the styling and we kind of get this look
111

112
00:10:44,900 --> 00:10:46,640
right here.
112

113
00:10:46,670 --> 00:10:52,100
This set function is a good function to remember if you've ever got like a little bit of a longer notebook
113

114
00:10:52,130 --> 00:10:55,670
that we've got here and you might have written some code up above
114

115
00:10:55,670 --> 00:11:01,320
that changes the styling of these charts and you want to do something different and your notebook
115

116
00:11:01,330 --> 00:11:08,710
is behaving a little bit unexpectedly, so "sns.set()" resets the styling to default and 
116

117
00:11:09,170 --> 00:11:18,730
"sns.set_style()" allows us to choose kind of like a template style to use for the chart.
117

118
00:11:18,770 --> 00:11:25,370
So there's a couple of templates to choose from, one of them is called white and then and this template
118

119
00:11:25,400 --> 00:11:29,510
will make our chart look like so which is kind of what we had before.
119

120
00:11:29,930 --> 00:11:36,440
But there's another template called white grid, which then have these grid lines to the chart like so.
120

121
00:11:36,460 --> 00:11:43,820
Now of course there's also like dark red and dark and pressing Shift+Tab on this function will actually
121

122
00:11:43,820 --> 00:11:49,690
show us what some of the options are - dark grid, dark, dark, white, ticks,
122

123
00:11:49,880 --> 00:11:57,020
got a couple to choose from if we want. And you even got some examples on how you would use them.
123

124
00:11:57,030 --> 00:12:03,870
So for example if you wanted to use ticks you can even provide the tick size as an additional argument.
124

125
00:12:03,870 --> 00:12:04,250
All right.
125

126
00:12:04,290 --> 00:12:11,280
So that's a little bit more detail on how you can control the aesthetics of your seaborn chart in your
126

127
00:12:11,280 --> 00:12:12,440
notebook.
127

128
00:12:12,600 --> 00:12:17,880
But the last thing I want to mention is that there is an additional template that you can mix and match
128

129
00:12:17,940 --> 00:12:27,540
with say white grid or dark grid and these templates are called contexts, if you will.
129

130
00:12:27,570 --> 00:12:37,830
So "sns.set_context()" will allow us to put in a template here for how this
130

131
00:12:37,830 --> 00:12:46,710
chart is gonna be used. For example, a context might be "talk", and if I use that then you can see that the
131

132
00:12:46,710 --> 00:12:52,170
font size is a lot larger and the dots are a little bit more clear.
132

133
00:12:52,200 --> 00:12:58,170
So this is presumably because you want to present this chart somewhere.
133

134
00:12:58,200 --> 00:13:00,810
Now there's a couple of other contexts as well.
134

135
00:13:00,900 --> 00:13:05,560
You can use "notebook" which will make the chart look like this.
135

136
00:13:05,670 --> 00:13:10,230
This is a template that's quite good if you're viewing this kind of stuff on a monitor and you're not
136

137
00:13:10,500 --> 00:13:17,730
having to throw it up on a screen or like a presentation and pressing Shift+Enter on context shows
137

138
00:13:17,730 --> 00:13:20,920
us that there's a couple of other options as well.
138

139
00:13:21,060 --> 00:13:27,550
So there's "paper", there's "poster" and there's "talk" and "notebook" which we've already looked at.
139

140
00:13:27,570 --> 00:13:34,800
I'm going to go with "talk" just to make it a little bit more readable on the video. The very last thing
140

141
00:13:34,800 --> 00:13:43,260
I'm going to mention on the styling front is how you can get the similar sort of transparency and the color
141

142
00:13:43,950 --> 00:13:46,290
that we have here on matplotlib.
142

143
00:13:46,350 --> 00:13:52,860
So I'm going to show you how you can set that by supplying certain arguments. The color argument is pretty
143

144
00:13:52,860 --> 00:13:53,450
straightforward,
144

145
00:13:53,460 --> 00:14:05,530
so "color  = 'indigo'" will give us a purple chart but when it comes to the transparency, you have to supply
145

146
00:14:05,530 --> 00:14:12,190
the argument in a different way, because jointplot doesn't take an argument called alpha, that's only
146

147
00:14:12,190 --> 00:14:14,180
for matplotlib.
147

148
00:14:14,320 --> 00:14:23,880
Instead you have to go to the keyword arguments, so "joint_kws = "
148

149
00:14:24,070 --> 00:14:31,450
and here you provide a dictionary, a Python dictionary, so that uses the curly braces notation and then
149

150
00:14:31,450 --> 00:14:42,130
a key value pair - "alpha" for their key, colon, and then say 0.5 for the value. If I press Shift+
150

151
00:14:42,130 --> 00:14:47,260
Enter now we'll have the transparency applied to our data points.
151

152
00:14:47,860 --> 00:14:53,920
So I hope you find it useful to see two different ways of generating this chart with different modules.
152

153
00:14:53,920 --> 00:14:58,780
The first one matplotlib and the second one being seaborn.
153

154
00:14:58,920 --> 00:15:01,390
Now there's a really cool thing I want to show you next.
154

155
00:15:01,800 --> 00:15:05,560
And that's to do with the fact that this jointplot method is actually incredibly powerful.
155

156
00:15:05,680 --> 00:15:08,840
So let me copy this cell and paste it below.
156

157
00:15:08,950 --> 00:15:16,270
So I have two copies of it now and what I'm going to do is just for comparison I'm going to modify how these
157

158
00:15:16,270 --> 00:15:19,000
data points are represented here.
158

159
00:15:19,090 --> 00:15:24,100
So I'm going to go with a blue color to set them apart.
159

160
00:15:24,100 --> 00:15:27,050
So I've got my blue one here, purple one here.
160

161
00:15:27,400 --> 00:15:34,300
And then what I'm going to do is I'm going to show you this keyword argument that I'm going to change in the
161

162
00:15:34,300 --> 00:15:43,630
quick documentation. So pressing Shift+Enter jointplot shows us that this keyword argument, "kind",
162

163
00:15:44,170 --> 00:15:47,200
is set to scatter by default.
163

164
00:15:47,380 --> 00:15:57,250
But there's other values that this can take, for example "kde", "reg", "resid" and "hex".
164

165
00:15:57,340 --> 00:16:02,290
So we've actually got a choice between five different values.
165

166
00:16:02,290 --> 00:16:04,390
Let me show you what one of them does.
166

167
00:16:04,540 --> 00:16:10,960
I'm going to go ahead and delete this argument here where we've set our alpha value.
167

168
00:16:10,960 --> 00:16:18,060
And if I press Shift+Enter you can see that we no longer have any alpha values on our chart.
168

169
00:16:18,400 --> 00:16:27,910
But if I come in here and I change the kind to "hex", so writing "kind = 'hex'"
169

170
00:16:28,830 --> 00:16:36,540
and then I hit Shift+Enter, we get the following. We get a chart that looks like this and what this chart is
170

171
00:16:36,540 --> 00:16:43,620
doing is that it's aggregating the data points that all fall in a certain area and then it shades them
171

172
00:16:43,620 --> 00:16:51,900
in depending on how many data points there are in that particular sector. So you're aggregating the data
172

173
00:16:51,900 --> 00:16:53,980
points over like a little 2D area.
173

174
00:16:54,120 --> 00:17:00,030
And again the shading gives us a very good idea of the density of the data points in that particular
174

175
00:17:00,030 --> 00:17:02,520
part of the plot.
175

176
00:17:02,520 --> 00:17:07,350
In other words we're aggregating the data in a hexagonal grid.
176

177
00:17:07,350 --> 00:17:11,280
And I think this is a quite a beautiful visualization actually.
177

178
00:17:11,370 --> 00:17:16,290
And it's one that you don't tend to see that often but it does remind me a little bit of a board game
178

179
00:17:16,290 --> 00:17:17,970
called Settlers of Catan.
179

180
00:17:18,030 --> 00:17:19,110
But maybe that's just me.