0
1
00:00:00,630 --> 00:00:06,980
The next feature I want you to investigate is called RAD.
1

2
00:00:07,030 --> 00:00:16,390
This is a measure of the accessibility to highways for the property, and I want to challenge you to use
2

3
00:00:16,480 --> 00:00:22,690
matplotlib to generate a meaningful histogram for this RAD feature.
3

4
00:00:22,720 --> 00:00:28,900
This might be a little tricky and require some thought, so pause the video, play with the Python code
4

5
00:00:29,410 --> 00:00:33,170
and have a think about what this feature is actually telling us.
5

6
00:00:33,430 --> 00:00:39,650
Oh and, for the histogram pick a beautiful royal purple color while you're at it.
6

7
00:00:39,950 --> 00:00:42,400
I'll give you a few seconds to pause the video.
7

8
00:00:44,300 --> 00:00:45,980
Here's the solution.
8

9
00:00:46,160 --> 00:00:51,200
Let's check what would happen if we took this code here,
9

10
00:00:51,680 --> 00:00:53,060
pasted it in,
10

11
00:00:53,060 --> 00:01:06,600
changed RM to RAD, changed the x label to "Accessibility to Highways" and changed the hex code to
11

12
00:01:06,660 --> 00:01:12,440
a nice purple from materialpalette.com, change that here,
12

13
00:01:12,460 --> 00:01:15,780
paste it in and hit Shift+Enter.
13

14
00:01:16,160 --> 00:01:19,800
We get something like this. Now,
14

15
00:01:20,000 --> 00:01:23,090
this looks a little strange to me.
15

16
00:01:23,450 --> 00:01:31,610
It seems like the histogram's bins are hiding some information from us, because the bins for this histogram
16

17
00:01:31,880 --> 00:01:36,960
seem a little too broad. If I look at the Python code,
17

18
00:01:37,070 --> 00:01:38,610
"plt.hist()",
18

19
00:01:39,120 --> 00:01:47,610
we haven't supplied any bins as an argument to this function call and this means we're using automatic
19

20
00:01:47,870 --> 00:01:48,850
binning.
20

21
00:01:48,870 --> 00:01:54,610
We're letting matplotlib decide on how to show us the histogram.
21

22
00:01:54,760 --> 00:02:03,000
Maybe what we need to do is investigate what RAD actually is and how accessibility to highways is actually
22

23
00:02:03,000 --> 00:02:03,930
measured.
23

24
00:02:03,930 --> 00:02:06,670
For example, what are the units in RAD?
24

25
00:02:06,720 --> 00:02:13,200
Perhaps we should try to understand this before creating our visualization, so let's output RAD to our
25

26
00:02:13,200 --> 00:02:22,410
Jupyter notebook, so I'm going to say "data['RAD']", all caps, and hit Shift+Enter.
26

27
00:02:23,310 --> 00:02:25,740
And I'm going to scroll down and just take a look at this.
27

28
00:02:29,570 --> 00:02:37,870
So I've got 506 entries and all of these seem to be whole numbers.
28

29
00:02:38,140 --> 00:02:42,090
So starts out with 1, 2, 3,
29

30
00:02:42,160 --> 00:02:44,740
some of them have 5,
30

31
00:02:44,770 --> 00:02:46,950
some of them have 24.
31

32
00:02:47,090 --> 00:02:47,470
Hmm,
32

33
00:02:48,130 --> 00:02:56,930
okay, so this is a contrast to the house prices, RAD is a bunch of distinct integer values and all the
33

34
00:02:56,930 --> 00:02:59,000
values seem to be whole numbers.
34

35
00:02:59,000 --> 00:03:06,080
A better way that we can see this and just look at how many unique values there are in the series is to
35

36
00:03:06,080 --> 00:03:11,100
use the value_counts method on this series.
36

37
00:03:11,120 --> 00:03:21,500
So I'm going to put a dot after "data['RAD']" and write "value_counts()"
37

38
00:03:22,030 --> 00:03:24,730
and hit Shift+Enter.
38

39
00:03:24,740 --> 00:03:33,770
So this gives me a beautiful summary of how many observations in this column, in RAD, have a particular
39

40
00:03:33,770 --> 00:03:34,870
value.
40

41
00:03:34,940 --> 00:03:39,760
So, for example, we can see that 17 observations,
41

42
00:03:39,780 --> 00:03:48,680
yeah 17 properties, in the dataset have a RAD value of 7 and there's 132 dwellings
42

43
00:03:48,950 --> 00:03:53,780
that have the highway accessibility value of 24.
43

44
00:03:53,840 --> 00:04:01,940
So keeping this in mind and scrolling back up to the description, RAD actually refers to an index
44

45
00:04:02,210 --> 00:04:05,070
of accessibility to radial highways.
45

46
00:04:05,240 --> 00:04:06,500
So that's what we're looking at.
46

47
00:04:06,620 --> 00:04:08,930
We're looking at an index.
47

48
00:04:09,290 --> 00:04:16,780
In other words, accessibility to highways is ranked from 1 to 24.
48

49
00:04:17,060 --> 00:04:23,900
1 is the value for low accessibility and 24 is the value for high accessibility.
49

50
00:04:23,960 --> 00:04:33,620
In other words, a property with poor accessibility to transport scores low on this index; and a property
50

51
00:04:33,800 --> 00:04:39,420
that has good accessibility to transport has a high value on this index.
51

52
00:04:39,500 --> 00:04:41,450
So looking at our histogram again.
52

53
00:04:41,660 --> 00:04:49,700
So what we probably want is we want this histogram to reflect these index values instead of this automatic
53

54
00:04:49,870 --> 00:04:50,930
binning.
54

55
00:04:51,410 --> 00:04:56,300
We want to show these index values and we don't want to bin several of the indexed values together,
55

56
00:04:57,270 --> 00:05:03,470
and that's because the data in this RAD feature already has pretty much our bins mapped out for us.
56

57
00:05:03,590 --> 00:05:05,960
So we're gonna use these.
57

58
00:05:05,960 --> 00:05:13,760
I can modify the histogram code right here to take this into account simply by adding the bins argument
58

59
00:05:14,840 --> 00:05:18,660
and setting it equal to the value 24.
59

60
00:05:18,710 --> 00:05:24,380
Now let me refresh my histogram. Voila! All right.
60

61
00:05:24,410 --> 00:05:31,650
So that completes the challenge, we plotted our histogram for the RAD feature and what we can see is
61

62
00:05:31,650 --> 00:05:37,010
that there's quite a few properties between the 1 and 7 range on the index.
62

63
00:05:37,180 --> 00:05:43,970
And there's also a whole bunch of properties for the value 24 on the index. But, you know what this
63

64
00:05:43,970 --> 00:05:54,600
histogram kind of looks like? It looks like a bar chart and bar chart is a histogram's cousin. Histograms
64

65
00:05:54,740 --> 00:05:58,440
and bar charts can be used to pretty much show the same information.
65

66
00:05:58,460 --> 00:06:06,020
So let me show you the Python code for creating a bar chart using matplotlib as well.
66

67
00:06:06,020 --> 00:06:12,770
This is another data visualization technique that's really handy to have in your tool belt. So I'm going to
67

68
00:06:12,770 --> 00:06:20,800
come down here, add a few more cells and I'm gonna make use of this values_counts method, So I'm going to copy
68

69
00:06:20,890 --> 00:06:31,450
this line of code and I'm going to store the output, the result from this code, in a variable called frequency.
69

70
00:06:34,790 --> 00:06:39,550
"Frequency = data[
70

71
00:06:39,550 --> 00:06:44,010
'RAD'].value_counts()".
71

72
00:06:44,050 --> 00:06:48,390
Now, frequency is also a pandas series.
72

73
00:06:48,460 --> 00:06:58,440
You can see this if I write the code "type(frequency)", hit Shift+Enter, so data['RAD']
73

74
00:06:58,770 --> 00:07:06,190
is a series, but the return value of this value_counts method is also a series.
74

75
00:07:07,080 --> 00:07:12,840
And the reason I'm showing you this is because I want to draw your attention to something. I'm going to comment 
75

76
00:07:12,840 --> 00:07:19,650
this out, and what I want to do is I want to access these values right here.
76

77
00:07:19,710 --> 00:07:25,310
I just want to access the labels for these unique index values.
77

78
00:07:25,750 --> 00:07:27,790
I can do this in one of two ways.
78

79
00:07:27,820 --> 00:07:33,760
Check it out. If I say frequency.index,
79

80
00:07:36,500 --> 00:07:41,210
then I'll get a collection of all these index values in my series.
80

81
00:07:41,540 --> 00:07:47,020
So this is one way of doing it. I'm going to comment this out and I'll show you the second way.
81

82
00:07:48,910 --> 00:07:52,500
"frequency.axes[
82

83
00:07:52,540 --> 00:08:00,470
0]". If I hit Shift+Enter, then I get exactly the same result.
83

84
00:08:01,910 --> 00:08:11,350
The axes attribute of the series can also be used to retrieve the row axes labels. And the reason I'm
84

85
00:08:11,350 --> 00:08:18,430
interested in these in the first place is because we're going to use these to label the x axis on the
85

86
00:08:18,430 --> 00:08:20,950
bar chart that we're going to create.
86

87
00:08:20,950 --> 00:08:23,100
So check it out. I'm going to comment this out
87

88
00:08:23,560 --> 00:08:31,660
and then to create the bar chart I'm going to take my matplotlib object, "plt.bar()",
88

89
00:08:32,260 --> 00:08:34,900
and then I have to supply two things.
89

90
00:08:34,930 --> 00:08:37,720
The first is what I want on the x axis.
90

91
00:08:37,720 --> 00:08:43,860
And this is gonna be "frequency.index".
91

92
00:08:44,090 --> 00:08:50,020
And the second thing I have to supply for the bar chart is the height of the individual bars.
92

93
00:08:50,050 --> 00:08:54,210
So this will be an argument called height and I'm going to set that equal to,
93

94
00:08:54,730 --> 00:09:00,040
well this would just be the values inside my frequency variable.
94

95
00:09:00,190 --> 00:09:01,890
That'll be these values here.
95

96
00:09:02,780 --> 00:09:03,690
So I'm going to say "height=
96

97
00:09:03,710 --> 00:09:12,670
frequency" and then we put "plt.show()" afterwards and scroll down and hit Shift+Enter.
97

98
00:09:12,700 --> 00:09:22,060
And this is what we get. As it is, there's no labels on the axes and there's also the default color
98

99
00:09:22,070 --> 00:09:23,360
being used.
99

100
00:09:23,360 --> 00:09:29,210
So what I'm going to do is I'm going to make this bar chart a little larger.
100

101
00:09:29,230 --> 00:09:34,470
Let me grab this code up here that we have, come down here,
101

102
00:09:34,480 --> 00:09:36,260
paste it in.
102

103
00:09:36,420 --> 00:09:43,830
I'm going to delete this line here and then I'm going to leave my x label and y label as they are and
103

104
00:09:43,830 --> 00:09:46,680
hit Shift+Enter.
104

105
00:09:46,850 --> 00:09:48,190
There we go.
105

106
00:09:48,200 --> 00:09:54,470
So this is a bar chart, but I want to draw your attention to one thing. The neat thing about the code
106

107
00:09:54,470 --> 00:10:00,560
we've just written is that we haven't had to specify the number of bins ahead of time,
107

108
00:10:00,560 --> 00:10:03,840
we haven't had to write "bins=24".
108

109
00:10:04,070 --> 00:10:08,280
We haven't had to hard code the number 24 for the number of bins.
109

110
00:10:08,330 --> 00:10:17,310
Instead we wrote some Python code using value_counts which figured out the best way to draw the x and
110

111
00:10:17,310 --> 00:10:20,960
y axes for our bar chart for us.
111

112
00:10:21,220 --> 00:10:26,260
So this is a technique that you can apply to other types of indexed data as well.
112

113
00:10:26,320 --> 00:10:33,060
It makes the code that we've just written a lot more flexible than hard coding particular integer values.
113

114
00:10:33,160 --> 00:10:35,100
And that's a good thing.
114

115
00:10:35,380 --> 00:10:41,860
You're also gonna be looking at this chart here and you might be thinking: Hmmm this looks a lot better
115

116
00:10:41,980 --> 00:10:45,990
than the histogram just because it's got these spaces in between the bars.
116

117
00:10:46,300 --> 00:10:51,520
Because if we look at our histogram, it kind of looks like this at the moment - all the bins all the bars
117

118
00:10:51,850 --> 00:10:54,100
are jam packed together.
118

119
00:10:54,130 --> 00:10:59,000
So let me give you a little challenge so you can familiarize yourself with the histogram function 
119

120
00:10:59,020 --> 00:11:01,180
a little better as well.
120

121
00:11:01,180 --> 00:11:06,970
I want you to modify this histogram so that it's also got some spaces between the bars.
121

122
00:11:06,970 --> 00:11:13,180
The trick will be to look at the documentation by say pulling up the quick documentation in the notebook
122

123
00:11:13,540 --> 00:11:21,690
and looking for the right argument to supply to the function call. You can pull up the quick documentation
123

124
00:11:21,690 --> 00:11:27,780
by pressing Shift and then Tab on your keyboard and hitting this little plus sign and scrolling down
124

125
00:11:27,870 --> 00:11:29,200
and taking a look at this
125

126
00:11:29,290 --> 00:11:35,100
here. I'll give you a few seconds to pause the video so you can find the parameter that you have to modify
126

127
00:11:35,490 --> 00:11:40,340
and give the bars a little bit more of a breathing room.
127

128
00:11:40,350 --> 00:11:44,920
How did you get on? Did you solve it? Here's the solution.
128

129
00:11:44,920 --> 00:11:53,260
The argument that we need to specify in this method call is "rwidth". By default,
129

130
00:11:53,260 --> 00:11:55,480
this has the value none.
130

131
00:11:55,480 --> 00:11:59,290
But let's check out what the description says for rwidth.
131

132
00:12:03,080 --> 00:12:09,570
If I scroll down in the quick documentation I can see that rwidth is an optional argument and that
132

133
00:12:09,570 --> 00:12:18,500
it is a number that specifies the relative width of the bars as a fraction of the total bin width. And
133

134
00:12:18,510 --> 00:12:19,990
the first time I read this,
134

135
00:12:20,490 --> 00:12:23,600
that didn't make a whole lot of sense to me.
135

136
00:12:23,610 --> 00:12:31,030
So what I had to do is try out a couple of different numbers and see how the chart turned out.
136

137
00:12:31,080 --> 00:12:42,030
So if we write "rwidth = 1" and hit Shift+Enter and see what we get, no change. But if we change
137

138
00:12:42,030 --> 00:12:50,590
that to say 0.5 and hit Shift+Enter, our histogram starts looking like this.
138

139
00:12:51,690 --> 00:12:59,310
So what this rwidth argument is doing if it's set to 0.5, our bar width will be approximately
139

140
00:12:59,490 --> 00:13:09,420
0.5 and on either side of the bar we'll have a space of 0.25. If we make this
140

141
00:13:09,570 --> 00:13:19,780
0.7, the gaps will get smaller and if we make this 0.3 then the gaps will
141

142
00:13:19,780 --> 00:13:29,760
get wider. So, in essence, you can add a value between 0 and 1 to this rwidth argument and you'll get
142

143
00:13:29,760 --> 00:13:30,660
different results.
143

144
00:13:31,120 --> 00:13:40,670
If I put in the value 10 then I get exactly the same as if I put in the value 1. All good? I'm going to
144

145
00:13:40,670 --> 00:13:44,110
leave it at 0.5. Cool.
145

146
00:13:44,660 --> 00:13:51,470
So we've looked at the average number of rooms per dwelling, we've looked at access to radial highways
146

147
00:13:52,130 --> 00:13:58,400
and we've looked at the property prices in our visualizations. So both the number of rooms and the house
147

148
00:13:58,400 --> 00:14:01,540
prices were quite easy to understand, right?
148

149
00:14:01,670 --> 00:14:07,820
Measuring how good the transport links were on the other hand was a little bit more complex given that
149

150
00:14:08,090 --> 00:14:15,800
it was measured as an index value with accessibility to radial highways. But there's actually another
150

151
00:14:15,950 --> 00:14:24,470
very nifty technique that the researchers are using to capture some information about these Boston Properties.
151

152
00:14:24,800 --> 00:14:31,340
You see, there's a river running through Boston and this river is called the Charles River and it looks
152

153
00:14:31,340 --> 00:14:39,700
something like this. Imagine for a second that you were conducting the original research and collating
153

154
00:14:39,880 --> 00:14:41,790
the Boston housing data.
154

155
00:14:42,100 --> 00:14:47,920
You want to be able to differentiate between the houses that are located right on the river and those
155

156
00:14:47,920 --> 00:14:50,210
that are located elsewhere.
156

157
00:14:50,230 --> 00:14:52,110
How would you go about doing this?
157

158
00:14:53,510 --> 00:14:57,900
And this brings us to our next challenge. And for this challenge,
158

159
00:14:57,900 --> 00:15:01,530
I want you to answer a very, very simple question.
159

160
00:15:01,800 --> 00:15:10,320
Tell me, out of the 506 properties in the dataset, how many properties are located on the Charles River?
160

161
00:15:10,410 --> 00:15:13,820
This challenge isn't going to be about data visualization.
161

162
00:15:13,920 --> 00:15:18,930
I just need a cold hard number from you. To solve this challenge,
162

163
00:15:18,930 --> 00:15:25,830
take a close look at the description of the features and then write a single line of code that will
163

164
00:15:25,830 --> 00:15:28,150
spit out the answer for you.
164

165
00:15:28,380 --> 00:15:30,810
And also while you're at it, have a think
165

166
00:15:30,840 --> 00:15:38,400
if you expect that the properties on the river will be worth more or less than properties that are away
166

167
00:15:38,400 --> 00:15:41,820
from the river. Is living next to the Charles River
167

168
00:15:41,880 --> 00:15:45,950
a good thing for house prices? Because we'll find out later.
168

169
00:15:46,200 --> 00:15:54,370
In the meantime, I'll give you a few seconds to pause the video so you can solve this challenge.
169

170
00:15:54,520 --> 00:15:58,390
Did you get it? Here is the solution.
170

171
00:15:58,470 --> 00:16:06,650
So the trick was looking for the feature description that would likely contain the answers and you maybe
171

172
00:16:06,660 --> 00:16:15,030
discovered that there is a feature called CHAS and this is the Charles River dummy variable
172

173
00:16:15,540 --> 00:16:21,080
which equals 1 if the tract bounds the river and 0 otherwise.
173

174
00:16:21,240 --> 00:16:27,410
In other words, CHAS captures whether the property is on the river or not.
174

175
00:16:27,410 --> 00:16:30,840
Now let's scroll back down and write the Python code.
175

176
00:16:31,020 --> 00:16:36,370
We're going to be using our old friend value_counts to solve this.
176

177
00:16:36,400 --> 00:16:46,890
If I write "data['CHAS'].value_counts()" and hit
177

178
00:16:46,890 --> 00:16:50,570
Shift+Enter I'm going to get the following output.
178

179
00:16:50,570 --> 00:17:00,050
I can see here that CHAS only has one of two values, 0 or 1, which ties out exactly with what they've
179

180
00:17:00,050 --> 00:17:01,550
said in the description.
180

181
00:17:01,550 --> 00:17:07,440
0 means not on the river and 1 means located on the Charles River.
181

182
00:17:07,460 --> 00:17:16,020
So the answer to the challenges question is there are 35 properties on the river. This type
182

183
00:17:16,020 --> 00:17:25,590
of feature is called a dummy variable and you'll find researchers using dummy variables to capture binary
183

184
00:17:25,770 --> 00:17:27,000
information.
184

185
00:17:27,060 --> 00:17:31,770
So this is a good example -  is the property on the river or not on the river?
185

186
00:17:31,770 --> 00:17:33,890
Are we dealing with a man or a woman?
186

187
00:17:33,900 --> 00:17:36,030
Are the unemployed or employed?
187

188
00:17:36,030 --> 00:17:37,140
Is it a homeowner
188

189
00:17:37,140 --> 00:17:38,530
or are they renting?
189

190
00:17:38,550 --> 00:17:42,940
This is the kind of information that you can capture with dummy variables.
190

191
00:17:43,000 --> 00:17:48,150
In other words, working with dummy variables is actually very similar to working with an index, except
191

192
00:17:48,150 --> 00:17:51,780
that a dummy variable can only have one of two values.
192

193
00:17:51,780 --> 00:17:52,620
Good stuff.
193

194
00:17:52,620 --> 00:17:56,120
So we're really getting into the nitty gritty. In the next lessons
194

195
00:17:56,250 --> 00:18:02,250
we're gonna be looking at descriptive statistics, outliers and scatter plots.
195

196
00:18:02,250 --> 00:18:03,150
I'll see you there.