0
1
00:00:00,270 --> 00:00:07,500
In the previous lessons we've imported our Boston house price data set into our Jupyter notebook.
1

2
00:00:07,500 --> 00:00:15,260
Now that we've successfully gathered our data, it is time to take a good hard look at it.
2

3
00:00:15,620 --> 00:00:18,090
And this is really the next step in our workflow.
3

4
00:00:18,090 --> 00:00:20,490
We've formulated our question.
4

5
00:00:20,490 --> 00:00:23,040
We've gathered our data.
5

6
00:00:23,100 --> 00:00:26,750
Now it's time to explore our data set in depth.
6

7
00:00:27,030 --> 00:00:32,940
And oftentimes we're going to be exploring, visualizing and cleaning this data more or less at the same
7

8
00:00:32,940 --> 00:00:41,100
time, simply because the problems that the data has only becomes apparent after you start digging into
8

9
00:00:41,100 --> 00:00:43,480
the data set. Okay,
9

10
00:00:43,500 --> 00:00:50,650
so imagine that you're back at your real estate job and that the office intern has just plonked down
10

11
00:00:50,650 --> 00:00:54,150
a big data set on your desk.
11

12
00:00:54,160 --> 00:00:59,680
What are the first things that you'd want to understand about a fresh dataset?
12

13
00:00:59,770 --> 00:01:05,690
What's a good starting point to get to grips with a data set that you've never seen before?
13

14
00:01:05,710 --> 00:01:10,560
What are the kind of questions that you'd want to ask when you're first starting out with your work?
14

15
00:01:10,570 --> 00:01:12,430
Let me show you my own starting point.
15

16
00:01:12,610 --> 00:01:17,680
Let me show you the first six questions that I ask myself when I first start working with a new dataset
16

17
00:01:17,950 --> 00:01:20,280
that I haven't seen before.
17

18
00:01:20,320 --> 00:01:24,310
The first question I'm going to ask myself is: Where does the data come from?
18

19
00:01:24,310 --> 00:01:26,940
What's the source of the data?
19

20
00:01:27,080 --> 00:01:32,880
The second question is: Can I find some sort of short description of what's in the data set?
20

21
00:01:33,080 --> 00:01:39,680
And this is important for understanding the all important context under which the data was collected
21

22
00:01:40,070 --> 00:01:43,600
and also how the data was collected.
22

23
00:01:43,610 --> 00:01:47,060
Third - how big is the data set actually?
23

24
00:01:47,090 --> 00:01:51,740
How many individual data points are there in the data set?
24

25
00:01:51,740 --> 00:01:56,060
Am I dealing with an enormous data set or a small one?
25

26
00:01:56,150 --> 00:02:01,880
This is important from a practical point because working with a dataset with 10 million data points
26

27
00:02:01,880 --> 00:02:07,910
will require very different techniques than working with a data set of say 10 data points.
27

28
00:02:07,910 --> 00:02:14,210
For starters, my aging laptop will totally struggle to crunch like a huge data set so it's important
28

29
00:02:14,210 --> 00:02:17,360
to figure out what sort of beast you're gonna be dealing with.
29

30
00:02:17,360 --> 00:02:23,810
But this aspect of data set size it's not just important from a practical point of view, but it's also
30

31
00:02:23,810 --> 00:02:30,080
important from a theoretical perspective because many statistical tests that you're going to be using
31

32
00:02:30,350 --> 00:02:39,320
will become a lot more powerful as the sample size increases. The fourth question is: how many features
32

33
00:02:39,470 --> 00:02:41,630
are there in the dataset?
33

34
00:02:41,720 --> 00:02:50,750
Now what I mean by features? Well, for each data point, how many aspects were measured, how many entries
34

35
00:02:50,840 --> 00:02:58,250
are there for each row in the table, how many columns are there - this is what I mean by features. You and
35

36
00:02:58,250 --> 00:03:01,390
I are going to be looking at house prices shortly.
36

37
00:03:01,400 --> 00:03:04,760
And each house is gonna be in a row in this dataset.
37

38
00:03:04,970 --> 00:03:12,200
The number of features will tell us how much information we have about each house. The number of features
38

39
00:03:12,230 --> 00:03:17,660
will help us figure out how many characteristics we're going to be basing our prediction of the house
39

40
00:03:17,660 --> 00:03:24,170
value on. The next two questions we're going to ask ourselves are: "What are the names of the features?"
40

41
00:03:24,530 --> 00:03:28,100
and "What are the descriptions of each feature?".
41

42
00:03:28,100 --> 00:03:35,030
The reason these questions are very crucial is because we need to understand what the dataset is actually
42

43
00:03:35,030 --> 00:03:36,150
measuring.
43

44
00:03:36,320 --> 00:03:42,710
So sometimes you'll get datasets with pretty unintuitive measurements so it's important to kind of dig
44

45
00:03:42,710 --> 00:03:48,380
in and understand what exactly is contained in the data.
45

46
00:03:48,380 --> 00:03:53,750
For starters, you'll probably want to check the units that are being used in each column.
46

47
00:03:53,780 --> 00:04:00,800
So for example, is our price given in dollars or in thousands of dollars?
47

48
00:04:01,200 --> 00:04:05,090
And these are just some of the basics to get right. Okay,
48

49
00:04:05,130 --> 00:04:13,810
so now that we've got our to do list, let's return to the Python code and see if we can answer these
49

50
00:04:13,900 --> 00:04:18,040
initial questions and check them off one by one.
50

51
00:04:18,180 --> 00:04:27,750
There is a handy little Python function called "dir" that we can use to look at a Python objects attributes.
51

52
00:04:27,750 --> 00:04:36,960
Check it out - "dir(boston_dataset)" and Shift+Enter
52

53
00:04:37,350 --> 00:04:39,940
will bring up the following output.
53

54
00:04:40,220 --> 00:04:48,990
They'll show us a list of attributes. What we're looking at here is actually a list of attributes for
54

55
00:04:48,990 --> 00:04:57,160
this Python object. The first attribute here is a shorthand for I'm guessing description.
55

56
00:04:57,250 --> 00:04:59,280
So let's pull this out.
56

57
00:04:59,290 --> 00:05:00,380
Let's print this out.
57

58
00:05:00,580 --> 00:05:12,610
I'm going to say "print(boston_dataset.DESCR)", all caps, and let's take a look at what we
58

59
00:05:12,610 --> 00:05:13,180
see here.
59

60
00:05:14,800 --> 00:05:16,470
So using this Python attribute
60

61
00:05:16,600 --> 00:05:21,450
We do indeed get a description of this dataset.
61

62
00:05:21,460 --> 00:05:28,210
This description was included in the Python object and we were able to access it with this attribute
62

63
00:05:28,450 --> 00:05:31,070
that we discovered through the dir
63

64
00:05:31,090 --> 00:05:35,570
function. Okay, so let's take a look at these notes.
64

65
00:05:35,870 --> 00:05:43,940
We've already seen that there are 506 instances or rows in this dataset and there are
65

66
00:05:43,940 --> 00:05:50,000
13 attributes, 13 categories or columns.
66

67
00:05:50,030 --> 00:05:53,530
Now these 13 columns are as follows;
67

68
00:05:53,810 --> 00:05:55,760
we've got per capita crime,
68

69
00:05:55,940 --> 00:06:02,050
we've got concentration of nitric oxides - so this is a proxy for pollution,
69

70
00:06:02,120 --> 00:06:06,410
we've got the average number of rooms per dwelling,
70

71
00:06:06,410 --> 00:06:10,570
we've got a pupil - teacher ratio by town,
71

72
00:06:10,760 --> 00:06:17,910
so this is a proxy for the quality of the schools and a whole bunch of other attributes.
72

73
00:06:18,410 --> 00:06:24,770
Scrolling down a little more we see that the two researchers that collated this dataset are Harrison
73

74
00:06:24,860 --> 00:06:30,270
and Rubinfeld and that this is based on a research paper.
74

75
00:06:30,710 --> 00:06:35,210
In fact we can actually see the title of the original research paper here
75

76
00:06:35,210 --> 00:06:43,670
"Hedonic prices and the demand for clean air" and this was published in The Journal of Environment Economics
76

77
00:06:43,760 --> 00:06:49,040
and Management Vol. 5 in 1978.
77

78
00:06:49,040 --> 00:06:54,540
So this already answers a lot of the initial questions about our dataset.
78

79
00:06:54,680 --> 00:07:00,500
It's actually fairly interesting that the original purpose of the researchers was to figure out how
79

80
00:07:00,830 --> 00:07:04,080
high the demand was for clean air in Boston.
80

81
00:07:04,130 --> 00:07:06,440
That's what they were trying to accomplish.
81

82
00:07:06,500 --> 00:07:12,890
They were trying to figure out how much more people are willing to pay to be able to breathe clean air
82

83
00:07:12,980 --> 00:07:15,080
in the city.
83

84
00:07:15,080 --> 00:07:22,820
The other thing that's important is that this housing data dates back to 1978 and we're working with
84

85
00:07:23,000 --> 00:07:31,760
506 different entries. So let's check off the questions on our list.
85

86
00:07:31,800 --> 00:07:35,210
We've figured out the source of the data.
86

87
00:07:35,340 --> 00:07:39,430
We've read a brief description of the data set.
87

88
00:07:39,480 --> 00:07:44,240
We've also managed to figure out the number of data points in the dataset which was 506
88

89
00:07:44,240 --> 00:07:49,480
and the number of features which was 13.
89

90
00:07:49,830 --> 00:07:53,770
And lucky for us the descriptions of the features was also given.
90

91
00:07:53,790 --> 00:07:59,400
They call them attributes, if you remember, and the names of these attributes were given in all caps, they
91

92
00:07:59,400 --> 00:08:06,660
were abbreviations. And finally we had a very brief description of all the features as well.
92

93
00:08:07,370 --> 00:08:10,080
So I think for starters this is pretty good going,
93

94
00:08:10,100 --> 00:08:17,570
but let's crack on. Now as an aside if you're curious on how this dataset was originally used you can
94

95
00:08:17,570 --> 00:08:23,240
actually pull up the original research paper that is mentioned in the description.
95

96
00:08:23,240 --> 00:08:30,680
So I just googled it and I got sent to the University of Michigan library Web site and there I was able
96

97
00:08:30,680 --> 00:08:35,170
to pull up the PDF for the original paper for free.
97

98
00:08:35,480 --> 00:08:42,080
And I think that's because Daniel Rubinfeld was actually at the University of Michigan while his co-author
98

99
00:08:42,260 --> 00:08:45,820
David Harrison was at Harvard at the time.
99

100
00:08:45,830 --> 00:08:53,570
Now if you Google this as well, let me show you how you can embed a link in your Jupyter notebook very,
100

101
00:08:53,570 --> 00:08:55,080
very easily.
101

102
00:08:55,080 --> 00:08:57,410
So I can copy the URL here,
102

103
00:08:57,410 --> 00:09:04,480
go back to my Jupyter notebook and then going into one of the markdown cells, say the gathered data cell
103

104
00:09:04,480 --> 00:09:15,070
I've got here, I can use some square brackets and some parentheses to insert my URL. So the URL
104

105
00:09:15,490 --> 00:09:17,980
goes between the two parentheses.
105

106
00:09:17,980 --> 00:09:19,260
So I'm gonna paste it in here.
106

107
00:09:19,270 --> 00:09:25,630
This is the URL that I copied from the other tab in my browser, and then in the square brackets I can
107

108
00:09:25,630 --> 00:09:31,760
include the text that I want to display instead of this long and unwieldy URL.
108

109
00:09:31,880 --> 00:09:41,030
So I'm going to write "Source: Original research paper" and when I hit Shift+Enter it's gonna be displayed
109

110
00:09:41,240 --> 00:09:47,930
like so and this is an active link that we have now in our Jupyter notebook.
110

111
00:09:47,990 --> 00:09:54,440
Now you might not always get a nice description along with your data set like this.
111

112
00:09:54,440 --> 00:10:00,230
So let's have a think about how we might look at the number of data points and the number of features
112

113
00:10:00,530 --> 00:10:07,870
manually in case it wasn't presented to us on a silver plate like this. Going down to the bottom,
113

114
00:10:07,910 --> 00:10:15,140
I'm going to insert another markdown cell and I'm going to add a subheading here and this is going to read
114

115
00:10:16,160 --> 00:10:25,550
"Data points and features". To look at the number of features in this dataset, I'm going to first access
115

116
00:10:25,820 --> 00:10:29,330
the Bunch object's data attribute.
116

117
00:10:29,330 --> 00:10:30,690
I remember seeing this up above
117

118
00:10:30,800 --> 00:10:38,960
when we use that the dir function. So we can write "boston_dataset.data".
118

119
00:10:39,980 --> 00:10:46,610
And this is what this would look like. From the output we can see that it's an array and we could verify
119

120
00:10:46,610 --> 00:10:54,250
this by writing "type" and then surrounding "boston_dataset.data" by parentheses.
120

121
00:10:54,320 --> 00:10:59,510
And here we can see that it is in fact a numpy n-dimensional array.
121

122
00:10:59,510 --> 00:11:06,140
This is the type of object that we would be accessing like so. Now if we wanted to see the number of
122

123
00:11:06,140 --> 00:11:16,160
rows and columns, the easiest way to do this is by writing "boston_dataset.data.shape";
123

124
00:11:16,940 --> 00:11:21,460
shape is an attribute of a numpy array
124

125
00:11:21,470 --> 00:11:31,310
if you recall. So hitting Shift+Enter, we can see that this array has 506 rows and 13 columns, which is
125

126
00:11:31,310 --> 00:11:36,680
good because it ties out with the documentation that we read earlier.
126

127
00:11:36,920 --> 00:11:43,840
The thing that you'll also notice in this line of Python code is that we are chaining our attributes.
127

128
00:11:43,940 --> 00:11:47,310
I'll just add this as a comment here on the right hand side.
128

129
00:11:47,330 --> 00:11:55,130
Now this is something quite important to understand, because this is a good example that shows how objects
129

130
00:11:55,400 --> 00:12:00,820
and data can be nested inside one other. Scrolling to the very top,
130

131
00:12:00,830 --> 00:12:08,030
we see that when we are working with the boston_dataset, we are in fact working with an object
131

132
00:12:08,480 --> 00:12:18,300
of type Bunch, and this bunch has a number of attributes, including that data attribute. This data attribute
132

133
00:12:18,870 --> 00:12:27,440
is in fact an object of type ndarray, a numpy n-dimensional array. And the n-dimensional array in
133

134
00:12:27,440 --> 00:12:32,950
turn also has attributes, including a shape attribute.
134

135
00:12:33,330 --> 00:12:42,820
And when we call on an ndarray's shape attribute, we get back a tuple. So when you see this dot notation
135

136
00:12:43,060 --> 00:12:48,670
being used to chain things together, you can think of it almost like a Russian matrioshka doll, where
136

137
00:12:48,670 --> 00:12:54,580
each doll contains another object. And if that's not your kind of thing then think of it maybe like the
137

138
00:12:54,580 --> 00:13:01,940
movie Inception where you had dreams within dreams, just with less gunfire. And on that note I'll see you
138

139
00:13:01,940 --> 00:13:02,740
in the next lesson.