0
1
00:00:00,270 --> 00:00:00,720
All right,
1

2
00:00:00,750 --> 00:00:07,080
so let's take a closer look what we've got inside of our boston_dataset bunch.
2

3
00:00:07,350 --> 00:00:15,720
Before we were talking about the attributes of this object. The thing is, in machine learning this word
3

4
00:00:15,930 --> 00:00:21,750
attribute is gonna be used in a different context. In machine learning the features of a data set are
4

5
00:00:21,750 --> 00:00:30,870
typically represented as the columns in a table and these columns you'll often see referred to as the attributes
5

6
00:00:31,140 --> 00:00:33,390
of a data set.
6

7
00:00:33,390 --> 00:00:39,570
In other words, when using this word attribute in the machine learning context we will refer to a feature
7

8
00:00:39,690 --> 00:00:42,040
or an independent variable.
8

9
00:00:42,180 --> 00:00:46,370
And this is what we're going to be using to predict a house price.
9

10
00:00:46,770 --> 00:00:53,360
So yeah, the word attribute is used both in Python as well as in machine learning, but unfortunately they
10

11
00:00:53,370 --> 00:01:00,990
mean completely different things. So, speaking of features, let's pull up the features of our dataset in
11

12
00:01:00,990 --> 00:01:09,420
the Python notebook and we can do this again using the boston_dataset object which has an
12

13
00:01:09,450 --> 00:01:19,410
attribute called "feature_names". Let's print this out. Here we can see all the feature names
13

14
00:01:19,740 --> 00:01:23,120
of our data set in a nice little array.
14

15
00:01:23,430 --> 00:01:30,990
Again the dir function was really, really handy in ascertaining that our bunch has a attribute called
15

16
00:01:30,990 --> 00:01:32,820
feature_names.
16

17
00:01:32,940 --> 00:01:39,150
So these are the kind of niceties that you're going to get with a toy dataset like this.
17

18
00:01:39,150 --> 00:01:48,960
Now taking a look at these feature names, the question you might ask is: where is the house price?
18

19
00:01:48,960 --> 00:01:51,890
Where is the price of the houses?
19

20
00:01:51,900 --> 00:01:58,290
Because we've got a bunch of abbreviations here and none of them seem to suggest anything about the
20

21
00:01:58,290 --> 00:02:02,920
values of the houses that we're looking to estimate.
21

22
00:02:02,940 --> 00:02:06,440
So this means that this is hidden somewhere else.
22

23
00:02:06,450 --> 00:02:14,010
The key thing what we're trying to predict is found in an attribute called "target".
23

24
00:02:14,040 --> 00:02:26,110
So boston_dataset.target will bring up the actual prices of the houses.
24

25
00:02:26,110 --> 00:02:30,870
This is why we didn't get a separate column for the house prices earlier.
25

26
00:02:30,880 --> 00:02:35,400
The house prices are actually found somewhere else in our bunch
26

27
00:02:35,400 --> 00:02:39,900
object. Now looking at these house prices,
27

28
00:02:40,130 --> 00:02:44,510
you might be wondering, 24 21 34,
28

29
00:02:44,990 --> 00:02:49,230
these look like they're the prices for toy houses or something.
29

30
00:02:49,370 --> 00:02:54,890
They don't look high enough to be the dollar values of actual houses, because no house could possibly
30

31
00:02:54,890 --> 00:02:57,100
cost 24 dollars right,
31

32
00:02:57,680 --> 00:02:59,720
unless of course you buy it of
32

33
00:02:59,720 --> 00:03:07,550
Ali express or something. So, the thing to note is that these units are actually in thousands.
33

34
00:03:07,550 --> 00:03:13,440
So these are the actual prices in thousands.
34

35
00:03:13,530 --> 00:03:15,970
I'm going to add this here as a comment.
35

36
00:03:15,980 --> 00:03:22,080
So if you come back to this note book in three months time this is still going to make sense. 
36

37
00:03:22,620 --> 00:03:29,460
Now working with a bunch object in the notebook is all very well and good, but one of the most common
37

38
00:03:29,460 --> 00:03:35,010
types of objects that you're actually going to be encountering in your work as a machine learning expert
38

39
00:03:35,100 --> 00:03:39,030
or as a data scientist is the pandas
39

40
00:03:39,150 --> 00:03:44,800
dataframe. The pandas dataframe is gonna be our main workhorse.
40

41
00:03:45,390 --> 00:03:50,100
So let me add a little section heading here to commemorate this.
41

42
00:03:50,480 --> 00:04:01,570
I'll add a subheading here that reads "Data exploration with Pandas dataframes" and what I'm going to do is I'm
42

43
00:04:01,570 --> 00:04:10,240
going to create a variable here called "data" and I'm going to have this variable hold on to our pandas
43

44
00:04:10,240 --> 00:04:19,040
dataframe object. And the way we're going to create this dataframe is using the pandas module.
44

45
00:04:19,120 --> 00:04:25,310
So before writing any more code here, I'm going to have to import this pandas
45

46
00:04:25,320 --> 00:04:26,070
module, right?
46

47
00:04:26,080 --> 00:04:35,420
I can't just write pd.DataFrame without importing my pandas module first.
47

48
00:04:35,590 --> 00:04:38,200
So I'm going to pause here for a second,
48

49
00:04:38,320 --> 00:04:48,100
go back up to the top where I've got all my import statements and then write "import pandas as pd" and
49

50
00:04:48,250 --> 00:04:58,660
hit Shift+Enter. Now, I can come back down here and actually make use of our module. To construct our data
50

51
00:04:58,660 --> 00:05:02,630
frame from our boston_dataset,
51

52
00:05:02,830 --> 00:05:06,460
we're going supply some arguments between these parentheses.
52

53
00:05:06,550 --> 00:05:16,770
The first argument is called "data" and we're gonna set that equal to boston_dataset.data.
53

54
00:05:16,810 --> 00:05:24,540
So this is gonna be the numpy array contained inside of our boston_dataset bunch.
54

55
00:05:24,640 --> 00:05:32,050
The next argument, columns, is gonna be the argument for the column names and we're gonna set that equal
55

56
00:05:32,050 --> 00:05:39,190
to "boston_dataset.feature_names".
56

57
00:05:39,880 --> 00:05:45,340
And what this code will do is it will create a pandas dataframe.
57

58
00:05:48,890 --> 00:05:54,420
Now, remember how our price, our house prices will not be included in this,
58

59
00:05:54,500 --> 00:05:56,420
so we're going to add those separately.
59

60
00:05:56,420 --> 00:06:01,230
Now I'm going to add a column, with the price,
60

61
00:06:01,280 --> 00:06:01,880
yeah,
61

62
00:06:01,880 --> 00:06:11,170
our target, to the dataframe. The way we want to do this is by using our dataframe variable which is
62

63
00:06:11,170 --> 00:06:15,040
called data, having some square brackets after it,
63

64
00:06:15,040 --> 00:06:18,700
and in those square brackets I want to supply a column name.
64

65
00:06:18,760 --> 00:06:26,680
So I'm going to call this column "PRICE", all caps, and I'm going to set it equal to "boston_dataset
65

66
00:06:28,120 --> 00:06:29,620
.target".
66

67
00:06:30,600 --> 00:06:31,050
Okay,
67

68
00:06:31,060 --> 00:06:35,880
so let's hit Shift+Enter together and see if we get any errors.
68

69
00:06:36,050 --> 00:06:38,320
All good.
69

70
00:06:38,360 --> 00:06:45,080
Let me add a few more cells here and then we can continue to explore our data and I'll show you a couple
70

71
00:06:45,080 --> 00:06:55,350
of tricks using the pandas dataframe. The thing is, oftentimes your data frame will be huge.
71

72
00:06:55,380 --> 00:06:59,730
This one just has 506 rows.
72

73
00:06:59,730 --> 00:07:06,880
But oftentimes you're gonna be working with dataframes with many thousands of rows or tens of thousands.
73

74
00:07:07,040 --> 00:07:14,630
So the question is: how can you get a glimpse of the data inside a huge data frame without printing out
74

75
00:07:14,870 --> 00:07:17,120
all of the values?
75

76
00:07:17,240 --> 00:07:21,620
And for that pandas gives us two dataframe methods;
76

77
00:07:21,620 --> 00:07:23,810
the first one is called "head"
77

78
00:07:23,960 --> 00:07:27,870
and the second one is called "tail".
78

79
00:07:27,980 --> 00:07:36,170
Let me show you how you'd use them. If we were to write 
"data" and hit Shift+Enter, our notebook would output
79

80
00:07:36,920 --> 00:07:47,310
a whole bunch of rows. But if we wanted to just take a gander at the first couple of rows in the data
80

81
00:07:47,310 --> 00:07:57,410
frame, say rows 0 through 4 for example, we could write "data.head()" and hitting Shift+Enter,
81

82
00:07:57,540 --> 00:08:08,330
what we see instead are rows 0 through 4. This will give us an idea of the kind of values that are contained
82

83
00:08:08,450 --> 00:08:16,720
inside of our rows and our columns without having to look at an enormous amount of data. So let me add
83

84
00:08:16,780 --> 00:08:23,300
a little comment here that says "The top rows look like this".
84

85
00:08:23,410 --> 00:08:32,320
Now it follows that "data.tail()" will show us the rows at the bottom of the data frame,
85

86
00:08:32,350 --> 00:08:40,100
right? "Rows at bottom of data frame look like this".
86

87
00:08:43,560 --> 00:08:52,170
Scrolling down, we can see that rows 501 through 505 have this kind of data inside of them.
87

88
00:08:52,170 --> 00:08:59,190
I personally really like these two methods for looking at the top part of the data and the bottom part
88

89
00:08:59,190 --> 00:09:04,910
of the data just to get an idea of what we're working with.
89

90
00:09:05,320 --> 00:09:12,880
Now, if we wanted to figure out how many rows our data frame has or we just wanted to retrieve the rows
90

91
00:09:13,240 --> 00:09:18,130
in each column, there's a handy little method called "count".
91

92
00:09:18,430 --> 00:09:30,190
So "data.count()" will show us the number of rows and that's for each column.
92

93
00:09:30,190 --> 00:09:30,730
Check it out.
93

94
00:09:33,720 --> 00:09:39,250
In the output below you see the number of entries per column.
94

95
00:09:39,270 --> 00:09:44,810
So each column has 506 entries.
95

96
00:09:44,970 --> 00:09:53,430
Now coming back to this topic of language and lingo and jargon, you'll often hear the number of data
96

97
00:09:53,430 --> 00:09:59,460
points or rows being referred to as the number of instances.
97

98
00:09:59,460 --> 00:10:03,810
So here we've got and 506 instances.
98

99
00:10:04,080 --> 00:10:11,190
This is how this word instance is used in the context of machine learning, and the important thing to
99

100
00:10:11,190 --> 00:10:18,900
note here is that this word instance means something completely different to a programmer; to a Python
100

101
00:10:18,900 --> 00:10:19,770
programmer
101

102
00:10:19,770 --> 00:10:23,100
an instance is an object.
102

103
00:10:23,100 --> 00:10:31,320
In other words, our data object right here is an instance of a data frame. A data frame would be the general
103

104
00:10:31,320 --> 00:10:38,670
category and a particular dataframe, namely our dataframe which we've stored inside our variable here,
104

105
00:10:39,270 --> 00:10:42,380
would then be referred to as an instance.
105

106
00:10:42,390 --> 00:10:49,420
So yeah, the word instance again has a different meaning in machine learning and in programming.
106

107
00:10:49,530 --> 00:10:52,940
Again, this is just something to be aware of.
107

108
00:10:53,010 --> 00:11:04,160
Moving onto our next topic, let's add a few more cells here and make the first one a markdown cell where
108

109
00:11:04,160 --> 00:11:16,310
we're going to add a subheading called "Cleaning data - check for missing values". So when you're doing your
109

110
00:11:16,310 --> 00:11:24,290
data exploration, oftentimes you're going to look for problems in your data set and dealing with missing
110

111
00:11:24,290 --> 00:11:30,950
data is definitely a kind of problem that you have to address, because I guarantee you that your machine
111

112
00:11:30,950 --> 00:11:37,220
learning algorithm is gonna get really confused and give you really terrible answers if you haven't
112

113
00:11:37,370 --> 00:11:43,260
addressed this ahead of time and are feeding clean data to your algorithm.
113

114
00:11:43,460 --> 00:11:50,570
So you might remember how we've addressed the missing values when we were analyzing our movie revenues.
114

115
00:11:50,570 --> 00:11:56,870
The problem that we're confronted with at the moment is: how do we find the missing values and how do
115

116
00:11:56,870 --> 00:11:59,100
we find them quickly?
116

117
00:11:59,120 --> 00:12:08,660
Pandas actually has a function called "isnull", and this function will return to us a table if any of the
117

118
00:12:08,660 --> 00:12:11,090
values were missing.
118

119
00:12:11,090 --> 00:12:15,660
Now let me show you how you use it. Since this function comes from the pandas module,
119

120
00:12:15,660 --> 00:12:23,720
we're gonna have to access it through "pd.isnull()" and as an argument we can have to pass in the
120

121
00:12:24,050 --> 00:12:26,750
data that we want the function to check.
121

122
00:12:26,750 --> 00:12:35,000
So we've stored all of this inside our data dataframe and hitting Shift+Enter on this will now return
122

123
00:12:35,000 --> 00:12:35,780
to us
123

124
00:12:35,990 --> 00:12:46,070
a whole table where each entry is either False meaning no missing values or True which means missing
124

125
00:12:46,070 --> 00:12:47,160
values.
125

126
00:12:47,240 --> 00:12:55,280
You can see this is a huge table, right, Jupyter notebook is not even showing us the entire table here.
126

127
00:12:55,340 --> 00:13:05,930
So the question is: how would we know if there are any missing values in this entire table of 506 entries?
127

128
00:13:07,740 --> 00:13:14,130
And the answer is: we can chain another method call onto this one.
128

129
00:13:14,130 --> 00:13:19,400
So we've caught our table back and it's a table of True and False entries.
129

130
00:13:19,710 --> 00:13:30,150
And if we chain a method called "any()" and hit Shift+Enter then what we get is we got pandas
130

131
00:13:30,240 --> 00:13:37,620
checking all the columns and telling us if there are any missing values in any of the columns.
131

132
00:13:37,680 --> 00:13:44,420
Now I don't know if you've heard the word null before, but null does not mean the value 0.
132

133
00:13:44,430 --> 00:13:51,930
So if a variable is equal to the value null, it contains nothing which is very, very different from the
133

134
00:13:51,930 --> 00:14:00,810
variable having the value 0. And the Internet has decided that the best way to summarize this is as a
134

135
00:14:00,840 --> 00:14:02,760
picture of a problem
135

136
00:14:02,790 --> 00:14:10,290
wouldn't wish upon anyone in a public restroom. On the left, we have the dispenser containing the value
136

137
00:14:10,290 --> 00:14:14,940
0 and on the right we have the dispenser containing the value
137

138
00:14:15,030 --> 00:14:24,550
null. This isnull() method chained with the any method is super handy for figuring out if there are any missing
138

139
00:14:24,550 --> 00:14:31,750
values in your dataset, because if there is a missing value in one of the columns, then instead of this
139

140
00:14:31,750 --> 00:14:38,920
word False being printed here, you will see the word True and then you might have to dig into the data
140

141
00:14:39,280 --> 00:14:41,160
and fix the problem.
141

142
00:14:41,170 --> 00:14:48,370
Now, let me show you an alternative way of doing this check, because this first method here isnull() and
142

143
00:14:48,460 --> 00:14:51,820
any() belong to the pandas module.
143

144
00:14:51,820 --> 00:15:00,910
The next method I'm going to show you, the alternative, belongs to the dataframe instead. Typing "data.
144

145
00:15:01,330 --> 00:15:10,330
info()", so using our data object and then calling the info method on it will show us not only if there
145

146
00:15:10,330 --> 00:15:17,800
are any null values, but it'll also show us a whole bunch of other information, including the number of
146

147
00:15:17,860 --> 00:15:28,140
entries or rows, the number of columns, the names of the columns, if any of the columns have a null value
147

148
00:15:29,010 --> 00:15:34,620
and also the type of object that each column contains.
148

149
00:15:34,620 --> 00:15:42,120
In our case all of the columns contain float64 type objects.
149

150
00:15:42,120 --> 00:15:45,880
Now if you're new to programming this will look super jargony.
150

151
00:15:46,230 --> 00:15:48,540
So let me explain what this means.
151

152
00:15:48,600 --> 00:15:57,240
A float is programming jargon for a floating point number. What's the floating point number? Nothing special,
152

153
00:15:57,240 --> 00:16:03,900
it just refers to a decimal number. So a floating point number has a decimal point, but something like
153

154
00:16:04,050 --> 00:16:06,260
an integer does not.
154

155
00:16:06,330 --> 00:16:11,490
In other words, Python has different categories for different types of numbers.
155

156
00:16:11,700 --> 00:16:19,710
So this number 64 at the end of the word float shows us that the category that we're working with here
156

157
00:16:20,250 --> 00:16:25,870
is a large and precise decimal number. In this case,
157

158
00:16:25,900 --> 00:16:35,290
I've got a 64 bit floating point number that takes up 64 bits of memory and that would be in contrast
158

159
00:16:35,290 --> 00:16:45,130
to less precise numbers like a float32 or a float16. A float64 number contains twice as many
159

160
00:16:45,130 --> 00:16:49,200
digits and takes up twice as much memory as a float32.
160

161
00:16:49,660 --> 00:16:51,670
So that's what that means.
161

162
00:16:51,910 --> 00:16:58,830
In any case, the good news is that we have no missing values, which is great.
162

163
00:16:58,840 --> 00:17:00,550
One less thing to worry about.
163

164
00:17:00,700 --> 00:17:06,530
So now we can start to explore our features that are contained in the dataset.
164

165
00:17:06,900 --> 00:17:15,790
It's time to demystify these mysterious sounding columns like RM, NOX and DIS. Can't wait to see you
165

166
00:17:15,790 --> 00:17:16,720
in the next lesson.