0
1
00:00:00,390 --> 00:00:05,700
In this lesson I want to talk to you a little bit more about cleaning data.
1

2
00:00:05,700 --> 00:00:09,770
We've already gathered all our data and we've started on this process.
2

3
00:00:10,050 --> 00:00:17,640
In particular what we've done so far is we've extracted the relevant data, namely the e-mail bodies from
3

4
00:00:17,970 --> 00:00:25,770
all our text files and we've also converted all the text files into a pandas dataframe.
4

5
00:00:25,800 --> 00:00:32,190
In other words, we've taken over 5000 different files, we've extracted the actual pieces of data that
5

6
00:00:32,190 --> 00:00:40,020
we need from it and we've converted all of that into a format that we can work with in our Python code.
6

7
00:00:40,020 --> 00:00:46,500
Now it's time for the next steps, namely checking for empty emails, checking for missing values or no
7

8
00:00:46,500 --> 00:00:52,470
values in our data that we've imported into our Python notebook.
8

9
00:00:52,500 --> 00:00:55,890
So without further ado, let me show you how to do that
9

10
00:00:55,890 --> 00:01:05,670
in Jupyter notebook. I'm going to add another markdown cell right here that reads "Data Cleaning: Checking
10

11
00:01:05,670 --> 00:01:10,850
for Missing Values".
11

12
00:01:10,960 --> 00:01:13,770
Of course it would help if I could spell a bit better.
12

13
00:01:13,870 --> 00:01:15,820
But there we go.
13

14
00:01:16,240 --> 00:01:21,890
The handy thing is that we've put all our data into a pandas dataframe.
14

15
00:01:22,060 --> 00:01:32,680
The first thing I want to do is check if any message bodies are equal to null. By null I mean that the value
15

16
00:01:32,680 --> 00:01:33,950
is missing.
16

17
00:01:34,240 --> 00:01:41,830
The way I can access a particular column in my dataframe, namely the messages column, is by writing
17

18
00:01:41,830 --> 00:01:45,040
"data.MESSAGE".
18

19
00:01:45,040 --> 00:01:52,660
This will access my message column which looks like so. The alternative syntax that you'll see to accomplish
19

20
00:01:52,660 --> 00:02:00,940
the exact same thing is with the square brackets and then the actual column name between single quotes.
20

21
00:02:00,940 --> 00:02:05,750
So that reads "MESSAGE". Voila!
21

22
00:02:05,930 --> 00:02:06,870
Great.
22

23
00:02:06,880 --> 00:02:12,980
Now there are 5800 rows in this column.
23

24
00:02:12,980 --> 00:02:22,620
So if I wanted to check if any of them has a missing value, then I can use the method "isnull". What we
24

25
00:02:22,620 --> 00:02:34,010
get in this case is a value True or False based on whether or not any particular value was null or not.
25

26
00:02:34,020 --> 00:02:42,450
Now this is a very, very handy method, but it doesn't help me check if any of these values is equal to
26

27
00:02:42,450 --> 00:02:43,430
null.
27

28
00:02:43,540 --> 00:02:47,000
I mean part of the reason is is that, yes, I can scroll through here,
28

29
00:02:47,490 --> 00:02:56,240
but Jupyter will actually only display so many rows. It will hide the middle part of my dataframe.
29

30
00:02:56,400 --> 00:03:00,990
So what I'm going to do is I'm going to chain some more methods.
30

31
00:03:00,990 --> 00:03:04,860
The first one is this values attribute.
31

32
00:03:04,860 --> 00:03:12,600
And what this will do is instead of giving me the index name and the value here, it will just give me
32

33
00:03:13,410 --> 00:03:15,060
numpy array.
33

34
00:03:15,210 --> 00:03:18,760
It will just give me the values without the index.
34

35
00:03:18,810 --> 00:03:24,870
The nice thing about this is that if I wanted to check if any of the 5800 rows
35

36
00:03:25,200 --> 00:03:33,740
has a missing value, I can use the "any" method and chain that onto here and get my result.
36

37
00:03:33,780 --> 00:03:36,980
So the good news for us is that the result is False.
37

38
00:03:37,020 --> 00:03:46,380
We have no missing values in our message column. Now, missing values are actually not the same thing as
38

39
00:03:46,560 --> 00:03:57,010
empty emails, right? An empty email would be something like this, an empty string. If I check what this
39

40
00:03:57,010 --> 00:04:02,560
is I will actually see that is of type string.
40

41
00:04:02,590 --> 00:04:11,840
This would simply be an empty string, a string of length zero. So what would be an example of a missing
41

42
00:04:11,840 --> 00:04:21,140
value? Say I had a variable called "my_var" and we don't want to put any value inside this variable, we
42

43
00:04:21,140 --> 00:04:24,780
simply want it to be empty or contain nothing.
43

44
00:04:24,950 --> 00:04:28,010
How would we accomplish this in Python?
44

45
00:04:28,010 --> 00:04:33,650
Now you might think that given that we've used isnull as a method on our dataframe, you could simply
45

46
00:04:33,650 --> 00:04:39,800
write something like "myvar = null" and create your empty variable.
46

47
00:04:39,800 --> 00:04:45,770
However, even though this null keyword is commonly used in other programming languages like say Java,
47

48
00:04:46,340 --> 00:04:51,400
in Python there is another keyword that has the same meaning
48

49
00:04:51,500 --> 00:04:53,450
and this is the keyword None.
49

50
00:04:53,720 --> 00:04:58,250
And you can see that Jupyter notebook is instantly picking up the fact that this is a keyword with the
50

51
00:04:58,250 --> 00:05:00,500
syntax highlighting.
51

52
00:05:00,530 --> 00:05:05,750
So in this case, if I was to check the type of my variable,
52

53
00:05:08,560 --> 00:05:12,100
I would see that it is indeed a NoneType,
53

54
00:05:12,100 --> 00:05:14,950
my_var is truly empty.
54

55
00:05:14,950 --> 00:05:17,480
It truly contains nothing.
55

56
00:05:17,620 --> 00:05:18,830
Now conceptually,
56

57
00:05:18,970 --> 00:05:26,170
the way I like to remember this difference between zero or an empty string with a null value is with
57

58
00:05:26,170 --> 00:05:27,400
this picture.
58

59
00:05:27,400 --> 00:05:31,900
The Internet has truly blessed us with lots of terrible programming jokes and I'm going to do my
59

60
00:05:31,900 --> 00:05:33,640
best to share them with you.
60

61
00:05:33,820 --> 00:05:39,520
But in all seriousness, this is actually quite a helpful way about thinking about null values or None
61

62
00:05:39,520 --> 00:05:41,470
values as they're known in Python.
62

63
00:05:42,490 --> 00:05:48,730
Okay, so even though that we've established that there's no entry in our message column with the value
63

64
00:05:48,790 --> 00:05:57,740
null or None, it doesn't mean that we don't have empty emails that could be in our dataset.
64

65
00:05:57,790 --> 00:06:04,030
Okay, so let me show you how we would write a check for discovering if there are any strings of length
65

66
00:06:04,180 --> 00:06:07,870
zero in this entire message column in our dataframe.
66

67
00:06:09,440 --> 00:06:21,440
I'll add a little comment here that reads "check if there are empty emails (string length zero)".
67

68
00:06:21,450 --> 00:06:27,420
The way I would go about doing this is grabbing our dataframe, then grabbing the message column, then
68

69
00:06:27,450 --> 00:06:35,150
converting that to a string and then grabbing the length of these strings.
69

70
00:06:35,150 --> 00:06:37,410
Let's take a look at what this looks like.
70

71
00:06:37,460 --> 00:06:44,210
Here we can see how many characters each email body has in our messages column.
71

72
00:06:46,320 --> 00:06:50,950
Now if we wanted to check if any of them are equal to zero,
72

73
00:06:51,240 --> 00:06:57,630
in other words if any of them look like this, we can simply use the double equal sign and then zero for
73

74
00:06:57,630 --> 00:06:59,170
the logical test.
74

75
00:06:59,290 --> 00:07:06,710
If I press Shift+Enter now we again get an entire column of True or False values.
75

76
00:07:06,890 --> 00:07:07,320
All right.
76

77
00:07:07,320 --> 00:07:13,380
We can't look through 5800 different messages, so the any method is our best
77

78
00:07:13,380 --> 00:07:20,370
friend again to see if there are any rows where the value is equal to True instead of False.
78

79
00:07:20,370 --> 00:07:21,700
Let's take a look.
79

80
00:07:21,810 --> 00:07:22,830
Huh.
80

81
00:07:22,980 --> 00:07:28,620
So there are some empty emails somewhere in our dataframe.
81

82
00:07:28,640 --> 00:07:34,470
What we don't know yet is how many of these rows have a string length of zero.
82

83
00:07:34,530 --> 00:07:36,320
So how would we check that?
83

84
00:07:36,550 --> 00:07:40,170
Well we could take our logical test here,
84

85
00:07:40,170 --> 00:07:45,300
remember this returned 5800 entries of True or False.
85

86
00:07:45,360 --> 00:07:54,120
Now the Boolean value of False is equated with the numerical value of 0 and the Boolean value of True
86

87
00:07:54,540 --> 00:07:57,010
equates to the value 1.
87

88
00:07:57,030 --> 00:08:04,200
So if we wanted to know how many rows have a string length of 0, all we would have to do is sum them up.
88

89
00:08:04,200 --> 00:08:11,940
This sums up all the zeros and any ones, any true values in the 5800 entries.
89

90
00:08:13,560 --> 00:08:14,270
Okay.
90

91
00:08:14,440 --> 00:08:17,320
So we're going deeper down the rabbit hole.
91

92
00:08:17,530 --> 00:08:24,970
We now know that among the 5800 entries there are 4 strings of length
92

93
00:08:24,970 --> 00:08:32,360
zero. Now, we have to find out where in our enormous dataframe these strings are.
93

94
00:08:32,530 --> 00:08:36,100
We have to find the rows that contain these empty strings.
94

95
00:08:36,100 --> 00:08:39,090
But before we do that I want to pose a challenge to you.
95

96
00:08:39,910 --> 00:08:45,790
Namely, I'd like you to pause the video and write the Python code that would check for the number of
96

97
00:08:45,850 --> 00:08:49,720
entries with null or None values.
97

98
00:08:49,720 --> 00:08:51,570
Did you solve it?
98

99
00:08:51,580 --> 00:08:52,860
Here's the solution.
99

100
00:08:52,900 --> 00:08:59,620
Now I know we don't have any null values in this dataframe, but that doesn't mean that your future projects
100

101
00:08:59,830 --> 00:09:02,630
won't be having this issue.
101

102
00:09:02,680 --> 00:09:14,020
So with "data.MESSAGE" we can pull up our message column and then with "isnull()" we can find out if
102

103
00:09:14,080 --> 00:09:16,410
there are any null values.
103

104
00:09:16,690 --> 00:09:23,980
And by summing up all the True and False values that we get out of this bit of code here, we can find
104

105
00:09:23,980 --> 00:09:27,230
out the number of null values.
105

106
00:09:27,310 --> 00:09:32,380
The answer of zero shouldn't be surprising, because we've already established that there are no null
106

107
00:09:32,380 --> 00:09:34,300
values in the message column.
107

108
00:09:35,200 --> 00:09:41,710
But I do think that this is quite an important piece of the data cleaning workflow, so it's good to have
108

109
00:09:41,710 --> 00:09:45,430
this down and practice this once or twice.
109

110
00:09:45,430 --> 00:09:49,480
Let me add some more empty cells here.
110

111
00:09:49,480 --> 00:10:01,140
Now I'll add a markdown cell and what I'll do here is I'll just write "Locate empty emails".
111

112
00:10:01,710 --> 00:10:06,580
Let's find out the row names of our empty emails.
112

113
00:10:06,810 --> 00:10:11,010
As you can probably guess this condition is gonna be key.
113

114
00:10:11,010 --> 00:10:17,760
This is what we're going to be using to get a pandas series of True or False values for where these
114

115
00:10:17,760 --> 00:10:19,100
rows are empty.
115

116
00:10:20,300 --> 00:10:30,110
Now if I wanted to pull out the index names for where these empty values are, then I could select based
116

117
00:10:30,260 --> 00:10:41,060
on this criteria where these values are True from our dataframe and then access the index attribute
117

118
00:10:41,660 --> 00:10:43,070
from the dataframe.
118

119
00:10:43,070 --> 00:10:52,120
So "data[]" and then put the logical condition in here and then have ".index" afterwards.
119

120
00:10:52,120 --> 00:10:54,140
And now we'll get the index names.
120

121
00:10:54,330 --> 00:10:58,400
The row names for where this condition is True.
121

122
00:10:58,500 --> 00:10:59,030
Check it out.
122

123
00:11:00,000 --> 00:11:03,500
OK, so this is interesting right?
123

124
00:11:03,510 --> 00:11:11,100
"cmds, cmds, cmds" and then ".DS_Store".
124

125
00:11:11,190 --> 00:11:13,910
These are the index names. Now,
125

126
00:11:13,910 --> 00:11:19,340
does anything strike you as odd with these index names?
126

127
00:11:19,340 --> 00:11:22,630
Let's see how they compare to all the other file names.
127

128
00:11:22,760 --> 00:11:30,260
So just scrolling through our dataset here, we can see that most of these emails actually have file
128

129
00:11:30,260 --> 00:11:40,190
names that looks something like this. The "cmds" and ".DS_Store" as a file name does seem to be a little bit
129

130
00:11:40,280 --> 00:11:41,590
odd.
130

131
00:11:41,600 --> 00:11:45,080
Let's see if we can find out what these files are.
131

132
00:11:45,140 --> 00:11:52,460
If I go to my spam_1 folder and I scroll all the way down then I indeed see one of these
132

133
00:11:52,460 --> 00:11:54,280
files here.
133

134
00:11:54,320 --> 00:11:59,970
This actually does not seem to be an email, right?
134

135
00:12:00,020 --> 00:12:07,620
Taking a look at this file here in my Atom text editor I can see that it looks like this.
135

136
00:12:07,720 --> 00:12:11,140
This is definitely not part of our dataset.
136

137
00:12:11,190 --> 00:12:19,720
This "cmds" file is actually a system file that you'll see both on Windows and Mac when you unzip the
137

138
00:12:19,780 --> 00:12:22,270
original email archive.
138

139
00:12:22,270 --> 00:12:27,530
This is not an email file that we should include in our dataframe.
139

140
00:12:27,580 --> 00:12:35,460
You'll also find this cmds file in the spam_2 folder and in the easy_ham_1
140

141
00:12:35,460 --> 00:12:39,330
folder. Now what about this ".DS_
141

142
00:12:39,430 --> 00:12:48,030
Store" file? Now, because the file name is preceded by a full stop, by a dot,
142

143
00:12:48,030 --> 00:12:52,000
this "DS_Store" file is actually a hidden file.
143

144
00:12:52,180 --> 00:13:00,460
This is a Mac OS specific file that stores the attributes of a particular folder. And this system file
144

145
00:13:00,490 --> 00:13:05,560
was created when I downloaded, unzipped and then rezipped the course materials.
145

146
00:13:05,620 --> 00:13:07,750
In other words, this is not an email.
146

147
00:13:07,750 --> 00:13:13,130
This is an artifact of me preparing the lesson resources for you.
147

148
00:13:13,390 --> 00:13:17,920
Now obviously we don't want any of these four files in our dataframe.
148

149
00:13:17,980 --> 00:13:20,110
We just want our email data.
149

150
00:13:20,110 --> 00:13:23,490
But why are these files in our dataframe in the first place
150

151
00:13:23,620 --> 00:13:24,870
if we don't want them there?
151

152
00:13:24,880 --> 00:13:27,100
What, what happened?
152

153
00:13:27,100 --> 00:13:30,310
Well let's take a look at our file reading function.
153

154
00:13:30,310 --> 00:13:39,130
Looking at the e-mail body generator code and this walk function from the os library, we can see that
154

155
00:13:39,340 --> 00:13:45,200
this function actually picks up all the files in a particular folder
155

156
00:13:45,430 --> 00:13:50,710
and this includes the system files that don't belong in the email corpus.
156

157
00:13:50,710 --> 00:13:53,790
That's how they got included in the dataframe.
157

158
00:13:53,800 --> 00:13:58,750
Now we know the row names are index names for undesirables in our dataframe.
158

159
00:13:58,750 --> 00:14:02,560
But what we don't know yet is where they are. Now,
159

160
00:14:02,590 --> 00:14:09,940
let me show you how to get the location for this ".DS_Store" entry.
160

161
00:14:09,940 --> 00:14:15,190
What we can do with our dataframe is get the index,
161

162
00:14:15,190 --> 00:14:17,140
so "index.get_
162

163
00:14:20,120 --> 00:14:31,220
loc", get_location, loc, parentheses and then single quotes, and then ".DS_
163

164
00:14:31,820 --> 00:14:40,970
Store" in the single quotes will give us the row number where this entry lives. So we can search the entire
164

165
00:14:40,970 --> 00:14:41,720
index,
165

166
00:14:41,840 --> 00:14:47,940
so all the names for a particular name which is our argument in this "get_location" method here.
166

167
00:14:48,800 --> 00:14:50,550
Let's take a look.
167

168
00:14:50,570 --> 00:14:58,700
So here we have it, 4609 is where this entry is in our dataframe. If I say
168

169
00:14:58,700 --> 00:15:09,630
"data[4609:
169

170
00:15:09,630 --> 00:15:17,870
4611]" then I can pull up this section of the dataframe and take a look at which rows precede
170

171
00:15:18,200 --> 00:15:27,190
and follow this ".DS_Store" entry and there we can see it just happily living between two
171

172
00:15:27,190 --> 00:15:35,410
non-spam emails. In the next lesson I'll show you how to remove these system files from our dataframe.
172

173
00:15:35,680 --> 00:15:41,200
I'll show you how to drop these entries and clean things up a little bit and much, much more. I'll see
173

174
00:15:41,200 --> 00:15:41,580
you there.