0
1
00:00:00,360 --> 00:00:07,050
Okay, so in this lesson I want to talk about the topic of tokenizing. Tokenizing just means splitting
1

2
00:00:07,050 --> 00:00:11,960
up the words in a sentence into individual words.
2

3
00:00:12,000 --> 00:00:14,850
The good thing for us is that we don't have to do this manually.
3

4
00:00:14,940 --> 00:00:17,690
We can get the NLTK module to do this for us.
4

5
00:00:18,800 --> 00:00:26,930
However, the NLTK module needs a certain component to be downloaded onto our local machines and
5

6
00:00:26,990 --> 00:00:37,310
we can grab this component with "nltk.download('punkt')".
6

7
00:00:38,950 --> 00:00:40,560
Before I hit Shift+Enter on this,
7

8
00:00:40,690 --> 00:00:45,740
let me show you where this will go. On both Mac and PC  it will go under your username.
8

9
00:00:45,910 --> 00:00:52,950
So when I hit Shift+Enter on this cell, then I'll be downloading package to Users,
9

10
00:00:53,020 --> 00:00:54,250
then my username,
10

11
00:00:54,250 --> 00:00:58,420
and then it creates a folder called "nltk_data".
11

12
00:00:58,420 --> 00:01:05,650
This folder will have been created right here and it now includes a sub folder called "tokenizers" with
12

13
00:01:05,650 --> 00:01:12,680
this "punkt" sub folder that will help us tokenize or split up a string into the individual words.
13

14
00:01:13,000 --> 00:01:19,260
The very first time you run this it will download a zip file and unzip it for you. If you rerun this cell,
14

15
00:01:19,320 --> 00:01:24,610
it will actually not do anything, it already knows that you've got the package and it says "Package is
15

16
00:01:24,610 --> 00:01:28,990
already up to date". With this component downloaded
16

17
00:01:28,990 --> 00:01:32,200
all we need to do to split up our example sentence 
17

18
00:01:32,530 --> 00:01:42,430
"All work and no play makes Jack a dull boy." is to copy this line of code, paste it down here and then
18

19
00:01:42,430 --> 00:01:52,390
run a method on it called "word_tokenize" and all it needs as an input is our message. Let's 
19

20
00:01:52,390 --> 00:01:59,300
print that. Hitting Shift+Enter, it splits up all the individual words in this string.
20

21
00:01:59,320 --> 00:02:04,020
Now if you're wondering where a word tokenize came from, it's from up here.
21

22
00:02:04,150 --> 00:02:11,740
We've imported this functionality from nltk.tokenize. All right,
22

23
00:02:11,740 --> 00:02:13,990
now suppose we want to do two things.
23

24
00:02:13,990 --> 00:02:19,620
Suppose we want to convert this to lower case as well as tokenize
24

25
00:02:19,690 --> 00:02:26,540
the words. In that case all we need to do is use our message string,
25

26
00:02:26,650 --> 00:02:29,200
call the lower() method on it
26

27
00:02:29,590 --> 00:02:36,290
and nest this bit of code between the parentheses of word_tokenize. That'll get the job done.
27

28
00:02:36,460 --> 00:02:39,430
So as you can see, tokenizing isn't very complicated.
28

29
00:02:39,460 --> 00:02:47,090
Let's move on to the next pre-processing step, namely removing stop words.
29

30
00:02:47,580 --> 00:02:54,630
Now what I mean by stop words? Stop words is a piece of jargon that refers to the most common words in
30

31
00:02:54,630 --> 00:02:55,800
a language.
31

32
00:02:55,800 --> 00:03:06,690
I'm talking about words like "the", "I", "of", "a", "at", "which", "on" etc. These words are very, very important for grammar,
32

33
00:03:07,080 --> 00:03:10,820
but they don't convey a lot of meaning on their own.
33

34
00:03:11,190 --> 00:03:17,880
In particular, these words aren't gonna be very helpful for differentiating between spam and non-spam
34

35
00:03:17,880 --> 00:03:19,630
messages.
35

36
00:03:19,670 --> 00:03:26,270
So what we're gonna do is we're going to exclude these stop words from our message text and the reason
36

37
00:03:26,270 --> 00:03:33,000
is is that we're going to be looking at words in isolation, remember, this is the naive Bayes' classifier.
37

38
00:03:33,240 --> 00:03:38,510
It will not look at a sentence as a whole, it won't look at a phrase. The naive Bayes' model and the bag
38

39
00:03:38,510 --> 00:03:41,360
of words approach will look at words individually,
39

40
00:03:41,360 --> 00:03:48,890
so if you have a phrase like "Flights to London", then it will look at the word "Flights", the word "to" and
40

41
00:03:48,890 --> 00:03:50,420
the word "London".
41

42
00:03:50,420 --> 00:03:55,970
The meaning is actually lost on the algorithm and this is why we filter out the stop words like the
42

43
00:03:55,970 --> 00:03:56,870
word "to".
43

44
00:03:56,960 --> 00:04:01,980
So we're just piping the word "Flights" and the word "London" into our algorithm.
44

45
00:04:02,030 --> 00:04:06,920
Now this is a little bit of a contrast to how modern search engines will work, right.
45

46
00:04:07,030 --> 00:04:14,630
They're not quite as naive as our spam classifier and they don't actually filter out these stop words.
46

47
00:04:14,630 --> 00:04:20,210
This is why when you're searching for the phrase "Let it be" it will bring up the Beatles song and you'll
47

48
00:04:20,210 --> 00:04:22,560
actually get meaningful results.
48

49
00:04:22,580 --> 00:04:31,550
The same is true if you search for "take that", again more stop words or "to be or not to be" which is a
49

50
00:04:31,550 --> 00:04:35,280
sentence comprised entirely of stop words.
50

51
00:04:35,300 --> 00:04:41,300
So let's insert a cell here where we're downloading our resources and actually download these stop words,
51

52
00:04:41,310 --> 00:04:50,270
as I've been talking a lot about them. So "nltk.download('stopwords')",
52

53
00:04:50,540 --> 00:04:59,300
all lowercase and in one word. If I hit Shift+Enter here, I can see that a new folder has appeared under
53

54
00:04:59,420 --> 00:05:08,180
nltk_data. The folder is called "corpora" and it has a sub folder called "stopwords".
54

55
00:05:08,330 --> 00:05:13,370
Now it's actually got the stop words for quite a few different languages and I can actually take a look
55

56
00:05:13,370 --> 00:05:18,150
at which stop words are included in the English section.
56

57
00:05:18,200 --> 00:05:19,220
Here is the list.
57

58
00:05:19,250 --> 00:05:27,890
You'll notice that they're all lowercase and there's about 179 of them, starting with "I", "me", "my", "myself",
58

59
00:05:27,920 --> 00:05:30,540
"we", "our", "ours" and so on.
59

60
00:05:30,560 --> 00:05:36,110
Okay, so we've downloaded the resource and you're probably wondering how do we exclude the stop word
60

61
00:05:36,500 --> 00:05:40,800
"and" and "a" from this string.
61

62
00:05:40,920 --> 00:05:48,210
You can think of this string as a collection of words and you can think of our list of stop words as
62

63
00:05:48,210 --> 00:05:50,030
just another collection.
63

64
00:05:50,160 --> 00:05:58,850
So the question then becomes: How do you efficiently check if a particular value is contained in a collection?
64

65
00:05:58,860 --> 00:06:04,740
This is actually a very generic programming problem that you'll face in many, many situations.
65

66
00:06:04,740 --> 00:06:13,890
Now one way you can do this is to create a list of all the words and then check one by one and look
66

67
00:06:13,890 --> 00:06:15,460
for a match.
67

68
00:06:15,480 --> 00:06:23,040
However, there is another way to do this and Python has a fantastic data structure that is very, very
68

69
00:06:23,040 --> 00:06:26,760
well suited for this particular type of problem
69

70
00:06:27,180 --> 00:06:37,140
and that is the Python set. A set is an unordered list, so you couldn't say something like "Give me the
70

71
00:06:37,140 --> 00:06:41,410
item at position number 1" or "Give me the item at position number 3",
71

72
00:06:41,490 --> 00:06:45,740
this is what you would do with an array or a list. With a set
72

73
00:06:45,750 --> 00:06:48,890
there is no order to the items.
73

74
00:06:48,980 --> 00:06:59,010
Also, every single item in a set occurs only one time and this makes a set super handy for checking membership
74

75
00:06:59,310 --> 00:07:01,470
or looking for differences.
75

76
00:07:01,470 --> 00:07:06,360
And what you'll find is that the larger your set or the more data that you're actually working with,
76

77
00:07:06,720 --> 00:07:12,060
the more you'll notice the advantage of working with this data structure, because as soon as you have
77

78
00:07:12,060 --> 00:07:17,290
to iterate through an enormous array or enormous list and check one by one,
78

79
00:07:17,310 --> 00:07:17,910
Is there a match?
79

80
00:07:17,910 --> 00:07:18,600
Is there a match?
80

81
00:07:18,600 --> 00:07:23,700
Is there a match? - then the computation will start to take quite a long time.
81

82
00:07:23,700 --> 00:07:27,970
Here's how you can visualize the words in our example sentence as a set.
82

83
00:07:28,140 --> 00:07:36,660
I'll draw one big circle and in that circle we'll have the word "work", "play", "makes", "Jack", "and", "a" and some of
83

84
00:07:36,660 --> 00:07:43,800
the other ones. Similarly our stop words can comprise another set
84

85
00:07:43,800 --> 00:07:50,600
and here we would have words like "we", "I", "until", "you", "while" and also the word "and" and "a".
85

86
00:07:50,640 --> 00:07:54,200
The words "and" and "a" are members of both sets.
86

87
00:07:54,320 --> 00:07:57,960
They're at the intersection of the two circles.
87

88
00:07:57,960 --> 00:08:04,350
So if you're a visual person like myself, then this is how you can kind of think about the set data structure.
88

89
00:08:05,280 --> 00:08:06,380
Coming to think of it,
89

90
00:08:06,420 --> 00:08:10,440
we've actually covered quite a few different types of collections so far, right.
90

91
00:08:10,590 --> 00:08:17,010
We've covered lists, we've covered dictionaries, we've covered tuples, we've covered arrays and now we've
91

92
00:08:17,010 --> 00:08:19,710
also covered sets.
92

93
00:08:19,920 --> 00:08:25,890
Now these are probably the most important data structures that you'll find in Python. The lists, the dictionaries,
93

94
00:08:25,890 --> 00:08:31,770
the tuples and the sets are actually built-in data structures in Python. The arrays on the other hand
94

95
00:08:31,980 --> 00:08:37,450
came from numpy, just the same way that pandas gave us the dataframe.
95

96
00:08:37,890 --> 00:08:39,780
But I think that's enough theory.
96

97
00:08:39,780 --> 00:08:43,440
Let's write some code and put our sets into action.
97

98
00:08:43,590 --> 00:08:55,580
Let's access the stop words first, so "stopwords.words('english')" will give us our stop
98

99
00:08:55,570 --> 00:08:56,080
words.
99

100
00:08:56,120 --> 00:08:59,920
Let me hit Shift+Enter and see what we get here.
100

101
00:09:00,020 --> 00:09:05,540
These are the stop words in that file. At the moment,
101

102
00:09:05,540 --> 00:09:11,740
if we check what the type is of this particular object, we get a list.
102

103
00:09:11,900 --> 00:09:16,730
So if we use this bit of Python code, we will be working with a list.
103

104
00:09:17,060 --> 00:09:24,560
If we wanted to work with a set instead, then we would use the Python keywords "set" and then between
104

105
00:09:24,560 --> 00:09:28,970
the parentheses feed in our list of stop words.
105

106
00:09:29,060 --> 00:09:30,930
Here's what it would look like.
106

107
00:09:31,220 --> 00:09:32,720
Similar to a dictionary,
107

108
00:09:32,900 --> 00:09:42,470
the Python set has this curly brackets notation, but in this case all the values inside are simply separated
108

109
00:09:42,710 --> 00:09:45,500
by a comma. With a dictionary,
109

110
00:09:45,500 --> 00:09:51,310
you always have the key and value pair with the colon in between, but with a set you have the curly brackets
110

111
00:09:51,890 --> 00:09:55,560
and then the values separated by a comma.
111

112
00:09:55,580 --> 00:10:04,130
So what I'm going to do now is create a variable called "stop_words" and store our set inside
112

113
00:10:04,160 --> 00:10:05,600
this variable.
113

114
00:10:05,900 --> 00:10:13,190
And if you're in doubt that we indeed are working with a set now, we can check the type of our stop_words
114

115
00:10:13,370 --> 00:10:15,940
variable.
115

116
00:10:16,030 --> 00:10:16,550
There you go.
116

117
00:10:16,550 --> 00:10:17,420
This should be the proof.
117

118
00:10:18,260 --> 00:10:22,260
Now let's use our set to check for membership.
118

119
00:10:22,310 --> 00:10:23,790
Here's how you can do it.
119

120
00:10:24,200 --> 00:10:37,340
If the string "this" is in stop_words, then print "Found it".
120

121
00:10:37,480 --> 00:10:45,270
Here we're using the Python "in" keyword to check if a particular string is contained inside our set.
121

122
00:10:46,520 --> 00:10:48,140
See what we get.
122

123
00:10:48,200 --> 00:10:48,590
Yeah.
123

124
00:10:48,620 --> 00:10:53,110
So the word "this" is contained inside our stop_words.
124

125
00:10:53,150 --> 00:10:55,050
What about the word "hello"?
125

126
00:10:55,060 --> 00:10:56,100
A common word, right?
126

127
00:10:56,680 --> 00:11:01,310
Let's see if that's contained. Nope, not among the list.
127

128
00:11:01,430 --> 00:11:08,240
The print statement does not execute. So the word "this" was found but the word "hello" was not.
128

129
00:11:08,710 --> 00:11:11,620
Now as a challenge, can you print out
129

130
00:11:11,630 --> 00:11:12,860
"Nope. Not in here."
130

131
00:11:13,430 --> 00:11:21,470
if the word "hello" is not contained in stop words? Did you give it a quick go?
131

132
00:11:21,470 --> 00:11:23,080
The solution is very simple.
132

133
00:11:23,180 --> 00:11:31,820
So if I just copy-paste my previous line of code, substitute "hello" for the word "this" and then add the
133

134
00:11:31,820 --> 00:11:38,590
word "not" which is a Python keyword to modify our condition in our if statement,
134

135
00:11:38,990 --> 00:11:43,220
and then lastly just modify the bit in the print statement,
135

136
00:11:43,430 --> 00:11:52,240
"Nope. Not in here." and we can see now that we've got already two patterns, one using the "in" keyword and
136

137
00:11:52,310 --> 00:11:57,450
a set to check for membership and the other one using the word "not in"
137

138
00:11:57,650 --> 00:12:03,650
if you wanted to check whether something was not contained in the set. Both these techniques are very,
138

139
00:12:03,650 --> 00:12:06,920
very handy and we're gonna make use of it later on in our code.
139

140
00:12:07,670 --> 00:12:14,960
But before we tackle all of our 5800 email messages, let's tackle an example sentence
140

141
00:12:14,960 --> 00:12:15,810
first.
141

142
00:12:16,190 --> 00:12:23,780
Let's write some Python code that both tokenizes, converts to lower case and removes stop words from
142

143
00:12:23,780 --> 00:12:29,680
an example sentence. I'm quickly going to copy this line of code here,
143

144
00:12:31,320 --> 00:12:35,820
paste it in and just add something at the end,
144

145
00:12:35,820 --> 00:12:44,880
maybe "To be or not to be", just a bunch of stop words. Now, we said we'd tokenize, lower case and
145

146
00:12:44,880 --> 00:12:47,340
remove the stop words from this string.
146

147
00:12:48,370 --> 00:13:00,250
We can lower case with "msg.lower()", we can tokenize with "word_tokenize()"
147

148
00:13:00,640 --> 00:13:09,880
and then provide "msg.lower()" as an argument and we can store this list of words in say a variable
148

149
00:13:09,880 --> 00:13:20,120
called "words". Okay fine. So far so good. To filter out all the stop words I'm going to use a loop and I'm
149

150
00:13:20,120 --> 00:13:24,770
also gonna use an empty list to hold on to the results.
150

151
00:13:24,770 --> 00:13:34,590
So I'll say "filtered_words = []" and then I want
151

152
00:13:34,590 --> 00:13:42,780
to write my loop, but I think, given what we've covered so far, you can probably write this yourself already.
152

153
00:13:42,780 --> 00:13:51,360
So as a challenge can you write a loop that appends all the non-stop words to this currently empty list
153

154
00:13:51,480 --> 00:13:59,690
of "filtered_words"? I'll give you a few seconds to pause the video and give this a go.
154

155
00:13:59,710 --> 00:14:00,520
Did you figure it out?
155

156
00:14:01,600 --> 00:14:03,490
Here's the solution.
156

157
00:14:03,490 --> 00:14:08,250
So I'm going to use a for loop, I'm going to say "for word in words:",
157

158
00:14:08,250 --> 00:14:11,960
so going through all the words in this list one by one,
158

159
00:14:12,780 --> 00:14:28,020
if the word is not in stop words, then take the list of filtered words and append the word.
159

160
00:14:28,600 --> 00:14:36,420
And just so we can take a look at our work I'm going to print out our list of filtered words. Let's see what we
160

161
00:14:36,420 --> 00:14:38,430
have.
161

162
00:14:38,500 --> 00:14:39,700
Okay, so this is interesting,
162

163
00:14:39,700 --> 00:14:43,020
right? The word "all" disappears.
163

164
00:14:43,020 --> 00:14:48,060
The word "work" is included. The word "and" is not. The word
164

165
00:14:48,060 --> 00:14:50,660
"no" is also excluded.
165

166
00:14:50,880 --> 00:14:58,290
"play" remains, "makes" remains, "jack" remains, but is lowercase because we've converted it here to lower case.
166

167
00:14:58,920 --> 00:15:01,210
"a" is excluded. "dull"
167

168
00:15:01,260 --> 00:15:03,330
and "boy" are included.
168

169
00:15:03,330 --> 00:15:10,050
And then we have the punctuation. We have the full stop here and all of these words "to be or not
169

170
00:15:10,050 --> 00:15:16,650
to be" are excluded because they're all stop words and then the full stop at the end is included as
170

171
00:15:16,650 --> 00:15:17,490
well.
171

172
00:15:17,490 --> 00:15:24,990
So our list of filtered words contains all the non-stop words plus the punctuation.
172

173
00:15:25,180 --> 00:15:25,600
All right.
173

174
00:15:25,620 --> 00:15:31,150
So we've covered quite a few of the pre-processing steps already. We've covered how to convert to lowercase,
174

175
00:15:31,800 --> 00:15:35,940
we've covered tokenizing and removing stop words.
175

176
00:15:35,940 --> 00:15:42,990
We still have to learn how to strip out HTML tags and do the word stemming and remove the punctuation.
176

177
00:15:42,990 --> 00:15:47,190
The word stemming and the punctuation removal is what I'm going to cover next.
177

178
00:15:47,190 --> 00:15:48,420
So I'll see you in the next lesson.