0 1 00:00:00,360 --> 00:00:07,050 Okay, so in this lesson I want to talk about the topic of tokenizing. Tokenizing just means splitting 1 2 00:00:07,050 --> 00:00:11,960 up the words in a sentence into individual words. 2 3 00:00:12,000 --> 00:00:14,850 The good thing for us is that we don't have to do this manually. 3 4 00:00:14,940 --> 00:00:17,690 We can get the NLTK module to do this for us. 4 5 00:00:18,800 --> 00:00:26,930 However, the NLTK module needs a certain component to be downloaded onto our local machines and 5 6 00:00:26,990 --> 00:00:37,310 we can grab this component with "nltk.download('punkt')". 6 7 00:00:38,950 --> 00:00:40,560 Before I hit Shift+Enter on this, 7 8 00:00:40,690 --> 00:00:45,740 let me show you where this will go. On both Mac and PC it will go under your username. 8 9 00:00:45,910 --> 00:00:52,950 So when I hit Shift+Enter on this cell, then I'll be downloading package to Users, 9 10 00:00:53,020 --> 00:00:54,250 then my username, 10 11 00:00:54,250 --> 00:00:58,420 and then it creates a folder called "nltk_data". 11 12 00:00:58,420 --> 00:01:05,650 This folder will have been created right here and it now includes a sub folder called "tokenizers" with 12 13 00:01:05,650 --> 00:01:12,680 this "punkt" sub folder that will help us tokenize or split up a string into the individual words. 13 14 00:01:13,000 --> 00:01:19,260 The very first time you run this it will download a zip file and unzip it for you. If you rerun this cell, 14 15 00:01:19,320 --> 00:01:24,610 it will actually not do anything, it already knows that you've got the package and it says "Package is 15 16 00:01:24,610 --> 00:01:28,990 already up to date". With this component downloaded 16 17 00:01:28,990 --> 00:01:32,200 all we need to do to split up our example sentence 17 18 00:01:32,530 --> 00:01:42,430 "All work and no play makes Jack a dull boy." is to copy this line of code, paste it down here and then 18 19 00:01:42,430 --> 00:01:52,390 run a method on it called "word_tokenize" and all it needs as an input is our message. Let's 19 20 00:01:52,390 --> 00:01:59,300 print that. Hitting Shift+Enter, it splits up all the individual words in this string. 20 21 00:01:59,320 --> 00:02:04,020 Now if you're wondering where a word tokenize came from, it's from up here. 21 22 00:02:04,150 --> 00:02:11,740 We've imported this functionality from nltk.tokenize. All right, 22 23 00:02:11,740 --> 00:02:13,990 now suppose we want to do two things. 23 24 00:02:13,990 --> 00:02:19,620 Suppose we want to convert this to lower case as well as tokenize 24 25 00:02:19,690 --> 00:02:26,540 the words. In that case all we need to do is use our message string, 25 26 00:02:26,650 --> 00:02:29,200 call the lower() method on it 26 27 00:02:29,590 --> 00:02:36,290 and nest this bit of code between the parentheses of word_tokenize. That'll get the job done. 27 28 00:02:36,460 --> 00:02:39,430 So as you can see, tokenizing isn't very complicated. 28 29 00:02:39,460 --> 00:02:47,090 Let's move on to the next pre-processing step, namely removing stop words. 29 30 00:02:47,580 --> 00:02:54,630 Now what I mean by stop words? Stop words is a piece of jargon that refers to the most common words in 30 31 00:02:54,630 --> 00:02:55,800 a language. 31 32 00:02:55,800 --> 00:03:06,690 I'm talking about words like "the", "I", "of", "a", "at", "which", "on" etc. These words are very, very important for grammar, 32 33 00:03:07,080 --> 00:03:10,820 but they don't convey a lot of meaning on their own. 33 34 00:03:11,190 --> 00:03:17,880 In particular, these words aren't gonna be very helpful for differentiating between spam and non-spam 34 35 00:03:17,880 --> 00:03:19,630 messages. 35 36 00:03:19,670 --> 00:03:26,270 So what we're gonna do is we're going to exclude these stop words from our message text and the reason 36 37 00:03:26,270 --> 00:03:33,000 is is that we're going to be looking at words in isolation, remember, this is the naive Bayes' classifier. 37 38 00:03:33,240 --> 00:03:38,510 It will not look at a sentence as a whole, it won't look at a phrase. The naive Bayes' model and the bag 38 39 00:03:38,510 --> 00:03:41,360 of words approach will look at words individually, 39 40 00:03:41,360 --> 00:03:48,890 so if you have a phrase like "Flights to London", then it will look at the word "Flights", the word "to" and 40 41 00:03:48,890 --> 00:03:50,420 the word "London". 41 42 00:03:50,420 --> 00:03:55,970 The meaning is actually lost on the algorithm and this is why we filter out the stop words like the 42 43 00:03:55,970 --> 00:03:56,870 word "to". 43 44 00:03:56,960 --> 00:04:01,980 So we're just piping the word "Flights" and the word "London" into our algorithm. 44 45 00:04:02,030 --> 00:04:06,920 Now this is a little bit of a contrast to how modern search engines will work, right. 45 46 00:04:07,030 --> 00:04:14,630 They're not quite as naive as our spam classifier and they don't actually filter out these stop words. 46 47 00:04:14,630 --> 00:04:20,210 This is why when you're searching for the phrase "Let it be" it will bring up the Beatles song and you'll 47 48 00:04:20,210 --> 00:04:22,560 actually get meaningful results. 48 49 00:04:22,580 --> 00:04:31,550 The same is true if you search for "take that", again more stop words or "to be or not to be" which is a 49 50 00:04:31,550 --> 00:04:35,280 sentence comprised entirely of stop words. 50 51 00:04:35,300 --> 00:04:41,300 So let's insert a cell here where we're downloading our resources and actually download these stop words, 51 52 00:04:41,310 --> 00:04:50,270 as I've been talking a lot about them. So "nltk.download('stopwords')", 52 53 00:04:50,540 --> 00:04:59,300 all lowercase and in one word. If I hit Shift+Enter here, I can see that a new folder has appeared under 53 54 00:04:59,420 --> 00:05:08,180 nltk_data. The folder is called "corpora" and it has a sub folder called "stopwords". 54 55 00:05:08,330 --> 00:05:13,370 Now it's actually got the stop words for quite a few different languages and I can actually take a look 55 56 00:05:13,370 --> 00:05:18,150 at which stop words are included in the English section. 56 57 00:05:18,200 --> 00:05:19,220 Here is the list. 57 58 00:05:19,250 --> 00:05:27,890 You'll notice that they're all lowercase and there's about 179 of them, starting with "I", "me", "my", "myself", 58 59 00:05:27,920 --> 00:05:30,540 "we", "our", "ours" and so on. 59 60 00:05:30,560 --> 00:05:36,110 Okay, so we've downloaded the resource and you're probably wondering how do we exclude the stop word 60 61 00:05:36,500 --> 00:05:40,800 "and" and "a" from this string. 61 62 00:05:40,920 --> 00:05:48,210 You can think of this string as a collection of words and you can think of our list of stop words as 62 63 00:05:48,210 --> 00:05:50,030 just another collection. 63 64 00:05:50,160 --> 00:05:58,850 So the question then becomes: How do you efficiently check if a particular value is contained in a collection? 64 65 00:05:58,860 --> 00:06:04,740 This is actually a very generic programming problem that you'll face in many, many situations. 65 66 00:06:04,740 --> 00:06:13,890 Now one way you can do this is to create a list of all the words and then check one by one and look 66 67 00:06:13,890 --> 00:06:15,460 for a match. 67 68 00:06:15,480 --> 00:06:23,040 However, there is another way to do this and Python has a fantastic data structure that is very, very 68 69 00:06:23,040 --> 00:06:26,760 well suited for this particular type of problem 69 70 00:06:27,180 --> 00:06:37,140 and that is the Python set. A set is an unordered list, so you couldn't say something like "Give me the 70 71 00:06:37,140 --> 00:06:41,410 item at position number 1" or "Give me the item at position number 3", 71 72 00:06:41,490 --> 00:06:45,740 this is what you would do with an array or a list. With a set 72 73 00:06:45,750 --> 00:06:48,890 there is no order to the items. 73 74 00:06:48,980 --> 00:06:59,010 Also, every single item in a set occurs only one time and this makes a set super handy for checking membership 74 75 00:06:59,310 --> 00:07:01,470 or looking for differences. 75 76 00:07:01,470 --> 00:07:06,360 And what you'll find is that the larger your set or the more data that you're actually working with, 76 77 00:07:06,720 --> 00:07:12,060 the more you'll notice the advantage of working with this data structure, because as soon as you have 77 78 00:07:12,060 --> 00:07:17,290 to iterate through an enormous array or enormous list and check one by one, 78 79 00:07:17,310 --> 00:07:17,910 Is there a match? 79 80 00:07:17,910 --> 00:07:18,600 Is there a match? 80 81 00:07:18,600 --> 00:07:23,700 Is there a match? - then the computation will start to take quite a long time. 81 82 00:07:23,700 --> 00:07:27,970 Here's how you can visualize the words in our example sentence as a set. 82 83 00:07:28,140 --> 00:07:36,660 I'll draw one big circle and in that circle we'll have the word "work", "play", "makes", "Jack", "and", "a" and some of 83 84 00:07:36,660 --> 00:07:43,800 the other ones. Similarly our stop words can comprise another set 84 85 00:07:43,800 --> 00:07:50,600 and here we would have words like "we", "I", "until", "you", "while" and also the word "and" and "a". 85 86 00:07:50,640 --> 00:07:54,200 The words "and" and "a" are members of both sets. 86 87 00:07:54,320 --> 00:07:57,960 They're at the intersection of the two circles. 87 88 00:07:57,960 --> 00:08:04,350 So if you're a visual person like myself, then this is how you can kind of think about the set data structure. 88 89 00:08:05,280 --> 00:08:06,380 Coming to think of it, 89 90 00:08:06,420 --> 00:08:10,440 we've actually covered quite a few different types of collections so far, right. 90 91 00:08:10,590 --> 00:08:17,010 We've covered lists, we've covered dictionaries, we've covered tuples, we've covered arrays and now we've 91 92 00:08:17,010 --> 00:08:19,710 also covered sets. 92 93 00:08:19,920 --> 00:08:25,890 Now these are probably the most important data structures that you'll find in Python. The lists, the dictionaries, 93 94 00:08:25,890 --> 00:08:31,770 the tuples and the sets are actually built-in data structures in Python. The arrays on the other hand 94 95 00:08:31,980 --> 00:08:37,450 came from numpy, just the same way that pandas gave us the dataframe. 95 96 00:08:37,890 --> 00:08:39,780 But I think that's enough theory. 96 97 00:08:39,780 --> 00:08:43,440 Let's write some code and put our sets into action. 97 98 00:08:43,590 --> 00:08:55,580 Let's access the stop words first, so "stopwords.words('english')" will give us our stop 98 99 00:08:55,570 --> 00:08:56,080 words. 99 100 00:08:56,120 --> 00:08:59,920 Let me hit Shift+Enter and see what we get here. 100 101 00:09:00,020 --> 00:09:05,540 These are the stop words in that file. At the moment, 101 102 00:09:05,540 --> 00:09:11,740 if we check what the type is of this particular object, we get a list. 102 103 00:09:11,900 --> 00:09:16,730 So if we use this bit of Python code, we will be working with a list. 103 104 00:09:17,060 --> 00:09:24,560 If we wanted to work with a set instead, then we would use the Python keywords "set" and then between 104 105 00:09:24,560 --> 00:09:28,970 the parentheses feed in our list of stop words. 105 106 00:09:29,060 --> 00:09:30,930 Here's what it would look like. 106 107 00:09:31,220 --> 00:09:32,720 Similar to a dictionary, 107 108 00:09:32,900 --> 00:09:42,470 the Python set has this curly brackets notation, but in this case all the values inside are simply separated 108 109 00:09:42,710 --> 00:09:45,500 by a comma. With a dictionary, 109 110 00:09:45,500 --> 00:09:51,310 you always have the key and value pair with the colon in between, but with a set you have the curly brackets 110 111 00:09:51,890 --> 00:09:55,560 and then the values separated by a comma. 111 112 00:09:55,580 --> 00:10:04,130 So what I'm going to do now is create a variable called "stop_words" and store our set inside 112 113 00:10:04,160 --> 00:10:05,600 this variable. 113 114 00:10:05,900 --> 00:10:13,190 And if you're in doubt that we indeed are working with a set now, we can check the type of our stop_words 114 115 00:10:13,370 --> 00:10:15,940 variable. 115 116 00:10:16,030 --> 00:10:16,550 There you go. 116 117 00:10:16,550 --> 00:10:17,420 This should be the proof. 117 118 00:10:18,260 --> 00:10:22,260 Now let's use our set to check for membership. 118 119 00:10:22,310 --> 00:10:23,790 Here's how you can do it. 119 120 00:10:24,200 --> 00:10:37,340 If the string "this" is in stop_words, then print "Found it". 120 121 00:10:37,480 --> 00:10:45,270 Here we're using the Python "in" keyword to check if a particular string is contained inside our set. 121 122 00:10:46,520 --> 00:10:48,140 See what we get. 122 123 00:10:48,200 --> 00:10:48,590 Yeah. 123 124 00:10:48,620 --> 00:10:53,110 So the word "this" is contained inside our stop_words. 124 125 00:10:53,150 --> 00:10:55,050 What about the word "hello"? 125 126 00:10:55,060 --> 00:10:56,100 A common word, right? 126 127 00:10:56,680 --> 00:11:01,310 Let's see if that's contained. Nope, not among the list. 127 128 00:11:01,430 --> 00:11:08,240 The print statement does not execute. So the word "this" was found but the word "hello" was not. 128 129 00:11:08,710 --> 00:11:11,620 Now as a challenge, can you print out 129 130 00:11:11,630 --> 00:11:12,860 "Nope. Not in here." 130 131 00:11:13,430 --> 00:11:21,470 if the word "hello" is not contained in stop words? Did you give it a quick go? 131 132 00:11:21,470 --> 00:11:23,080 The solution is very simple. 132 133 00:11:23,180 --> 00:11:31,820 So if I just copy-paste my previous line of code, substitute "hello" for the word "this" and then add the 133 134 00:11:31,820 --> 00:11:38,590 word "not" which is a Python keyword to modify our condition in our if statement, 134 135 00:11:38,990 --> 00:11:43,220 and then lastly just modify the bit in the print statement, 135 136 00:11:43,430 --> 00:11:52,240 "Nope. Not in here." and we can see now that we've got already two patterns, one using the "in" keyword and 136 137 00:11:52,310 --> 00:11:57,450 a set to check for membership and the other one using the word "not in" 137 138 00:11:57,650 --> 00:12:03,650 if you wanted to check whether something was not contained in the set. Both these techniques are very, 138 139 00:12:03,650 --> 00:12:06,920 very handy and we're gonna make use of it later on in our code. 139 140 00:12:07,670 --> 00:12:14,960 But before we tackle all of our 5800 email messages, let's tackle an example sentence 140 141 00:12:14,960 --> 00:12:15,810 first. 141 142 00:12:16,190 --> 00:12:23,780 Let's write some Python code that both tokenizes, converts to lower case and removes stop words from 142 143 00:12:23,780 --> 00:12:29,680 an example sentence. I'm quickly going to copy this line of code here, 143 144 00:12:31,320 --> 00:12:35,820 paste it in and just add something at the end, 144 145 00:12:35,820 --> 00:12:44,880 maybe "To be or not to be", just a bunch of stop words. Now, we said we'd tokenize, lower case and 145 146 00:12:44,880 --> 00:12:47,340 remove the stop words from this string. 146 147 00:12:48,370 --> 00:13:00,250 We can lower case with "msg.lower()", we can tokenize with "word_tokenize()" 147 148 00:13:00,640 --> 00:13:09,880 and then provide "msg.lower()" as an argument and we can store this list of words in say a variable 148 149 00:13:09,880 --> 00:13:20,120 called "words". Okay fine. So far so good. To filter out all the stop words I'm going to use a loop and I'm 149 150 00:13:20,120 --> 00:13:24,770 also gonna use an empty list to hold on to the results. 150 151 00:13:24,770 --> 00:13:34,590 So I'll say "filtered_words = []" and then I want 151 152 00:13:34,590 --> 00:13:42,780 to write my loop, but I think, given what we've covered so far, you can probably write this yourself already. 152 153 00:13:42,780 --> 00:13:51,360 So as a challenge can you write a loop that appends all the non-stop words to this currently empty list 153 154 00:13:51,480 --> 00:13:59,690 of "filtered_words"? I'll give you a few seconds to pause the video and give this a go. 154 155 00:13:59,710 --> 00:14:00,520 Did you figure it out? 155 156 00:14:01,600 --> 00:14:03,490 Here's the solution. 156 157 00:14:03,490 --> 00:14:08,250 So I'm going to use a for loop, I'm going to say "for word in words:", 157 158 00:14:08,250 --> 00:14:11,960 so going through all the words in this list one by one, 158 159 00:14:12,780 --> 00:14:28,020 if the word is not in stop words, then take the list of filtered words and append the word. 159 160 00:14:28,600 --> 00:14:36,420 And just so we can take a look at our work I'm going to print out our list of filtered words. Let's see what we 160 161 00:14:36,420 --> 00:14:38,430 have. 161 162 00:14:38,500 --> 00:14:39,700 Okay, so this is interesting, 162 163 00:14:39,700 --> 00:14:43,020 right? The word "all" disappears. 163 164 00:14:43,020 --> 00:14:48,060 The word "work" is included. The word "and" is not. The word 164 165 00:14:48,060 --> 00:14:50,660 "no" is also excluded. 165 166 00:14:50,880 --> 00:14:58,290 "play" remains, "makes" remains, "jack" remains, but is lowercase because we've converted it here to lower case. 166 167 00:14:58,920 --> 00:15:01,210 "a" is excluded. "dull" 167 168 00:15:01,260 --> 00:15:03,330 and "boy" are included. 168 169 00:15:03,330 --> 00:15:10,050 And then we have the punctuation. We have the full stop here and all of these words "to be or not 169 170 00:15:10,050 --> 00:15:16,650 to be" are excluded because they're all stop words and then the full stop at the end is included as 170 171 00:15:16,650 --> 00:15:17,490 well. 171 172 00:15:17,490 --> 00:15:24,990 So our list of filtered words contains all the non-stop words plus the punctuation. 172 173 00:15:25,180 --> 00:15:25,600 All right. 173 174 00:15:25,620 --> 00:15:31,150 So we've covered quite a few of the pre-processing steps already. We've covered how to convert to lowercase, 174 175 00:15:31,800 --> 00:15:35,940 we've covered tokenizing and removing stop words. 175 176 00:15:35,940 --> 00:15:42,990 We still have to learn how to strip out HTML tags and do the word stemming and remove the punctuation. 176 177 00:15:42,990 --> 00:15:47,190 The word stemming and the punctuation removal is what I'm going to cover next. 177 178 00:15:47,190 --> 00:15:48,420 So I'll see you in the next lesson.