Okay, so in this lesson I want to talk about the topic of tokenizing. Tokenizing just means splitting up the words in a sentence into individual words. The good thing for us is that we don't have to do this manually. We can get the NLTK module to do this for us. However, the NLTK module needs a certain component to be downloaded onto our local machines and we can grab this component with "nltk.download('punkt')". Before I hit Shift+Enter on this, let me show you where this will go. On both Mac and PC it will go under your username. So when I hit Shift+Enter on this cell, then I'll be downloading package to Users, then my username, and then it creates a folder called "nltk_data". This folder will have been created right here and it now includes a sub folder called "tokenizers" with this "punkt" sub folder that will help us tokenize or split up a string into the individual words. The very first time you run this it will download a zip file and unzip it for you. If you rerun this cell, it will actually not do anything, it already knows that you've got the package and it says "Package is already up to date". With this component downloaded all we need to do to split up our example sentence "All work and no play makes Jack a dull boy." is to copy this line of code, paste it down here and then run a method on it called "word_tokenize" and all it needs as an input is our message. Let's print that. Hitting Shift+Enter, it splits up all the individual words in this string. Now if you're wondering where a word tokenize came from, it's from up here. We've imported this functionality from nltk.tokenize. All right, now suppose we want to do two things. Suppose we want to convert this to lower case as well as tokenize the words. In that case all we need to do is use our message string, call the lower() method on it and nest this bit of code between the parentheses of word_tokenize. That'll get the job done. So as you can see, tokenizing isn't very complicated. Let's move on to the next pre-processing step, namely removing stop words. Now what I mean by stop words? Stop words is a piece of jargon that refers to the most common words in a language. I'm talking about words like "the", "I", "of", "a", "at", "which", "on" etc. These words are very, very important for grammar, but they don't convey a lot of meaning on their own. In particular, these words aren't gonna be very helpful for differentiating between spam and non-spam messages. So what we're gonna do is we're going to exclude these stop words from our message text and the reason is is that we're going to be looking at words in isolation, remember, this is the naive Bayes' classifier. It will not look at a sentence as a whole, it won't look at a phrase. The naive Bayes' model and the bag of words approach will look at words individually, so if you have a phrase like "Flights to London", then it will look at the word "Flights", the word "to" and the word "London". The meaning is actually lost on the algorithm and this is why we filter out the stop words like the word "to". So we're just piping the word "Flights" and the word "London" into our algorithm. Now this is a little bit of a contrast to how modern search engines will work, right. They're not quite as naive as our spam classifier and they don't actually filter out these stop words. This is why when you're searching for the phrase "Let it be" it will bring up the Beatles song and you'll actually get meaningful results. The same is true if you search for "take that", again more stop words or "to be or not to be" which is a sentence comprised entirely of stop words. So let's insert a cell here where we're downloading our resources and actually download these stop words, as I've been talking a lot about them. So "nltk.download('stopwords')", all lowercase and in one word. If I hit Shift+Enter here, I can see that a new folder has appeared under nltk_data. The folder is called "corpora" and it has a sub folder called "stopwords". Now it's actually got the stop words for quite a few different languages and I can actually take a look at which stop words are included in the English section. Here is the list. You'll notice that they're all lowercase and there's about 179 of them, starting with "I", "me", "my", "myself", "we", "our", "ours" and so on. Okay, so we've downloaded the resource and you're probably wondering how do we exclude the stop word "and" and "a" from this string. You can think of this string as a collection of words and you can think of our list of stop words as just another collection. So the question then becomes: How do you efficiently check if a particular value is contained in a collection? This is actually a very generic programming problem that you'll face in many, many situations. Now one way you can do this is to create a list of all the words and then check one by one and look for a match. However, there is another way to do this and Python has a fantastic data structure that is very, very well suited for this particular type of problem and that is the Python set. A set is an unordered list, so you couldn't say something like "Give me the item at position number 1" or "Give me the item at position number 3", this is what you would do with an array or a list. With a set there is no order to the items. Also, every single item in a set occurs only one time and this makes a set super handy for checking membership or looking for differences. And what you'll find is that the larger your set or the more data that you're actually working with, the more you'll notice the advantage of working with this data structure, because as soon as you have to iterate through an enormous array or enormous list and check one by one, Is there a match? Is there a match? Is there a match? - then the computation will start to take quite a long time. Here's how you can visualize the words in our example sentence as a set. I'll draw one big circle and in that circle we'll have the word "work", "play", "makes", "Jack", "and", "a" and some of the other ones. Similarly our stop words can comprise another set and here we would have words like "we", "I", "until", "you", "while" and also the word "and" and "a". The words "and" and "a" are members of both sets. They're at the intersection of the two circles. So if you're a visual person like myself, then this is how you can kind of think about the set data structure. Coming to think of it, we've actually covered quite a few different types of collections so far, right. We've covered lists, we've covered dictionaries, we've covered tuples, we've covered arrays and now we've also covered sets. Now these are probably the most important data structures that you'll find in Python. The lists, the dictionaries, the tuples and the sets are actually built-in data structures in Python. The arrays on the other hand came from numpy, just the same way that pandas gave us the dataframe. But I think that's enough theory. Let's write some code and put our sets into action. Let's access the stop words first, so "stopwords.words('english')" will give us our stop words. Let me hit Shift+Enter and see what we get here. These are the stop words in that file. At the moment, if we check what the type is of this particular object, we get a list. So if we use this bit of Python code, we will be working with a list. If we wanted to work with a set instead, then we would use the Python keywords "set" and then between the parentheses feed in our list of stop words. Here's what it would look like. Similar to a dictionary, the Python set has this curly brackets notation, but in this case all the values inside are simply separated by a comma. With a dictionary, you always have the key and value pair with the colon in between, but with a set you have the curly brackets and then the values separated by a comma. So what I'm going to do now is create a variable called "stop_words" and store our set inside this variable. And if you're in doubt that we indeed are working with a set now, we can check the type of our stop_words variable. There you go. This should be the proof. Now let's use our set to check for membership. Here's how you can do it. If the string "this" is in stop_words, then print "Found it". Here we're using the Python "in" keyword to check if a particular string is contained inside our set. See what we get. Yeah. So the word "this" is contained inside our stop_words. What about the word "hello"? A common word, right? Let's see if that's contained. Nope, not among the list. The print statement does not execute. So the word "this" was found but the word "hello" was not. Now as a challenge, can you print out "Nope. Not in here." if the word "hello" is not contained in stop words? Did you give it a quick go? The solution is very simple. So if I just copy-paste my previous line of code, substitute "hello" for the word "this" and then add the word "not" which is a Python keyword to modify our condition in our if statement, and then lastly just modify the bit in the print statement, "Nope. Not in here." and we can see now that we've got already two patterns, one using the "in" keyword and a set to check for membership and the other one using the word "not in" if you wanted to check whether something was not contained in the set. Both these techniques are very, very handy and we're gonna make use of it later on in our code. But before we tackle all of our 5800 email messages, let's tackle an example sentence first. Let's write some Python code that both tokenizes, converts to lower case and removes stop words from an example sentence. I'm quickly going to copy this line of code here, paste it in and just add something at the end, maybe "To be or not to be", just a bunch of stop words. Now, we said we'd tokenize, lower case and remove the stop words from this string. We can lower case with "msg.lower()", we can tokenize with "word_tokenize()" and then provide "msg.lower()" as an argument and we can store this list of words in say a variable called "words". Okay fine. So far so good. To filter out all the stop words I'm going to use a loop and I'm also gonna use an empty list to hold on to the results. So I'll say "filtered_words = []" and then I want to write my loop, but I think, given what we've covered so far, you can probably write this yourself already. So as a challenge can you write a loop that appends all the non-stop words to this currently empty list of "filtered_words"? I'll give you a few seconds to pause the video and give this a go. Did you figure it out? Here's the solution. So I'm going to use a for loop, I'm going to say "for word in words:", so going through all the words in this list one by one, if the word is not in stop words, then take the list of filtered words and append the word. And just so we can take a look at our work I'm going to print out our list of filtered words. Let's see what we have. Okay, so this is interesting, right? The word "all" disappears. The word "work" is included. The word "and" is not. The word "no" is also excluded. "play" remains, "makes" remains, "jack" remains, but is lowercase because we've converted it here to lower case. "a" is excluded. "dull" and "boy" are included. And then we have the punctuation. We have the full stop here and all of these words "to be or not to be" are excluded because they're all stop words and then the full stop at the end is included as well. So our list of filtered words contains all the non-stop words plus the punctuation. All right. So we've covered quite a few of the pre-processing steps already. We've covered how to convert to lowercase, we've covered tokenizing and removing stop words. We still have to learn how to strip out HTML tags and do the word stemming and remove the punctuation. The word stemming and the punctuation removal is what I'm going to cover next. So I'll see you in the next lesson.