Welcome back. In the next couple of lessons I've got a really exciting topic for you. We're gonna be talking about NLP, natural language processing. Natural language processing is a huge field. It used to be a subfield of artificial intelligence actually, but it moved more and more into the domain of machine learning and NLP is big business too, all sorts of things fall into the domain of natural language processing. For example, it covers things like search, sentiment analysis of tweets or reviews, Google AdWords, automatic translation, spellcheck, auto-correct, Siri, Alexa, you name it. As you can guess, NLP is what most of Google's earnings actually depend on. Now how are we going to use NLP for our naive Bayes' classifier? Well, we're gonna use it to prepare a piece of text for our learning algorithm. We have to convert our email bodies into a form that the algorithm can understand and this means pre-processing our text. Now what kind of things do I mean by pre-processing? Here's the high level overview. First off, we're gonna start converting all our text to lower case. Second, we're gonna tokenize our text, meaning we're gonna split up the individual words in a sentence. Third, we're gonna be removing the stop words. By stop words I mean very common English words, like the word "the", which is there to convey grammar rather than meaning. Next we're also going to strip out the HTML tags that are in the emails. A lot of the emails are not written in plain text, but contain a lot of HTML formatting which we're not going to feed into our algorithm. Next, we're gonna do some word stemming and that means converting the individual words to their stem word. So for example, if you have the words "going", "goes" and "go", then all of these words actually share the same word stem, it's only really the grammar that changes their spelling. By stemming the words we're able to treat them all as the same word. And lastly, we're also going to remove the punctuation and that is because as you can tell our Naive Bayes' Classifier will ignore the grammar. Now without further ado, let's get started. All right. So I'm going to add a few markdown cells once again in Jupyter, so that we can find this section really easily when we're scrolling through it. So I'll call the first heading "Natural Language Processing", with two s's not three and then I'll add a subheading that reads "Text Pre-Processing". Now the first step is normalizing the casing of the letters. Very often the case of the words should not matter. If I search for "what is the airspeed velocity of an unladen swallow?", then if I were to type "wHaT iS thE AirSPEed VeLocITy of An UnLaDen SWaLloW?", now, now even though that is horrible to read, the answer to this vitally important question should not depend on the upper or lower casing of my letters. And you can also verify at home that when you type in a search query, Google completely ignores the upper casing of your letters. The casing of the letters doesn't affect our search results. And similarly for our spam classifier we will treat the words "loan" or "Viagra" the same way regardless whether they're spelled with uppercase letters or lowercase letters. So coming back to our Python code, suppose we have a message. We have some sort of string that reads "All work and no play makes Jack a dull boy.". How can we convert all of these letters to lowercase? How can we ignore the casing of the words in this string? Well, Python strings have a handy little function called "lower()", so "msg.lower()" will actually convert all the letters in the string to lower case, so you can see that "Jack" becomes lowercase and the word "All" also becomes lowercase. So converting to lowercase is one kind of text pre-processing that you can do. Now for a lot of other pre-processing that we're going to do, we're going to use a Python package called The Natural Language Toolkit or NLTK. The Web site for this module looks like this, it's on nltk.org and this is actually a package that almost every professional in the NLP field will use at some point for their natural language processing needs. The NLTK package can do a huge number of things and we're gonna start using it with some of the fundamentals, namely pre-processing our text, so that our machine learning algorithm can use it. Now since we're gonna be using the NLTK resources, I'm going to add a very quick section heading here, to "Download the NLTK Resources" and those include something called a Tokenizer and a list of stop words amongst other things. But before I do that, I'm going to import the package itself along with a couple of the tools. So I'm going to come up here to my notebook imports and then I'm going to say "import nltk" and then from nltk.stem we're going to import the "PorterStemmer", from nltk.corpus we're going to import "stopwords" and from nltk.tokenize we're going to import "word_tokenize". I think this will do for now. I'm going to import the package as a whole and then we're going to import three additional pieces of functionality - the PorterStemmer, stopwords and a word tokenizer. So I'm going to hit Shift+Enter on this cell and then scroll down here to our section where I'm going to show you how to download the NLTK resources. And this is what we're going to do in the next lesson to tokenize our words. I'll see you there.