0 1 00:00:00,780 --> 00:00:02,780 Welcome back. 1 2 00:00:02,910 --> 00:00:07,500 In the next couple of lessons I've got a really exciting topic for you. 2 3 00:00:07,530 --> 00:00:15,840 We're gonna be talking about NLP, natural language processing. Natural language processing is a huge 3 4 00:00:15,840 --> 00:00:16,910 field. 4 5 00:00:16,980 --> 00:00:23,340 It used to be a subfield of artificial intelligence actually, but it moved more and more into the domain 5 6 00:00:23,400 --> 00:00:32,340 of machine learning and NLP is big business too, all sorts of things fall into the domain of natural 6 7 00:00:32,340 --> 00:00:33,960 language processing. 7 8 00:00:34,050 --> 00:00:41,940 For example, it covers things like search, sentiment analysis of tweets or reviews, Google AdWords, automatic 8 9 00:00:41,940 --> 00:00:51,990 translation, spellcheck, auto-correct, Siri, Alexa, you name it. As you can guess, NLP is what most of Google's 9 10 00:00:52,050 --> 00:00:54,190 earnings actually depend on. 10 11 00:00:54,480 --> 00:01:02,300 Now how are we going to use NLP for our naive Bayes' classifier? Well, we're gonna use it to prepare 11 12 00:01:02,480 --> 00:01:06,130 a piece of text for our learning algorithm. 12 13 00:01:06,260 --> 00:01:14,400 We have to convert our email bodies into a form that the algorithm can understand and this means pre- 13 14 00:01:14,400 --> 00:01:17,540 processing our text. 14 15 00:01:17,580 --> 00:01:20,740 Now what kind of things do I mean by pre-processing? 15 16 00:01:21,360 --> 00:01:23,690 Here's the high level overview. 16 17 00:01:23,700 --> 00:01:29,040 First off, we're gonna start converting all our text to lower case. Second, 17 18 00:01:29,070 --> 00:01:35,400 we're gonna tokenize our text, meaning we're gonna split up the individual words in a sentence. 18 19 00:01:35,400 --> 00:01:41,790 Third, we're gonna be removing the stop words. By stop words I mean very common English words, like the 19 20 00:01:41,790 --> 00:01:48,740 word "the", which is there to convey grammar rather than meaning. Next we're also going to strip out the 20 21 00:01:48,740 --> 00:01:51,800 HTML tags that are in the emails. 21 22 00:01:52,100 --> 00:01:57,870 A lot of the emails are not written in plain text, but contain a lot of HTML formatting which we're 22 23 00:01:57,870 --> 00:02:00,120 not going to feed into our algorithm. 23 24 00:02:00,120 --> 00:02:06,560 Next, we're gonna do some word stemming and that means converting the individual words to their stem 24 25 00:02:06,560 --> 00:02:07,710 word. 25 26 00:02:07,710 --> 00:02:16,350 So for example, if you have the words "going", "goes" and "go", then all of these words actually share the same 26 27 00:02:16,530 --> 00:02:23,340 word stem, it's only really the grammar that changes their spelling. By stemming the words we're able 27 28 00:02:23,340 --> 00:02:25,750 to treat them all as the same word. 28 29 00:02:25,770 --> 00:02:31,830 And lastly, we're also going to remove the punctuation and that is because as you can tell our Naive 29 30 00:02:31,860 --> 00:02:34,880 Bayes' Classifier will ignore the grammar. 30 31 00:02:34,980 --> 00:02:39,270 Now without further ado, let's get started. 31 32 00:02:39,280 --> 00:02:39,940 All right. 32 33 00:02:39,940 --> 00:02:47,500 So I'm going to add a few markdown cells once again in Jupyter, so that we can find this section really 33 34 00:02:47,500 --> 00:02:49,380 easily when we're scrolling through it. 34 35 00:02:49,510 --> 00:03:02,160 So I'll call the first heading "Natural Language Processing", with two s's not three and then I'll add 35 36 00:03:02,220 --> 00:03:10,100 a subheading that reads "Text Pre-Processing". 36 37 00:03:10,170 --> 00:03:15,870 Now the first step is normalizing the casing of the letters. 37 38 00:03:15,870 --> 00:03:24,340 Very often the case of the words should not matter. If I search for "what is the airspeed velocity of 38 39 00:03:24,460 --> 00:03:33,630 an unladen swallow?", then if I were to type "wHaT iS thE AirSPEed VeLocITy of An UnLaDen SWaLloW?", 39 40 00:03:33,630 --> 00:03:40,460 now, now even though that is horrible to read, the answer to this vitally important question should not 40 41 00:03:40,460 --> 00:03:44,940 depend on the upper or lower casing of my letters. 41 42 00:03:45,320 --> 00:03:51,080 And you can also verify at home that when you type in a search query, Google completely ignores the upper 42 43 00:03:51,080 --> 00:03:57,690 casing of your letters. The casing of the letters doesn't affect our search results. 43 44 00:03:57,690 --> 00:04:05,330 And similarly for our spam classifier we will treat the words "loan" or "Viagra" the same way regardless 44 45 00:04:05,390 --> 00:04:11,990 whether they're spelled with uppercase letters or lowercase letters. So coming back to our Python code, 45 46 00:04:12,830 --> 00:04:14,150 suppose we have a message. 46 47 00:04:14,210 --> 00:04:26,100 We have some sort of string that reads "All work and no play makes Jack a dull boy.". 47 48 00:04:27,210 --> 00:04:32,700 How can we convert all of these letters to lowercase? 48 49 00:04:32,700 --> 00:04:42,210 How can we ignore the casing of the words in this string? Well, Python strings have a handy little function 49 50 00:04:42,570 --> 00:04:53,880 called "lower()", so "msg.lower()" will actually convert all the letters in the string to lower 50 51 00:04:53,880 --> 00:05:03,570 case, so you can see that "Jack" becomes lowercase and the word "All" also becomes lowercase. So converting 51 52 00:05:03,570 --> 00:05:11,400 to lowercase is one kind of text pre-processing that you can do. Now for a lot of other pre-processing 52 53 00:05:11,400 --> 00:05:20,370 that we're going to do, we're going to use a Python package called The Natural Language Toolkit or 53 54 00:05:20,430 --> 00:05:28,950 NLTK. The Web site for this module looks like this, it's on nltk.org and this is actually 54 55 00:05:28,950 --> 00:05:38,040 a package that almost every professional in the NLP field will use at some point for their natural language 55 56 00:05:38,040 --> 00:05:39,420 processing needs. 56 57 00:05:39,660 --> 00:05:46,380 The NLTK package can do a huge number of things and we're gonna start using it with some of the fundamentals, 57 58 00:05:46,920 --> 00:05:52,030 namely pre-processing our text, so that our machine learning algorithm can use it. 58 59 00:05:52,410 --> 00:05:58,680 Now since we're gonna be using the NLTK resources, I'm going to add a very quick section heading here, 59 60 00:05:59,190 --> 00:06:13,230 to "Download the NLTK Resources" and those include something called a Tokenizer and a list of stop 60 61 00:06:13,230 --> 00:06:21,820 words amongst other things. But before I do that, I'm going to import the package itself along with a 61 62 00:06:21,820 --> 00:06:31,710 couple of the tools. So I'm going to come up here to my notebook imports and then I'm going to say "import nltk" and 62 63 00:06:31,710 --> 00:06:38,900 then from nltk.stem we're going to import the "PorterStemmer", 63 64 00:06:40,670 --> 00:06:45,350 from nltk.corpus 64 65 00:06:49,010 --> 00:06:51,800 we're going to import "stopwords" 65 66 00:06:57,480 --> 00:07:09,070 and from nltk.tokenize we're going to import "word_tokenize". 66 67 00:07:09,200 --> 00:07:13,080 I think this will do for now. I'm going to import the package as a whole 67 68 00:07:13,130 --> 00:07:19,520 and then we're going to import three additional pieces of functionality - the PorterStemmer, stopwords 68 69 00:07:19,970 --> 00:07:30,650 and a word tokenizer. So I'm going to hit Shift+Enter on this cell and then scroll down here to our section where 69 70 00:07:30,650 --> 00:07:36,530 I'm going to show you how to download the NLTK resources. And this is what we're going to do in 70 71 00:07:36,530 --> 00:07:40,770 the next lesson to tokenize our words. 71 72 00:07:41,240 --> 00:07:41,960 I'll see you there.