0 1 00:00:00,690 --> 00:00:02,380 Welcome back. 1 2 00:00:02,400 --> 00:00:10,720 In this video we're going to talk about word stems and word stemming, as well as removing punctuation. 2 3 00:00:10,740 --> 00:00:16,970 These are gonna be the next two steps in our pre-processing stage. Again, we'll work with an example 3 4 00:00:16,970 --> 00:00:21,330 sentence before applying the all of this to our email dataset. 4 5 00:00:21,560 --> 00:00:32,040 I'm going to add a quick markdown cell here that reads "Word Stems and Stemming". Now, what do I mean when 5 6 00:00:32,040 --> 00:00:39,870 I say word stems and word stemming? You see, stemming is the process of reducing words to their base 6 7 00:00:39,990 --> 00:00:42,000 or their root form. 7 8 00:00:42,000 --> 00:00:48,370 The idea behind word stemming is to treat inflected or derived words in the same way. 8 9 00:00:48,420 --> 00:00:56,010 So for example, the words "fishing", "fished", "fisher" and "fishlike" are all reduced by the stemmer to the word 9 10 00:00:56,010 --> 00:01:04,860 "fish". Endings like "ing" in "fishing" or "ed" in "fished" are removed by the stemming software. 10 11 00:01:04,860 --> 00:01:12,760 Now the thing to note about stemming is that the stemmer might not produce "proper words". 11 12 00:01:12,900 --> 00:01:17,010 That is to say, you might not end up with a real word 12 13 00:01:17,010 --> 00:01:18,780 after removing the stem. 13 14 00:01:19,110 --> 00:01:29,310 So for example, the words "argue", "argued", "argues" and "arguing" are all stemmed to the word "argu". Now this 14 15 00:01:29,310 --> 00:01:30,760 is not an error. 15 16 00:01:30,780 --> 00:01:38,730 The purpose of the stemming algorithm is to bring the variant forms of the word together and not to 16 17 00:01:38,730 --> 00:01:43,010 map a word to its paradigm form if you will. 17 18 00:01:43,680 --> 00:01:49,110 The stemmer that I'd like to introduce to you is the de facto standard stemmer for the English language, 18 19 00:01:49,650 --> 00:01:51,660 the Porter Stemmer. 19 20 00:01:51,780 --> 00:01:57,840 This algorithm was written by Martin Porter all the way back in the 1980s at the University of Cambridge. 20 21 00:01:59,130 --> 00:02:00,280 In our previous lesson, 21 22 00:02:00,330 --> 00:02:08,600 we've already imported the PorterStemmer functionality from NLTK at the top of our Jupyter notebook. 22 23 00:02:08,610 --> 00:02:11,560 Now it's time to put it to use. 23 24 00:02:11,580 --> 00:02:20,820 I'm going to copy this cell here and then I'm going to paste it below my markup. What I can do now is 24 25 00:02:20,820 --> 00:02:29,250 simply save the PorterStemmer to a variable, so I'll just say "stemmer = PorterStemmer()" 25 26 00:02:30,450 --> 00:02:32,730 and then to use this stemmer, 26 27 00:02:33,000 --> 00:02:40,230 I'm going to go inside my for loop, just before appending the words, I'll create another variable called 27 28 00:02:40,980 --> 00:02:49,660 "stemmed_word" and this will be equal to the result of the "stem" method from the stemmer. 28 29 00:02:49,800 --> 00:02:57,240 So I'm going to use the stemmer, put a dot after it, call the stem method and then here between the parentheses 29 30 00:02:57,540 --> 00:03:05,220 I'm going to supply the word that our loop is looping over, stem the word, store it inside this variable 30 31 00:03:05,220 --> 00:03:12,940 here and then, instead of appending the original word, I will simply append the stemmed word. 31 32 00:03:13,020 --> 00:03:20,190 Now looking at our example sentence here, the word "makes" is a clearm clear candidate for stemming. Let 32 33 00:03:20,190 --> 00:03:28,070 me hit Shift+Enter and see what it will be stemmed to. "Makes" is stemmed to "make". 33 34 00:03:28,070 --> 00:03:32,650 Now it turns out this was only one word in our example sentence that was stemmed. 34 35 00:03:32,780 --> 00:03:38,930 Perhaps we should add another word that's a stemming candidate at the very end just to try out how the 35 36 00:03:38,930 --> 00:03:40,390 stemmer works. 36 37 00:03:40,700 --> 00:03:50,240 I'm going to expand the example sentence with a few more words, so I'm gonna wrap my line across two 37 38 00:03:50,240 --> 00:03:55,490 lines in Python so I don't have a very, very long sentence all in the same line. 38 39 00:03:55,910 --> 00:04:01,910 So I'm going to use that backslash, that escape character which escapes me pressing Enter on my keyboard. 39 40 00:04:03,090 --> 00:04:05,910 Now I'm going to add a few more words to my example sentence - 40 41 00:04:06,010 --> 00:04:11,110 "Nobody expects the Spanish Inquisition". 41 42 00:04:13,340 --> 00:04:13,850 There we go. 42 43 00:04:14,270 --> 00:04:17,350 Let's see how our Porter stemmer handles this. 43 44 00:04:17,600 --> 00:04:27,440 So quite interesting, "nobody" gets stemmed to "nobodi" with an "i". With "expects", the stemmer drops the "s" and 44 45 00:04:27,440 --> 00:04:34,340 with the word "inquisition" the stemmer drops the letters "ion" at the end. Now one thing I'll say is 45 46 00:04:34,340 --> 00:04:41,150 that you're not actually limited to using the PorterStemmer from the NLTK tool box. There's quite a few 46 47 00:04:41,150 --> 00:04:46,130 to choose from, there's almost like a menu. The reason you might want to use a different stemmer other 47 48 00:04:46,130 --> 00:04:53,280 than the Porter stemmer for example is if you're stemming a different language. Scrolling up to the top, to our 48 49 00:04:53,300 --> 00:05:03,560 imports, a popular choice for other stemmers is the Snowball stemmer, so "nltk.stem" can also import 49 50 00:05:03,680 --> 00:05:11,930 the "SnowballStemmer" and the nice thing with the Snowball stemmer is that if I come down here, copy this 50 51 00:05:11,930 --> 00:05:22,890 line, comment this out, paste it in and substitute the "SnowballStemmer" here I can choose a language, for 51 52 00:05:22,890 --> 00:05:32,220 example, yeah, English obviously, we can use the Snowball steamer with English, but if we go to the documentation 52 53 00:05:32,220 --> 00:05:41,850 here from NLTK and we scroll down a bit to the Snowball stemmer, then what you'll see is that there's 53 54 00:05:41,880 --> 00:05:49,890 other options too, right, there's Arabic, there's Finnish, there's French, there's German, there's Hungarian, 54 55 00:05:50,610 --> 00:05:57,330 Swedish, Norwegian, quite a few, Romanian, like the list goes on, right, Russian, so you can have a look, I'll 55 56 00:05:57,330 --> 00:06:03,810 put the link in the lesson resources, so yeah if you ever want to stem words and use this tool on text 56 57 00:06:03,840 --> 00:06:05,220 that is not English, 57 58 00:06:05,220 --> 00:06:12,810 the Snowball stemmer is your friend. Okay, so that pretty much covers stemming. The next thing that we'll 58 59 00:06:12,810 --> 00:06:20,730 do to clean up the email text and the words is to remove the punctuation. Our spam classifier is not 59 60 00:06:20,730 --> 00:06:26,520 gonna be very interested in the punctuation for the sentences. We can see at the moment, we still have 60 61 00:06:26,520 --> 00:06:33,750 these full stops in our output and a exclamation mark and if we add question marks or anything else 61 62 00:06:34,110 --> 00:06:35,860 it'll show up as well. 62 63 00:06:36,730 --> 00:06:45,550 To remove the punctuation I'm going to copy this cell, paste it below and then also just quickly add a markdown 63 64 00:06:45,550 --> 00:06:48,550 cell here to commemorate what we're doing. 64 65 00:06:48,700 --> 00:06:58,650 So I'll say "Removing Punctuation" and then I'll delete a few of these comments here, format this slightly 65 66 00:06:58,650 --> 00:07:06,560 differently. Maybe I add the odd question mark here, hit Shift+Enter and now we're ready to go. 66 67 00:07:06,830 --> 00:07:13,280 Removing punctuation is, well I think there's like an easy way and there's a hard way and I'll show you 67 68 00:07:13,280 --> 00:07:21,620 the easiest way you can do this. You see, Python strings have a fantastic method called "isalpha", so if 68 69 00:07:21,620 --> 00:07:29,300 you've got a string say a single character, the letter "p" and you put a dot after it and then type 69 70 00:07:30,170 --> 00:07:40,310 "isalpha()" just like so, then this will check if you've got a character or punctuation. In this case, the method 70 71 00:07:40,310 --> 00:07:51,730 returns True, but if we had say a question mark and wrote "isalpha()", then this would return False. I'm going to move 71 72 00:07:51,740 --> 00:08:00,620 these cells up slightly, so we've got them up here and I want to maybe pose a mini challenge to you. Can 72 73 00:08:00,620 --> 00:08:08,900 you modify our code in this cell so that all these special characters here, all these punctuation characters, 73 74 00:08:09,020 --> 00:08:16,460 full stops, question marks, exclamation marks get excluded from the output. What would you change in our 74 75 00:08:16,460 --> 00:08:24,530 code here to accomplish this? I'll give you a few seconds to pause the video and then I'll show you the 75 76 00:08:24,530 --> 00:08:34,780 solution. Did you have a go? What I would do is to modify this condition here. Not only would I check if 76 77 00:08:34,780 --> 00:08:43,390 the word is part of the stop words, but I would also say that punctuation should not be included in our 77 78 00:08:43,390 --> 00:08:53,560 list. So we can take the word, put a dot after it and say "isalpha()". This bit of code will only 78 79 00:08:53,560 --> 00:09:03,040 return True if it hits an actual word, like "boy" or "adult". It will not return True for the full stops or 79 80 00:09:03,040 --> 00:09:07,290 the question marks. Let's check it out if it works. 80 81 00:09:07,290 --> 00:09:11,530 Surprise, surprise, I've planned out this tutorial and it, and it does work. 81 82 00:09:11,700 --> 00:09:13,860 So there you go. 82 83 00:09:13,860 --> 00:09:19,950 This is how you can use a built-in method from the Python strings to check for punctuation and exclude it 83 84 00:09:20,160 --> 00:09:23,210 if necessary. In the next lesson, 84 85 00:09:23,280 --> 00:09:27,610 I'm going to show you how to tackle the HTML tags in the emails. 85 86 00:09:27,660 --> 00:09:28,500 I'll see you there. 86 87 00:09:28,500 --> 00:09:29,100 Stay tuned.