0
1
00:00:00,690 --> 00:00:02,380
Welcome back.
1

2
00:00:02,400 --> 00:00:10,720
In this video we're going to talk about word stems and word stemming, as well as removing punctuation.
2

3
00:00:10,740 --> 00:00:16,970
These are gonna be the next two steps in our pre-processing stage. Again, we'll work with an example
3

4
00:00:16,970 --> 00:00:21,330
sentence before applying the all of this to our email dataset.
4

5
00:00:21,560 --> 00:00:32,040
I'm going to add a quick markdown cell here that reads "Word Stems and Stemming". Now, what do I mean when
5

6
00:00:32,040 --> 00:00:39,870
I say word stems and word stemming? You see, stemming is the process of reducing words to their base
6

7
00:00:39,990 --> 00:00:42,000
or their root form.
7

8
00:00:42,000 --> 00:00:48,370
The idea behind word stemming is to treat inflected or derived words in the same way.
8

9
00:00:48,420 --> 00:00:56,010
So for example, the words "fishing", "fished", "fisher" and "fishlike" are all reduced by the stemmer to the word
9

10
00:00:56,010 --> 00:01:04,860
"fish". Endings like "ing" in "fishing" or "ed" in "fished" are removed by the stemming software.
10

11
00:01:04,860 --> 00:01:12,760
Now the thing to note about stemming is that the stemmer might not produce "proper words".
11

12
00:01:12,900 --> 00:01:17,010
That is to say, you might not end up with a real word
12

13
00:01:17,010 --> 00:01:18,780
after removing the stem.
13

14
00:01:19,110 --> 00:01:29,310
So for example, the words "argue", "argued", "argues" and "arguing" are all stemmed to the word "argu". Now this
14

15
00:01:29,310 --> 00:01:30,760
is not an error.
15

16
00:01:30,780 --> 00:01:38,730
The purpose of the stemming algorithm is to bring the variant forms of the word together and not to
16

17
00:01:38,730 --> 00:01:43,010
map a word to its paradigm form if you will.
17

18
00:01:43,680 --> 00:01:49,110
The stemmer that I'd like to introduce to you is the de facto standard stemmer for the English language,
18

19
00:01:49,650 --> 00:01:51,660
the Porter Stemmer.
19

20
00:01:51,780 --> 00:01:57,840
This algorithm was written by Martin Porter all the way back in the 1980s at the University of Cambridge.
20

21
00:01:59,130 --> 00:02:00,280
In our previous lesson,
21

22
00:02:00,330 --> 00:02:08,600
we've already imported the PorterStemmer functionality from NLTK at the top of our Jupyter notebook.
22

23
00:02:08,610 --> 00:02:11,560
Now it's time to put it to use.
23

24
00:02:11,580 --> 00:02:20,820
I'm going to copy this cell here and then I'm going to paste it below my markup. What I can do now is
24

25
00:02:20,820 --> 00:02:29,250
simply save the PorterStemmer to a variable, so I'll just say "stemmer = PorterStemmer()"
25

26
00:02:30,450 --> 00:02:32,730
and then to use this stemmer,
26

27
00:02:33,000 --> 00:02:40,230
I'm going to go inside my for loop, just before appending the words, I'll create another variable called
27

28
00:02:40,980 --> 00:02:49,660
"stemmed_word" and this will be equal to the result of the "stem" method from the stemmer.
28

29
00:02:49,800 --> 00:02:57,240
So I'm going to use the stemmer, put a dot after it, call the stem method and then here between the parentheses
29

30
00:02:57,540 --> 00:03:05,220
I'm going to supply the word that our loop is looping over, stem the word, store it inside this variable
30

31
00:03:05,220 --> 00:03:12,940
here and then, instead of appending the original word, I will simply append the stemmed word.
31

32
00:03:13,020 --> 00:03:20,190
Now looking at our example sentence here, the word "makes" is a clearm clear candidate for stemming. Let
32

33
00:03:20,190 --> 00:03:28,070
me hit Shift+Enter and see what it will be stemmed to. "Makes" is stemmed to "make".
33

34
00:03:28,070 --> 00:03:32,650
Now it turns out this was only one word in our example sentence that was stemmed.
34

35
00:03:32,780 --> 00:03:38,930
Perhaps we should add another word that's a stemming candidate at the very end just to try out how the
35

36
00:03:38,930 --> 00:03:40,390
stemmer works.
36

37
00:03:40,700 --> 00:03:50,240
I'm going to expand the example sentence with a few more words, so I'm gonna wrap my line across two
37

38
00:03:50,240 --> 00:03:55,490
lines in Python so I don't have a very, very long sentence all in the same line.
38

39
00:03:55,910 --> 00:04:01,910
So I'm going to use that backslash, that escape character which escapes me pressing Enter on my keyboard.
39

40
00:04:03,090 --> 00:04:05,910
Now I'm going to add a few more words to my example sentence - 
40

41
00:04:06,010 --> 00:04:11,110
"Nobody expects the Spanish Inquisition".
41

42
00:04:13,340 --> 00:04:13,850
There we go.
42

43
00:04:14,270 --> 00:04:17,350
Let's see how our Porter stemmer handles this.
43

44
00:04:17,600 --> 00:04:27,440
So quite interesting, "nobody" gets stemmed to "nobodi" with an "i". With "expects", the stemmer drops the "s" and
44

45
00:04:27,440 --> 00:04:34,340
with the word "inquisition" the stemmer drops the letters "ion" at the end. Now one thing I'll say is
45

46
00:04:34,340 --> 00:04:41,150
that you're not actually limited to using the PorterStemmer from the NLTK tool box. There's quite a few
46

47
00:04:41,150 --> 00:04:46,130
to choose from, there's almost like a menu. The reason you might want to use a different stemmer other
47

48
00:04:46,130 --> 00:04:53,280
than the Porter stemmer for example is if you're stemming a different language. Scrolling up to the top, to our
48

49
00:04:53,300 --> 00:05:03,560
imports, a popular choice for other stemmers is the Snowball stemmer, so "nltk.stem" can also import
49

50
00:05:03,680 --> 00:05:11,930
the "SnowballStemmer" and the nice thing with the Snowball stemmer is that if I come down here, copy this
50

51
00:05:11,930 --> 00:05:22,890
line, comment this out, paste it in and substitute the "SnowballStemmer" here I can choose a language, for
51

52
00:05:22,890 --> 00:05:32,220
example, yeah, English obviously, we can use the Snowball steamer with English, but if we go to the documentation
52

53
00:05:32,220 --> 00:05:41,850
here from NLTK and we scroll down a bit to the Snowball stemmer, then what you'll see is that there's
53

54
00:05:41,880 --> 00:05:49,890
other options too, right, there's Arabic, there's Finnish, there's French, there's German, there's Hungarian,
54

55
00:05:50,610 --> 00:05:57,330
Swedish, Norwegian, quite a few, Romanian, like the list goes on, right, Russian, so you can have a look, I'll
55

56
00:05:57,330 --> 00:06:03,810
put the link in the lesson resources, so yeah if you ever want to stem words and use this tool on text
56

57
00:06:03,840 --> 00:06:05,220
that is not English,
57

58
00:06:05,220 --> 00:06:12,810
the Snowball stemmer is your friend. Okay, so that pretty much covers stemming. The next thing that we'll
58

59
00:06:12,810 --> 00:06:20,730
do to clean up the email text and the words is to remove the punctuation. Our spam classifier is not
59

60
00:06:20,730 --> 00:06:26,520
gonna be very interested in the punctuation for the sentences. We can see at the moment, we still have
60

61
00:06:26,520 --> 00:06:33,750
these full stops in our output and a exclamation mark and if we add question marks or anything else
61

62
00:06:34,110 --> 00:06:35,860
it'll show up as well.
62

63
00:06:36,730 --> 00:06:45,550
To remove the punctuation I'm going to copy this cell, paste it below and then also just quickly add a markdown
63

64
00:06:45,550 --> 00:06:48,550
cell here to commemorate what we're doing.
64

65
00:06:48,700 --> 00:06:58,650
So I'll say "Removing Punctuation" and then I'll delete a few of these comments here, format this slightly
65

66
00:06:58,650 --> 00:07:06,560
differently. Maybe I add the odd question mark here, hit Shift+Enter and now we're ready to go.
66

67
00:07:06,830 --> 00:07:13,280
Removing punctuation is, well I think there's like an easy way and there's a hard way and I'll show you
67

68
00:07:13,280 --> 00:07:21,620
the easiest way you can do this. You see, Python strings have a fantastic method called "isalpha", so if
68

69
00:07:21,620 --> 00:07:29,300
you've got a string say a single character, the letter "p" and you put a dot after it and then type
69

70
00:07:30,170 --> 00:07:40,310
"isalpha()" just like so, then this will check if you've got a character or punctuation. In this case, the method
70

71
00:07:40,310 --> 00:07:51,730
returns True, but if we had say a question mark and wrote "isalpha()", then this would return False. I'm going to move
71

72
00:07:51,740 --> 00:08:00,620
these cells up slightly, so we've got them up here and I want to maybe pose a mini challenge to you. Can
72

73
00:08:00,620 --> 00:08:08,900
you modify our code in this cell so that all these special characters here, all these punctuation characters,
73

74
00:08:09,020 --> 00:08:16,460
full stops, question marks, exclamation marks get excluded from the output. What would you change in our
74

75
00:08:16,460 --> 00:08:24,530
code here to accomplish this? I'll give you a few seconds to pause the video and then I'll show you the
75

76
00:08:24,530 --> 00:08:34,780
solution. Did you have a go? What I would do is to modify this condition here. Not only would I check if
76

77
00:08:34,780 --> 00:08:43,390
the word is part of the stop words, but I would also say that punctuation should not be included in our
77

78
00:08:43,390 --> 00:08:53,560
list. So we can take the word, put a dot after it and say "isalpha()". This bit of code will only
78

79
00:08:53,560 --> 00:09:03,040
return True if it hits an actual word, like "boy" or "adult". It will not return True for the full stops or
79

80
00:09:03,040 --> 00:09:07,290
the question marks. Let's check it out if it works.
80

81
00:09:07,290 --> 00:09:11,530
Surprise, surprise, I've planned out this tutorial and it, and it does work.
81

82
00:09:11,700 --> 00:09:13,860
So there you go.
82

83
00:09:13,860 --> 00:09:19,950
This is how you can use a built-in method from the Python strings to check for punctuation and exclude it
83

84
00:09:20,160 --> 00:09:23,210
if necessary. In the next lesson,
84

85
00:09:23,280 --> 00:09:27,610
I'm going to show you how to tackle the HTML tags in the emails.
85

86
00:09:27,660 --> 00:09:28,500
I'll see you there.
86

87
00:09:28,500 --> 00:09:29,100
Stay tuned.