0
1
00:00:00,780 --> 00:00:02,780
Welcome back.
1

2
00:00:02,910 --> 00:00:07,500
In the next couple of lessons I've got a really exciting topic for you.
2

3
00:00:07,530 --> 00:00:15,840
We're gonna be talking about NLP, natural language processing. Natural language processing is a huge
3

4
00:00:15,840 --> 00:00:16,910
field.
4

5
00:00:16,980 --> 00:00:23,340
It used to be a subfield of artificial intelligence actually, but it moved more and more into the domain
5

6
00:00:23,400 --> 00:00:32,340
of machine learning and NLP is big business too, all sorts of things fall into the domain of natural
6

7
00:00:32,340 --> 00:00:33,960
language processing.
7

8
00:00:34,050 --> 00:00:41,940
For example, it covers things like search, sentiment analysis of tweets or reviews, Google AdWords, automatic
8

9
00:00:41,940 --> 00:00:51,990
translation, spellcheck, auto-correct, Siri, Alexa, you name it. As you can guess, NLP is what most of Google's
9

10
00:00:52,050 --> 00:00:54,190
earnings actually depend on.
10

11
00:00:54,480 --> 00:01:02,300
Now how are we going to use NLP for our naive Bayes' classifier? Well, we're gonna use it to prepare
11

12
00:01:02,480 --> 00:01:06,130
a piece of text for our learning algorithm.
12

13
00:01:06,260 --> 00:01:14,400
We have to convert our email bodies into a form that the algorithm can understand and this means pre-
13

14
00:01:14,400 --> 00:01:17,540
processing our text.
14

15
00:01:17,580 --> 00:01:20,740
Now what kind of things do I mean by pre-processing?
15

16
00:01:21,360 --> 00:01:23,690
Here's the high level overview.
16

17
00:01:23,700 --> 00:01:29,040
First off, we're gonna start converting all our text to lower case. Second,
17

18
00:01:29,070 --> 00:01:35,400
we're gonna tokenize our text, meaning we're gonna split up the individual words in a sentence.
18

19
00:01:35,400 --> 00:01:41,790
Third, we're gonna be removing the stop words. By stop words I mean very common English words, like the
19

20
00:01:41,790 --> 00:01:48,740
word "the", which is there to convey grammar rather than meaning. Next we're also going to strip out the 
20

21
00:01:48,740 --> 00:01:51,800
HTML tags that are in the emails.
21

22
00:01:52,100 --> 00:01:57,870
A lot of the emails are not written in plain text, but contain a lot of HTML formatting which we're
22

23
00:01:57,870 --> 00:02:00,120
not going to feed into our algorithm.
23

24
00:02:00,120 --> 00:02:06,560
Next, we're gonna do some word stemming and that means converting the individual words to their stem
24

25
00:02:06,560 --> 00:02:07,710
word.
25

26
00:02:07,710 --> 00:02:16,350
So for example, if you have the words "going", "goes" and "go", then all of these words actually share the same
26

27
00:02:16,530 --> 00:02:23,340
word stem, it's only really the grammar that changes their spelling. By stemming the words we're able
27

28
00:02:23,340 --> 00:02:25,750
to treat them all as the same word.
28

29
00:02:25,770 --> 00:02:31,830
And lastly, we're also going to remove the punctuation and that is because as you can tell our Naive
29

30
00:02:31,860 --> 00:02:34,880
Bayes' Classifier will ignore the grammar.
30

31
00:02:34,980 --> 00:02:39,270
Now without further ado, let's get started.
31

32
00:02:39,280 --> 00:02:39,940
All right.
32

33
00:02:39,940 --> 00:02:47,500
So I'm going to add a few markdown cells once again in Jupyter, so that we can find this section really
33

34
00:02:47,500 --> 00:02:49,380
easily when we're scrolling through it.
34

35
00:02:49,510 --> 00:03:02,160
So I'll call the first heading "Natural Language Processing", with two s's not three and then I'll add
35

36
00:03:02,220 --> 00:03:10,100
a subheading that reads "Text Pre-Processing".
36

37
00:03:10,170 --> 00:03:15,870
Now the first step is normalizing the casing of the letters.
37

38
00:03:15,870 --> 00:03:24,340
Very often the case of the words should not matter. If I search for "what is the airspeed velocity of
38

39
00:03:24,460 --> 00:03:33,630
an unladen swallow?", then if I were to type "wHaT iS thE AirSPEed VeLocITy of An UnLaDen SWaLloW?",
39

40
00:03:33,630 --> 00:03:40,460
now, now even though that is horrible to read, the answer to this vitally important question should not
40

41
00:03:40,460 --> 00:03:44,940
depend on the upper or lower casing of my letters.
41

42
00:03:45,320 --> 00:03:51,080
And you can also verify at home that when you type in a search query, Google completely ignores the upper
42

43
00:03:51,080 --> 00:03:57,690
casing of your letters. The casing of the letters doesn't affect our search results.
43

44
00:03:57,690 --> 00:04:05,330
And similarly for our spam classifier we will treat the words "loan" or "Viagra" the same way regardless
44

45
00:04:05,390 --> 00:04:11,990
whether they're spelled with uppercase letters or lowercase letters. So coming back to our Python code,
45

46
00:04:12,830 --> 00:04:14,150
suppose we have a message.
46

47
00:04:14,210 --> 00:04:26,100
We have some sort of string that reads "All work and no play makes Jack a dull boy.".
47

48
00:04:27,210 --> 00:04:32,700
How can we convert all of these letters to lowercase?
48

49
00:04:32,700 --> 00:04:42,210
How can we ignore the casing of the words in this string? Well, Python strings have a handy little function
49

50
00:04:42,570 --> 00:04:53,880
called "lower()", so "msg.lower()" will actually convert all the letters in the string to lower
50

51
00:04:53,880 --> 00:05:03,570
case, so you can see that "Jack" becomes lowercase and the word "All" also becomes lowercase. So converting
51

52
00:05:03,570 --> 00:05:11,400
to lowercase is one kind of text pre-processing that you can do. Now for a lot of other pre-processing
52

53
00:05:11,400 --> 00:05:20,370
that we're going to do, we're going to use a Python package called The Natural Language Toolkit or 
53

54
00:05:20,430 --> 00:05:28,950
NLTK. The Web site for this module looks like this, it's on nltk.org and this is actually
54

55
00:05:28,950 --> 00:05:38,040
a package that almost every professional in the NLP field will use at some point for their natural language
55

56
00:05:38,040 --> 00:05:39,420
processing needs.
56

57
00:05:39,660 --> 00:05:46,380
The NLTK package can do a huge number of things and we're gonna start using it with some of the fundamentals,
57

58
00:05:46,920 --> 00:05:52,030
namely pre-processing our text, so that our machine learning algorithm can use it.
58

59
00:05:52,410 --> 00:05:58,680
Now since we're gonna be using the NLTK resources, I'm going to add a very quick section heading here,
59

60
00:05:59,190 --> 00:06:13,230
to "Download the NLTK Resources" and those include something called a Tokenizer and a list of stop
60

61
00:06:13,230 --> 00:06:21,820
words amongst other things. But before I do that, I'm going to import the package itself along with a
61

62
00:06:21,820 --> 00:06:31,710
couple of the tools. So I'm going to come up here to my notebook imports and then I'm going to say "import nltk" and
62

63
00:06:31,710 --> 00:06:38,900
then from nltk.stem we're going to import the "PorterStemmer",
63

64
00:06:40,670 --> 00:06:45,350
from nltk.corpus
64

65
00:06:49,010 --> 00:06:51,800
we're going to import "stopwords"
65

66
00:06:57,480 --> 00:07:09,070
and from nltk.tokenize we're going to import "word_tokenize".
66

67
00:07:09,200 --> 00:07:13,080
I think this will do for now. I'm going to import the package as a whole
67

68
00:07:13,130 --> 00:07:19,520
and then we're going to import three additional pieces of functionality - the PorterStemmer, stopwords
68

69
00:07:19,970 --> 00:07:30,650
and a word tokenizer. So I'm going to hit Shift+Enter on this cell and then scroll down here to our section where
69

70
00:07:30,650 --> 00:07:36,530
I'm going to show you how to download the NLTK resources. And this is what we're going to do in
70

71
00:07:36,530 --> 00:07:40,770
the next lesson to tokenize our words.
71

72
00:07:41,240 --> 00:07:41,960
I'll see you there.