1
00:00:00,166 --> 00:00:02,533
Hello and welcome to this art tutorial.

2
00:00:02,533 --> 00:00:05,900
So, so far we've been doing a great deal
of simplifications

3
00:00:05,900 --> 00:00:09,400
for our corpus and therefore
for a future sparse matrix of features.

4
00:00:09,766 --> 00:00:14,566
But we can still do better, and doing
better is what we'll do in this tutorial.

5
00:00:14,800 --> 00:00:18,533
And it's a new step of the
cleaning process that is called stemming.

6
00:00:18,866 --> 00:00:20,366
So what is stemming?

7
00:00:20,366 --> 00:00:23,766
Well, stemming
is about getting the root of each word.

8
00:00:24,333 --> 00:00:28,800
For example, if we look at the first
review, we have this loved one here.

9
00:00:29,100 --> 00:00:32,100
And the root of this word would be love.

10
00:00:32,433 --> 00:00:35,433
So what is the purpose
of getting the root of the word?

11
00:00:35,566 --> 00:00:39,033
Well, it's of course still related
to our goal to reduce the total

12
00:00:39,033 --> 00:00:42,600
number of words that will be in our future
sparse matrix of features.

13
00:00:43,000 --> 00:00:45,866
And we can do this
by taking the root of the words.

14
00:00:45,866 --> 00:00:50,433
Because whether we have loved or love
or will love or loving,

15
00:00:50,766 --> 00:00:53,900
well, this actually means the same thing
for our algorithm.

16
00:00:54,333 --> 00:00:57,966
And not only it means the same thing,
but also it gives the same hint

17
00:00:58,166 --> 00:01:00,600
whether the review is positive
or negative.

18
00:01:00,600 --> 00:01:04,500
So we don't really need to have
some different tense of one same verb.

19
00:01:04,733 --> 00:01:07,266
And we don't really need to have
derivative words.

20
00:01:07,266 --> 00:01:09,200
We just need the root of the words.

21
00:01:09,200 --> 00:01:12,400
And that will be perfectly enough
for our machine learning classification

22
00:01:12,400 --> 00:01:15,600
model to train on the future
sparse matrix of features

23
00:01:15,866 --> 00:01:18,866
that therefore will contain
only the roots of the words.

24
00:01:18,900 --> 00:01:21,833
And you can imagine
how we will considerably reduce

25
00:01:21,833 --> 00:01:23,500
the final total number of words.

26
00:01:23,500 --> 00:01:27,133
That is, the final total number of columns
in the sparse matrix of features.

27
00:01:27,333 --> 00:01:29,566
Because of course,
by only keeping the roots

28
00:01:29,566 --> 00:01:31,500
of the different versions
of the same word.

29
00:01:31,500 --> 00:01:34,700
Well, of course
that simplifies it very well and therefore

30
00:01:34,733 --> 00:01:37,733
considerably reduces
the final total number of words.

31
00:01:38,166 --> 00:01:39,366
So that's stemming.

32
00:01:39,366 --> 00:01:42,500
That's also a very important step
in natural language processing.

33
00:01:42,800 --> 00:01:45,633
You will most of the time apply
stemming to your text

34
00:01:45,633 --> 00:01:50,633
whether you are working with reviews
or articles or books or HTML pages.

35
00:01:50,833 --> 00:01:53,666
Well, for any kind of text,
it's really help your machine learning

36
00:01:53,666 --> 00:01:57,466
algorithm to do an even better job
for your classification problem.

37
00:01:57,866 --> 00:01:59,766
So let's do it for our reviews.

38
00:01:59,766 --> 00:02:02,400
And it is still going to be very simple.

39
00:02:02,400 --> 00:02:04,300
We will do another copy paste here.

40
00:02:04,300 --> 00:02:08,400
So I will actually copy this line
because we only need two parameters

41
00:02:08,633 --> 00:02:12,633
the corpus and a function
that will perform the stemming.

42
00:02:12,633 --> 00:02:16,066
So based here and I will replace

43
00:02:16,066 --> 00:02:20,466
remove punctuation by the appropriate
function to proceed to the stemming

44
00:02:20,700 --> 00:02:24,900
which is the stem capital D document.

45
00:02:24,900 --> 00:02:25,733
Here it is.

46
00:02:25,733 --> 00:02:30,366
That's the function we use to perform
stemming on all the reviews of our corpus.

47
00:02:30,766 --> 00:02:32,366
So let's check it out.

48
00:02:32,366 --> 00:02:35,166
Let's select this line right now.

49
00:02:35,166 --> 00:02:38,166
Our first review is well left place.

50
00:02:38,366 --> 00:02:42,033
And you'll see that
after stemming left becomes love.

51
00:02:42,600 --> 00:02:44,533
All right. So let's execute now.

52
00:02:44,533 --> 00:02:45,600
Press command and control list.

53
00:02:45,600 --> 00:02:46,933
Enter to execute.

54
00:02:46,933 --> 00:02:49,600
Here we go. New corpus updated.

55
00:02:49,600 --> 00:02:53,233
And now let's have a look
at the first review of this new corpus.

56
00:02:53,666 --> 00:02:57,433
So I'm pressing the up arrow here
to get this line of code.

57
00:02:57,700 --> 00:02:59,533
And now pressing enter.

58
00:02:59,533 --> 00:03:00,733
And here we go.

59
00:03:00,733 --> 00:03:03,166
Wow love and place.

60
00:03:03,166 --> 00:03:05,566
So loved was replaced by love.

61
00:03:05,566 --> 00:03:08,133
Because the root of love is love.

62
00:03:08,133 --> 00:03:09,500
All right. So.

63
00:03:09,500 --> 00:03:11,400
And that's the same for all the reviews

64
00:03:11,400 --> 00:03:15,066
and all the other reviews,
the words were replaced by the root.

65
00:03:15,666 --> 00:03:18,100
So that's done for this new step.

66
00:03:18,100 --> 00:03:20,900
And actually we are almost done
with the cleaning process.

67
00:03:20,900 --> 00:03:24,866
We have one final step and we will do this
final step in the next tutorial.

68
00:03:25,166 --> 00:03:26,733
Until then, enjoy machine learning.