1
00:00:00,233 --> 00:00:02,533
Hello and welcome to this art tutorial.

2
00:00:02,533 --> 00:00:05,933
So in the previous tutorials
we imported the data set and we started

3
00:00:05,933 --> 00:00:09,000
with the big first step
of natural language processing,

4
00:00:09,000 --> 00:00:12,000
which is about cleaning the texts
we are working with.

5
00:00:12,333 --> 00:00:18,066
So this first step consisted of creating
a corpus that basically is a new data set,

6
00:00:18,066 --> 00:00:21,666
but this time only containing the reviews,
the text of the reviews.

7
00:00:21,900 --> 00:00:24,733
And basically this is in this corpus
that we were going to clean

8
00:00:24,733 --> 00:00:28,133
all the 1000 reviews, and
we are going to clean them step by step.

9
00:00:28,333 --> 00:00:31,333
And in this tutorial
we're going to do the first cleaning step.

10
00:00:31,700 --> 00:00:33,166
All right so let's do it.

11
00:00:33,166 --> 00:00:37,866
This first cleaning step will consist
of putting all the reviews in lower cases.

12
00:00:38,100 --> 00:00:39,966
And what is the purpose of doing that.

13
00:00:39,966 --> 00:00:42,966
Well we are doing this
so that in the final sparse matrix

14
00:00:42,966 --> 00:00:46,800
containing all the words
of our 1000 reviews, we don't get twice

15
00:00:46,800 --> 00:00:49,966
the same word, you know, with one word
starting with the capital letter

16
00:00:49,966 --> 00:00:53,000
and the same word,
but not starting with the capital letter.

17
00:00:53,200 --> 00:00:56,300
And of course, we only want
to have one version of this same word,

18
00:00:56,600 --> 00:00:59,400
and therefore
we'll keep the one with lowercase.

19
00:00:59,400 --> 00:01:02,400
So that's why right now, in this
first step of the cleaning process,

20
00:01:02,600 --> 00:01:07,133
we will put all the words of our 1000
reviews in lower cases.

21
00:01:07,333 --> 00:01:08,266
So let's do it.

22
00:01:08,266 --> 00:01:12,766
To do this we are going to update
the corpus this way.

23
00:01:13,333 --> 00:01:15,700
Then equals because we are updating it.

24
00:01:15,700 --> 00:01:17,566
That is it will contain new reviews

25
00:01:17,566 --> 00:01:21,000
which are going to be the same reviews
but with lower cases.

26
00:01:21,466 --> 00:01:25,366
And to put other words of the reviews
in this corpus, we will use the t

27
00:01:25,600 --> 00:01:29,966
m underscore map function
that will do the job for us.

28
00:01:30,266 --> 00:01:31,433
So function.

29
00:01:31,433 --> 00:01:33,266
So we need to add some parenthesis.

30
00:01:33,266 --> 00:01:35,300
And now we need to add two parameters.

31
00:01:35,300 --> 00:01:39,100
The first parameter
is actually the corpus itself.

32
00:01:39,100 --> 00:01:43,733
But you know the old version of the corpus
that is the corpus that we have here

33
00:01:44,000 --> 00:01:46,500
which contains
the original versions of the reviews,

34
00:01:46,500 --> 00:01:49,500
that is the 1000 reviews
we have in our data set.

35
00:01:49,700 --> 00:01:54,433
And this corpus here will be the new
updated version of the corpus,

36
00:01:54,700 --> 00:01:57,900
that is the corpus containing
all the reviews in lower cases.

37
00:01:58,500 --> 00:02:01,733
So that's the first parameter,
the old version of the corpus.

38
00:02:01,733 --> 00:02:07,366
And the second parameter is a function
that is some kind of a transform function.

39
00:02:07,800 --> 00:02:10,433
And that will simply transform
each word of the corpus

40
00:02:10,433 --> 00:02:13,433
by replacing the capital letters
in lowercase.

41
00:02:13,433 --> 00:02:16,766
So this function is content
underscore transformer.

42
00:02:16,766 --> 00:02:18,900
Here it is. Let's press enter here.

43
00:02:18,900 --> 00:02:22,900
And actually this content transformer
function can perform

44
00:02:22,900 --> 00:02:24,100
several transformations.

45
00:02:24,100 --> 00:02:28,700
As we can see in the yellow rectangle here
the parameter of this function is fun.

46
00:02:29,000 --> 00:02:32,000
And so we need to add the function
that we want

47
00:02:32,166 --> 00:02:35,233
to put all the words of the reviews
in lower cases.

48
00:02:35,500 --> 00:02:38,500
And this function is called two lower.

49
00:02:39,100 --> 00:02:39,500
All right.

50
00:02:39,500 --> 00:02:42,600
So that's the function parameter
that we need to input

51
00:02:42,600 --> 00:02:46,766
in this content transformer
which is like a transform function.

52
00:02:46,766 --> 00:02:50,366
Having several transformation
possibilities as input here.

53
00:02:50,366 --> 00:02:54,066
And the possibility that we choose
is this two lower function

54
00:02:54,066 --> 00:02:56,633
which will put all the words
in lower cases.

55
00:02:56,633 --> 00:03:01,200
And basically this tmp function is used
so that we can apply this content

56
00:03:01,200 --> 00:03:05,766
transformer to lower function for all the
words of the 1000 reviews of the corpus.

57
00:03:06,266 --> 00:03:08,600
Great. So that's actually done.

58
00:03:08,600 --> 00:03:09,400
That's actually all

59
00:03:09,400 --> 00:03:13,033
we need to put all the words of the 1000
reviews in lower cases.

60
00:03:13,400 --> 00:03:16,133
So I'm going to show you now what it does.

61
00:03:16,133 --> 00:03:19,900
So what we'll do before selecting this
and executing this.

62
00:03:20,200 --> 00:03:24,166
What we'll do is have a look at,
you know, one review of the corpus.

63
00:03:24,166 --> 00:03:27,533
Let's take the first review
and then we'll run this line of code

64
00:03:27,733 --> 00:03:30,900
and you'll see what it does to this
first review okay.

65
00:03:30,900 --> 00:03:32,433
So let's access to the first review.

66
00:03:32,433 --> 00:03:36,266
And to do this
we need to use the as that character.

67
00:03:36,933 --> 00:03:39,933
And then in parentheses
we input the corpus.

68
00:03:40,166 --> 00:03:41,233
But then since we want to look

69
00:03:41,233 --> 00:03:45,433
at the first review of this corpus
well we need to add some double brackets.

70
00:03:45,433 --> 00:03:46,166
Actually.

71
00:03:46,166 --> 00:03:50,200
And one because this is the index
of the first review because indexes in

72
00:03:50,200 --> 00:03:51,400
are started one.

73
00:03:51,400 --> 00:03:53,666
So this way
we will have a look at the first review.

74
00:03:53,666 --> 00:03:56,466
And you know, since the corpus
is kind of a complicated object,

75
00:03:56,466 --> 00:04:00,566
we need to use these double brackets here
to access to the written review.

76
00:04:00,833 --> 00:04:03,666
And besides we need to use this
as dot character

77
00:04:03,666 --> 00:04:06,666
function
to have this written review displayed.

78
00:04:07,033 --> 00:04:07,700
All right.

79
00:04:07,700 --> 00:04:10,200
So I'm going to press enter here.

80
00:04:10,200 --> 00:04:13,366
And as I just told you
we get the written review.

81
00:04:13,400 --> 00:04:15,200
Well love this place.

82
00:04:15,200 --> 00:04:16,266
Which is of course

83
00:04:16,266 --> 00:04:20,333
the first review as we can see in our data
set while love this place.

84
00:04:20,933 --> 00:04:22,333
All right. So that's the first review.

85
00:04:22,333 --> 00:04:24,766
That's the original version
of the first review.

86
00:04:24,766 --> 00:04:28,500
And now we are going to apply
the first step of the cleaning process,

87
00:04:28,800 --> 00:04:30,933
which is to put other reviews
in other cases.

88
00:04:30,933 --> 00:04:31,800
So let's do it.

89
00:04:31,800 --> 00:04:34,800
Let's select this line and execute.

90
00:04:35,200 --> 00:04:35,966
All right.

91
00:04:35,966 --> 00:04:38,066
So as you can see it was very fast.

92
00:04:38,066 --> 00:04:41,433
All the 1000 reviews were just transformed
in lower cases.

93
00:04:41,433 --> 00:04:42,800
So let's check it out.

94
00:04:42,800 --> 00:04:44,633
Let's check it out for the first review.

95
00:04:44,633 --> 00:04:48,466
So we just need to,
you know, press the up arrow

96
00:04:48,633 --> 00:04:51,766
to get the previous command
which is this one.

97
00:04:52,200 --> 00:04:52,966
And you know, since

98
00:04:52,966 --> 00:04:56,833
our new corpus is also called corpus,
we just updated the corpus.

99
00:04:57,133 --> 00:05:00,033
Well, we can run this
and hopefully we'll get

100
00:05:00,033 --> 00:05:03,033
the same review written in lower cases.

101
00:05:03,300 --> 00:05:04,233
So let's check it out.

102
00:05:04,233 --> 00:05:06,400
Let's press enter here.

103
00:05:06,400 --> 00:05:07,800
And here we go.

104
00:05:07,800 --> 00:05:11,400
As you can see
the capital W became this little W.

105
00:05:11,700 --> 00:05:14,800
And this capital L became this lower L.

106
00:05:15,100 --> 00:05:18,066
Perfect. So first simplification.

107
00:05:18,066 --> 00:05:21,100
Now in the final big table
the final sparse matrix,

108
00:05:21,300 --> 00:05:25,133
we will get two versions of the same word,
one in capital letter

109
00:05:25,133 --> 00:05:26,500
and one in lowercase.

110
00:05:26,500 --> 00:05:28,466
We'll get one unique version of the word.

111
00:05:28,466 --> 00:05:31,133
And therefore
we did the first simplification

112
00:05:31,133 --> 00:05:33,233
of our future sparse matrix.

113
00:05:33,233 --> 00:05:34,800
So that's the first good thing done.

114
00:05:34,800 --> 00:05:37,833
And now we will proceed to the next step
of the cleaning process,

115
00:05:38,066 --> 00:05:42,933
which will be to remove all the numbers
of the reviews, because indeed the numbers

116
00:05:42,933 --> 00:05:46,833
are not very relevant in telling
if a review is positive or negative.

117
00:05:46,933 --> 00:05:50,266
Well, we need to be cautious actually,
because maybe some reviews are,

118
00:05:50,266 --> 00:05:53,300
you know, on a scale of 1 to 10,
I give it ten.

119
00:05:53,600 --> 00:05:57,266
Well, that's definitely a number
that is fully correlated to the outcome,

120
00:05:57,266 --> 00:05:59,266
whether the review
is positive or negative.

121
00:05:59,266 --> 00:06:01,133
So we should pay attention to that.

122
00:06:01,133 --> 00:06:04,033
But we could have other numbers
that are totally irrelevant,

123
00:06:04,033 --> 00:06:07,600
like, you know, some addresses
that contain numbers or phone numbers.

124
00:06:07,600 --> 00:06:10,800
Well, that would be a little weird
in a review, but we never know.

125
00:06:11,100 --> 00:06:14,700
Well, in general, when we are dealing
with text, when we're working with text,

126
00:06:14,933 --> 00:06:18,133
we want to get rid of the numbers
because these are most of the time

127
00:06:18,133 --> 00:06:19,233
not very relevant.

128
00:06:19,233 --> 00:06:21,366
And you know,
this could add a lot more columns.

129
00:06:21,366 --> 00:06:24,100
So in general
it's better to remove the numbers.

130
00:06:24,100 --> 00:06:26,100
So that's what
we'll do in the next tutorial.

131
00:06:26,100 --> 00:06:27,766
And until then enjoy machine learning.