1
00:00:00,200 --> 00:00:02,466
Hello and welcome to this art tutorial.

2
00:00:02,466 --> 00:00:04,933
So in the previous tutorial,
we imported the data set

3
00:00:04,933 --> 00:00:08,733
which contains
1000 reviews of a restaurant.

4
00:00:09,066 --> 00:00:10,300
And for each of the review,

5
00:00:10,300 --> 00:00:13,500
we have the information whether the review
is positive or negative.

6
00:00:13,800 --> 00:00:17,666
So one when the review is positive
and zero when the review is negative

7
00:00:18,066 --> 00:00:20,933
and we are trying to build a machine
learning model

8
00:00:20,933 --> 00:00:23,933
that will be able to classify
each new review

9
00:00:24,000 --> 00:00:27,233
and tell if this new review
is positive or negative.

10
00:00:27,700 --> 00:00:31,166
So basically what we are allowed to do
is something we already did.

11
00:00:31,166 --> 00:00:33,600
That is classification and part three.

12
00:00:33,600 --> 00:00:37,533
But this time we are working on text
and therefore we need to figure out a way

13
00:00:37,533 --> 00:00:41,266
to create a model where we can have
some independent variables

14
00:00:41,500 --> 00:00:44,000
to train a machine
learning classification model,

15
00:00:44,000 --> 00:00:47,000
to learn some correlations
between the independent variables

16
00:00:47,066 --> 00:00:50,400
and the dependent variable,
which is of course the liked column.

17
00:00:50,966 --> 00:00:55,133
So now the goal is simply
to create some independent variables.

18
00:00:55,433 --> 00:00:57,633
And what could be those
independent variables.

19
00:00:57,633 --> 00:01:00,666
Well the IDE and natural
language processing

20
00:01:00,666 --> 00:01:05,800
and that's the bag of words model,
is to create a model which will basically

21
00:01:05,800 --> 00:01:10,566
be a huge table, a table where the rows
are nothing else, and the reviews.

22
00:01:10,600 --> 00:01:14,200
So this table will have 1000 rows
because we have 1000 reviews,

23
00:01:14,333 --> 00:01:18,700
one row for each review
and the columns will simply be

24
00:01:19,100 --> 00:01:22,033
all the words
that we can find in the reviews.

25
00:01:22,033 --> 00:01:25,000
You know, we take the 1000 reviews,
we look at all the words

26
00:01:25,000 --> 00:01:29,466
in those 1000 reviews, and we are going
to create one column for each word.

27
00:01:30,033 --> 00:01:33,733
And then each cell in the table
will correspond to one row that is one

28
00:01:33,733 --> 00:01:38,933
review and one column that is one word
in all these words in these 1000 reviews.

29
00:01:39,233 --> 00:01:39,966
And then this cell,

30
00:01:39,966 --> 00:01:43,533
there is going to be the number of times
the word appears in the review.

31
00:01:43,833 --> 00:01:45,633
So for example, here is the first word.

32
00:01:45,633 --> 00:01:49,866
Well, well there's going to be a column
for this while word here.

33
00:01:50,166 --> 00:01:53,166
And so for the first line
that corresponds to the first review.

34
00:01:53,366 --> 00:01:56,400
Well this well column will get a one
because the word

35
00:01:56,400 --> 00:01:59,400
well appears once in the first review.

36
00:01:59,500 --> 00:02:02,400
But then for all the other reviews,
that is all the other rows,

37
00:02:02,400 --> 00:02:03,766
we don't see any wow here.

38
00:02:03,766 --> 00:02:07,866
So all the cells in this first column
for the other rows will get a zero.

39
00:02:08,333 --> 00:02:11,400
And that's
what will be our model in the end.

40
00:02:11,766 --> 00:02:16,066
It's going to be this huge table with 1000
rows that are going to be the reviews.

41
00:02:16,300 --> 00:02:19,300
And a lot of columns
that are going to be all the words

42
00:02:19,300 --> 00:02:22,300
that we can find in this 1000
reviews here.

43
00:02:22,600 --> 00:02:26,433
So now you might start to figure out
why we would need to clean the reviews.

44
00:02:27,000 --> 00:02:27,733
It's because

45
00:02:27,733 --> 00:02:32,066
since we are going to take all the words
in these reviews here, well, we are going

46
00:02:32,066 --> 00:02:35,766
to get a lot of columns because we create
one column for each word.

47
00:02:36,233 --> 00:02:40,033
But we don't want to get too many columns,
because the more we get some columns

48
00:02:40,233 --> 00:02:40,933
and the harder

49
00:02:40,933 --> 00:02:45,133
it will be for our machine learning model
to run properly, to execute efficiently.

50
00:02:45,366 --> 00:02:48,866
Not only our machine learning model will
have more trouble to execute properly,

51
00:02:49,166 --> 00:02:51,900
but also it will have more trouble
understanding the correlations

52
00:02:51,900 --> 00:02:55,333
between the presence of the words
in the reviews and the information.

53
00:02:55,333 --> 00:02:57,733
Whether the review is positive
or negative.

54
00:02:57,733 --> 00:03:01,500
Because of course, if we keep
all the words in these reviews, well,

55
00:03:01,500 --> 00:03:05,566
we will get some irrelevant words,
some words that will not help the machine

56
00:03:05,566 --> 00:03:09,400
learning algorithm to predict
if a review is positive or negative.

57
00:03:09,733 --> 00:03:12,700
Because you know the words
that we can find in the reviews here.

58
00:03:12,700 --> 00:03:13,566
Well, some words

59
00:03:13,566 --> 00:03:17,366
give a much better hint in telling
if the review is positive or negative.

60
00:03:17,600 --> 00:03:19,266
Let me give you a simple example.

61
00:03:19,266 --> 00:03:23,100
We have this loved word here
in this review.

62
00:03:23,100 --> 00:03:26,833
This is the word that basically tells us
that the reviews positive

63
00:03:27,000 --> 00:03:28,833
because there is this love word.

64
00:03:28,833 --> 00:03:32,866
But then if we look at this, this word
or even place,

65
00:03:33,266 --> 00:03:36,433
well, these two words don't
give the machine learning algorithm a hint

66
00:03:36,600 --> 00:03:38,866
whether the review is positive
or negative.

67
00:03:38,866 --> 00:03:41,300
It's of course, this word loved

68
00:03:41,300 --> 00:03:45,033
by which the machine learning algorithm
will understand some correlations

69
00:03:45,033 --> 00:03:48,800
between this love word
and the fact that the review is positive.

70
00:03:49,366 --> 00:03:54,300
So that's the whole reason why right now
we are going to clean the text.

71
00:03:54,600 --> 00:03:58,733
It's not only to reduce the Bigtable
we're going to get in the end,

72
00:03:59,100 --> 00:04:02,600
because we want our algorithm
to run properly and not be saturated.

73
00:04:03,333 --> 00:04:07,433
And the other reason is that
we want to get the most relevant words

74
00:04:07,600 --> 00:04:11,200
to find the best correlations between
the presence of the words and the outcome,

75
00:04:11,200 --> 00:04:13,700
whether the review is positive
or negative.

76
00:04:13,700 --> 00:04:14,100
All right.

77
00:04:14,100 --> 00:04:15,633
So now we get the point.

78
00:04:15,633 --> 00:04:17,766
So let's start cleaning the reviews.

79
00:04:17,766 --> 00:04:20,800
And an important thing to understand
is that what we'll do here

80
00:04:20,800 --> 00:04:21,966
to clean the reviews.

81
00:04:21,966 --> 00:04:25,600
Well it's the same technique
to clean any other kind of text.

82
00:04:26,100 --> 00:04:27,933
Well I will give you the main tools.

83
00:04:27,933 --> 00:04:31,566
You will be able to use these tools
to clean any text you're working with.

84
00:04:31,833 --> 00:04:32,366
And of course,

85
00:04:32,366 --> 00:04:36,533
if your text is a little more complicated,
like for example, an HTML pages

86
00:04:36,533 --> 00:04:40,600
that contains HTML tags, well,
you would need to add a little more tools,

87
00:04:40,800 --> 00:04:43,500
but you would still use the tools
that we are about to use.

88
00:04:43,500 --> 00:04:44,566
And the good news is that

89
00:04:44,566 --> 00:04:48,466
if you want to use more tools
to clean more sophisticated text, well,

90
00:04:48,466 --> 00:04:50,700
you just need to ask me
some questions in the Q&A.

91
00:04:50,700 --> 00:04:54,100
And now I'll help you add
these tools to your problem and your text.

92
00:04:54,733 --> 00:04:58,433
But what we'll do here,
you will definitely do it to perform

93
00:04:58,433 --> 00:05:01,600
natural language processing
on your text files for your problems.

94
00:05:02,233 --> 00:05:02,566
All right.

95
00:05:02,566 --> 00:05:05,000
So let's do it. Let's clean the text.

96
00:05:05,000 --> 00:05:07,700
So that's the next step
in natural language processing.

97
00:05:07,700 --> 00:05:09,300
We will clean all the text.

98
00:05:09,300 --> 00:05:13,600
And then we will create our bag of words
model which is this huge table

99
00:05:13,600 --> 00:05:17,633
which is actually called a sparse matrix
because we'll get a lot of zeros

100
00:05:17,633 --> 00:05:18,933
in the sparse matrix.

101
00:05:18,933 --> 00:05:22,733
And then that means that we'll get a model
where we have some independent variables

102
00:05:22,733 --> 00:05:24,333
and one dependent variable.

103
00:05:24,333 --> 00:05:27,133
And that's
when we'll be able to use our machine

104
00:05:27,133 --> 00:05:31,066
learning classification models
that we built in part three to predict

105
00:05:31,066 --> 00:05:34,833
the class of a new review
that the model will not have seen yet.

106
00:05:35,333 --> 00:05:36,666
So let's do it.

107
00:05:36,666 --> 00:05:39,866
Let's start with the first step
of cleaning the text.

108
00:05:40,366 --> 00:05:45,100
And this first step is going to be
about initializing a corpus, because,

109
00:05:45,433 --> 00:05:48,900
you know we will not clean the reviews
directly in the data set.

110
00:05:49,166 --> 00:05:53,366
We will instead create a corpus
which will contain all the reviews,

111
00:05:53,600 --> 00:05:56,666
and that will be in this corpus
that we will clean all the reviews.

112
00:05:57,033 --> 00:05:59,400
So let's start by training this corpus.

113
00:05:59,400 --> 00:06:02,766
And in order to create this corpus
we need to import a package.

114
00:06:03,100 --> 00:06:04,366
And you might need to install it

115
00:06:04,366 --> 00:06:07,133
if it's the first time you're doing
natural language processing.

116
00:06:07,133 --> 00:06:10,900
So I'm going to type here
the command to install this package.

117
00:06:11,133 --> 00:06:13,033
This package is called the TM package.

118
00:06:13,033 --> 00:06:16,700
It's a very famous package
in R for natural language processing.

119
00:06:16,800 --> 00:06:21,566
So let's install this package
by typing install dot packages.

120
00:06:21,566 --> 00:06:22,500
Here it is.

121
00:06:22,500 --> 00:06:28,800
And so the name of the package has to be
input in quotes which is the TM package.

122
00:06:29,166 --> 00:06:30,066
All right.

123
00:06:30,066 --> 00:06:35,133
So if I go to my packages I will find

124
00:06:36,700 --> 00:06:37,500
the TM package.

125
00:06:37,500 --> 00:06:38,166
Here it is.

126
00:06:38,166 --> 00:06:41,166
So I already have it installed
so I don't need to install it again.

127
00:06:41,400 --> 00:06:44,400
So I will put that in comment.

128
00:06:44,566 --> 00:06:45,433
Here we go.

129
00:06:45,433 --> 00:06:49,900
And of course if you don't have the team
package here, you will need to install it

130
00:06:49,900 --> 00:06:53,266
by executing this line
and everything will run properly.

131
00:06:53,700 --> 00:06:54,666
All right.

132
00:06:54,666 --> 00:06:58,900
So of course after we install the package
we need to import the package.

133
00:06:59,200 --> 00:07:02,400
Well mine is already imported
but we need to automate all this.

134
00:07:02,400 --> 00:07:05,533
So as usual
we are going to take the library command.

135
00:07:06,133 --> 00:07:09,500
And in parentheses
we input the name of the package.

136
00:07:09,500 --> 00:07:12,500
So TM again not in quotes. All right.

137
00:07:12,833 --> 00:07:16,000
And now we are ready to build the corpus.

138
00:07:16,666 --> 00:07:18,600
And so how are we going to call
our corpus.

139
00:07:18,600 --> 00:07:20,800
We're going to call it corpus.

140
00:07:20,800 --> 00:07:22,966
All right. Corpus equals.

141
00:07:22,966 --> 00:07:25,900
And now we're going to use the V corpus
function spelled

142
00:07:25,900 --> 00:07:28,900
this way
V corpus with capital v and capital C.

143
00:07:28,933 --> 00:07:33,266
And then in parenthesis
we need to input vector source.

144
00:07:33,800 --> 00:07:36,800
This way. And again new parentheses.

145
00:07:36,900 --> 00:07:39,866
And it's in these parentheses
that we input

146
00:07:39,866 --> 00:07:42,900
the column that contains the text
that we want to clean.

147
00:07:43,166 --> 00:07:44,500
In this corpus.

148
00:07:44,500 --> 00:07:48,100
So this column is of course
the review column of our data set.

149
00:07:48,500 --> 00:07:51,833
So here we will input this column
which can be taken the following way

150
00:07:52,133 --> 00:07:55,200
by typing data set dollar sign.

151
00:07:55,200 --> 00:07:59,100
And then as you can see we have the two
variables here that we can pick.

152
00:07:59,433 --> 00:08:01,933
And the one we want to pick is review.

153
00:08:01,933 --> 00:08:02,433
All right.

154
00:08:02,433 --> 00:08:06,200
And that takes all the review
column of our data set.

155
00:08:06,200 --> 00:08:09,633
And that's exactly what we want
because this column contains the text.

156
00:08:09,833 --> 00:08:12,800
And we want to clean the text
in the corpus.

157
00:08:12,800 --> 00:08:13,200
All right.

158
00:08:13,200 --> 00:08:15,666
So that's the first step
of cleaning the text.

159
00:08:15,666 --> 00:08:17,266
So let's select this and press

160
00:08:17,266 --> 00:08:20,900
Command or Control plus enter to execute
and create the corpus.

161
00:08:21,133 --> 00:08:22,233
Here we go.

162
00:08:22,233 --> 00:08:25,666
As you can see it
already says that it's a large corpus.

163
00:08:25,866 --> 00:08:30,600
And we can even see here
that our corpus takes 3.7MB of space.

164
00:08:31,100 --> 00:08:35,000
We will simplify this corpus
by cleaning step by step other reviews.

165
00:08:35,000 --> 00:08:37,566
And that's
what we'll do in the next tutorials.

166
00:08:37,566 --> 00:08:39,166
Until then, enjoy machine learning.