1
00:00:00,133 --> 00:00:02,466
Hello and welcome to this art tutorial.

2
00:00:02,466 --> 00:00:03,300
So that's it.

3
00:00:03,300 --> 00:00:04,300
We did the first step

4
00:00:04,300 --> 00:00:07,500
of natural language processing,
which consisted of cleaning the text

5
00:00:07,500 --> 00:00:08,566
we're working with.

6
00:00:08,566 --> 00:00:11,666
And now it's time to create the sparse
matrix of features

7
00:00:11,900 --> 00:00:13,766
containing all the different reviews

8
00:00:13,766 --> 00:00:17,300
and the rows and all the different words
of the reviews in the columns.

9
00:00:17,633 --> 00:00:21,600
So as a reminder,
what we're about to build is a huge table

10
00:00:21,800 --> 00:00:24,333
in which the rows are the 1000 reviews.

11
00:00:24,333 --> 00:00:26,800
So we're going to have one row
for each review.

12
00:00:26,800 --> 00:00:28,700
So we're going to have 1000 rows.

13
00:00:28,700 --> 00:00:31,133
And then the columns are going to contain
all the words

14
00:00:31,133 --> 00:00:33,900
we can find
in the 1000 reviews in this corpus.

15
00:00:33,900 --> 00:00:36,700
That is the 1000 cleaned reviews.

16
00:00:36,700 --> 00:00:39,633
And so basically what this means
is that we are going to take

17
00:00:39,633 --> 00:00:43,133
all the different words in the 1000
cleaned reviews in the corpus,

18
00:00:43,333 --> 00:00:46,433
and we're going to create one column
for each word.

19
00:00:46,733 --> 00:00:51,700
So suppose we count in total 1005
hundred words in this corpus of reviews.

20
00:00:51,933 --> 00:00:56,700
Well, that means that our huge table
is going to contain 1005 hundred columns.

21
00:00:57,066 --> 00:01:01,233
And then for each cell in this huge table,
well, then each cell will correspond

22
00:01:01,233 --> 00:01:05,633
to one review corresponding to the row
and one word corresponding to the column.

23
00:01:05,966 --> 00:01:07,700
And so the value that will contain

24
00:01:07,700 --> 00:01:11,233
the cell is the number of times
the word appears in the review.

25
00:01:11,666 --> 00:01:15,000
So, as we explained earlier,
since most of the words don't appear

26
00:01:15,000 --> 00:01:18,600
in the reviews, well,
most of the cells will contain a zero.

27
00:01:19,166 --> 00:01:22,633
And then of course we'll get a few ones
because each review is composed

28
00:01:22,633 --> 00:01:24,233
between 5 to 10 words.

29
00:01:24,233 --> 00:01:28,766
So in each row we're going to have 5
or 10 cells having a one,

30
00:01:28,766 --> 00:01:31,800
and all the other cells will have zero,
and the cells that have

31
00:01:31,800 --> 00:01:35,266
a one will be in the columns corresponding
to the words that are in the review.

32
00:01:35,733 --> 00:01:39,566
And maybe sometimes,
but very rarely, we'll get a 2 or 3.

33
00:01:39,800 --> 00:01:43,633
That happens when the word appears twice
or three times in a review.

34
00:01:43,966 --> 00:01:45,400
I can give you a simple example.

35
00:01:45,400 --> 00:01:46,766
Let's imagine that we have

36
00:01:46,766 --> 00:01:50,933
a super positive review saying,
I love this restaurant very, very much.

37
00:01:51,166 --> 00:01:54,366
Well, in this review
the word very appears twice.

38
00:01:54,600 --> 00:01:58,200
So for this particular review
in the table, that is, let's say it's

39
00:01:58,200 --> 00:01:59,400
the row 100.

40
00:01:59,400 --> 00:02:01,566
Well, in this cell
that belongs to this row

41
00:02:01,566 --> 00:02:04,266
and that belongs to the column
corresponding to the word.

42
00:02:04,266 --> 00:02:08,466
Very well we'll get a two
because very appears twice in this review.

43
00:02:09,033 --> 00:02:11,700
So that could happen. But it's very rare.

44
00:02:11,700 --> 00:02:15,700
And what's most important to understand
is that in this huge table

45
00:02:15,700 --> 00:02:20,633
we'll get mostly zeros a few ones
and very few twos or threes.

46
00:02:20,966 --> 00:02:25,366
And we'll get so many zeros
that we call this table a sparse matrix.

47
00:02:25,633 --> 00:02:28,833
A sparse matrix is a table
that contains a lot of zeros.

48
00:02:29,066 --> 00:02:31,433
It contains very few non-zero values.

49
00:02:31,433 --> 00:02:32,333
And that's exactly

50
00:02:32,333 --> 00:02:35,333
what we're about to obtain
because of what we've just explained.

51
00:02:36,000 --> 00:02:39,900
And using the information of the table
we build that we have the word sparsity.

52
00:02:40,200 --> 00:02:43,733
And sparsity refers to that situation
where we have a lot of zeros.

53
00:02:44,266 --> 00:02:48,066
And speaking of sparsity,
let's also keep in mind that if we clean

54
00:02:48,100 --> 00:02:51,900
all the reviews here in this first step
of natural language processing, it's

55
00:02:51,900 --> 00:02:55,766
in order to reduce as much as possible
this future sparsity

56
00:02:55,766 --> 00:02:58,900
that will occur in this huge table
that we're about to build.

57
00:02:59,333 --> 00:03:02,666
So that's the whole point behind this
first step here.

58
00:03:02,666 --> 00:03:03,833
Cleaning the text.

59
00:03:03,833 --> 00:03:05,766
It's to avoid having too much sparsity.

60
00:03:05,766 --> 00:03:09,900
That is it's to avoid having a table
too big with too many columns.

61
00:03:09,900 --> 00:03:13,800
Because remember one column
is created for each word in the corpus.

62
00:03:14,100 --> 00:03:17,700
So by doing all these steps here,
we removed a lot of words

63
00:03:17,700 --> 00:03:21,166
and a lot of characters,
punctuation, numbers, etc.

64
00:03:21,166 --> 00:03:24,200
so that in this final huge table
we get the minimum

65
00:03:24,200 --> 00:03:27,200
number of words and therefore the minimum
number of columns.

66
00:03:27,500 --> 00:03:30,833
And one last quick reminder
we are creating this table

67
00:03:30,833 --> 00:03:33,966
in order to have the framework
of classification models.

68
00:03:34,200 --> 00:03:38,133
That is, you know, having several
independent variables and one dependent

69
00:03:38,133 --> 00:03:38,733
variable.

70
00:03:38,733 --> 00:03:41,266
We haven't created the dependent
variable yet.

71
00:03:41,266 --> 00:03:44,266
It's actually in this data set here
we will just take

72
00:03:44,466 --> 00:03:47,900
the second column of this data set
because this contains the outcome.

73
00:03:47,900 --> 00:03:51,000
Whether the review is positive or negative
we can see that here.

74
00:03:51,000 --> 00:03:53,166
It's the second column light one.

75
00:03:53,166 --> 00:03:56,166
If the review is positive
and zero of the reviews negative.

76
00:03:56,166 --> 00:03:58,333
So that's the dependent variable column.

77
00:03:58,333 --> 00:04:03,566
And the independent variables are going
to be nothing else than these columns

78
00:04:03,566 --> 00:04:07,200
corresponding to each one of the words
in the cleaned reviews of the corpus.

79
00:04:07,433 --> 00:04:10,800
Because for each review
that has each observation, we can link

80
00:04:10,800 --> 00:04:13,833
the review to each of the columns,
because for each of the review,

81
00:04:13,833 --> 00:04:18,000
we can associate a value
for each of the columns, and this value is

82
00:04:18,000 --> 00:04:21,966
the number of times the word corresponding
to the column appears in the review.

83
00:04:22,200 --> 00:04:24,833
So that's how we create
our independent variables.

84
00:04:24,833 --> 00:04:27,333
And then we'll create
our dependent variable.

85
00:04:27,333 --> 00:04:31,200
And therefore we'll get the classification
model as we used to work with.

86
00:04:31,200 --> 00:04:34,200
And eventually we went
because we will have everything.

87
00:04:34,333 --> 00:04:38,300
We will have our independent variables,
we will have our dependent variable.

88
00:04:38,566 --> 00:04:41,400
And we already have
all our classification models.

89
00:04:41,400 --> 00:04:43,633
That's the models we made in part three.

90
00:04:43,633 --> 00:04:44,500
So we will just need

91
00:04:44,500 --> 00:04:48,433
to apply these models on our new data
set that we were about to create

92
00:04:48,600 --> 00:04:52,633
that contains the independent variables
as the words and the dependent variable

93
00:04:52,800 --> 00:04:55,800
as the light column
in our original data set.

94
00:04:55,800 --> 00:04:56,166
All right.

95
00:04:56,166 --> 00:04:58,200
So let's do it. Let's create this table.

96
00:04:58,200 --> 00:05:01,533
And in R we can do it
very efficiently using a function

97
00:05:01,866 --> 00:05:04,666
a function that is called document
or matrix.

98
00:05:04,666 --> 00:05:08,400
And it's super easy because this function
will only take one argument.

99
00:05:08,600 --> 00:05:11,533
And as you might have guessed, it's
going to be the corpus.

100
00:05:11,533 --> 00:05:12,400
And that's it.

101
00:05:12,400 --> 00:05:17,333
This will create this huge sparse matrix
with all the 1000 reviews in the rows,

102
00:05:17,333 --> 00:05:20,333
and with all the words of the reviews
in the columns.

103
00:05:20,466 --> 00:05:21,466
So let's do it.

104
00:05:21,466 --> 00:05:25,733
Let's call
this sparse matrix of features DTM,

105
00:05:25,733 --> 00:05:29,333
because the function we're about to use
is document or matrix.

106
00:05:29,333 --> 00:05:31,700
So so far we'll call it DTM.

107
00:05:31,700 --> 00:05:32,733
So equals.

108
00:05:32,733 --> 00:05:35,400
And then we use
this super function document

109
00:05:36,533 --> 00:05:37,366
term matrix.

110
00:05:37,366 --> 00:05:39,633
Here it is. I just need to press enter.

111
00:05:39,633 --> 00:05:44,966
And as I just said we just need
to input one argument which is our corpus.

112
00:05:45,400 --> 00:05:45,966
All right.

113
00:05:45,966 --> 00:05:47,966
And that's done. Here it is corpus.

114
00:05:47,966 --> 00:05:50,966
This will create our sparse
matrix of features.

115
00:05:51,033 --> 00:05:54,033
So I'm going to select this line
and execute

116
00:05:54,166 --> 00:05:57,166
and done
the sparse matrix of features is created.

117
00:05:57,233 --> 00:05:59,166
It appears right here DTM.

118
00:05:59,166 --> 00:06:02,166
We can click on this button here
to have some info.

119
00:06:02,233 --> 00:06:04,366
And actually what's interesting to see
now is the total

120
00:06:04,366 --> 00:06:08,166
number of words counted in the corpus
to create all the columns.

121
00:06:08,166 --> 00:06:12,000
And we can see this total count here 1577.

122
00:06:12,433 --> 00:06:15,800
So that means that the number of columns
indicated by end

123
00:06:15,800 --> 00:06:18,900
call here in our document or matrix

124
00:06:18,900 --> 00:06:22,766
or sparse matrix is 1577.

125
00:06:23,133 --> 00:06:26,600
So that means that this huge table
has 1000 rows.

126
00:06:26,600 --> 00:06:29,600
So we expected this
because of course we have 1000 reviews,

127
00:06:29,833 --> 00:06:33,566
but we didn't expect the number of columns
in total because simply

128
00:06:33,566 --> 00:06:35,633
that was the total number of words
in the reviews.

129
00:06:35,633 --> 00:06:36,966
So we can count them.

130
00:06:36,966 --> 00:06:40,100
But we can see this number here 1577.

131
00:06:40,500 --> 00:06:41,933
So that's already a big table.

132
00:06:41,933 --> 00:06:44,966
But be prepared
if you're working with more complicated

133
00:06:44,966 --> 00:06:47,966
text or longer
text like articles or books.

134
00:06:48,266 --> 00:06:50,933
Well,
you might get a lot more columns here

135
00:06:50,933 --> 00:06:53,066
because you will get a lot more words.

136
00:06:53,066 --> 00:06:54,300
So what you'll have to do

137
00:06:54,300 --> 00:06:58,200
and you can ask me any questions
about that in the Q&A, is reduce

138
00:06:58,200 --> 00:07:02,666
even more to sparsity
by filtering the words in your text.

139
00:07:03,033 --> 00:07:06,200
And speaking of filtering,
that's what we're going to do right now.

140
00:07:06,600 --> 00:07:10,900
We are going to apply a filter to clean
even more the reviews

141
00:07:11,233 --> 00:07:14,233
by only considering
the most frequent words

142
00:07:14,266 --> 00:07:18,233
that means that it's like
we're going to add a step in this text

143
00:07:18,233 --> 00:07:19,433
cleaning process,

144
00:07:19,433 --> 00:07:23,900
which will consist of only taking
the words that are the most frequent.

145
00:07:24,200 --> 00:07:27,200
For example,
the words that appear in only one review,

146
00:07:27,300 --> 00:07:30,200
well, they might be removed
because they're not frequent,

147
00:07:30,200 --> 00:07:31,533
they only appear once.

148
00:07:31,533 --> 00:07:35,600
Only one cell in the matrix contains one,
because these words only appear

149
00:07:35,600 --> 00:07:36,666
in one review.

150
00:07:36,666 --> 00:07:38,700
And these words, of course,
are not very relevant

151
00:07:38,700 --> 00:07:42,600
because since they only appear
in one review, well, our machine learning

152
00:07:42,600 --> 00:07:45,900
classification model will not be able
to establish any correlation

153
00:07:45,900 --> 00:07:49,766
between this word and the outcome, whether
the review is positive or negative,

154
00:07:49,900 --> 00:07:52,400
because indeed, to understand
such correlations,

155
00:07:52,400 --> 00:07:55,400
the word would need to appear
in at least two reviews.

156
00:07:55,600 --> 00:07:58,133
So that's the kind of words
we're going to remove.

157
00:07:58,133 --> 00:08:01,100
And again
this is in order to reduce sparsity.

158
00:08:01,100 --> 00:08:03,900
And speaking of sparsity I will show you
something very interesting.

159
00:08:03,900 --> 00:08:08,100
Right now
if we go to the console and type here DTM,

160
00:08:08,433 --> 00:08:13,000
then we'll get other information
about this sparse matrix of features.

161
00:08:13,333 --> 00:08:14,700
And the information that I want to

162
00:08:14,700 --> 00:08:18,333
highlight here is of course
this sparsity information.

163
00:08:18,733 --> 00:08:22,666
And as you can see
the sparsity is 100% right now.

164
00:08:22,866 --> 00:08:25,333
And that's because there are
a lot of zeros in the matrix.

165
00:08:25,333 --> 00:08:29,033
And also because we haven't filtered
any non frequent word yet.

166
00:08:29,233 --> 00:08:30,533
So that's what we'll do right now.

167
00:08:30,533 --> 00:08:33,600
We will filter all the words that appear
only once.

168
00:08:33,800 --> 00:08:36,900
We will filter all the words
that are not frequent in the reviews.

169
00:08:37,400 --> 00:08:37,666
All right.

170
00:08:37,666 --> 00:08:38,666
So let's do it.

171
00:08:38,666 --> 00:08:42,900
To do this we are going to update
our document term matrix.

172
00:08:43,066 --> 00:08:45,900
So we're taking again DTM here.

173
00:08:45,900 --> 00:08:46,366
All right.

174
00:08:46,366 --> 00:08:50,400
Because we're updating our sparse matrix
and equals.

175
00:08:50,833 --> 00:08:53,833
And now we're going to use a function
a very practical function

176
00:08:53,833 --> 00:08:57,500
that will filter the non frequent words
of our sparse matrix

177
00:08:57,766 --> 00:09:00,766
which so far is nothing else than DTM.

178
00:09:01,000 --> 00:09:03,233
So DTM will be one of the inputs.

179
00:09:03,233 --> 00:09:07,200
And we will filter all the non
frequent words by specifying a proportion

180
00:09:07,200 --> 00:09:10,633
of non frequent words that we want
to remove from the sparse matrix.

181
00:09:10,966 --> 00:09:13,600
And this proportion of non
frequent words will be obtained

182
00:09:13,600 --> 00:09:15,766
thanks to the second
input of this function.

183
00:09:15,766 --> 00:09:17,033
Because the second input is

184
00:09:17,033 --> 00:09:20,600
the percentage of the most frequent words
we want to keep in the reviews.

185
00:09:20,900 --> 00:09:23,800
So let's say we want to keep 99%

186
00:09:23,800 --> 00:09:26,800
of the words in the review
that are the most frequent words.

187
00:09:26,866 --> 00:09:30,000
Well, this second input
will take the value of 99%.

188
00:09:30,633 --> 00:09:31,933
So let's use this function.

189
00:09:31,933 --> 00:09:35,566
This function is remove sparse terms.

190
00:09:35,766 --> 00:09:37,800
Here it is remove sparse terms.

191
00:09:37,800 --> 00:09:41,666
So pressing enter and ready
to input the two arguments.

192
00:09:41,666 --> 00:09:43,100
So the first argument

193
00:09:43,100 --> 00:09:47,066
is of course the sparse matrix
on which we want to apply this filtering.

194
00:09:47,066 --> 00:09:49,700
So of course it's DTM.

195
00:09:49,700 --> 00:09:50,666
All right.

196
00:09:50,666 --> 00:09:54,366
And the second input is the proportion
of words that are the most frequent words.

197
00:09:54,533 --> 00:09:57,066
And that will be kept
in this sparse matrix.

198
00:09:57,066 --> 00:10:00,833
So let's say we want to keep 99%
of the most frequent words.

199
00:10:01,100 --> 00:10:04,233
Well we would need to input here
oh point 99.

200
00:10:04,533 --> 00:10:07,333
And therefore
we will build the same sparse matrix,

201
00:10:07,333 --> 00:10:10,333
but this time containing 99% of the words

202
00:10:10,366 --> 00:10:13,600
that are the most frequent
in this sparse matrix of features.

203
00:10:13,900 --> 00:10:17,133
And therefore, you know, we're not looking
at the corpus containing all the words

204
00:10:17,233 --> 00:10:20,866
and counting the most frequent words
of this corpus, where this function remove

205
00:10:20,866 --> 00:10:25,033
sparse terms will do, is to look at all
the columns of the sparse matrix here,

206
00:10:25,200 --> 00:10:30,000
and then keep 99% of the columns
that have the most ones in the columns.

207
00:10:30,000 --> 00:10:33,800
Because each column corresponds to a word,
and therefore, when there are very

208
00:10:33,800 --> 00:10:38,133
few ones in the columns, that means that
this word appears in very few reviews,

209
00:10:38,533 --> 00:10:39,833
and therefore these are the words

210
00:10:39,833 --> 00:10:43,533
that are non frequent in the reviews
and accordingly not relevant.

211
00:10:43,800 --> 00:10:46,200
And that's why we can remove them.

212
00:10:46,200 --> 00:10:46,500
All right.

213
00:10:46,500 --> 00:10:47,166
So let's do it.

214
00:10:47,166 --> 00:10:49,533
Let's apply the filter to be cautious.

215
00:10:49,533 --> 00:10:53,500
Let's maybe take a higher proportion
of frequent words we keep.

216
00:10:53,766 --> 00:10:56,700
Because actually with 99%
we might remove a lot of words.

217
00:10:56,700 --> 00:10:59,333
You can try it on your studio to see.

218
00:10:59,333 --> 00:11:02,766
But here since we don't have many reviews,
you know, we have 1000 reviews.

219
00:11:02,766 --> 00:11:05,433
That is not much compared to other texts.

220
00:11:05,433 --> 00:11:08,000
We can work with in natural language
processing.

221
00:11:08,000 --> 00:11:12,966
Let's be careful here and apply a 99.9%
proportion of frequent words.

222
00:11:13,166 --> 00:11:15,233
So here I'm going to add a nine

223
00:11:15,233 --> 00:11:18,666
and you'll see that it will already remove
quite a lot of words.

224
00:11:18,800 --> 00:11:22,566
So let's try
it I'm going to select this and execute.

225
00:11:23,133 --> 00:11:25,866
And indeed as you can see we now have

226
00:11:25,866 --> 00:11:29,033
691 columns in the sparse matrix.

227
00:11:29,266 --> 00:11:32,366
That is we only kept 691 words.

228
00:11:32,700 --> 00:11:34,200
So clearly we can see that

229
00:11:34,200 --> 00:11:37,866
by keeping 99.9% of the words
that are the most frequent.

230
00:11:38,133 --> 00:11:41,133
Well,
that's already removes almost 1000 words,

231
00:11:41,366 --> 00:11:45,133
because originally we had remember
more than 1005 hundred words.

232
00:11:45,466 --> 00:11:46,800
So be careful with this.

233
00:11:46,800 --> 00:11:49,800
Be careful not to apply a too low

234
00:11:49,800 --> 00:11:53,033
proportion of frequent words
you want to keep, and to choose that.

235
00:11:53,033 --> 00:11:55,466
Remember
to look at the total number of words

236
00:11:55,466 --> 00:11:58,200
that is counted
when you build this first sparse matrix.

237
00:11:58,200 --> 00:12:01,000
And of course, you can also
choose this number by considering

238
00:12:01,000 --> 00:12:04,000
the total number of reviews
you have in your original data set.

239
00:12:04,300 --> 00:12:06,966
And you know,
since we only had 1000 reviews,

240
00:12:06,966 --> 00:12:09,966
well, that's
why we take such a high proportion here.

241
00:12:10,200 --> 00:12:12,500
And let's see by
how much we reduce the sparsity.

242
00:12:12,500 --> 00:12:17,133
So we have to type DTM again
because our document sparse matrix

243
00:12:17,133 --> 00:12:21,166
that is our sparse matrix was just updated
with all these words removed.

244
00:12:21,366 --> 00:12:26,000
So pressing enter here and the sparsity
now became 99%.

245
00:12:26,400 --> 00:12:27,133
So better.

246
00:12:27,133 --> 00:12:30,133
But anyway that was fine
because we didn't have too much columns.

247
00:12:30,533 --> 00:12:32,566
You will see that
if you work with larger text

248
00:12:32,566 --> 00:12:35,766
you will get a lot more words
and therefore a lot more columns.

249
00:12:36,366 --> 00:12:38,633
All right.
So that will be all. For this tutorial.

250
00:12:38,633 --> 00:12:40,800
We built our Bag of Words model.

251
00:12:40,800 --> 00:12:42,333
Congratulations for that.

252
00:12:42,333 --> 00:12:45,233
And now it's time
to make the classification model.

253
00:12:45,233 --> 00:12:48,466
So that's what we'll do in the next
and final tutorial of this section.

254
00:12:48,700 --> 00:12:50,400
And until then enjoy machine learning.