1
00:00:00,233 --> 00:00:01,100
Hello my friends.

2
00:00:01,100 --> 00:00:05,133
Are you ready for the most essential step
of this implementation,

3
00:00:05,133 --> 00:00:07,800
which is at the heart of sentiment
analysis,

4
00:00:07,800 --> 00:00:12,100
that is creating the bag of Words model,
which we are ready to do now

5
00:00:12,100 --> 00:00:15,333
because all our reviews
are properly cleaned.

6
00:00:15,333 --> 00:00:19,366
So we're going to get them
into the bag of words model to create,

7
00:00:19,366 --> 00:00:23,100
you know, the sparse matrix
which will contain in the rows.

8
00:00:23,100 --> 00:00:27,100
Well, the different reviews, you know,
the same reviews as the ones in our corpus

9
00:00:27,333 --> 00:00:30,300
and in the columns,
all the different words

10
00:00:30,300 --> 00:00:33,300
taken from all the different reviews,
you know, all of them.

11
00:00:33,466 --> 00:00:36,766
And each cell will either get a 0 or 1.

12
00:00:36,966 --> 00:00:38,400
It will get zero.

13
00:00:38,400 --> 00:00:43,866
If the word of the column is not in the
review of the row, and it will get a one.

14
00:00:44,033 --> 00:00:48,000
If the word of the column is indeed
bar of the words

15
00:00:48,000 --> 00:00:52,200
in the review of the row, all right,
so that's what the Sports Matrix is about.

16
00:00:52,200 --> 00:00:55,866
And the process of creating
all these columns corresponding

17
00:00:55,900 --> 00:01:00,700
to each of the words taken from
all the reviews is called tokenization.

18
00:01:00,700 --> 00:01:03,100
So that's exactly what we'll do
in this new cell.

19
00:01:03,100 --> 00:01:05,933
But first let me actually show you
what we created.

20
00:01:05,933 --> 00:01:08,300
You know,
I just want to show you the corpus.

21
00:01:08,300 --> 00:01:12,033
So actually right here
we're going to create a new code cell.

22
00:01:12,333 --> 00:01:16,766
And I'm
just going to do a print of the corpus

23
00:01:17,133 --> 00:01:20,000
so that I can show
you indeed what we created.

24
00:01:20,000 --> 00:01:21,500
So let's press play here.

25
00:01:21,500 --> 00:01:24,233
And this will show the corpus.

26
00:01:24,233 --> 00:01:24,900
All right.

27
00:01:24,900 --> 00:01:27,766
So this is the first review
after the cleaning.

28
00:01:27,766 --> 00:01:30,800
You know, after all this cleaning process
in different steps.

29
00:01:31,000 --> 00:01:34,000
Remember I can actually show you
the original review here.

30
00:01:34,200 --> 00:01:37,933
The original review was wow with,
you know, capital letters,

31
00:01:37,933 --> 00:01:42,033
something to here with the three
little dots and then left this place.

32
00:01:42,300 --> 00:01:47,233
And after the cleaning process
it became wow, love place indeed.

33
00:01:47,233 --> 00:01:51,300
We removed all this stopwords
such as, you know, this.

34
00:01:51,333 --> 00:01:54,600
You know, that's an article
that doesn't give any hint on

35
00:01:54,600 --> 00:01:56,700
whether the review is positive
or negative.

36
00:01:56,700 --> 00:01:57,600
However, of course

37
00:01:57,600 --> 00:02:01,800
we kept loved because loved means
of course that the review is positive.

38
00:02:01,966 --> 00:02:05,166
However, we transformed loved into love.

39
00:02:05,400 --> 00:02:07,133
That's the process of stemming.

40
00:02:07,133 --> 00:02:10,333
So we can simplify all the words
by their roots.

41
00:02:10,633 --> 00:02:14,133
And then of course, we kept place
because that's of course not a step word.

42
00:02:14,333 --> 00:02:17,366
All right
then let's have a look at the second one.

43
00:02:17,433 --> 00:02:19,800
Crust is not good.

44
00:02:19,800 --> 00:02:20,100
All right.

45
00:02:20,100 --> 00:02:22,966
So let's try to guess
actually how it was transformed.

46
00:02:22,966 --> 00:02:27,333
So crust was just transformed
into crust with a lowercase c.

47
00:02:27,733 --> 00:02:29,866
Then is was probably removed.

48
00:02:29,866 --> 00:02:30,233
Right.

49
00:02:30,233 --> 00:02:33,133
Because it doesn't give any hint on
whether the review was positive

50
00:02:33,133 --> 00:02:33,900
or negative.

51
00:02:33,900 --> 00:02:37,700
Not was definitely kept
because that's a negative statement.

52
00:02:37,966 --> 00:02:41,266
And good was of course kept. Okay.

53
00:02:41,266 --> 00:02:44,666
So after the transformation, you know,
after all the cleaning,

54
00:02:44,666 --> 00:02:48,433
this review must become crust
with a lowercase c.

55
00:02:48,700 --> 00:02:50,166
Not good.

56
00:02:50,166 --> 00:02:52,000
Let's check that it is the case.

57
00:02:52,000 --> 00:02:54,600
And oh okay.

58
00:02:54,600 --> 00:02:58,433
So actually they removed the nuts,
which is a bit strange

59
00:02:58,433 --> 00:03:02,166
actually because you know,
not clearly indicates a negative thing.

60
00:03:02,166 --> 00:03:03,333
You know, a negative review.

61
00:03:03,333 --> 00:03:07,733
We clearly have a difference between
crust is good and crust is not good.

62
00:03:08,133 --> 00:03:13,300
So I think we need to do some extra work
here in order to not include

63
00:03:13,566 --> 00:03:17,033
the not word from the stopwords.

64
00:03:17,033 --> 00:03:19,133
And I'm going to show you how you can do
this.

65
00:03:19,133 --> 00:03:20,633
It's very easy.

66
00:03:20,633 --> 00:03:22,733
So we're going to work again on this code.

67
00:03:22,733 --> 00:03:25,733
I'm actually going to take this here,
you know,

68
00:03:26,033 --> 00:03:28,666
Stopwords or the English Stopwords.

69
00:03:28,666 --> 00:03:31,366
I'm going to cut that then

70
00:03:31,366 --> 00:03:35,300
right here in a new line of code,
I'm going to paste that.

71
00:03:35,700 --> 00:03:38,933
Then I'm going to create
actually a new variable

72
00:03:38,966 --> 00:03:42,866
which I'm going to call
oh, underscore stopwords.

73
00:03:43,333 --> 00:03:44,133
Right.

74
00:03:44,133 --> 00:03:47,133
And which will be equal to exactly this.

75
00:03:47,166 --> 00:03:50,166
But then what I'm going to do just below

76
00:03:50,300 --> 00:03:53,100
is to take this again, which was now

77
00:03:53,100 --> 00:03:56,400
created, as, you know, this whole end
symbol of all the stopwords.

78
00:03:56,400 --> 00:03:59,800
But we don't want to include
not in this top word, because that's

79
00:03:59,800 --> 00:04:04,033
clearly a negative term
indicating therefore negative review.

80
00:04:04,266 --> 00:04:06,066
So I'm going to paste that here.

81
00:04:06,066 --> 00:04:09,066
And I'm
just going to add here a dot remove.

82
00:04:09,600 --> 00:04:14,366
And in the parenthesis
I'm simply going to include in quotes not.

83
00:04:14,700 --> 00:04:15,000
All right.

84
00:04:15,000 --> 00:04:19,233
So that will not include the not word
from the stopwords.

85
00:04:19,433 --> 00:04:20,500
And therefore here

86
00:04:21,700 --> 00:04:22,733
instead of

87
00:04:22,733 --> 00:04:26,233
taking the set of the original
and symbol of Stopwords,

88
00:04:26,366 --> 00:04:29,933
well, we're now going to take the original
and symbol of Stopwords.

89
00:04:29,933 --> 00:04:33,166
Excluding this time the nuts word.

90
00:04:33,166 --> 00:04:34,333
Let's see if it works.

91
00:04:34,333 --> 00:04:38,500
I'm kind of improvising things here,
but it might work.

92
00:04:38,500 --> 00:04:41,833
So we actually have to restore
the runtime.

93
00:04:41,833 --> 00:04:42,700
So let's do this.

94
00:04:42,700 --> 00:04:44,100
Restore runtime.

95
00:04:44,100 --> 00:04:47,100
Yes we still have our data set. All good.

96
00:04:47,300 --> 00:04:48,533
And now let's see if this works.

97
00:04:48,533 --> 00:04:50,633
So we're going to re-execute the cells.

98
00:04:50,633 --> 00:04:51,500
I can not do a run.

99
00:04:51,500 --> 00:04:54,433
Oh here
because the implementation is not over.

100
00:04:54,433 --> 00:04:56,333
But let's import the libraries first.

101
00:04:56,333 --> 00:04:57,966
Now the data set.

102
00:04:57,966 --> 00:05:00,966
And now let's clean the text.

103
00:05:01,000 --> 00:05:03,033
I hope this will work.

104
00:05:03,033 --> 00:05:04,500
Let's play.

105
00:05:04,500 --> 00:05:07,300
All right. This seems to be good. Good.

106
00:05:07,300 --> 00:05:09,000
Now let's remove this output.

107
00:05:09,000 --> 00:05:11,900
This was the previous output right.

108
00:05:11,900 --> 00:05:13,300
And now let's print the corpus.

109
00:05:13,300 --> 00:05:18,166
And let's hope that the second review
is now no longer, you know crust good.

110
00:05:18,166 --> 00:05:21,200
But indeed crust not good okay.

111
00:05:21,666 --> 00:05:22,666
So let's press play.

112
00:05:22,666 --> 00:05:26,200
And perfect okay. Good I'm relieved.

113
00:05:26,233 --> 00:05:29,633
You know this was really bad
to remove the nut because it's

114
00:05:29,633 --> 00:05:33,066
clearly a negative term
indicating a negative review.

115
00:05:33,600 --> 00:05:34,733
All right. So much better now.

116
00:05:34,733 --> 00:05:36,766
And actually you know
same for the next one.

117
00:05:36,766 --> 00:05:39,133
Nut tasty texture. Nasty.

118
00:05:39,133 --> 00:05:41,433
That definitely means a negative review.

119
00:05:41,433 --> 00:05:44,166
Let's actually check that right.

120
00:05:44,166 --> 00:05:45,600
Yes. Not tasty.

121
00:05:45,600 --> 00:05:48,233
And whatever zero negative review.

122
00:05:48,233 --> 00:05:49,666
And same for this one.

123
00:05:49,666 --> 00:05:50,400
All right. So good.

124
00:05:50,400 --> 00:05:52,466
We have actually a much better model now.

125
00:05:52,466 --> 00:05:53,700
So we can continue.

126
00:05:53,700 --> 00:05:57,000
And we can mostly create the bag
for its model.

127
00:05:57,666 --> 00:05:58,800
All right. So let's do this.

128
00:05:58,800 --> 00:06:03,700
Let's actually scroll down a bit
and there we go new code cell.

129
00:06:03,700 --> 00:06:07,100
And now let's proceed
with this tokenization to create

130
00:06:07,100 --> 00:06:11,500
a sparse matrix containing all the reviews
in different rows and all the words

131
00:06:11,500 --> 00:06:13,433
from all the reviews
in the different columns,

132
00:06:13,433 --> 00:06:18,166
where the cells will get a one if the word
is in the review, and a zero otherwise.

133
00:06:18,533 --> 00:06:18,966
All right.

134
00:06:18,966 --> 00:06:22,500
So we're going to do this
with actually scikit learn.

135
00:06:22,500 --> 00:06:26,866
You know the tokenization process
will be done thanks to a class from scikit

136
00:06:26,866 --> 00:06:27,300
learn.

137
00:06:27,300 --> 00:06:31,433
More specifically from a module of scikit
learn called feature extraction.

138
00:06:31,700 --> 00:06:34,700
And that class is called count Vectorizer.

139
00:06:35,033 --> 00:06:35,333
All right.

140
00:06:35,333 --> 00:06:35,933
So let's do this.

141
00:06:35,933 --> 00:06:38,733
Let's start from scikit learn.

142
00:06:38,733 --> 00:06:41,833
You know this library very well as k learn

143
00:06:42,200 --> 00:06:45,400
from which
we're going to call that feature.

144
00:06:45,400 --> 00:06:46,200
There we go.

145
00:06:46,200 --> 00:06:49,966
Extraction module from which
actually you know it's not over.

146
00:06:50,000 --> 00:06:54,633
We're going to get access
to the submodule called text text

147
00:06:54,900 --> 00:06:58,333
from which we're going to import
that count.

148
00:06:58,966 --> 00:07:00,666
Vectorizer class. Perfect.

149
00:07:00,666 --> 00:07:04,000
I really love Google Colab
when it assist me this.

150
00:07:04,000 --> 00:07:05,066
Well okay.

151
00:07:05,066 --> 00:07:06,233
So we have the class.

152
00:07:06,233 --> 00:07:08,433
Now you know,
what is the next natural step.

153
00:07:08,433 --> 00:07:11,333
It is to create an instance of this class.

154
00:07:11,333 --> 00:07:15,933
And we're going to call that CV
as count Vectorizer which will be created

155
00:07:15,933 --> 00:07:19,433
as, you know, an instance of this count

156
00:07:19,833 --> 00:07:24,600
Vectorizer class perfect,
which has to take as input

157
00:07:24,700 --> 00:07:27,033
only one important parameter.

158
00:07:27,033 --> 00:07:29,300
Can you actually guess what it is?

159
00:07:29,300 --> 00:07:33,133
Well, it is actually the maximum size
of the sparse matrix.

160
00:07:33,133 --> 00:07:35,000
You know, the maximum number of columns.

161
00:07:35,000 --> 00:07:37,133
Therefore the maximum number of words

162
00:07:37,133 --> 00:07:40,133
you want to include
in the columns of the sparse matrix.

163
00:07:40,600 --> 00:07:44,433
And why is this important that because,
you know, in our corpus of reviews

164
00:07:44,433 --> 00:07:48,800
now with all the simplifications, well,
we actually have still some words

165
00:07:48,800 --> 00:07:50,766
that are not relevant or,

166
00:07:50,766 --> 00:07:54,000
you know, not helpful to predict
if a review is positive or negative,

167
00:07:54,200 --> 00:07:55,933
even if they were not part
of the stopwords.

168
00:07:55,933 --> 00:07:58,766
And these include, for example,
you know, text you,

169
00:07:58,766 --> 00:08:02,466
you know, texture doesn't really help
to predict it for review, positive

170
00:08:02,466 --> 00:08:05,900
or negative or,
you know, bank, you know, or holiday

171
00:08:06,133 --> 00:08:09,366
or Rick and even Steve,
you know, Steve doesn't help at all.

172
00:08:09,600 --> 00:08:13,333
So we still have these words which,
even if they're not part of the stopwords,

173
00:08:13,466 --> 00:08:16,633
don't help at all to predict
if a review is positive or negative.

174
00:08:17,100 --> 00:08:21,100
And the way to get rid of them is by,
you know, entering this parameter

175
00:08:21,100 --> 00:08:25,300
that we're about to enter
the way to get rid of them is just to take

176
00:08:25,300 --> 00:08:29,800
actually the most frequent words,
you know, the words that appear

177
00:08:29,800 --> 00:08:34,200
most frequently in the reviews, because
probably here Steve only appears once.

178
00:08:34,200 --> 00:08:37,466
So if we only take
the most frequent words, we won't include

179
00:08:37,466 --> 00:08:41,700
Steve in this sparse matrix,
you know, in the tokenization process.

180
00:08:42,066 --> 00:08:43,866
So so that's the trick.

181
00:08:43,866 --> 00:08:48,300
And so now we need to just choose
a maximum size of the sparse matrix.

182
00:08:48,433 --> 00:08:51,500
However, we can't really know now
how many words

183
00:08:51,500 --> 00:08:54,533
there are in total, you know
before we take the most frequent ones.

184
00:08:54,733 --> 00:08:57,600
So what we'll do in fact is
we will leave this for now.

185
00:08:57,600 --> 00:08:59,400
You know, we want enter this parameter.

186
00:08:59,400 --> 00:09:03,633
Now we will run this cell
once we create the sparse matrix,

187
00:09:03,633 --> 00:09:07,100
which is actually going
to be the matrix of features when training

188
00:09:07,100 --> 00:09:11,100
or Naive Bayes model on the training set,
it's going to be the matrix of features.

189
00:09:11,233 --> 00:09:15,500
And therefore we will do a print in order
to know the total number of columns.

190
00:09:15,733 --> 00:09:17,966
And we will get therefore,
the total number of words.

191
00:09:17,966 --> 00:09:22,533
And then we can reduce that
total number of words to a lower number

192
00:09:22,533 --> 00:09:23,933
of the most frequent words

193
00:09:23,933 --> 00:09:28,200
in the sparse matrix, so that we can
simplify even more the bag of words model.

194
00:09:28,266 --> 00:09:29,900
Okay, so that's what we'll do.

195
00:09:29,900 --> 00:09:32,200
Therefore so far let's not enter anything.

196
00:09:32,200 --> 00:09:35,566
Let's just continue
to create that bag of words model.

197
00:09:36,700 --> 00:09:37,200
All right.

198
00:09:37,200 --> 00:09:41,733
And actually speaking of the matrix
of features that's exactly our next step.

199
00:09:41,733 --> 00:09:45,000
Here we are ready
thanks to discount Vectorizer class

200
00:09:45,233 --> 00:09:49,566
to create the matrix of features
which is indeed that sparse matrix.

201
00:09:49,800 --> 00:09:53,233
So we're going to call it x as usual
as every

202
00:09:53,233 --> 00:09:55,200
of our previous matrices of features.

203
00:09:55,200 --> 00:09:56,566
So x equals.

204
00:09:56,566 --> 00:09:59,266
And now according to you
what is the next step here.

205
00:09:59,266 --> 00:10:01,033
Well you guess that we're going

206
00:10:01,033 --> 00:10:05,066
to create this sparse matrix
thanks to our CV object.

207
00:10:05,200 --> 00:10:09,333
So there we go I'm calling CV first
from which I'm going to call now

208
00:10:09,333 --> 00:10:13,800
a method which you know very well,
which we already called many times.

209
00:10:14,133 --> 00:10:19,000
And that method is the fit transform
method.

210
00:10:19,333 --> 00:10:20,033
All right.

211
00:10:20,033 --> 00:10:23,633
Fit transform
method which will indeed fit well.

212
00:10:23,633 --> 00:10:26,233
You know, the input of this fit transfer
method, which will be

213
00:10:26,233 --> 00:10:30,166
you know, I'll tell you now the corpus,
it will fit the corpus to X.

214
00:10:30,433 --> 00:10:31,500
And what does it mean.

215
00:10:31,500 --> 00:10:33,066
It means exactly that

216
00:10:33,066 --> 00:10:36,766
it will take all the words
from all the reviews in the corpus.

217
00:10:36,966 --> 00:10:40,566
And then using this transform
part of the method, it will put all these

218
00:10:40,566 --> 00:10:43,500
words in different columns.
So you see that's very simple.

219
00:10:43,500 --> 00:10:45,300
The fit method will just take all the

220
00:10:45,300 --> 00:10:49,033
words, and the transform method
will put all these words into the columns.

221
00:10:49,033 --> 00:10:49,800
That's it.

222
00:10:49,800 --> 00:10:51,200
Nothing more. Okay.

223
00:10:51,200 --> 00:10:55,533
So of course inside this fit transform
method we have to input our corpus

224
00:10:55,533 --> 00:10:58,533
of reviews of very cleaned reviews.

225
00:10:58,666 --> 00:11:01,900
And then we just need to add here
a two array.

226
00:11:02,133 --> 00:11:06,600
Because actually you know, remember that
the matrix of features must be a 2D array.

227
00:11:06,600 --> 00:11:08,366
It has to be a 2D array.

228
00:11:08,366 --> 00:11:11,900
Because then, you know, we will train
the naive base model on the training set.

229
00:11:12,300 --> 00:11:16,033
And this expects of course, an array
as the format of its input,

230
00:11:16,033 --> 00:11:17,433
you know, the matrix of features.

231
00:11:17,433 --> 00:11:19,866
So you know X will be an array here.

232
00:11:19,866 --> 00:11:21,966
Then it will be
split it into the training set

233
00:11:21,966 --> 00:11:25,100
and test it, you know with X train
Y trend X and y test.

234
00:11:25,333 --> 00:11:26,433
And then there we go.

235
00:11:26,433 --> 00:11:29,666
We'll have the right array format
to train the naive base model

236
00:11:29,666 --> 00:11:32,700
on the training set
composed of X train and Y train.

237
00:11:33,000 --> 00:11:33,933
So two array.

238
00:11:33,933 --> 00:11:36,000
Let's not forget the parenthesis.

239
00:11:36,000 --> 00:11:38,800
And now there we go. We're almost done.

240
00:11:38,800 --> 00:11:42,666
Our final step here is to create
the dependent variable vector y.

241
00:11:43,000 --> 00:11:47,166
And actually I will let you do this now
because you know exactly how to do it

242
00:11:47,466 --> 00:11:48,000
right.

243
00:11:48,000 --> 00:11:51,800
We simply need to take that
second column here because that's exactly

244
00:11:51,800 --> 00:11:53,766
the dependent variable vector.

245
00:11:53,766 --> 00:11:55,533
And we don't have anything to do here

246
00:11:55,533 --> 00:11:58,566
because it's already ready
with the binary outcome zero one.

247
00:11:58,866 --> 00:11:59,500
And so.

248
00:11:59,500 --> 00:12:03,033
Well, the way to get
this is actually very simple.

249
00:12:03,033 --> 00:12:06,033
And I'm actually thinking right now
of an even simpler way,

250
00:12:06,100 --> 00:12:09,100
which is to go to our data
preprocessing template,

251
00:12:09,166 --> 00:12:12,633
then take this line of code,
because, you know, I'm very lazy.

252
00:12:12,633 --> 00:12:16,066
And so I'm copying this and pasting it.

253
00:12:16,366 --> 00:12:17,000
Right.

254
00:12:17,000 --> 00:12:19,800
You know, deleting this and right here.

255
00:12:19,800 --> 00:12:22,033
And that's exactly our dependent variable.

256
00:12:22,033 --> 00:12:22,533
Right.

257
00:12:22,533 --> 00:12:24,166
It is just taking

258
00:12:24,166 --> 00:12:27,833
the last column of our data set,
which is the same as the second column.

259
00:12:27,833 --> 00:12:28,166
Right.

260
00:12:28,166 --> 00:12:31,133
You can either put a minus one here
or the index one.

261
00:12:31,133 --> 00:12:33,600
But we want to make this a code template
if we can.

262
00:12:33,600 --> 00:12:36,300
So let's just keep that okay.

263
00:12:36,300 --> 00:12:37,233
Wow. So good.

264
00:12:37,233 --> 00:12:39,833
We are done
with actually the bag of words model.

265
00:12:39,833 --> 00:12:43,033
So now as we said we're going to run this
to figure out

266
00:12:43,033 --> 00:12:44,700
the number of columns in the matrix

267
00:12:44,700 --> 00:12:48,266
X meaning the total number of words
in that sparse matrix.

268
00:12:48,400 --> 00:12:52,533
So let's play this cell in order
to first create x and y.

269
00:12:52,533 --> 00:12:57,933
And then we'll do the necessary to indeed
get that total number of columns in x.

270
00:12:57,933 --> 00:12:59,933
And that's exactly what we're ready to do.

271
00:12:59,933 --> 00:13:02,933
Now, you saw that
this cell executed properly.

272
00:13:02,933 --> 00:13:07,033
And now the trick
to get that number of columns in X,

273
00:13:07,033 --> 00:13:10,900
or you know, that number of words
resulting from the tokenization

274
00:13:11,233 --> 00:13:16,166
is just to call the Len function here,
which is going to take as input

275
00:13:16,366 --> 00:13:21,500
this matrix of features x,
and then only the first row.

276
00:13:21,533 --> 00:13:21,900
Right.

277
00:13:21,900 --> 00:13:23,500
Remember that the first index here

278
00:13:23,500 --> 00:13:27,233
and the pair of square brackets
corresponds to the index of the row.

279
00:13:27,600 --> 00:13:27,966
All right.

280
00:13:27,966 --> 00:13:31,266
So this will give us exactly
the number of elements

281
00:13:31,266 --> 00:13:35,466
basically in the first row
therefore the number of columns of x.

282
00:13:35,666 --> 00:13:38,233
So let's see let's press play.

283
00:13:38,233 --> 00:13:40,200
And we're going to get now that indeed.

284
00:13:40,200 --> 00:13:47,200
Well okay there are 1566 words
resulting from the tokenization.

285
00:13:47,500 --> 00:13:52,400
Basically we have 1566 words
that were taken from all the reviews.

286
00:13:52,633 --> 00:13:55,866
And for each of the reviews,
we have either one in the columns

287
00:13:55,866 --> 00:13:59,133
corresponding to the words
that are in the review and zero

288
00:13:59,133 --> 00:14:03,300
to all the other columns corresponding
to the words that are not in the review.

289
00:14:03,766 --> 00:14:04,033
All right.

290
00:14:04,033 --> 00:14:07,733
So basically we have 1566 words.

291
00:14:07,733 --> 00:14:10,766
And we can simplify this even more
by for example

292
00:14:10,766 --> 00:14:16,200
taking the 1500 most frequent words
so that we can, you know, get rid of words

293
00:14:16,200 --> 00:14:20,333
such as Rick, Steve
and maybe, you know, holiday or,

294
00:14:20,700 --> 00:14:23,366
or let's say, you know, faux.

295
00:14:23,366 --> 00:14:24,466
I don't know what that means.

296
00:14:25,433 --> 00:14:26,233
Rubber.

297
00:14:26,233 --> 00:14:29,033
You know, this probably appears only once.

298
00:14:29,033 --> 00:14:33,000
And, you know, words like that, words
that don't help at all.

299
00:14:33,000 --> 00:14:36,000
Predict
if the review is positive or negative.

300
00:14:36,000 --> 00:14:38,133
Okay. So that's the idea.

301
00:14:38,133 --> 00:14:43,133
So let's just take,
you know, the 1500 most frequent words.

302
00:14:43,400 --> 00:14:46,666
And therefore to do this
we just need to enter max

303
00:14:46,833 --> 00:14:49,833
underscore features parameters.

304
00:14:49,833 --> 00:14:50,833
There we go.

305
00:14:50,833 --> 00:14:54,633
And in order to get
only the 1500 most frequent words,

306
00:14:54,633 --> 00:14:57,900
we just need to enter here 1500.

307
00:14:57,900 --> 00:14:58,900
And feel free to

308
00:14:58,900 --> 00:15:02,633
try with other values, like for example,
the 1000 most frequent words.

309
00:15:02,633 --> 00:15:05,300
But be careful
not to remove too many words.

310
00:15:05,300 --> 00:15:08,133
Okay? All right. So good.

311
00:15:08,133 --> 00:15:11,833
Therefore now we're going to
you know rerun that cell.

312
00:15:12,133 --> 00:15:13,200
So let's do this.

313
00:15:13,200 --> 00:15:16,200
Let's press play okay. Good.

314
00:15:16,200 --> 00:15:19,800
And now if we rerun that cell
we should get 1500 here.

315
00:15:19,800 --> 00:15:21,033
Perfect.

316
00:15:21,033 --> 00:15:24,966
So now we have a nice bag of words model
with only relevant words

317
00:15:24,966 --> 00:15:27,666
you know, that appear
at least a certain amount of times

318
00:15:27,666 --> 00:15:30,900
and without all the non relevant words
that appear once

319
00:15:30,900 --> 00:15:35,766
like Rick, Steve or that weird forward
we saw in one of the reviews.

320
00:15:35,766 --> 00:15:37,500
Okay, good.

321
00:15:37,500 --> 00:15:40,933
And so now, well,
we basically did the most difficult part.

322
00:15:41,166 --> 00:15:43,266
We created the Bag of Words model.

323
00:15:43,266 --> 00:15:46,100
And so now
I actually have an exercise for you

324
00:15:46,100 --> 00:15:49,266
which you're going to do by yourself
first before we do it together.

325
00:15:49,400 --> 00:15:53,166
It is of course, to do all the rest
of the different steps here.

326
00:15:53,166 --> 00:15:56,833
And you know how to do them
because you basically have everything.

327
00:15:57,066 --> 00:16:00,233
You have the matrix of features
and the dependent variable vector Y,

328
00:16:00,433 --> 00:16:04,033
which you can therefore split
into a training set and test set,

329
00:16:04,033 --> 00:16:09,166
you know, composed respectively of X train
and Y train and x and y test.

330
00:16:09,500 --> 00:16:13,333
Then you're going to use the training set
composed of X train and white train

331
00:16:13,333 --> 00:16:16,500
to train the naive base
model on the train set.

332
00:16:16,800 --> 00:16:20,566
Then you're going to predict a test result
using the test set containing

333
00:16:20,566 --> 00:16:24,766
therefore reviews and their outcomes
on which the model wasn't trained.

334
00:16:24,966 --> 00:16:29,000
And finally, you're going to make the
confusion matrix and compute the accuracy.

335
00:16:29,400 --> 00:16:32,300
Of course
you're going to do this using your machine

336
00:16:32,300 --> 00:16:36,200
learning toolkit containing
all the code templates we built so far.

337
00:16:36,300 --> 00:16:38,033
So you totally have the right to do that.

338
00:16:38,033 --> 00:16:40,400
And I actually hope
that you're going to do this

339
00:16:40,400 --> 00:16:44,166
because I want you to be,
as most efficient as possible.

340
00:16:44,366 --> 00:16:47,066
And therefore
that's exactly what we will do.

341
00:16:47,066 --> 00:16:51,233
In the next and final tutorial
of this section, I will show you how

342
00:16:51,233 --> 00:16:56,666
to juggle with our diverse toolkit and
especially the classification toolkit to.

343
00:16:56,666 --> 00:17:00,300
In a flashlight, split the data
set into the training set and test it.

344
00:17:00,433 --> 00:17:03,033
Then train the naive base model
on the training set

345
00:17:03,033 --> 00:17:06,500
and predict the test results
and making the confusion matrix.

346
00:17:06,500 --> 00:17:11,200
I will show you that I will do this
with only copy paste, nothing else.

347
00:17:11,200 --> 00:17:15,266
We won't type any code now
we have everything in our diverse toolkit,

348
00:17:15,600 --> 00:17:16,933
but please make it first.

349
00:17:16,933 --> 00:17:18,533
Please do it on your own first

350
00:17:18,533 --> 00:17:21,900
and we will implement the solution
together in the next tutorial.

351
00:17:22,200 --> 00:17:24,166
Until then, enjoy machine learning.