1
00:00:00,166 --> 00:00:02,400
Hello and welcome to this art tutorial.

2
00:00:02,400 --> 00:00:03,933
So we did the main steps.

3
00:00:03,933 --> 00:00:06,233
We cleaned all the texts, all the reviews.

4
00:00:06,233 --> 00:00:10,200
We created our backwards model
and now we have to do one more thing,

5
00:00:10,200 --> 00:00:13,466
which is of course to build our machine
learning classification model.

6
00:00:13,733 --> 00:00:16,766
And we can do that because we have
all our independent variables

7
00:00:16,766 --> 00:00:21,900
in this sparse matrix DTM here builds
thanks to this function Documenter matrix.

8
00:00:22,300 --> 00:00:26,633
And besides, we applied a filter here
to remove the non frequent words.

9
00:00:26,666 --> 00:00:27,900
Well a few of them.

10
00:00:27,900 --> 00:00:32,000
But still this considerably reduced
the number of words in the matrix.

11
00:00:32,033 --> 00:00:35,033
So that's always good for our model
to run faster.

12
00:00:35,400 --> 00:00:37,233
All right. So now let's build the model.

13
00:00:37,233 --> 00:00:43,066
So what we'll do is go to our files to
get back to the part three classification.

14
00:00:43,500 --> 00:00:47,100
Because what we'll do
of course is take a classification model

15
00:00:47,100 --> 00:00:48,400
that we already built.

16
00:00:48,400 --> 00:00:50,633
And we will apply it on our text here.

17
00:00:50,633 --> 00:00:54,333
Because out of this text,
we managed to create a matrix of features

18
00:00:54,333 --> 00:00:56,133
containing the independent variables.

19
00:00:56,133 --> 00:00:57,100
And of course we have

20
00:00:57,100 --> 00:01:00,700
one dependent variable
which is the second column of our data

21
00:01:00,700 --> 00:01:03,766
set the liked column, which tells
if yes or no.

22
00:01:03,900 --> 00:01:05,100
The review is positive.

23
00:01:06,300 --> 00:01:06,900
So we have

24
00:01:06,900 --> 00:01:11,033
everything and therefore
we only need to take our model now.

25
00:01:11,033 --> 00:01:13,766
And therefore we go
two by three classification.

26
00:01:13,766 --> 00:01:16,966
And here
we can find all our classification models.

27
00:01:17,233 --> 00:01:19,100
All right so which one to pick.

28
00:01:19,100 --> 00:01:21,933
Which one to choose
for natural language processing.

29
00:01:21,933 --> 00:01:27,333
Well in general based on experience,
the most common classification models used

30
00:01:27,333 --> 00:01:30,400
for natural language processing or Naive

31
00:01:30,400 --> 00:01:33,400
Bayes decision tree or random forest.

32
00:01:33,433 --> 00:01:37,400
You also have the cart model, which is
another type of decision tree model,

33
00:01:37,666 --> 00:01:41,700
and you also have the maximum entropy
model, which is based on entropy as well.

34
00:01:41,700 --> 00:01:43,800
Like for decision trees.

35
00:01:43,800 --> 00:01:46,933
So these models work
very well for natural language processing.

36
00:01:46,933 --> 00:01:50,933
And therefore he will pick
one that is related to entropy.

37
00:01:51,233 --> 00:01:54,133
And that's the case for decision tree
classification model

38
00:01:54,133 --> 00:01:56,833
as well as our random forest
classification model.

39
00:01:56,833 --> 00:01:57,666
Because of course

40
00:01:57,666 --> 00:02:01,933
a random forest is a combination of trees
making the same predictions together.

41
00:02:02,400 --> 00:02:05,033
And keep in mind
that you can also use Naive Bayes,

42
00:02:05,033 --> 00:02:08,033
which is commonly used as well
for natural language processing.

43
00:02:08,666 --> 00:02:12,433
But here in this tutorial we will choose
random Forest classification.

44
00:02:12,866 --> 00:02:16,433
So let's go into this section
and here are all the files.

45
00:02:16,433 --> 00:02:21,300
You know the data set the classification
templates and our model in Python and R.

46
00:02:21,600 --> 00:02:23,100
So let's take the one in R.

47
00:02:23,100 --> 00:02:25,866
So I'm just clicking on the file. Here
we go.

48
00:02:25,866 --> 00:02:27,766
Model open here.

49
00:02:27,766 --> 00:02:29,133
So what do we need here.

50
00:02:29,133 --> 00:02:33,100
Well first of all let's notice
that you know when we are using the random

51
00:02:33,100 --> 00:02:36,366
forest classification model
we are starting with a data set

52
00:02:36,366 --> 00:02:38,100
which is a data frame.

53
00:02:38,100 --> 00:02:42,466
And that contains both the independent
variables and the dependent variable.

54
00:02:42,800 --> 00:02:46,466
So what do we have to do right now
is go back to our natural language

55
00:02:46,466 --> 00:02:50,033
processing file and create
exactly the same.

56
00:02:50,033 --> 00:02:51,366
That is, create a data

57
00:02:51,366 --> 00:02:54,900
set containing independent variables
and one dependent variable.

58
00:02:55,300 --> 00:02:58,333
And that will be
the input of this model here.

59
00:02:58,333 --> 00:03:00,400
Because you know
we have our data set here.

60
00:03:00,400 --> 00:03:03,466
And then we use our data
set in each of the code sections here.

61
00:03:03,700 --> 00:03:07,033
And then you know, we split the data
sets into a training set and a test set.

62
00:03:07,366 --> 00:03:11,400
And we train our machine learning
classification model on the training

63
00:03:11,400 --> 00:03:12,400
set here.

64
00:03:12,400 --> 00:03:15,300
So what we only have to do
is just to create this data

65
00:03:15,300 --> 00:03:18,533
set containing the independent variables
and the dependent variable.

66
00:03:18,700 --> 00:03:19,866
So that's very simple.

67
00:03:19,866 --> 00:03:22,833
We already have our independent variables.

68
00:03:22,833 --> 00:03:27,000
But the problem is that our independent
variables right now are in a matrix.

69
00:03:27,266 --> 00:03:31,633
Because you know this document term
matrix function returns a matrix.

70
00:03:31,633 --> 00:03:33,300
So DTM is a matrix right now.

71
00:03:33,300 --> 00:03:37,100
And as you remember in our classification

72
00:03:37,100 --> 00:03:40,133
models on R
well this data set is a dataframe.

73
00:03:40,500 --> 00:03:41,833
It's not a matrix.

74
00:03:41,833 --> 00:03:45,466
So we need to make sure that here
for the inputs of the model

75
00:03:45,466 --> 00:03:49,466
that we are going to apply on this
bag of words model that we just created

76
00:03:49,466 --> 00:03:50,866
in the previous tutorials.

77
00:03:50,866 --> 00:03:53,633
Well we need to make sure
that we have a dataframe,

78
00:03:53,633 --> 00:03:55,133
but that's actually very simple.

79
00:03:55,133 --> 00:03:58,633
We just need to take our matrix
and use the function

80
00:03:58,633 --> 00:04:01,633
as that data dot frame.

81
00:04:01,700 --> 00:04:04,566
And we will input our sparse matrix
DTM inside.

82
00:04:04,566 --> 00:04:09,500
And that will transform our DTM
sparse matrix into a data frame.

83
00:04:09,833 --> 00:04:10,966
So let's do this.

84
00:04:10,966 --> 00:04:14,500
And since you know
we are just going to copy paste

85
00:04:14,500 --> 00:04:15,966
our random forest classification

86
00:04:15,966 --> 00:04:19,900
here, well, since the input of this model
is basically this data set,

87
00:04:20,300 --> 00:04:24,133
well here we will use the same name
to create this data frame.

88
00:04:24,133 --> 00:04:26,966
And so we will call it data set.

89
00:04:26,966 --> 00:04:28,433
Data set equals.

90
00:04:28,433 --> 00:04:33,700
And then that's when we use the as dot
data dot frame.

91
00:04:33,700 --> 00:04:35,633
Here it is. That's the first one.

92
00:04:35,633 --> 00:04:36,566
Here we go.

93
00:04:36,566 --> 00:04:38,566
And so now we need to input the matrix.

94
00:04:38,566 --> 00:04:41,000
We want to transform into a data frame.

95
00:04:41,000 --> 00:04:43,600
And that's of course DTM.

96
00:04:43,600 --> 00:04:46,400
And just to make sure
we have the matrix type

97
00:04:46,400 --> 00:04:49,333
expected by this as dot data
frame function.

98
00:04:49,333 --> 00:04:54,600
Well we need to use here
the function as dot matrix

99
00:04:54,600 --> 00:04:58,100
and put DTM
as input of this as matrix function.

100
00:04:58,533 --> 00:05:03,300
Because you know this sparse
matrix DTM here is definitely a matrix,

101
00:05:03,433 --> 00:05:07,300
but it doesn't have the type expected
by this as the data frame function.

102
00:05:07,633 --> 00:05:09,966
And to make sure
we have the right matrix type,

103
00:05:09,966 --> 00:05:12,900
well we need to use this
add that matrix function.

104
00:05:12,900 --> 00:05:13,366
All right.

105
00:05:13,366 --> 00:05:14,366
And now let's be careful.

106
00:05:14,366 --> 00:05:16,500
We lost one parenthesis.

107
00:05:16,500 --> 00:05:19,366
So I'm just adding it.
All right. Now we're good.

108
00:05:19,366 --> 00:05:24,566
We are ready to transform our sparse
matrix of features into a data frame.

109
00:05:24,933 --> 00:05:28,366
So let's do it
I'm going to select this line and execute.

110
00:05:28,700 --> 00:05:29,600
All right.

111
00:05:29,600 --> 00:05:33,000
And now what's interesting to see
is that we have the real data set.

112
00:05:33,300 --> 00:05:37,333
You know with all the reviews and the rows
and all the words that we took

113
00:05:37,333 --> 00:05:38,100
from the corpus.

114
00:05:38,100 --> 00:05:42,066
And then we're filtered thanks to this
remove sparse terms function.

115
00:05:42,400 --> 00:05:46,166
Well, we can see the full data
set here with this 1000 rows

116
00:05:46,300 --> 00:05:49,400
and all these 691 columns,

117
00:05:49,766 --> 00:05:53,700
each one corresponding to a word
that comes from the reviews in the corpus.

118
00:05:53,700 --> 00:05:57,233
And that was not filtered
by the remove sparse terms function.

119
00:05:57,700 --> 00:05:58,433
All right.

120
00:05:58,433 --> 00:06:01,333
So here
you can have a look at this huge table.

121
00:06:01,333 --> 00:06:02,933
And we can clearly see here

122
00:06:02,933 --> 00:06:07,033
that this is a sparse matrix
because basically we can only see zeros.

123
00:06:07,200 --> 00:06:08,766
Well we have very few ones.

124
00:06:08,766 --> 00:06:10,333
We have one here one here.

125
00:06:10,333 --> 00:06:12,200
But all the rest is zeros.

126
00:06:12,200 --> 00:06:14,666
And so for example
if I take this one here,

127
00:06:14,666 --> 00:06:20,400
well this one belongs to the also column
and to the 23rd row.

128
00:06:20,400 --> 00:06:22,333
That is the 23rd review.

129
00:06:22,333 --> 00:06:25,166
And so this one here means that the word

130
00:06:25,166 --> 00:06:28,166
also appears in the review 23.

131
00:06:28,533 --> 00:06:30,266
All right. So that's the sparse matrix.

132
00:06:30,266 --> 00:06:33,066
And now you can really see what it is
with your own eyes.

133
00:06:33,066 --> 00:06:33,466
All right.

134
00:06:33,466 --> 00:06:36,466
So let's go back to our natural language
processing file.

135
00:06:36,600 --> 00:06:40,366
So we have our data set
which is now a data

136
00:06:40,366 --> 00:06:43,600
frame as we wanted but still incomplete.

137
00:06:43,900 --> 00:06:44,466
You know why.

138
00:06:44,466 --> 00:06:49,500
It's because the data set we start with in
this random forest classification model.

139
00:06:49,633 --> 00:06:53,066
And in general we say classification
models is a data frame.

140
00:06:53,066 --> 00:06:54,300
So we're good on that.

141
00:06:54,300 --> 00:06:57,866
But a data frame containing
both the independent variables

142
00:06:58,066 --> 00:06:59,933
and the dependent variable.

143
00:06:59,933 --> 00:07:01,366
So what we need to do right now

144
00:07:01,366 --> 00:07:06,066
is add the dependent variable
to this data frame data set.

145
00:07:06,266 --> 00:07:09,266
Because right now it only contains
the independent variables.

146
00:07:09,600 --> 00:07:09,966
All right.

147
00:07:09,966 --> 00:07:14,100
So you might remember how to add the
dependent variable column to a data set.

148
00:07:14,100 --> 00:07:15,366
That is a data frame.

149
00:07:15,366 --> 00:07:17,800
Remember we need to take our data set.

150
00:07:17,800 --> 00:07:19,566
Then add a dollar sign here.

151
00:07:19,566 --> 00:07:22,500
And then after this dollar sign
we can either take one of

152
00:07:22,500 --> 00:07:26,200
the existing column here
if we want to update the column,

153
00:07:26,533 --> 00:07:30,566
or create a new column
to add to this data set.

154
00:07:30,900 --> 00:07:32,500
And that's exactly what we want to do.

155
00:07:32,500 --> 00:07:34,966
We want to create a new column
to this data set.

156
00:07:34,966 --> 00:07:36,266
Well that's an existing column.

157
00:07:36,266 --> 00:07:38,033
That's the light column.

158
00:07:38,033 --> 00:07:41,066
But we created for this data set
because it is new column.

159
00:07:41,400 --> 00:07:44,833
And so we'll give to this column
the same name as the real dependent

160
00:07:44,833 --> 00:07:47,100
variable column. That is light.

161
00:07:48,200 --> 00:07:48,566
All right.

162
00:07:48,566 --> 00:07:52,200
So by doing this we are adding
this new column that we call liked

163
00:07:52,833 --> 00:07:54,400
and then equals.

164
00:07:54,400 --> 00:07:55,600
And then after this equal

165
00:07:55,600 --> 00:07:59,133
we need to specify what we want to add
in this new column.

166
00:07:59,400 --> 00:08:02,700
And what we want to add is nothing else
than the existing

167
00:08:02,900 --> 00:08:05,733
liked column of our data set.

168
00:08:05,733 --> 00:08:09,866
But be careful, because our data set
was just a data to this new data frame,

169
00:08:10,133 --> 00:08:14,100
and therefore we no longer have the data
set that we imported originally.

170
00:08:14,366 --> 00:08:15,900
So what we'll do is very simple.

171
00:08:15,900 --> 00:08:20,566
We'll just rename this data set
by adding an underscore and then original.

172
00:08:21,300 --> 00:08:22,000
Here we go.

173
00:08:22,000 --> 00:08:25,433
And we will select this line again
and execute.

174
00:08:25,800 --> 00:08:28,266
All right.
So now we have our original data set.

175
00:08:28,266 --> 00:08:31,500
And therefore we can have access
to the liked column of this

176
00:08:31,500 --> 00:08:34,866
original data set which is going to be
our dependent variable.

177
00:08:35,333 --> 00:08:38,700
So let's add this dependent variable
right now to our data set.

178
00:08:39,133 --> 00:08:44,866
And so to take this dependent variable
we need to take our data set original.

179
00:08:44,866 --> 00:08:45,300
Here it is

180
00:08:45,300 --> 00:08:49,433
because that's the original data set
containing the dependent variable liked.

181
00:08:49,766 --> 00:08:52,766
And so to take this dependent
variable vector

182
00:08:52,800 --> 00:08:55,766
we need to add a dollar sign here same.

183
00:08:55,766 --> 00:08:59,033
And then take the column we want
which is the liked column.

184
00:08:59,533 --> 00:09:00,900
All right. So that's good.

185
00:09:00,900 --> 00:09:04,066
By selecting this line and executing it

186
00:09:04,466 --> 00:09:07,833
we add the light dependent variable
vector column

187
00:09:07,966 --> 00:09:11,866
to our data set already
containing the independent variables

188
00:09:12,100 --> 00:09:15,500
that are all the filtered words
of our cleaned reviews in the corpus.

189
00:09:16,200 --> 00:09:18,066
All right.
So now we have everything we need.

190
00:09:18,066 --> 00:09:21,566
And we are ready to take our machine
learning classification model

191
00:09:21,800 --> 00:09:25,500
because we have our data
set that not only is a data frame,

192
00:09:25,500 --> 00:09:28,866
but also contains both the independent
variables and the dependent variable.

193
00:09:29,100 --> 00:09:30,066
So we have everything.

194
00:09:30,066 --> 00:09:33,200
What is expecting a random forest
classification model here?

195
00:09:33,466 --> 00:09:38,200
So what we only need to do here is take
everything from here and not from here.

196
00:09:38,200 --> 00:09:40,733
You know, because this section
is to import the data set.

197
00:09:40,733 --> 00:09:44,700
But we already have our data set
that is ready for classification model.

198
00:09:44,900 --> 00:09:48,433
So we just need to take everything
from here because this is where

199
00:09:48,600 --> 00:09:50,900
the data set starts to be processed.

200
00:09:50,900 --> 00:09:53,066
And so we take everything from here to

201
00:09:54,233 --> 00:09:55,300
here.

202
00:09:55,300 --> 00:09:59,233
And we can not take this
because this is to plot the results in 2D.

203
00:09:59,233 --> 00:10:00,766
That is two independent variables.

204
00:10:00,766 --> 00:10:04,166
And here since of course we have
a lot more than two independent variables.

205
00:10:04,266 --> 00:10:06,633
Well,
we cannot use this to plot the results,

206
00:10:06,633 --> 00:10:09,466
but we will definitely have a look
at the confusion matrix

207
00:10:09,466 --> 00:10:11,300
to see the number of correct predictions,

208
00:10:11,300 --> 00:10:13,766
as well as the number
of incorrect predictions,

209
00:10:13,766 --> 00:10:16,466
so that we can evaluate
the model performance.

210
00:10:16,466 --> 00:10:16,800
All right.

211
00:10:16,800 --> 00:10:19,800
So let's get back to our natural language
processing file.

212
00:10:20,166 --> 00:10:24,600
And we will paste our random forest
classification model right here.

213
00:10:25,166 --> 00:10:26,033
All right.

214
00:10:26,033 --> 00:10:27,866
So now we just need to modify

215
00:10:27,866 --> 00:10:30,866
a very few things
because everything is basically ready.

216
00:10:30,900 --> 00:10:33,333
But let's see what we can modify.

217
00:10:33,333 --> 00:10:37,033
Well first here in the section
that encodes the target feature as vector.

218
00:10:37,266 --> 00:10:41,066
Well of course we need to replace this
purchased dependent variable

219
00:10:41,200 --> 00:10:43,433
which was the dependent variable
in part three.

220
00:10:43,433 --> 00:10:47,533
Well we need to replace it with our new
dependent variable which is liked.

221
00:10:48,266 --> 00:10:49,200
All right.

222
00:10:49,200 --> 00:10:53,000
And same here we replace purchased by like

223
00:10:53,966 --> 00:10:56,000
all right good for this section.

224
00:10:56,000 --> 00:10:59,533
Then in the next section we split
the data sets into the training set

225
00:10:59,533 --> 00:11:00,600
and the test set.

226
00:11:00,600 --> 00:11:02,166
Well that's very important to do this.

227
00:11:02,166 --> 00:11:04,533
Unless you want to create a new review.

228
00:11:04,533 --> 00:11:07,600
But you know we will train our random
forest classification

229
00:11:07,600 --> 00:11:10,600
models on say for example, 800 reviews.

230
00:11:10,766 --> 00:11:15,066
And we will test the predictive power
of random forests on 200

231
00:11:15,100 --> 00:11:19,366
new reviews on which our random forest
classification model was not trained.

232
00:11:19,500 --> 00:11:21,100
And therefore these 200 reviews

233
00:11:21,100 --> 00:11:24,933
and the test set will be new reviews
for a random forest classification model.

234
00:11:25,333 --> 00:11:28,500
And so we will see how it manages
to predict

235
00:11:28,500 --> 00:11:31,900
whether each of these 200 reviews
is positive or negative.

236
00:11:32,233 --> 00:11:34,966
And then that's in the confusion matrix
that will see

237
00:11:34,966 --> 00:11:37,866
the number of correct predictions
and the number of incorrect

238
00:11:37,866 --> 00:11:40,866
predictions in this 200 new reviews.

239
00:11:40,866 --> 00:11:42,866
All right.
So that's what is done in this section.

240
00:11:42,866 --> 00:11:46,833
And since I just gave as an example
800 reviews to train the model

241
00:11:46,833 --> 00:11:50,700
and 200 reviews to test it,
well let's go with this choice of numbers.

242
00:11:51,000 --> 00:11:57,233
And so we need to change the split
ratio here to 0.8 because that's 80%.

243
00:11:57,233 --> 00:11:58,866
And we have 1000 reviews.

244
00:11:58,866 --> 00:12:03,266
So 80% of 1000 reviews is 800 reviews
to go to the training set,

245
00:12:03,533 --> 00:12:06,533
and therefore 200 reviews
to go to the test set.

246
00:12:06,533 --> 00:12:07,333
All right. So that's good.

247
00:12:07,333 --> 00:12:09,533
And of course, let's not forget to replace

248
00:12:09,533 --> 00:12:13,100
the purchased variable here
by our new dependent variable.

249
00:12:13,100 --> 00:12:15,233
That is light.

250
00:12:15,233 --> 00:12:15,600
All right.

251
00:12:15,600 --> 00:12:17,566
So I think we're good with this section.

252
00:12:17,566 --> 00:12:19,466
So now let's move on to the next one.

253
00:12:19,466 --> 00:12:21,533
The next one is about feature scaling.

254
00:12:21,533 --> 00:12:24,166
And so here
do we need to apply feature scaling.

255
00:12:24,166 --> 00:12:24,966
Well not really

256
00:12:24,966 --> 00:12:29,366
because we only have zeros and ones
in the sparse matrix of features.

257
00:12:29,700 --> 00:12:30,966
And therefore we don't have one

258
00:12:30,966 --> 00:12:34,133
independent variable
dominating another independent variable.

259
00:12:34,300 --> 00:12:36,133
So we don't need to apply feature scaling.

260
00:12:36,133 --> 00:12:39,133
So we will remove this section.

261
00:12:39,266 --> 00:12:40,033
All right.

262
00:12:40,033 --> 00:12:41,233
And so what about this one.

263
00:12:41,233 --> 00:12:45,400
Yes of course we keep this one
because this is the section where we build

264
00:12:45,500 --> 00:12:49,200
a random forest classification model
that will classify the reviews.

265
00:12:49,500 --> 00:12:50,000
And that's

266
00:12:50,000 --> 00:12:53,733
where we train the random forest
classification model on the training set.

267
00:12:53,900 --> 00:12:56,400
And therefore here
we need to change two things.

268
00:12:56,400 --> 00:13:01,100
First, the index here that you know
is the index of the dependent variable

269
00:13:01,100 --> 00:13:04,900
that we need to remove from x
because x is supposed

270
00:13:04,900 --> 00:13:08,200
to be the training set
without the dependent variable.

271
00:13:08,666 --> 00:13:11,866
So we need to remove it with the index
of our new dependent variable

272
00:13:11,866 --> 00:13:16,100
like it is not three but is 692.

273
00:13:16,433 --> 00:13:18,166
We can see that here very easily.

274
00:13:18,166 --> 00:13:21,600
So let's replace three by 692.

275
00:13:22,166 --> 00:13:23,766
All right. Good.

276
00:13:23,766 --> 00:13:28,033
And now the second thing that we need to
change is of course this purchased here

277
00:13:28,033 --> 00:13:31,166
that we still need to replace by light

278
00:13:32,500 --> 00:13:33,600
this way.

279
00:13:33,600 --> 00:13:37,300
And then
if we want we can train our random forest

280
00:13:37,300 --> 00:13:39,100
classification with more trees.

281
00:13:39,100 --> 00:13:40,933
Right now we have ten trees.

282
00:13:40,933 --> 00:13:42,266
So we will keep ten trees.

283
00:13:42,266 --> 00:13:44,966
That might be enough for our 1000 reviews,

284
00:13:44,966 --> 00:13:49,600
which is quite a small number of reviews,
and especially our 692

285
00:13:49,800 --> 00:13:53,366
words columns that we have
in our sparse matrix of features.

286
00:13:53,700 --> 00:13:56,400
Ten trees might be enough,
but of course, you're welcome

287
00:13:56,400 --> 00:13:59,933
to try more random forest
classification models with more trees.

288
00:14:00,500 --> 00:14:01,933
So we're good with this section.

289
00:14:01,933 --> 00:14:04,100
And now let's move on to the next one.

290
00:14:04,100 --> 00:14:06,933
The next one is about
predicting the test results.

291
00:14:06,933 --> 00:14:10,566
So making the predictions
on 200 new reviews

292
00:14:10,766 --> 00:14:13,466
that our model won't know anything about.

293
00:14:13,466 --> 00:14:16,700
And therefore for this new reviews,
our model is going to try to predict

294
00:14:17,000 --> 00:14:21,166
if those reviews are positive or negative
and therefore it will be very interesting

295
00:14:21,166 --> 00:14:24,233
to see if it's making some correct
predictions on new reviews.

296
00:14:24,800 --> 00:14:26,000
So right now it's the same.

297
00:14:26,000 --> 00:14:29,233
We have to replace this index here
that corresponds to the index

298
00:14:29,233 --> 00:14:30,700
of the dependent variable.

299
00:14:30,700 --> 00:14:34,700
And so we need to replace three by
of course 692.

300
00:14:34,700 --> 00:14:37,800
That's exactly the same as we did
for the training set here.

301
00:14:38,333 --> 00:14:40,533
And so now we're good for this section.

302
00:14:40,533 --> 00:14:44,000
We're finally getting to the last section
that is making the confusion matrix.

303
00:14:44,300 --> 00:14:47,233
That's the interesting section
that will tell us the number of correct

304
00:14:47,233 --> 00:14:51,300
predictions and the number of incorrect
prediction for these 200 new reviews.

305
00:14:51,633 --> 00:14:52,966
So we will see that right now.

306
00:14:52,966 --> 00:14:56,600
But of course we need to replace this
three index that corresponds

307
00:14:56,600 --> 00:14:59,600
to the index of the dependent variable
still the same.

308
00:14:59,633 --> 00:15:02,166
And replace it by 692.

309
00:15:03,300 --> 00:15:03,666
All right.

310
00:15:03,666 --> 00:15:05,000
So now everything is good.

311
00:15:05,000 --> 00:15:08,400
We are ready to train our random forest
classification model

312
00:15:08,700 --> 00:15:11,233
on our 800 reviews of the training set.

313
00:15:11,233 --> 00:15:13,966
And then evaluate
the predictive power of our model

314
00:15:13,966 --> 00:15:16,833
on our 200 new reviews in the test set.

315
00:15:16,833 --> 00:15:18,133
So let's do it.

316
00:15:18,133 --> 00:15:22,733
Since we already executed everything
up to here, what we need to do now

317
00:15:22,733 --> 00:15:26,400
is just select everything
from here to the bottom.

318
00:15:26,866 --> 00:15:27,800
And now we're good.

319
00:15:27,800 --> 00:15:32,066
We just need to press command or control
plus enter to execute to train the model

320
00:15:32,066 --> 00:15:34,033
and test it on the test set,

321
00:15:34,033 --> 00:15:37,200
and eventually have a look at the number
of correct predictions and the number

322
00:15:37,200 --> 00:15:40,200
of incorrect predictions
on 200 new reviews.

323
00:15:40,366 --> 00:15:41,466
So let's do it.

324
00:15:41,466 --> 00:15:43,366
I'm going to press
Command Plus Enter to execute.

325
00:15:45,200 --> 00:15:46,166
And here we go.

326
00:15:46,166 --> 00:15:48,200
Everything worked properly. Great.

327
00:15:48,200 --> 00:15:49,633
So let's have a look.

328
00:15:49,633 --> 00:15:52,200
We will have a look at the confusion
matrix.

329
00:15:52,200 --> 00:15:56,100
Of course by typing
here c m in the console.

330
00:15:56,433 --> 00:15:59,100
Here we go. So let's see what we have.

331
00:15:59,100 --> 00:16:04,400
We have 79 correct predictions
of negative reviews, 70 correct

332
00:16:04,400 --> 00:16:10,100
predictions of positive reviews, 21
incorrect predictions of negative reviews,

333
00:16:10,500 --> 00:16:13,800
and 30 incorrect
predictions of positive reviews.

334
00:16:14,233 --> 00:16:16,133
All right, so that's actually not too bad.

335
00:16:16,133 --> 00:16:19,366
You know, because we only had 800 reviews
to train the model.

336
00:16:19,500 --> 00:16:21,733
That's not much
when you're working with text.

337
00:16:21,733 --> 00:16:24,666
And therefore 30 plus 21 equals 51.

338
00:16:24,666 --> 00:16:28,700
Incorrect prediction
is not bad out of 200 new reviews.

339
00:16:29,033 --> 00:16:33,600
When you know that you train your
classification model on only 800 reviews.

340
00:16:33,900 --> 00:16:36,200
And actually,
let's have a look at the accuracy.

341
00:16:36,200 --> 00:16:41,066
The accuracy is the number
of correct predictions that is 79

342
00:16:41,300 --> 00:16:45,266
plus 70 divided by the total number

343
00:16:45,266 --> 00:16:48,600
of observations in the test set,
and that is 200.

344
00:16:49,200 --> 00:16:51,133
So let's have a look at the accuracy.

345
00:16:51,133 --> 00:16:52,900
Pressing enter here.

346
00:16:52,900 --> 00:16:55,900
And the accuracy is 74.5%.

347
00:16:56,333 --> 00:16:57,033
So again

348
00:16:57,033 --> 00:17:01,500
that's not bad considering the fact that
we trained our model on only 800 reviews.

349
00:17:01,733 --> 00:17:04,866
And you'll clearly see that
if you had a lot more reviews to train

350
00:17:05,066 --> 00:17:08,800
your classification model,
you will get a much better accuracy.

351
00:17:09,566 --> 00:17:12,033
All right, so that's the end of natural
language processing

352
00:17:12,033 --> 00:17:15,433
and are congratulations
for having completed all this.

353
00:17:15,433 --> 00:17:18,733
Creating the Bag of Words
model training and classification model.

354
00:17:18,733 --> 00:17:20,066
And this data set.

355
00:17:20,066 --> 00:17:23,133
But that's not the end of your natural
language processing journey.

356
00:17:23,133 --> 00:17:26,933
Because right after this video
you'll get a little challenge.

357
00:17:27,233 --> 00:17:29,133
So we'll let you find out about that.

358
00:17:29,133 --> 00:17:31,100
And until then, enjoy machine learning.