﻿1
00:00:00,233 --> 00:00:02,200
Hello and welcome to this art tutorial.

2
00:00:02,200 --> 00:00:02,966
And mostly.

3
00:00:02,966 --> 00:00:06,033
Welcome to part seven
Natural Language Processing.

4
00:00:06,366 --> 00:00:08,833
So what is natural language
Processing all about?

5
00:00:08,833 --> 00:00:11,166
Well, it's about analyzing texts.

6
00:00:11,166 --> 00:00:15,300
These texts can be books, reviews,
some HTML

7
00:00:15,300 --> 00:00:19,200
web pages that you extract from web
scraping, all sorts of text

8
00:00:19,266 --> 00:00:23,066
and therefore natural language
processing is a branch of machine learning

9
00:00:23,300 --> 00:00:27,033
where we do
some predictive analysis on text mostly.

10
00:00:27,566 --> 00:00:30,566
So in this part
these texts are going to be reviews,

11
00:00:30,966 --> 00:00:33,533
you know, written reviews of restaurants.

12
00:00:33,533 --> 00:00:35,500
And so we will make some machine
learning models

13
00:00:35,500 --> 00:00:38,700
that will predict
if the review is positive or negative.

14
00:00:39,333 --> 00:00:42,900
So this is a simple example
of a application

15
00:00:42,900 --> 00:00:44,533
of natural language processing.

16
00:00:44,533 --> 00:00:45,933
But the algorithm we'll make in

17
00:00:45,933 --> 00:00:49,500
this part will be very well applicable
to other kinds of texts.

18
00:00:49,733 --> 00:00:50,533
You know, you'll be able

19
00:00:50,533 --> 00:00:54,300
to apply this on books, for example,
to predict the genre of a book.

20
00:00:54,300 --> 00:00:57,933
You know, whether a book is a thriller,
a comedy or a romance.

21
00:00:58,500 --> 00:01:01,666
You'll be also able to use it on HTML
web pages

22
00:01:01,800 --> 00:01:05,000
to do whatever kind of analysis
you want to do on these pages.

23
00:01:05,400 --> 00:01:09,533
You can also apply it on newspapers,
you know, to predict in which category

24
00:01:09,533 --> 00:01:11,066
an article belongs to.

25
00:01:11,066 --> 00:01:13,466
Well, you'll see that
we'll make a general model

26
00:01:13,466 --> 00:01:16,500
that you'll be able to apply
on most of the texts.

27
00:01:16,933 --> 00:01:19,933
And of course, if you need to apply this
on a more complicated text,

28
00:01:20,033 --> 00:01:22,900
you can ask me some questions in the Q&A
and I'll tell you what

29
00:01:22,900 --> 00:01:25,966
to add to make this code work properly
for your problem.

30
00:01:26,100 --> 00:01:27,233
Your text.

31
00:01:27,233 --> 00:01:29,133
Okay, so let's start this.

32
00:01:29,133 --> 00:01:31,466
Let's start our algorithm.

33
00:01:31,466 --> 00:01:34,733
And before starting with the first code
section, let's do the basic step.

34
00:01:34,966 --> 00:01:37,466
Let's set the right
folder as working directory.

35
00:01:37,466 --> 00:01:39,833
So we'll go to our Machine learning
A-Z folder.

36
00:01:39,833 --> 00:01:42,300
Then part seven
Natural Language Processing.

37
00:01:42,300 --> 00:01:45,000
So congratulations for reaching the part.

38
00:01:45,000 --> 00:01:48,900
You are now entering into a very useful
and exciting branch of machine learning.

39
00:01:49,100 --> 00:01:50,333
So let's go in there.

40
00:01:50,333 --> 00:01:53,233
And now we have to be the section
Natural Language Processing.

41
00:01:53,233 --> 00:01:54,266
So here we go.

42
00:01:54,266 --> 00:01:56,933
That's the folder
we want to set as a working directory.

43
00:01:56,933 --> 00:02:01,700
So now we'll click on this more button
here and then set as working directory.

44
00:02:02,166 --> 00:02:03,133
All good.

45
00:02:03,133 --> 00:02:08,100
And now as you can notice in this folder
that we just set it as working directory.

46
00:02:08,233 --> 00:02:10,433
We have two data files.

47
00:02:10,433 --> 00:02:13,600
We have a restaurant reviews dot CSV file

48
00:02:13,600 --> 00:02:16,700
and a restaurant reviews dot TSV file.

49
00:02:16,900 --> 00:02:20,866
So I have these two data files on purpose
because there is an important thing

50
00:02:20,866 --> 00:02:24,900
to understand when we prepare text
data sets that I would like to highlight

51
00:02:24,900 --> 00:02:26,166
and show you right now.

52
00:02:26,166 --> 00:02:29,166
So I'm going to go to my folder
on my computer

53
00:02:29,166 --> 00:02:31,466
and we'll have a look at these two files
from there.

54
00:02:31,466 --> 00:02:34,033
So right now I'm going to my folder.

55
00:02:34,033 --> 00:02:34,866
Here it is.

56
00:02:34,866 --> 00:02:37,966
And let's open
the two restaurant review CSV.

57
00:02:38,200 --> 00:02:41,066
So open with text edit.

58
00:02:41,066 --> 00:02:41,733
Here we go.

59
00:02:41,733 --> 00:02:44,733
And restaurant reviews dot CSV.

60
00:02:45,133 --> 00:02:48,666
So the same open with text edit.

61
00:02:49,500 --> 00:02:52,500
So that's the CSV and that's the CSV.

62
00:02:52,633 --> 00:02:54,333
Let's first have a look at the CSV.

63
00:02:54,333 --> 00:02:57,666
As you can see, the first line
here is the title

64
00:02:57,666 --> 00:03:00,666
of the future columns
we're going to have in RStudio.

65
00:03:00,966 --> 00:03:05,000
The first column is review
and the second column is light.

66
00:03:05,000 --> 00:03:09,933
And I can see that because these two terms
here are separated by a comma.

67
00:03:10,266 --> 00:03:15,266
And this is a CSV file, meaning that
all the columns are separated by a comma.

68
00:03:15,900 --> 00:03:18,900
So that's the first line
containing the titles of the columns.

69
00:03:19,166 --> 00:03:22,500
And then we have all our observations.

70
00:03:22,533 --> 00:03:25,533
So one line
corresponds to one observation.

71
00:03:25,900 --> 00:03:30,233
And as you can see in each of these lines
we first have the review.

72
00:03:30,266 --> 00:03:33,033
So this is
of course a review of a restaurant.

73
00:03:33,033 --> 00:03:35,100
So wow love this place.

74
00:03:35,100 --> 00:03:37,300
Meaning
of course that the review is positive

75
00:03:37,300 --> 00:03:40,766
and therefore in the second column
the liked column,

76
00:03:40,966 --> 00:03:45,166
we have a one here, meaning
that the review is indeed positive.

77
00:03:45,600 --> 00:03:50,766
So the variable for this liked
column can take two values, 1 or 0.

78
00:03:51,000 --> 00:03:53,100
And one means that it's a positive review

79
00:03:53,100 --> 00:03:55,700
and zero means
that it's a negative review.

80
00:03:55,700 --> 00:03:59,566
So indeed, as you can see in the second
review, crust is not good.

81
00:03:59,866 --> 00:04:01,400
Well, of course that's a negative review.

82
00:04:01,400 --> 00:04:03,966
And therefore there is a zero here.

83
00:04:03,966 --> 00:04:04,266
All right.

84
00:04:04,266 --> 00:04:07,833
So that's actually the kind of file
we are used to.

85
00:04:07,833 --> 00:04:12,366
Because since the beginning of this course
we've only been using some CSV file

86
00:04:12,366 --> 00:04:15,466
where the columns
are separated by a comma.

87
00:04:16,066 --> 00:04:19,066
But here we have something different.

88
00:04:19,133 --> 00:04:21,400
We can see that we have the same columns.

89
00:04:21,400 --> 00:04:25,600
First column is review, second column
is liked and we have the same reviews.

90
00:04:25,600 --> 00:04:29,133
So these are exactly the same data sets
with the same data.

91
00:04:29,433 --> 00:04:31,400
But there is one major difference.

92
00:04:31,400 --> 00:04:34,833
And as you might have guessed
this difference is the delimiter.

93
00:04:35,266 --> 00:04:37,633
And this file. Here
the delimiter is a comma.

94
00:04:37,633 --> 00:04:40,133
So that's the delimiter
separating the two columns.

95
00:04:40,133 --> 00:04:43,033
And in this file the delimiter is a tab.

96
00:04:43,033 --> 00:04:45,700
And that's why we call it tsv tab.

97
00:04:45,700 --> 00:04:46,933
Separated values.

98
00:04:46,933 --> 00:04:50,333
These csv comma separated values.

99
00:04:51,000 --> 00:04:53,666
And so now according to you, which one

100
00:04:53,666 --> 00:04:57,300
should we choose for our future algorithm?

101
00:04:57,500 --> 00:05:01,500
You know, we'll have a machine learning
algorithm analyzing all the reviews here.

102
00:05:01,866 --> 00:05:04,166
And then the goal of this algorithm
will be to predict

103
00:05:04,166 --> 00:05:07,166
whether the review is positive
or negative.

104
00:05:07,200 --> 00:05:07,700
All right.

105
00:05:07,700 --> 00:05:09,133
But now the question is

106
00:05:09,133 --> 00:05:13,600
do we need a data set where the columns
are separated by a comma or by a tab.

107
00:05:13,900 --> 00:05:17,466
Well, the answer is,
as you might have guessed by a tab.

108
00:05:17,800 --> 00:05:18,900
And why is that?

109
00:05:18,900 --> 00:05:22,533
It's because we already have some commas
in the reviews itself.

110
00:05:23,000 --> 00:05:27,466
Well, for example, this one,
this review is the food comma.

111
00:05:27,466 --> 00:05:29,100
Amazing.

112
00:05:29,100 --> 00:05:33,200
So if we use our CSV file
where the delimiter is a comma,

113
00:05:33,333 --> 00:05:37,233
well we'll have a problem
for this review here because for this

114
00:05:37,233 --> 00:05:41,233
particular observation the first column
will contain the food here.

115
00:05:41,400 --> 00:05:43,566
So R will think it's a review.

116
00:05:43,566 --> 00:05:46,966
The food and the second column
will not be one here.

117
00:05:47,066 --> 00:05:51,300
But amazing because there is this comma
here that is taken for the delimiter.

118
00:05:51,366 --> 00:05:53,500
And therefore it will separate the food.

119
00:05:53,500 --> 00:05:54,766
And amazing.

120
00:05:54,766 --> 00:05:58,833
And therefore what will happen to one,
it will go to the next observation,

121
00:05:59,100 --> 00:06:02,100
and therefore one will be taken
for a new review.

122
00:06:02,300 --> 00:06:04,466
So that will not make any sense.

123
00:06:04,466 --> 00:06:07,300
And this will mess up
with the whole algorithm.

124
00:06:07,300 --> 00:06:10,600
And that's why it's way
better to take tabs here, because,

125
00:06:10,833 --> 00:06:15,066
you know, when people write reviews,
they do not put tabs in the review.

126
00:06:15,166 --> 00:06:16,700
Well that would be very rare.

127
00:06:16,700 --> 00:06:21,300
They would put comma very easily,
as we can see for this particular review.

128
00:06:21,566 --> 00:06:23,833
And we can find other reviews with commas.

129
00:06:23,833 --> 00:06:26,600
I'm sure of it. Yes, indeed.
We have another one here.

130
00:06:26,600 --> 00:06:30,500
This place is not worth your time, comma, 
let alone Vegas.

131
00:06:31,033 --> 00:06:34,000
So it's very natural
to put some commas in the reviews,

132
00:06:34,000 --> 00:06:37,400
but not much natural
to put some tabs in the reviews.

133
00:06:37,400 --> 00:06:39,933
And besides, if you press on tab

134
00:06:39,933 --> 00:06:42,933
when you're writing a review,
well this will go to the next

135
00:06:43,400 --> 00:06:47,533
you know button to like submit
a review or something else.

136
00:06:47,533 --> 00:06:50,533
But by pressing the Tab button
when you were writing your review,

137
00:06:50,600 --> 00:06:54,366
you would get out of the review and you
would not be able to continue to write it.

138
00:06:54,366 --> 00:06:57,833
So we will never find a tab in the review.

139
00:06:57,833 --> 00:07:01,566
And that's why we will
never have this problem of getting these

140
00:07:01,900 --> 00:07:06,533
anomalies due to duplicate delimiters
in one specific review.

141
00:07:07,233 --> 00:07:09,866
So I really recommend to prepare

142
00:07:09,866 --> 00:07:13,066
your text data
sets this way with a tab separator,

143
00:07:13,400 --> 00:07:15,700
because you will never have
that kind of problem.

144
00:07:15,700 --> 00:07:18,500
One other solution,
if you really want to use a CSV,

145
00:07:18,500 --> 00:07:21,566
would be to include
some double quotes here

146
00:07:21,666 --> 00:07:24,266
one of the left of the review
and one on the right,

147
00:07:24,266 --> 00:07:28,333
but you would still get some problems
in case you have some double quotes

148
00:07:28,333 --> 00:07:31,333
in the reviews itself,
I'm sure we can find one.

149
00:07:31,400 --> 00:07:31,933
Let's have a look.

150
00:07:31,933 --> 00:07:36,233
Let's press command F
to find a double quote here we are.

151
00:07:36,233 --> 00:07:39,000
And that's exactly
what I'm talking about.

152
00:07:39,000 --> 00:07:41,766
For example,
let's have a look at this review here.

153
00:07:41,766 --> 00:07:44,566
The description said yum yum sauce.

154
00:07:44,566 --> 00:07:49,000
Well that's because this person here
is quoting a description found somewhere.

155
00:07:49,233 --> 00:07:53,400
And so since it's quoting it's using
some double quotes here yum yum sauce.

156
00:07:53,700 --> 00:07:54,966
And another one here.

157
00:07:54,966 --> 00:08:00,000
So even if you put some double quotes
to separate your reviews from the result,

158
00:08:00,300 --> 00:08:01,500
that is 1 or 0.

159
00:08:01,500 --> 00:08:03,600
Well you would still have
this kind of problem.

160
00:08:03,600 --> 00:08:08,200
Whereas if you separate your review
in the light variable by a tab,

161
00:08:08,400 --> 00:08:10,033
you will never get this kind of problem

162
00:08:10,033 --> 00:08:13,033
because no one will press tab
by writing a review.

163
00:08:13,366 --> 00:08:16,066
So definitely that's the one we'll go for.

164
00:08:16,066 --> 00:08:20,900
Restaurant underscore reviews
dot TSV tab separated values.

165
00:08:21,166 --> 00:08:24,466
And by the way this is a data
set taken from a paper from group

166
00:08:24,466 --> 00:08:28,200
to individual labels
using deep features by coziest et al.

167
00:08:28,733 --> 00:08:30,133
So we will use this data set.

168
00:08:30,133 --> 00:08:33,266
And this contains 1000 reviews.

169
00:08:33,533 --> 00:08:37,000
And for each of the review
we have the real result 0 or 1.

170
00:08:37,633 --> 00:08:38,566
So let's do it.

171
00:08:38,566 --> 00:08:41,400
Let's start implementing our algorithm.

172
00:08:41,400 --> 00:08:46,066
And we will start with the first step
that is to import this data set.

173
00:08:46,066 --> 00:08:49,200
Restaurant reviews dot CSV into to.

174
00:08:49,533 --> 00:08:50,400
So let's do it.

175
00:08:50,400 --> 00:08:54,666
Let's close
this and let's go back to RStudio.

176
00:08:55,500 --> 00:08:58,133
All right.
So now let's import the data set.

177
00:08:58,133 --> 00:09:02,966
So as usual we're going to call our data
set data set this way.

178
00:09:03,266 --> 00:09:04,400
And then equals.

179
00:09:04,400 --> 00:09:08,900
And then that's where we use a function
to import the data set.

180
00:09:09,200 --> 00:09:13,200
However so far
we've been using the readcsv function

181
00:09:13,200 --> 00:09:18,133
to import our data sets because simply
our data sets were CSV files.

182
00:09:18,366 --> 00:09:22,600
But as we just understood this time
we're not dealing with the CSV file.

183
00:09:22,700 --> 00:09:24,666
We are dealing with a CSV file.

184
00:09:24,666 --> 00:09:27,766
So of course
things might be different now, but

185
00:09:27,766 --> 00:09:30,766
we will still type read dot

186
00:09:30,800 --> 00:09:33,800
csv here and then some parentheses.

187
00:09:34,166 --> 00:09:38,033
And then let's press F1 here
to get some info about this.

188
00:09:38,033 --> 00:09:39,666
Read that CSV function.

189
00:09:39,666 --> 00:09:40,800
So what do we see?

190
00:09:40,800 --> 00:09:44,433
First we see that
we don't only have one import function.

191
00:09:45,300 --> 00:09:47,700
Indeed,
we can see here that we have the read

192
00:09:47,700 --> 00:09:50,700
that table function
which we haven't used yet.

193
00:09:51,166 --> 00:09:52,733
The recursive function,

194
00:09:52,733 --> 00:09:56,166
which is the function we've been using
since the beginning of this course.

195
00:09:56,600 --> 00:10:01,300
Then we also have the read dot
csv two function, which is the same

196
00:10:01,300 --> 00:10:05,000
as this one, with the only difference
that the default separator.

197
00:10:05,000 --> 00:10:09,233
You know, the delimiter that is separating
your columns is a semicolon

198
00:10:09,633 --> 00:10:13,966
instead of a comma as the default
parameter for the readcsv function.

199
00:10:14,133 --> 00:10:16,866
So that's the main difference
between the two.

200
00:10:16,866 --> 00:10:19,533
But that's not what we are interested
in right now,

201
00:10:19,533 --> 00:10:22,600
because we would like to use a function
where the default parameter

202
00:10:22,600 --> 00:10:26,666
for the separator is a tab
and not a semicolon.

203
00:10:26,666 --> 00:10:31,566
We could still use this readcsv function
and change the set parameter.

204
00:10:31,833 --> 00:10:34,566
But you know, let's use another function
for once

205
00:10:34,566 --> 00:10:38,100
to import the data
set with the right default parameter.

206
00:10:38,466 --> 00:10:42,600
And speaking of the default parameter,
which should be the tab separator.

207
00:10:43,000 --> 00:10:45,933
Well,
that's actually the next import function,

208
00:10:45,933 --> 00:10:49,366
which is the read that the line function.

209
00:10:49,800 --> 00:10:55,133
Indeed, you can see here that the default
parameter for the separator is a tab.

210
00:10:55,333 --> 00:10:57,433
Empty here means tab actually.

211
00:10:57,433 --> 00:11:00,433
So that's exactly the function we want.

212
00:11:00,500 --> 00:11:03,600
That's the best function to use for data
set right now,

213
00:11:03,766 --> 00:11:07,633
because our data set contains
columns separated by a tab.

214
00:11:08,066 --> 00:11:10,366
So that's the function we'll use.

215
00:11:10,366 --> 00:11:14,033
So I'm going to remove here readcsv

216
00:11:14,333 --> 00:11:17,333
and replace it by red dot the limb.

217
00:11:17,733 --> 00:11:18,433
Here we go.

218
00:11:18,433 --> 00:11:21,000
And now we input the parameters.

219
00:11:21,000 --> 00:11:24,000
So it's the same principle as for readcsv.

220
00:11:24,100 --> 00:11:28,866
We of course need to input first
the data set in quotes.

221
00:11:29,100 --> 00:11:31,433
And the data set is called restaurant.

222
00:11:33,700 --> 00:11:36,700
Reviews dot csv.

223
00:11:36,933 --> 00:11:38,766
So we need to specify this because indeed

224
00:11:38,766 --> 00:11:42,766
in our working directory for that
we have the two files CSV and CSV.

225
00:11:42,766 --> 00:11:44,966
So we need to specify here csv.

226
00:11:44,966 --> 00:11:46,333
That's the first parameter.

227
00:11:46,333 --> 00:11:50,300
And then we have some other parameters
like this header parameter here

228
00:11:50,333 --> 00:11:54,133
which is by default equals to true,
meaning that it considers

229
00:11:54,133 --> 00:11:55,833
the first line of our data set

230
00:11:55,833 --> 00:11:59,633
as the titles of the columns,
which is the case for our data set.

231
00:11:59,633 --> 00:12:05,233
Because remember the first line is review
tab liked and review is the title

232
00:12:05,233 --> 00:12:08,500
of the first column, and liked
is the title of the second column.

233
00:12:08,500 --> 00:12:10,400
So we're good with this header parameter.

234
00:12:10,400 --> 00:12:12,566
So we don't need to input this.

235
00:12:12,566 --> 00:12:15,066
And same for this next parameter set,

236
00:12:15,066 --> 00:12:18,600
because by default the parameter
for the separator is a tab.

237
00:12:18,866 --> 00:12:20,700
And that's exactly what we need right now.

238
00:12:20,700 --> 00:12:23,166
And then we have this parameter quotes.

239
00:12:23,166 --> 00:12:27,600
And that's a very useful parameter
to input for natural language processing.

240
00:12:27,900 --> 00:12:30,900
Because most of the time
you'll find some quotes,

241
00:12:31,166 --> 00:12:34,066
most of the time
double quotes in your text.

242
00:12:34,066 --> 00:12:37,066
We checked that
we had some in our reviews,

243
00:12:37,066 --> 00:12:40,666
and so we need to ignore these quotes
because we don't want to have some kind

244
00:12:40,666 --> 00:12:44,800
of misinterpretation when our reading
function reads all the reviews.

245
00:12:45,033 --> 00:12:48,133
So in general, in natural language
processing, it's better to ignore

246
00:12:48,133 --> 00:12:49,366
any kind of quotes.

247
00:12:49,366 --> 00:12:52,266
We did exactly the same in Python
and everything went well,

248
00:12:52,266 --> 00:12:53,700
so we'll do the same here.

249
00:12:53,700 --> 00:12:57,900
And to do this we add this quote parameter

250
00:12:58,233 --> 00:13:02,700
and we set it equals
to actually nothing in quotes.

251
00:13:03,133 --> 00:13:06,133
You know
by putting nothing in this quotes here,

252
00:13:06,133 --> 00:13:09,233
that means that it's ignoring
any kind of quotes in the text.

253
00:13:09,400 --> 00:13:10,400
So that's good.

254
00:13:10,400 --> 00:13:15,666
And now we'll add a last parameter
that is not specified here

255
00:13:15,933 --> 00:13:18,933
and which is the strings
as vectors parameter.

256
00:13:19,200 --> 00:13:20,900
And what is this parameter used for.

257
00:13:20,900 --> 00:13:25,733
Well you know the first column of
our data set contains the written reviews.

258
00:13:26,133 --> 00:13:29,433
And you know, in R when we're making
some classification models

259
00:13:29,700 --> 00:13:33,033
which will be what we'll be doing here
in natural language processing,

260
00:13:33,033 --> 00:13:34,166
because basically

261
00:13:34,166 --> 00:13:37,866
we'll be classifying your reviews and tell
whether they are positive or negative.

262
00:13:38,266 --> 00:13:40,100
So that's classification.

263
00:13:40,100 --> 00:13:40,600
And you know,

264
00:13:40,600 --> 00:13:41,166
when we're doing

265
00:13:41,166 --> 00:13:44,633
some classification models and working
with some categorical variables,

266
00:13:44,933 --> 00:13:47,000
well remember we use this factor function

267
00:13:47,000 --> 00:13:50,566
to specify the categorical variables
as factors.

268
00:13:51,033 --> 00:13:53,466
And you know right now
we have some reviews.

269
00:13:53,466 --> 00:13:56,433
And since in some way
it's not a numeric variable,

270
00:13:56,433 --> 00:14:00,266
you know, taking some continuous
real values, well in some way it can

271
00:14:00,266 --> 00:14:03,866
be considered as a categorical variable
having some different factors.

272
00:14:04,066 --> 00:14:07,033
But in natural language processing
we must not identify

273
00:14:07,033 --> 00:14:10,866
the reviews as factors in R
and that's because we will analyze

274
00:14:10,866 --> 00:14:13,666
the inside of the reviews,
because we'll be analyzing

275
00:14:13,666 --> 00:14:16,100
the different words of the review
to understand the correlations

276
00:14:16,100 --> 00:14:18,133
between the presence of the words

277
00:14:18,133 --> 00:14:21,000
and the result, whether the review
is positive or negative.

278
00:14:21,000 --> 00:14:25,933
So since we'll drill into the review
and analyze its content well,

279
00:14:25,933 --> 00:14:30,100
we must not specify the review as factors
as if it was a single entity,

280
00:14:30,366 --> 00:14:34,733
because that's what a factor would be,
a single entity having a single meaning,

281
00:14:34,733 --> 00:14:38,033
regardless of the different meanings
of the different words of the review.

282
00:14:38,533 --> 00:14:42,100
And so to prevent from identifying
those reviews as factors,

283
00:14:42,366 --> 00:14:43,366
well, what we need to do

284
00:14:43,366 --> 00:14:47,900
is add this other parameter,
which is the string as factors parameter.

285
00:14:47,900 --> 00:14:50,700
Here it is. I just need to press enter.

286
00:14:50,700 --> 00:14:54,233
And now we just need to input false
like this.

287
00:14:54,233 --> 00:14:55,233
Not in quotes.

288
00:14:55,233 --> 00:14:58,400
And that's will not identify
the reviews as factors.

289
00:14:58,833 --> 00:14:59,900
And that's all.

290
00:14:59,900 --> 00:15:03,333
That's
how we should import this text file.

291
00:15:03,733 --> 00:15:08,600
You know using the read import
function to import a TSV file by default.

292
00:15:08,933 --> 00:15:12,266
And then add this quote parameter
to ignore the quotes and then add

293
00:15:12,266 --> 00:15:17,100
the string as factors parameter to prevent
from identifying the reviews as vectors.

294
00:15:17,433 --> 00:15:18,900
All right. So let's do it.

295
00:15:18,900 --> 00:15:21,966
Let's select
this line of code and execute.

296
00:15:22,433 --> 00:15:24,533
All right I'll get our data set as well.

297
00:15:24,533 --> 00:15:25,266
Imported.

298
00:15:25,266 --> 00:15:27,633
As you can see it has 1000 observations.

299
00:15:27,633 --> 00:15:31,100
That means that the cut
between the review column

300
00:15:31,100 --> 00:15:34,633
and the liked column was done
very properly without any issue.

301
00:15:34,833 --> 00:15:37,833
So now let's open
our data set and let's have a look.

302
00:15:38,633 --> 00:15:41,700
As you can see,
all the reviews are very well separated

303
00:15:41,700 --> 00:15:46,033
to their verdict, whether it's a positive
or negative review.

304
00:15:46,400 --> 00:15:48,566
And so everything here looks great.

305
00:15:48,566 --> 00:15:51,600
And we need to make sure
that we have our 1000 reviews.

306
00:15:51,600 --> 00:15:53,866
Well we can see that here very easily.

307
00:15:53,866 --> 00:15:56,000
But you know we have our 1000 reviews.

308
00:15:56,000 --> 00:15:59,100
And when I scroll up we can see that
all the reviews are

309
00:15:59,100 --> 00:16:03,600
well in the review column
and all the liked results.

310
00:16:03,600 --> 00:16:07,633
0 or 1 or well in the light column here.

311
00:16:07,933 --> 00:16:11,333
You see, if I scroll up, we don't have
any review in the light column

312
00:16:11,566 --> 00:16:14,800
or a 1 or 0 in the review column.

313
00:16:14,966 --> 00:16:16,666
So everything looks great.

314
00:16:16,666 --> 00:16:20,266
We are ready to move on to the next step,
which will be to clean

315
00:16:20,266 --> 00:16:21,600
the different reviews.

316
00:16:21,600 --> 00:16:22,800
That is a compulsory step

317
00:16:22,800 --> 00:16:26,166
in natural language processing,
which consists of cleaning the text

318
00:16:26,600 --> 00:16:30,900
to make it ready for our future machine
learning algorithms.

319
00:16:31,366 --> 00:16:33,200
So that's what
we'll do in the next tutorial.

320
00:16:33,200 --> 00:16:35,066
And until then, enjoy machine learning.
