1
00:00:00,133 --> 00:00:01,133
All right, let's do this.

2
00:00:01,133 --> 00:00:05,266
Let's begin our implementation
of the natural language processing.

3
00:00:05,266 --> 00:00:07,266
Well, you know,
branch of machine learning, but more.

4
00:00:07,266 --> 00:00:08,333
Specifically.

5
00:00:08,333 --> 00:00:11,966
Of an NLP model made
for sentiment analysis.

6
00:00:12,600 --> 00:00:12,900
All right.

7
00:00:12,900 --> 00:00:16,900
So as usual, we're going to start
as much efficiently as we can.

8
00:00:16,966 --> 00:00:19,800
We're going to use our data
preprocessing template.

9
00:00:19,800 --> 00:00:21,500
Which I've of course prepared.

10
00:00:21,500 --> 00:00:22,900
For this implementation

11
00:00:22,900 --> 00:00:27,133
which contains, you know, the codes
to import the libraries and import.

12
00:00:27,133 --> 00:00:28,033
The dataset.

13
00:00:28,033 --> 00:00:29,766
So let's quickly start with the libraries.

14
00:00:29,766 --> 00:00:31,333
Here I'm going to take them.

15
00:00:31,333 --> 00:00:33,033
I'm going to paste. That right.

16
00:00:33,033 --> 00:00:35,100
Here in a new. Code cell.

17
00:00:35,100 --> 00:00:37,333
To indeed import the essential libraries.

18
00:00:37,333 --> 00:00:38,700
You know, just in case we need them.

19
00:00:38,700 --> 00:00:40,566
It doesn't mean that we will necessarily.

20
00:00:40,566 --> 00:00:43,600
Use. All of them, but at least
we have them in case we need them.

21
00:00:43,633 --> 00:00:44,500
Okay.

22
00:00:44,500 --> 00:00:45,600
Then importing the data.

23
00:00:45,600 --> 00:00:47,866
Set let's create a new code cell.

24
00:00:47,866 --> 00:00:50,600
And now according to you.
Do I have to take.

25
00:00:50,600 --> 00:00:53,700
All the lines of code here
or just this one to get.

26
00:00:53,700 --> 00:00:55,000
The data set?

27
00:00:55,000 --> 00:00:58,200
Well, as you might guess,
now we're going to do some different kind

28
00:00:58,200 --> 00:00:59,533
of data preprocessing.

29
00:00:59,533 --> 00:01:00,166
And therefore.

30
00:01:00,166 --> 00:01:01,033
We'll just take.

31
00:01:01,033 --> 00:01:04,600
This line of code to indeed import
the reviews inside.

32
00:01:04,600 --> 00:01:06,566
Still a data set variable.

33
00:01:06,566 --> 00:01:07,566
But then you will see.

34
00:01:07,566 --> 00:01:09,966
That there. Will be a certain work needed.

35
00:01:09,966 --> 00:01:11,733
Before creating these two features.

36
00:01:11,733 --> 00:01:14,200
We will indeed create these two features
at some point.

37
00:01:14,200 --> 00:01:17,700
You know the matrix of features
and the dependent variable, but not now.

38
00:01:17,700 --> 00:01:18,700
This is too. Early.

39
00:01:18,700 --> 00:01:22,433
We will have to clean the text first
and prepare the bag of words model.

40
00:01:22,433 --> 00:01:23,133
And in fact.

41
00:01:23,133 --> 00:01:24,766
We will create these two.

42
00:01:24,766 --> 00:01:28,366
Entities a matrix of features
and a dependent variable vector in the.

43
00:01:28,366 --> 00:01:30,300
Cell where we create the bag of.

44
00:01:30,300 --> 00:01:31,066
Words model.

45
00:01:31,066 --> 00:01:32,300
Okay. So let's just.

46
00:01:32,300 --> 00:01:34,300
Take this for now the. Data set.

47
00:01:34,300 --> 00:01:37,833
And back into our NLP implementation.

48
00:01:38,100 --> 00:01:40,566
Let's paste that right here and now.

49
00:01:40,566 --> 00:01:42,966
Indeed we have to adapt this a little.

50
00:01:42,966 --> 00:01:44,766
Because now we're not dealing with a.

51
00:01:44,766 --> 00:01:46,000
CSV file.

52
00:01:46,000 --> 00:01:48,966
We're dealing with a. TSB file
where the feature.

53
00:01:48,966 --> 00:01:52,100
Is meaning the text
and the binary variable 0 or 1.

54
00:01:52,300 --> 00:01:55,300
Are separated by a. Tab
instead of a comma.

55
00:01:55,533 --> 00:01:56,600
So first thing.

56
00:01:56,600 --> 00:02:00,000
First, let's replace this data set by the.

57
00:02:00,000 --> 00:02:00,833
Right name.

58
00:02:00,833 --> 00:02:04,466
You notice that I even included the
extension because we'll have to change it.

59
00:02:04,766 --> 00:02:06,800
So the name of. Our data set.

60
00:02:06,800 --> 00:02:07,766
Let's have a look.

61
00:02:07,766 --> 00:02:08,500
Again.

62
00:02:08,500 --> 00:02:12,533
Is Restaurant Reviews dot CSV.

63
00:02:12,900 --> 00:02:13,566
All right.

64
00:02:13,566 --> 00:02:15,500
So that's exactly what we'll replace here.

65
00:02:15,500 --> 00:02:18,733
Restaurant underscore reviews

66
00:02:20,200 --> 00:02:22,700
dot t test v okay.

67
00:02:22,700 --> 00:02:24,300
And now since it is a.

68
00:02:24,300 --> 00:02:26,333
TSV we'll have to add.

69
00:02:26,333 --> 00:02:28,966
Some extra parameters. To specify.

70
00:02:28,966 --> 00:02:32,333
That indeed we're dealing with a T
as we found instead of a comma.

71
00:02:32,433 --> 00:02:34,533
Separated value file CSV.

72
00:02:34,533 --> 00:02:34,900
All right.

73
00:02:34,900 --> 00:02:36,500
And the way to. Do this is just to add.

74
00:02:36,500 --> 00:02:39,500
One parameter. Here, which is. Delimiter.

75
00:02:39,700 --> 00:02:40,533
All right.

76
00:02:40,533 --> 00:02:43,266
For which the default value is actually.

77
00:02:43,266 --> 00:02:44,133
The comma, meaning.

78
00:02:44,133 --> 00:02:46,966
That the default data set.
That we can import.

79
00:02:46,966 --> 00:02:50,366
With this read underscore
CSV is indeed CSV.

80
00:02:50,700 --> 00:02:51,900
But you know, we can also.

81
00:02:51,900 --> 00:02:52,900
Use this read.

82
00:02:52,900 --> 00:02:55,333
Underscore CSV function to import.

83
00:02:55,333 --> 00:02:57,466
A TSV. File. And that's exactly.

84
00:02:57,466 --> 00:02:59,000
What we're about to. Do now.

85
00:02:59,000 --> 00:03:01,200
But the way to specify.
That we're dealing with a.

86
00:03:01,200 --> 00:03:02,800
TSC file is to.

87
00:03:02,800 --> 00:03:05,000
Enter
the following. Value for this. Delimiter.

88
00:03:05,000 --> 00:03:07,666
Parameter, which is in quotes.

89
00:03:07,666 --> 00:03:10,766
This slash here backslash. N. T.

90
00:03:11,166 --> 00:03:13,033
All right.
That's the value. Of the delimiter.

91
00:03:13,033 --> 00:03:13,833
You should enter to.

92
00:03:13,833 --> 00:03:16,833
Specify that your data set is a TSC file.

93
00:03:17,133 --> 00:03:18,200
But then that's not all.

94
00:03:18,200 --> 00:03:20,300
We need to add one final parameter.

95
00:03:20,300 --> 00:03:23,433
Very important one
when you're working with text

96
00:03:23,766 --> 00:03:26,833
I'm going to show you something
now in not this.

97
00:03:26,833 --> 00:03:27,566
Data set.

98
00:03:27,566 --> 00:03:29,833
Because we couldn't
see the. Whole reviews.

99
00:03:29,833 --> 00:03:33,666
But I'm going to show you the whole data
set inside the folder machine learning

100
00:03:33,700 --> 00:03:36,700
data set, which you could download
once again in the article.

101
00:03:36,700 --> 00:03:39,600
Right before this tutorial.
So let's. Open it.

102
00:03:39,600 --> 00:03:43,766
Let's go into part seven NLP,
then NLP again and Python.

103
00:03:43,766 --> 00:03:45,666
And that's the whole data set.

104
00:03:45,666 --> 00:03:47,800
So I'm on Mac here.
So I'm going to open it.

105
00:03:47,800 --> 00:03:50,633
With a classic text editor like text edit.

106
00:03:50,633 --> 00:03:51,433
Perfect.

107
00:03:51,433 --> 00:03:53,866
We just need to have a look
at the text quickly.

108
00:03:53,866 --> 00:03:55,000
So there we go.

109
00:03:55,000 --> 00:03:58,900
And now I'm just going to do a command
or control F to find something.

110
00:03:59,366 --> 00:04:02,933
Which is a double quotes
just like that okay.

111
00:04:03,633 --> 00:04:06,300
And as we see
we. Can see that we have many.

112
00:04:06,300 --> 00:04:09,300
Double quotes. Within the. Text.
All right.

113
00:04:09,600 --> 00:04:10,733
And in order to.

114
00:04:10,733 --> 00:04:12,066
Process this the right.

115
00:04:12,066 --> 00:04:15,233
Way, you know,
when our machinery models learn how to.

116
00:04:15,233 --> 00:04:17,700
Read text, well, we'll have to say to.

117
00:04:17,700 --> 00:04:20,433
Our model to ignore. The double quotes.

118
00:04:20,433 --> 00:04:24,133
Otherwise, you know, if you don't do it,
this can cause some processing

119
00:04:24,133 --> 00:04:25,366
or splicing errors

120
00:04:25,366 --> 00:04:29,033
which you want to avoid, you know, because
this can lead to an execution error.

121
00:04:29,233 --> 00:04:30,866
So I always recommend to.

122
00:04:30,866 --> 00:04:33,166
Add this. Quoting parameter and set its.

123
00:04:33,166 --> 00:04:34,366
Value to three.

124
00:04:34,366 --> 00:04:37,800
Which means actually no quotes or,
you know, ignore the quotes

125
00:04:38,066 --> 00:04:40,933
so that indeed
you can be free from processing errors.

126
00:04:40,933 --> 00:04:43,200
You can see there are many quotes, right?

127
00:04:43,200 --> 00:04:43,633
So we're.

128
00:04:43,633 --> 00:04:45,933
Just going to ignore all of them as if,
you know,

129
00:04:45,933 --> 00:04:48,066
they're just some different characters
in the.

130
00:04:48,066 --> 00:04:48,933
Text.

131
00:04:48,933 --> 00:04:49,400
All right.

132
00:04:49,400 --> 00:04:51,466
So that's all I wanted to show you.

133
00:04:51,466 --> 00:04:53,366
So now let's close. This.

134
00:04:53,366 --> 00:04:56,200
And let's go back to our implementation.

135
00:04:56,200 --> 00:05:01,466
And to add this final parameter
we need to add here quoting equals.

136
00:05:01,633 --> 00:05:06,600
And the value of this quoting parameter
to ignore all the double quotes is three.

137
00:05:06,900 --> 00:05:07,600
All right.

138
00:05:07,600 --> 00:05:08,733
And now perfect.

139
00:05:08,733 --> 00:05:09,400
That's how.

140
00:05:09,400 --> 00:05:09,900
You import.

141
00:05:09,900 --> 00:05:12,400
Correctly a TSV file which should.

142
00:05:12,400 --> 00:05:13,600
Be the format of,

143
00:05:13,600 --> 00:05:18,166
you know, a data set separating text
and a binary outcome like zero one.

144
00:05:18,300 --> 00:05:20,400
That's the classic way to proceed.

145
00:05:20,400 --> 00:05:22,333
With sentiment analysis.

146
00:05:22,333 --> 00:05:23,133
So there we go.

147
00:05:23,133 --> 00:05:25,833
Well, actually let's import the data
set to make sure.

148
00:05:25,833 --> 00:05:27,000
Everything's all right.

149
00:05:27,000 --> 00:05:29,633
So we're going to click. This folder here.

150
00:05:29,633 --> 00:05:31,800
Then it's going to take a little time.

151
00:05:31,800 --> 00:05:34,800
You know a few seconds
to connect this notebook

152
00:05:35,100 --> 00:05:38,100
to a runtime to enable file browsing.

153
00:05:38,366 --> 00:05:39,466
But in a second.

154
00:05:39,466 --> 00:05:41,466
We should see that upload.

155
00:05:41,466 --> 00:05:43,466
Button here to indeed upload.

156
00:05:43,466 --> 00:05:45,366
There we go that data set.

157
00:05:45,366 --> 00:05:46,800
So let's click it.

158
00:05:46,800 --> 00:05:50,900
And now please find your machine
learning A to Z folder on your machine

159
00:05:50,900 --> 00:05:51,833
which you had to download

160
00:05:51,833 --> 00:05:55,366
either in the previous tutorial
or at the beginning of each section.

161
00:05:55,600 --> 00:05:56,833
So now let's go inside.

162
00:05:56,833 --> 00:06:00,133
Let's go once again into part seven
Natural Language Processing.

163
00:06:00,366 --> 00:06:03,766
Then this section, then Python,
and then Restaurant.

164
00:06:03,766 --> 00:06:05,666
Reviews dot CSV.

165
00:06:05,666 --> 00:06:07,966
Let's click open. Let's click okay.

166
00:06:07,966 --> 00:06:08,733
And now we're going to.

167
00:06:08,733 --> 00:06:10,333
Have the data.

168
00:06:10,333 --> 00:06:12,500
Set inside the notebook.

169
00:06:12,500 --> 00:06:13,200
All right. Perfect.

170
00:06:13,200 --> 00:06:14,866
So now let's run the cells.

171
00:06:14,866 --> 00:06:16,266
First this cell where.

172
00:06:16,266 --> 00:06:17,866
We import. The libraries.

173
00:06:17,866 --> 00:06:19,600
So simple one.

174
00:06:19,600 --> 00:06:21,966
And now this cell where we import.

175
00:06:21,966 --> 00:06:23,000
The data set.

176
00:06:23,000 --> 00:06:25,000
Let's do this.
Let's make sure everything goes well.

177
00:06:25,000 --> 00:06:26,833
And there we go.

178
00:06:26,833 --> 00:06:28,966
Now we have. The data set ready.

179
00:06:28,966 --> 00:06:31,466
So that means we're ready
for the next step.

180
00:06:31,466 --> 00:06:32,600
Cleaning the text.

181
00:06:32,600 --> 00:06:35,866
That's an essential step
in natural language processing.

182
00:06:36,100 --> 00:06:40,900
I will show you all the techniques
to make your text as clean as possible.

183
00:06:40,900 --> 00:06:43,666
And we will do. All this.
In the next tutorial.

184
00:06:43,666 --> 00:06:45,533
Until then, enjoy machine learning.