1
00:00:00,300 --> 00:00:02,900
Hello
my friends, and welcome to this new part.

2
00:00:02,900 --> 00:00:05,700
On Natural Language Processing.

3
00:00:05,700 --> 00:00:09,233
I'm super excited to start this part
because this is the branch of machine

4
00:00:09,233 --> 00:00:12,933
learning with which you can build
chat bots and machine translations.

5
00:00:13,200 --> 00:00:14,666
So of course, this is not

6
00:00:14,666 --> 00:00:18,000
what we're going to do in this part
because this is really advanced NLP.

7
00:00:18,166 --> 00:00:22,800
So we'll just cover the basics
with sentiment analysis, which consists

8
00:00:22,833 --> 00:00:26,100
of training a machine
to understand some text

9
00:00:26,233 --> 00:00:29,233
and predict
a certain outcome for this text.

10
00:00:29,366 --> 00:00:33,700
So in our case study here, these text
will be reviews of a restaurant.

11
00:00:33,900 --> 00:00:35,266
And we'll have to train a machine

12
00:00:35,266 --> 00:00:39,133
to understand
if each review is positive or negative.

13
00:00:39,366 --> 00:00:41,400
So very simple, very classic.

14
00:00:41,400 --> 00:00:44,400
But the best way to be introduced to NLP.

15
00:00:44,700 --> 00:00:45,000
All right.

16
00:00:45,000 --> 00:00:48,233
So before we start let's make sure
everyone here is on the same page.

17
00:00:48,266 --> 00:00:51,666
This is the folder containing
all the codes and data sets

18
00:00:51,800 --> 00:00:55,566
and of which I give you the link
right before this tutorial in the article.

19
00:00:55,633 --> 00:00:57,600
So make sure to connect to that link.

20
00:00:57,600 --> 00:00:58,600
And now there we go.

21
00:00:58,600 --> 00:01:02,366
We can and support seven
natural language processing.

22
00:01:02,900 --> 00:01:05,633
So in this part
you will only find one section.

23
00:01:05,633 --> 00:01:10,066
That's because we only do one case study
of NLP about sentiment analysis.

24
00:01:10,333 --> 00:01:12,300
However, you will see that you can try

25
00:01:12,300 --> 00:01:15,766
diverse machine
learning models to tackle the problem.

26
00:01:16,066 --> 00:01:17,533
Indeed, the essential

27
00:01:17,533 --> 00:01:21,433
part of our implementation
will be to build the bag of Words model.

28
00:01:21,600 --> 00:01:26,100
But then once it is built,
we can try several classification models.

29
00:01:26,233 --> 00:01:27,633
Why classification models?

30
00:01:27,633 --> 00:01:31,966
That's because we'll have to predict,
you know, a binary outcome 1 or 0 one,

31
00:01:31,966 --> 00:01:36,400
meaning the review is positive and zero,
meaning the review is negative.

32
00:01:36,600 --> 00:01:40,866
So you'll have actually the flexibility
to try several machinery models.

33
00:01:40,866 --> 00:01:45,166
And this will actually be
your final exercise of this section.

34
00:01:45,533 --> 00:01:46,866
So there we go. Let's do this.

35
00:01:46,866 --> 00:01:49,000
Let's enter section 36.

36
00:01:49,000 --> 00:01:51,300
Now natural language processing.

37
00:01:51,300 --> 00:01:55,733
And as usual we're going to start with
Python in which you will find two files.

38
00:01:55,933 --> 00:01:59,300
The implementation natural language
processing dot Ipynb,

39
00:01:59,633 --> 00:02:03,366
which you can open with either
Google Collaboratory or Jupyter Notebook.

40
00:02:03,666 --> 00:02:07,400
And our data set restaurant
reviews dot this time.

41
00:02:07,400 --> 00:02:11,633
Net CSV, but SVM
and this will be a good opportunity

42
00:02:11,633 --> 00:02:16,100
for me to train
you on how to import a TSV data set.

43
00:02:16,400 --> 00:02:22,200
TSV mean Tab separated value instead of
comma separated value like in a CSV.

44
00:02:22,200 --> 00:02:23,666
So basically the only difference

45
00:02:23,666 --> 00:02:26,733
is that in the previous data sets
we worked with, well,

46
00:02:26,733 --> 00:02:30,466
you know, the features and the dependent
variable were separated by commas.

47
00:02:30,633 --> 00:02:34,166
And in this one, well,
instead of being separated by commas,

48
00:02:34,200 --> 00:02:37,766
the reviews and the dependent variable
will be separated by a tab.

49
00:02:37,900 --> 00:02:39,000
And this makes sense, right?

50
00:02:39,000 --> 00:02:41,466
Because in the reviews
we already have commas

51
00:02:41,466 --> 00:02:43,733
and therefore they would create nonsense
features.

52
00:02:43,733 --> 00:02:46,500
But let me show you what this data set
looks like.

53
00:02:46,500 --> 00:02:47,166
So as you can see,

54
00:02:47,166 --> 00:02:51,033
there are only two columns,
the first one containing all the reviews.

55
00:02:51,033 --> 00:02:52,600
So for example, this is the first one.

56
00:02:52,600 --> 00:02:54,233
Well, not this place.

57
00:02:54,233 --> 00:02:57,166
A second one trust is not good.

58
00:02:57,166 --> 00:02:59,666
another one, great touch, etc..

59
00:02:59,666 --> 00:03:03,000
So you have in total let's see, 1000
reviews.

60
00:03:03,366 --> 00:03:03,666
Right.

61
00:03:03,666 --> 00:03:05,200
So we're going to train our machine
learning

62
00:03:05,200 --> 00:03:09,466
to actually understand text and predict
if the text are positive or negative.

63
00:03:09,466 --> 00:03:11,666
With 1000 texts.

64
00:03:11,666 --> 00:03:12,200
All right.

65
00:03:12,200 --> 00:03:17,033
And then the second column is of course
if the review is positive or negative.

66
00:03:17,033 --> 00:03:20,033
So one means that it is positive
meaning the customer liked it.

67
00:03:20,300 --> 00:03:23,300
And zero means
that the review is negative.

68
00:03:23,466 --> 00:03:26,933
And of course we have the real outcomes
in order to train our machine

69
00:03:26,933 --> 00:03:31,800
learning model to understand if each of
these text is positive or negative.

70
00:03:32,066 --> 00:03:35,466
So that's purely in the end,
you know, a classification problem.

71
00:03:35,700 --> 00:03:40,566
But the essential part of it is
that will train the machine to understand

72
00:03:40,566 --> 00:03:44,966
these text first and then to predict
if they are positive or negative.

73
00:03:45,500 --> 00:03:45,933
All right.

74
00:03:45,933 --> 00:03:48,600
So very simple case study
very simple data set.

75
00:03:48,600 --> 00:03:51,900
That means we are ready
to start the implementation

76
00:03:51,900 --> 00:03:53,966
of natural language processing.

77
00:03:53,966 --> 00:03:55,300
So as you prefer

78
00:03:55,300 --> 00:03:59,100
feel free to open it with either
Google Colaboratory or Jupyter Notebook.

79
00:03:59,400 --> 00:04:00,200
I'm opening it

80
00:04:00,200 --> 00:04:03,900
with Google Collab as usual,
so feel free to do the same if you'd like.

81
00:04:04,000 --> 00:04:06,800
And now the Notebook is loading.

82
00:04:06,800 --> 00:04:10,766
In a second it will be laying out
all right loading leading out.

83
00:04:10,766 --> 00:04:13,766
Perfect. And this is the implementation.

84
00:04:13,833 --> 00:04:16,333
And as usual this is in read only mode.

85
00:04:16,333 --> 00:04:19,333
And we want to re-implement
this from scratch.

86
00:04:19,366 --> 00:04:20,066
Therefore we're going

87
00:04:20,066 --> 00:04:24,200
to create a copy right away
so that we can modify the code inside.

88
00:04:24,466 --> 00:04:27,200
So we're going to click
save a Copy and drive here.

89
00:04:27,200 --> 00:04:30,633
This will create a copy,
after which we will be able

90
00:04:30,633 --> 00:04:33,633
to modify the code and re-implement
this from scratch.

91
00:04:34,233 --> 00:04:37,200
And speaking of re-implementing it
from scratch.

92
00:04:37,200 --> 00:04:41,466
Well, let's delete all the code cells
because we will re-implement them.

93
00:04:41,666 --> 00:04:45,600
So let's click this trash button here
and each of the code cells,

94
00:04:45,600 --> 00:04:49,866
but not the text, so that we can keep that
well highlighted structure

95
00:04:50,100 --> 00:04:54,000
and see where we're going
at each time of the implementation.

96
00:04:54,366 --> 00:04:57,366
All right. So almost done.

97
00:04:57,400 --> 00:05:00,533
It's actually an implementation
in about ten steps.

98
00:05:00,833 --> 00:05:04,500
But you will recognize
some of the steps as steps we did before.

99
00:05:04,966 --> 00:05:07,233
You'll see I'm going to show you
in a second.

100
00:05:07,233 --> 00:05:07,700
All right.

101
00:05:07,700 --> 00:05:10,466
So let's have a look at the structure
of this implementation.

102
00:05:10,466 --> 00:05:13,633
We will start first
by importing the libraries as usual.

103
00:05:13,633 --> 00:05:17,000
Because indeed we will need
several libraries to preprocess

104
00:05:17,000 --> 00:05:20,000
our texts and train our future machine
learning model.

105
00:05:20,333 --> 00:05:21,833
Then we will import the data set.

106
00:05:21,833 --> 00:05:23,966
So that's actually the data
preprocessing phase.

107
00:05:23,966 --> 00:05:26,333
But not only that, the data preprocessing

108
00:05:26,333 --> 00:05:30,200
phase will also contain the next two cells
cleaning the text.

109
00:05:30,366 --> 00:05:33,900
Indeed, we will have to simplify the text
as much as we can

110
00:05:34,166 --> 00:05:38,100
in order to ease the learning process
of the machine learning model.

111
00:05:38,100 --> 00:05:40,200
You know, we'll have to remove
all the punctuations.

112
00:05:40,200 --> 00:05:42,500
We'll have to put
all the letters in lowercase.

113
00:05:42,500 --> 00:05:44,066
Then we'll have to apply stemming.

114
00:05:44,066 --> 00:05:47,366
You know, we'll have to make
very clean text to alleviate

115
00:05:47,466 --> 00:05:50,600
the learning process of the future
classification model will build.

116
00:05:50,800 --> 00:05:52,866
So that's a compulsory process.

117
00:05:52,866 --> 00:05:56,333
When doing NLP
you have to preprocess the text basically.

118
00:05:56,733 --> 00:05:59,400
Then we'll create the bag of words model

119
00:05:59,400 --> 00:06:02,400
which is at the heart
of sentiment analysis.

120
00:06:02,533 --> 00:06:03,933
And then there you go.

121
00:06:03,933 --> 00:06:06,200
That's
where you will recognize everything.

122
00:06:06,200 --> 00:06:07,633
Once we have the bag of Words

123
00:06:07,633 --> 00:06:11,400
model, we basically have a data
set ready to be trained, right?

124
00:06:11,400 --> 00:06:14,800
We have a dataset ready
to be trained by a machine learning model.

125
00:06:14,800 --> 00:06:19,466
And that's why then we will just apply
the classic process of training a model.

126
00:06:19,666 --> 00:06:23,366
First, we will split the data
set into the training set and test it

127
00:06:23,566 --> 00:06:26,400
so that we can indeed
have a set where we train the model

128
00:06:26,400 --> 00:06:30,133
to understand text and predict
if the text are positive or negative,

129
00:06:30,366 --> 00:06:33,366
and the test set
so that we can evaluate the performance

130
00:06:33,500 --> 00:06:36,500
on mutex
on which the model wasn't trained.

131
00:06:36,600 --> 00:06:38,433
And then there we go, we train.

132
00:06:38,433 --> 00:06:41,366
So I chose a Naive Bayes
model on the training set.

133
00:06:41,366 --> 00:06:42,633
But you will see that your

134
00:06:42,633 --> 00:06:46,933
final exercise at the end will be to try
the other classification models

135
00:06:46,933 --> 00:06:51,200
and see if you can beat the accuracy
I will get in this implementation.

136
00:06:51,600 --> 00:06:54,466
Then we will predict the test result
and finally

137
00:06:54,466 --> 00:06:58,133
we will make the confusion matrix
and get the final accuracy.

138
00:06:58,566 --> 00:06:59,566
So that's our structure.

139
00:06:59,566 --> 00:07:01,433
That's our NLP journey.

140
00:07:01,433 --> 00:07:05,133
So now as soon as you're ready
let's start in the next tutorial

141
00:07:05,133 --> 00:07:08,133
with the simple data preprocessing phase.

142
00:07:08,433 --> 00:07:09,600
I can't wait to start.

143
00:07:09,600 --> 00:07:12,800
See you in the next tutorial
and until then, enjoy machine learning!