1
00:00:00,200 --> 00:00:02,166
Hello and welcome to this art tutorial.

2
00:00:02,166 --> 00:00:05,700
So in this tutorial we are going to apply
PCA and actually I have prepared

3
00:00:05,700 --> 00:00:08,966
you the required package to apply this
first

4
00:00:08,966 --> 00:00:12,400
dimensionality reduction technique
principal Component Analysis.

5
00:00:12,700 --> 00:00:16,900
So these packages are carrot
which I think we already installed.

6
00:00:16,900 --> 00:00:20,400
But if that's not the case
then you can check it here in packages.

7
00:00:20,566 --> 00:00:24,000
You can see if you have carrot available
in this list of packages.

8
00:00:24,266 --> 00:00:27,700
If you don't see it here, you can execute
this line without the comment

9
00:00:28,000 --> 00:00:30,400
and this will install carrot.

10
00:00:30,400 --> 00:00:31,933
So that's the first package.

11
00:00:31,933 --> 00:00:35,900
Then this is to actually import
this carrot package.

12
00:00:35,900 --> 00:00:37,733
So we'll execute that as well.

13
00:00:37,733 --> 00:00:40,833
And we also need this other package
that we installed

14
00:00:40,833 --> 00:00:44,533
in part
three classification the E10 71 package.

15
00:00:44,700 --> 00:00:46,666
So normally you should have it installed.

16
00:00:46,666 --> 00:00:47,700
But that's not the case.

17
00:00:47,700 --> 00:00:50,500
You can select this line
and install the package.

18
00:00:50,500 --> 00:00:53,933
And don't forget to execute this line
as well to select it.

19
00:00:54,266 --> 00:00:57,266
And now we are ready to start
applying PCA.

20
00:00:57,366 --> 00:01:00,300
So the first thing that we're going to do
is create a new variable

21
00:01:00,300 --> 00:01:03,866
that we're going to call PCA
that we will use afterwards

22
00:01:03,866 --> 00:01:08,466
to transform our original data
set composed of our 13 features,

23
00:01:08,733 --> 00:01:12,300
into this new data
set with the new extracted features.

24
00:01:12,666 --> 00:01:15,933
So now to create this object
we are going to use a function.

25
00:01:16,066 --> 00:01:18,933
This is the pre process function.

26
00:01:18,933 --> 00:01:21,066
Here it is from the carrot package.

27
00:01:21,066 --> 00:01:23,866
And let's now press F1 here

28
00:01:23,866 --> 00:01:26,833
to see all the info of this preprocess
function.

29
00:01:26,833 --> 00:01:29,633
Because you're going to see
that you have some very useful parameters

30
00:01:29,633 --> 00:01:32,866
that allow you to apply PCA
according to your goals.

31
00:01:32,866 --> 00:01:37,366
For example, you can specify the minimum
ratio of explained variance

32
00:01:37,366 --> 00:01:38,266
you want to get.

33
00:01:38,266 --> 00:01:39,266
That means, for example,

34
00:01:39,266 --> 00:01:42,366
if you want to reduce the dimensionality
of your data set down

35
00:01:42,366 --> 00:01:46,533
to a number of features, that will explain
at least 60% of the variance,

36
00:01:46,700 --> 00:01:50,766
well, you can specify this with one of the
parameters of this preprocess function.

37
00:01:51,000 --> 00:01:52,666
So let's have a look at the info.

38
00:01:52,666 --> 00:01:53,566
That's the info.

39
00:01:53,566 --> 00:01:56,033
And let's jump to the arguments.

40
00:01:56,033 --> 00:01:59,500
So right the first argument is x
a matrix or a data frame.

41
00:01:59,700 --> 00:02:02,966
This is actually the data of which
we want to reduce the dimensionality.

42
00:02:03,166 --> 00:02:05,633
So this is going to be our training set.

43
00:02:05,633 --> 00:02:07,500
So x will be the training set.

44
00:02:07,500 --> 00:02:09,533
Then the next argument is method.

45
00:02:09,533 --> 00:02:12,600
So method is your dimensionality
reduction technique.

46
00:02:12,733 --> 00:02:16,200
So as you can see you have
several techniques of dimensionality

47
00:02:16,200 --> 00:02:18,433
reduction PCA ICA.

48
00:02:18,433 --> 00:02:20,000
So these are all the methods.

49
00:02:20,000 --> 00:02:24,266
But of course the one that we want to use
is PCA principal component analysis.

50
00:02:24,600 --> 00:02:27,000
So we will use method equals PCA here.

51
00:02:27,000 --> 00:02:27,966
Then thresh.

52
00:02:27,966 --> 00:02:30,000
Thresh is a very important parameter.

53
00:02:30,000 --> 00:02:31,466
That's what I've just told you.

54
00:02:31,466 --> 00:02:35,666
If you want to reduce your dimensionality
of your data set with at least

55
00:02:35,666 --> 00:02:36,566
a minimum amount

56
00:02:36,566 --> 00:02:40,333
of explained variance, well, you can do it
by using the stress parameter.

57
00:02:40,633 --> 00:02:44,733
And as you can see, it's a cut off
the cumulative percent of variance

58
00:02:44,966 --> 00:02:46,666
to be retained by PCA.

59
00:02:46,666 --> 00:02:47,466
So for example,

60
00:02:47,466 --> 00:02:51,600
if you want your new extracted features
to explain at least 60% of the variance,

61
00:02:51,766 --> 00:02:56,400
well you need to specify here
thresh equals 0.6 60%.

62
00:02:57,200 --> 00:02:59,500
But we're not going to use this thresh

63
00:02:59,500 --> 00:03:02,500
parameter here
because we already know what we want.

64
00:03:02,500 --> 00:03:05,866
What we want is two independent variables,
because we want to be able

65
00:03:05,866 --> 00:03:08,866
to visualize the training set results
and the test results,

66
00:03:09,000 --> 00:03:13,300
and that we will be able to get
with the next parameter PCA comp.

67
00:03:13,633 --> 00:03:17,000
That is the specific number
of PCA components to keep.

68
00:03:17,000 --> 00:03:21,433
So that's exactly the number of extracted
features you want to obtain in the end.

69
00:03:21,900 --> 00:03:27,233
So here we will input PCA comp equals to
so that our training set,

70
00:03:27,233 --> 00:03:31,766
our original training set will go
from having 13 independent variables.

71
00:03:31,800 --> 00:03:36,833
The 13 original independent variables
that we had in our data set to having

72
00:03:36,866 --> 00:03:41,066
two new extracted features
that will explain the most the variance.

73
00:03:41,533 --> 00:03:47,033
And as you can see, if we specify the
second parameter, this overrides thresh.

74
00:03:47,233 --> 00:03:50,800
So that's why we don't need to specify
the stress parameter to specify

75
00:03:50,800 --> 00:03:53,800
a minimum
cumulative percent of explained variance.

76
00:03:54,333 --> 00:03:54,733
All right.

77
00:03:54,733 --> 00:03:57,700
And then you have other parameters
but we won't use them.

78
00:03:57,700 --> 00:04:02,100
We actually only need our x
to specify the data we want to transform

79
00:04:02,100 --> 00:04:05,100
to extract the new features,
the method PCA

80
00:04:05,200 --> 00:04:07,800
and the number of extracted features
we want to get.

81
00:04:07,800 --> 00:04:10,800
Eventually
that is two new extracted features.

82
00:04:11,200 --> 00:04:12,666
So let's input the arguments.

83
00:04:12,666 --> 00:04:15,666
Let's start with the first one x equals.

84
00:04:16,100 --> 00:04:18,166
So that's training set.

85
00:04:18,166 --> 00:04:19,033
Here we go.

86
00:04:19,033 --> 00:04:22,500
And actually
we need to specify the features.

87
00:04:22,800 --> 00:04:24,700
And actually
that's not the whole training set.

88
00:04:24,700 --> 00:04:29,333
Because remember PCA is an unsupervised
dimensionality reduction technique.

89
00:04:29,333 --> 00:04:32,366
That means that we don't consider
the dependent variable

90
00:04:32,500 --> 00:04:34,300
to extract the new features.

91
00:04:34,300 --> 00:04:37,300
So we actually need to remove here
the dependent variable.

92
00:04:37,466 --> 00:04:39,633
And remember this has index 14.

93
00:04:39,633 --> 00:04:42,800
So the way we can do that
is the same as we did for feature scaling.

94
00:04:43,033 --> 00:04:46,033
That means we just add here -14.

95
00:04:46,366 --> 00:04:48,666
All right. So now PCA will be applied

96
00:04:48,666 --> 00:04:52,266
on all the features
the 13 features of our training set.

97
00:04:53,233 --> 00:04:53,833
Perfect.

98
00:04:53,833 --> 00:04:56,866
Now next argument next argument is method.

99
00:04:56,866 --> 00:05:00,966
And as we said
method equals and quotes PCA.

100
00:05:01,600 --> 00:05:05,466
All right then
comma next argument and last argument.

101
00:05:05,700 --> 00:05:07,966
And as we said that's PCA comp.

102
00:05:07,966 --> 00:05:10,533
So PCA

103
00:05:10,533 --> 00:05:13,966
and what we want
is two new extracted features.

104
00:05:14,666 --> 00:05:15,000
All right.

105
00:05:15,000 --> 00:05:18,000
So that creates the PCA object

106
00:05:18,033 --> 00:05:21,033
that we will then use on our training set

107
00:05:21,066 --> 00:05:25,533
to transform our original training set
composed of our 13 independent variables

108
00:05:25,833 --> 00:05:29,200
to this new training
set of reduced dimensionality.

109
00:05:29,433 --> 00:05:30,433
And that will contain

110
00:05:30,433 --> 00:05:33,666
the two new extracted features
that will explain the most variance.

111
00:05:34,033 --> 00:05:35,133
So let's do it.

112
00:05:35,133 --> 00:05:38,833
Let's take our training set
because we are going to call this

113
00:05:38,833 --> 00:05:40,400
new training set training set as well.

114
00:05:40,400 --> 00:05:42,233
Because you know then we have

115
00:05:42,233 --> 00:05:45,833
all our templates and we use this training
set variable name.

116
00:05:46,033 --> 00:05:48,533
So we want to keep this training set name.

117
00:05:48,533 --> 00:05:52,100
But of course if you want to keep
your original training set and test set,

118
00:05:52,266 --> 00:05:56,500
you can use other names
like training set underscore PCA.

119
00:05:56,766 --> 00:05:59,600
But then if you do that, don't
forget to change training

120
00:05:59,600 --> 00:06:03,866
set here by training set PCA and here test
set PCA as well.

121
00:06:04,066 --> 00:06:06,266
And the same for the confusion
matrix section.

122
00:06:06,266 --> 00:06:09,233
And especially here
visualizing the training set results.

123
00:06:09,233 --> 00:06:12,500
You will need to replace training
set here by training set PCA.

124
00:06:12,900 --> 00:06:14,966
All right.
So that's why we are keeping the name.

125
00:06:14,966 --> 00:06:17,733
It's in order
not to have to change everything.

126
00:06:17,733 --> 00:06:20,733
So let's go back to training set equals.

127
00:06:21,100 --> 00:06:25,000
And now let's transform
this original training set

128
00:06:25,000 --> 00:06:28,700
into our new training set
composed of our new extracted features.

129
00:06:29,033 --> 00:06:30,800
And to do this it's very simple.

130
00:06:30,800 --> 00:06:33,066
We use the predict function.

131
00:06:33,066 --> 00:06:36,633
And inside
we take our PCA object, come up,

132
00:06:37,033 --> 00:06:40,533
and we apply this PCA transformation
object

133
00:06:40,866 --> 00:06:45,800
on the original training set
that is named training set as well.

134
00:06:47,033 --> 00:06:47,900
And so by doing

135
00:06:47,900 --> 00:06:51,133
this, this original training
set will become this

136
00:06:51,133 --> 00:06:54,800
new training set composed
of the two new extracted features.

137
00:06:55,000 --> 00:06:56,100
So let's do it.

138
00:06:56,100 --> 00:06:59,033
Let's start by creating this object.

139
00:06:59,033 --> 00:07:01,533
And then we will
transform our training set.

140
00:07:01,533 --> 00:07:05,766
So I'm going to select this line
and execute perfect.

141
00:07:06,300 --> 00:07:09,066
The PCA object is ready to be used

142
00:07:09,066 --> 00:07:12,600
on the original training set
to transform it

143
00:07:12,600 --> 00:07:16,666
into our new training set, composed
of the two new extracted features.

144
00:07:17,033 --> 00:07:19,966
So let's execute this as well. Here we go.

145
00:07:19,966 --> 00:07:22,000
Our new training set is now created.

146
00:07:22,000 --> 00:07:23,066
We can have a look.

147
00:07:23,066 --> 00:07:27,000
As you can see when I'm clicking on this,
well I have a new training

148
00:07:27,000 --> 00:07:30,300
set composed of two new extracted
features.

149
00:07:30,300 --> 00:07:33,000
And remember these two new extracted
features are called

150
00:07:33,000 --> 00:07:34,333
the principal components.

151
00:07:34,333 --> 00:07:37,300
So that's why you have PC1 and PC2.

152
00:07:37,300 --> 00:07:40,300
And of course we still have
our dependent variable vector,

153
00:07:40,300 --> 00:07:43,033
the customer segment
dependent variable with its

154
00:07:43,033 --> 00:07:46,033
three labels one, two and three.

155
00:07:46,933 --> 00:07:47,666
All right perfect.

156
00:07:47,666 --> 00:07:51,300
But now as you can clearly notice,
the dependent variable vector

157
00:07:51,466 --> 00:07:53,533
just went in the first position.

158
00:07:53,533 --> 00:07:56,466
And since then
we're going to use a template on data sets

159
00:07:56,466 --> 00:07:57,433
I mean the training set

160
00:07:57,433 --> 00:08:00,733
and the test set where the dependent
variable is in last position.

161
00:08:01,033 --> 00:08:04,066
We will need to put this dependent
variable in last position.

162
00:08:04,066 --> 00:08:06,300
Here. And that's actually very easy.

163
00:08:06,300 --> 00:08:09,400
What we only need to do
is play with the indexes.

164
00:08:09,733 --> 00:08:13,533
To put this customer segment
dependent variable in last position.

165
00:08:14,033 --> 00:08:15,333
So the method is really easy.

166
00:08:15,333 --> 00:08:17,766
We're going to take our training
set again.

167
00:08:17,766 --> 00:08:18,400
Here we go.

168
00:08:19,400 --> 00:08:20,700
And then equals.

169
00:08:20,700 --> 00:08:24,400
And then we take again
our training set then brackets.

170
00:08:24,633 --> 00:08:28,166
And then inside these brackets
we're going to take the indexes

171
00:08:28,166 --> 00:08:31,900
of the columns of our training set
in the correct order we want to get.

172
00:08:32,166 --> 00:08:35,200
So you're going to understand that
now we're going to take a vector.

173
00:08:35,433 --> 00:08:39,366
So remember in our vector
it's taken with C and then parenthesis.

174
00:08:39,866 --> 00:08:40,300
All right.

175
00:08:40,300 --> 00:08:44,100
And inside these parentheses
we put the correct order of the indexes

176
00:08:44,100 --> 00:08:45,266
we want to get.

177
00:08:45,266 --> 00:08:47,766
So let's go back to our training set.

178
00:08:47,766 --> 00:08:50,766
The first column we want to get is PC1.

179
00:08:50,833 --> 00:08:53,500
That should be the first column
of our new training set.

180
00:08:53,500 --> 00:08:55,133
And this has index two.

181
00:08:55,133 --> 00:08:58,700
So here we input the first index
which is two.

182
00:08:59,333 --> 00:08:59,966
Then comma.

183
00:08:59,966 --> 00:09:03,500
And then we input the second index
we want to get.

184
00:09:03,500 --> 00:09:05,033
That is the second column.

185
00:09:05,033 --> 00:09:07,133
And the second column is PC2.

186
00:09:07,133 --> 00:09:08,466
And this has index three.

187
00:09:08,466 --> 00:09:14,400
So here we input three and then here
you input the index of the last column.

188
00:09:14,400 --> 00:09:16,500
You want to have your training set.

189
00:09:16,500 --> 00:09:17,400
And the last column

190
00:09:17,400 --> 00:09:20,833
you want to have in your training set
is this customer segment column.

191
00:09:20,833 --> 00:09:22,700
Because that's the dependent variable.

192
00:09:22,700 --> 00:09:26,100
And so far
this customer segment has index one.

193
00:09:26,500 --> 00:09:29,700
So you need to specify here
this index that is one.

194
00:09:30,266 --> 00:09:34,200
And by doing this our new training
set here will be the same training set

195
00:09:34,200 --> 00:09:37,200
that we have here
but with a new order of the columns.

196
00:09:37,200 --> 00:09:38,866
And that is given by this order here.

197
00:09:38,866 --> 00:09:41,866
First, the first independent variable
that has index two,

198
00:09:42,000 --> 00:09:44,566
then the second independent variable
that has index three,

199
00:09:44,566 --> 00:09:47,566
and eventually the dependent
variable column that has index one.

200
00:09:47,933 --> 00:09:51,300
You're going to see if I select this line
now and execute.

201
00:09:51,533 --> 00:09:53,266
And if I go back to training set.

202
00:09:53,266 --> 00:09:57,600
Now I have my first two columns
as the new extracted features

203
00:09:57,900 --> 00:10:02,666
x1 and x2, and the last column customer
segment in last position

204
00:10:02,866 --> 00:10:06,100
as our code templates that we're going to
use is expecting it.

205
00:10:06,533 --> 00:10:07,766
So that's perfect.

206
00:10:07,766 --> 00:10:12,266
We can go back to PCA and now
we need to do the same for the test set.

207
00:10:12,666 --> 00:10:16,233
So what we're going to do
is select these two lines, copy them

208
00:10:16,233 --> 00:10:19,500
and just replace training
set here by test set.

209
00:10:20,033 --> 00:10:23,433
Same here test set and same here as well.

210
00:10:23,433 --> 00:10:26,733
Test set and eventually test set.

211
00:10:26,866 --> 00:10:28,033
All right.

212
00:10:28,033 --> 00:10:31,566
And that's of course the same indexes
for the order you want to have.

213
00:10:31,666 --> 00:10:34,033
We can check it out
I'm going to select this line.

214
00:10:34,033 --> 00:10:39,066
As you can see so far
the test set has its 13 original features.

215
00:10:39,500 --> 00:10:43,700
Then if I execute this line
it now has two new

216
00:10:43,700 --> 00:10:46,700
extracted features
the principal components one and two.

217
00:10:46,833 --> 00:10:49,333
But the customer segment
is in first position.

218
00:10:49,333 --> 00:10:51,800
We want to put it in the last position.

219
00:10:51,800 --> 00:10:54,766
And so to do this we execute this line.

220
00:10:54,766 --> 00:10:56,233
And that will do it.

221
00:10:56,233 --> 00:11:00,066
If I go back to test set now
the customer segment is in this position

222
00:11:00,466 --> 00:11:03,766
and we are ready to use
the following parts of the template.

223
00:11:04,000 --> 00:11:07,000
Predicted test results
make the confusion matrix.

224
00:11:07,200 --> 00:11:09,800
And eventually
that's the most exciting part.

225
00:11:09,800 --> 00:11:12,633
We will now be able to visualize
the training set results,

226
00:11:12,633 --> 00:11:16,800
because we now have two dimensions
in our training set and test set.

227
00:11:17,266 --> 00:11:20,266
So I look forward to visualizing
these results in the next tutorial.

228
00:11:20,266 --> 00:11:22,066
And until then, enjoy machine learning.