1
00:00:00,133 --> 00:00:02,233
Hello my friends, and welcome to this.

2
00:00:02,233 --> 00:00:04,000
New practical activity.

3
00:00:04,000 --> 00:00:08,900
On this time dimensionality reduction,
which is not a branch of machine learning.

4
00:00:08,900 --> 00:00:11,400
Per se, but a. Very important technique.

5
00:00:11,400 --> 00:00:13,066
To. Know how to handle.

6
00:00:13,066 --> 00:00:16,800
When you work with big data sets,
you know, huge data sets with.

7
00:00:17,000 --> 00:00:17,400
Many.

8
00:00:17,400 --> 00:00:21,700
Features and you know for which
you would like to reduce to complexity by.

9
00:00:21,700 --> 00:00:23,400
Reducing the dimensionality.

10
00:00:23,400 --> 00:00:25,000
And this is exactly what.

11
00:00:25,000 --> 00:00:27,266
Dimensionality reduction is about.

12
00:00:27,266 --> 00:00:30,900
So in this new part,
part nine dimensionality reduction, we.

13
00:00:30,900 --> 00:00:31,800
Will build.

14
00:00:31,800 --> 00:00:34,366
Three different models that can perform
such a.

15
00:00:34,366 --> 00:00:35,000
Task.

16
00:00:35,000 --> 00:00:36,266
These are first.

17
00:00:36,266 --> 00:00:38,000
Principal component analysis.

18
00:00:38,000 --> 00:00:41,100
The most famous one
then linear discriminant.

19
00:00:41,100 --> 00:00:44,100
Analysis. And finally. Kernel PCA.

20
00:00:44,100 --> 00:00:45,000
So we will build these.

21
00:00:45,000 --> 00:00:48,000
Three models, one for each section.
And now.

22
00:00:48,000 --> 00:00:49,500
We're about to start with the first.

23
00:00:49,500 --> 00:00:52,500
One principal component analysis.

24
00:00:52,700 --> 00:00:53,766
But before we start let's.

25
00:00:53,766 --> 00:00:56,400
Just make sure everyone here
is on the same page.

26
00:00:56,400 --> 00:00:58,600
I gave you the link to this folder
right before this.

27
00:00:58,600 --> 00:01:01,500
Tutorial in the article,
so make sure to connect to it.

28
00:01:01,500 --> 00:01:03,633
And now we.
Should be. All on the same page.

29
00:01:03,633 --> 00:01:07,200
So we're going to go into part
nine dimensionality reduction.

30
00:01:08,000 --> 00:01:08,466
All right.

31
00:01:08,466 --> 00:01:10,500
And as I told. You
you have the three sections.

32
00:01:10,500 --> 00:01:12,533
Corresponding to each of the models.
And we're going.

33
00:01:12,533 --> 00:01:13,200
To start with.

34
00:01:13,200 --> 00:01:16,200
Principal component. Analysis PCA.

35
00:01:16,766 --> 00:01:19,000
And as. Usual
we're going to start with. Python.

36
00:01:19,000 --> 00:01:22,300
And then despite and folder
you will find two files as usual.

37
00:01:22,466 --> 00:01:23,400
First the.

38
00:01:23,400 --> 00:01:26,066
Implementation in. Ipynb. Format.

39
00:01:26,066 --> 00:01:28,800
Which. Now once again
we will be able to run on.

40
00:01:28,800 --> 00:01:32,500
Google Collaboratory because we will work
with a classic data. Set.

41
00:01:32,666 --> 00:01:33,300
And speaking.

42
00:01:33,300 --> 00:01:36,566
Of which, here is a data
set one. Dot. CSV.

43
00:01:36,800 --> 00:01:39,066
So let's open it and let me.

44
00:01:39,066 --> 00:01:41,233
Explain what this is about.

45
00:01:41,233 --> 00:01:41,500
All right.

46
00:01:41,500 --> 00:01:42,766
So actually first.

47
00:01:42,766 --> 00:01:45,566
You notice
indeed that we have many. Features.

48
00:01:45,566 --> 00:01:48,100
I did not take a dataset
with hundreds of features

49
00:01:48,100 --> 00:01:50,700
because then we would,
you know, get lost in data set.

50
00:01:50,700 --> 00:01:51,700
So I just took a.

51
00:01:51,700 --> 00:01:53,533
Dataset with more than ten features.

52
00:01:53,533 --> 00:01:55,966
And of course. These are all. The features
from here.

53
00:01:55,966 --> 00:01:58,800
Alcohol to this one proline.

54
00:01:58,800 --> 00:01:59,633
And as you can.

55
00:01:59,633 --> 00:02:01,866
Guess, each feature gives a.

56
00:02:01,866 --> 00:02:03,300
Certain information.

57
00:02:03,300 --> 00:02:05,200
Of a. Certain. Wine, right?

58
00:02:05,200 --> 00:02:06,066
Each row.

59
00:02:06,066 --> 00:02:08,200
Corresponds. To. A wine.

60
00:02:08,200 --> 00:02:10,133
And for each wine we. Have diverse.

61
00:02:10,133 --> 00:02:11,366
Informations, diverse.

62
00:02:11,366 --> 00:02:15,433
Features, you know, characteristics
of the wine, the alcohol level.

63
00:02:15,600 --> 00:02:16,800
The malic acid.

64
00:02:16,800 --> 00:02:19,166
I'm not an expert of wines, but.

65
00:02:19,166 --> 00:02:20,266
These are some. Wines.

66
00:02:20,266 --> 00:02:21,600
Characteristics.

67
00:02:21,600 --> 00:02:24,233
Ash. Ash Alicante. Magnesium.

68
00:02:24,233 --> 00:02:25,500
Total phenols.

69
00:02:25,500 --> 00:02:26,500
Flavonoids.

70
00:02:26,500 --> 00:02:27,200
Anyway, so.

71
00:02:27,200 --> 00:02:29,633
You see you have. Many wine features.

72
00:02:29,633 --> 00:02:31,166
And for each of these wines.

73
00:02:31,166 --> 00:02:32,166
Well there. We go.

74
00:02:32,166 --> 00:02:34,500
I'm about to explain
the dependent variable.

75
00:02:34,500 --> 00:02:35,966
For each of. These wines.

76
00:02:35,966 --> 00:02:38,833
We have the customer segment.

77
00:02:38,833 --> 00:02:39,700
You know. That's the.

78
00:02:39,700 --> 00:02:40,600
Last column.

79
00:02:40,600 --> 00:02:43,500
To which. The wines belong. Okay.

80
00:02:43,500 --> 00:02:46,633
So let me explain
what happens in terms of business.

81
00:02:46,633 --> 00:02:48,733
First of all, this is a data set I took.

82
00:02:48,733 --> 00:02:50,700
From the UCI ML. Repository.

83
00:02:50,700 --> 00:02:52,700
So all the credits go of course to.

84
00:02:52,700 --> 00:02:55,033
This amazing platform of. Data set.

85
00:02:55,033 --> 00:02:56,300
However, in this.

86
00:02:56,300 --> 00:02:58,200
Data set, I just changed the last.

87
00:02:58,200 --> 00:03:01,400
Column customer segment
to make it more business wise.

88
00:03:01,400 --> 00:03:04,166
You know, to make this case.
The more a business case study.

89
00:03:04,166 --> 00:03:06,133
Because the scenario is the. Following.

90
00:03:06,133 --> 00:03:07,166
Let's imagine that.

91
00:03:07,166 --> 00:03:09,166
This data set belongs to a.

92
00:03:09,166 --> 00:03:13,133
Wine merchant with many different bottles
of wine to sell, and.

93
00:03:13,133 --> 00:03:15,366
Therefore a large base of customers.

94
00:03:15,366 --> 00:03:16,566
And this wine shop.

95
00:03:16,566 --> 00:03:18,800
Owner actually hired you as a.

96
00:03:18,800 --> 00:03:20,066
Data scientist to.

97
00:03:20,066 --> 00:03:20,833
First do a.

98
00:03:20,833 --> 00:03:23,400
Preliminary work. Of clustering.

99
00:03:23,400 --> 00:03:24,333
Meaning that.

100
00:03:24,333 --> 00:03:27,066
At first we had all these features.

101
00:03:27,066 --> 00:03:30,033
Without this last.
Column customer segment.

102
00:03:30,033 --> 00:03:32,233
We have all these. Features
from alcohol. To.

103
00:03:32,233 --> 00:03:34,733
Proline and this wine shop owner.

104
00:03:34,733 --> 00:03:36,566
Actually asked you to. Perform some.

105
00:03:36,566 --> 00:03:38,966
Clustering to identify diverse.

106
00:03:38,966 --> 00:03:40,833
Segments of customers.

107
00:03:40,833 --> 00:03:45,466
Grouped by similarities
which correspond to the ones they prefer.

108
00:03:45,466 --> 00:03:45,866
All right.

109
00:03:45,866 --> 00:03:46,666
So each.

110
00:03:46,666 --> 00:03:47,533
Customer segment.

111
00:03:47,533 --> 00:03:50,100
Here and by the way,
there are three of them. Right.

112
00:03:50,100 --> 00:03:51,633
If we scroll. Down we can.

113
00:03:51,633 --> 00:03:53,200
See that we have three.

114
00:03:53,200 --> 00:03:55,866
Different categories
or, you know. Clusters.

115
00:03:55,866 --> 00:03:57,800
And each. Of these. Segments.

116
00:03:57,800 --> 00:04:00,566
Will. Correspond
to a certain group of customers.

117
00:04:00,566 --> 00:04:02,666
That have. Similar preferences.

118
00:04:02,666 --> 00:04:04,133
For. Similar wines.

119
00:04:04,133 --> 00:04:06,433
And that's exactly
what these segments are about.

120
00:04:06,433 --> 00:04:07,633
But that was the first work.

121
00:04:07,633 --> 00:04:09,633
And if you want, you can have fun and.

122
00:04:09,633 --> 00:04:11,500
Do this first work yourself.

123
00:04:11,500 --> 00:04:14,266
But here we want to work on
dimensionality reduction.

124
00:04:14,266 --> 00:04:15,266
So there. Goes the.

125
00:04:15,266 --> 00:04:18,066
Second mission that this wine shop owner.

126
00:04:18,066 --> 00:04:19,333
Asks you to do.

127
00:04:19,333 --> 00:04:22,300
This wine shop owner was actually
satisfied with your first work.

128
00:04:22,300 --> 00:04:24,300
You know,
identifying these three segments.

129
00:04:24,300 --> 00:04:26,433
But now the. Owner would like to.

130
00:04:26,433 --> 00:04:26,933
You know.

131
00:04:26,933 --> 00:04:29,100
Reduce the complexity of this dataset.

132
00:04:29,100 --> 00:04:31,800
By ending up with a smaller
amount of features.

133
00:04:31,800 --> 00:04:33,400
And at the same time, this.

134
00:04:33,400 --> 00:04:36,000
Owner would like you to build
a. Predictive model.

135
00:04:36,000 --> 00:04:36,900
That will be.

136
00:04:36,900 --> 00:04:39,000
Trained on this data, you know, including.

137
00:04:39,000 --> 00:04:40,500
The features up to here.

138
00:04:40,500 --> 00:04:42,466
And the dependent variable.

139
00:04:42,466 --> 00:04:44,866
So that for each. New wine.

140
00:04:44,866 --> 00:04:47,066
That this owner has in, it's up.

141
00:04:47,066 --> 00:04:50,400
Well,
we can deploy. This. Predictive model.

142
00:04:50,400 --> 00:04:51,900
Applied to a reduced.

143
00:04:51,900 --> 00:04:52,800
Dimensionality.

144
00:04:52,800 --> 00:04:56,100
Data set to predict which customer.

145
00:04:56,100 --> 00:04:59,133
Segment this new wine belongs to. Right.

146
00:04:59,300 --> 00:05:01,800
And therefore once we managed to. Predict.

147
00:05:01,800 --> 00:05:06,733
Which customer segment this wine belongs
to, then we can recommend this wine.

148
00:05:06,733 --> 00:05:09,300
To the right. Customers. And that's.

149
00:05:09,300 --> 00:05:11,700
Exactly why
what we're about to do is like a.

150
00:05:11,700 --> 00:05:13,066
Recommender system.

151
00:05:13,066 --> 00:05:15,433
Because for each. New wine
that will be in the.

152
00:05:15,433 --> 00:05:17,333
Shop, well, our predictive.

153
00:05:17,333 --> 00:05:19,233
Model. Will tell us to which.

154
00:05:19,233 --> 00:05:21,066
Customer segment it. Will.

155
00:05:21,066 --> 00:05:22,700
Be the most appropriate.

156
00:05:22,700 --> 00:05:25,466
You know, it will be the most appreciated.

157
00:05:25,466 --> 00:05:27,266
All right. So that's the business case.

158
00:05:27,266 --> 00:05:28,466
And therefore, you know, our.

159
00:05:28,466 --> 00:05:30,366
Predictive model will add tons of.

160
00:05:30,366 --> 00:05:34,133
Value to this owner because therefore
if this owner manages to build.

161
00:05:34,133 --> 00:05:36,533
A good recommender system, of. Course
it will.

162
00:05:36,533 --> 00:05:38,500
Optimize the sales and therefore the.

163
00:05:38,500 --> 00:05:41,100
Profit of the business.

164
00:05:41,100 --> 00:05:43,533
Okay.
So that's what the case study is about.

165
00:05:43,533 --> 00:05:46,933
Now we're going to move on
to the implementation of course.

166
00:05:47,233 --> 00:05:51,066
Therefore I'm opening this file
principal component analysis.

167
00:05:51,366 --> 00:05:54,366
Which you have the choice to open with
either Google Colaboratory.

168
00:05:54,366 --> 00:05:55,566
Or Jupyter Notebook.

169
00:05:55,566 --> 00:05:58,566
As we.
Did in the previous section on CNNs.

170
00:05:58,600 --> 00:05:59,533
But there we go.

171
00:05:59,533 --> 00:06:02,100
Let's open it with Google Collaboratory.

172
00:06:02,100 --> 00:06:05,266
And enjoy
a brand new implementation on it.

173
00:06:06,366 --> 00:06:06,766
All right.

174
00:06:06,766 --> 00:06:07,533
So here is the.

175
00:06:07,533 --> 00:06:10,266
Implementation principal components
analysis.

176
00:06:10,266 --> 00:06:11,700
This is in read only mode.

177
00:06:11,700 --> 00:06:14,200
So as. Usual
we will create a copy by clicking.

178
00:06:14,200 --> 00:06:17,333
File here. And then save. A copy in drive.

179
00:06:17,600 --> 00:06:22,200
This will create a copy inside
which we will be able to re-implement.

180
00:06:22,466 --> 00:06:23,633
Not the whole.

181
00:06:23,633 --> 00:06:27,066
Implementation this time,
because I will explain that most of.

182
00:06:27,066 --> 00:06:29,633
The cells are cells we already did before.

183
00:06:29,633 --> 00:06:32,133
You know, many times
in the classification part

184
00:06:32,133 --> 00:06:34,366
and also in the first section
of part eight.

185
00:06:34,366 --> 00:06:36,866
So we won't have to re-implement
everything.

186
00:06:36,866 --> 00:06:39,000
This would.
Be a. Waste of time. And mostly.

187
00:06:39,000 --> 00:06:42,266
We. Rather want to focus on
dimensionality reduction.

188
00:06:42,600 --> 00:06:43,800
And therefore.

189
00:06:43,800 --> 00:06:45,100
Here. Is what we're going to do.

190
00:06:45,100 --> 00:06:46,700
I'm going to show you the implementation.

191
00:06:46,700 --> 00:06:49,666
Of course, but the only cell that we will.

192
00:06:49,666 --> 00:06:51,500
Re-Implement will. Be.

193
00:06:51,500 --> 00:06:54,433
This one applying. PCA. So let's.

194
00:06:54,433 --> 00:06:55,833
Remove it. Right away.

195
00:06:55,833 --> 00:06:57,666
Not the text, only this one.

196
00:06:57,666 --> 00:06:59,966
And now I'm going to show you
that indeed, you.

197
00:06:59,966 --> 00:07:03,966
Know, all the cells are super familiar
to us, right?

198
00:07:04,233 --> 00:07:05,600
Because indeed we. Start.

199
00:07:05,600 --> 00:07:06,400
By importing the.

200
00:07:06,400 --> 00:07:09,300
Libraries that we did 100. Times. Right?

201
00:07:09,300 --> 00:07:11,600
So we have the three essential libraries
here.

202
00:07:11,600 --> 00:07:14,300
Then we import the. Data
set with the exact.

203
00:07:14,300 --> 00:07:16,800
Same code as the one you have in your.

204
00:07:16,800 --> 00:07:18,633
Data preprocessing. Template.

205
00:07:18,633 --> 00:07:20,600
So of course here I just put.

206
00:07:20,600 --> 00:07:22,300
The right name of the data. Set which is.

207
00:07:22,300 --> 00:07:25,000
Wine. Dot CSV.

208
00:07:25,000 --> 00:07:25,800
Okay.

209
00:07:25,800 --> 00:07:28,700
Then you will recognize the next steps.

210
00:07:28,700 --> 00:07:30,266
Of the data preprocessing. Template.

211
00:07:30,266 --> 00:07:32,133
Which is to split the data set.

212
00:07:32,133 --> 00:07:33,466
Into the training set and the.

213
00:07:33,466 --> 00:07:35,700
Test set executive the same code.

214
00:07:35,700 --> 00:07:38,100
Then we apply feature scaling as it.

215
00:07:38,100 --> 00:07:40,066
Is, you know,
most of the time recommended.

216
00:07:40,066 --> 00:07:42,666
So we apply it of course on separately.

217
00:07:42,666 --> 00:07:44,800
The training set and the test. Set.

218
00:07:44,800 --> 00:07:48,000
And that closes the data
preprocessing phase.

219
00:07:48,266 --> 00:07:49,633
Then we apply. PCA.

220
00:07:49,633 --> 00:07:50,400
And that's of.

221
00:07:50,400 --> 00:07:51,600
Course the cell we will.

222
00:07:51,600 --> 00:07:53,533
Re-Implement together.

223
00:07:53,533 --> 00:07:54,933
Then let me just remove.

224
00:07:54,933 --> 00:07:57,233
All the outputs here
so that. You don't see them.

225
00:07:57,233 --> 00:07:59,800
I hope you close. Your eyes
when I just remove them.

226
00:07:59,800 --> 00:08:00,433
But there you.

227
00:08:00,433 --> 00:08:00,900
Now close.

228
00:08:00,900 --> 00:08:01,633
Your eyes a little. Bit.

229
00:08:01,633 --> 00:08:02,700
I'm going to.

230
00:08:02,700 --> 00:08:04,800
Remove that output as well because.

231
00:08:04,800 --> 00:08:06,900
Actually the dimensionality
reduction technique.

232
00:08:06,900 --> 00:08:09,566
That we'll use will. Manage.
To get us great results.

233
00:08:09,566 --> 00:08:11,866
With only two. Extracted features.

234
00:08:11,866 --> 00:08:12,300
Right.

235
00:08:12,300 --> 00:08:15,433
We're not reducing the number
of existing features.

236
00:08:15,633 --> 00:08:20,733
We are creating new extracted features
based on these existing features.

237
00:08:20,733 --> 00:08:22,300
So we will get totally. Different.

238
00:08:22,300 --> 00:08:24,233
New features at the end which we.

239
00:08:24,233 --> 00:08:26,900
Call, you know, principal components.
So we'll have.

240
00:08:26,900 --> 00:08:30,600
Principal component one
and principal component two at the end.

241
00:08:31,066 --> 00:08:33,733
But there we go. So back to.

242
00:08:33,733 --> 00:08:35,433
Our implementation.

243
00:08:35,433 --> 00:08:38,233
After applying PCA
which we will redo together.

244
00:08:38,233 --> 00:08:39,733
Well we. Trained the.

245
00:08:39,733 --> 00:08:42,033
Logistic. Regression
model on the training. Set.

246
00:08:42,033 --> 00:08:43,833
I chose the logistic regression model

247
00:08:43,833 --> 00:08:47,700
as the first model
of our classification toolkit, but I.

248
00:08:47,700 --> 00:08:49,366
Could have chosen any other ones.

249
00:08:49,366 --> 00:08:50,200
But you will see that

250
00:08:50,200 --> 00:08:53,300
we will get great results with this one,
but feel free to choose.

251
00:08:53,300 --> 00:08:56,200
Another classification model
and we will. Work.

252
00:08:56,200 --> 00:08:57,266
But notice that.

253
00:08:57,266 --> 00:09:00,300
It is important to. Apply PCA before.

254
00:09:00,333 --> 00:09:01,200
Training your.

255
00:09:01,200 --> 00:09:01,833
Classification.

256
00:09:01,833 --> 00:09:05,000
Model on the training set right,
you want to reduce. The.

257
00:09:05,000 --> 00:09:07,966
Dimensionality of your. Data
set before of course.

258
00:09:07,966 --> 00:09:10,433
Training it on your training.

259
00:09:10,433 --> 00:09:12,966
Set right.
The training set basically is the.

260
00:09:12,966 --> 00:09:15,333
Final version. Of your data after you.

261
00:09:15,333 --> 00:09:16,900
Performed all. The data preprocessing.

262
00:09:16,900 --> 00:09:19,866
Phase and dimensionality reduction
if you want.

263
00:09:19,866 --> 00:09:20,633
Okay.

264
00:09:20,633 --> 00:09:24,700
So the training happens after applying
your dimensionality reduction technique.

265
00:09:25,066 --> 00:09:25,500
And then.

266
00:09:25,500 --> 00:09:28,266
Of course,
well we. Will make the confusion matrix.

267
00:09:28,266 --> 00:09:30,600
You know how to do that.
We did it many times.

268
00:09:30,600 --> 00:09:33,500
And then since our dimensionality
reduction technique.

269
00:09:33,500 --> 00:09:36,000
Will get us great results. With only two.

270
00:09:36,000 --> 00:09:37,333
Extracted features.

271
00:09:37,333 --> 00:09:40,233
Principal component
one and principal component two.

272
00:09:40,233 --> 00:09:43,233
Well, that will allow us to visualize
the training set results.

273
00:09:43,266 --> 00:09:44,700
In two dimensions. Right?

274
00:09:44,700 --> 00:09:45,433
Because remember.

275
00:09:45,433 --> 00:09:46,133
That each.

276
00:09:46,133 --> 00:09:48,933
Dimension corresponds to one feature.

277
00:09:48,933 --> 00:09:50,400
And we do this. For the. Training.

278
00:09:50,400 --> 00:09:51,600
Set right here.

279
00:09:51,600 --> 00:09:54,233
And the test set okay.

280
00:09:54,233 --> 00:09:56,700
So as you can see. What I did with this.

281
00:09:56,700 --> 00:10:00,733
Implementation is something you can do
in less than five minutes right now.

282
00:10:00,733 --> 00:10:02,700
Thanks to your toolkit. Right.

283
00:10:02,700 --> 00:10:04,100
Because you just need to take

284
00:10:04,100 --> 00:10:07,500
the data preprocessing toolkit
to make these four cells.

285
00:10:07,800 --> 00:10:08,533
Then you just need to.

286
00:10:08,533 --> 00:10:10,700
Grab the feature. Scaling tool in your.

287
00:10:10,700 --> 00:10:12,466
Data preprocessing toolkit.

288
00:10:12,466 --> 00:10:13,133
Then you.

289
00:10:13,133 --> 00:10:15,300
Just need to grab your logistic
regression.

290
00:10:15,300 --> 00:10:17,733
Implementation to implement this cell.

291
00:10:17,733 --> 00:10:20,266
And same for the other ones.
You know the confusion matrix.

292
00:10:20,266 --> 00:10:21,633
And same for these last two.

293
00:10:21,633 --> 00:10:23,233
Visualizing the transit results.

294
00:10:23,233 --> 00:10:24,966
And. Visualizing the test results.

295
00:10:24,966 --> 00:10:29,666
These are all cells that you have in
your logistic regression implementation.

296
00:10:29,833 --> 00:10:31,133
So absolutely no need.

297
00:10:31,133 --> 00:10:33,166
To do it together again. And therefore.

298
00:10:33,166 --> 00:10:35,033
We can now focus. Directly.

299
00:10:35,033 --> 00:10:36,533
On. This cell.

300
00:10:36,533 --> 00:10:38,633
Applying. PCA.

301
00:10:38,633 --> 00:10:39,433
So there we. Go.

302
00:10:39,433 --> 00:10:41,233
We're going to create a new code cell.

303
00:10:41,233 --> 00:10:43,133
And now let's implement.

304
00:10:43,133 --> 00:10:46,133
PCA. Principal component analysis.

305
00:10:46,733 --> 00:10:47,100
All right.

306
00:10:47,100 --> 00:10:48,433
So you. Could almost.

307
00:10:48,433 --> 00:10:49,733
Press. Pause on the video.

308
00:10:49,733 --> 00:10:52,266
Now and get the. Right. Tool
from the scikit.

309
00:10:52,266 --> 00:10:55,266
Learn API to see how. To implement this.

310
00:10:55,433 --> 00:10:56,966
That would be a good exercise.

311
00:10:56,966 --> 00:10:58,866
But if you don't. Want to do it
that's fine.

312
00:10:58,866 --> 00:11:00,566
Let's implement this right now.

313
00:11:00,566 --> 00:11:03,000
And as you guessed by what I've just said,
well.

314
00:11:03,000 --> 00:11:04,800
We're going to implement PCA.

315
00:11:04,800 --> 00:11:07,400
Using the scikit learn library.

316
00:11:07,400 --> 00:11:10,800
So the first thing we'll do
is start from the scikit

317
00:11:10,800 --> 00:11:13,800
learn,
from which we're going to get access to a.

318
00:11:13,800 --> 00:11:17,400
Certain module, which we'll find
in the cyclone API and which.

319
00:11:17,400 --> 00:11:20,066
Is called decomposition.

320
00:11:20,066 --> 00:11:20,700
Just like that.

321
00:11:20,700 --> 00:11:23,900
Decomposition
from which we're going to import.

322
00:11:23,900 --> 00:11:27,600
Of course, a class
that will allow us to build this object.

323
00:11:27,600 --> 00:11:28,800
Which will be nothing else.

324
00:11:28,800 --> 00:11:31,500
Then this PCA tool that will.

325
00:11:31,500 --> 00:11:34,566
Apply
dimensionality reduction on our data. Set.

326
00:11:34,800 --> 00:11:39,133
And. That class is called very simply
okay, okay.

327
00:11:39,133 --> 00:11:42,133
So you can't miss it in the API, PCA.

328
00:11:42,300 --> 00:11:43,800
And now next natural.

329
00:11:43,800 --> 00:11:44,666
Step is of course to.

330
00:11:44,666 --> 00:11:46,066
Create an. Object.

331
00:11:46,066 --> 00:11:48,300
Or you know, an instance of this class.

332
00:11:48,300 --> 00:11:50,466
And guess
how we're going to call that object.

333
00:11:50,466 --> 00:11:53,800
Well very simply
we're going to call that object okay.

334
00:11:53,800 --> 00:11:56,466
Right. So this is super intuitive.

335
00:11:56,466 --> 00:11:58,866
And now you know the next step next step.

336
00:11:58,866 --> 00:12:03,600
Is to call the PCA class
which needs to take.

337
00:12:03,600 --> 00:12:05,233
One essential argument.

338
00:12:05,233 --> 00:12:07,433
You know
we only have to input one argument here.

339
00:12:07,433 --> 00:12:09,166
And you can totally guess. What.

340
00:12:09,166 --> 00:12:11,333
This argument will be. Right.

341
00:12:11,333 --> 00:12:11,866
It is.

342
00:12:11,866 --> 00:12:14,866
The final number. Of extracted features.

343
00:12:15,066 --> 00:12:17,100
You want to end up with in your new.

344
00:12:17,100 --> 00:12:17,966
Data set.

345
00:12:17,966 --> 00:12:20,466
And that argument
to choose that number is.

346
00:12:20,466 --> 00:12:21,133
Called.

347
00:12:21,133 --> 00:12:25,266
N underscore components and components.

348
00:12:26,033 --> 00:12:26,400
All right.

349
00:12:26,400 --> 00:12:27,000
So now the.

350
00:12:27,000 --> 00:12:29,500
Question is of course
which. Number should we.

351
00:12:29,500 --> 00:12:30,300
Choose. Right.

352
00:12:30,300 --> 00:12:33,800
How do we know down to which
number of features right.

353
00:12:33,800 --> 00:12:34,633
Extract features.

354
00:12:34,633 --> 00:12:36,166
We want to reduce dimensionality.

355
00:12:36,166 --> 00:12:37,800
Of our data set.

356
00:12:37,800 --> 00:12:38,900
Well I have a very.

357
00:12:38,900 --> 00:12:39,500
Simple answer.

358
00:12:39,500 --> 00:12:40,466
To that question.

359
00:12:40,466 --> 00:12:42,566
What I usually do is start with two.

360
00:12:42,566 --> 00:12:43,333
You know. Two.

361
00:12:43,333 --> 00:12:44,466
Principal components.

362
00:12:44,466 --> 00:12:46,500
Therefore two extracted features.

363
00:12:46,500 --> 00:12:48,266
And see the. Results I. Get in the end.

364
00:12:48,266 --> 00:12:50,833
And thanks to our code,
you know, our code template, we can.

365
00:12:50,833 --> 00:12:53,100
Check that very quickly. And easily.

366
00:12:53,100 --> 00:12:53,766
And besides.

367
00:12:53,766 --> 00:12:56,100
We do want to try with two.
Because then if we.

368
00:12:56,100 --> 00:12:57,600
Get good results with two,

369
00:12:57,600 --> 00:13:00,300
well we will be able to visualize
the training set result.

370
00:13:00,300 --> 00:13:03,000
And the. Test result. In two dimensions.

371
00:13:03,000 --> 00:13:05,833
You know, in this nice. Plot
that we had. In part.

372
00:13:05,833 --> 00:13:07,200
Three classification.

373
00:13:07,200 --> 00:13:09,000
So we definitely want to start with two.

374
00:13:09,000 --> 00:13:10,533
And if you know, we get really.

375
00:13:10,533 --> 00:13:12,500
Poor results and. We see on the.

376
00:13:12,500 --> 00:13:13,300
Graphics here.

377
00:13:13,300 --> 00:13:14,333
That we can't.

378
00:13:14,333 --> 00:13:16,500
Separate the three classes properly.

379
00:13:16,500 --> 00:13:17,200
You know, remember

380
00:13:17,200 --> 00:13:20,233
with those different prediction regions
and the prediction boundary.

381
00:13:20,600 --> 00:13:21,833
Well if we see that we have poor.

382
00:13:21,833 --> 00:13:24,366
Results on the visualizations,
then we can try.

383
00:13:24,366 --> 00:13:24,833
With.

384
00:13:24,833 --> 00:13:28,333
Higher numbers of principal components
meaning three than four.

385
00:13:28,500 --> 00:13:31,033
And at some point we'll get,
you know, some extracted.

386
00:13:31,033 --> 00:13:33,966
Features
that explain. Well enough the variance.

387
00:13:33,966 --> 00:13:37,966
Which is exactly what PCA is about,
right, is about extracting.

388
00:13:37,966 --> 00:13:40,933
Some features. That. Explain
well enough. The variance.

389
00:13:40,933 --> 00:13:42,300
And once you find them.

390
00:13:42,300 --> 00:13:44,333
Well, you will get good results.

391
00:13:44,333 --> 00:13:46,400
Even with lower dimensionality.

392
00:13:46,400 --> 00:13:47,800
Okay. So let's try with.

393
00:13:47,800 --> 00:13:49,233
Two and let's see what we'll. Get.

394
00:13:49,233 --> 00:13:51,066
But I already told you that will get.

395
00:13:51,066 --> 00:13:52,866
Amazing. Results.

396
00:13:52,866 --> 00:13:53,866
Therefore there you go.

397
00:13:53,866 --> 00:13:56,833
And components. Equals to two.

398
00:13:56,833 --> 00:13:57,966
Principal components.

399
00:13:57,966 --> 00:13:59,000
Or in other words.

400
00:13:59,000 --> 00:14:01,600
To extracted features okay.

401
00:14:01,600 --> 00:14:02,800
So that's for our object.

402
00:14:02,800 --> 00:14:03,900
And now next.

403
00:14:03,900 --> 00:14:07,200
Step of course is to apply this object
to our.

404
00:14:07,266 --> 00:14:10,233
Training set to. Reduce the.
Dimensionality.

405
00:14:10,233 --> 00:14:12,800
Of our. Training set in. Order to ease.

406
00:14:12,800 --> 00:14:13,833
The learning process.

407
00:14:13,833 --> 00:14:15,700
Of the logistic regression model.

408
00:14:15,700 --> 00:14:18,300
But also we will have to apply it. On the.

409
00:14:18,300 --> 00:14:19,133
Test set.

410
00:14:19,133 --> 00:14:20,200
Because remember.

411
00:14:20,200 --> 00:14:21,533
That the predict.

412
00:14:21,533 --> 00:14:22,266
Method that.

413
00:14:22,266 --> 00:14:24,566
We will call here has to be called.

414
00:14:24,566 --> 00:14:27,500
On the exact same format of. Data.

415
00:14:27,500 --> 00:14:29,066
As the one that was. Used.

416
00:14:29,066 --> 00:14:30,333
For the training set.

417
00:14:30,333 --> 00:14:31,666
So as long as you apply.

418
00:14:31,666 --> 00:14:34,400
Some transformations
like data. Preprocessing.

419
00:14:34,400 --> 00:14:36,533
Or dimensionality reduction
on your training.

420
00:14:36,533 --> 00:14:38,000
Set, well you have to do the.

421
00:14:38,000 --> 00:14:39,900
Same on your test. Set.

422
00:14:39,900 --> 00:14:43,500
However,
be careful exactly as feature scaling.

423
00:14:43,566 --> 00:14:45,466
We will have to apply the fit.

424
00:14:45,466 --> 00:14:48,466
Transform
method on the training. Set, but.

425
00:14:48,566 --> 00:14:51,566
Only the transform method on the.
Test set.

426
00:14:51,566 --> 00:14:53,433
And that's. Always for the same. Reason.

427
00:14:53,433 --> 00:14:57,233
That's because we want to avoid
information leakage on the test set.

428
00:14:57,433 --> 00:14:57,800
Write.

429
00:14:57,800 --> 00:15:00,666
The test set is supposed to be
you observations like.

430
00:15:00,666 --> 00:15:03,733
Data on which we deploy
our model in production.

431
00:15:03,966 --> 00:15:05,666
And therefore we're not supposed.

432
00:15:05,666 --> 00:15:07,500
To fit our scaler.

433
00:15:07,500 --> 00:15:09,500
Or, you know. Feature extractor object.

434
00:15:09,500 --> 00:15:10,966
On the. Test set.

435
00:15:10,966 --> 00:15:13,766
We can apply them to
transform them. Right, because they were.

436
00:15:13,766 --> 00:15:15,100
Fitted on the training set.

437
00:15:15,100 --> 00:15:17,033
But we. Can't fit them again to.

438
00:15:17,033 --> 00:15:18,600
The test set because that would be like.

439
00:15:18,600 --> 00:15:21,700
Trying to get some hints of information
from the test. Set.

440
00:15:21,900 --> 00:15:24,066
That we're not supposed. To have.
That's exactly.

441
00:15:24,066 --> 00:15:26,100
What information. Leakage is about.

442
00:15:26,100 --> 00:15:28,200
So there you go.
I said. Everything. Now you can.

443
00:15:28,200 --> 00:15:29,466
Press pause. On this.

444
00:15:29,466 --> 00:15:30,466
Video to.

445
00:15:30,466 --> 00:15:33,466
Finish this implementation of. PCA.

446
00:15:33,466 --> 00:15:37,200
And in two seconds I'm going to implement
with you the solution.

447
00:15:40,066 --> 00:15:40,900
All right.

448
00:15:40,900 --> 00:15:43,300
I hope you did well.
Now let's do it together.

449
00:15:43,300 --> 00:15:46,400
So as we said, we want to apply this
PCA object separately.

450
00:15:46,400 --> 00:15:47,100
On the training set.

451
00:15:47,100 --> 00:15:47,733
And two sets.

452
00:15:47,733 --> 00:15:50,733
So first I'm going to. Take X. Train.

453
00:15:51,166 --> 00:15:52,866
All right which I'm going to.

454
00:15:52,866 --> 00:15:57,300
Update by applying this PCA object from.

455
00:15:57,300 --> 00:15:58,200
Which I'm going to.

456
00:15:58,200 --> 00:15:59,433
Call the fit.

457
00:15:59,433 --> 00:16:02,300
Transform method.

458
00:16:02,300 --> 00:16:03,900
On this all the.

459
00:16:03,900 --> 00:16:08,066
Version of X train
meaning before the transformation of PCA.

460
00:16:08,300 --> 00:16:09,100
And so here.

461
00:16:09,100 --> 00:16:11,800
What. Happens
technically is that. The fit part of this.

462
00:16:11,800 --> 00:16:13,366
Fit transform method will. Get.

463
00:16:13,366 --> 00:16:15,766
All the information it needs from X train.

464
00:16:15,766 --> 00:16:16,700
To apply.

465
00:16:16,700 --> 00:16:18,600
Principal component analysis.

466
00:16:18,600 --> 00:16:20,733
And then of course the transform.

467
00:16:20,733 --> 00:16:21,300
Part of this.

468
00:16:21,300 --> 00:16:23,366
Fit transform method. Will apply.

469
00:16:23,366 --> 00:16:25,900
The transformation. Itself to extract the.

470
00:16:25,900 --> 00:16:28,100
Principal component features. Okay.

471
00:16:28,100 --> 00:16:30,900
So that what it means technically and now.

472
00:16:30,900 --> 00:16:32,000
Well let's. Do the same.

473
00:16:32,000 --> 00:16:36,000
Actually for X2'S I'm copying
this, pasting it here.

474
00:16:36,300 --> 00:16:39,600
And. Replacing here x train by x test.

475
00:16:40,100 --> 00:16:43,100
Then x train here again by. X test.

476
00:16:43,133 --> 00:16:47,700
And only applying the transform method.

477
00:16:48,200 --> 00:16:49,800
And there we go my friends.

478
00:16:49,800 --> 00:16:52,633
This implementation is already over.