1
00:00:00,333 --> 00:00:02,700
Hello and welcome to this art tutorial.

2
00:00:02,700 --> 00:00:05,833
So in the previous tutorials,
who took care of the pre-processing phase?

3
00:00:05,833 --> 00:00:09,833
And then we applied PCA on our data
set to reduce its

4
00:00:09,833 --> 00:00:12,900
dimensionality
down to two new extracted features.

5
00:00:13,200 --> 00:00:16,200
And now we are ready
to build a classification model.

6
00:00:16,400 --> 00:00:20,166
So speaking of classification model,
we started with the logistic regression

7
00:00:20,166 --> 00:00:20,833
model.

8
00:00:20,833 --> 00:00:24,966
But actually from this point
we can build any classification model

9
00:00:24,966 --> 00:00:27,966
among all the classification models
we made in part three.

10
00:00:28,166 --> 00:00:32,000
If I go back to part three classification,
this folder here

11
00:00:32,300 --> 00:00:35,133
you have here
all the models we made in this part three.

12
00:00:35,133 --> 00:00:38,300
And basically from this point
you can build any model

13
00:00:38,300 --> 00:00:41,766
you want by
just selecting your classification model.

14
00:00:41,766 --> 00:00:45,866
For example, let's take support
vector machine classification model.

15
00:00:46,266 --> 00:00:49,266
Then you can just open the SVM or file.

16
00:00:49,433 --> 00:00:53,600
And then basically all you need to do
is take everything after the data

17
00:00:53,600 --> 00:00:54,600
preprocessing phase.

18
00:00:54,600 --> 00:00:57,900
That is from where you start to build
your SVM model.

19
00:00:58,566 --> 00:01:01,833
And select everything
down to the bottom and copy.

20
00:01:02,300 --> 00:01:06,900
And then in your PCA file
you can include your classification model.

21
00:01:07,200 --> 00:01:10,200
Right after applying PCA on your data set.

22
00:01:10,200 --> 00:01:14,866
And so here I just replaced the logistic
regression model by the SVM model.

23
00:01:15,100 --> 00:01:18,533
And you can do this for any classification
model you want.

24
00:01:18,866 --> 00:01:21,866
Among the classification models
we made in part three.

25
00:01:21,933 --> 00:01:24,900
So that's very easy to replace
your different models by this simple

26
00:01:24,900 --> 00:01:28,700
copy paste so that you can try different
classification models very efficiently.

27
00:01:29,266 --> 00:01:31,333
So let's see what we get
with this SVM model.

28
00:01:31,333 --> 00:01:34,900
For example we just need to change here
the name of the dependent variable.

29
00:01:34,900 --> 00:01:38,066
This is not purchased
but customer segment.

30
00:01:40,033 --> 00:01:41,266
Here we go.

31
00:01:41,266 --> 00:01:43,100
And that's
the only thing we need to change

32
00:01:43,100 --> 00:01:45,866
because the data here
has input training set.

33
00:01:45,866 --> 00:01:48,133
And this is the training set
which is transformed.

34
00:01:48,133 --> 00:01:51,133
The new training set
composed of the new extracted features.

35
00:01:51,400 --> 00:01:55,100
And so basically
we are ready to select this section

36
00:01:55,100 --> 00:01:59,600
and execute it
to build our SVM classification model.

37
00:02:00,066 --> 00:02:01,666
And now that the model is built,

38
00:02:01,666 --> 00:02:05,033
we are ready to predict
the new observations of the test set.

39
00:02:05,366 --> 00:02:06,966
And so this line is ready.

40
00:02:06,966 --> 00:02:09,000
And actually
we don't need to change anything

41
00:02:09,000 --> 00:02:12,900
because this index three here
is the index of the dependent variable.

42
00:02:13,033 --> 00:02:16,533
And since we reduce the dimensionality
of our data set down to two,

43
00:02:16,666 --> 00:02:19,433
that means that we have two features
and one dependent variable.

44
00:02:19,433 --> 00:02:22,433
And therefore the index of the dependent
variable is still three.

45
00:02:22,700 --> 00:02:25,700
And so we are ready
to select this line and execute.

46
00:02:25,966 --> 00:02:29,033
And now we have the predictions
of the test set results.

47
00:02:29,500 --> 00:02:32,833
So we can have a look
y pred in the console.

48
00:02:32,833 --> 00:02:34,100
Press enter.

49
00:02:34,100 --> 00:02:38,466
And for each observation of the test set
we have its prediction by the model

50
00:02:38,466 --> 00:02:39,633
the SVM model.

51
00:02:39,633 --> 00:02:44,133
So for example, the fourth one of the data
set that belongs to the test set

52
00:02:44,500 --> 00:02:47,166
is predicted to belong
to customer number one.

53
00:02:47,166 --> 00:02:52,266
And the one number 132 is predicted to
belong to customer segment number three.

54
00:02:52,966 --> 00:02:53,833
So very easy.

55
00:02:53,833 --> 00:02:56,700
And then we can make the confusion matrix.

56
00:02:56,700 --> 00:02:58,733
And since we don't need
to change anything here

57
00:02:58,733 --> 00:03:02,200
because this corresponds
to the index of the dependent variable.

58
00:03:02,200 --> 00:03:04,866
So we are ready to execute this as well.

59
00:03:04,866 --> 00:03:07,866
Execute the confusion matrix is ready.

60
00:03:07,900 --> 00:03:08,700
Let's have a look.

61
00:03:10,033 --> 00:03:11,066
And while

62
00:03:11,066 --> 00:03:14,066
perfect
results we only get correct predictions.

63
00:03:14,333 --> 00:03:18,000
As you can see, 12 ones were correctly
predicted to belong to customer.

64
00:03:18,000 --> 00:03:22,200
Segment number 114 ones were correctly
predicted to belong to customer segment

65
00:03:22,200 --> 00:03:26,033
number two and ten ones were correctly
predicted to belong to customer segment

66
00:03:26,033 --> 00:03:27,000
number three.

67
00:03:27,000 --> 00:03:29,966
And then we have zero incorrect
predictions.

68
00:03:29,966 --> 00:03:31,600
So these are excellent results.

69
00:03:31,600 --> 00:03:34,600
And of course we get 100% accuracy.

70
00:03:34,833 --> 00:03:38,700
So now when moving on to the next part
to visualize the training set results,

71
00:03:38,900 --> 00:03:40,500
we should get amazingly

72
00:03:40,500 --> 00:03:44,200
well-separated prediction regions
and a very clear prediction boundary.

73
00:03:44,300 --> 00:03:45,600
So let's check it out.

74
00:03:45,600 --> 00:03:48,366
But now we have something to change.

75
00:03:48,366 --> 00:03:53,100
And this is not a tiny change as we used
to do, because now we have three classes.

76
00:03:53,400 --> 00:03:57,300
And as you can notice in this code
when we plot the prediction regions

77
00:03:57,300 --> 00:03:58,633
thanks to this line,

78
00:03:58,633 --> 00:04:02,800
well this code template allows us to do it
when we only have two classes

79
00:04:03,033 --> 00:04:06,033
because as you can see,
we have this if else condition.

80
00:04:06,066 --> 00:04:11,066
If y grid equals one, then the color is
green and else if y green equals

81
00:04:11,066 --> 00:04:15,066
zero, then the color is tomato
and same when we plot the observations.

82
00:04:15,200 --> 00:04:18,466
If the observations of the set
that is the training

83
00:04:18,466 --> 00:04:21,466
set belongs to class one, then it's green.

84
00:04:21,566 --> 00:04:22,700
And if it belongs to class

85
00:04:22,700 --> 00:04:26,066
zero, that is in the else condition 
the points will be red.

86
00:04:26,433 --> 00:04:28,933
But now the problem is
that we have three classes.

87
00:04:28,933 --> 00:04:33,766
So we need to improve this code here
to distinct the three conditions.

88
00:04:33,766 --> 00:04:38,400
If y equals zero, if y equals one,
and if y equals two.

89
00:04:39,000 --> 00:04:39,833
So let's do it.

90
00:04:39,833 --> 00:04:41,666
That will be good coding practice.

91
00:04:41,666 --> 00:04:44,766
And speaking of coding practice,
what would be very good is that you

92
00:04:44,766 --> 00:04:47,766
try to do it
before I do it in this tutorial.

93
00:04:47,900 --> 00:04:50,266
So you can press pause and try.

94
00:04:50,266 --> 00:04:52,000
And now I'm going to do it.

95
00:04:52,000 --> 00:04:54,600
So basically we need to add
one more condition.

96
00:04:54,600 --> 00:04:57,566
The condition where y equals to.

97
00:04:57,566 --> 00:04:58,733
So let's do it.

98
00:04:58,733 --> 00:05:01,200
Let's add this new condition here.

99
00:05:01,200 --> 00:05:02,833
If y grid

100
00:05:04,300 --> 00:05:06,800
equals equals to then comma.

101
00:05:06,800 --> 00:05:11,666
And then after this condition y grid
is equal to two, we will put what we want.

102
00:05:11,933 --> 00:05:15,533
And what we want is a new color
because there is one color

103
00:05:15,533 --> 00:05:18,533
associated to each value of y grid.

104
00:05:18,600 --> 00:05:22,966
So we will keep spring green three
for the case where y grid equals one.

105
00:05:23,266 --> 00:05:26,600
And we will keep tomato for the case
where y grid equals zero.

106
00:05:27,133 --> 00:05:30,400
But for y equals two
we need to introduce a new color.

107
00:05:30,666 --> 00:05:33,900
And since we have here green and red
let's put blue.

108
00:05:34,400 --> 00:05:38,733
So a good color is actually deep sky blue.

109
00:05:40,200 --> 00:05:43,200
Then come up to get the next conditions.

110
00:05:43,433 --> 00:05:46,466
So so far what we see
is that if y grid equals

111
00:05:46,466 --> 00:05:49,466
equals to
then the color will be deep sky blue.

112
00:05:49,733 --> 00:05:52,666
Then if y grid equals one
then the color will be green.

113
00:05:52,666 --> 00:05:55,566
And if y grid equals zero
and then the color will be red.

114
00:05:55,566 --> 00:05:57,466
But this is not how it works.

115
00:05:57,466 --> 00:06:01,600
It's not as simple as that because
this is actually not a correct syntax.

116
00:06:01,800 --> 00:06:05,700
Because this ifelse function expects
three arguments.

117
00:06:05,866 --> 00:06:09,366
The first argument is the condition
y grid equals one.

118
00:06:09,866 --> 00:06:14,033
Then the second argument is the result
when this condition is true,

119
00:06:14,400 --> 00:06:18,333
and the third argument is the result
when this condition is not true.

120
00:06:18,733 --> 00:06:21,833
So here we have a lot more than three
arguments.

121
00:06:21,900 --> 00:06:23,066
That's not right.

122
00:06:23,066 --> 00:06:27,500
And so the trick to solve
this is to put all this

123
00:06:27,733 --> 00:06:29,800
that is the y grid equals one condition.

124
00:06:29,800 --> 00:06:31,900
And then the results bring green three.

125
00:06:31,900 --> 00:06:37,433
And then the result if y grid equals zero
into the third argument of this.

126
00:06:37,433 --> 00:06:38,766
If else function.

127
00:06:38,766 --> 00:06:42,066
So that means that will get
the first argument y grid equals two.

128
00:06:42,066 --> 00:06:43,333
That's the condition.

129
00:06:43,333 --> 00:06:45,600
Then the second argument deep sky blue.

130
00:06:45,600 --> 00:06:47,833
That is the result when y grid equals two.

131
00:06:47,833 --> 00:06:51,366
And the third argument
all this in one same argument.

132
00:06:51,500 --> 00:06:54,800
And so how can we include all this
in one same argument.

133
00:06:55,200 --> 00:07:00,300
Well we need to use another ifelse here,
which will contain the other

134
00:07:00,300 --> 00:07:04,366
two conditions where y grid equals
one and y grid equals zero.

135
00:07:05,000 --> 00:07:07,466
And so we need to be careful
with the parenthesis

136
00:07:07,466 --> 00:07:09,300
because we added a new function.

137
00:07:09,300 --> 00:07:12,300
This new function if else. And here it is.

138
00:07:12,733 --> 00:07:14,600
The new parenthesis is added.

139
00:07:14,600 --> 00:07:16,166
And now it should be fine.

140
00:07:16,166 --> 00:07:17,533
So let's recap.

141
00:07:17,533 --> 00:07:19,633
We start with this first ifelse here.

142
00:07:19,633 --> 00:07:23,666
So if y grid equals two
then the color will be sky blue.

143
00:07:24,033 --> 00:07:28,500
And then if y grid is not equal to two
then we go into this new if else

144
00:07:29,066 --> 00:07:32,566
and this new if else contains the two last
remaining conditions.

145
00:07:32,800 --> 00:07:36,600
That is if y equals one,
then the color will be spring green.

146
00:07:36,600 --> 00:07:37,500
Three.

147
00:07:37,500 --> 00:07:41,700
And if y equals zero,
then the color will be tomato like red.

148
00:07:42,200 --> 00:07:45,766
And therefore we get our three conditions
in the correct syntax.

149
00:07:46,266 --> 00:07:47,166
So that's a trick.

150
00:07:47,166 --> 00:07:49,400
It's actually quite common to do it
in coding.

151
00:07:49,400 --> 00:07:51,033
So it's good to know how to do it.

152
00:07:52,433 --> 00:07:55,566
And that's the same to plot the colors
of our observation points.

153
00:07:55,566 --> 00:08:00,000
So we need to take this
and paste it here again.

154
00:08:00,600 --> 00:08:03,600
And then replace this one here by two.

155
00:08:04,033 --> 00:08:06,266
So that is the new first condition.

156
00:08:06,266 --> 00:08:11,333
If our observation point belongs to class
two then we want to give it a new color

157
00:08:11,633 --> 00:08:15,866
which will be a blue color but
a different blue then this deep sky blue.

158
00:08:16,000 --> 00:08:17,733
And so, you know,
we need to get a good contrast

159
00:08:17,733 --> 00:08:21,766
so that we don't confuse the color
of the point and the color of the region.

160
00:08:22,033 --> 00:08:26,600
So actually a good color to use here
is blue three, blue three.

161
00:08:26,600 --> 00:08:29,100
You'll see that
it will give us a good contrast.

162
00:08:29,100 --> 00:08:32,133
And so that's the first result
of the first condition.

163
00:08:32,533 --> 00:08:33,266
And then same.

164
00:08:33,266 --> 00:08:36,366
We need to include the two
remaining conditions here

165
00:08:36,366 --> 00:08:39,700
into one argument
that is inside a new if else.

166
00:08:40,000 --> 00:08:43,200
So if else here in parenthesis.

167
00:08:43,866 --> 00:08:47,766
And we don't forget to add
the closing parenthesis here.

168
00:08:48,300 --> 00:08:50,200
And here we go. This is ready.

169
00:08:50,200 --> 00:08:53,866
So recap again
if our observation point belongs

170
00:08:53,866 --> 00:08:56,866
to class two
then it will have the color blue three.

171
00:08:57,033 --> 00:09:00,200
Then if it doesn't belong to class two
then we go here.

172
00:09:00,366 --> 00:09:03,266
And here we have
two new separate conditions.

173
00:09:03,266 --> 00:09:06,266
If our observation points
belongs to class one

174
00:09:06,433 --> 00:09:08,366
then it will have the color green four.

175
00:09:08,366 --> 00:09:12,400
And if it doesn't belong to class one
then it will have the color red three.

176
00:09:13,200 --> 00:09:14,700
So that should be ready.

177
00:09:14,700 --> 00:09:17,700
And then we have two tiny changes to add.

178
00:09:17,833 --> 00:09:22,166
So remember in this line here line
49 with the column names

179
00:09:22,266 --> 00:09:26,333
we need to input the real column
names of the columns of the training set.

180
00:09:26,700 --> 00:09:29,633
And these column names are not age
and estimated salary.

181
00:09:29,633 --> 00:09:32,333
That was for the previous
classification problem.

182
00:09:32,333 --> 00:09:34,300
Now the column names are of course

183
00:09:35,400 --> 00:09:37,166
PC1 and PC2.

184
00:09:37,166 --> 00:09:40,800
So here we just need to replace age by PC1

185
00:09:41,566 --> 00:09:45,533
and estimated salary by PC two.

186
00:09:45,900 --> 00:09:46,966
So that's compulsory.

187
00:09:46,966 --> 00:09:48,766
That's actually what you need to input.

188
00:09:48,766 --> 00:09:51,766
Otherwise you will get an error
when you execute your code.

189
00:09:52,133 --> 00:09:56,266
And then here it's not compulsory
but it's better for the visualization.

190
00:09:56,300 --> 00:09:59,300
You can replace age by PC1

191
00:09:59,733 --> 00:10:03,466
and estimated salary by PC2,

192
00:10:03,733 --> 00:10:04,633
but if you don't do it,

193
00:10:04,633 --> 00:10:08,666
you will not get an error because this is
just for the visualization.

194
00:10:08,666 --> 00:10:12,200
This is just for the labels
that you will see on the graph.

195
00:10:13,233 --> 00:10:13,633
All right.

196
00:10:13,633 --> 00:10:16,900
And then I think we're good
I think this is ready to be executed.

197
00:10:16,900 --> 00:10:19,900
Let's hope that I didn't make any mistake.

198
00:10:20,000 --> 00:10:23,866
So we're going to try to execute this
and let's see what we get.

199
00:10:25,033 --> 00:10:27,966
So I'm going to select everything
in this section.

200
00:10:27,966 --> 00:10:31,800
So from here up to the top here

201
00:10:32,400 --> 00:10:34,800
and let's execute.

202
00:10:34,800 --> 00:10:36,600
All right. Good start.

203
00:10:36,600 --> 00:10:38,700
It's running.

204
00:10:38,700 --> 00:10:39,533
Let's see what we get.

205
00:10:39,533 --> 00:10:42,533
Let's go into this plot tab.

206
00:10:42,633 --> 00:10:45,500
It is still running.

207
00:10:45,500 --> 00:10:46,333
And here we go.

208
00:10:46,333 --> 00:10:48,400
We get our beautiful results.

209
00:10:48,400 --> 00:10:50,800
So I hope
you like my choice of the colors.

210
00:10:50,800 --> 00:10:52,600
This is the deep sky blue.

211
00:10:52,600 --> 00:10:53,833
And this is the blue three.

212
00:10:53,833 --> 00:10:55,533
So that we get the contrast

213
00:10:55,533 --> 00:10:58,533
between the observation points
and the prediction regions.

214
00:10:59,300 --> 00:11:02,300
So we can actually enlarge this
if you want.

215
00:11:04,100 --> 00:11:07,600
So as a quick reminder the points are
the real observation points.

216
00:11:07,600 --> 00:11:10,900
That is these are the ones that we have
in our training set.

217
00:11:11,366 --> 00:11:14,100
And the regions are
where our model predicts

218
00:11:14,100 --> 00:11:16,233
that the ones belong
to the customer segments.

219
00:11:16,233 --> 00:11:17,466
So for example,

220
00:11:17,466 --> 00:11:17,966
the green

221
00:11:17,966 --> 00:11:21,700
points are the ones of the training set
belonging to customer segment number two.

222
00:11:22,066 --> 00:11:23,133
And this green region.

223
00:11:23,133 --> 00:11:27,033
Here is where the model predicts
that the ones belong to customer

224
00:11:27,033 --> 00:11:28,333
segment number two.

225
00:11:28,333 --> 00:11:31,133
And same for the blue and red parts here.

226
00:11:31,133 --> 00:11:31,466
All right.

227
00:11:31,466 --> 00:11:33,766
So now we can quickly do the same
for the test set.

228
00:11:33,766 --> 00:11:37,266
So we actually need to do the same changes
as we did for the training set.

229
00:11:37,633 --> 00:11:41,966
That is let's start with the simplest one
we need to replace here edge by PC1

230
00:11:41,966 --> 00:11:46,033
the estimated salary by PC2.

231
00:11:46,033 --> 00:11:48,433
So these are compulsory changes.

232
00:11:48,433 --> 00:11:52,333
And we can also change the labels
even if that's not compulsory changes.

233
00:11:52,666 --> 00:11:55,100
Replace H by PC1.

234
00:11:55,100 --> 00:11:58,566
Replace estimated salary by PC2.

235
00:11:59,400 --> 00:12:01,000
And here we go. We are almost ready.

236
00:12:01,000 --> 00:12:04,500
We need to make this big change here
to add the third condition.

237
00:12:04,500 --> 00:12:09,833
To add the third color and
we can actually take these two lines here.

238
00:12:10,700 --> 00:12:12,333
Copy them and

239
00:12:14,033 --> 00:12:16,633
select this and paste.

240
00:12:16,633 --> 00:12:17,066
All right.

241
00:12:17,066 --> 00:12:21,833
We can do this because these are the same
variable names as for the training set.

242
00:12:22,200 --> 00:12:25,200
Because we use this set variable name here

243
00:12:25,233 --> 00:12:28,733
for both the training set
and the test set.

244
00:12:29,666 --> 00:12:31,200
And so basically that's ready.

245
00:12:31,200 --> 00:12:34,200
We can now select this whole section here

246
00:12:34,566 --> 00:12:37,966
and execute
to visualize the test set results.

247
00:12:38,366 --> 00:12:39,833
So let's do it.

248
00:12:39,833 --> 00:12:42,000
Here we go to processing.

249
00:12:42,000 --> 00:12:43,633
The test set. Results are coming.

250
00:12:43,633 --> 00:12:48,133
And we should get a perfect plot
with no incorrect predictions.

251
00:12:48,133 --> 00:12:51,633
That means that we should get
all the green points in the green region.

252
00:12:52,000 --> 00:12:53,666
All the red points where here we go.

253
00:12:53,666 --> 00:12:56,533
All the red points,
as you can see, are in the red region

254
00:12:56,533 --> 00:12:59,533
and all the blue points
in the blue region.

255
00:12:59,666 --> 00:13:00,500
So that's perfect.

256
00:13:00,500 --> 00:13:04,066
That's a perfect representation
of 100% accuracy.

257
00:13:04,400 --> 00:13:08,166
And so in conclusion,
we were able to transform a data set

258
00:13:08,166 --> 00:13:13,866
composed of 13 independent variables into
this new data set of reduced dimension.

259
00:13:14,166 --> 00:13:16,933
We were able to reduce the dimension
down to two,

260
00:13:16,933 --> 00:13:20,466
thanks to which we could visualize
the results in two dimensions.

261
00:13:21,066 --> 00:13:21,966
Okay, perfect.

262
00:13:21,966 --> 00:13:24,566
We are done with this
first section about PCA.

263
00:13:24,566 --> 00:13:26,700
And now the interesting thing
that we want to see

264
00:13:26,700 --> 00:13:30,033
is how our next dimensionality
reduction technique

265
00:13:30,266 --> 00:13:33,333
that we are going to implement
is going to do on this data set.

266
00:13:33,500 --> 00:13:37,566
This next dimensionality reduction
technique is LDA Linear Discriminant

267
00:13:37,566 --> 00:13:38,466
analysis.

268
00:13:38,466 --> 00:13:40,666
So we'll find out about that
in the next section.

269
00:13:40,666 --> 00:13:42,466
And until then enjoy machine learning.