1 00:00:00,333 --> 00:00:02,700 Hello and welcome to this art tutorial. 2 00:00:02,700 --> 00:00:05,833 So in the previous tutorials, who took care of the pre-processing phase? 3 00:00:05,833 --> 00:00:09,833 And then we applied PCA on our data set to reduce its 4 00:00:09,833 --> 00:00:12,900 dimensionality down to two new extracted features. 5 00:00:13,200 --> 00:00:16,200 And now we are ready to build a classification model. 6 00:00:16,400 --> 00:00:20,166 So speaking of classification model, we started with the logistic regression 7 00:00:20,166 --> 00:00:20,833 model. 8 00:00:20,833 --> 00:00:24,966 But actually from this point we can build any classification model 9 00:00:24,966 --> 00:00:27,966 among all the classification models we made in part three. 10 00:00:28,166 --> 00:00:32,000 If I go back to part three classification, this folder here 11 00:00:32,300 --> 00:00:35,133 you have here all the models we made in this part three. 12 00:00:35,133 --> 00:00:38,300 And basically from this point you can build any model 13 00:00:38,300 --> 00:00:41,766 you want by just selecting your classification model. 14 00:00:41,766 --> 00:00:45,866 For example, let's take support vector machine classification model. 15 00:00:46,266 --> 00:00:49,266 Then you can just open the SVM or file. 16 00:00:49,433 --> 00:00:53,600 And then basically all you need to do is take everything after the data 17 00:00:53,600 --> 00:00:54,600 preprocessing phase. 18 00:00:54,600 --> 00:00:57,900 That is from where you start to build your SVM model. 19 00:00:58,566 --> 00:01:01,833 And select everything down to the bottom and copy. 20 00:01:02,300 --> 00:01:06,900 And then in your PCA file you can include your classification model. 21 00:01:07,200 --> 00:01:10,200 Right after applying PCA on your data set. 22 00:01:10,200 --> 00:01:14,866 And so here I just replaced the logistic regression model by the SVM model. 23 00:01:15,100 --> 00:01:18,533 And you can do this for any classification model you want. 24 00:01:18,866 --> 00:01:21,866 Among the classification models we made in part three. 25 00:01:21,933 --> 00:01:24,900 So that's very easy to replace your different models by this simple 26 00:01:24,900 --> 00:01:28,700 copy paste so that you can try different classification models very efficiently. 27 00:01:29,266 --> 00:01:31,333 So let's see what we get with this SVM model. 28 00:01:31,333 --> 00:01:34,900 For example we just need to change here the name of the dependent variable. 29 00:01:34,900 --> 00:01:38,066 This is not purchased but customer segment. 30 00:01:40,033 --> 00:01:41,266 Here we go. 31 00:01:41,266 --> 00:01:43,100 And that's the only thing we need to change 32 00:01:43,100 --> 00:01:45,866 because the data here has input training set. 33 00:01:45,866 --> 00:01:48,133 And this is the training set which is transformed. 34 00:01:48,133 --> 00:01:51,133 The new training set composed of the new extracted features. 35 00:01:51,400 --> 00:01:55,100 And so basically we are ready to select this section 36 00:01:55,100 --> 00:01:59,600 and execute it to build our SVM classification model. 37 00:02:00,066 --> 00:02:01,666 And now that the model is built, 38 00:02:01,666 --> 00:02:05,033 we are ready to predict the new observations of the test set. 39 00:02:05,366 --> 00:02:06,966 And so this line is ready. 40 00:02:06,966 --> 00:02:09,000 And actually we don't need to change anything 41 00:02:09,000 --> 00:02:12,900 because this index three here is the index of the dependent variable. 42 00:02:13,033 --> 00:02:16,533 And since we reduce the dimensionality of our data set down to two, 43 00:02:16,666 --> 00:02:19,433 that means that we have two features and one dependent variable. 44 00:02:19,433 --> 00:02:22,433 And therefore the index of the dependent variable is still three. 45 00:02:22,700 --> 00:02:25,700 And so we are ready to select this line and execute. 46 00:02:25,966 --> 00:02:29,033 And now we have the predictions of the test set results. 47 00:02:29,500 --> 00:02:32,833 So we can have a look y pred in the console. 48 00:02:32,833 --> 00:02:34,100 Press enter. 49 00:02:34,100 --> 00:02:38,466 And for each observation of the test set we have its prediction by the model 50 00:02:38,466 --> 00:02:39,633 the SVM model. 51 00:02:39,633 --> 00:02:44,133 So for example, the fourth one of the data set that belongs to the test set 52 00:02:44,500 --> 00:02:47,166 is predicted to belong to customer number one. 53 00:02:47,166 --> 00:02:52,266 And the one number 132 is predicted to belong to customer segment number three. 54 00:02:52,966 --> 00:02:53,833 So very easy. 55 00:02:53,833 --> 00:02:56,700 And then we can make the confusion matrix. 56 00:02:56,700 --> 00:02:58,733 And since we don't need to change anything here 57 00:02:58,733 --> 00:03:02,200 because this corresponds to the index of the dependent variable. 58 00:03:02,200 --> 00:03:04,866 So we are ready to execute this as well. 59 00:03:04,866 --> 00:03:07,866 Execute the confusion matrix is ready. 60 00:03:07,900 --> 00:03:08,700 Let's have a look. 61 00:03:10,033 --> 00:03:11,066 And while 62 00:03:11,066 --> 00:03:14,066 perfect results we only get correct predictions. 63 00:03:14,333 --> 00:03:18,000 As you can see, 12 ones were correctly predicted to belong to customer. 64 00:03:18,000 --> 00:03:22,200 Segment number 114 ones were correctly predicted to belong to customer segment 65 00:03:22,200 --> 00:03:26,033 number two and ten ones were correctly predicted to belong to customer segment 66 00:03:26,033 --> 00:03:27,000 number three. 67 00:03:27,000 --> 00:03:29,966 And then we have zero incorrect predictions. 68 00:03:29,966 --> 00:03:31,600 So these are excellent results. 69 00:03:31,600 --> 00:03:34,600 And of course we get 100% accuracy. 70 00:03:34,833 --> 00:03:38,700 So now when moving on to the next part to visualize the training set results, 71 00:03:38,900 --> 00:03:40,500 we should get amazingly 72 00:03:40,500 --> 00:03:44,200 well-separated prediction regions and a very clear prediction boundary. 73 00:03:44,300 --> 00:03:45,600 So let's check it out. 74 00:03:45,600 --> 00:03:48,366 But now we have something to change. 75 00:03:48,366 --> 00:03:53,100 And this is not a tiny change as we used to do, because now we have three classes. 76 00:03:53,400 --> 00:03:57,300 And as you can notice in this code when we plot the prediction regions 77 00:03:57,300 --> 00:03:58,633 thanks to this line, 78 00:03:58,633 --> 00:04:02,800 well this code template allows us to do it when we only have two classes 79 00:04:03,033 --> 00:04:06,033 because as you can see, we have this if else condition. 80 00:04:06,066 --> 00:04:11,066 If y grid equals one, then the color is green and else if y green equals 81 00:04:11,066 --> 00:04:15,066 zero, then the color is tomato and same when we plot the observations. 82 00:04:15,200 --> 00:04:18,466 If the observations of the set that is the training 83 00:04:18,466 --> 00:04:21,466 set belongs to class one, then it's green. 84 00:04:21,566 --> 00:04:22,700 And if it belongs to class 85 00:04:22,700 --> 00:04:26,066 zero, that is in the else condition the points will be red. 86 00:04:26,433 --> 00:04:28,933 But now the problem is that we have three classes. 87 00:04:28,933 --> 00:04:33,766 So we need to improve this code here to distinct the three conditions. 88 00:04:33,766 --> 00:04:38,400 If y equals zero, if y equals one, and if y equals two. 89 00:04:39,000 --> 00:04:39,833 So let's do it. 90 00:04:39,833 --> 00:04:41,666 That will be good coding practice. 91 00:04:41,666 --> 00:04:44,766 And speaking of coding practice, what would be very good is that you 92 00:04:44,766 --> 00:04:47,766 try to do it before I do it in this tutorial. 93 00:04:47,900 --> 00:04:50,266 So you can press pause and try. 94 00:04:50,266 --> 00:04:52,000 And now I'm going to do it. 95 00:04:52,000 --> 00:04:54,600 So basically we need to add one more condition. 96 00:04:54,600 --> 00:04:57,566 The condition where y equals to. 97 00:04:57,566 --> 00:04:58,733 So let's do it. 98 00:04:58,733 --> 00:05:01,200 Let's add this new condition here. 99 00:05:01,200 --> 00:05:02,833 If y grid 100 00:05:04,300 --> 00:05:06,800 equals equals to then comma. 101 00:05:06,800 --> 00:05:11,666 And then after this condition y grid is equal to two, we will put what we want. 102 00:05:11,933 --> 00:05:15,533 And what we want is a new color because there is one color 103 00:05:15,533 --> 00:05:18,533 associated to each value of y grid. 104 00:05:18,600 --> 00:05:22,966 So we will keep spring green three for the case where y grid equals one. 105 00:05:23,266 --> 00:05:26,600 And we will keep tomato for the case where y grid equals zero. 106 00:05:27,133 --> 00:05:30,400 But for y equals two we need to introduce a new color. 107 00:05:30,666 --> 00:05:33,900 And since we have here green and red let's put blue. 108 00:05:34,400 --> 00:05:38,733 So a good color is actually deep sky blue. 109 00:05:40,200 --> 00:05:43,200 Then come up to get the next conditions. 110 00:05:43,433 --> 00:05:46,466 So so far what we see is that if y grid equals 111 00:05:46,466 --> 00:05:49,466 equals to then the color will be deep sky blue. 112 00:05:49,733 --> 00:05:52,666 Then if y grid equals one then the color will be green. 113 00:05:52,666 --> 00:05:55,566 And if y grid equals zero and then the color will be red. 114 00:05:55,566 --> 00:05:57,466 But this is not how it works. 115 00:05:57,466 --> 00:06:01,600 It's not as simple as that because this is actually not a correct syntax. 116 00:06:01,800 --> 00:06:05,700 Because this ifelse function expects three arguments. 117 00:06:05,866 --> 00:06:09,366 The first argument is the condition y grid equals one. 118 00:06:09,866 --> 00:06:14,033 Then the second argument is the result when this condition is true, 119 00:06:14,400 --> 00:06:18,333 and the third argument is the result when this condition is not true. 120 00:06:18,733 --> 00:06:21,833 So here we have a lot more than three arguments. 121 00:06:21,900 --> 00:06:23,066 That's not right. 122 00:06:23,066 --> 00:06:27,500 And so the trick to solve this is to put all this 123 00:06:27,733 --> 00:06:29,800 that is the y grid equals one condition. 124 00:06:29,800 --> 00:06:31,900 And then the results bring green three. 125 00:06:31,900 --> 00:06:37,433 And then the result if y grid equals zero into the third argument of this. 126 00:06:37,433 --> 00:06:38,766 If else function. 127 00:06:38,766 --> 00:06:42,066 So that means that will get the first argument y grid equals two. 128 00:06:42,066 --> 00:06:43,333 That's the condition. 129 00:06:43,333 --> 00:06:45,600 Then the second argument deep sky blue. 130 00:06:45,600 --> 00:06:47,833 That is the result when y grid equals two. 131 00:06:47,833 --> 00:06:51,366 And the third argument all this in one same argument. 132 00:06:51,500 --> 00:06:54,800 And so how can we include all this in one same argument. 133 00:06:55,200 --> 00:07:00,300 Well we need to use another ifelse here, which will contain the other 134 00:07:00,300 --> 00:07:04,366 two conditions where y grid equals one and y grid equals zero. 135 00:07:05,000 --> 00:07:07,466 And so we need to be careful with the parenthesis 136 00:07:07,466 --> 00:07:09,300 because we added a new function. 137 00:07:09,300 --> 00:07:12,300 This new function if else. And here it is. 138 00:07:12,733 --> 00:07:14,600 The new parenthesis is added. 139 00:07:14,600 --> 00:07:16,166 And now it should be fine. 140 00:07:16,166 --> 00:07:17,533 So let's recap. 141 00:07:17,533 --> 00:07:19,633 We start with this first ifelse here. 142 00:07:19,633 --> 00:07:23,666 So if y grid equals two then the color will be sky blue. 143 00:07:24,033 --> 00:07:28,500 And then if y grid is not equal to two then we go into this new if else 144 00:07:29,066 --> 00:07:32,566 and this new if else contains the two last remaining conditions. 145 00:07:32,800 --> 00:07:36,600 That is if y equals one, then the color will be spring green. 146 00:07:36,600 --> 00:07:37,500 Three. 147 00:07:37,500 --> 00:07:41,700 And if y equals zero, then the color will be tomato like red. 148 00:07:42,200 --> 00:07:45,766 And therefore we get our three conditions in the correct syntax. 149 00:07:46,266 --> 00:07:47,166 So that's a trick. 150 00:07:47,166 --> 00:07:49,400 It's actually quite common to do it in coding. 151 00:07:49,400 --> 00:07:51,033 So it's good to know how to do it. 152 00:07:52,433 --> 00:07:55,566 And that's the same to plot the colors of our observation points. 153 00:07:55,566 --> 00:08:00,000 So we need to take this and paste it here again. 154 00:08:00,600 --> 00:08:03,600 And then replace this one here by two. 155 00:08:04,033 --> 00:08:06,266 So that is the new first condition. 156 00:08:06,266 --> 00:08:11,333 If our observation point belongs to class two then we want to give it a new color 157 00:08:11,633 --> 00:08:15,866 which will be a blue color but a different blue then this deep sky blue. 158 00:08:16,000 --> 00:08:17,733 And so, you know, we need to get a good contrast 159 00:08:17,733 --> 00:08:21,766 so that we don't confuse the color of the point and the color of the region. 160 00:08:22,033 --> 00:08:26,600 So actually a good color to use here is blue three, blue three. 161 00:08:26,600 --> 00:08:29,100 You'll see that it will give us a good contrast. 162 00:08:29,100 --> 00:08:32,133 And so that's the first result of the first condition. 163 00:08:32,533 --> 00:08:33,266 And then same. 164 00:08:33,266 --> 00:08:36,366 We need to include the two remaining conditions here 165 00:08:36,366 --> 00:08:39,700 into one argument that is inside a new if else. 166 00:08:40,000 --> 00:08:43,200 So if else here in parenthesis. 167 00:08:43,866 --> 00:08:47,766 And we don't forget to add the closing parenthesis here. 168 00:08:48,300 --> 00:08:50,200 And here we go. This is ready. 169 00:08:50,200 --> 00:08:53,866 So recap again if our observation point belongs 170 00:08:53,866 --> 00:08:56,866 to class two then it will have the color blue three. 171 00:08:57,033 --> 00:09:00,200 Then if it doesn't belong to class two then we go here. 172 00:09:00,366 --> 00:09:03,266 And here we have two new separate conditions. 173 00:09:03,266 --> 00:09:06,266 If our observation points belongs to class one 174 00:09:06,433 --> 00:09:08,366 then it will have the color green four. 175 00:09:08,366 --> 00:09:12,400 And if it doesn't belong to class one then it will have the color red three. 176 00:09:13,200 --> 00:09:14,700 So that should be ready. 177 00:09:14,700 --> 00:09:17,700 And then we have two tiny changes to add. 178 00:09:17,833 --> 00:09:22,166 So remember in this line here line 49 with the column names 179 00:09:22,266 --> 00:09:26,333 we need to input the real column names of the columns of the training set. 180 00:09:26,700 --> 00:09:29,633 And these column names are not age and estimated salary. 181 00:09:29,633 --> 00:09:32,333 That was for the previous classification problem. 182 00:09:32,333 --> 00:09:34,300 Now the column names are of course 183 00:09:35,400 --> 00:09:37,166 PC1 and PC2. 184 00:09:37,166 --> 00:09:40,800 So here we just need to replace age by PC1 185 00:09:41,566 --> 00:09:45,533 and estimated salary by PC two. 186 00:09:45,900 --> 00:09:46,966 So that's compulsory. 187 00:09:46,966 --> 00:09:48,766 That's actually what you need to input. 188 00:09:48,766 --> 00:09:51,766 Otherwise you will get an error when you execute your code. 189 00:09:52,133 --> 00:09:56,266 And then here it's not compulsory but it's better for the visualization. 190 00:09:56,300 --> 00:09:59,300 You can replace age by PC1 191 00:09:59,733 --> 00:10:03,466 and estimated salary by PC2, 192 00:10:03,733 --> 00:10:04,633 but if you don't do it, 193 00:10:04,633 --> 00:10:08,666 you will not get an error because this is just for the visualization. 194 00:10:08,666 --> 00:10:12,200 This is just for the labels that you will see on the graph. 195 00:10:13,233 --> 00:10:13,633 All right. 196 00:10:13,633 --> 00:10:16,900 And then I think we're good I think this is ready to be executed. 197 00:10:16,900 --> 00:10:19,900 Let's hope that I didn't make any mistake. 198 00:10:20,000 --> 00:10:23,866 So we're going to try to execute this and let's see what we get. 199 00:10:25,033 --> 00:10:27,966 So I'm going to select everything in this section. 200 00:10:27,966 --> 00:10:31,800 So from here up to the top here 201 00:10:32,400 --> 00:10:34,800 and let's execute. 202 00:10:34,800 --> 00:10:36,600 All right. Good start. 203 00:10:36,600 --> 00:10:38,700 It's running. 204 00:10:38,700 --> 00:10:39,533 Let's see what we get. 205 00:10:39,533 --> 00:10:42,533 Let's go into this plot tab. 206 00:10:42,633 --> 00:10:45,500 It is still running. 207 00:10:45,500 --> 00:10:46,333 And here we go. 208 00:10:46,333 --> 00:10:48,400 We get our beautiful results. 209 00:10:48,400 --> 00:10:50,800 So I hope you like my choice of the colors. 210 00:10:50,800 --> 00:10:52,600 This is the deep sky blue. 211 00:10:52,600 --> 00:10:53,833 And this is the blue three. 212 00:10:53,833 --> 00:10:55,533 So that we get the contrast 213 00:10:55,533 --> 00:10:58,533 between the observation points and the prediction regions. 214 00:10:59,300 --> 00:11:02,300 So we can actually enlarge this if you want. 215 00:11:04,100 --> 00:11:07,600 So as a quick reminder the points are the real observation points. 216 00:11:07,600 --> 00:11:10,900 That is these are the ones that we have in our training set. 217 00:11:11,366 --> 00:11:14,100 And the regions are where our model predicts 218 00:11:14,100 --> 00:11:16,233 that the ones belong to the customer segments. 219 00:11:16,233 --> 00:11:17,466 So for example, 220 00:11:17,466 --> 00:11:17,966 the green 221 00:11:17,966 --> 00:11:21,700 points are the ones of the training set belonging to customer segment number two. 222 00:11:22,066 --> 00:11:23,133 And this green region. 223 00:11:23,133 --> 00:11:27,033 Here is where the model predicts that the ones belong to customer 224 00:11:27,033 --> 00:11:28,333 segment number two. 225 00:11:28,333 --> 00:11:31,133 And same for the blue and red parts here. 226 00:11:31,133 --> 00:11:31,466 All right. 227 00:11:31,466 --> 00:11:33,766 So now we can quickly do the same for the test set. 228 00:11:33,766 --> 00:11:37,266 So we actually need to do the same changes as we did for the training set. 229 00:11:37,633 --> 00:11:41,966 That is let's start with the simplest one we need to replace here edge by PC1 230 00:11:41,966 --> 00:11:46,033 the estimated salary by PC2. 231 00:11:46,033 --> 00:11:48,433 So these are compulsory changes. 232 00:11:48,433 --> 00:11:52,333 And we can also change the labels even if that's not compulsory changes. 233 00:11:52,666 --> 00:11:55,100 Replace H by PC1. 234 00:11:55,100 --> 00:11:58,566 Replace estimated salary by PC2. 235 00:11:59,400 --> 00:12:01,000 And here we go. We are almost ready. 236 00:12:01,000 --> 00:12:04,500 We need to make this big change here to add the third condition. 237 00:12:04,500 --> 00:12:09,833 To add the third color and we can actually take these two lines here. 238 00:12:10,700 --> 00:12:12,333 Copy them and 239 00:12:14,033 --> 00:12:16,633 select this and paste. 240 00:12:16,633 --> 00:12:17,066 All right. 241 00:12:17,066 --> 00:12:21,833 We can do this because these are the same variable names as for the training set. 242 00:12:22,200 --> 00:12:25,200 Because we use this set variable name here 243 00:12:25,233 --> 00:12:28,733 for both the training set and the test set. 244 00:12:29,666 --> 00:12:31,200 And so basically that's ready. 245 00:12:31,200 --> 00:12:34,200 We can now select this whole section here 246 00:12:34,566 --> 00:12:37,966 and execute to visualize the test set results. 247 00:12:38,366 --> 00:12:39,833 So let's do it. 248 00:12:39,833 --> 00:12:42,000 Here we go to processing. 249 00:12:42,000 --> 00:12:43,633 The test set. Results are coming. 250 00:12:43,633 --> 00:12:48,133 And we should get a perfect plot with no incorrect predictions. 251 00:12:48,133 --> 00:12:51,633 That means that we should get all the green points in the green region. 252 00:12:52,000 --> 00:12:53,666 All the red points where here we go. 253 00:12:53,666 --> 00:12:56,533 All the red points, as you can see, are in the red region 254 00:12:56,533 --> 00:12:59,533 and all the blue points in the blue region. 255 00:12:59,666 --> 00:13:00,500 So that's perfect. 256 00:13:00,500 --> 00:13:04,066 That's a perfect representation of 100% accuracy. 257 00:13:04,400 --> 00:13:08,166 And so in conclusion, we were able to transform a data set 258 00:13:08,166 --> 00:13:13,866 composed of 13 independent variables into this new data set of reduced dimension. 259 00:13:14,166 --> 00:13:16,933 We were able to reduce the dimension down to two, 260 00:13:16,933 --> 00:13:20,466 thanks to which we could visualize the results in two dimensions. 261 00:13:21,066 --> 00:13:21,966 Okay, perfect. 262 00:13:21,966 --> 00:13:24,566 We are done with this first section about PCA. 263 00:13:24,566 --> 00:13:26,700 And now the interesting thing that we want to see 264 00:13:26,700 --> 00:13:30,033 is how our next dimensionality reduction technique 265 00:13:30,266 --> 00:13:33,333 that we are going to implement is going to do on this data set. 266 00:13:33,500 --> 00:13:37,566 This next dimensionality reduction technique is LDA Linear Discriminant 267 00:13:37,566 --> 00:13:38,466 analysis. 268 00:13:38,466 --> 00:13:40,666 So we'll find out about that in the next section. 269 00:13:40,666 --> 00:13:42,466 And until then enjoy machine learning.