1 00:00:00,200 --> 00:00:02,166 Hello and welcome to this art tutorial. 2 00:00:02,166 --> 00:00:05,700 So in this tutorial we are going to apply PCA and actually I have prepared 3 00:00:05,700 --> 00:00:08,966 you the required package to apply this first 4 00:00:08,966 --> 00:00:12,400 dimensionality reduction technique principal Component Analysis. 5 00:00:12,700 --> 00:00:16,900 So these packages are carrot which I think we already installed. 6 00:00:16,900 --> 00:00:20,400 But if that's not the case then you can check it here in packages. 7 00:00:20,566 --> 00:00:24,000 You can see if you have carrot available in this list of packages. 8 00:00:24,266 --> 00:00:27,700 If you don't see it here, you can execute this line without the comment 9 00:00:28,000 --> 00:00:30,400 and this will install carrot. 10 00:00:30,400 --> 00:00:31,933 So that's the first package. 11 00:00:31,933 --> 00:00:35,900 Then this is to actually import this carrot package. 12 00:00:35,900 --> 00:00:37,733 So we'll execute that as well. 13 00:00:37,733 --> 00:00:40,833 And we also need this other package that we installed 14 00:00:40,833 --> 00:00:44,533 in part three classification the E10 71 package. 15 00:00:44,700 --> 00:00:46,666 So normally you should have it installed. 16 00:00:46,666 --> 00:00:47,700 But that's not the case. 17 00:00:47,700 --> 00:00:50,500 You can select this line and install the package. 18 00:00:50,500 --> 00:00:53,933 And don't forget to execute this line as well to select it. 19 00:00:54,266 --> 00:00:57,266 And now we are ready to start applying PCA. 20 00:00:57,366 --> 00:01:00,300 So the first thing that we're going to do is create a new variable 21 00:01:00,300 --> 00:01:03,866 that we're going to call PCA that we will use afterwards 22 00:01:03,866 --> 00:01:08,466 to transform our original data set composed of our 13 features, 23 00:01:08,733 --> 00:01:12,300 into this new data set with the new extracted features. 24 00:01:12,666 --> 00:01:15,933 So now to create this object we are going to use a function. 25 00:01:16,066 --> 00:01:18,933 This is the pre process function. 26 00:01:18,933 --> 00:01:21,066 Here it is from the carrot package. 27 00:01:21,066 --> 00:01:23,866 And let's now press F1 here 28 00:01:23,866 --> 00:01:26,833 to see all the info of this preprocess function. 29 00:01:26,833 --> 00:01:29,633 Because you're going to see that you have some very useful parameters 30 00:01:29,633 --> 00:01:32,866 that allow you to apply PCA according to your goals. 31 00:01:32,866 --> 00:01:37,366 For example, you can specify the minimum ratio of explained variance 32 00:01:37,366 --> 00:01:38,266 you want to get. 33 00:01:38,266 --> 00:01:39,266 That means, for example, 34 00:01:39,266 --> 00:01:42,366 if you want to reduce the dimensionality of your data set down 35 00:01:42,366 --> 00:01:46,533 to a number of features, that will explain at least 60% of the variance, 36 00:01:46,700 --> 00:01:50,766 well, you can specify this with one of the parameters of this preprocess function. 37 00:01:51,000 --> 00:01:52,666 So let's have a look at the info. 38 00:01:52,666 --> 00:01:53,566 That's the info. 39 00:01:53,566 --> 00:01:56,033 And let's jump to the arguments. 40 00:01:56,033 --> 00:01:59,500 So right the first argument is x a matrix or a data frame. 41 00:01:59,700 --> 00:02:02,966 This is actually the data of which we want to reduce the dimensionality. 42 00:02:03,166 --> 00:02:05,633 So this is going to be our training set. 43 00:02:05,633 --> 00:02:07,500 So x will be the training set. 44 00:02:07,500 --> 00:02:09,533 Then the next argument is method. 45 00:02:09,533 --> 00:02:12,600 So method is your dimensionality reduction technique. 46 00:02:12,733 --> 00:02:16,200 So as you can see you have several techniques of dimensionality 47 00:02:16,200 --> 00:02:18,433 reduction PCA ICA. 48 00:02:18,433 --> 00:02:20,000 So these are all the methods. 49 00:02:20,000 --> 00:02:24,266 But of course the one that we want to use is PCA principal component analysis. 50 00:02:24,600 --> 00:02:27,000 So we will use method equals PCA here. 51 00:02:27,000 --> 00:02:27,966 Then thresh. 52 00:02:27,966 --> 00:02:30,000 Thresh is a very important parameter. 53 00:02:30,000 --> 00:02:31,466 That's what I've just told you. 54 00:02:31,466 --> 00:02:35,666 If you want to reduce your dimensionality of your data set with at least 55 00:02:35,666 --> 00:02:36,566 a minimum amount 56 00:02:36,566 --> 00:02:40,333 of explained variance, well, you can do it by using the stress parameter. 57 00:02:40,633 --> 00:02:44,733 And as you can see, it's a cut off the cumulative percent of variance 58 00:02:44,966 --> 00:02:46,666 to be retained by PCA. 59 00:02:46,666 --> 00:02:47,466 So for example, 60 00:02:47,466 --> 00:02:51,600 if you want your new extracted features to explain at least 60% of the variance, 61 00:02:51,766 --> 00:02:56,400 well you need to specify here thresh equals 0.6 60%. 62 00:02:57,200 --> 00:02:59,500 But we're not going to use this thresh 63 00:02:59,500 --> 00:03:02,500 parameter here because we already know what we want. 64 00:03:02,500 --> 00:03:05,866 What we want is two independent variables, because we want to be able 65 00:03:05,866 --> 00:03:08,866 to visualize the training set results and the test results, 66 00:03:09,000 --> 00:03:13,300 and that we will be able to get with the next parameter PCA comp. 67 00:03:13,633 --> 00:03:17,000 That is the specific number of PCA components to keep. 68 00:03:17,000 --> 00:03:21,433 So that's exactly the number of extracted features you want to obtain in the end. 69 00:03:21,900 --> 00:03:27,233 So here we will input PCA comp equals to so that our training set, 70 00:03:27,233 --> 00:03:31,766 our original training set will go from having 13 independent variables. 71 00:03:31,800 --> 00:03:36,833 The 13 original independent variables that we had in our data set to having 72 00:03:36,866 --> 00:03:41,066 two new extracted features that will explain the most the variance. 73 00:03:41,533 --> 00:03:47,033 And as you can see, if we specify the second parameter, this overrides thresh. 74 00:03:47,233 --> 00:03:50,800 So that's why we don't need to specify the stress parameter to specify 75 00:03:50,800 --> 00:03:53,800 a minimum cumulative percent of explained variance. 76 00:03:54,333 --> 00:03:54,733 All right. 77 00:03:54,733 --> 00:03:57,700 And then you have other parameters but we won't use them. 78 00:03:57,700 --> 00:04:02,100 We actually only need our x to specify the data we want to transform 79 00:04:02,100 --> 00:04:05,100 to extract the new features, the method PCA 80 00:04:05,200 --> 00:04:07,800 and the number of extracted features we want to get. 81 00:04:07,800 --> 00:04:10,800 Eventually that is two new extracted features. 82 00:04:11,200 --> 00:04:12,666 So let's input the arguments. 83 00:04:12,666 --> 00:04:15,666 Let's start with the first one x equals. 84 00:04:16,100 --> 00:04:18,166 So that's training set. 85 00:04:18,166 --> 00:04:19,033 Here we go. 86 00:04:19,033 --> 00:04:22,500 And actually we need to specify the features. 87 00:04:22,800 --> 00:04:24,700 And actually that's not the whole training set. 88 00:04:24,700 --> 00:04:29,333 Because remember PCA is an unsupervised dimensionality reduction technique. 89 00:04:29,333 --> 00:04:32,366 That means that we don't consider the dependent variable 90 00:04:32,500 --> 00:04:34,300 to extract the new features. 91 00:04:34,300 --> 00:04:37,300 So we actually need to remove here the dependent variable. 92 00:04:37,466 --> 00:04:39,633 And remember this has index 14. 93 00:04:39,633 --> 00:04:42,800 So the way we can do that is the same as we did for feature scaling. 94 00:04:43,033 --> 00:04:46,033 That means we just add here -14. 95 00:04:46,366 --> 00:04:48,666 All right. So now PCA will be applied 96 00:04:48,666 --> 00:04:52,266 on all the features the 13 features of our training set. 97 00:04:53,233 --> 00:04:53,833 Perfect. 98 00:04:53,833 --> 00:04:56,866 Now next argument next argument is method. 99 00:04:56,866 --> 00:05:00,966 And as we said method equals and quotes PCA. 100 00:05:01,600 --> 00:05:05,466 All right then comma next argument and last argument. 101 00:05:05,700 --> 00:05:07,966 And as we said that's PCA comp. 102 00:05:07,966 --> 00:05:10,533 So PCA 103 00:05:10,533 --> 00:05:13,966 and what we want is two new extracted features. 104 00:05:14,666 --> 00:05:15,000 All right. 105 00:05:15,000 --> 00:05:18,000 So that creates the PCA object 106 00:05:18,033 --> 00:05:21,033 that we will then use on our training set 107 00:05:21,066 --> 00:05:25,533 to transform our original training set composed of our 13 independent variables 108 00:05:25,833 --> 00:05:29,200 to this new training set of reduced dimensionality. 109 00:05:29,433 --> 00:05:30,433 And that will contain 110 00:05:30,433 --> 00:05:33,666 the two new extracted features that will explain the most variance. 111 00:05:34,033 --> 00:05:35,133 So let's do it. 112 00:05:35,133 --> 00:05:38,833 Let's take our training set because we are going to call this 113 00:05:38,833 --> 00:05:40,400 new training set training set as well. 114 00:05:40,400 --> 00:05:42,233 Because you know then we have 115 00:05:42,233 --> 00:05:45,833 all our templates and we use this training set variable name. 116 00:05:46,033 --> 00:05:48,533 So we want to keep this training set name. 117 00:05:48,533 --> 00:05:52,100 But of course if you want to keep your original training set and test set, 118 00:05:52,266 --> 00:05:56,500 you can use other names like training set underscore PCA. 119 00:05:56,766 --> 00:05:59,600 But then if you do that, don't forget to change training 120 00:05:59,600 --> 00:06:03,866 set here by training set PCA and here test set PCA as well. 121 00:06:04,066 --> 00:06:06,266 And the same for the confusion matrix section. 122 00:06:06,266 --> 00:06:09,233 And especially here visualizing the training set results. 123 00:06:09,233 --> 00:06:12,500 You will need to replace training set here by training set PCA. 124 00:06:12,900 --> 00:06:14,966 All right. So that's why we are keeping the name. 125 00:06:14,966 --> 00:06:17,733 It's in order not to have to change everything. 126 00:06:17,733 --> 00:06:20,733 So let's go back to training set equals. 127 00:06:21,100 --> 00:06:25,000 And now let's transform this original training set 128 00:06:25,000 --> 00:06:28,700 into our new training set composed of our new extracted features. 129 00:06:29,033 --> 00:06:30,800 And to do this it's very simple. 130 00:06:30,800 --> 00:06:33,066 We use the predict function. 131 00:06:33,066 --> 00:06:36,633 And inside we take our PCA object, come up, 132 00:06:37,033 --> 00:06:40,533 and we apply this PCA transformation object 133 00:06:40,866 --> 00:06:45,800 on the original training set that is named training set as well. 134 00:06:47,033 --> 00:06:47,900 And so by doing 135 00:06:47,900 --> 00:06:51,133 this, this original training set will become this 136 00:06:51,133 --> 00:06:54,800 new training set composed of the two new extracted features. 137 00:06:55,000 --> 00:06:56,100 So let's do it. 138 00:06:56,100 --> 00:06:59,033 Let's start by creating this object. 139 00:06:59,033 --> 00:07:01,533 And then we will transform our training set. 140 00:07:01,533 --> 00:07:05,766 So I'm going to select this line and execute perfect. 141 00:07:06,300 --> 00:07:09,066 The PCA object is ready to be used 142 00:07:09,066 --> 00:07:12,600 on the original training set to transform it 143 00:07:12,600 --> 00:07:16,666 into our new training set, composed of the two new extracted features. 144 00:07:17,033 --> 00:07:19,966 So let's execute this as well. Here we go. 145 00:07:19,966 --> 00:07:22,000 Our new training set is now created. 146 00:07:22,000 --> 00:07:23,066 We can have a look. 147 00:07:23,066 --> 00:07:27,000 As you can see when I'm clicking on this, well I have a new training 148 00:07:27,000 --> 00:07:30,300 set composed of two new extracted features. 149 00:07:30,300 --> 00:07:33,000 And remember these two new extracted features are called 150 00:07:33,000 --> 00:07:34,333 the principal components. 151 00:07:34,333 --> 00:07:37,300 So that's why you have PC1 and PC2. 152 00:07:37,300 --> 00:07:40,300 And of course we still have our dependent variable vector, 153 00:07:40,300 --> 00:07:43,033 the customer segment dependent variable with its 154 00:07:43,033 --> 00:07:46,033 three labels one, two and three. 155 00:07:46,933 --> 00:07:47,666 All right perfect. 156 00:07:47,666 --> 00:07:51,300 But now as you can clearly notice, the dependent variable vector 157 00:07:51,466 --> 00:07:53,533 just went in the first position. 158 00:07:53,533 --> 00:07:56,466 And since then we're going to use a template on data sets 159 00:07:56,466 --> 00:07:57,433 I mean the training set 160 00:07:57,433 --> 00:08:00,733 and the test set where the dependent variable is in last position. 161 00:08:01,033 --> 00:08:04,066 We will need to put this dependent variable in last position. 162 00:08:04,066 --> 00:08:06,300 Here. And that's actually very easy. 163 00:08:06,300 --> 00:08:09,400 What we only need to do is play with the indexes. 164 00:08:09,733 --> 00:08:13,533 To put this customer segment dependent variable in last position. 165 00:08:14,033 --> 00:08:15,333 So the method is really easy. 166 00:08:15,333 --> 00:08:17,766 We're going to take our training set again. 167 00:08:17,766 --> 00:08:18,400 Here we go. 168 00:08:19,400 --> 00:08:20,700 And then equals. 169 00:08:20,700 --> 00:08:24,400 And then we take again our training set then brackets. 170 00:08:24,633 --> 00:08:28,166 And then inside these brackets we're going to take the indexes 171 00:08:28,166 --> 00:08:31,900 of the columns of our training set in the correct order we want to get. 172 00:08:32,166 --> 00:08:35,200 So you're going to understand that now we're going to take a vector. 173 00:08:35,433 --> 00:08:39,366 So remember in our vector it's taken with C and then parenthesis. 174 00:08:39,866 --> 00:08:40,300 All right. 175 00:08:40,300 --> 00:08:44,100 And inside these parentheses we put the correct order of the indexes 176 00:08:44,100 --> 00:08:45,266 we want to get. 177 00:08:45,266 --> 00:08:47,766 So let's go back to our training set. 178 00:08:47,766 --> 00:08:50,766 The first column we want to get is PC1. 179 00:08:50,833 --> 00:08:53,500 That should be the first column of our new training set. 180 00:08:53,500 --> 00:08:55,133 And this has index two. 181 00:08:55,133 --> 00:08:58,700 So here we input the first index which is two. 182 00:08:59,333 --> 00:08:59,966 Then comma. 183 00:08:59,966 --> 00:09:03,500 And then we input the second index we want to get. 184 00:09:03,500 --> 00:09:05,033 That is the second column. 185 00:09:05,033 --> 00:09:07,133 And the second column is PC2. 186 00:09:07,133 --> 00:09:08,466 And this has index three. 187 00:09:08,466 --> 00:09:14,400 So here we input three and then here you input the index of the last column. 188 00:09:14,400 --> 00:09:16,500 You want to have your training set. 189 00:09:16,500 --> 00:09:17,400 And the last column 190 00:09:17,400 --> 00:09:20,833 you want to have in your training set is this customer segment column. 191 00:09:20,833 --> 00:09:22,700 Because that's the dependent variable. 192 00:09:22,700 --> 00:09:26,100 And so far this customer segment has index one. 193 00:09:26,500 --> 00:09:29,700 So you need to specify here this index that is one. 194 00:09:30,266 --> 00:09:34,200 And by doing this our new training set here will be the same training set 195 00:09:34,200 --> 00:09:37,200 that we have here but with a new order of the columns. 196 00:09:37,200 --> 00:09:38,866 And that is given by this order here. 197 00:09:38,866 --> 00:09:41,866 First, the first independent variable that has index two, 198 00:09:42,000 --> 00:09:44,566 then the second independent variable that has index three, 199 00:09:44,566 --> 00:09:47,566 and eventually the dependent variable column that has index one. 200 00:09:47,933 --> 00:09:51,300 You're going to see if I select this line now and execute. 201 00:09:51,533 --> 00:09:53,266 And if I go back to training set. 202 00:09:53,266 --> 00:09:57,600 Now I have my first two columns as the new extracted features 203 00:09:57,900 --> 00:10:02,666 x1 and x2, and the last column customer segment in last position 204 00:10:02,866 --> 00:10:06,100 as our code templates that we're going to use is expecting it. 205 00:10:06,533 --> 00:10:07,766 So that's perfect. 206 00:10:07,766 --> 00:10:12,266 We can go back to PCA and now we need to do the same for the test set. 207 00:10:12,666 --> 00:10:16,233 So what we're going to do is select these two lines, copy them 208 00:10:16,233 --> 00:10:19,500 and just replace training set here by test set. 209 00:10:20,033 --> 00:10:23,433 Same here test set and same here as well. 210 00:10:23,433 --> 00:10:26,733 Test set and eventually test set. 211 00:10:26,866 --> 00:10:28,033 All right. 212 00:10:28,033 --> 00:10:31,566 And that's of course the same indexes for the order you want to have. 213 00:10:31,666 --> 00:10:34,033 We can check it out I'm going to select this line. 214 00:10:34,033 --> 00:10:39,066 As you can see so far the test set has its 13 original features. 215 00:10:39,500 --> 00:10:43,700 Then if I execute this line it now has two new 216 00:10:43,700 --> 00:10:46,700 extracted features the principal components one and two. 217 00:10:46,833 --> 00:10:49,333 But the customer segment is in first position. 218 00:10:49,333 --> 00:10:51,800 We want to put it in the last position. 219 00:10:51,800 --> 00:10:54,766 And so to do this we execute this line. 220 00:10:54,766 --> 00:10:56,233 And that will do it. 221 00:10:56,233 --> 00:11:00,066 If I go back to test set now the customer segment is in this position 222 00:11:00,466 --> 00:11:03,766 and we are ready to use the following parts of the template. 223 00:11:04,000 --> 00:11:07,000 Predicted test results make the confusion matrix. 224 00:11:07,200 --> 00:11:09,800 And eventually that's the most exciting part. 225 00:11:09,800 --> 00:11:12,633 We will now be able to visualize the training set results, 226 00:11:12,633 --> 00:11:16,800 because we now have two dimensions in our training set and test set. 227 00:11:17,266 --> 00:11:20,266 So I look forward to visualizing these results in the next tutorial. 228 00:11:20,266 --> 00:11:22,066 And until then, enjoy machine learning.