1 00:00:00,266 --> 00:00:01,200 Hello and welcome to 2 00:00:01,200 --> 00:00:04,666 this art tutorial and welcome to part nine Dimensionality Reduction. 3 00:00:05,000 --> 00:00:08,366 So we are starting this part with our first technique 4 00:00:08,366 --> 00:00:12,900 of dimensionality reduction which is PCA principal component analysis. 5 00:00:13,233 --> 00:00:15,166 And so you know in dimensionality reduction 6 00:00:15,166 --> 00:00:18,500 there are two techniques feature selection and feature extraction. 7 00:00:18,766 --> 00:00:22,266 We did feature selection in part two when we implemented the backward 8 00:00:22,266 --> 00:00:25,866 elimination model to select the most relevant features 9 00:00:26,066 --> 00:00:27,233 of our matrix of features. 10 00:00:27,233 --> 00:00:30,700 That is the features that explained the most the dependent variable. 11 00:00:30,933 --> 00:00:33,766 And now we are starting this new technique of dimensionality 12 00:00:33,766 --> 00:00:36,900 reduction, which is feature extraction and PCA. 13 00:00:36,900 --> 00:00:40,766 Principal component analysis is one feature extraction technique. 14 00:00:41,100 --> 00:00:45,566 So as a reminder let's say your matrix of features has m independent variables. 15 00:00:45,833 --> 00:00:47,800 Well what PCA will do is that it 16 00:00:47,800 --> 00:00:51,600 will extract a smaller number of your independent variables. 17 00:00:51,800 --> 00:00:55,433 But there are going to be new independent variables like new dimensions. 18 00:00:55,800 --> 00:00:58,766 And these new independent variables extracted are going to be 19 00:00:58,766 --> 00:01:01,466 some new independent variables that explained the most 20 00:01:01,466 --> 00:01:03,466 the variance of your data set. 21 00:01:03,466 --> 00:01:06,433 And that is regardless of your dependent variable. 22 00:01:06,433 --> 00:01:10,500 And that makes PCA an unsupervised model in the sense 23 00:01:10,500 --> 00:01:13,700 that we don't consider the dependent variable in the model. 24 00:01:14,200 --> 00:01:15,300 So that's PCA. 25 00:01:15,300 --> 00:01:16,866 And remember in part 26 00:01:16,866 --> 00:01:19,866 two and part three we worked with 1 or 2 independent variables. 27 00:01:19,966 --> 00:01:22,266 Well that was for two specific purposes. 28 00:01:22,266 --> 00:01:26,433 The first purpose is that we needed a graphic visualization of our results. 29 00:01:26,733 --> 00:01:30,966 And since each independent variable corresponded to one dimension in the plot, 30 00:01:31,133 --> 00:01:35,233 well, we could visualize our results with at most two independent variables. 31 00:01:35,566 --> 00:01:39,233 And the second reason is that thanks to this PCA dimensionality 32 00:01:39,233 --> 00:01:40,400 reduction technique. 33 00:01:40,400 --> 00:01:43,733 Well, even if we have a lot of independent variables at the beginning, 34 00:01:44,033 --> 00:01:47,300 well, we can end up with much less independent variables. 35 00:01:47,633 --> 00:01:51,266 But that are going to be relevant independent variables, because 36 00:01:51,466 --> 00:01:55,800 these independent variables will explain the most the variance of your data set. 37 00:01:56,200 --> 00:01:59,633 And therefore, since we can reduce this number of independent variables, 38 00:01:59,933 --> 00:02:03,000 well, we can end up with 2 or 3 independent variables 39 00:02:03,000 --> 00:02:06,300 and therefore visualize the results as we did in part three. 40 00:02:06,600 --> 00:02:09,866 And this is exactly what we're going to do in this tutorial in the following 41 00:02:09,866 --> 00:02:12,500 tutorials of this section and the following sections. 42 00:02:12,500 --> 00:02:14,033 When we cover other dimensionality 43 00:02:14,033 --> 00:02:17,666 reduction techniques like LDA and also kernel PCA. 44 00:02:18,066 --> 00:02:20,533 Well, we will have many features at the beginning 45 00:02:20,533 --> 00:02:23,400 and therefore it will be impossible to visualize the results. 46 00:02:23,400 --> 00:02:28,033 But then when we apply PCA or LDA, we will reduce the number of features 47 00:02:28,033 --> 00:02:31,400 down to two and therefore will be able to visualize the results. 48 00:02:31,800 --> 00:02:35,200 So let's start right now and let's start by setting the right folder 49 00:02:35,200 --> 00:02:38,133 as working directory. So as usual we go to our machine learning. 50 00:02:38,133 --> 00:02:39,266 It is that folder. 51 00:02:39,266 --> 00:02:41,900 Then part nine dimensionality reduction. 52 00:02:41,900 --> 00:02:44,600 And here we are in the first section of this part nine 53 00:02:44,600 --> 00:02:46,433 Principal Component analysis. 54 00:02:46,433 --> 00:02:48,800 That's our first technique. Let's click on it. 55 00:02:48,800 --> 00:02:51,633 And that's the folder we want to set as working directory. 56 00:02:51,633 --> 00:02:53,800 Make sure that you have the one dot CSV file. 57 00:02:53,800 --> 00:02:56,600 And if that's the case you're ready to click on this more button here 58 00:02:56,600 --> 00:02:59,400 to set this folder as Working directory. 59 00:02:59,400 --> 00:03:00,300 Perfect. 60 00:03:00,300 --> 00:03:03,933 And now we're going to open another file that we made in part 61 00:03:03,933 --> 00:03:08,133 three classification which is our logistic regression file. 62 00:03:08,400 --> 00:03:12,566 Because what we're going to do is take this logistic regression code. 63 00:03:12,866 --> 00:03:14,866 Then we are going to change the name of the data set, 64 00:03:14,866 --> 00:03:19,766 because we will be working on a new data set, which will be the wine dot CSV file. 65 00:03:20,100 --> 00:03:23,100 And we will apply PCA on this data set. 66 00:03:23,333 --> 00:03:26,366 And of course I will explain quickly the business problem behind it. 67 00:03:26,700 --> 00:03:30,966 So I'm going to take everything from here down to the bottom. 68 00:03:31,300 --> 00:03:33,166 Here we go. Copy. 69 00:03:33,166 --> 00:03:37,466 And I'm going to paste that in my PCA file this way. 70 00:03:38,100 --> 00:03:41,566 So let's go up and let's change the name of the data set. 71 00:03:41,566 --> 00:03:44,400 This is not social network add CSV. 72 00:03:44,400 --> 00:03:47,433 This is now one dot CSV. 73 00:03:47,866 --> 00:03:48,900 That's perfect. 74 00:03:48,900 --> 00:03:52,200 So now what we're going to do is first import this data set 75 00:03:52,466 --> 00:03:54,233 and then apply data preprocessing. 76 00:03:54,233 --> 00:03:55,500 Maybe we'll need to change 77 00:03:55,500 --> 00:03:59,433 some things like the index is here but this will be very quick. 78 00:03:59,600 --> 00:04:02,033 So first let's import the data set. 79 00:04:02,033 --> 00:04:05,033 So I'm going to select this line and execute. 80 00:04:05,066 --> 00:04:07,366 Here we go. The data set is well imported. 81 00:04:07,366 --> 00:04:08,233 Here it is. 82 00:04:08,233 --> 00:04:11,300 And now let's expand the business problem behind it okay. 83 00:04:11,300 --> 00:04:13,566 So first of all this is a very famous data set 84 00:04:13,566 --> 00:04:15,766 well known in the machine learning literature. 85 00:04:15,766 --> 00:04:20,333 And that you can find on the UCI machine Learning repository, as you can see here. 86 00:04:20,333 --> 00:04:22,366 And you can find this page at this link. 87 00:04:23,366 --> 00:04:24,100 So basically 88 00:04:24,100 --> 00:04:27,633 first what are the independent variables and what is the dependent variable. 89 00:04:28,066 --> 00:04:31,800 Well the independent variables are all the variables from this one. 90 00:04:31,800 --> 00:04:34,800 Alcohol up to this one proline. 91 00:04:34,833 --> 00:04:39,066 And this last variable customer segment is the dependent variable. 92 00:04:39,366 --> 00:04:43,800 So in the original data set this dependent variable is not called customer segment. 93 00:04:43,800 --> 00:04:46,166 This is actually the origin of the wine. 94 00:04:46,166 --> 00:04:49,800 But let's imagine that we as data scientist are working 95 00:04:49,800 --> 00:04:51,266 for one business owner. 96 00:04:51,266 --> 00:04:54,866 And this one business owner gathered all these informations in this data set. 97 00:04:55,200 --> 00:04:59,300 And so first what this business owner did is that it gathered all the informations 98 00:04:59,300 --> 00:05:01,000 of these independent variables here 99 00:05:01,000 --> 00:05:04,533 that are chemical informations of several wines. 100 00:05:04,966 --> 00:05:08,200 And this business owner applied some clustering technique 101 00:05:08,400 --> 00:05:10,633 to find some segments of customers 102 00:05:10,633 --> 00:05:14,333 that like a specific wine, depending on the informations of the wine. 103 00:05:14,666 --> 00:05:17,500 And by applying these clustering techniques, 104 00:05:17,500 --> 00:05:20,800 this business owner identified three segments of customers. 105 00:05:20,800 --> 00:05:22,033 That's the first one here. 106 00:05:22,033 --> 00:05:25,066 Then we have the second one and eventually the third one. 107 00:05:26,133 --> 00:05:27,033 So based on these 108 00:05:27,033 --> 00:05:30,066 informations and thanks to its clustering techniques, 109 00:05:30,233 --> 00:05:34,066 well this one business owner managed to find some segments of customers. 110 00:05:34,333 --> 00:05:38,100 Each segment having a specific preference for a specific wine. 111 00:05:38,466 --> 00:05:41,500 So basically this business owner found three types of wines, 112 00:05:41,800 --> 00:05:43,166 each type of one corresponding 113 00:05:43,166 --> 00:05:47,000 to one segment of customers and therefore three segments of customers. 114 00:05:47,400 --> 00:05:49,966 And why does it create added value for his business? 115 00:05:49,966 --> 00:05:51,033 Well, that's because now 116 00:05:51,033 --> 00:05:54,900 what this business owner can do is take all these informations of the wines, 117 00:05:55,100 --> 00:05:58,100 as well as the information about the customer segments, 118 00:05:58,300 --> 00:06:01,400 and make a classification model like logistic regression, 119 00:06:01,533 --> 00:06:04,500 in which the independent variables are all these variables 120 00:06:04,500 --> 00:06:07,500 and the dependent variable is the customer segment. 121 00:06:07,500 --> 00:06:11,866 And therefore for each new wine, it can predict to which customer segment 122 00:06:12,266 --> 00:06:14,033 it should recommend. This wine. 123 00:06:14,033 --> 00:06:16,633 So that adds a lot of value for this business owner. 124 00:06:16,633 --> 00:06:19,500 But then if this business owner wants to have a clear visual 125 00:06:19,500 --> 00:06:22,500 look at the prediction regions and the prediction boundary 126 00:06:22,600 --> 00:06:25,566 of the classification model that we're going to build, to be able to 127 00:06:25,566 --> 00:06:29,300 see if the predictions are in the right spot of the customer segments. 128 00:06:29,300 --> 00:06:33,500 Well, it cannot be done with all these independent variables because of course, 129 00:06:33,500 --> 00:06:37,333 we cannot represent these many independent variables in one plot. 130 00:06:37,333 --> 00:06:38,400 That's impossible. 131 00:06:38,400 --> 00:06:42,066 So what we need to do is apply some dimensionality reduction techniques 132 00:06:42,266 --> 00:06:43,233 to extract 133 00:06:43,233 --> 00:06:46,333 two independent variables that explain the most the variance, 134 00:06:46,500 --> 00:06:49,366 and then we'll be able to see the prediction regions 135 00:06:49,366 --> 00:06:50,766 and the prediction boundary. 136 00:06:50,766 --> 00:06:54,600 And therefore we will clearly be able to see where the customer segments are 137 00:06:54,733 --> 00:06:56,000 and where these predictions 138 00:06:56,000 --> 00:06:59,733 of the customer segments are according to the extracted features 139 00:06:59,733 --> 00:07:02,733 of all the informations of our independent variables. 140 00:07:02,800 --> 00:07:06,633 And remember, these extracted features are called the principal components. 141 00:07:07,233 --> 00:07:07,600 All right. 142 00:07:07,600 --> 00:07:11,733 So now that we understand the challenge and the business problem, let's apply PCA 143 00:07:11,900 --> 00:07:15,100 to see how we can reduce the dimensionality of this data set. 144 00:07:15,100 --> 00:07:19,033 Because indeed it contains 13 dimensions because it contains 13 145 00:07:19,033 --> 00:07:20,400 independent variables. 146 00:07:20,400 --> 00:07:24,066 And we'll see how we can use PCA to reduce this number 147 00:07:24,066 --> 00:07:27,200 of independent variables down to two independent variables. 148 00:07:27,200 --> 00:07:28,266 But be careful. 149 00:07:28,266 --> 00:07:31,800 It's important to understand that the new two independent variables 150 00:07:31,800 --> 00:07:35,166 that will have in the end are going to be new ones, as opposed to 151 00:07:35,166 --> 00:07:39,066 feature selection, where, you know, we end up with two independent variables 152 00:07:39,066 --> 00:07:42,866 that are among these original 13 independent variables. 153 00:07:43,100 --> 00:07:45,733 Here with PCA, we'll get new extracted one. 154 00:07:45,733 --> 00:07:47,466 And that's the important distinction 155 00:07:47,466 --> 00:07:50,466 to make between feature selection and feature extraction. 156 00:07:51,100 --> 00:07:51,400 All right. 157 00:07:51,400 --> 00:07:55,800 So before we apply PCA as usual we need to preprocess the data. 158 00:07:56,133 --> 00:08:00,133 And this is actually going to be very quick because our template is ready. 159 00:08:00,133 --> 00:08:02,533 We will just need to change just a few things. 160 00:08:02,533 --> 00:08:05,333 So first data set equals data set three five. 161 00:08:05,333 --> 00:08:09,666 That's just to select the independent variables that matter for our problem. 162 00:08:09,900 --> 00:08:11,366 But here everything matters. 163 00:08:11,366 --> 00:08:14,366 We just want to reduce the dimensionality of this data set. 164 00:08:14,500 --> 00:08:17,300 So we will keep all our independent variables here. 165 00:08:17,300 --> 00:08:19,300 And therefore we don't need this line here. 166 00:08:19,300 --> 00:08:22,600 So I will just remove it okay. 167 00:08:22,600 --> 00:08:24,766 So first section and bring the data set ready. 168 00:08:24,766 --> 00:08:25,666 Well executed. 169 00:08:25,666 --> 00:08:27,666 Now let's move on to the next section. 170 00:08:27,666 --> 00:08:31,000 So the next section is about splitting the data sets into the training set. 171 00:08:31,000 --> 00:08:32,400 And the test set. 172 00:08:32,400 --> 00:08:33,900 And here be careful. 173 00:08:33,900 --> 00:08:37,600 We just need to change this name of the dependent variable. 174 00:08:37,900 --> 00:08:41,633 Because in logistic regression we're dealing with the social network add 175 00:08:41,700 --> 00:08:42,900 CSV file. 176 00:08:42,900 --> 00:08:44,700 And the dependent variable was purchased. 177 00:08:44,700 --> 00:08:47,800 But now for a new business problem the dependent variable 178 00:08:47,800 --> 00:08:49,033 is not called purchased. 179 00:08:49,033 --> 00:08:51,200 It is called customer segment. 180 00:08:51,200 --> 00:08:56,000 So we just need to replace purchased here by customer segment. 181 00:08:56,100 --> 00:08:56,666 Here we go. 182 00:08:57,633 --> 00:08:58,000 All right. 183 00:08:58,000 --> 00:09:00,966 Do we keep a split ratio of 75%. 184 00:09:00,966 --> 00:09:04,333 Well let's rather take 80%. 185 00:09:04,333 --> 00:09:05,400 But that's as you want. 186 00:09:05,400 --> 00:09:08,933 It's just that 80% is a good split ratio to take. 187 00:09:08,933 --> 00:09:10,266 So we will go with that. 188 00:09:10,266 --> 00:09:13,566 And then here for training set and test set we don't need to change anything. 189 00:09:13,566 --> 00:09:17,900 So we are ready to split our data set into the training set and the test set. 190 00:09:17,900 --> 00:09:21,666 So let's do it I'm going to select all this section here 191 00:09:21,900 --> 00:09:24,966 and press Command Plus Enter to execute. 192 00:09:25,366 --> 00:09:26,433 Here we go. 193 00:09:26,433 --> 00:09:30,300 The training set is now created as well as the test set. 194 00:09:31,233 --> 00:09:31,966 Great. 195 00:09:31,966 --> 00:09:34,300 So ready to move on to the next section. 196 00:09:34,300 --> 00:09:36,600 The next section is about feature scaling. 197 00:09:36,600 --> 00:09:40,000 And for PCA it is way better to apply feature scaling. 198 00:09:40,000 --> 00:09:43,466 You can actually apply it by playing with the parameters 199 00:09:43,666 --> 00:09:46,666 of the PCA function that we're going to use afterwards. 200 00:09:46,833 --> 00:09:49,800 But let's take this feature scaling part of our code 201 00:09:49,800 --> 00:09:52,800 template to put our features on the same scale. 202 00:09:52,833 --> 00:09:55,866 So here we just need to change the indexes. 203 00:09:56,000 --> 00:10:00,033 We actually need to specify the indexes of the features we want to scale. 204 00:10:00,266 --> 00:10:04,100 So basically the features we want to scale are all the features 205 00:10:04,100 --> 00:10:06,200 from alcohol to proline. 206 00:10:06,200 --> 00:10:09,633 And so what we can do is specify that we want to scale all the variables 207 00:10:09,633 --> 00:10:13,700 except the last one customer segment that has index 14. 208 00:10:13,966 --> 00:10:18,033 So therefore here instead of putting the indexes of the features 209 00:10:18,233 --> 00:10:21,233 we can replace it by -14. 210 00:10:21,333 --> 00:10:22,733 We can remove that. 211 00:10:22,733 --> 00:10:26,666 Let's copy this because we will do the same for the others. 212 00:10:26,666 --> 00:10:29,666 So let's replace that here by -14. 213 00:10:30,100 --> 00:10:32,466 And -14 here as well. 214 00:10:32,466 --> 00:10:35,266 And eventually -14. 215 00:10:35,266 --> 00:10:35,566 All right. 216 00:10:35,566 --> 00:10:37,866 So now the feature scaling part is ready. 217 00:10:37,866 --> 00:10:40,533 So we are ready to select the section 218 00:10:40,533 --> 00:10:44,000 and press Command or Control plus enter to execute. 219 00:10:44,200 --> 00:10:46,800 And now all our variables are scaled. 220 00:10:46,800 --> 00:10:50,566 As you can see we can clearly see that all our features are on the same scale. 221 00:10:50,933 --> 00:10:55,200 And of course the customer segments kept its labels one, two and three. 222 00:10:55,800 --> 00:10:57,900 And same for the test set. Let's make sure that. 223 00:10:57,900 --> 00:10:59,366 All right perfect. 224 00:10:59,366 --> 00:11:01,166 So feature scaling done. 225 00:11:01,166 --> 00:11:04,633 And actually the pre-processing phase is completed. 226 00:11:04,800 --> 00:11:07,033 So we did that quite efficiently. 227 00:11:07,033 --> 00:11:09,600 But that's good because now we're getting to the exciting part. 228 00:11:09,600 --> 00:11:12,100 Applying PCA to our data. 229 00:11:12,100 --> 00:11:14,900 So actually we will do that right here. 230 00:11:14,900 --> 00:11:17,900 You apply PCA right after the data preprocessing phase. 231 00:11:17,966 --> 00:11:21,766 And just before you fit your logistic regression model to the training set, 232 00:11:21,766 --> 00:11:22,733 because of course, 233 00:11:22,733 --> 00:11:27,266 you want to train your model on your new data set with the new extracted features, 234 00:11:27,533 --> 00:11:29,533 that is, with the two new extracted features 235 00:11:29,533 --> 00:11:31,300 that will explain the most variance. 236 00:11:31,300 --> 00:11:35,300 And after you trained your classifier, you're ready to predict the test results. 237 00:11:35,300 --> 00:11:36,700 Make the confusion matrix. 238 00:11:36,700 --> 00:11:39,166 Then you can also visualize the training set results. 239 00:11:39,166 --> 00:11:42,833 Remember, this section is applied on a data set that contains two features, 240 00:11:43,133 --> 00:11:46,233 and so we will see what we get by extracting these two new features. 241 00:11:46,933 --> 00:11:50,400 All right so to finish this tutorial I'm just going to introduce 242 00:11:50,400 --> 00:11:53,800 this new section here that I'm going to call applying 243 00:11:55,266 --> 00:11:56,866 PCA. All right. 244 00:11:56,866 --> 00:11:59,600 And in the next tutorial we are going to apply PCA. 245 00:11:59,600 --> 00:12:04,100 And then eventually we will build our model on our new reduce data set. 246 00:12:04,433 --> 00:12:06,600 So I look forward to doing that in the next tutorial. 247 00:12:06,600 --> 00:12:09,600 And until then enjoy machine learning.