1 00:00:00,133 --> 00:00:02,233 Hello my friends, and welcome to this. 2 00:00:02,233 --> 00:00:04,000 New practical activity. 3 00:00:04,000 --> 00:00:08,900 On this time dimensionality reduction, which is not a branch of machine learning. 4 00:00:08,900 --> 00:00:11,400 Per se, but a. Very important technique. 5 00:00:11,400 --> 00:00:13,066 To. Know how to handle. 6 00:00:13,066 --> 00:00:16,800 When you work with big data sets, you know, huge data sets with. 7 00:00:17,000 --> 00:00:17,400 Many. 8 00:00:17,400 --> 00:00:21,700 Features and you know for which you would like to reduce to complexity by. 9 00:00:21,700 --> 00:00:23,400 Reducing the dimensionality. 10 00:00:23,400 --> 00:00:25,000 And this is exactly what. 11 00:00:25,000 --> 00:00:27,266 Dimensionality reduction is about. 12 00:00:27,266 --> 00:00:30,900 So in this new part, part nine dimensionality reduction, we. 13 00:00:30,900 --> 00:00:31,800 Will build. 14 00:00:31,800 --> 00:00:34,366 Three different models that can perform such a. 15 00:00:34,366 --> 00:00:35,000 Task. 16 00:00:35,000 --> 00:00:36,266 These are first. 17 00:00:36,266 --> 00:00:38,000 Principal component analysis. 18 00:00:38,000 --> 00:00:41,100 The most famous one then linear discriminant. 19 00:00:41,100 --> 00:00:44,100 Analysis. And finally. Kernel PCA. 20 00:00:44,100 --> 00:00:45,000 So we will build these. 21 00:00:45,000 --> 00:00:48,000 Three models, one for each section. And now. 22 00:00:48,000 --> 00:00:49,500 We're about to start with the first. 23 00:00:49,500 --> 00:00:52,500 One principal component analysis. 24 00:00:52,700 --> 00:00:53,766 But before we start let's. 25 00:00:53,766 --> 00:00:56,400 Just make sure everyone here is on the same page. 26 00:00:56,400 --> 00:00:58,600 I gave you the link to this folder right before this. 27 00:00:58,600 --> 00:01:01,500 Tutorial in the article, so make sure to connect to it. 28 00:01:01,500 --> 00:01:03,633 And now we. Should be. All on the same page. 29 00:01:03,633 --> 00:01:07,200 So we're going to go into part nine dimensionality reduction. 30 00:01:08,000 --> 00:01:08,466 All right. 31 00:01:08,466 --> 00:01:10,500 And as I told. You you have the three sections. 32 00:01:10,500 --> 00:01:12,533 Corresponding to each of the models. And we're going. 33 00:01:12,533 --> 00:01:13,200 To start with. 34 00:01:13,200 --> 00:01:16,200 Principal component. Analysis PCA. 35 00:01:16,766 --> 00:01:19,000 And as. Usual we're going to start with. Python. 36 00:01:19,000 --> 00:01:22,300 And then despite and folder you will find two files as usual. 37 00:01:22,466 --> 00:01:23,400 First the. 38 00:01:23,400 --> 00:01:26,066 Implementation in. Ipynb. Format. 39 00:01:26,066 --> 00:01:28,800 Which. Now once again we will be able to run on. 40 00:01:28,800 --> 00:01:32,500 Google Collaboratory because we will work with a classic data. Set. 41 00:01:32,666 --> 00:01:33,300 And speaking. 42 00:01:33,300 --> 00:01:36,566 Of which, here is a data set one. Dot. CSV. 43 00:01:36,800 --> 00:01:39,066 So let's open it and let me. 44 00:01:39,066 --> 00:01:41,233 Explain what this is about. 45 00:01:41,233 --> 00:01:41,500 All right. 46 00:01:41,500 --> 00:01:42,766 So actually first. 47 00:01:42,766 --> 00:01:45,566 You notice indeed that we have many. Features. 48 00:01:45,566 --> 00:01:48,100 I did not take a dataset with hundreds of features 49 00:01:48,100 --> 00:01:50,700 because then we would, you know, get lost in data set. 50 00:01:50,700 --> 00:01:51,700 So I just took a. 51 00:01:51,700 --> 00:01:53,533 Dataset with more than ten features. 52 00:01:53,533 --> 00:01:55,966 And of course. These are all. The features from here. 53 00:01:55,966 --> 00:01:58,800 Alcohol to this one proline. 54 00:01:58,800 --> 00:01:59,633 And as you can. 55 00:01:59,633 --> 00:02:01,866 Guess, each feature gives a. 56 00:02:01,866 --> 00:02:03,300 Certain information. 57 00:02:03,300 --> 00:02:05,200 Of a. Certain. Wine, right? 58 00:02:05,200 --> 00:02:06,066 Each row. 59 00:02:06,066 --> 00:02:08,200 Corresponds. To. A wine. 60 00:02:08,200 --> 00:02:10,133 And for each wine we. Have diverse. 61 00:02:10,133 --> 00:02:11,366 Informations, diverse. 62 00:02:11,366 --> 00:02:15,433 Features, you know, characteristics of the wine, the alcohol level. 63 00:02:15,600 --> 00:02:16,800 The malic acid. 64 00:02:16,800 --> 00:02:19,166 I'm not an expert of wines, but. 65 00:02:19,166 --> 00:02:20,266 These are some. Wines. 66 00:02:20,266 --> 00:02:21,600 Characteristics. 67 00:02:21,600 --> 00:02:24,233 Ash. Ash Alicante. Magnesium. 68 00:02:24,233 --> 00:02:25,500 Total phenols. 69 00:02:25,500 --> 00:02:26,500 Flavonoids. 70 00:02:26,500 --> 00:02:27,200 Anyway, so. 71 00:02:27,200 --> 00:02:29,633 You see you have. Many wine features. 72 00:02:29,633 --> 00:02:31,166 And for each of these wines. 73 00:02:31,166 --> 00:02:32,166 Well there. We go. 74 00:02:32,166 --> 00:02:34,500 I'm about to explain the dependent variable. 75 00:02:34,500 --> 00:02:35,966 For each of. These wines. 76 00:02:35,966 --> 00:02:38,833 We have the customer segment. 77 00:02:38,833 --> 00:02:39,700 You know. That's the. 78 00:02:39,700 --> 00:02:40,600 Last column. 79 00:02:40,600 --> 00:02:43,500 To which. The wines belong. Okay. 80 00:02:43,500 --> 00:02:46,633 So let me explain what happens in terms of business. 81 00:02:46,633 --> 00:02:48,733 First of all, this is a data set I took. 82 00:02:48,733 --> 00:02:50,700 From the UCI ML. Repository. 83 00:02:50,700 --> 00:02:52,700 So all the credits go of course to. 84 00:02:52,700 --> 00:02:55,033 This amazing platform of. Data set. 85 00:02:55,033 --> 00:02:56,300 However, in this. 86 00:02:56,300 --> 00:02:58,200 Data set, I just changed the last. 87 00:02:58,200 --> 00:03:01,400 Column customer segment to make it more business wise. 88 00:03:01,400 --> 00:03:04,166 You know, to make this case. The more a business case study. 89 00:03:04,166 --> 00:03:06,133 Because the scenario is the. Following. 90 00:03:06,133 --> 00:03:07,166 Let's imagine that. 91 00:03:07,166 --> 00:03:09,166 This data set belongs to a. 92 00:03:09,166 --> 00:03:13,133 Wine merchant with many different bottles of wine to sell, and. 93 00:03:13,133 --> 00:03:15,366 Therefore a large base of customers. 94 00:03:15,366 --> 00:03:16,566 And this wine shop. 95 00:03:16,566 --> 00:03:18,800 Owner actually hired you as a. 96 00:03:18,800 --> 00:03:20,066 Data scientist to. 97 00:03:20,066 --> 00:03:20,833 First do a. 98 00:03:20,833 --> 00:03:23,400 Preliminary work. Of clustering. 99 00:03:23,400 --> 00:03:24,333 Meaning that. 100 00:03:24,333 --> 00:03:27,066 At first we had all these features. 101 00:03:27,066 --> 00:03:30,033 Without this last. Column customer segment. 102 00:03:30,033 --> 00:03:32,233 We have all these. Features from alcohol. To. 103 00:03:32,233 --> 00:03:34,733 Proline and this wine shop owner. 104 00:03:34,733 --> 00:03:36,566 Actually asked you to. Perform some. 105 00:03:36,566 --> 00:03:38,966 Clustering to identify diverse. 106 00:03:38,966 --> 00:03:40,833 Segments of customers. 107 00:03:40,833 --> 00:03:45,466 Grouped by similarities which correspond to the ones they prefer. 108 00:03:45,466 --> 00:03:45,866 All right. 109 00:03:45,866 --> 00:03:46,666 So each. 110 00:03:46,666 --> 00:03:47,533 Customer segment. 111 00:03:47,533 --> 00:03:50,100 Here and by the way, there are three of them. Right. 112 00:03:50,100 --> 00:03:51,633 If we scroll. Down we can. 113 00:03:51,633 --> 00:03:53,200 See that we have three. 114 00:03:53,200 --> 00:03:55,866 Different categories or, you know. Clusters. 115 00:03:55,866 --> 00:03:57,800 And each. Of these. Segments. 116 00:03:57,800 --> 00:04:00,566 Will. Correspond to a certain group of customers. 117 00:04:00,566 --> 00:04:02,666 That have. Similar preferences. 118 00:04:02,666 --> 00:04:04,133 For. Similar wines. 119 00:04:04,133 --> 00:04:06,433 And that's exactly what these segments are about. 120 00:04:06,433 --> 00:04:07,633 But that was the first work. 121 00:04:07,633 --> 00:04:09,633 And if you want, you can have fun and. 122 00:04:09,633 --> 00:04:11,500 Do this first work yourself. 123 00:04:11,500 --> 00:04:14,266 But here we want to work on dimensionality reduction. 124 00:04:14,266 --> 00:04:15,266 So there. Goes the. 125 00:04:15,266 --> 00:04:18,066 Second mission that this wine shop owner. 126 00:04:18,066 --> 00:04:19,333 Asks you to do. 127 00:04:19,333 --> 00:04:22,300 This wine shop owner was actually satisfied with your first work. 128 00:04:22,300 --> 00:04:24,300 You know, identifying these three segments. 129 00:04:24,300 --> 00:04:26,433 But now the. Owner would like to. 130 00:04:26,433 --> 00:04:26,933 You know. 131 00:04:26,933 --> 00:04:29,100 Reduce the complexity of this dataset. 132 00:04:29,100 --> 00:04:31,800 By ending up with a smaller amount of features. 133 00:04:31,800 --> 00:04:33,400 And at the same time, this. 134 00:04:33,400 --> 00:04:36,000 Owner would like you to build a. Predictive model. 135 00:04:36,000 --> 00:04:36,900 That will be. 136 00:04:36,900 --> 00:04:39,000 Trained on this data, you know, including. 137 00:04:39,000 --> 00:04:40,500 The features up to here. 138 00:04:40,500 --> 00:04:42,466 And the dependent variable. 139 00:04:42,466 --> 00:04:44,866 So that for each. New wine. 140 00:04:44,866 --> 00:04:47,066 That this owner has in, it's up. 141 00:04:47,066 --> 00:04:50,400 Well, we can deploy. This. Predictive model. 142 00:04:50,400 --> 00:04:51,900 Applied to a reduced. 143 00:04:51,900 --> 00:04:52,800 Dimensionality. 144 00:04:52,800 --> 00:04:56,100 Data set to predict which customer. 145 00:04:56,100 --> 00:04:59,133 Segment this new wine belongs to. Right. 146 00:04:59,300 --> 00:05:01,800 And therefore once we managed to. Predict. 147 00:05:01,800 --> 00:05:06,733 Which customer segment this wine belongs to, then we can recommend this wine. 148 00:05:06,733 --> 00:05:09,300 To the right. Customers. And that's. 149 00:05:09,300 --> 00:05:11,700 Exactly why what we're about to do is like a. 150 00:05:11,700 --> 00:05:13,066 Recommender system. 151 00:05:13,066 --> 00:05:15,433 Because for each. New wine that will be in the. 152 00:05:15,433 --> 00:05:17,333 Shop, well, our predictive. 153 00:05:17,333 --> 00:05:19,233 Model. Will tell us to which. 154 00:05:19,233 --> 00:05:21,066 Customer segment it. Will. 155 00:05:21,066 --> 00:05:22,700 Be the most appropriate. 156 00:05:22,700 --> 00:05:25,466 You know, it will be the most appreciated. 157 00:05:25,466 --> 00:05:27,266 All right. So that's the business case. 158 00:05:27,266 --> 00:05:28,466 And therefore, you know, our. 159 00:05:28,466 --> 00:05:30,366 Predictive model will add tons of. 160 00:05:30,366 --> 00:05:34,133 Value to this owner because therefore if this owner manages to build. 161 00:05:34,133 --> 00:05:36,533 A good recommender system, of. Course it will. 162 00:05:36,533 --> 00:05:38,500 Optimize the sales and therefore the. 163 00:05:38,500 --> 00:05:41,100 Profit of the business. 164 00:05:41,100 --> 00:05:43,533 Okay. So that's what the case study is about. 165 00:05:43,533 --> 00:05:46,933 Now we're going to move on to the implementation of course. 166 00:05:47,233 --> 00:05:51,066 Therefore I'm opening this file principal component analysis. 167 00:05:51,366 --> 00:05:54,366 Which you have the choice to open with either Google Colaboratory. 168 00:05:54,366 --> 00:05:55,566 Or Jupyter Notebook. 169 00:05:55,566 --> 00:05:58,566 As we. Did in the previous section on CNNs. 170 00:05:58,600 --> 00:05:59,533 But there we go. 171 00:05:59,533 --> 00:06:02,100 Let's open it with Google Collaboratory. 172 00:06:02,100 --> 00:06:05,266 And enjoy a brand new implementation on it. 173 00:06:06,366 --> 00:06:06,766 All right. 174 00:06:06,766 --> 00:06:07,533 So here is the. 175 00:06:07,533 --> 00:06:10,266 Implementation principal components analysis. 176 00:06:10,266 --> 00:06:11,700 This is in read only mode. 177 00:06:11,700 --> 00:06:14,200 So as. Usual we will create a copy by clicking. 178 00:06:14,200 --> 00:06:17,333 File here. And then save. A copy in drive. 179 00:06:17,600 --> 00:06:22,200 This will create a copy inside which we will be able to re-implement. 180 00:06:22,466 --> 00:06:23,633 Not the whole. 181 00:06:23,633 --> 00:06:27,066 Implementation this time, because I will explain that most of. 182 00:06:27,066 --> 00:06:29,633 The cells are cells we already did before. 183 00:06:29,633 --> 00:06:32,133 You know, many times in the classification part 184 00:06:32,133 --> 00:06:34,366 and also in the first section of part eight. 185 00:06:34,366 --> 00:06:36,866 So we won't have to re-implement everything. 186 00:06:36,866 --> 00:06:39,000 This would. Be a. Waste of time. And mostly. 187 00:06:39,000 --> 00:06:42,266 We. Rather want to focus on dimensionality reduction. 188 00:06:42,600 --> 00:06:43,800 And therefore. 189 00:06:43,800 --> 00:06:45,100 Here. Is what we're going to do. 190 00:06:45,100 --> 00:06:46,700 I'm going to show you the implementation. 191 00:06:46,700 --> 00:06:49,666 Of course, but the only cell that we will. 192 00:06:49,666 --> 00:06:51,500 Re-Implement will. Be. 193 00:06:51,500 --> 00:06:54,433 This one applying. PCA. So let's. 194 00:06:54,433 --> 00:06:55,833 Remove it. Right away. 195 00:06:55,833 --> 00:06:57,666 Not the text, only this one. 196 00:06:57,666 --> 00:06:59,966 And now I'm going to show you that indeed, you. 197 00:06:59,966 --> 00:07:03,966 Know, all the cells are super familiar to us, right? 198 00:07:04,233 --> 00:07:05,600 Because indeed we. Start. 199 00:07:05,600 --> 00:07:06,400 By importing the. 200 00:07:06,400 --> 00:07:09,300 Libraries that we did 100. Times. Right? 201 00:07:09,300 --> 00:07:11,600 So we have the three essential libraries here. 202 00:07:11,600 --> 00:07:14,300 Then we import the. Data set with the exact. 203 00:07:14,300 --> 00:07:16,800 Same code as the one you have in your. 204 00:07:16,800 --> 00:07:18,633 Data preprocessing. Template. 205 00:07:18,633 --> 00:07:20,600 So of course here I just put. 206 00:07:20,600 --> 00:07:22,300 The right name of the data. Set which is. 207 00:07:22,300 --> 00:07:25,000 Wine. Dot CSV. 208 00:07:25,000 --> 00:07:25,800 Okay. 209 00:07:25,800 --> 00:07:28,700 Then you will recognize the next steps. 210 00:07:28,700 --> 00:07:30,266 Of the data preprocessing. Template. 211 00:07:30,266 --> 00:07:32,133 Which is to split the data set. 212 00:07:32,133 --> 00:07:33,466 Into the training set and the. 213 00:07:33,466 --> 00:07:35,700 Test set executive the same code. 214 00:07:35,700 --> 00:07:38,100 Then we apply feature scaling as it. 215 00:07:38,100 --> 00:07:40,066 Is, you know, most of the time recommended. 216 00:07:40,066 --> 00:07:42,666 So we apply it of course on separately. 217 00:07:42,666 --> 00:07:44,800 The training set and the test. Set. 218 00:07:44,800 --> 00:07:48,000 And that closes the data preprocessing phase. 219 00:07:48,266 --> 00:07:49,633 Then we apply. PCA. 220 00:07:49,633 --> 00:07:50,400 And that's of. 221 00:07:50,400 --> 00:07:51,600 Course the cell we will. 222 00:07:51,600 --> 00:07:53,533 Re-Implement together. 223 00:07:53,533 --> 00:07:54,933 Then let me just remove. 224 00:07:54,933 --> 00:07:57,233 All the outputs here so that. You don't see them. 225 00:07:57,233 --> 00:07:59,800 I hope you close. Your eyes when I just remove them. 226 00:07:59,800 --> 00:08:00,433 But there you. 227 00:08:00,433 --> 00:08:00,900 Now close. 228 00:08:00,900 --> 00:08:01,633 Your eyes a little. Bit. 229 00:08:01,633 --> 00:08:02,700 I'm going to. 230 00:08:02,700 --> 00:08:04,800 Remove that output as well because. 231 00:08:04,800 --> 00:08:06,900 Actually the dimensionality reduction technique. 232 00:08:06,900 --> 00:08:09,566 That we'll use will. Manage. To get us great results. 233 00:08:09,566 --> 00:08:11,866 With only two. Extracted features. 234 00:08:11,866 --> 00:08:12,300 Right. 235 00:08:12,300 --> 00:08:15,433 We're not reducing the number of existing features. 236 00:08:15,633 --> 00:08:20,733 We are creating new extracted features based on these existing features. 237 00:08:20,733 --> 00:08:22,300 So we will get totally. Different. 238 00:08:22,300 --> 00:08:24,233 New features at the end which we. 239 00:08:24,233 --> 00:08:26,900 Call, you know, principal components. So we'll have. 240 00:08:26,900 --> 00:08:30,600 Principal component one and principal component two at the end. 241 00:08:31,066 --> 00:08:33,733 But there we go. So back to. 242 00:08:33,733 --> 00:08:35,433 Our implementation. 243 00:08:35,433 --> 00:08:38,233 After applying PCA which we will redo together. 244 00:08:38,233 --> 00:08:39,733 Well we. Trained the. 245 00:08:39,733 --> 00:08:42,033 Logistic. Regression model on the training. Set. 246 00:08:42,033 --> 00:08:43,833 I chose the logistic regression model 247 00:08:43,833 --> 00:08:47,700 as the first model of our classification toolkit, but I. 248 00:08:47,700 --> 00:08:49,366 Could have chosen any other ones. 249 00:08:49,366 --> 00:08:50,200 But you will see that 250 00:08:50,200 --> 00:08:53,300 we will get great results with this one, but feel free to choose. 251 00:08:53,300 --> 00:08:56,200 Another classification model and we will. Work. 252 00:08:56,200 --> 00:08:57,266 But notice that. 253 00:08:57,266 --> 00:09:00,300 It is important to. Apply PCA before. 254 00:09:00,333 --> 00:09:01,200 Training your. 255 00:09:01,200 --> 00:09:01,833 Classification. 256 00:09:01,833 --> 00:09:05,000 Model on the training set right, you want to reduce. The. 257 00:09:05,000 --> 00:09:07,966 Dimensionality of your. Data set before of course. 258 00:09:07,966 --> 00:09:10,433 Training it on your training. 259 00:09:10,433 --> 00:09:12,966 Set right. The training set basically is the. 260 00:09:12,966 --> 00:09:15,333 Final version. Of your data after you. 261 00:09:15,333 --> 00:09:16,900 Performed all. The data preprocessing. 262 00:09:16,900 --> 00:09:19,866 Phase and dimensionality reduction if you want. 263 00:09:19,866 --> 00:09:20,633 Okay. 264 00:09:20,633 --> 00:09:24,700 So the training happens after applying your dimensionality reduction technique. 265 00:09:25,066 --> 00:09:25,500 And then. 266 00:09:25,500 --> 00:09:28,266 Of course, well we. Will make the confusion matrix. 267 00:09:28,266 --> 00:09:30,600 You know how to do that. We did it many times. 268 00:09:30,600 --> 00:09:33,500 And then since our dimensionality reduction technique. 269 00:09:33,500 --> 00:09:36,000 Will get us great results. With only two. 270 00:09:36,000 --> 00:09:37,333 Extracted features. 271 00:09:37,333 --> 00:09:40,233 Principal component one and principal component two. 272 00:09:40,233 --> 00:09:43,233 Well, that will allow us to visualize the training set results. 273 00:09:43,266 --> 00:09:44,700 In two dimensions. Right? 274 00:09:44,700 --> 00:09:45,433 Because remember. 275 00:09:45,433 --> 00:09:46,133 That each. 276 00:09:46,133 --> 00:09:48,933 Dimension corresponds to one feature. 277 00:09:48,933 --> 00:09:50,400 And we do this. For the. Training. 278 00:09:50,400 --> 00:09:51,600 Set right here. 279 00:09:51,600 --> 00:09:54,233 And the test set okay. 280 00:09:54,233 --> 00:09:56,700 So as you can see. What I did with this. 281 00:09:56,700 --> 00:10:00,733 Implementation is something you can do in less than five minutes right now. 282 00:10:00,733 --> 00:10:02,700 Thanks to your toolkit. Right. 283 00:10:02,700 --> 00:10:04,100 Because you just need to take 284 00:10:04,100 --> 00:10:07,500 the data preprocessing toolkit to make these four cells. 285 00:10:07,800 --> 00:10:08,533 Then you just need to. 286 00:10:08,533 --> 00:10:10,700 Grab the feature. Scaling tool in your. 287 00:10:10,700 --> 00:10:12,466 Data preprocessing toolkit. 288 00:10:12,466 --> 00:10:13,133 Then you. 289 00:10:13,133 --> 00:10:15,300 Just need to grab your logistic regression. 290 00:10:15,300 --> 00:10:17,733 Implementation to implement this cell. 291 00:10:17,733 --> 00:10:20,266 And same for the other ones. You know the confusion matrix. 292 00:10:20,266 --> 00:10:21,633 And same for these last two. 293 00:10:21,633 --> 00:10:23,233 Visualizing the transit results. 294 00:10:23,233 --> 00:10:24,966 And. Visualizing the test results. 295 00:10:24,966 --> 00:10:29,666 These are all cells that you have in your logistic regression implementation. 296 00:10:29,833 --> 00:10:31,133 So absolutely no need. 297 00:10:31,133 --> 00:10:33,166 To do it together again. And therefore. 298 00:10:33,166 --> 00:10:35,033 We can now focus. Directly. 299 00:10:35,033 --> 00:10:36,533 On. This cell. 300 00:10:36,533 --> 00:10:38,633 Applying. PCA. 301 00:10:38,633 --> 00:10:39,433 So there we. Go. 302 00:10:39,433 --> 00:10:41,233 We're going to create a new code cell. 303 00:10:41,233 --> 00:10:43,133 And now let's implement. 304 00:10:43,133 --> 00:10:46,133 PCA. Principal component analysis. 305 00:10:46,733 --> 00:10:47,100 All right. 306 00:10:47,100 --> 00:10:48,433 So you. Could almost. 307 00:10:48,433 --> 00:10:49,733 Press. Pause on the video. 308 00:10:49,733 --> 00:10:52,266 Now and get the. Right. Tool from the scikit. 309 00:10:52,266 --> 00:10:55,266 Learn API to see how. To implement this. 310 00:10:55,433 --> 00:10:56,966 That would be a good exercise. 311 00:10:56,966 --> 00:10:58,866 But if you don't. Want to do it that's fine. 312 00:10:58,866 --> 00:11:00,566 Let's implement this right now. 313 00:11:00,566 --> 00:11:03,000 And as you guessed by what I've just said, well. 314 00:11:03,000 --> 00:11:04,800 We're going to implement PCA. 315 00:11:04,800 --> 00:11:07,400 Using the scikit learn library. 316 00:11:07,400 --> 00:11:10,800 So the first thing we'll do is start from the scikit 317 00:11:10,800 --> 00:11:13,800 learn, from which we're going to get access to a. 318 00:11:13,800 --> 00:11:17,400 Certain module, which we'll find in the cyclone API and which. 319 00:11:17,400 --> 00:11:20,066 Is called decomposition. 320 00:11:20,066 --> 00:11:20,700 Just like that. 321 00:11:20,700 --> 00:11:23,900 Decomposition from which we're going to import. 322 00:11:23,900 --> 00:11:27,600 Of course, a class that will allow us to build this object. 323 00:11:27,600 --> 00:11:28,800 Which will be nothing else. 324 00:11:28,800 --> 00:11:31,500 Then this PCA tool that will. 325 00:11:31,500 --> 00:11:34,566 Apply dimensionality reduction on our data. Set. 326 00:11:34,800 --> 00:11:39,133 And. That class is called very simply okay, okay. 327 00:11:39,133 --> 00:11:42,133 So you can't miss it in the API, PCA. 328 00:11:42,300 --> 00:11:43,800 And now next natural. 329 00:11:43,800 --> 00:11:44,666 Step is of course to. 330 00:11:44,666 --> 00:11:46,066 Create an. Object. 331 00:11:46,066 --> 00:11:48,300 Or you know, an instance of this class. 332 00:11:48,300 --> 00:11:50,466 And guess how we're going to call that object. 333 00:11:50,466 --> 00:11:53,800 Well very simply we're going to call that object okay. 334 00:11:53,800 --> 00:11:56,466 Right. So this is super intuitive. 335 00:11:56,466 --> 00:11:58,866 And now you know the next step next step. 336 00:11:58,866 --> 00:12:03,600 Is to call the PCA class which needs to take. 337 00:12:03,600 --> 00:12:05,233 One essential argument. 338 00:12:05,233 --> 00:12:07,433 You know we only have to input one argument here. 339 00:12:07,433 --> 00:12:09,166 And you can totally guess. What. 340 00:12:09,166 --> 00:12:11,333 This argument will be. Right. 341 00:12:11,333 --> 00:12:11,866 It is. 342 00:12:11,866 --> 00:12:14,866 The final number. Of extracted features. 343 00:12:15,066 --> 00:12:17,100 You want to end up with in your new. 344 00:12:17,100 --> 00:12:17,966 Data set. 345 00:12:17,966 --> 00:12:20,466 And that argument to choose that number is. 346 00:12:20,466 --> 00:12:21,133 Called. 347 00:12:21,133 --> 00:12:25,266 N underscore components and components. 348 00:12:26,033 --> 00:12:26,400 All right. 349 00:12:26,400 --> 00:12:27,000 So now the. 350 00:12:27,000 --> 00:12:29,500 Question is of course which. Number should we. 351 00:12:29,500 --> 00:12:30,300 Choose. Right. 352 00:12:30,300 --> 00:12:33,800 How do we know down to which number of features right. 353 00:12:33,800 --> 00:12:34,633 Extract features. 354 00:12:34,633 --> 00:12:36,166 We want to reduce dimensionality. 355 00:12:36,166 --> 00:12:37,800 Of our data set. 356 00:12:37,800 --> 00:12:38,900 Well I have a very. 357 00:12:38,900 --> 00:12:39,500 Simple answer. 358 00:12:39,500 --> 00:12:40,466 To that question. 359 00:12:40,466 --> 00:12:42,566 What I usually do is start with two. 360 00:12:42,566 --> 00:12:43,333 You know. Two. 361 00:12:43,333 --> 00:12:44,466 Principal components. 362 00:12:44,466 --> 00:12:46,500 Therefore two extracted features. 363 00:12:46,500 --> 00:12:48,266 And see the. Results I. Get in the end. 364 00:12:48,266 --> 00:12:50,833 And thanks to our code, you know, our code template, we can. 365 00:12:50,833 --> 00:12:53,100 Check that very quickly. And easily. 366 00:12:53,100 --> 00:12:53,766 And besides. 367 00:12:53,766 --> 00:12:56,100 We do want to try with two. Because then if we. 368 00:12:56,100 --> 00:12:57,600 Get good results with two, 369 00:12:57,600 --> 00:13:00,300 well we will be able to visualize the training set result. 370 00:13:00,300 --> 00:13:03,000 And the. Test result. In two dimensions. 371 00:13:03,000 --> 00:13:05,833 You know, in this nice. Plot that we had. In part. 372 00:13:05,833 --> 00:13:07,200 Three classification. 373 00:13:07,200 --> 00:13:09,000 So we definitely want to start with two. 374 00:13:09,000 --> 00:13:10,533 And if you know, we get really. 375 00:13:10,533 --> 00:13:12,500 Poor results and. We see on the. 376 00:13:12,500 --> 00:13:13,300 Graphics here. 377 00:13:13,300 --> 00:13:14,333 That we can't. 378 00:13:14,333 --> 00:13:16,500 Separate the three classes properly. 379 00:13:16,500 --> 00:13:17,200 You know, remember 380 00:13:17,200 --> 00:13:20,233 with those different prediction regions and the prediction boundary. 381 00:13:20,600 --> 00:13:21,833 Well if we see that we have poor. 382 00:13:21,833 --> 00:13:24,366 Results on the visualizations, then we can try. 383 00:13:24,366 --> 00:13:24,833 With. 384 00:13:24,833 --> 00:13:28,333 Higher numbers of principal components meaning three than four. 385 00:13:28,500 --> 00:13:31,033 And at some point we'll get, you know, some extracted. 386 00:13:31,033 --> 00:13:33,966 Features that explain. Well enough the variance. 387 00:13:33,966 --> 00:13:37,966 Which is exactly what PCA is about, right, is about extracting. 388 00:13:37,966 --> 00:13:40,933 Some features. That. Explain well enough. The variance. 389 00:13:40,933 --> 00:13:42,300 And once you find them. 390 00:13:42,300 --> 00:13:44,333 Well, you will get good results. 391 00:13:44,333 --> 00:13:46,400 Even with lower dimensionality. 392 00:13:46,400 --> 00:13:47,800 Okay. So let's try with. 393 00:13:47,800 --> 00:13:49,233 Two and let's see what we'll. Get. 394 00:13:49,233 --> 00:13:51,066 But I already told you that will get. 395 00:13:51,066 --> 00:13:52,866 Amazing. Results. 396 00:13:52,866 --> 00:13:53,866 Therefore there you go. 397 00:13:53,866 --> 00:13:56,833 And components. Equals to two. 398 00:13:56,833 --> 00:13:57,966 Principal components. 399 00:13:57,966 --> 00:13:59,000 Or in other words. 400 00:13:59,000 --> 00:14:01,600 To extracted features okay. 401 00:14:01,600 --> 00:14:02,800 So that's for our object. 402 00:14:02,800 --> 00:14:03,900 And now next. 403 00:14:03,900 --> 00:14:07,200 Step of course is to apply this object to our. 404 00:14:07,266 --> 00:14:10,233 Training set to. Reduce the. Dimensionality. 405 00:14:10,233 --> 00:14:12,800 Of our. Training set in. Order to ease. 406 00:14:12,800 --> 00:14:13,833 The learning process. 407 00:14:13,833 --> 00:14:15,700 Of the logistic regression model. 408 00:14:15,700 --> 00:14:18,300 But also we will have to apply it. On the. 409 00:14:18,300 --> 00:14:19,133 Test set. 410 00:14:19,133 --> 00:14:20,200 Because remember. 411 00:14:20,200 --> 00:14:21,533 That the predict. 412 00:14:21,533 --> 00:14:22,266 Method that. 413 00:14:22,266 --> 00:14:24,566 We will call here has to be called. 414 00:14:24,566 --> 00:14:27,500 On the exact same format of. Data. 415 00:14:27,500 --> 00:14:29,066 As the one that was. Used. 416 00:14:29,066 --> 00:14:30,333 For the training set. 417 00:14:30,333 --> 00:14:31,666 So as long as you apply. 418 00:14:31,666 --> 00:14:34,400 Some transformations like data. Preprocessing. 419 00:14:34,400 --> 00:14:36,533 Or dimensionality reduction on your training. 420 00:14:36,533 --> 00:14:38,000 Set, well you have to do the. 421 00:14:38,000 --> 00:14:39,900 Same on your test. Set. 422 00:14:39,900 --> 00:14:43,500 However, be careful exactly as feature scaling. 423 00:14:43,566 --> 00:14:45,466 We will have to apply the fit. 424 00:14:45,466 --> 00:14:48,466 Transform method on the training. Set, but. 425 00:14:48,566 --> 00:14:51,566 Only the transform method on the. Test set. 426 00:14:51,566 --> 00:14:53,433 And that's. Always for the same. Reason. 427 00:14:53,433 --> 00:14:57,233 That's because we want to avoid information leakage on the test set. 428 00:14:57,433 --> 00:14:57,800 Write. 429 00:14:57,800 --> 00:15:00,666 The test set is supposed to be you observations like. 430 00:15:00,666 --> 00:15:03,733 Data on which we deploy our model in production. 431 00:15:03,966 --> 00:15:05,666 And therefore we're not supposed. 432 00:15:05,666 --> 00:15:07,500 To fit our scaler. 433 00:15:07,500 --> 00:15:09,500 Or, you know. Feature extractor object. 434 00:15:09,500 --> 00:15:10,966 On the. Test set. 435 00:15:10,966 --> 00:15:13,766 We can apply them to transform them. Right, because they were. 436 00:15:13,766 --> 00:15:15,100 Fitted on the training set. 437 00:15:15,100 --> 00:15:17,033 But we. Can't fit them again to. 438 00:15:17,033 --> 00:15:18,600 The test set because that would be like. 439 00:15:18,600 --> 00:15:21,700 Trying to get some hints of information from the test. Set. 440 00:15:21,900 --> 00:15:24,066 That we're not supposed. To have. That's exactly. 441 00:15:24,066 --> 00:15:26,100 What information. Leakage is about. 442 00:15:26,100 --> 00:15:28,200 So there you go. I said. Everything. Now you can. 443 00:15:28,200 --> 00:15:29,466 Press pause. On this. 444 00:15:29,466 --> 00:15:30,466 Video to. 445 00:15:30,466 --> 00:15:33,466 Finish this implementation of. PCA. 446 00:15:33,466 --> 00:15:37,200 And in two seconds I'm going to implement with you the solution. 447 00:15:40,066 --> 00:15:40,900 All right. 448 00:15:40,900 --> 00:15:43,300 I hope you did well. Now let's do it together. 449 00:15:43,300 --> 00:15:46,400 So as we said, we want to apply this PCA object separately. 450 00:15:46,400 --> 00:15:47,100 On the training set. 451 00:15:47,100 --> 00:15:47,733 And two sets. 452 00:15:47,733 --> 00:15:50,733 So first I'm going to. Take X. Train. 453 00:15:51,166 --> 00:15:52,866 All right which I'm going to. 454 00:15:52,866 --> 00:15:57,300 Update by applying this PCA object from. 455 00:15:57,300 --> 00:15:58,200 Which I'm going to. 456 00:15:58,200 --> 00:15:59,433 Call the fit. 457 00:15:59,433 --> 00:16:02,300 Transform method. 458 00:16:02,300 --> 00:16:03,900 On this all the. 459 00:16:03,900 --> 00:16:08,066 Version of X train meaning before the transformation of PCA. 460 00:16:08,300 --> 00:16:09,100 And so here. 461 00:16:09,100 --> 00:16:11,800 What. Happens technically is that. The fit part of this. 462 00:16:11,800 --> 00:16:13,366 Fit transform method will. Get. 463 00:16:13,366 --> 00:16:15,766 All the information it needs from X train. 464 00:16:15,766 --> 00:16:16,700 To apply. 465 00:16:16,700 --> 00:16:18,600 Principal component analysis. 466 00:16:18,600 --> 00:16:20,733 And then of course the transform. 467 00:16:20,733 --> 00:16:21,300 Part of this. 468 00:16:21,300 --> 00:16:23,366 Fit transform method. Will apply. 469 00:16:23,366 --> 00:16:25,900 The transformation. Itself to extract the. 470 00:16:25,900 --> 00:16:28,100 Principal component features. Okay. 471 00:16:28,100 --> 00:16:30,900 So that what it means technically and now. 472 00:16:30,900 --> 00:16:32,000 Well let's. Do the same. 473 00:16:32,000 --> 00:16:36,000 Actually for X2'S I'm copying this, pasting it here. 474 00:16:36,300 --> 00:16:39,600 And. Replacing here x train by x test. 475 00:16:40,100 --> 00:16:43,100 Then x train here again by. X test. 476 00:16:43,133 --> 00:16:47,700 And only applying the transform method. 477 00:16:48,200 --> 00:16:49,800 And there we go my friends. 478 00:16:49,800 --> 00:16:52,633 This implementation is already over.