1 00:00:00,133 --> 00:00:00,966 Hello, my friends. 2 00:00:00,966 --> 00:00:02,966 All right. Are you ready to start this 3 00:00:02,966 --> 00:00:06,700 big implementation of your very first artificial brain? 4 00:00:06,866 --> 00:00:08,700 Well, I'm definitely ready. 5 00:00:08,700 --> 00:00:11,633 So let's do this. Let's smash this together. 6 00:00:11,633 --> 00:00:14,633 All right, so we're going to start with the data preprocessing phase, 7 00:00:14,633 --> 00:00:18,600 which we will tackle in this same tutorial because we quickly 8 00:00:18,600 --> 00:00:20,400 want to get to the interesting stuff. 9 00:00:20,400 --> 00:00:23,933 So let's do this efficiently thanks to our data 10 00:00:23,933 --> 00:00:27,400 preprocessing template but also our data preprocessing toolkit. 11 00:00:27,733 --> 00:00:32,700 And therefore the first thing we're going to do here is to import the libraries. 12 00:00:32,700 --> 00:00:35,500 So we're going to create a new code cell here. 13 00:00:35,500 --> 00:00:39,866 We're going to go into our data preprocessing template to steal the cell 14 00:00:39,866 --> 00:00:44,966 we want meaning this one and get it back into our implementation. 15 00:00:44,966 --> 00:00:46,033 The first cell. 16 00:00:46,033 --> 00:00:46,366 All right. 17 00:00:46,366 --> 00:00:47,166 So that's the first thing. 18 00:00:47,166 --> 00:00:50,000 However I just want to show you something extra. 19 00:00:50,000 --> 00:00:53,000 It is about the beauty of Google Colab. 20 00:00:53,033 --> 00:00:54,833 I want to show you that indeed 21 00:00:54,833 --> 00:00:59,233 TensorFlow 2.0 is already pre-installed in Google Colab. 22 00:00:59,233 --> 00:01:03,033 You know, in any Google Colab notebook you will ever open. 23 00:01:03,033 --> 00:01:08,133 So the way for me to show you this is to first import TensorFlow, 24 00:01:08,133 --> 00:01:11,400 because okay, it is already pre-installed as a library 25 00:01:11,400 --> 00:01:14,400 inside a notebook, but we still need to import it. 26 00:01:14,600 --> 00:01:18,866 And in fact here, since we will actually won't use matplotlib, 27 00:01:19,066 --> 00:01:21,933 we're just going to delete this import of this library 28 00:01:21,933 --> 00:01:24,866 and then just add as a third library here. 29 00:01:24,866 --> 00:01:26,300 Well TensorFlow. 30 00:01:26,300 --> 00:01:29,433 And the way to import TensorFlow we start with indeed import 31 00:01:29,566 --> 00:01:32,700 then the name of the library which is TensorFlow of course. 32 00:01:33,066 --> 00:01:36,066 And then we add a shortcut simple one 33 00:01:36,066 --> 00:01:39,100 like the classic one okay. 34 00:01:39,100 --> 00:01:41,933 And now I'm going to create a new code cell. 35 00:01:41,933 --> 00:01:46,733 And inside I'm going to enter the following TF dot 36 00:01:47,066 --> 00:01:50,766 double underscore version double underscore again. 37 00:01:51,000 --> 00:01:55,433 And this will simply print the version of TensorFlow we're using 38 00:01:55,433 --> 00:01:56,766 which I just want to show you 39 00:01:56,766 --> 00:01:59,933 is indeed TensorFlow to write the brand new TensorFlow. 40 00:02:00,300 --> 00:02:01,600 So let's do this. 41 00:02:01,600 --> 00:02:04,166 First we need to execute this cell and this one. 42 00:02:04,166 --> 00:02:07,700 But remember if we execute this cell now this will take some time 43 00:02:07,700 --> 00:02:10,566 because actually the notebook is not running yet. 44 00:02:10,566 --> 00:02:13,933 And the way to run this is just to click this folder here. 45 00:02:14,000 --> 00:02:14,500 Right. 46 00:02:14,500 --> 00:02:18,600 And it is now that it will connect to a runtime to enable file browsing, 47 00:02:18,600 --> 00:02:21,600 but mostly to, you know, start running the notebook. 48 00:02:21,600 --> 00:02:22,200 Okay. 49 00:02:22,200 --> 00:02:26,333 And at the same time let's take that opportunity to upload the data set. 50 00:02:26,466 --> 00:02:30,200 So please go to your machine learning is that Codes and Datasets folder 51 00:02:30,200 --> 00:02:31,233 which had to download 52 00:02:31,233 --> 00:02:35,400 either in the previous tutorial or at the beginning of each practical activity. 53 00:02:35,566 --> 00:02:39,566 So there we go inside the import a deep learning almost close to the end. 54 00:02:39,566 --> 00:02:43,166 By the way, you must be very excited almost seeing the tip of the tunnel. 55 00:02:43,500 --> 00:02:46,633 Then let's go to this section on Artificial Neural network. 56 00:02:46,866 --> 00:02:49,666 Let's go to Python and let's select this dataset 57 00:02:49,666 --> 00:02:53,100 churn modeling dot CSV open okay. 58 00:02:53,300 --> 00:02:54,500 And there we go. 59 00:02:54,500 --> 00:02:56,366 Now we have everything. We have the data set. 60 00:02:56,366 --> 00:02:59,233 And besides our notebook is already running. 61 00:02:59,233 --> 00:03:02,033 As you can notice it takes a little bit more time than usual 62 00:03:02,033 --> 00:03:05,800 because you know the data set is this time more real world and mostly bigger. 63 00:03:05,800 --> 00:03:06,900 Okay. All right. 64 00:03:06,900 --> 00:03:07,500 So let's do this. 65 00:03:07,500 --> 00:03:10,733 Let's run this first cell here to import the libraries. 66 00:03:10,733 --> 00:03:13,500 This time numpy pandas and TensorFlow. 67 00:03:13,500 --> 00:03:18,466 And now let's play this cell to indeed reassure ourselves 68 00:03:18,466 --> 00:03:23,700 that the TensorFlow version we're going to be working with is 2.2.0. 69 00:03:23,700 --> 00:03:28,400 Basically TensorFlow two, which is so much better than TensorFlow one. 70 00:03:28,400 --> 00:03:30,666 I'm so happy about their new version. 71 00:03:30,666 --> 00:03:31,466 Okay great. 72 00:03:31,466 --> 00:03:33,200 So it's good to have the confirmation. 73 00:03:33,200 --> 00:03:36,600 And now now let's tackle this part one data preprocessing. 74 00:03:36,600 --> 00:03:37,333 We're going to do this 75 00:03:37,333 --> 00:03:41,066 efficiently thanks to our data preprocessing template in toolkit. 76 00:03:41,233 --> 00:03:42,266 So let's create first 77 00:03:42,266 --> 00:03:46,066 a new code cell to import the data set which we already have in a notebook. 78 00:03:46,366 --> 00:03:47,166 Perfect. 79 00:03:47,166 --> 00:03:49,600 So let's go to our data preprocessing template. 80 00:03:49,600 --> 00:03:54,533 And let's now steal this second cell to import the data set. 81 00:03:54,800 --> 00:03:55,466 All right. 82 00:03:55,466 --> 00:03:58,233 Let's go back here. And let's face that inside. 83 00:03:58,233 --> 00:04:01,800 And now of course the question is what do we need to replace. 84 00:04:01,966 --> 00:04:06,000 Well the obvious first change we need to do is the name of the data set, 85 00:04:06,000 --> 00:04:10,466 which this time is not data dot CSV but churn 86 00:04:10,900 --> 00:04:14,666 underscore modeling, dot CSV create. 87 00:04:15,000 --> 00:04:17,366 Then let's look at the rows one by one. 88 00:04:17,366 --> 00:04:19,200 This one is okay. 89 00:04:19,200 --> 00:04:21,000 Now what about this one? 90 00:04:21,000 --> 00:04:24,833 This line of code creates the matrix features x and the way it does 91 00:04:24,833 --> 00:04:28,066 this is it takes all the columns except the last one. 92 00:04:28,466 --> 00:04:32,200 But let's actually have a look again at our data set. 93 00:04:32,466 --> 00:04:33,300 Right. 94 00:04:33,300 --> 00:04:38,100 We noticed when I described to you the data set that the first columns 95 00:04:38,100 --> 00:04:41,433 are actually irrelevant in the sense that they will not help 96 00:04:41,600 --> 00:04:44,466 to predict the outcome of the dependent variable. 97 00:04:44,466 --> 00:04:49,733 And these columns are, you know, the non helpful columns or obviously this one. 98 00:04:49,733 --> 00:04:52,133 This just gives the row number of this data set. 99 00:04:52,133 --> 00:04:56,233 So we clearly don't want to include it then customer ID as well. 100 00:04:56,233 --> 00:04:56,833 Right. 101 00:04:56,833 --> 00:05:00,700 The customer ID is just a key identifier of each customer. 102 00:05:00,700 --> 00:05:04,966 Because you know each row corresponds to a different customer. 103 00:05:04,966 --> 00:05:06,233 So of course the customer 104 00:05:06,233 --> 00:05:10,300 ID has absolutely no impact on the dependent variable exited. 105 00:05:10,466 --> 00:05:13,633 So we will also exclude that column. 106 00:05:13,633 --> 00:05:14,200 We don't have. 107 00:05:14,200 --> 00:05:16,933 So you know the neural network will just figure it out. 108 00:05:16,933 --> 00:05:20,700 But let's just ease the learning process of our future neural network. 109 00:05:20,700 --> 00:05:21,133 Right. 110 00:05:21,133 --> 00:05:23,100 We're all on the same boat here. 111 00:05:23,100 --> 00:05:25,066 Okay. Then what about the surname? 112 00:05:25,066 --> 00:05:27,133 Does the surname have an impact on 113 00:05:27,133 --> 00:05:30,900 whether the customer will stay in or leave the bank? 114 00:05:30,900 --> 00:05:32,400 Well absolutely not. 115 00:05:32,400 --> 00:05:32,900 Right. 116 00:05:32,900 --> 00:05:36,766 Surname of course, has no impact on the decision of a customer 117 00:05:36,766 --> 00:05:38,600 to stay in or leave the bank. 118 00:05:38,600 --> 00:05:41,333 So we will also exclude this column 119 00:05:41,333 --> 00:05:45,166 and then all the rest, you know, all the other features here look fine. 120 00:05:45,166 --> 00:05:46,800 They might have an impact 121 00:05:46,800 --> 00:05:50,400 on the dependent variable, meaning they might help to predict 122 00:05:50,400 --> 00:05:54,866 if each customer will stay in the bank or leave the bank. 123 00:05:54,966 --> 00:05:59,566 Okay, so we will definitely keep all the other ones, meaning all the features 124 00:05:59,700 --> 00:06:00,833 starting from this one. 125 00:06:00,833 --> 00:06:05,233 Do credit score and so here in our implementation, 126 00:06:05,466 --> 00:06:10,066 instead of taking all the columns except the last one, well we will take 127 00:06:10,066 --> 00:06:14,300 all the columns starting from this one except the last one, 128 00:06:14,566 --> 00:06:18,333 meaning all the columns from credit score up to estimated salary. 129 00:06:18,633 --> 00:06:23,400 And the way to do this is still to keep that upper bound of the range. 130 00:06:23,400 --> 00:06:26,633 You know, finishing at the one before last column. 131 00:06:26,833 --> 00:06:29,233 Right? You know, that's exactly the upper bound. 132 00:06:29,233 --> 00:06:30,466 That's the range. 133 00:06:30,466 --> 00:06:33,966 But at the left of this range we won't specify 134 00:06:33,966 --> 00:06:36,966 nothing, which means the first column, the first index. 135 00:06:37,200 --> 00:06:41,700 But instead we will specify the index of the column 136 00:06:41,800 --> 00:06:44,700 we want to start with which is the credit score. 137 00:06:44,700 --> 00:06:46,366 Right. We know we want to start from here. 138 00:06:46,366 --> 00:06:49,833 And therefore now the question is what is the index of that column. 139 00:06:50,033 --> 00:06:52,966 Well let's see indexes in Python start from zero. 140 00:06:52,966 --> 00:06:54,900 So this has index zero. 141 00:06:54,900 --> 00:06:56,433 Then this has index one. 142 00:06:56,433 --> 00:06:59,200 This has index two and this has the next three. 143 00:06:59,200 --> 00:07:03,433 And therefore here instead of specifying nothing here as a lower 144 00:07:03,433 --> 00:07:06,866 bound of the range, well we will specify the index three 145 00:07:07,066 --> 00:07:11,400 so that we can take all the columns starting from the column of index three 146 00:07:11,633 --> 00:07:16,266 up to the one before last, and taking all the rows, all the values of the data set. 147 00:07:16,533 --> 00:07:19,766 And this will create a relevant matrix of features. 148 00:07:20,266 --> 00:07:22,666 Perfect. So this line of code is done. 149 00:07:22,666 --> 00:07:24,200 Now what about the next one. 150 00:07:24,200 --> 00:07:26,466 Well obviously the next one is fine. 151 00:07:26,466 --> 00:07:30,000 It will just take the last column of this data set, which is exactly 152 00:07:30,000 --> 00:07:33,766 what we want for dependent variable exited. 153 00:07:34,066 --> 00:07:36,300 So all good here. Nothing to change. 154 00:07:36,300 --> 00:07:40,166 We can just play the cell and we will have our data set, 155 00:07:40,366 --> 00:07:43,366 our matrix of features and our dependent variable vector. 156 00:07:43,466 --> 00:07:44,200 Let's check it out. 157 00:07:44,200 --> 00:07:47,466 Actually let's create two new code cells right. 158 00:07:47,666 --> 00:07:51,500 One where we will print the matrix of features x, 159 00:07:51,700 --> 00:07:56,400 and one where we will print the dependent variable vector y. 160 00:07:56,733 --> 00:07:57,800 Perfect. All right. 161 00:07:57,800 --> 00:07:58,600 So let's do this now 162 00:07:58,600 --> 00:08:03,466 actually let's play first this cell to print the matrix of features x. 163 00:08:03,466 --> 00:08:04,466 And there we go. 164 00:08:04,466 --> 00:08:08,066 We have indeed all the features starting from the credit score. 165 00:08:08,100 --> 00:08:09,500 This is a credit score. 166 00:08:09,500 --> 00:08:13,833 Then you know the country of residence and then the gender and all the other ones 167 00:08:13,833 --> 00:08:16,500 you know has credit card. Yes or no is active. 168 00:08:16,500 --> 00:08:19,666 And the last one is the estimated salary. 169 00:08:19,800 --> 00:08:22,333 Okay. So we have all these features. Perfect. 170 00:08:22,333 --> 00:08:23,333 And of course we don't have 171 00:08:23,333 --> 00:08:27,466 the dependent variable values because they are right here in Y. 172 00:08:27,833 --> 00:08:29,100 And there we go. 173 00:08:29,100 --> 00:08:33,400 These are all the decisions that the customers to state or leave in the bank. 174 00:08:33,400 --> 00:08:36,600 So of course this one here corresponds to this customer here 175 00:08:36,900 --> 00:08:40,500 which obviously has decided to leave the bank. 176 00:08:40,500 --> 00:08:40,800 Right. 177 00:08:40,800 --> 00:08:44,166 This is actually this same one here exited one. 178 00:08:44,633 --> 00:08:48,033 And then well, this second customer 179 00:08:48,033 --> 00:08:51,733 has decided to stay in the bank and corresponds to this one. 180 00:08:51,933 --> 00:08:52,366 Right. 181 00:08:52,366 --> 00:08:55,133 Which is exactly this one as well. Okay. 182 00:08:55,133 --> 00:08:56,666 This customer. 183 00:08:56,666 --> 00:08:56,966 All right. 184 00:08:56,966 --> 00:08:58,133 So all good so far. 185 00:08:58,133 --> 00:09:02,300 First step of the data preprocessing phase done successfully. 186 00:09:02,433 --> 00:09:07,000 And now let's move on to the more advanced steps of our data preprocessing phase 187 00:09:07,000 --> 00:09:10,000 which is about encoding the categorical data. 188 00:09:10,100 --> 00:09:10,666 Right. 189 00:09:10,666 --> 00:09:14,500 Of course we noticed that there are two categorical variables. 190 00:09:14,500 --> 00:09:18,700 This first one giving the country of residence of the customers, 191 00:09:18,700 --> 00:09:21,700 and the second one giving the gender of the customers. 192 00:09:21,866 --> 00:09:26,833 So we'll have to do some encoding work here to encode these categorical data. 193 00:09:26,833 --> 00:09:30,633 And either simple labels, you know, zero and one for the gender 194 00:09:30,866 --> 00:09:34,933 or some one hot encoding for this categorical variables 195 00:09:34,933 --> 00:09:39,566 in which indeed there is no relationship order between these values, you know, 196 00:09:39,600 --> 00:09:42,300 between these categories France, Spain and Germany. 197 00:09:42,300 --> 00:09:43,800 Okay. So let's do this. 198 00:09:43,800 --> 00:09:47,700 Let's start first with the label encoding of the gender column. 199 00:09:47,700 --> 00:09:49,500 So let's create a new code cell. 200 00:09:49,500 --> 00:09:53,400 And now of course to do it efficiently we're going to go into our data 201 00:09:53,400 --> 00:09:54,700 preprocessing toolkit. 202 00:09:54,700 --> 00:09:57,266 We're going to scroll down to find. 203 00:09:57,266 --> 00:09:59,700 By the way there is no missing data in the data set. 204 00:09:59,700 --> 00:10:03,133 I checked them and in reality you would also have to check them. 205 00:10:03,300 --> 00:10:04,000 But all good. 206 00:10:04,000 --> 00:10:05,600 We don't have to take care of any 207 00:10:05,600 --> 00:10:09,633 missing data so we can directly move to encoding categorical data. 208 00:10:09,633 --> 00:10:15,233 And now since we're taking care of label encoding the gender column, 209 00:10:15,366 --> 00:10:16,866 well we're going to take this. 210 00:10:16,866 --> 00:10:20,633 That's exactly the tool we need to perform label encoding. 211 00:10:20,633 --> 00:10:22,133 So I'm stealing this code cell. 212 00:10:22,133 --> 00:10:25,466 Now I'm adding it inside our notebook here. 213 00:10:25,466 --> 00:10:26,766 Our implementation. 214 00:10:26,766 --> 00:10:30,300 But remember that in our data preprocessing toolkit we did this 215 00:10:30,300 --> 00:10:32,300 on the dependent variable vector. 216 00:10:32,300 --> 00:10:37,800 But now we want to do it on this specific column of the matrix of features x. 217 00:10:37,800 --> 00:10:43,000 And therefore what we only need to replace here is this y by that specific 218 00:10:43,000 --> 00:10:47,400 column of the matrix of features x to which we want to apply label encoding. 219 00:10:47,700 --> 00:10:48,200 And so. 220 00:10:48,200 --> 00:10:51,200 Well now the question is how can we get this column. 221 00:10:51,200 --> 00:10:55,366 Well we just need to get the index and then call x with that index. 222 00:10:55,366 --> 00:10:56,900 And so well there we go. 223 00:10:56,900 --> 00:10:59,700 That's the first column of x. It has index zero. 224 00:10:59,700 --> 00:11:02,266 That's the second column of x. It has index one. 225 00:11:02,266 --> 00:11:05,266 And that's the third column of x which has index two. 226 00:11:05,600 --> 00:11:08,600 And therefore here we simply need to replace y 227 00:11:08,833 --> 00:11:13,766 by our matrix of features x of which we're going to take all the rows. 228 00:11:13,766 --> 00:11:18,000 And I'm taking them with this column, you know which means arrange in Python. 229 00:11:18,200 --> 00:11:21,700 And then to take the column we want meaning the gender column which has 230 00:11:21,700 --> 00:11:22,866 index two. 231 00:11:22,866 --> 00:11:27,166 Well I just need to add here after the comma the index two 232 00:11:27,266 --> 00:11:30,900 so that it will take all the rows but only the column of index two. 233 00:11:31,200 --> 00:11:33,500 And now of course we need to take this. 234 00:11:33,500 --> 00:11:39,200 And paste that inside the fit transform method called from our object, 235 00:11:39,200 --> 00:11:42,200 which is an instance of the label encoder class. 236 00:11:42,366 --> 00:11:43,333 And done. 237 00:11:43,333 --> 00:11:47,766 We just performed successfully label encoding to the gender column 238 00:11:47,766 --> 00:11:49,266 of our matrix of features x. 239 00:11:49,266 --> 00:11:53,033 Let's make sure it's the case by creating a new code cell here. 240 00:11:53,033 --> 00:11:57,400 And do, new print of the matrix of features X. 241 00:11:57,766 --> 00:12:00,933 Let's run the cell you know, first. 242 00:12:01,366 --> 00:12:03,133 All right. Good. 243 00:12:03,133 --> 00:12:04,500 And now let's print X. 244 00:12:04,500 --> 00:12:06,533 And let's just make sure that we no longer see 245 00:12:06,533 --> 00:12:09,000 female, female, female, female, male, female. 246 00:12:09,000 --> 00:12:12,833 But whatever encoding there was, which probably will be one 247 00:12:12,833 --> 00:12:16,933 for female or zero for female and zero for male or one female. 248 00:12:16,933 --> 00:12:18,166 Let's see what they did. 249 00:12:18,166 --> 00:12:20,000 All right. And there we go. Right. 250 00:12:20,000 --> 00:12:24,300 That's the new column after the label encoding and so female 251 00:12:24,300 --> 00:12:28,266 was encoded into zero and male was encoded into one. 252 00:12:28,266 --> 00:12:31,766 That's of course a random decision of the machine to choose this. 253 00:12:31,766 --> 00:12:32,966 Integers associated. 254 00:12:32,966 --> 00:12:34,333 And so all good. 255 00:12:34,333 --> 00:12:37,200 Now this column is well label encoded. 256 00:12:37,200 --> 00:12:42,700 And now we're going to proceed to the one hot encoding of the geography column. 257 00:12:42,900 --> 00:12:45,400 And this time we have indeed to perform one hot encoding 258 00:12:45,400 --> 00:12:49,200 because there is no other relationship between France, Spain and Germany. 259 00:12:49,200 --> 00:12:52,066 So we couldn't, you know, encode France into zero. 260 00:12:52,066 --> 00:12:54,200 Then Spain into one and German into three. 261 00:12:54,200 --> 00:12:56,833 We have to perform one hot encoding instead. 262 00:12:56,833 --> 00:12:58,300 And so let's do this. 263 00:12:58,300 --> 00:13:01,433 Let's go back to our data preprocessing toolkit. 264 00:13:01,766 --> 00:13:05,233 Let's take that cell this time, which is executive 265 00:13:05,233 --> 00:13:08,533 cell that perform one hot encoding. 266 00:13:08,866 --> 00:13:12,800 And let's paste it inside a new code cell 267 00:13:13,166 --> 00:13:16,166 to one hot encode the geography column. 268 00:13:16,666 --> 00:13:17,333 All right. 269 00:13:17,333 --> 00:13:21,666 Now the question is of course what do we have to replace or change 270 00:13:21,666 --> 00:13:26,833 in that cell to indeed perform one hot encoding on the geography column? 271 00:13:27,100 --> 00:13:29,400 Well, remember, the only thing that you have to 272 00:13:29,400 --> 00:13:33,933 change inside this code is that index of the column 273 00:13:33,933 --> 00:13:36,933 you want to apply one hot encoding on, right? 274 00:13:37,133 --> 00:13:42,233 And remember that in our data CSV file of our part one data preprocessing. 275 00:13:42,266 --> 00:13:44,000 While the categorical variable 276 00:13:44,000 --> 00:13:46,400 with the three different states was in the first column. 277 00:13:46,400 --> 00:13:48,600 That's why we had index zero here. 278 00:13:48,600 --> 00:13:52,300 But this time this column is actually the second column. 279 00:13:52,300 --> 00:13:53,833 Therefore it has index one. 280 00:13:53,833 --> 00:13:59,966 And therefore very simply we just need to replace zero here by one okay. 281 00:14:00,033 --> 00:14:00,933 And that's it. 282 00:14:00,933 --> 00:14:03,166 All the rest will be done automatically. 283 00:14:03,166 --> 00:14:06,166 Let me show you this. Let's play that cell. 284 00:14:06,266 --> 00:14:10,533 And now let's create a new code cell to print again X. 285 00:14:11,033 --> 00:14:12,266 All right. Good. 286 00:14:12,266 --> 00:14:16,733 And now let's play that cell and see what x has become. 287 00:14:17,033 --> 00:14:20,600 And indeed well remember when we perform one hot encoding. 288 00:14:20,600 --> 00:14:21,800 Well the dummy variables 289 00:14:21,800 --> 00:14:25,433 are actually moved to the first columns of the matrix of features. 290 00:14:25,433 --> 00:14:28,966 We have them exactly here you know in the three first columns. 291 00:14:29,200 --> 00:14:30,333 So let's see. 292 00:14:30,333 --> 00:14:32,666 Let's see how the one hot encoding was done. 293 00:14:32,666 --> 00:14:35,400 This is the first combination 294 00:14:35,400 --> 00:14:38,533 of dummy variables which corresponds to friends. 295 00:14:38,533 --> 00:14:40,633 You know these are the same rows here. 296 00:14:40,633 --> 00:14:45,100 And therefore friends was encoded into 100. 297 00:14:45,533 --> 00:14:50,100 Now Spain was encoding into 001. 298 00:14:50,400 --> 00:14:54,166 And finally Germany was encoded into 299 00:14:54,433 --> 00:14:57,833 well this 1010 okay. 300 00:14:58,066 --> 00:15:00,066 So that's all one hot encoding. 301 00:15:00,066 --> 00:15:02,400 Then we no longer see the gender column. 302 00:15:02,400 --> 00:15:03,933 But no worries it is still here. 303 00:15:03,933 --> 00:15:04,933 And so perfect. 304 00:15:04,933 --> 00:15:09,800 One hot encoding was not only done successfully, but also Western efficiently 305 00:15:09,833 --> 00:15:13,033 thanks to our data preprocessing toolkit and template. 306 00:15:13,633 --> 00:15:14,100 Good. 307 00:15:14,100 --> 00:15:16,500 Now let's move on to the next step, which is to split 308 00:15:16,500 --> 00:15:19,166 the data set into the training set and the test set. 309 00:15:19,166 --> 00:15:22,166 And once again we're going to do that so efficiently 310 00:15:22,200 --> 00:15:25,200 thanks to this time our data preprocessing template. 311 00:15:25,400 --> 00:15:29,133 Indeed we have to steal now this cell that splits the data 312 00:15:29,133 --> 00:15:31,700 set into the training set and the test set. 313 00:15:31,700 --> 00:15:35,133 So let's step back into our implementation 314 00:15:35,133 --> 00:15:38,333 in a new code cell right here. 315 00:15:38,666 --> 00:15:40,766 And now we can just just this 100%. 316 00:15:40,766 --> 00:15:42,233 We will just play that cell. 317 00:15:42,233 --> 00:15:45,300 And we don't have to do a print of these four entities. 318 00:15:45,300 --> 00:15:48,733 We perfectly understand how they work, but feel free to do it if you want. 319 00:15:48,966 --> 00:15:52,833 You're free to do any modification in this copy, of course, of the notebook. 320 00:15:53,466 --> 00:15:56,000 And finally, we have a final step 321 00:15:56,000 --> 00:15:59,000 of our data preprocessing phase, which is feature scaling. 322 00:15:59,100 --> 00:16:02,100 And now I want to say something very, very important. 323 00:16:02,100 --> 00:16:06,466 Feature scaling is absolutely compulsory for deep learning. 324 00:16:06,466 --> 00:16:11,200 Whenever you build an artificial neural network you have to apply feature scaling. 325 00:16:11,200 --> 00:16:13,133 That's absolutely fundamental. 326 00:16:13,133 --> 00:16:17,200 And it is so fundamental that we will actually apply feature 327 00:16:17,200 --> 00:16:21,033 scaling to all our features, you know, regardless of whether they already 328 00:16:21,033 --> 00:16:22,700 have some values of zero and one. 329 00:16:22,700 --> 00:16:25,133 You know, like the dummy variables. And same for these ones. 330 00:16:25,133 --> 00:16:28,900 We will just scale everything because it is so important to do it 331 00:16:29,000 --> 00:16:30,366 for deep learning. 332 00:16:30,366 --> 00:16:33,833 So the feature scaling step here will be very simple. 333 00:16:34,000 --> 00:16:37,000 We will just take our data preprocessing toolkit. 334 00:16:37,100 --> 00:16:40,700 We will go right at the end because I think this is our last tool. 335 00:16:40,700 --> 00:16:41,833 Yes there we go. 336 00:16:41,833 --> 00:16:46,800 We will take that full cell and we will paste it right back 337 00:16:46,800 --> 00:16:51,233 in a new code cell just below feature scaling will paste it here. 338 00:16:51,233 --> 00:16:53,266 And now instead of selecting 339 00:16:53,266 --> 00:16:56,500 some specific indexes here, we'll just take everything. 340 00:16:56,500 --> 00:17:01,266 So I'm just removing all our index selections here right. 341 00:17:01,500 --> 00:17:03,566 So that we can just scale everything. 342 00:17:03,566 --> 00:17:06,800 And that's the way it should be for neural network 343 00:17:06,800 --> 00:17:09,800 you know for building and training a neural network. 344 00:17:10,133 --> 00:17:11,400 All right. So perfect. 345 00:17:11,400 --> 00:17:14,466 This will just apply feature scaling to all the features 346 00:17:14,466 --> 00:17:18,066 of both the training set and the test set. 347 00:17:18,100 --> 00:17:22,566 But of course our scaler object is only fitted to the training set. 348 00:17:22,700 --> 00:17:23,100 Right. 349 00:17:23,100 --> 00:17:26,633 Remember it's to avoid information leakage that doesn't change. 350 00:17:26,633 --> 00:17:27,733 But there you go. 351 00:17:27,733 --> 00:17:31,666 Now we have the code to perform features counting already. 352 00:17:31,666 --> 00:17:32,600 So let's do this. 353 00:17:32,600 --> 00:17:39,233 Let's run this final cell and then the data preprocessing phase will be over. 354 00:17:39,733 --> 00:17:43,833 So congratulations I hope we did it efficiently enough for you. 355 00:17:43,866 --> 00:17:45,200 That's the way it should be. 356 00:17:45,200 --> 00:17:48,300 I'd like to remind, by the way, that, you know, the data preprocessing 357 00:17:48,300 --> 00:17:52,800 phase counts for 70% of the work of a data scientist. 358 00:17:53,000 --> 00:17:56,833 So that's why it was really important for me to give you some very efficient 359 00:17:56,866 --> 00:18:00,766 data preprocessing template and toolkit so that, as you can see, we can do it 360 00:18:00,766 --> 00:18:04,500 efficiently in less than 20 minutes, you know, in less than 20 minutes. 361 00:18:04,500 --> 00:18:06,966 With my explanation, but without the explanation, 362 00:18:06,966 --> 00:18:08,666 even in less than ten minutes. 363 00:18:08,666 --> 00:18:11,200 So I hope you understand and appreciate the importance. 364 00:18:11,200 --> 00:18:12,333 And now, my friends, 365 00:18:12,333 --> 00:18:16,000 it is time for the exciting step, the exciting part of this implementation. 366 00:18:16,000 --> 00:18:19,500 I'm talking of course, about part to building the CNN. 367 00:18:19,500 --> 00:18:20,666 So there we go. 368 00:18:20,666 --> 00:18:23,966 Recharge yourself with good energy, and as soon as you're ready, 369 00:18:23,966 --> 00:18:27,066 let's tackle together part two, where we're going to build 370 00:18:27,066 --> 00:18:32,100 for the first time an artificial brain leveraging TensorFlow 2.0. 371 00:18:32,500 --> 00:18:34,633 I can't wait to see you in the next tutorial. 372 00:18:34,633 --> 00:18:36,533 And until then, enjoy machine learning.