1 00:00:00,166 --> 00:00:02,300 Hello and welcome to this art tutorial. 2 00:00:02,300 --> 00:00:04,600 I'm super excited to be in the deep learning part. 3 00:00:04,600 --> 00:00:08,000 This is one of the most fascinating and exciting branch of machine learning, 4 00:00:08,366 --> 00:00:11,000 and besides, it's one of the most powerful. 5 00:00:11,000 --> 00:00:13,666 In the following tutorials, we're going to solve the business problem 6 00:00:13,666 --> 00:00:16,500 described by Kirill at the beginning of this section, 7 00:00:16,500 --> 00:00:19,133 and you will see that we are going to get strong results. 8 00:00:19,133 --> 00:00:23,466 Thanks to this artificial neural network that we are about to build with are. 9 00:00:23,866 --> 00:00:24,600 So as usual, 10 00:00:24,600 --> 00:00:27,900 we are going to make this artificial neural network model very efficiently. 11 00:00:28,100 --> 00:00:29,633 And we're going to use the best package 12 00:00:29,633 --> 00:00:33,166 for that, which I will let you find out about in the next tutorials. 13 00:00:33,733 --> 00:00:34,766 So let's start. 14 00:00:34,766 --> 00:00:39,533 And the first step of our journey is the boring step data preprocessing. 15 00:00:39,800 --> 00:00:43,200 But we will do it very efficiently because we have our template 16 00:00:43,200 --> 00:00:45,766 or classification template that I've prepared here. 17 00:00:45,766 --> 00:00:48,433 And why can we use this classification template? 18 00:00:48,433 --> 00:00:52,200 Well, it's by nature of the business problem you saw in the business 19 00:00:52,200 --> 00:00:55,200 problem description that you have some independent variables. 20 00:00:55,200 --> 00:00:59,033 And with these independent variables, you have to predict a dependent variable 21 00:00:59,033 --> 00:01:00,833 that has a binary outcome. 22 00:01:00,833 --> 00:01:03,833 And since the outcome of the dependent variable is binary, 23 00:01:03,833 --> 00:01:06,033 that means it's a categorical variable. 24 00:01:06,033 --> 00:01:09,033 That means we have to predict classes zero and one, 25 00:01:09,100 --> 00:01:12,800 and therefore that makes our problem a classification problem. 26 00:01:13,033 --> 00:01:16,033 And so okay we're going to build a deep learning model. 27 00:01:16,133 --> 00:01:17,066 But this deep learning 28 00:01:17,066 --> 00:01:20,066 model is going to be nothing else than a classification model. 29 00:01:20,300 --> 00:01:24,133 And that's why we are going to use our classification template that we have here. 30 00:01:24,433 --> 00:01:28,733 And that will save us a lot of time to build our artificial neural network. 31 00:01:28,766 --> 00:01:32,400 And besides, we want to focus on the deep learning model itself. 32 00:01:32,400 --> 00:01:36,033 So we will get there very quickly thanks to this template. 33 00:01:36,500 --> 00:01:36,866 All right. 34 00:01:36,866 --> 00:01:42,333 So let's take everything from this template from the top down to here. 35 00:01:42,333 --> 00:01:46,666 Because we cannot use this section here because, you know this is the section 36 00:01:46,666 --> 00:01:48,533 to visualize the training set results. 37 00:01:48,533 --> 00:01:50,300 And the test set results as well. 38 00:01:50,300 --> 00:01:53,266 But only when we have two independent variables, 39 00:01:53,266 --> 00:01:56,566 because one independent variable corresponds to one dimension. 40 00:01:56,933 --> 00:01:58,200 And since now in the data 41 00:01:58,200 --> 00:02:02,200 set of the business problem, we have, I think 10 or 11 independent variables. 42 00:02:02,400 --> 00:02:06,266 Well, then it's a little bit hard to represent something in 11 dimensions. 43 00:02:06,533 --> 00:02:11,100 So we won't take this, but we will definitely take everything that's above. 44 00:02:11,366 --> 00:02:15,866 So I'm going to copy that and let's go back to our A and and model 45 00:02:16,166 --> 00:02:19,666 and paste this classification template right here. 46 00:02:20,233 --> 00:02:20,966 All right. 47 00:02:20,966 --> 00:02:24,033 And now in this template we're going to change a very few things. 48 00:02:24,033 --> 00:02:28,133 And of course we are going to build our artificial neural network 49 00:02:28,333 --> 00:02:31,666 right here in this section create your classifier here 50 00:02:31,833 --> 00:02:36,666 we can already replace classifier here by and to then build the model. 51 00:02:37,033 --> 00:02:37,866 All right. 52 00:02:37,866 --> 00:02:38,700 But of course 53 00:02:38,700 --> 00:02:42,366 we need to make sure everything's okay in all the data pre-processing step. 54 00:02:42,666 --> 00:02:45,066 And that's what we're going to do right now. 55 00:02:45,066 --> 00:02:46,133 All right. So let's start. 56 00:02:46,133 --> 00:02:49,833 Let's start with the basic step setting the right folder has a working directory. 57 00:02:50,133 --> 00:02:53,333 So right now I'm on my desktop I'm going to my machine Learning 80 folder. 58 00:02:53,333 --> 00:02:56,100 Then we are in part eight Deep learning. 59 00:02:56,100 --> 00:02:59,100 And now section 40 Artificial Neural Networks. 60 00:02:59,133 --> 00:02:59,966 Here we go. 61 00:02:59,966 --> 00:03:02,900 Make sure that you have the churn modeling dot CSV file. 62 00:03:02,900 --> 00:03:05,333 And if that's the case you can click on this more button here. 63 00:03:05,333 --> 00:03:08,433 And then set as working directory. Great. 64 00:03:08,433 --> 00:03:10,433 And now let's change a few things. 65 00:03:10,433 --> 00:03:13,800 So first of all let's start with this section importing the data set. 66 00:03:14,100 --> 00:03:17,033 Well the name of the data set is not social network ads. 67 00:03:17,033 --> 00:03:20,800 It's now for churn modeling. 68 00:03:21,500 --> 00:03:22,266 All right. 69 00:03:22,266 --> 00:03:24,300 We are now ready to import the data set. 70 00:03:24,300 --> 00:03:27,100 So let's do it right now I'm going to select this line. 71 00:03:27,100 --> 00:03:28,733 And execute. 72 00:03:28,733 --> 00:03:29,033 All right. 73 00:03:29,033 --> 00:03:30,366 Data sets will import it. 74 00:03:30,366 --> 00:03:32,833 Well we have actually 14 variables. 75 00:03:32,833 --> 00:03:37,200 But let's see if we include all these variables in the real data set. 76 00:03:37,500 --> 00:03:40,500 You know the one on which we want to build our deep learning model. 77 00:03:40,733 --> 00:03:44,166 So let's see I'm going to click on this data set here 78 00:03:44,500 --> 00:03:47,733 to see which independent variables we include in the model. 79 00:03:48,433 --> 00:03:50,133 All right so just a quick reminder. 80 00:03:50,133 --> 00:03:53,766 This data set contains 10,000 observations containing 81 00:03:53,766 --> 00:03:57,833 some informations of customers in a bank like the surname 82 00:03:57,833 --> 00:04:01,800 the credit score geography gender, age and all the other informations here. 83 00:04:02,166 --> 00:04:05,966 And during six month the bank looked for each customer. 84 00:04:05,966 --> 00:04:09,866 If the customer stayed or left the bank within the six month period 85 00:04:10,166 --> 00:04:11,033 and this result, 86 00:04:11,033 --> 00:04:15,433 whether the customer stayed or left is given in this last column here exited. 87 00:04:15,633 --> 00:04:19,800 So one means that the customer left the bank during the six months, and zero 88 00:04:19,800 --> 00:04:22,833 means that the customer stayed in the bank during the six months. 89 00:04:23,533 --> 00:04:27,666 So what's important to understand now is that all these variables here, 90 00:04:27,666 --> 00:04:32,066 from row number to estimated salary, are the independent variables. 91 00:04:32,233 --> 00:04:35,333 And the last column here exited is the dependent variable. 92 00:04:35,833 --> 00:04:41,466 So right now our goal is to make a model where we can predict this result. 93 00:04:41,466 --> 00:04:44,633 Exited here whether the customer left or stayed in the bank 94 00:04:44,966 --> 00:04:48,400 from the information contained in all these independent variables here. 95 00:04:49,100 --> 00:04:52,466 But the thing is that in these independent variables, 96 00:04:52,466 --> 00:04:56,100 some definitely don't have an impact on this dependent variable. 97 00:04:56,100 --> 00:04:57,000 Exited. 98 00:04:57,000 --> 00:05:01,066 And so now what we have to do is only take the independent variables 99 00:05:01,266 --> 00:05:04,133 that could have an impact and correlations 100 00:05:04,133 --> 00:05:07,566 with the decision of the customer to leave or stayed in the bank. 101 00:05:08,200 --> 00:05:09,533 And so that's what we're going to do right now. 102 00:05:09,533 --> 00:05:12,800 So let's look at each of these independent variables one by one. 103 00:05:13,000 --> 00:05:16,000 And let's see which one we keep in our model. 104 00:05:16,500 --> 00:05:19,200 All right so let's start with the first one row number. 105 00:05:19,200 --> 00:05:22,900 Well row number has definitely no impact on the dependent variable exited. 106 00:05:23,100 --> 00:05:26,400 So of course we will not include it then customer ID. 107 00:05:26,733 --> 00:05:28,366 Well customer ID that's the same. 108 00:05:28,366 --> 00:05:30,600 That's just an identification number. 109 00:05:30,600 --> 00:05:34,066 This definitely doesn't have any impact on the decision of the customer 110 00:05:34,066 --> 00:05:35,600 to stay or leave in the bank. 111 00:05:35,600 --> 00:05:37,900 So we will not include that either. 112 00:05:37,900 --> 00:05:38,866 Then the surname. 113 00:05:38,866 --> 00:05:40,100 Well that's the same. 114 00:05:40,100 --> 00:05:42,866 It's not because your name is Andrews that you have more chance 115 00:05:42,866 --> 00:05:46,166 to leave the bank than if your name is Romeo. 116 00:05:46,700 --> 00:05:49,700 All right, so we don't include surname either, 117 00:05:49,900 --> 00:05:53,700 but then we have credit score, and credit score might have an impact 118 00:05:53,700 --> 00:05:56,600 on the decision of the customer to stay or leave in the bank. 119 00:05:56,600 --> 00:05:59,766 Indeed, we can assume that customers with a low credit score 120 00:05:59,866 --> 00:06:03,366 are more likely to leave the bank than customers with a high credit score, 121 00:06:03,533 --> 00:06:06,633 so definitely we will include credit score in our model. 122 00:06:07,366 --> 00:06:09,300 All right then we have geography. 123 00:06:09,300 --> 00:06:09,566 Well, 124 00:06:09,566 --> 00:06:13,766 maybe some customers are more likely to leave the bank in one specific country. 125 00:06:13,766 --> 00:06:17,200 And that can be due to external factors like the economy of the country 126 00:06:17,400 --> 00:06:18,566 or any other factors. 127 00:06:18,566 --> 00:06:19,800 But yes, definitely, 128 00:06:19,800 --> 00:06:22,266 there might be some correlations between the countries 129 00:06:22,266 --> 00:06:24,833 and the decision to stay or leave the bank. 130 00:06:24,833 --> 00:06:27,000 So we willing to do that as well then? 131 00:06:27,000 --> 00:06:29,000 Gender. Well, that's the same. 132 00:06:29,000 --> 00:06:33,266 Maybe men or women are more likely to stay in the bank than the other. 133 00:06:33,266 --> 00:06:35,533 So we need to check it out then. Age. 134 00:06:35,533 --> 00:06:36,600 Well that's the same. 135 00:06:36,600 --> 00:06:38,800 And that's even quite intuitive. 136 00:06:38,800 --> 00:06:42,366 We might expect that younger people, or more likely to leave the bank 137 00:06:42,666 --> 00:06:43,666 than older people, 138 00:06:43,666 --> 00:06:46,933 because all the people have more balance and have more stability. 139 00:06:47,233 --> 00:06:49,566 So we include age as well then tenure. 140 00:06:49,566 --> 00:06:52,566 So tenure is for how long the customer has been in the bank. 141 00:06:52,900 --> 00:06:53,900 And so that's the same. 142 00:06:53,900 --> 00:06:57,000 We might expect that customers that have been in the bank for a long time 143 00:06:57,233 --> 00:07:00,300 are more likely to stay in the bank than recent customers. 144 00:07:00,600 --> 00:07:03,566 So yes, we'll take it then. Balance. 145 00:07:03,566 --> 00:07:07,666 Well, balance, of course, we might expect that this customer with this 146 00:07:07,666 --> 00:07:11,200 high balance has a lot more chance to stay in the bank 147 00:07:11,400 --> 00:07:16,133 than this customer with the zero balance all right, than the number of products. 148 00:07:16,133 --> 00:07:19,200 So that's the number of banking products the customers have in the bank. 149 00:07:19,400 --> 00:07:23,066 And so of course, maybe that the customers with many products in the bank 150 00:07:23,233 --> 00:07:26,200 are more likely to stay than customers with, for example, 151 00:07:26,200 --> 00:07:28,766 one product in the bank. So we'll need to check it out. 152 00:07:28,766 --> 00:07:29,833 That's just assumptions. 153 00:07:29,833 --> 00:07:33,300 That's the model that we'll find out about these correlations more thoroughly. 154 00:07:33,566 --> 00:07:37,200 But you know definitely from our intuition we need to include 155 00:07:37,200 --> 00:07:39,100 number of products as well. 156 00:07:39,100 --> 00:07:40,500 Then has great card. 157 00:07:40,500 --> 00:07:43,500 Well that's a little bit of the same as this variable. 158 00:07:43,533 --> 00:07:46,000 Customers that have a credit card might be more likely 159 00:07:46,000 --> 00:07:49,000 to stay in the bank than customers that don't have a credit card. 160 00:07:49,033 --> 00:07:52,033 So yes, is active member. That's the same. 161 00:07:52,033 --> 00:07:53,400 If a customer is active, 162 00:07:53,400 --> 00:07:56,533 then this customer is more likely to stay in the bank than a customer 163 00:07:56,533 --> 00:07:57,633 that is not active. 164 00:07:57,633 --> 00:08:00,533 So yes, it might be a significant independent variable. 165 00:08:00,533 --> 00:08:02,400 Then estimated salary. 166 00:08:02,400 --> 00:08:05,500 Well, that's the salary of the customer estimated by the bank. 167 00:08:05,866 --> 00:08:09,733 And it would make sense that customers with a high estimated salary 168 00:08:09,966 --> 00:08:13,833 have more chance to leave the bank than customers with a low estimated salary. 169 00:08:14,133 --> 00:08:14,533 All right. 170 00:08:14,533 --> 00:08:17,733 So that was the last independent variable of this data set. 171 00:08:18,000 --> 00:08:21,900 So now we know which independent variables we include in our data set. 172 00:08:22,200 --> 00:08:26,166 And that's what we're going to specify right now by updating our data set 173 00:08:26,333 --> 00:08:28,733 taking only the indexes of the independent variables 174 00:08:28,733 --> 00:08:30,500 we want to include in the model. 175 00:08:30,500 --> 00:08:33,200 So let's see what these indexes are okay. 176 00:08:33,200 --> 00:08:35,266 So indexes in R start at one. 177 00:08:35,266 --> 00:08:37,900 And so basically we taking all the independent variables 178 00:08:37,900 --> 00:08:41,300 from credit score up to estimated salary. 179 00:08:41,666 --> 00:08:45,900 So let's see index one index suit index three index four. 180 00:08:45,900 --> 00:08:50,366 So we are taking the indexes 456789 181 00:08:50,366 --> 00:08:53,733 ten 1112 and 13. 182 00:08:54,133 --> 00:08:54,566 All right. 183 00:08:54,566 --> 00:08:58,733 So we are taking the indexes from 4 to 14. 184 00:08:58,733 --> 00:09:01,800 Because you know in R it's not like in Python when we separate 185 00:09:01,800 --> 00:09:05,166 a matrix of features and the dependent variable vector 186 00:09:05,400 --> 00:09:07,800 we include all the variables in one data frame. 187 00:09:07,800 --> 00:09:10,300 And so we include the dependent variable grade. 188 00:09:10,300 --> 00:09:12,533 So let's input these indexes. 189 00:09:12,533 --> 00:09:16,300 So we just said that we want to take the indexes from four. 190 00:09:16,500 --> 00:09:19,433 So that's the index of the first independent variable 191 00:09:19,433 --> 00:09:23,600 up to the index 14 which is the index of the dependent variable. 192 00:09:24,366 --> 00:09:25,200 And that's great. 193 00:09:25,200 --> 00:09:29,666 Now we can update our data set by selecting this line and execute. 194 00:09:30,433 --> 00:09:30,933 Great. 195 00:09:30,933 --> 00:09:34,533 And now as you can see if I will go back to the data set here 196 00:09:34,733 --> 00:09:39,900 we have all are potentially statistically significant independent variables 197 00:09:39,900 --> 00:09:43,933 that might have an impact on the dependent variable exited. 198 00:09:44,133 --> 00:09:48,166 And so now the first step of data pre-processing is completed. 199 00:09:48,433 --> 00:09:49,566 We import correctly 200 00:09:49,566 --> 00:09:53,300 the data set by choosing all the relevant independent variables. 201 00:09:54,066 --> 00:09:56,366 Okay. Now let's move on to the second step. 202 00:09:56,366 --> 00:09:59,566 The second step is encoding the target feature as vector. 203 00:10:00,000 --> 00:10:03,633 Well we don't really need to do that because the dependent variable of our data 204 00:10:03,633 --> 00:10:07,800 set is a categorical variable with a binary outcome 1 or 0. 205 00:10:08,100 --> 00:10:10,633 And the thing to understand is that the package we're going to use 206 00:10:10,633 --> 00:10:15,066 is going to recognize it as a categorical variable with a binary outcome. 207 00:10:15,300 --> 00:10:21,000 So we actually don't need to encode this target feature exited as a vector. 208 00:10:21,000 --> 00:10:23,866 So I'm going to remove this line. We don't need it. 209 00:10:23,866 --> 00:10:27,933 However we do need to do something regarding some categorical variables. 210 00:10:28,200 --> 00:10:29,933 Of course I'm talking about 211 00:10:29,933 --> 00:10:33,700 the two categorical independent variables we have in our data set. 212 00:10:34,000 --> 00:10:37,866 And these two variables are of course geography and gender. 213 00:10:38,433 --> 00:10:40,166 So we have two problems here. 214 00:10:40,166 --> 00:10:42,566 So we need to do two things here for these variables. 215 00:10:42,566 --> 00:10:46,000 The first thing we need to do is to convert them as vectors. 216 00:10:46,333 --> 00:10:48,166 And then we will need to do something 217 00:10:48,166 --> 00:10:51,733 more than we used to do when encoding our categorical variables. 218 00:10:52,000 --> 00:10:54,600 It's to set them as numeric. 219 00:10:54,600 --> 00:10:57,433 And to do this we'll use the as numeric function. 220 00:10:57,433 --> 00:11:00,000 And why do we need to do this especially here. 221 00:11:00,000 --> 00:11:00,600 Well that is 222 00:11:00,600 --> 00:11:03,966 just because the deep learning package that we're going to use is requiring it. 223 00:11:04,233 --> 00:11:09,333 And that's the only reason it expects vectors but set as numeric 224 00:11:09,466 --> 00:11:10,666 numeric vectors. 225 00:11:10,666 --> 00:11:13,800 So let's do this I'm going back to my A and model. 226 00:11:14,133 --> 00:11:16,933 And so first we're going to change this to say 227 00:11:16,933 --> 00:11:19,800 that we're encoding the categorical 228 00:11:21,100 --> 00:11:23,533 variables as factors. 229 00:11:23,533 --> 00:11:24,000 All right. 230 00:11:24,000 --> 00:11:25,566 And now we're going to take this 231 00:11:25,566 --> 00:11:29,633 categorical data file that we made in part one data preprocessing. 232 00:11:30,033 --> 00:11:34,766 Because you know there is the code ready to encode any categorical data. 233 00:11:35,100 --> 00:11:40,733 So I'm going to select all of this and paste it here in this second 234 00:11:40,733 --> 00:11:45,000 step of data preprocessing to encode the categorical variables as vectors. 235 00:11:45,700 --> 00:11:47,000 All right. So let's do this. 236 00:11:47,000 --> 00:11:50,766 We just need to replace the names of the variables and then add 237 00:11:50,900 --> 00:11:54,700 this as dot numeric function to set the factors as numeric. 238 00:11:55,000 --> 00:11:57,600 So let's start by replacing all the names here. 239 00:11:57,600 --> 00:12:00,466 Well the first categorical variable gives the countries 240 00:12:00,466 --> 00:12:02,000 but it is not called country. 241 00:12:02,000 --> 00:12:04,200 It is called geography. 242 00:12:04,200 --> 00:12:08,400 So we will replace here country by geography. 243 00:12:09,233 --> 00:12:10,000 Same here. 244 00:12:12,900 --> 00:12:13,800 And the good 245 00:12:13,800 --> 00:12:17,866 news is that now we don't need to change the names of the categories here 246 00:12:17,866 --> 00:12:21,333 France, Spain and Germany because that's the same names. 247 00:12:21,333 --> 00:12:22,800 So that's great. 248 00:12:22,800 --> 00:12:25,433 And we will keep the labels 123. 249 00:12:25,433 --> 00:12:26,333 All right that's good. 250 00:12:26,333 --> 00:12:31,466 And now we add this as dot numeric function 251 00:12:31,800 --> 00:12:34,800 to set the factors as numeric. 252 00:12:34,833 --> 00:12:37,000 So I'm putting all these factor function 253 00:12:37,000 --> 00:12:40,666 here inside the parentheses of the as numeric function. 254 00:12:41,100 --> 00:12:43,366 And now I just need to align everything. 255 00:12:43,366 --> 00:12:45,066 Well here we go. 256 00:12:45,066 --> 00:12:48,066 And same for here. 257 00:12:48,133 --> 00:12:49,533 All right. Great. 258 00:12:49,533 --> 00:12:52,533 And now let's do the same for the second categorical variable. 259 00:12:52,633 --> 00:12:55,766 So we need to replace purchase here by gender. 260 00:12:56,533 --> 00:12:57,700 So let's do it. 261 00:12:57,700 --> 00:13:00,700 Purchased replaced by gender. 262 00:13:01,433 --> 00:13:02,100 All right. 263 00:13:02,100 --> 00:13:05,000 Same here gender. 264 00:13:05,000 --> 00:13:08,400 And now we replace the two categories no and yes by 265 00:13:08,700 --> 00:13:11,700 female and male. 266 00:13:12,266 --> 00:13:14,533 And here we can give the labels we want. 267 00:13:14,533 --> 00:13:19,366 So let's for example take labels one for female and two for male. 268 00:13:19,966 --> 00:13:20,566 All right. 269 00:13:20,566 --> 00:13:23,566 And let's not forget to add the US 270 00:13:24,000 --> 00:13:27,066 dot numeric function which I remind we just 271 00:13:27,066 --> 00:13:30,066 do for the future deep learning package that we're going to use. 272 00:13:30,133 --> 00:13:33,366 So parentheses here parentheses here. 273 00:13:33,366 --> 00:13:35,600 And now let's align everything. 274 00:13:35,600 --> 00:13:38,600 Here we go. All right. Great. 275 00:13:38,800 --> 00:13:40,066 So now everything is ready. 276 00:13:40,066 --> 00:13:44,133 This section is ready that encodes as required 277 00:13:44,133 --> 00:13:47,800 by the deep learning package our categorical independent variables. 278 00:13:48,233 --> 00:13:48,600 All right. 279 00:13:48,600 --> 00:13:51,066 So I'm going to select all this section here. 280 00:13:51,066 --> 00:13:54,066 And let's execute. 281 00:13:54,166 --> 00:13:55,866 All right. Executed properly. 282 00:13:55,866 --> 00:13:59,600 Now let's have a look at the data set to see what the variables became. 283 00:13:59,766 --> 00:14:00,366 Perfect. 284 00:14:00,366 --> 00:14:02,300 Geography was encoded 285 00:14:02,300 --> 00:14:06,400 into one, two and three categories that are numeric categories. 286 00:14:06,766 --> 00:14:10,333 And the gender one for female and two for male. 287 00:14:10,800 --> 00:14:13,066 Great and again as numeric vectors. 288 00:14:14,066 --> 00:14:14,666 Perfect. 289 00:14:14,666 --> 00:14:16,933 So this section is now completed. 290 00:14:16,933 --> 00:14:19,166 And let's move on to the next one. 291 00:14:19,166 --> 00:14:21,900 We can see how we're getting very efficient at this. 292 00:14:21,900 --> 00:14:25,066 The next one is about splitting the data sets into the training set 293 00:14:25,066 --> 00:14:26,100 and the test set. 294 00:14:26,100 --> 00:14:29,866 We need to do that because we will train our artificial neural network 295 00:14:30,100 --> 00:14:33,900 on the training set, and we will test its performance on the test set. 296 00:14:34,200 --> 00:14:35,133 So we'll do that. 297 00:14:35,133 --> 00:14:37,000 But let's not execute too fast. 298 00:14:37,000 --> 00:14:38,933 We need to replace purchase here 299 00:14:38,933 --> 00:14:42,600 by the name of the dependent variable, which is exited. 300 00:14:44,000 --> 00:14:47,000 And maybe we can change the split ratio as well. 301 00:14:47,033 --> 00:14:52,766 You know, put 80% for the training set so that we have 8000 observations to train 302 00:14:52,766 --> 00:14:56,800 our artificial neural network and 2000 observations to test 303 00:14:57,133 --> 00:14:59,833 its performance on new observations. 304 00:14:59,833 --> 00:15:02,333 That is, the new observations of the test set. 305 00:15:02,333 --> 00:15:03,400 So now that's ready. 306 00:15:03,400 --> 00:15:05,133 We don't have to do anything more here. 307 00:15:05,133 --> 00:15:09,133 The most important thing is not to forget to replace purchased by exited. 308 00:15:09,533 --> 00:15:15,200 And so now I'm going to select all these section and execute perfect. 309 00:15:15,333 --> 00:15:19,633 Now we have our training set and our test set. 310 00:15:20,633 --> 00:15:21,366 Great. 311 00:15:21,366 --> 00:15:22,933 So that is the whole data set. 312 00:15:22,933 --> 00:15:25,933 That is our training set with 8000 observations. 313 00:15:26,200 --> 00:15:29,400 And that is our test set with 2000 observations. 314 00:15:30,000 --> 00:15:30,600 Perfect. 315 00:15:30,600 --> 00:15:32,466 Now let's go back to our and 316 00:15:32,466 --> 00:15:35,900 and we are finally getting to the last step of data preprocessing. 317 00:15:36,100 --> 00:15:38,100 And that is feature scaling. 318 00:15:38,100 --> 00:15:39,400 So now the question is 319 00:15:39,400 --> 00:15:43,333 do we need to apply feature scaling to train an artificial neural network. 320 00:15:43,600 --> 00:15:45,600 And the answer is yes. 321 00:15:45,600 --> 00:15:46,600 Absolutely. 322 00:15:46,600 --> 00:15:49,300 That's 100% compulsory. 323 00:15:49,300 --> 00:15:50,900 And that is because training 324 00:15:50,900 --> 00:15:53,900 an artificial neural network is highly compute intensive. 325 00:15:54,066 --> 00:15:56,333 So there is going to be a lot of computations. 326 00:15:56,333 --> 00:15:58,566 And besides parallel computations. 327 00:15:58,566 --> 00:16:00,900 So definitely we need to apply feature scaling. 328 00:16:00,900 --> 00:16:03,900 And besides it is required by the package. 329 00:16:03,900 --> 00:16:05,333 So we will execute this. 330 00:16:05,333 --> 00:16:08,433 But before let's not forget to change the indexes. 331 00:16:08,833 --> 00:16:11,933 These index is three here where the index 332 00:16:11,933 --> 00:16:14,933 of the dependent variable and part one data preprocessing. 333 00:16:15,000 --> 00:16:17,966 So right now we just need to replace this index three. 334 00:16:17,966 --> 00:16:21,133 Here by our new index of the dependent variable. 335 00:16:21,566 --> 00:16:22,700 And so what is this index. 336 00:16:22,700 --> 00:16:25,533 That is the index of the exited column. 337 00:16:25,533 --> 00:16:27,600 Well we can see that directly here. 338 00:16:27,600 --> 00:16:30,500 This data set has 11 variables. 339 00:16:30,500 --> 00:16:34,233 So that means that the exited column here has index 11. 340 00:16:34,833 --> 00:16:40,433 So let's replace three here by 11 then here as well 341 00:16:40,433 --> 00:16:44,300 1111 and 11. 342 00:16:44,866 --> 00:16:45,500 Great. 343 00:16:45,500 --> 00:16:47,800 And now the feature scaling section is ready. 344 00:16:47,800 --> 00:16:50,766 So let's select the whole section. 345 00:16:50,766 --> 00:16:53,700 And execute. Great. 346 00:16:53,700 --> 00:16:56,700 And now if we have a look at our training set 347 00:16:57,000 --> 00:17:00,000 well well yes definitely everything is scaled. 348 00:17:00,066 --> 00:17:01,833 And our test set same. 349 00:17:01,833 --> 00:17:03,600 Everything is definitely scaled. 350 00:17:03,600 --> 00:17:04,466 We are happy. 351 00:17:04,466 --> 00:17:09,000 We are ready to build our artificial neural network. 352 00:17:09,300 --> 00:17:11,533 And that's what we're going to do in the next tutorial. 353 00:17:11,533 --> 00:17:13,400 So I'm super excited to start. 354 00:17:13,400 --> 00:17:14,866 I look forward to seeing you there. 355 00:17:14,866 --> 00:17:16,733 And until then, enjoy machine learning.