1 00:00:00,233 --> 00:00:02,500 Hello and welcome to this art tutorial. 2 00:00:02,500 --> 00:00:05,933 So you learned a lot of stuff, but it would be a shame 3 00:00:05,933 --> 00:00:10,966 to leave this course without having some introduction to one of the most popular 4 00:00:11,166 --> 00:00:14,900 algorithm in machine learning that is quite recently popular. 5 00:00:14,900 --> 00:00:15,666 But still, 6 00:00:15,666 --> 00:00:20,133 it is definitely a very powerful model, especially if you work on large data sets. 7 00:00:20,433 --> 00:00:24,166 It will offer you very high performance while being fast to execute. 8 00:00:24,600 --> 00:00:27,600 And speaking of performance and execution speed, 9 00:00:27,900 --> 00:00:32,400 it is important to remind that XGBoost is the most powerful implementation 10 00:00:32,400 --> 00:00:37,066 of gradient boosting in terms of model performance and execution speed. 11 00:00:37,366 --> 00:00:40,900 Therefore, it's very important for you to have it in your toolkit. 12 00:00:41,266 --> 00:00:43,133 So let's implement XGBoost. 13 00:00:43,133 --> 00:00:47,066 This is only going to be an introduction, so we will make a simple 14 00:00:47,066 --> 00:00:48,900 implementation of XGBoost. 15 00:00:48,900 --> 00:00:52,500 But you will have the code template on your computer and you'll be able 16 00:00:52,500 --> 00:00:55,500 to try it on your problems on your data sets. 17 00:00:55,533 --> 00:00:58,533 And you'll see that even with this simple implementation, 18 00:00:58,633 --> 00:01:01,633 it will definitely give you some excellent performance. 19 00:01:01,833 --> 00:01:05,100 And so now what we're going to do is take one of the business problem 20 00:01:05,100 --> 00:01:06,733 we dealt with in this course. 21 00:01:06,733 --> 00:01:10,700 This is going to be actually the problem that we solved in the deep 22 00:01:10,700 --> 00:01:11,700 learning section. 23 00:01:11,700 --> 00:01:14,000 Remember this is the churn modeling problem 24 00:01:14,000 --> 00:01:17,000 where we need to predict the customers of the bank. 25 00:01:17,033 --> 00:01:18,400 That will leave the bank. 26 00:01:18,400 --> 00:01:21,933 So this was a classification problem where we classify the customers 27 00:01:21,933 --> 00:01:23,000 in two classes, 28 00:01:23,000 --> 00:01:26,500 those who will leave the bank and those who will not leave the bank. 29 00:01:26,733 --> 00:01:30,766 And so remember for this problem we obtained an accuracy of 86%. 30 00:01:31,133 --> 00:01:35,600 But that took quite a while because we trained an artificial neural network 31 00:01:35,733 --> 00:01:39,533 with many epochs, and therefore it took quite some time to execute. 32 00:01:40,100 --> 00:01:42,600 And so now in this section we're going to do the same. 33 00:01:42,600 --> 00:01:46,200 We're going to apply XGBoost on this churn modeling problem. 34 00:01:46,333 --> 00:01:47,766 This data set contains, 35 00:01:47,766 --> 00:01:51,366 if I remember, 13 features, but that's not a large data set. 36 00:01:51,633 --> 00:01:55,733 And what is important to highlight is that even if this was a large data set, 37 00:01:55,733 --> 00:01:59,933 a very large data set, well, XGBoost would be one of the best model 38 00:01:59,933 --> 00:02:05,400 in terms of performance, that is to get a good accuracy and execution speed. 39 00:02:05,700 --> 00:02:06,500 So for example, 40 00:02:06,500 --> 00:02:11,333 if you are working with a large data set, I strongly encourage you to test XGBoost. 41 00:02:12,400 --> 00:02:12,733 All right. 42 00:02:12,733 --> 00:02:16,966 So what we're going to do now is take the pre-processing phase. 43 00:02:16,966 --> 00:02:19,966 So there is only part one because part two is to implement 44 00:02:20,033 --> 00:02:21,900 the artificial neural network. 45 00:02:21,900 --> 00:02:25,933 And so we just want to preprocess the data for this churn modeling problem 46 00:02:26,266 --> 00:02:29,266 associated to this churn modeling CSV file. 47 00:02:29,833 --> 00:02:34,400 But actually we're not going to take everything in this pre-processing phase. 48 00:02:34,700 --> 00:02:37,566 The reason is that for the artificial 49 00:02:37,566 --> 00:02:40,700 neural network, well, feature scaling was totally compulsory. 50 00:02:41,000 --> 00:02:42,166 No questions asked. 51 00:02:42,166 --> 00:02:45,266 Feature scaling must be applied for deep learning. 52 00:02:45,633 --> 00:02:50,600 But the good news is that for XGBoost, well, since XGBoost is a gradient 53 00:02:50,600 --> 00:02:51,966 boosting model with decision 54 00:02:51,966 --> 00:02:55,600 trees, well, accordingly feature scaling is totally unnecessary. 55 00:02:55,833 --> 00:02:58,566 And that's one of the very good thing about XGBoost. 56 00:02:58,566 --> 00:03:02,133 Besides, it's high performance and it's fast execution speed. 57 00:03:02,433 --> 00:03:05,933 It's that you can keep the interpretation of your problem, 58 00:03:06,200 --> 00:03:09,866 of your data set, and of the results you'll get after building the model. 59 00:03:10,600 --> 00:03:14,000 So we can understand now why XGBoost is so popular. 60 00:03:14,133 --> 00:03:18,433 It's because it has the three qualities first quality, high performance, 61 00:03:18,433 --> 00:03:22,166 second quality, first execution, speed, and third quality. 62 00:03:22,366 --> 00:03:25,733 You can keep all the interpretation of your problem and your model. 63 00:03:26,133 --> 00:03:29,133 So definitely a model to have in your toolkit. 64 00:03:29,200 --> 00:03:29,500 All right. 65 00:03:29,500 --> 00:03:32,100 So feature scaling is unnecessary here. 66 00:03:32,100 --> 00:03:37,233 And therefore we will take everything from here up to the top like this. 67 00:03:37,766 --> 00:03:38,666 Copy. 68 00:03:38,666 --> 00:03:41,700 And we'll paste that in our XGBoost file. 69 00:03:42,233 --> 00:03:42,733 Here we go. 70 00:03:42,733 --> 00:03:45,733 And now we can implement XGBoost. 71 00:03:45,733 --> 00:03:48,733 So first let's introduce new section fitting 72 00:03:49,500 --> 00:03:51,800 XGBoost to the training set. 73 00:03:54,700 --> 00:03:55,500 All right. 74 00:03:55,500 --> 00:03:58,500 And first let's install XGBoost. 75 00:03:58,666 --> 00:04:00,966 So as usual there's a package. 76 00:04:00,966 --> 00:04:05,466 It's XGBoost package that will allow us to implement XGBoost very efficiently. 77 00:04:05,733 --> 00:04:06,700 So let's type here. 78 00:04:06,700 --> 00:04:10,200 As usual install dot packages 79 00:04:10,533 --> 00:04:13,200 and inside the name of the extra boost package, 80 00:04:13,200 --> 00:04:16,100 which is simply XGBoost. 81 00:04:16,100 --> 00:04:17,200 Like this. 82 00:04:17,200 --> 00:04:19,733 So then you select this line and press Command 83 00:04:19,733 --> 00:04:22,733 and Control plus enter to execute. 84 00:04:22,766 --> 00:04:25,766 And that's installing the XGBoost package. 85 00:04:26,033 --> 00:04:28,800 All right. We can see it's processing. 86 00:04:28,800 --> 00:04:30,633 And here again it's downloaded. 87 00:04:30,633 --> 00:04:33,600 Binary packages are in this package folder. 88 00:04:33,600 --> 00:04:36,266 All good XGBoost is installed. 89 00:04:36,266 --> 00:04:39,266 So let's put that section in comment 90 00:04:40,333 --> 00:04:41,100 there. 91 00:04:41,100 --> 00:04:45,900 And now let's import the execute package because indeed we installed it. 92 00:04:46,200 --> 00:04:51,200 But if we go down to the bottom XGBoost is installed but not imported. 93 00:04:51,533 --> 00:04:53,133 And we want to make it automatic. 94 00:04:53,133 --> 00:04:58,166 So as usual we use the command library and inside XGBoost 95 00:04:58,833 --> 00:05:01,433 and that will import the package. 96 00:05:01,433 --> 00:05:02,133 All right. 97 00:05:02,133 --> 00:05:04,833 And now let's implement XGBoost. 98 00:05:04,833 --> 00:05:07,166 And actually this is going to take one line 99 00:05:07,166 --> 00:05:11,366 because we're just going to make the classifier the XGBoost classifier itself. 100 00:05:11,700 --> 00:05:14,366 And so basically we just need to create a new variable 101 00:05:14,366 --> 00:05:18,533 that we call as usual classifier and then equals. 102 00:05:18,533 --> 00:05:22,266 And then we use the XGBoost function from this XGBoost package. 103 00:05:22,533 --> 00:05:25,766 So XGBoost and parenthesis 104 00:05:26,166 --> 00:05:29,066 and let's click here 105 00:05:29,066 --> 00:05:32,966 press F1 and get some information about this execute function. 106 00:05:33,533 --> 00:05:36,533 So the information is we are interested in the arguments 107 00:05:36,600 --> 00:05:40,066 and what arguments do we need here okay. 108 00:05:40,066 --> 00:05:40,966 So first we see that 109 00:05:40,966 --> 00:05:45,400 we have this params parameter which is actually a list of parameters. 110 00:05:45,800 --> 00:05:48,966 And these parameters are all the parameters that you can see here. 111 00:05:49,200 --> 00:05:52,733 For example the eta parameter that controls the learning rate, 112 00:05:53,100 --> 00:05:56,100 the gamma parameter which is the minimum loss reduction. 113 00:05:56,266 --> 00:05:59,000 Well, you have a lot of these parameters, but 114 00:05:59,000 --> 00:06:02,700 this tutorial is just an introduction of XGBoost. 115 00:06:02,700 --> 00:06:06,533 So we will not do some tuning on our XGBoost model in this course. 116 00:06:06,800 --> 00:06:10,600 But I'm sure in some future courses I will make some more complex 117 00:06:10,600 --> 00:06:14,166 implementations of XGBoost on some more complex problems, 118 00:06:14,600 --> 00:06:18,433 which in this course is just to end with a simple introduction of boost 119 00:06:18,600 --> 00:06:21,266 so that you can at least have some knowledge about it 120 00:06:21,266 --> 00:06:23,266 and have it in your toolkit. 121 00:06:23,266 --> 00:06:26,566 So let's not focus on this now, and let's move on to the compulsory 122 00:06:26,566 --> 00:06:30,633 parameters that are of course the first one is data. 123 00:06:31,100 --> 00:06:33,733 So data is of course your training set 124 00:06:33,733 --> 00:06:36,700 the data sets on which you want to train your XGBoost model. 125 00:06:36,700 --> 00:06:38,500 And so let's import that right now. 126 00:06:38,500 --> 00:06:43,533 So first argument data equals then training set. 127 00:06:44,400 --> 00:06:44,933 Here we go. 128 00:06:44,933 --> 00:06:48,600 And actually here we only need the features in the training set. 129 00:06:48,600 --> 00:06:51,633 So we will remove the dependent variable from this training set. 130 00:06:51,633 --> 00:06:55,900 Because this training set contains both the features and the dependent variable. 131 00:06:56,233 --> 00:07:00,133 But what this data parameter expects is only the features. 132 00:07:00,366 --> 00:07:04,666 So here we add some brackets and we remove that have been invaluable. 133 00:07:05,033 --> 00:07:06,300 And what is this index. 134 00:07:06,300 --> 00:07:09,366 Well to do this we need to import the data set. 135 00:07:09,366 --> 00:07:11,000 But first before importing the data 136 00:07:11,000 --> 00:07:14,100 set let's quickly set the right folder as working directory. 137 00:07:14,333 --> 00:07:18,133 So right now we're importing then section 49 XGBoost. 138 00:07:18,433 --> 00:07:19,533 That's the right folder. 139 00:07:19,533 --> 00:07:21,666 Make sure that you have the churn modeling CSV file. 140 00:07:21,666 --> 00:07:24,666 And then click on Sets Working Directory here. 141 00:07:24,900 --> 00:07:25,733 And here we go. 142 00:07:25,733 --> 00:07:27,766 Now we can import the data set. 143 00:07:27,766 --> 00:07:30,733 So let's import it. And that's the data set. 144 00:07:30,733 --> 00:07:33,900 But remember in this data sets we don't take all the independent 145 00:07:33,900 --> 00:07:38,266 variables because we're not interested in row number customer ID and surname. 146 00:07:38,266 --> 00:07:42,666 We know that these three variables have no impact on the dependent variable. 147 00:07:42,900 --> 00:07:44,966 So we remove them. 148 00:07:44,966 --> 00:07:46,866 And that's what we do in this line. 149 00:07:46,866 --> 00:07:49,600 Data set equals data set for 14. 150 00:07:49,600 --> 00:07:54,200 That means that we take all the variables from the fourth variable of the data set. 151 00:07:54,533 --> 00:07:58,866 That is credit score up to the last variable, exited the dependent variable. 152 00:07:58,866 --> 00:08:00,500 That's the dependent variable. 153 00:08:00,500 --> 00:08:03,966 And so let's select this line and execute. 154 00:08:04,366 --> 00:08:09,833 And now if we look at our data set well this contains all the relevant features. 155 00:08:10,133 --> 00:08:12,266 And the dependent variable exited. 156 00:08:12,266 --> 00:08:15,666 And so the challenge is with all these independent variables here 157 00:08:15,900 --> 00:08:19,700 we want to predict if the customer will leave or stay in the bank. 158 00:08:20,066 --> 00:08:21,566 And so that's the data set. 159 00:08:21,566 --> 00:08:25,333 We consider to train the model and test its performance. 160 00:08:25,700 --> 00:08:29,300 And therefore the index of the dependent variable we have to remove. 161 00:08:29,300 --> 00:08:32,366 Now in the XGBoost function for the data parameter 162 00:08:32,366 --> 00:08:35,366 is the last index here of the exited column. 163 00:08:35,433 --> 00:08:40,000 And since we have 11 variables, well that index is 11. 164 00:08:40,800 --> 00:08:44,966 So let's go back to XGBoost and let's go back to our function. 165 00:08:45,266 --> 00:08:49,000 And therefore here we have to input -11. 166 00:08:49,766 --> 00:08:50,100 All right. 167 00:08:50,100 --> 00:08:53,566 So we have our whole training set but without the dependent variable. 168 00:08:53,566 --> 00:08:55,766 So that's perfect. That's exactly what we want. 169 00:08:55,766 --> 00:08:57,133 So now let's go back to help 170 00:08:57,133 --> 00:09:00,233 to see if we need some more info about this first parameter. 171 00:09:00,733 --> 00:09:04,500 Well indeed there is some very important information that we need to consider here. 172 00:09:04,833 --> 00:09:10,466 It's that this input data set needs to be an XGBoost data matrix. 173 00:09:10,466 --> 00:09:13,133 So that's basically a type of matrix. 174 00:09:13,133 --> 00:09:16,633 But we can also see that in addition data 175 00:09:16,633 --> 00:09:19,633 the data parameter also accepts matrix S. 176 00:09:20,200 --> 00:09:22,300 But this is not a matrix. 177 00:09:22,300 --> 00:09:24,000 This is a data frame. 178 00:09:24,000 --> 00:09:27,000 So this won't work if we input the features this way. 179 00:09:27,166 --> 00:09:33,300 So we can either convert this into an XGBoost matrix or a simple matrix. 180 00:09:33,500 --> 00:09:35,400 So let's take the simple solution. 181 00:09:35,400 --> 00:09:38,966 Let's convert this DataFrame features into a matrix. 182 00:09:39,200 --> 00:09:40,600 And you know how to do this. 183 00:09:40,600 --> 00:09:43,600 We just need to use the as dot 184 00:09:43,600 --> 00:09:47,500 matrix function and put inside some parenthesis. 185 00:09:47,500 --> 00:09:48,766 Because it's a function. 186 00:09:48,766 --> 00:09:51,033 This dataframe of features. 187 00:09:51,033 --> 00:09:51,666 Here we go. 188 00:09:51,666 --> 00:09:53,400 And now this becomes a matrix. 189 00:09:53,400 --> 00:09:55,533 And that's exactly what we need. 190 00:09:55,533 --> 00:09:58,200 All right perfect then next argument. 191 00:09:58,200 --> 00:10:00,733 So here again you have a lot of other arguments. 192 00:10:00,733 --> 00:10:02,633 But these are not compulsory. 193 00:10:02,633 --> 00:10:04,400 So we won't focus on them now. 194 00:10:04,400 --> 00:10:08,266 But the next compulsory argument is this label argument. 195 00:10:08,633 --> 00:10:11,633 Because indeed here we input the matrix of features. 196 00:10:11,700 --> 00:10:14,566 But of course to train a classification model we need 197 00:10:14,566 --> 00:10:17,933 not only the matrix of features but also the dependent variable. 198 00:10:18,200 --> 00:10:21,533 And that's what we put in this label parameter. 199 00:10:21,733 --> 00:10:26,066 And so as you might expect, since we input the features into a matrix, 200 00:10:26,333 --> 00:10:30,266 well we need to input this label parameter as a vector. 201 00:10:30,766 --> 00:10:33,600 And to get our dependent variable as a vector, 202 00:10:33,600 --> 00:10:37,366 we need to input label equals our training set. 203 00:10:38,200 --> 00:10:39,166 Then dollar. 204 00:10:39,166 --> 00:10:42,200 And then we take the name of our dependent variable which is exited. 205 00:10:42,766 --> 00:10:45,000 And this will give us a vector. 206 00:10:45,000 --> 00:10:49,966 So training set our exited is the dependent variable but given as a vector. 207 00:10:50,266 --> 00:10:51,400 So that's exactly what we need. 208 00:10:51,400 --> 00:10:55,400 Because indeed, as you can see label is expected to be a vector. 209 00:10:55,800 --> 00:10:57,900 The vector of response values. 210 00:10:57,900 --> 00:11:01,500 The response values are of course the values of the dependent variable. 211 00:11:02,333 --> 00:11:03,066 All right. 212 00:11:03,066 --> 00:11:05,266 Now next argument. 213 00:11:05,266 --> 00:11:06,566 What is the next argument. 214 00:11:06,566 --> 00:11:11,433 Well there is a third compulsory argument that we need to input here. 215 00:11:11,433 --> 00:11:13,433 And that is actually above. 216 00:11:13,433 --> 00:11:17,133 But I wanted to put the label after the matrix of features that made 217 00:11:17,400 --> 00:11:18,733 kind of sense. 218 00:11:18,733 --> 00:11:21,733 And now there is a third argument that we need to input, 219 00:11:21,966 --> 00:11:24,200 which is the in rounds argument. 220 00:11:24,200 --> 00:11:27,900 And the in rounds argument is the maximum number of iterations. 221 00:11:28,200 --> 00:11:31,166 So since we're not working on a two complex problem, 222 00:11:31,166 --> 00:11:34,600 well, a maximum number of ten iterations will be sufficient. 223 00:11:34,900 --> 00:11:36,166 So we will input here. 224 00:11:36,166 --> 00:11:39,166 And rounds equals ten. 225 00:11:39,400 --> 00:11:42,733 And XGBoost will be trained in maximum ten iterations. 226 00:11:43,533 --> 00:11:44,333 Perfect. 227 00:11:44,333 --> 00:11:48,400 And now actually this line of code is ready 228 00:11:48,400 --> 00:11:51,933 to be executed to train the XGBoost classifier. 229 00:11:52,300 --> 00:11:55,533 So even if XGBoost is a very advanced 230 00:11:55,533 --> 00:11:58,633 machine learning problem, well, thanks to this extra boost package, 231 00:11:58,900 --> 00:12:03,900 you just need a single simple line of code to implement it very efficiently. 232 00:12:04,900 --> 00:12:07,300 All right, we're not going to execute this line now 233 00:12:07,300 --> 00:12:10,500 because first we need to run the data preprocessing phase. 234 00:12:10,500 --> 00:12:11,266 And then 235 00:12:11,266 --> 00:12:15,533 I would like to add some code sections to evaluate our XGBoost model performance. 236 00:12:15,666 --> 00:12:18,666 So we are going to execute the whole thing in the end. 237 00:12:18,666 --> 00:12:24,066 But for now let's add the last sections to evaluate the boost performance. 238 00:12:24,066 --> 00:12:24,833 And of course 239 00:12:24,833 --> 00:12:29,333 we are going to take our k fold cross-validation technique to evaluate it. 240 00:12:29,500 --> 00:12:32,500 And therefore here I'm going to take the k fold 241 00:12:32,500 --> 00:12:35,500 cross validation section which is right here. 242 00:12:35,633 --> 00:12:39,033 And we are going to use it on our XGBoost model. 243 00:12:39,366 --> 00:12:41,733 So here I just need to copy this section. 244 00:12:41,733 --> 00:12:45,166 Go back to my exhibitor model and paste it here. 245 00:12:45,666 --> 00:12:47,566 And be careful inside of it. 246 00:12:47,566 --> 00:12:49,333 We need to change the classifier 247 00:12:49,333 --> 00:12:52,200 because right here that's the kernel SVM classifier. 248 00:12:52,200 --> 00:12:56,800 And so basically we just need to replace this kernel SVM classifier 249 00:12:57,233 --> 00:13:00,400 by our XGBoost classifier. 250 00:13:00,733 --> 00:13:02,533 So I'm just copying that here. 251 00:13:02,533 --> 00:13:05,600 And go back to my k fold cross-validation section 252 00:13:06,000 --> 00:13:11,466 and paste the code to train the XGBoost classifier on the training set right here. 253 00:13:12,200 --> 00:13:12,600 All right. 254 00:13:12,600 --> 00:13:15,600 And then we need to add another line of code inside this section. 255 00:13:15,900 --> 00:13:19,433 It's related to the fact that this XGBoost model 256 00:13:19,433 --> 00:13:22,433 will return the predictions as probabilities. 257 00:13:22,566 --> 00:13:25,566 You know it will return the probability of class one. 258 00:13:25,666 --> 00:13:28,533 And therefore you know this trick to convert 259 00:13:28,533 --> 00:13:32,066 the probabilities into the real predictions 0 or 1. 260 00:13:32,400 --> 00:13:35,400 Well, we need to add this line of code y pred 261 00:13:36,666 --> 00:13:40,200 equals and then parenthesis y pret 262 00:13:41,233 --> 00:13:44,266 larger than 0.5. 263 00:13:44,700 --> 00:13:49,533 So that's if the probability is larger than 0.5 then y breath will be one. 264 00:13:49,833 --> 00:13:54,433 And if the probability is lower than 0.5, then y pred will be zero. 265 00:13:54,733 --> 00:13:57,700 So that's where we'll get the binary outcome 0 or 1. 266 00:13:57,700 --> 00:14:01,433 And that's exactly what this k fold cross-validation section expects. 267 00:14:01,866 --> 00:14:04,733 And eventually before we execute the whole thing, 268 00:14:04,733 --> 00:14:07,066 there are two things that we still need to change. 269 00:14:07,066 --> 00:14:09,833 First, it's the fact that since the training set 270 00:14:09,833 --> 00:14:13,500 is expected to be a matrix, well, that's going to be the same for the test set. 271 00:14:13,766 --> 00:14:17,900 So here we also need to add as dot matrix. 272 00:14:18,266 --> 00:14:21,333 And inside of the parenthesis we put our test fold. 273 00:14:22,066 --> 00:14:23,400 So that's the first change. 274 00:14:23,400 --> 00:14:25,366 And now the second change is of course 275 00:14:25,366 --> 00:14:28,366 related to the index of the dependent variable. 276 00:14:28,366 --> 00:14:29,400 Because three. 277 00:14:29,400 --> 00:14:32,733 Here was the index of the dependent variable in our previous problem 278 00:14:33,033 --> 00:14:35,700 where we implemented k fold cross-validation. 279 00:14:35,700 --> 00:14:39,600 So we need to replace this three index by the index of the dependent 280 00:14:39,600 --> 00:14:43,266 variable in our new problem which is not three but 11. 281 00:14:43,666 --> 00:14:48,400 And same right here in the confusion matrix it is 11. 282 00:14:49,200 --> 00:14:49,800 All right. 283 00:14:49,800 --> 00:14:51,633 And now everything is ready. 284 00:14:51,633 --> 00:14:53,766 We can execute the whole code. 285 00:14:53,766 --> 00:14:54,933 So let's do it. 286 00:14:54,933 --> 00:14:57,566 And let's see which accuracy we get. 287 00:14:57,566 --> 00:14:59,966 So let's go back to the top. 288 00:14:59,966 --> 00:15:01,966 We already imported the data set. 289 00:15:01,966 --> 00:15:06,500 So now let's encode the categorical variables as vectors. 290 00:15:06,933 --> 00:15:08,700 Here we go. Done. 291 00:15:08,700 --> 00:15:11,100 Now let's split the data sets into the training set. 292 00:15:11,100 --> 00:15:14,200 And the test set. Here we go. Done as well. 293 00:15:14,733 --> 00:15:17,933 And now let's fit the XGBoost to the training set. 294 00:15:18,300 --> 00:15:20,700 So the extra boost package was already imported. 295 00:15:20,700 --> 00:15:25,866 So basically we just need to select this line and execute. 296 00:15:26,400 --> 00:15:27,166 Here we go. 297 00:15:27,166 --> 00:15:30,733 We get the information of the root mean squared error at each round. 298 00:15:30,966 --> 00:15:33,366 So basically the root mean squared error is irrelevant. 299 00:15:33,366 --> 00:15:34,833 Computation of the error. 300 00:15:34,833 --> 00:15:36,566 You can picture this as the error. 301 00:15:36,566 --> 00:15:40,266 And of course the lower is the error the better is your model. 302 00:15:40,566 --> 00:15:43,966 And we can see that from the first iteration to the last one. 303 00:15:43,966 --> 00:15:45,100 The 10th one. 304 00:15:45,100 --> 00:15:49,666 Well the error decreased from oh point 41 down to oh point 29. 305 00:15:49,800 --> 00:15:52,800 And besides we can see that the maximum number 306 00:15:52,800 --> 00:15:56,566 of ten iterations was a good choice, because we can see that 307 00:15:56,566 --> 00:16:00,066 it is more or less converging around oh point 30. 308 00:16:00,400 --> 00:16:03,400 Well, feel free to try with more iterations and try to see 309 00:16:03,400 --> 00:16:06,400 if it's converging to a number that is less than 30. 310 00:16:06,533 --> 00:16:10,566 If you get a number close to oh point 30, then ten iterations was a good choice. 311 00:16:11,166 --> 00:16:15,200 So perfect XGBoost is implemented and trained on the training set. 312 00:16:15,500 --> 00:16:18,200 And now let's apply k fold cross-validation 313 00:16:18,200 --> 00:16:21,566 to evaluate its performance with the accuracy metric. 314 00:16:22,066 --> 00:16:25,066 And actually I'm noticing that there is still one thing to change. 315 00:16:25,066 --> 00:16:27,166 It's the name of the dependent variable here. 316 00:16:27,166 --> 00:16:30,666 Congratulations to those of you who noticed that we need to replace 317 00:16:30,666 --> 00:16:34,366 project here by the real name of the dependent variable in our problem, 318 00:16:34,666 --> 00:16:37,666 which is not purchased but exited. 319 00:16:37,966 --> 00:16:41,400 So let's replace purchase here by accident. 320 00:16:42,366 --> 00:16:43,133 And here we go. 321 00:16:43,133 --> 00:16:45,166 Now everything should be fine. 322 00:16:45,166 --> 00:16:48,500 Let's do one last check as matrix for the training set 323 00:16:48,500 --> 00:16:52,500 as matrix for the test set Y pred converted into a binary outcome 324 00:16:52,500 --> 00:16:55,933 0 or 1 indexes are correct for the dependent variable. 325 00:16:56,300 --> 00:16:57,533 Everything looks fine. 326 00:16:57,533 --> 00:17:00,566 Let's select this whole section here 327 00:17:00,900 --> 00:17:04,933 and get the ultimate accuracy of our executed model. 328 00:17:05,266 --> 00:17:06,766 Here we go. 329 00:17:06,766 --> 00:17:09,033 All executed properly and very fast 330 00:17:09,033 --> 00:17:13,133 and we get a final accuracy of 88%. 331 00:17:13,533 --> 00:17:18,066 So not only that was very efficient, but also we managed to beat the accuracy 332 00:17:18,066 --> 00:17:23,266 obtained with and and besides this value is the relevant accuracy of XGBoost. 333 00:17:23,266 --> 00:17:26,366 So we can trust this value of 88%. 334 00:17:26,633 --> 00:17:27,700 So that's very good. 335 00:17:27,700 --> 00:17:32,100 Not only XGBoost was very fast, but also it gave us an amazing accuracy. 336 00:17:32,100 --> 00:17:35,600 Probably the best of all the models we implemented in this course. 337 00:17:36,100 --> 00:17:38,500 So that was an amazing job. 338 00:17:38,500 --> 00:17:40,800 And now it is time to say goodbye, 339 00:17:40,800 --> 00:17:43,766 because this was actually the last tutorial of this course. 340 00:17:43,766 --> 00:17:46,766 So that's quite a feeling because this is the end of this machine 341 00:17:46,766 --> 00:17:50,400 learning journey that I introduced in my very first tutorial of this course. 342 00:17:50,600 --> 00:17:53,466 So yes, that's right, that's the end of the journey. 343 00:17:53,466 --> 00:17:57,000 However, I am sure this is not the last machine learning journey. 344 00:17:57,300 --> 00:17:59,400 This is your first machine learning journey. 345 00:17:59,400 --> 00:18:01,800 I was so happy to take this adventure with you. 346 00:18:01,800 --> 00:18:03,066 I really enjoyed that journey. 347 00:18:03,066 --> 00:18:04,700 I hope that's the case for you too 348 00:18:04,700 --> 00:18:06,633 and I'll be very happy to make some new machine 349 00:18:06,633 --> 00:18:09,666 learning courses to start some new machine learning journeys. 350 00:18:09,933 --> 00:18:11,633 So I hope I'll see you very soon. 351 00:18:11,633 --> 00:18:13,466 And until then, enjoy machine learning.