1 00:00:00,650 --> 00:00:05,750 In this video, we are going to learn how to create a steam boost model not. 2 00:00:07,270 --> 00:00:12,010 This is going to be a little bit different from the models that we have been creating. 3 00:00:14,050 --> 00:00:18,820 Usually we just had one test data and one train data in the form luvvie mentioned. 4 00:00:19,420 --> 00:00:20,350 What is the dependent? 5 00:00:20,380 --> 00:00:21,910 And what is the independent variable? 6 00:00:22,420 --> 00:00:26,140 And we told what data to be used and the model was trained. 7 00:00:27,400 --> 00:00:34,840 But for training a model using XY boost, we need to prepare the data in a particular format in that 8 00:00:34,840 --> 00:00:36,790 format is called R B Matrix. 9 00:00:37,810 --> 00:00:44,470 So we'll first learn how to get the data ready and that format in which it's Tibbles will be able to 10 00:00:44,740 --> 00:00:46,320 run on it and train the model. 11 00:00:48,340 --> 00:00:51,790 The package that we are going to use for this is Izzy Boost. 12 00:00:52,450 --> 00:00:55,630 So if it is installed, you can just stay out of it. 13 00:00:55,790 --> 00:00:56,350 Celebrity deliberate. 14 00:00:56,380 --> 00:00:58,990 Eagleman If it is not, we will install the package. 15 00:01:05,810 --> 00:01:08,470 The package it started will run Liberty Command. 16 00:01:15,040 --> 00:01:16,250 Now to prepare the data. 17 00:01:17,020 --> 00:01:22,750 First we have to separate the dependent and independent variables. 18 00:01:24,360 --> 00:01:29,920 And when we separate the dependent variable, that is the variable that we want to predict. 19 00:01:30,610 --> 00:01:33,930 That variable should be in the form of a Boolean added. 20 00:01:34,180 --> 00:01:35,530 It should contain values. 21 00:01:35,620 --> 00:01:36,610 True or false. 22 00:01:38,080 --> 00:01:40,780 So if you remember my start, take Oscar. 23 00:01:40,780 --> 00:01:41,380 Very well. 24 00:01:41,500 --> 00:01:42,460 Contained value. 25 00:01:42,520 --> 00:01:43,600 Zero or one. 26 00:01:45,850 --> 00:01:51,790 I will create a new variable, Greenway, and this Greenway will contain boolean values. 27 00:01:53,870 --> 00:01:57,760 It will contain true if static Oscar contained one. 28 00:01:58,450 --> 00:02:03,030 And it will contain false if static Oscar contained ZEW. 29 00:02:03,820 --> 00:02:06,650 So I'm checking here whether the values won or not. 30 00:02:06,820 --> 00:02:08,670 If it is won, then it will have value. 31 00:02:08,670 --> 00:02:09,050 True. 32 00:02:09,550 --> 00:02:12,160 If it is zero, then it will have value. 33 00:02:12,520 --> 00:02:12,960 False. 34 00:02:13,510 --> 00:02:13,790 So that. 35 00:02:13,810 --> 00:02:14,630 Does it underscore my. 36 00:02:16,970 --> 00:02:24,420 And look at this train very, very well, it it does very well and it has values falls through True-Blue 37 00:02:24,560 --> 00:02:25,070 and so on. 38 00:02:27,300 --> 00:02:31,660 You can by train my head and I need to see all the values of train my. 39 00:02:34,290 --> 00:02:37,160 Next for Green Eggs, we will. 40 00:02:38,620 --> 00:02:41,130 We need to create a model matrix. 41 00:02:41,950 --> 00:02:48,970 The point is that we cannot have any categorical variable containing categories. 42 00:02:49,420 --> 00:02:52,660 We need to convert categorical variables to dummy variables. 43 00:02:53,260 --> 00:02:54,840 That is the variable. 44 00:02:54,850 --> 00:02:56,920 Should that have value only 011? 45 00:02:58,210 --> 00:03:01,810 So if you have a categorical variable containing value, such as. 46 00:03:02,200 --> 00:03:03,100 Yes and no. 47 00:03:03,370 --> 00:03:06,580 For example, if we go and open the train dataset. 48 00:03:09,840 --> 00:03:15,360 And we look at the X 3D available variable. 49 00:03:15,630 --> 00:03:16,410 It has values. 50 00:03:16,530 --> 00:03:17,250 Yes, I know. 51 00:03:18,310 --> 00:03:23,020 And there is a genre variable which has values, drama, comedy, action and thriller. 52 00:03:24,460 --> 00:03:26,750 These labels are not allowed in next reboost. 53 00:03:27,430 --> 00:03:32,110 We need to convert these variables to numeric paper variables. 54 00:03:33,460 --> 00:03:37,210 So one way of doing that is to create a dummy variable for these variable. 55 00:03:38,740 --> 00:03:45,840 So a dummy variable of 3D available would have won if this value is. 56 00:03:45,970 --> 00:03:49,330 Yes, and it will have zero if this value is no. 57 00:03:50,420 --> 00:03:54,980 Similarly for Donald Variable, we will make N minus one. 58 00:03:56,470 --> 00:03:57,250 Dummy variables. 59 00:03:57,700 --> 00:03:58,120 That is. 60 00:03:58,150 --> 00:03:59,350 There are four categories. 61 00:03:59,650 --> 00:04:01,220 We will create four minus one. 62 00:04:01,240 --> 00:04:03,310 That is three dummy variables. 63 00:04:04,720 --> 00:04:10,070 The first time you variable will have value one wherever DeJohn already of evil has value. 64 00:04:10,090 --> 00:04:10,540 Drama. 65 00:04:12,440 --> 00:04:18,550 The second dummy variable of John, it will have value new one whenever the John Elway variable values 66 00:04:18,590 --> 00:04:19,090 comedy. 67 00:04:19,710 --> 00:04:27,140 The third will have value one whatever Dorna has value Trilla when all the three dummy variables are 68 00:04:27,140 --> 00:04:27,560 zero. 69 00:04:28,160 --> 00:04:30,500 John, it has a value of action. 70 00:04:31,370 --> 00:04:36,740 So in this way, we will create dummy variables for all the categorical variables. 71 00:04:37,860 --> 00:04:41,600 And a short way to do that is create a model matrix. 72 00:04:43,760 --> 00:04:46,580 So we will create a new tree next video when? 73 00:04:47,870 --> 00:04:53,430 It will have the values from model matrix function, which has this formula. 74 00:04:54,080 --> 00:04:58,820 Any variable on the left of this delay will not be converted to dummy variables. 75 00:04:59,570 --> 00:05:03,170 Variables on the right of this delay will be converted to the movie deals. 76 00:05:04,730 --> 00:05:09,590 So you want all the other categorical variables to be converted to a dummy variable. 77 00:05:09,890 --> 00:05:10,610 So there is a dark. 78 00:05:12,240 --> 00:05:13,380 Then we have a minus one. 79 00:05:14,010 --> 00:05:20,430 This is to delete the first column of the created dummy variables, the data to be. 80 00:05:20,610 --> 00:05:22,220 Well, the one is Brain C.. 81 00:05:23,760 --> 00:05:28,560 Let's run this and we will look at three next so that you get a better understanding of what we have 82 00:05:28,560 --> 00:05:28,980 created. 83 00:05:31,540 --> 00:05:32,720 So this is the Kleenex. 84 00:05:33,610 --> 00:05:35,950 So if we look at this next one, evil. 85 00:05:38,150 --> 00:05:42,910 It has all these categorical variables converted to rebuild. 86 00:05:43,970 --> 00:05:47,180 You can see that extreme liberal has now dummy variables. 87 00:05:47,700 --> 00:05:48,560 John, it has three. 88 00:05:49,940 --> 00:05:54,660 We should delete this extra dummy variable, which can. 89 00:05:54,860 --> 00:05:59,480 Well, you know, and rest of it is ready. 90 00:05:59,900 --> 00:06:03,130 So we'll just go and delete that extra variable. 91 00:06:04,130 --> 00:06:10,060 To do that, we will write this green X gets green X 92 00:06:12,690 --> 00:06:14,380 comma, minus two. 93 00:06:14,400 --> 00:06:19,130 Will this minus 10 is because we want really deep gold column. 94 00:06:19,700 --> 00:06:20,600 So we'll run this combine. 95 00:06:21,590 --> 00:06:27,260 You can go to the train next day to confirm that the 12th column is gone. 96 00:06:31,240 --> 00:06:36,430 So with this hour, Train X data is ready in the model matrix format. 97 00:06:41,790 --> 00:06:44,460 Now, the same thing we need to do with the test. 98 00:06:45,240 --> 00:06:53,110 But also will separate the Y and X, part of it will run best Y is equal to this condition so that we 99 00:06:53,110 --> 00:06:55,660 have bestway indeed boolean format. 100 00:06:55,810 --> 00:07:01,990 That is, if star take Oscar in S to C dataset is one, it will contain value. 101 00:07:01,990 --> 00:07:03,020 True, if it is zero. 102 00:07:03,040 --> 00:07:09,970 It will contain false will run this test y is created with proof false values. 103 00:07:12,000 --> 00:07:13,480 Or test exile's. 104 00:07:13,880 --> 00:07:16,040 Also, we'll run the same thing. 105 00:07:16,980 --> 00:07:26,510 And we will believe these call column just to confirm they'll open their stakes scrawl and check if 106 00:07:26,510 --> 00:07:28,760 it is again available at World Vision. 107 00:07:29,120 --> 00:07:29,510 It does. 108 00:07:31,050 --> 00:07:36,860 So we will copy the skyline to the change train to test. 109 00:07:42,080 --> 00:07:42,880 I'm it. 110 00:07:44,970 --> 00:07:47,710 And Test X now has grown to everyone's. 111 00:07:50,610 --> 00:07:57,900 Now, as I told you, a extra boost takes input as a B matrix so two can create a D matrix. 112 00:07:58,050 --> 00:08:00,910 This is the code that we are put on SGV door. 113 00:08:01,090 --> 00:08:02,550 The matrix is the function. 114 00:08:03,330 --> 00:08:05,160 Data is the expert. 115 00:08:05,520 --> 00:08:13,440 The train X model that we created and label is the part that is the classification part that we want 116 00:08:13,440 --> 00:08:14,010 to predict. 117 00:08:15,540 --> 00:08:18,780 Similarly for test also, we will create this de matrix. 118 00:08:19,470 --> 00:08:24,360 So we'll run these two commands to create the data and do the B Matrix format. 119 00:08:26,280 --> 00:08:33,780 So by doing all this, we have prepared the data to be run into a SCV XY boost function. 120 00:08:35,430 --> 00:08:43,740 Now we will use this X matrix to train the model so we use SD boost function data is equal to X matrix. 121 00:08:44,040 --> 00:08:52,020 This is the data and drone is the number of iterations that the boasting algorithm will do for now. 122 00:08:52,080 --> 00:08:53,290 I have ordered 250. 123 00:08:53,760 --> 00:08:59,670 You can change it to 20 or 200 to see the performance objective. 124 00:08:59,700 --> 00:09:01,740 I have said to my date softmax. 125 00:09:02,010 --> 00:09:04,860 So let us see what are the objective options that we have. 126 00:09:06,140 --> 00:09:10,590 I have pressed one on extra boost to openly help on S to boost. 127 00:09:32,200 --> 00:09:36,240 So we'll put it best F1 to openly help what I see most in this. 128 00:09:36,290 --> 00:09:39,220 We want to look at the objective barometer. 129 00:09:40,670 --> 00:09:45,340 If you scroll down, you can see this is a task parameter. 130 00:09:46,310 --> 00:09:48,930 In this, we specify the learning task. 131 00:09:49,550 --> 00:09:53,360 If you're going to do linear regression, you use rig linear. 132 00:09:54,220 --> 00:09:57,620 If you want to do logistic regression, you write rig logistic. 133 00:09:58,770 --> 00:10:00,500 If you want the model to classify. 134 00:10:01,250 --> 00:10:09,110 You can use multi softmax, which is setting the XY boost to do multiclass classification using softmax 135 00:10:09,110 --> 00:10:09,560 objective. 136 00:10:12,150 --> 00:10:18,850 So sense are objective risk classification here we will use multi softmax if you have the objective 137 00:10:18,850 --> 00:10:19,540 of regression. 138 00:10:19,690 --> 00:10:23,230 You can use these two neglection linear regression, logistic. 139 00:10:25,970 --> 00:10:27,330 It is equal to pointy. 140 00:10:27,450 --> 00:10:34,590 It does the learning barometer, which decide to learn in grade two, either can have values between 141 00:10:34,740 --> 00:10:35,490 zero to one. 142 00:10:37,020 --> 00:10:42,920 And if you look at the help section, it is telling you that Iida control seed learning read. 143 00:10:44,070 --> 00:10:50,840 So if you put a very small value, it would mean that you need to run the model for more number of eight 144 00:10:51,450 --> 00:10:53,220 so that it learns completely. 145 00:10:53,910 --> 00:10:57,660 If you have large value of reader, that is learning is very fast. 146 00:10:58,380 --> 00:11:00,450 You can reduce the number of round. 147 00:11:00,780 --> 00:11:05,190 But in that scenario, the model may not be able to completely fit the data. 148 00:11:05,850 --> 00:11:15,300 So it is preferable to keep a lower value for ETR and a higher value of and drone so that your model 149 00:11:15,510 --> 00:11:16,950 fits the data completely. 150 00:11:18,320 --> 00:11:22,370 Nonetheless is a parameter which is specific to this objective. 151 00:11:22,980 --> 00:11:27,150 That is, since it is a multiclass classification objective. 152 00:11:27,630 --> 00:11:31,420 We need to specify the number of classes in that objective variable. 153 00:11:32,070 --> 00:11:36,870 Since outdoor activity will has only two classes, so class with you will actually well, you have to 154 00:11:37,680 --> 00:11:39,940 max depth control the growth of the three. 155 00:11:40,350 --> 00:11:48,150 So Max depth value of hundred means that the final three has can have a maximum depth of hundred levied 156 00:11:49,010 --> 00:11:49,920 by a default. 157 00:11:50,220 --> 00:11:52,590 It has a maximum depth of six. 158 00:11:56,910 --> 00:11:57,960 So I'll run this, come on. 159 00:11:59,870 --> 00:12:08,120 So for the 50 round, those who have dehydration that complete and the value of a booze model are stored 160 00:12:08,120 --> 00:12:10,180 in this Ezzy boosting variable. 161 00:12:11,510 --> 00:12:17,270 Now, using this extreme boosting variable, we will predict devalues and there's a zip rate variable. 162 00:12:18,050 --> 00:12:25,280 So we will again use the predict function, exhibiting the model name X Matrix, underscore P is for 163 00:12:25,280 --> 00:12:26,270 the test set. 164 00:12:27,290 --> 00:12:29,390 So this is the D matrix of the test set. 165 00:12:30,890 --> 00:12:38,210 If we run this timeline, the predicted values of the glasses is in the x deep red variable. 166 00:12:39,500 --> 00:12:45,660 Now to see the prediction accuracy of this model, we will run this table. 167 00:12:45,700 --> 00:12:48,020 Come on, let's create a conclusion matrix. 168 00:12:50,530 --> 00:12:58,700 And you can see that our model is correctly classifying these two sets of observations, that is 31 169 00:12:59,340 --> 00:13:04,960 had predicted values, although zero and actual value was also zero or false. 170 00:13:05,560 --> 00:13:08,920 And these 43 were predicted as one and actual results one. 171 00:13:09,760 --> 00:13:12,730 So far, 74 out of 113. 172 00:13:12,980 --> 00:13:16,330 We are getting late predictions on that test. 173 00:13:17,110 --> 00:13:18,310 So 74 174 00:13:20,990 --> 00:13:24,100 out of 13 is correct. 175 00:13:26,130 --> 00:13:31,530 So basically, we have a prediction, accuracy of sixty five point five percent. 176 00:13:32,610 --> 00:13:36,900 We can change the parameter values to get different prediction accuracy. 177 00:13:37,080 --> 00:13:40,770 So probably if I decrease the depth to in. 178 00:13:42,940 --> 00:13:45,220 And then this mortal again. 179 00:13:46,600 --> 00:13:48,320 And now I predict devalues. 180 00:13:48,550 --> 00:13:49,440 I know what they did. 181 00:13:49,440 --> 00:13:49,920 They will. 182 00:13:51,050 --> 00:13:57,010 Now I'm correctly predicting seventy one cases instead of 70 focuses. 183 00:13:57,490 --> 00:14:03,310 So reducing the depth in this scenario has reduced the accuracy of my model. 184 00:14:06,400 --> 00:14:12,630 So you can check it for different values of and grown different values of Iida, different values of 185 00:14:13,080 --> 00:14:19,140 Max stepped and seedy prediction, accuracy and different values of different parameters. 186 00:14:19,950 --> 00:14:27,480 If you know how to do looping, that is if you know how to run loops and the software, you can also 187 00:14:28,350 --> 00:14:33,270 run a loop and find out the different prediction accuracies at different values off and on. 188 00:14:33,330 --> 00:14:40,400 So probably you can run a loop on and on changing its value from 10 to under. 189 00:14:41,180 --> 00:14:46,470 And for each scenario you find out deep prediction, accuracy, wherever you get the best prediction, 190 00:14:46,470 --> 00:14:46,970 accuracy. 191 00:14:47,310 --> 00:14:48,690 You keep that value of Hendron. 192 00:14:49,500 --> 00:14:51,690 That is something we will not be covering in this course. 193 00:14:52,290 --> 00:14:54,240 However, that is also possible. 194 00:14:56,130 --> 00:14:59,030 So this is how we do X GS boosting. 195 00:14:59,360 --> 00:15:06,340 And ah, you have seen that initially when we created very simple busy entries. 196 00:15:07,230 --> 00:15:08,400 We could plot them. 197 00:15:08,790 --> 00:15:10,650 We could easily interpret them. 198 00:15:11,100 --> 00:15:15,150 And we can use those visuals in our presentations easily. 199 00:15:16,830 --> 00:15:19,020 They basically were very interpretable. 200 00:15:20,310 --> 00:15:28,800 But to increase the prediction accuracy, we traded off that in dependability and regard prediction, 201 00:15:28,860 --> 00:15:29,430 accuracy. 202 00:15:31,170 --> 00:15:39,420 We talked about on timbale method, bagging random forest and boosting in boosting. 203 00:15:39,780 --> 00:15:45,600 We further discussed gradient boosting Adam hosting an extra boosting. 204 00:15:47,700 --> 00:15:55,240 All of these methods showed great improvement in prediction accuracy as compared to a simple decision 205 00:15:55,240 --> 00:15:57,300 tree or a prone decision tree. 206 00:15:59,190 --> 00:16:00,940 So now you know both parts. 207 00:16:01,320 --> 00:16:07,380 How to create the simple decision tree and how to create a very advanced Gaudette prediction. 208 00:16:07,550 --> 00:16:08,460 This is tree.