1
00:00:00,650 --> 00:00:05,750
In this video, we are going to learn how to create a steam boost model not.

2
00:00:07,270 --> 00:00:12,010
This is going to be a little bit different from the models that we have been creating.

3
00:00:14,050 --> 00:00:18,820
Usually we just had one test data and one train data in the form luvvie mentioned.

4
00:00:19,420 --> 00:00:20,350
What is the dependent?

5
00:00:20,380 --> 00:00:21,910
And what is the independent variable?

6
00:00:22,420 --> 00:00:26,140
And we told what data to be used and the model was trained.

7
00:00:27,400 --> 00:00:34,840
But for training a model using XY boost, we need to prepare the data in a particular format in that

8
00:00:34,840 --> 00:00:36,790
format is called R B Matrix.

9
00:00:37,810 --> 00:00:44,470
So we'll first learn how to get the data ready and that format in which it's Tibbles will be able to

10
00:00:44,740 --> 00:00:46,320
run on it and train the model.

11
00:00:48,340 --> 00:00:51,790
The package that we are going to use for this is Izzy Boost.

12
00:00:52,450 --> 00:00:55,630
So if it is installed, you can just stay out of it.

13
00:00:55,790 --> 00:00:56,350
Celebrity deliberate.

14
00:00:56,380 --> 00:00:58,990
Eagleman If it is not, we will install the package.

15
00:01:05,810 --> 00:01:08,470
The package it started will run Liberty Command.

16
00:01:15,040 --> 00:01:16,250
Now to prepare the data.

17
00:01:17,020 --> 00:01:22,750
First we have to separate the dependent and independent variables.

18
00:01:24,360 --> 00:01:29,920
And when we separate the dependent variable, that is the variable that we want to predict.

19
00:01:30,610 --> 00:01:33,930
That variable should be in the form of a Boolean added.

20
00:01:34,180 --> 00:01:35,530
It should contain values.

21
00:01:35,620 --> 00:01:36,610
True or false.

22
00:01:38,080 --> 00:01:40,780
So if you remember my start, take Oscar.

23
00:01:40,780 --> 00:01:41,380
Very well.

24
00:01:41,500 --> 00:01:42,460
Contained value.

25
00:01:42,520 --> 00:01:43,600
Zero or one.

26
00:01:45,850 --> 00:01:51,790
I will create a new variable, Greenway, and this Greenway will contain boolean values.

27
00:01:53,870 --> 00:01:57,760
It will contain true if static Oscar contained one.

28
00:01:58,450 --> 00:02:03,030
And it will contain false if static Oscar contained ZEW.

29
00:02:03,820 --> 00:02:06,650
So I'm checking here whether the values won or not.

30
00:02:06,820 --> 00:02:08,670
If it is won, then it will have value.

31
00:02:08,670 --> 00:02:09,050
True.

32
00:02:09,550 --> 00:02:12,160
If it is zero, then it will have value.

33
00:02:12,520 --> 00:02:12,960
False.

34
00:02:13,510 --> 00:02:13,790
So that.

35
00:02:13,810 --> 00:02:14,630
Does it underscore my.

36
00:02:16,970 --> 00:02:24,420
And look at this train very, very well, it it does very well and it has values falls through True-Blue

37
00:02:24,560 --> 00:02:25,070
and so on.

38
00:02:27,300 --> 00:02:31,660
You can by train my head and I need to see all the values of train my.

39
00:02:34,290 --> 00:02:37,160
Next for Green Eggs, we will.

40
00:02:38,620 --> 00:02:41,130
We need to create a model matrix.

41
00:02:41,950 --> 00:02:48,970
The point is that we cannot have any categorical variable containing categories.

42
00:02:49,420 --> 00:02:52,660
We need to convert categorical variables to dummy variables.

43
00:02:53,260 --> 00:02:54,840
That is the variable.

44
00:02:54,850 --> 00:02:56,920
Should that have value only 011?

45
00:02:58,210 --> 00:03:01,810
So if you have a categorical variable containing value, such as.

46
00:03:02,200 --> 00:03:03,100
Yes and no.

47
00:03:03,370 --> 00:03:06,580
For example, if we go and open the train dataset.

48
00:03:09,840 --> 00:03:15,360
And we look at the X 3D available variable.

49
00:03:15,630 --> 00:03:16,410
It has values.

50
00:03:16,530 --> 00:03:17,250
Yes, I know.

51
00:03:18,310 --> 00:03:23,020
And there is a genre variable which has values, drama, comedy, action and thriller.

52
00:03:24,460 --> 00:03:26,750
These labels are not allowed in next reboost.

53
00:03:27,430 --> 00:03:32,110
We need to convert these variables to numeric paper variables.

54
00:03:33,460 --> 00:03:37,210
So one way of doing that is to create a dummy variable for these variable.

55
00:03:38,740 --> 00:03:45,840
So a dummy variable of 3D available would have won if this value is.

56
00:03:45,970 --> 00:03:49,330
Yes, and it will have zero if this value is no.

57
00:03:50,420 --> 00:03:54,980
Similarly for Donald Variable, we will make N minus one.

58
00:03:56,470 --> 00:03:57,250
Dummy variables.

59
00:03:57,700 --> 00:03:58,120
That is.

60
00:03:58,150 --> 00:03:59,350
There are four categories.

61
00:03:59,650 --> 00:04:01,220
We will create four minus one.

62
00:04:01,240 --> 00:04:03,310
That is three dummy variables.

63
00:04:04,720 --> 00:04:10,070
The first time you variable will have value one wherever DeJohn already of evil has value.

64
00:04:10,090 --> 00:04:10,540
Drama.

65
00:04:12,440 --> 00:04:18,550
The second dummy variable of John, it will have value new one whenever the John Elway variable values

66
00:04:18,590 --> 00:04:19,090
comedy.

67
00:04:19,710 --> 00:04:27,140
The third will have value one whatever Dorna has value Trilla when all the three dummy variables are

68
00:04:27,140 --> 00:04:27,560
zero.

69
00:04:28,160 --> 00:04:30,500
John, it has a value of action.

70
00:04:31,370 --> 00:04:36,740
So in this way, we will create dummy variables for all the categorical variables.

71
00:04:37,860 --> 00:04:41,600
And a short way to do that is create a model matrix.

72
00:04:43,760 --> 00:04:46,580
So we will create a new tree next video when?

73
00:04:47,870 --> 00:04:53,430
It will have the values from model matrix function, which has this formula.

74
00:04:54,080 --> 00:04:58,820
Any variable on the left of this delay will not be converted to dummy variables.

75
00:04:59,570 --> 00:05:03,170
Variables on the right of this delay will be converted to the movie deals.

76
00:05:04,730 --> 00:05:09,590
So you want all the other categorical variables to be converted to a dummy variable.

77
00:05:09,890 --> 00:05:10,610
So there is a dark.

78
00:05:12,240 --> 00:05:13,380
Then we have a minus one.

79
00:05:14,010 --> 00:05:20,430
This is to delete the first column of the created dummy variables, the data to be.

80
00:05:20,610 --> 00:05:22,220
Well, the one is Brain C..

81
00:05:23,760 --> 00:05:28,560
Let's run this and we will look at three next so that you get a better understanding of what we have

82
00:05:28,560 --> 00:05:28,980
created.

83
00:05:31,540 --> 00:05:32,720
So this is the Kleenex.

84
00:05:33,610 --> 00:05:35,950
So if we look at this next one, evil.

85
00:05:38,150 --> 00:05:42,910
It has all these categorical variables converted to rebuild.

86
00:05:43,970 --> 00:05:47,180
You can see that extreme liberal has now dummy variables.

87
00:05:47,700 --> 00:05:48,560
John, it has three.

88
00:05:49,940 --> 00:05:54,660
We should delete this extra dummy variable, which can.

89
00:05:54,860 --> 00:05:59,480
Well, you know, and rest of it is ready.

90
00:05:59,900 --> 00:06:03,130
So we'll just go and delete that extra variable.

91
00:06:04,130 --> 00:06:10,060
To do that, we will write this green X gets green X

92
00:06:12,690 --> 00:06:14,380
comma, minus two.

93
00:06:14,400 --> 00:06:19,130
Will this minus 10 is because we want really deep gold column.

94
00:06:19,700 --> 00:06:20,600
So we'll run this combine.

95
00:06:21,590 --> 00:06:27,260
You can go to the train next day to confirm that the 12th column is gone.

96
00:06:31,240 --> 00:06:36,430
So with this hour, Train X data is ready in the model matrix format.

97
00:06:41,790 --> 00:06:44,460
Now, the same thing we need to do with the test.

98
00:06:45,240 --> 00:06:53,110
But also will separate the Y and X, part of it will run best Y is equal to this condition so that we

99
00:06:53,110 --> 00:06:55,660
have bestway indeed boolean format.

100
00:06:55,810 --> 00:07:01,990
That is, if star take Oscar in S to C dataset is one, it will contain value.

101
00:07:01,990 --> 00:07:03,020
True, if it is zero.

102
00:07:03,040 --> 00:07:09,970
It will contain false will run this test y is created with proof false values.

103
00:07:12,000 --> 00:07:13,480
Or test exile's.

104
00:07:13,880 --> 00:07:16,040
Also, we'll run the same thing.

105
00:07:16,980 --> 00:07:26,510
And we will believe these call column just to confirm they'll open their stakes scrawl and check if

106
00:07:26,510 --> 00:07:28,760
it is again available at World Vision.

107
00:07:29,120 --> 00:07:29,510
It does.

108
00:07:31,050 --> 00:07:36,860
So we will copy the skyline to the change train to test.

109
00:07:42,080 --> 00:07:42,880
I'm it.

110
00:07:44,970 --> 00:07:47,710
And Test X now has grown to everyone's.

111
00:07:50,610 --> 00:07:57,900
Now, as I told you, a extra boost takes input as a B matrix so two can create a D matrix.

112
00:07:58,050 --> 00:08:00,910
This is the code that we are put on SGV door.

113
00:08:01,090 --> 00:08:02,550
The matrix is the function.

114
00:08:03,330 --> 00:08:05,160
Data is the expert.

115
00:08:05,520 --> 00:08:13,440
The train X model that we created and label is the part that is the classification part that we want

116
00:08:13,440 --> 00:08:14,010
to predict.

117
00:08:15,540 --> 00:08:18,780
Similarly for test also, we will create this de matrix.

118
00:08:19,470 --> 00:08:24,360
So we'll run these two commands to create the data and do the B Matrix format.

119
00:08:26,280 --> 00:08:33,780
So by doing all this, we have prepared the data to be run into a SCV XY boost function.

120
00:08:35,430 --> 00:08:43,740
Now we will use this X matrix to train the model so we use SD boost function data is equal to X matrix.

121
00:08:44,040 --> 00:08:52,020
This is the data and drone is the number of iterations that the boasting algorithm will do for now.

122
00:08:52,080 --> 00:08:53,290
I have ordered 250.

123
00:08:53,760 --> 00:08:59,670
You can change it to 20 or 200 to see the performance objective.

124
00:08:59,700 --> 00:09:01,740
I have said to my date softmax.

125
00:09:02,010 --> 00:09:04,860
So let us see what are the objective options that we have.

126
00:09:06,140 --> 00:09:10,590
I have pressed one on extra boost to openly help on S to boost.

127
00:09:32,200 --> 00:09:36,240
So we'll put it best F1 to openly help what I see most in this.

128
00:09:36,290 --> 00:09:39,220
We want to look at the objective barometer.

129
00:09:40,670 --> 00:09:45,340
If you scroll down, you can see this is a task parameter.

130
00:09:46,310 --> 00:09:48,930
In this, we specify the learning task.

131
00:09:49,550 --> 00:09:53,360
If you're going to do linear regression, you use rig linear.

132
00:09:54,220 --> 00:09:57,620
If you want to do logistic regression, you write rig logistic.

133
00:09:58,770 --> 00:10:00,500
If you want the model to classify.

134
00:10:01,250 --> 00:10:09,110
You can use multi softmax, which is setting the XY boost to do multiclass classification using softmax

135
00:10:09,110 --> 00:10:09,560
objective.

136
00:10:12,150 --> 00:10:18,850
So sense are objective risk classification here we will use multi softmax if you have the objective

137
00:10:18,850 --> 00:10:19,540
of regression.

138
00:10:19,690 --> 00:10:23,230
You can use these two neglection linear regression, logistic.

139
00:10:25,970 --> 00:10:27,330
It is equal to pointy.

140
00:10:27,450 --> 00:10:34,590
It does the learning barometer, which decide to learn in grade two, either can have values between

141
00:10:34,740 --> 00:10:35,490
zero to one.

142
00:10:37,020 --> 00:10:42,920
And if you look at the help section, it is telling you that Iida control seed learning read.

143
00:10:44,070 --> 00:10:50,840
So if you put a very small value, it would mean that you need to run the model for more number of eight

144
00:10:51,450 --> 00:10:53,220
so that it learns completely.

145
00:10:53,910 --> 00:10:57,660
If you have large value of reader, that is learning is very fast.

146
00:10:58,380 --> 00:11:00,450
You can reduce the number of round.

147
00:11:00,780 --> 00:11:05,190
But in that scenario, the model may not be able to completely fit the data.

148
00:11:05,850 --> 00:11:15,300
So it is preferable to keep a lower value for ETR and a higher value of and drone so that your model

149
00:11:15,510 --> 00:11:16,950
fits the data completely.

150
00:11:18,320 --> 00:11:22,370
Nonetheless is a parameter which is specific to this objective.

151
00:11:22,980 --> 00:11:27,150
That is, since it is a multiclass classification objective.

152
00:11:27,630 --> 00:11:31,420
We need to specify the number of classes in that objective variable.

153
00:11:32,070 --> 00:11:36,870
Since outdoor activity will has only two classes, so class with you will actually well, you have to

154
00:11:37,680 --> 00:11:39,940
max depth control the growth of the three.

155
00:11:40,350 --> 00:11:48,150
So Max depth value of hundred means that the final three has can have a maximum depth of hundred levied

156
00:11:49,010 --> 00:11:49,920
by a default.

157
00:11:50,220 --> 00:11:52,590
It has a maximum depth of six.

158
00:11:56,910 --> 00:11:57,960
So I'll run this, come on.

159
00:11:59,870 --> 00:12:08,120
So for the 50 round, those who have dehydration that complete and the value of a booze model are stored

160
00:12:08,120 --> 00:12:10,180
in this Ezzy boosting variable.

161
00:12:11,510 --> 00:12:17,270
Now, using this extreme boosting variable, we will predict devalues and there's a zip rate variable.

162
00:12:18,050 --> 00:12:25,280
So we will again use the predict function, exhibiting the model name X Matrix, underscore P is for

163
00:12:25,280 --> 00:12:26,270
the test set.

164
00:12:27,290 --> 00:12:29,390
So this is the D matrix of the test set.

165
00:12:30,890 --> 00:12:38,210
If we run this timeline, the predicted values of the glasses is in the x deep red variable.

166
00:12:39,500 --> 00:12:45,660
Now to see the prediction accuracy of this model, we will run this table.

167
00:12:45,700 --> 00:12:48,020
Come on, let's create a conclusion matrix.

168
00:12:50,530 --> 00:12:58,700
And you can see that our model is correctly classifying these two sets of observations, that is 31

169
00:12:59,340 --> 00:13:04,960
had predicted values, although zero and actual value was also zero or false.

170
00:13:05,560 --> 00:13:08,920
And these 43 were predicted as one and actual results one.

171
00:13:09,760 --> 00:13:12,730
So far, 74 out of 113.

172
00:13:12,980 --> 00:13:16,330
We are getting late predictions on that test.

173
00:13:17,110 --> 00:13:18,310
So 74

174
00:13:20,990 --> 00:13:24,100
out of 13 is correct.

175
00:13:26,130 --> 00:13:31,530
So basically, we have a prediction, accuracy of sixty five point five percent.

176
00:13:32,610 --> 00:13:36,900
We can change the parameter values to get different prediction accuracy.

177
00:13:37,080 --> 00:13:40,770
So probably if I decrease the depth to in.

178
00:13:42,940 --> 00:13:45,220
And then this mortal again.

179
00:13:46,600 --> 00:13:48,320
And now I predict devalues.

180
00:13:48,550 --> 00:13:49,440
I know what they did.

181
00:13:49,440 --> 00:13:49,920
They will.

182
00:13:51,050 --> 00:13:57,010
Now I'm correctly predicting seventy one cases instead of 70 focuses.

183
00:13:57,490 --> 00:14:03,310
So reducing the depth in this scenario has reduced the accuracy of my model.

184
00:14:06,400 --> 00:14:12,630
So you can check it for different values of and grown different values of Iida, different values of

185
00:14:13,080 --> 00:14:19,140
Max stepped and seedy prediction, accuracy and different values of different parameters.

186
00:14:19,950 --> 00:14:27,480
If you know how to do looping, that is if you know how to run loops and the software, you can also

187
00:14:28,350 --> 00:14:33,270
run a loop and find out the different prediction accuracies at different values off and on.

188
00:14:33,330 --> 00:14:40,400
So probably you can run a loop on and on changing its value from 10 to under.

189
00:14:41,180 --> 00:14:46,470
And for each scenario you find out deep prediction, accuracy, wherever you get the best prediction,

190
00:14:46,470 --> 00:14:46,970
accuracy.

191
00:14:47,310 --> 00:14:48,690
You keep that value of Hendron.

192
00:14:49,500 --> 00:14:51,690
That is something we will not be covering in this course.

193
00:14:52,290 --> 00:14:54,240
However, that is also possible.

194
00:14:56,130 --> 00:14:59,030
So this is how we do X GS boosting.

195
00:14:59,360 --> 00:15:06,340
And ah, you have seen that initially when we created very simple busy entries.

196
00:15:07,230 --> 00:15:08,400
We could plot them.

197
00:15:08,790 --> 00:15:10,650
We could easily interpret them.

198
00:15:11,100 --> 00:15:15,150
And we can use those visuals in our presentations easily.

199
00:15:16,830 --> 00:15:19,020
They basically were very interpretable.

200
00:15:20,310 --> 00:15:28,800
But to increase the prediction accuracy, we traded off that in dependability and regard prediction,

201
00:15:28,860 --> 00:15:29,430
accuracy.

202
00:15:31,170 --> 00:15:39,420
We talked about on timbale method, bagging random forest and boosting in boosting.

203
00:15:39,780 --> 00:15:45,600
We further discussed gradient boosting Adam hosting an extra boosting.

204
00:15:47,700 --> 00:15:55,240
All of these methods showed great improvement in prediction accuracy as compared to a simple decision

205
00:15:55,240 --> 00:15:57,300
tree or a prone decision tree.

206
00:15:59,190 --> 00:16:00,940
So now you know both parts.

207
00:16:01,320 --> 00:16:07,380
How to create the simple decision tree and how to create a very advanced Gaudette prediction.

208
00:16:07,550 --> 00:16:08,460
This is tree.