1
00:00:01,710 --> 00:00:06,600
In this video, we are going to look at the code to build a regression tree.

2
00:00:07,200 --> 00:00:10,530
And we will build a regression tree on our movie data set.

3
00:00:12,020 --> 00:00:19,670
And using that tree, we will predict the values of boxoffice collection for our test set.

4
00:00:20,970 --> 00:00:27,330
The first step that we need to take to create this tree is we need to install some packages.

5
00:00:28,490 --> 00:00:33,410
To create this entry, we will be using our part package.

6
00:00:34,990 --> 00:00:41,100
And once a tree is created, we will use the R part or plot package to plot that decision tree.

7
00:00:42,000 --> 00:00:45,620
So we need two packages, our part and our part not.

8
00:00:47,010 --> 00:00:50,310
Again, if these are already in start, you do not need to install them.

9
00:00:51,640 --> 00:00:57,580
But if they are not, you need to run these two lines to install the other part and are part of the

10
00:00:57,610 --> 00:00:58,120
package.

11
00:00:58,420 --> 00:00:59,800
So I will run these two commands.

12
00:01:02,380 --> 00:01:03,480
And Condoling did.

13
00:01:12,270 --> 00:01:14,780
Now both the packages are installed.

14
00:01:15,390 --> 00:01:21,090
But if you go to the packages section, the take will not be dead.

15
00:01:25,650 --> 00:01:27,840
You can see we have these two packages.

16
00:01:27,990 --> 00:01:32,070
These are installed, but these are not active to make them active.

17
00:01:32,390 --> 00:01:38,140
We either in library are part and library are blocked or we just take it here.

18
00:01:38,400 --> 00:01:40,710
So we'll run these two commands to make them active.

19
00:01:42,790 --> 00:01:50,200
Now we can use these two libraries, training a regression tree is very simple, not just one line of

20
00:01:50,200 --> 00:01:50,590
code.

21
00:01:51,130 --> 00:01:54,670
This is that line we will use.

22
00:01:54,820 --> 00:02:01,530
The odd part funkin Gregory is the name of Abel in which we will get all the information or designation

23
00:02:01,600 --> 00:02:01,990
tree.

24
00:02:03,130 --> 00:02:07,180
And lately, I'm getting information from the, ah, part function.

25
00:02:08,520 --> 00:02:09,570
That formula is.

26
00:02:10,910 --> 00:02:14,900
Collection delayed dart, the meaning of collection delayed orders.

27
00:02:16,620 --> 00:02:20,870
Anything on the left of L.A., this very symbol?

28
00:02:22,110 --> 00:02:25,590
Anything to the left of this baby symbol is the dependent variable.

29
00:02:26,310 --> 00:02:28,290
So my dependent variable is collection.

30
00:02:29,010 --> 00:02:36,900
And I want to identify the relationship of collection with all the other variables in my data to represent

31
00:02:37,050 --> 00:02:37,980
all other variables.

32
00:02:38,040 --> 00:02:39,330
I have to write a dot.

33
00:02:40,390 --> 00:02:45,190
If I want to use particular variable, I can write the name of that particular variable also here,

34
00:02:46,030 --> 00:02:52,150
if I want to use two variables, I can write the name of both the variables and use a plus symbol in

35
00:02:52,150 --> 00:02:52,720
between them.

36
00:02:53,470 --> 00:03:00,220
So if I want to find the relationship of collection with marketing expense and production expense,

37
00:03:00,640 --> 00:03:07,000
I'll just write collection delay marketing expense plus production expense.

38
00:03:08,640 --> 00:03:12,660
So formula will tell the output function.

39
00:03:13,930 --> 00:03:19,140
Which is the dependent variable and which are the independent variables here?

40
00:03:19,270 --> 00:03:26,970
Collection is the dependent variable and DART signifies that all other variables except collection are

41
00:03:26,970 --> 00:03:29,650
the independent or predictable variables.

42
00:03:31,000 --> 00:03:32,770
The next parameter is data.

43
00:03:33,850 --> 00:03:35,400
Name of data is green.

44
00:03:35,980 --> 00:03:39,640
Because we want to use only retrains it to bring that model.

45
00:03:41,660 --> 00:03:44,810
The third parameter we are going to use is control.

46
00:03:45,990 --> 00:03:50,340
This barometer will help us control the length of the decision three.

47
00:03:51,580 --> 00:03:57,670
We do not want a very long decision tree, a small tree is more interpretable.

48
00:03:58,030 --> 00:04:00,880
And it has less chance of overfitting.

49
00:04:01,790 --> 00:04:05,360
So here, I'm just going to keep our depth of three.

50
00:04:06,140 --> 00:04:08,450
That is, it will have at Max.

51
00:04:08,690 --> 00:04:13,250
Three more layers from the initial load, which has the total population.

52
00:04:15,050 --> 00:04:23,030
If you want to know more about the are part function, that is what all other arguments can you use

53
00:04:23,150 --> 00:04:23,980
with our part?

54
00:04:24,680 --> 00:04:27,090
You just go to our bar and press EF1.

55
00:04:28,940 --> 00:04:33,290
It will open the help of our part function in the dating scene.

56
00:04:34,250 --> 00:04:37,790
You can see all the different argument that can be given.

57
00:04:38,420 --> 00:04:41,120
The arguments that we are giving are mostly mandatory.

58
00:04:41,390 --> 00:04:43,850
This control argument is not mandatory.

59
00:04:44,150 --> 00:04:48,040
It has some default values but to control the growth of this country.

60
00:04:48,200 --> 00:04:50,300
We have put this control argument to.

61
00:04:52,670 --> 00:04:54,380
So I will run this, come on, no.

62
00:04:58,510 --> 00:05:07,690
And you can see a new very well, Greg Tree is created and it has the information of the decision tree

63
00:05:07,780 --> 00:05:09,700
model on this trading site.

64
00:05:11,480 --> 00:05:16,670
Not if you want to look at this decision tree created by this single line of code.

65
00:05:17,740 --> 00:05:19,420
You can run this second line.

66
00:05:20,890 --> 00:05:26,090
The second line uses the passport law function of our partner plourde package.

67
00:05:27,880 --> 00:05:35,560
The first argument, and this is the variable which contains the disagreeably information for us, it

68
00:05:35,560 --> 00:05:36,260
was a.

69
00:05:39,190 --> 00:05:41,470
The second argument is not mandatory.

70
00:05:41,530 --> 00:05:46,280
It is just the color palette that we are going to use to lord this decision tree.

71
00:05:46,450 --> 00:05:48,530
You can use a different color palette.

72
00:05:49,240 --> 00:05:51,370
You can miss this argument also.

73
00:05:52,980 --> 00:05:56,450
If you do not write this, it will use some default colors of its own.

74
00:05:59,220 --> 00:06:04,480
The third argument is digit digit tells how many significant digits.

75
00:06:04,740 --> 00:06:06,240
Should that decision tree help?

76
00:06:07,050 --> 00:06:13,500
Once we applaud this decision tree, I will change the value of this digit to show you how the value

77
00:06:13,980 --> 00:06:15,480
in the decision tree changes.

78
00:06:16,450 --> 00:06:19,190
So let us run this command and Blätter diffidently.

79
00:06:24,060 --> 00:06:27,930
You can see on the right we have this decision tree plot.

80
00:06:28,890 --> 00:06:30,840
We can click on the zoom button.

81
00:06:34,160 --> 00:06:37,400
And it is the zoomed version of this decision tree.

82
00:06:38,790 --> 00:06:41,790
Let us look at this decision tree to see what this is telling.

83
00:06:43,300 --> 00:06:49,480
So the top node that is the initial node had all the observations.

84
00:06:49,570 --> 00:06:51,460
That is hundred percent of the observations.

85
00:06:52,270 --> 00:06:56,850
And the average collection of all the movies was nearly 45000.

86
00:06:58,520 --> 00:07:02,470
The first split is using the variable budget.

87
00:07:03,540 --> 00:07:08,400
At nearly thirty eight thousand, well, you saw the left side.

88
00:07:08,820 --> 00:07:16,890
As for movies which have budget less than 38000 and the right side is for movies which have budget more

89
00:07:16,890 --> 00:07:17,850
than to you told them.

90
00:07:20,140 --> 00:07:26,940
In the right side, you can see this node contains nearly 16 percent of the observations.

91
00:07:27,750 --> 00:07:33,330
And whereas this left node contains nearly 84 percent of the obligations.

92
00:07:36,050 --> 00:07:41,190
In these 16 percent of the movies, which are budget, more than 38000.

93
00:07:42,470 --> 00:07:47,210
The average collection is nearly seven retooled holen, whereas.

94
00:07:48,640 --> 00:07:54,730
In movies with less than 38000 budget, the average collection is nearly 39000.

95
00:07:55,390 --> 00:08:02,000
So there is a huge difference between movies which have budget more than 38000.

96
00:08:02,440 --> 00:08:04,450
And budget less than 38000.

97
00:08:05,410 --> 00:08:09,730
You can continue further to see the effect of different variables.

98
00:08:10,880 --> 00:08:11,720
At each node.

99
00:08:11,930 --> 00:08:17,030
So in this node, the split is on the Berek everything.

100
00:08:18,410 --> 00:08:26,190
And this node displayed is on trailered views and so on, since we mentioned that we need a maximum

101
00:08:26,190 --> 00:08:27,630
depth of three layers.

102
00:08:28,770 --> 00:08:29,970
This is layer zero.

103
00:08:30,900 --> 00:08:31,860
This is layer one.

104
00:08:32,610 --> 00:08:33,470
This is layer, too.

105
00:08:33,810 --> 00:08:35,210
And this is layer three.

106
00:08:36,620 --> 00:08:41,570
So this tree has a maximum depth of three.

107
00:08:42,590 --> 00:08:49,430
If we increase that control barometer of maximum debt to five, we will have a bigger three.

108
00:08:50,740 --> 00:08:53,510
Although that tree will be difficult to interpret.

109
00:08:55,440 --> 00:09:00,480
So here you can clearly see that the most important variable is budget after budget.

110
00:09:00,610 --> 00:09:02,200
It is nearly three levels.

111
00:09:03,190 --> 00:09:06,130
And then we have that iterating our marketing expense.

112
00:09:07,640 --> 00:09:12,760
Now, let us go back and see the effect of mentioning those digits.

113
00:09:12,970 --> 00:09:13,510
But we did.

114
00:09:17,000 --> 00:09:22,850
So if I put digits as Zito and done, come on.

115
00:09:26,330 --> 00:09:28,490
You can see that now I have.

116
00:09:31,000 --> 00:09:41,080
Now, I have this plot in scientific format that is, it is giving me values in 45 in two tenders to

117
00:09:41,130 --> 00:09:42,940
develop three format.

118
00:09:44,310 --> 00:09:51,120
So since this becomes a little bit difficult to read and understand, that is why I had Bridget is equal

119
00:09:51,120 --> 00:09:51,990
to minus three.

120
00:09:52,230 --> 00:09:52,650
That is.

121
00:09:54,620 --> 00:09:59,300
I want those three significant figures also, which were going into the.

122
00:09:59,960 --> 00:10:05,680
So if you wanted to be in the power, you increase the value of digital.

123
00:10:06,560 --> 00:10:12,320
And if you want more significant deserves to be displayed like this, you reduce the value of digit.

124
00:10:14,040 --> 00:10:21,540
The love we have created, we have trained our regression tree on the training set and we have used

125
00:10:22,020 --> 00:10:25,140
the Arpad outlawed function to block the integration de.

126
00:10:28,330 --> 00:10:33,030
You can use these regression trees in your presentations and in your document.

127
00:10:33,790 --> 00:10:37,060
And these can help you make important business decisions.

128
00:10:38,900 --> 00:10:47,600
No other part of using the C entry is to be able to predict the values for other movies or for future

129
00:10:47,990 --> 00:10:49,820
observations that we are going to get.

130
00:10:51,390 --> 00:10:54,580
Do predictive value, we use predict function.

131
00:10:55,600 --> 00:11:00,780
I'm going to add a new column in the test dataset.

132
00:11:01,540 --> 00:11:04,930
This column will contain the predicted values.

133
00:11:06,380 --> 00:11:08,660
Using the decision tree that we have created.

134
00:11:11,700 --> 00:11:17,880
So to find out the predicted values, we use this predict function in this predict function.

135
00:11:18,720 --> 00:11:26,570
The first argument is that variable, which contains the information of DDC entry, which was a three

136
00:11:26,580 --> 00:11:27,090
for us.

137
00:11:28,580 --> 00:11:33,350
The second argument is the data on which you want to predict.

138
00:11:33,800 --> 00:11:37,160
So that data is contained in the test, does it?

139
00:11:38,670 --> 00:11:44,580
And the third parameter is telling the type of response that you want to predict.

140
00:11:45,360 --> 00:11:49,590
Since we want continuous menus, we'll use type is equal to vector.

141
00:11:50,980 --> 00:11:57,550
If this was a classification decision tree, that is, we wanted to predict the classes, then we would

142
00:11:57,550 --> 00:11:59,230
have used a different type.

143
00:11:59,590 --> 00:12:01,270
That is Tabor's equal duck class.

144
00:12:02,110 --> 00:12:03,940
Here is a disintegration tree.

145
00:12:04,060 --> 00:12:05,580
We want a political director.

146
00:12:06,400 --> 00:12:13,160
So I will run this and we can go and look at the test dataset.

147
00:12:14,000 --> 00:12:14,740
Notice click on it.

148
00:12:18,180 --> 00:12:24,450
And in the end, you can see this column, the collection column has the actual values.

149
00:12:25,410 --> 00:12:31,970
And this last column, the red column, has the predicted values for are.

150
00:12:32,260 --> 00:12:33,120
We'll read it as it.

151
00:12:34,690 --> 00:12:41,410
So using the information in a tree, it has predicted these values for the desert.

152
00:12:43,120 --> 00:12:50,290
Now, one measure of performance of our model is calculating the mean squared error.

153
00:12:52,510 --> 00:13:00,100
And asked Orlandi, two related means squared error is the meaning of squared of differences between

154
00:13:00,220 --> 00:13:01,360
actual and predicted.

155
00:13:02,150 --> 00:13:08,470
So basically we find out the difference between predicted and actual V squared is different.

156
00:13:09,770 --> 00:13:12,620
And then we find, I mean, of all those values.

157
00:13:15,120 --> 00:13:18,560
If I run this, a new variable, MASC squared is created.

158
00:13:19,340 --> 00:13:20,390
It has this value.

159
00:13:21,950 --> 00:13:24,560
Which is nearly one hundred thirteen million.

160
00:13:26,570 --> 00:13:32,480
And this is the mean square difference between actual and deeply rooted values.

161
00:13:34,030 --> 00:13:41,620
Now, when we study the advanced techniques in the future, we calculate masc values for all those techniques

162
00:13:42,250 --> 00:13:47,500
and we will see that whether using those techniques reduces this MASC.

163
00:13:49,450 --> 00:13:56,620
So basically, the accuracy will be dulled using MASC if masc value is low.

164
00:13:57,190 --> 00:13:59,170
The model is more accurate.

165
00:14:03,640 --> 00:14:07,500
So this is the code to create our D.C. entry.

166
00:14:07,950 --> 00:14:15,940
Then to plot a D.C. entry and then to use that this entry to predictive values for another dataset.