1
00:00:00,966 --> 00:00:03,600
Hello and welcome back to the course
on Machine Learning.

2
00:00:03,600 --> 00:00:04,300
In today's tutorial,

3
00:00:04,300 --> 00:00:07,600
we're talking about decision trees
and the intuition behind them.

4
00:00:08,066 --> 00:00:11,666
All right so you may have heard the term
cart which stands for classification

5
00:00:11,666 --> 00:00:13,066
and regression trees.

6
00:00:13,066 --> 00:00:17,700
And this is an umbrella term that
encompasses two types of a decision trees.

7
00:00:17,700 --> 00:00:19,033
And as you've correctly guessed,

8
00:00:19,033 --> 00:00:22,433
they are the classification trees
and regression trees.

9
00:00:22,900 --> 00:00:25,633
And in this course
we're going to talk about both types.

10
00:00:25,633 --> 00:00:30,333
But specifically in this section
we're focusing on the regression trees.

11
00:00:30,900 --> 00:00:34,300
And I wanted to mention right away
that regression trees are a bit

12
00:00:34,300 --> 00:00:36,200
more complex than classification trees.

13
00:00:36,200 --> 00:00:38,100
And that's why this tutorial
is going to be a bit longer

14
00:00:38,100 --> 00:00:40,800
and is going to require
some additional attention.

15
00:00:40,800 --> 00:00:44,800
But nevertheless, we're still going
to break this kind of somewhat complex

16
00:00:44,800 --> 00:00:50,100
topic into, very simple,
bite sized, elements of information.

17
00:00:50,100 --> 00:00:51,866
So it will all make sense.

18
00:00:51,866 --> 00:00:54,733
And towards the end of it, you'll be quite
comfortable with regression trees.

19
00:00:54,733 --> 00:00:56,666
So let's get straight into it.

20
00:00:56,666 --> 00:00:57,166
All right.

21
00:00:57,166 --> 00:01:00,733
So here we've got a scatterplot
which represents our data set.

22
00:01:00,733 --> 00:01:02,666
So data set that has been given to us.

23
00:01:02,666 --> 00:01:06,133
And the interesting thing about
the scatterplot is that we've got two

24
00:01:06,133 --> 00:01:08,133
independent variables x1 and x2.

25
00:01:08,133 --> 00:01:13,300
And what we're predicting is a third
variable a dependent variable which is y.

26
00:01:13,600 --> 00:01:16,200
And you cannot actually
see why on this chart.

27
00:01:16,200 --> 00:01:19,500
And that is because this is a simply
a two dimensional chart.

28
00:01:19,533 --> 00:01:22,600
When you fit the two variables,
y is the third dimension.

29
00:01:22,600 --> 00:01:25,733
And if you think about it,
it's like sticking out of your screen.

30
00:01:25,733 --> 00:01:27,333
That's where that dimension is.

31
00:01:27,333 --> 00:01:31,033
And this is just a projection
of all the points on the x1, x2 plane.

32
00:01:31,400 --> 00:01:34,766
And so if I add a third dimension,
it would look something like that.

33
00:01:35,100 --> 00:01:37,733
But once again we can't see y right now.

34
00:01:37,733 --> 00:01:39,433
And the interesting thing is

35
00:01:39,433 --> 00:01:43,200
that we don't actually need to see y
because we need to work with this,

36
00:01:43,200 --> 00:01:47,066
scatterplot first for a little bit
to build our decision tree.

37
00:01:47,200 --> 00:01:49,666
And then once we've built
it will return to Y.

38
00:01:49,666 --> 00:01:52,666
Now a quick point
I wanted to make here is that

39
00:01:53,066 --> 00:01:57,300
I've seen decision trees explained
with just one independent variable.

40
00:01:57,300 --> 00:01:59,433
So x1 or just x and y.

41
00:01:59,433 --> 00:02:03,933
And then in and in that case, yes, you can
you can just put x1 over here

42
00:02:03,933 --> 00:02:06,533
and then Y would go over here
and you would have

43
00:02:06,533 --> 00:02:10,400
a bit of a different kind of diagram,
and you'd be able to explain it that way.

44
00:02:10,400 --> 00:02:13,966
But at the same time, I think
it might not really drive the point home.

45
00:02:14,200 --> 00:02:17,533
And, it can be a bit confusing
when it's explained like that,

46
00:02:17,866 --> 00:02:20,600
although sometimes it is done.

47
00:02:20,600 --> 00:02:24,333
nevertheless, I thought would go,
the full way, would do the full Monty

48
00:02:24,333 --> 00:02:28,533
and would look at this problem
with two independent variables,

49
00:02:28,533 --> 00:02:30,600
because it'll be
a more robust explanation.

50
00:02:30,600 --> 00:02:34,333
So it will make it a bit more complex,
but it's definitely worth it in the long

51
00:02:34,333 --> 00:02:38,666
run, because that way will understand
the decision tree regression a bit.

52
00:02:38,966 --> 00:02:42,000
or actually
I would say quite a bit better.

53
00:02:42,233 --> 00:02:42,566
All right.

54
00:02:42,566 --> 00:02:44,133
So let's continue.

55
00:02:44,133 --> 00:02:45,700
We've got the X1 and X2.

56
00:02:45,700 --> 00:02:47,333
These are independent variables.

57
00:02:47,333 --> 00:02:48,433
The dependent variable.

58
00:02:48,433 --> 00:02:51,100
We cannot see it.
And it's the third dimension.

59
00:02:51,100 --> 00:02:54,466
And we're actually going
to forget about it for a little while.

60
00:02:54,466 --> 00:02:54,666
Right.

61
00:02:54,666 --> 00:02:57,700
So we're going to just forget about it
because we need to work with this

62
00:02:57,800 --> 00:03:01,200
scatterplot to see how our decision tree
is going to be created.

63
00:03:01,633 --> 00:03:05,633
So once you run the regression tree
or decision

64
00:03:05,866 --> 00:03:10,500
tree algorithm in the regression
sense of it, what will happen is

65
00:03:10,500 --> 00:03:15,433
your scatterplot
will be split up into segments.

66
00:03:15,433 --> 00:03:18,900
And let's have a look at how an algorithm
could go about doing that.

67
00:03:18,900 --> 00:03:24,066
So an algorithm would create a split over
here for example at somewhere around 20.

68
00:03:24,600 --> 00:03:29,000
so it would basically split your diagram
or your scatterplot into two parts.

69
00:03:29,000 --> 00:03:30,133
Everything has less than 20.

70
00:03:30,133 --> 00:03:32,933
Everything that's greater
than 20 for the X1 variable.

71
00:03:32,933 --> 00:03:34,700
Then another split would happen here.

72
00:03:34,700 --> 00:03:37,700
So for all of the elements in this side

73
00:03:37,700 --> 00:03:40,766
they would be compared
to 170 greater or less.

74
00:03:41,066 --> 00:03:42,900
And then there'd
would be another split here

75
00:03:42,900 --> 00:03:44,766
and then maybe another split over here.

76
00:03:44,766 --> 00:03:48,566
Now how and where these splits are
conducted

77
00:03:48,933 --> 00:03:51,600
is determined by the algorithm.

78
00:03:51,600 --> 00:03:54,600
And, it is actually involves

79
00:03:54,600 --> 00:03:58,033
looking at something
called the information entropy.

80
00:03:58,266 --> 00:04:00,866
And it is a mathematical concept.

81
00:04:00,866 --> 00:04:02,866
It is quite complex.

82
00:04:02,866 --> 00:04:06,366
So it basically means
when I perform this split right.

83
00:04:06,700 --> 00:04:09,700
Is this split increasing

84
00:04:09,700 --> 00:04:12,833
the amount of information
that we have about our points?

85
00:04:12,833 --> 00:04:18,700
Is it actually adding some value to
our way that we want to group our points?

86
00:04:18,966 --> 00:04:22,566
And the algorithm knows when to stop, is
when there's

87
00:04:22,566 --> 00:04:26,700
a certain minimum for the information
that needs to be added.

88
00:04:27,033 --> 00:04:30,066
And once the, like, it cannot add

89
00:04:30,066 --> 00:04:33,866
any more information
to our set up by split.

90
00:04:34,000 --> 00:04:35,966
These leaves are called leaves.

91
00:04:35,966 --> 00:04:38,166
So each one of these splits
is called a leaf.

92
00:04:38,166 --> 00:04:41,833
By splitting these leaves, once
it kind of adding more information, then

93
00:04:41,833 --> 00:04:46,866
it stops or, or the algorithm could,
let's say stop when you have less than 5%.

94
00:04:47,666 --> 00:04:50,333
if you were to conduct a split,
then you'd have less than 5%

95
00:04:50,333 --> 00:04:54,066
of your total points in that leaf,
and then that leaf wouldn't be created.

96
00:04:54,166 --> 00:04:58,200
So there are, different variations
or different options for that to happen.

97
00:04:58,533 --> 00:05:02,100
And but the most important thing is,
of course, where the splits are happening.

98
00:05:02,400 --> 00:05:03,700
And if you'd like to learn

99
00:05:03,700 --> 00:05:07,500
more about that, you'd you'd need to study
a bit more about information entropy.

100
00:05:07,833 --> 00:05:10,533
We're not going to go
into that mathematical depth right now.

101
00:05:10,533 --> 00:05:14,400
For us, it's sufficient to know
that the algorithm can handle this,

102
00:05:14,400 --> 00:05:19,533
and that it is finding the optimal splits
of our data set into these leaves.

103
00:05:19,533 --> 00:05:22,000
And the final leaves
are called terminal leaves.

104
00:05:22,000 --> 00:05:25,700
And then we're going to focus
on the practical application

105
00:05:25,700 --> 00:05:29,666
of this algorithm, how
and why we're using these,

106
00:05:29,666 --> 00:05:33,066
decision trees
and how this regression is going to work.

107
00:05:33,566 --> 00:05:35,800
All right.
So hopefully we're on the same page.

108
00:05:35,800 --> 00:05:36,433
Let's continue.

109
00:05:36,433 --> 00:05:39,533
So we're going to rewind all of this
a little bit.

110
00:05:39,833 --> 00:05:42,900
And we're going to create these splits
one by one.

111
00:05:42,900 --> 00:05:46,233
And alongside we're going to actually
start drawing our decision tree.

112
00:05:46,766 --> 00:05:49,333
So there's our diagram
brand new and fresh.

113
00:05:49,333 --> 00:05:51,366
And there goes our first split.

114
00:05:51,366 --> 00:05:54,366
So now we're going to start
creating our decision tree.

115
00:05:54,500 --> 00:05:55,800
the splitting happened at 20.

116
00:05:55,800 --> 00:05:57,633
So let's start drawing.

117
00:05:57,633 --> 00:05:59,666
There is our first decision.

118
00:05:59,666 --> 00:06:02,666
And we have two options yes and no.

119
00:06:03,100 --> 00:06:03,433
All right.

120
00:06:03,433 --> 00:06:05,266
So let's let's see what happens next.

121
00:06:05,266 --> 00:06:07,233
Next happens split two.

122
00:06:07,233 --> 00:06:09,066
Split two happens at 170.

123
00:06:09,066 --> 00:06:12,066
And only happens for the points
that are greater than 20.

124
00:06:12,266 --> 00:06:15,833
So that means you would check this
condition x one is less than 20

125
00:06:15,866 --> 00:06:18,333
meaning you check. No you.
The answer is no.

126
00:06:18,333 --> 00:06:23,333
And then you check if x two is
less than one, 70 x two is less than 170,

127
00:06:23,966 --> 00:06:25,200
then a split three happens

128
00:06:25,200 --> 00:06:28,633
on the other side and it checks
if x two is less than 200.

129
00:06:29,166 --> 00:06:31,766
Let's add that here x two less than 200

130
00:06:31,766 --> 00:06:34,933
and then split four happens at 40.

131
00:06:35,066 --> 00:06:38,666
And it checks
if x one is greater or less than 40.

132
00:06:38,866 --> 00:06:42,400
And a split four only happens
for the points that answered to split one.

133
00:06:42,400 --> 00:06:45,033
They answered
and no, it's not less than 20.

134
00:06:45,033 --> 00:06:49,933
And to split they answered no, it's yes,
it's actually less than 170.

135
00:06:50,400 --> 00:06:52,433
So no, it's not less than 20.

136
00:06:52,433 --> 00:06:53,833
Yes, it's less than 170.

137
00:06:53,833 --> 00:06:56,500
And then 
this is where split world four happens.

138
00:06:56,500 --> 00:06:59,400
X1 is less than 40 is no.

139
00:06:59,400 --> 00:06:59,700
All right.

140
00:06:59,700 --> 00:07:01,033
So that's our decision tree.

141
00:07:01,033 --> 00:07:02,866
It's done. It's drawn.

142
00:07:02,866 --> 00:07:04,366
And so what happens next.

143
00:07:04,366 --> 00:07:07,366
How what do we actually populate
into those boxes.

144
00:07:07,700 --> 00:07:11,233
Well this is where we need to remember
about our dependent variable.

145
00:07:11,233 --> 00:07:13,166
The third dimension.

146
00:07:13,166 --> 00:07:16,500
And what we need to check here is

147
00:07:16,866 --> 00:07:21,966
how are we going to predict the value of y

148
00:07:22,100 --> 00:07:28,066
for a new observation that gets added
to our scatterplot or to our dataset.

149
00:07:28,066 --> 00:07:35,033
So let's say we add a observation which is
has x1 equals to 30 and x2 equals to 50.

150
00:07:35,300 --> 00:07:39,266
It would fall somewhere over here
and 50 is somewhere over here.

151
00:07:39,266 --> 00:07:40,600
It would fall somewhere over here.

152
00:07:40,600 --> 00:07:44,566
So obviously it falls
into this, terminal leaf.

153
00:07:44,900 --> 00:07:47,766
And how does that information.

154
00:07:47,766 --> 00:07:49,400
So as you can see, we've by adding

155
00:07:49,400 --> 00:07:52,933
these splits,
we've added information into our system.

156
00:07:53,233 --> 00:07:57,333
So how does that information
that now we know that it falls into this

157
00:07:57,333 --> 00:07:58,400
terminal leaf.

158
00:07:58,400 --> 00:08:01,833
How does it information
help us in terms of predicting

159
00:08:01,833 --> 00:08:04,833
the value of y for that new element
that we're going to add?

160
00:08:05,100 --> 00:08:08,433
Well, the way it works
is it's actually pretty straightforward.

161
00:08:08,466 --> 00:08:14,466
The way it works is you just take the
averages of each of your terminal leaves.

162
00:08:14,800 --> 00:08:18,200
So you take the average of Y
for all of these points.

163
00:08:18,200 --> 00:08:19,533
And that will be the value

164
00:08:19,533 --> 00:08:24,066
that will be assigned to any new point
that falls in this terminal leaf.

165
00:08:24,333 --> 00:08:25,666
Same for this terminal leaf.

166
00:08:25,666 --> 00:08:28,533
Same for this terminal leaf, same
for this one and the same for this one.

167
00:08:28,533 --> 00:08:29,166
So let's have a look.

168
00:08:29,166 --> 00:08:32,433
Let's say the average for y here is 65.7.

169
00:08:32,433 --> 00:08:36,366
The average for y is here is 300 and point
five 1023.

170
00:08:36,366 --> 00:08:39,600
Here -64.1 0.7 here.

171
00:08:39,900 --> 00:08:42,800
So for that point that we just

172
00:08:42,800 --> 00:08:45,800
discussed with x1 equals 30 and x2

173
00:08:46,066 --> 00:08:50,100
equals 50, the predicted value of y
that the regression tree

174
00:08:50,100 --> 00:08:53,933
algorithm would predict a value of -64.1.

175
00:08:54,400 --> 00:08:56,500
If it were to fall
in any other terminal leaf,

176
00:08:56,500 --> 00:08:59,466
then that's what the value
there would predict.

177
00:08:59,466 --> 00:09:01,866
So as you can see,
it's actually pretty straightforward.

178
00:09:01,866 --> 00:09:03,933
It's it's very simple.

179
00:09:03,933 --> 00:09:05,700
It's just taking averages.

180
00:09:05,700 --> 00:09:10,200
but you do need to remember that
we are, working.

181
00:09:10,233 --> 00:09:13,500
The whole point of this exercise is to add

182
00:09:13,500 --> 00:09:18,000
more information into our chart,
into our system, to better predict

183
00:09:18,000 --> 00:09:21,666
y, because if you think about it,
what was our other option?

184
00:09:21,900 --> 00:09:23,533
What is our default option?

185
00:09:23,533 --> 00:09:26,766
If the default option
were for running any machine

186
00:09:26,766 --> 00:09:30,600
learning on this, 
data set is to just take all of the points

187
00:09:30,600 --> 00:09:34,533
and take the average across
all of the points and whatever that is.

188
00:09:34,666 --> 00:09:37,766
Wherever our new point,
the new element of data

189
00:09:37,766 --> 00:09:41,200
that is added to our data set,
wherever it falls, we just assign.

190
00:09:41,200 --> 00:09:45,300
It's always that's average
for all of the points that we had existing

191
00:09:45,300 --> 00:09:45,800
previously.

192
00:09:45,800 --> 00:09:49,800
What do we did now is we've split
our diagram up into these terminal leaves.

193
00:09:49,800 --> 00:09:53,800
The machine learning algorithm has added
information to our entire system.

194
00:09:53,966 --> 00:09:57,766
And so now we can more accurately predict
the value

195
00:09:57,766 --> 00:10:02,100
or assign the value of y
to a new coming element.

196
00:10:02,400 --> 00:10:05,600
And as you can see now, it's average,
not just across all of them.

197
00:10:05,600 --> 00:10:11,333
The average is taken into in specific
parts or segments of our scatterplot.

198
00:10:11,333 --> 00:10:14,733
And therefore it is
or it's supposed to be more accurate.

199
00:10:14,766 --> 00:10:17,766
That's the whole point
of the regression tree.

200
00:10:17,833 --> 00:10:19,600
And now last thing we have left to do

201
00:10:19,600 --> 00:10:23,800
is to add the values into our,
decision tree.

202
00:10:23,800 --> 00:10:26,033
So basically
we just add those values in here.

203
00:10:26,033 --> 00:10:29,433
And now whenever we have a new value,

204
00:10:29,433 --> 00:10:34,033
what would happen
is the algorithm which is go through this,

205
00:10:34,600 --> 00:10:37,800
these checks and it would check
where it falls and assign the value.

206
00:10:38,266 --> 00:10:39,266
And that's pretty much it.

207
00:10:39,266 --> 00:10:43,900
So the scatterplot is more for like
visualization, conceptual purposes.

208
00:10:43,900 --> 00:10:46,300
So you can maybe drive some insights
from there.

209
00:10:46,300 --> 00:10:50,100
But the core of Decision
Tree is actually held here.

210
00:10:50,300 --> 00:10:53,300
That's why the algorithm
is called a regression tree.

211
00:10:53,766 --> 00:10:55,266
I hope you enjoyed today's tutorial.

212
00:10:55,266 --> 00:10:58,333
And, hopefully we did break down
this quite complex

213
00:10:58,333 --> 00:11:01,566
topic
into some simple and actionable steps,

214
00:11:01,866 --> 00:11:03,933
and I'll look forward
to seeing you next time.

215
00:11:03,933 --> 00:11:05,866
Until then, enjoy machine learning.