1
00:00:00,100 --> 00:00:03,133
And now we actually need
to input a third argument.

2
00:00:03,666 --> 00:00:05,066
Can you guess what that is?

3
00:00:05,066 --> 00:00:05,700
For those of you

4
00:00:05,700 --> 00:00:09,600
who followed the Python tutorial, well,
you will guess what it's going to be.

5
00:00:09,966 --> 00:00:11,066
It's actually going to be.

6
00:00:11,066 --> 00:00:13,566
And tree
the number of trees in the forest.

7
00:00:13,566 --> 00:00:17,733
Well, of course we're building a random
forest, so it's actually a lot better

8
00:00:17,733 --> 00:00:21,200
if we can choose the number of trees
that we build in our forest.

9
00:00:21,500 --> 00:00:24,433
And it's even better considering the fact
that we're going to play around

10
00:00:24,433 --> 00:00:26,100
with different number of trees.

11
00:00:26,100 --> 00:00:29,566
That is, we're going to start
with ten trees with a forest of ten trees.

12
00:00:29,900 --> 00:00:33,666
And then, you know, we'll try with a lot
more than ten trees, like 100 trees

13
00:00:33,666 --> 00:00:36,666
or 300 trees or 500 trees.

14
00:00:36,766 --> 00:00:40,066
So that's where we're going
to input the third argument and tree.

15
00:00:40,466 --> 00:00:43,033
And we're going to start with ten trees.

16
00:00:43,033 --> 00:00:44,766
All right. So let's start with this.

17
00:00:44,766 --> 00:00:48,233
And that's all the arguments
we need to build a random forest.

18
00:00:48,300 --> 00:00:52,333
We only need independent variables
the dependent variable and the number of

19
00:00:52,333 --> 00:00:52,966
trees.

20
00:00:52,966 --> 00:00:56,866
And that will already make a robust
random forest regression model.

21
00:00:56,866 --> 00:01:01,466
And then we will make it even more robust
by adding more trees in the forest.

22
00:01:02,133 --> 00:01:05,200
But before we continue,
let's set the random factors

23
00:01:05,200 --> 00:01:08,066
to something fixed
so that we all get the same results.

24
00:01:08,066 --> 00:01:11,266
So, you know, in Python we used a random
state parameter equal to zero.

25
00:01:11,433 --> 00:01:14,433
Here we can do the same on R by using
the set

26
00:01:15,066 --> 00:01:17,400
dot seed function.

27
00:01:17,400 --> 00:01:19,866
And then in this function
we actually input a seed.

28
00:01:19,866 --> 00:01:22,333
And you know we can use whatever seed
we want.

29
00:01:22,333 --> 00:01:26,833
In Python we usually take zero 42
and an oh what we like to do

30
00:01:26,833 --> 00:01:30,500
is you know, to take either one
two 3 or 1, two, three, four.

31
00:01:30,633 --> 00:01:33,166
So let's use
the seed to all get the same result.

32
00:01:33,166 --> 00:01:35,666
And that's
what make this tutorial easier to follow.

33
00:01:35,666 --> 00:01:37,866
If you're coding at the same time.

34
00:01:37,866 --> 00:01:39,333
So now we're all good.

35
00:01:39,333 --> 00:01:42,133
We're actually all good
with the whole code.

36
00:01:42,133 --> 00:01:44,166
We don't have anything to replace.

37
00:01:44,166 --> 00:01:46,666
The only thing that we'll do now is to,
you know,

38
00:01:46,666 --> 00:01:50,400
try several random forests
with several number of trees.

39
00:01:50,733 --> 00:01:54,000
And look at the visualization results
and look at the prediction

40
00:01:54,166 --> 00:01:58,633
to see
if we're getting close to the supposed 160

41
00:01:58,633 --> 00:02:01,866
K preview salary of our new employee
that is about to be hired.

42
00:02:02,700 --> 00:02:03,866
So let's do it.

43
00:02:03,866 --> 00:02:06,366
Let's execute the sections one by one.

44
00:02:06,366 --> 00:02:08,033
So let's import the data set first.

45
00:02:09,533 --> 00:02:10,200
Here we go.

46
00:02:10,200 --> 00:02:11,766
Data set will import it.

47
00:02:11,766 --> 00:02:15,433
We make sure we have our two columns
the independent variable level

48
00:02:15,666 --> 00:02:18,666
and the dependent variable salary.
Perfect.

49
00:02:18,866 --> 00:02:21,933
Now no need to split the data set
into the training set and the test set.

50
00:02:22,200 --> 00:02:24,233
No need to apply feature scaling.

51
00:02:24,233 --> 00:02:28,600
And now time
to create our first random forest.

52
00:02:28,733 --> 00:02:30,200
So let's do this.

53
00:02:30,200 --> 00:02:33,200
Let's execute this code section here.

54
00:02:33,700 --> 00:02:35,433
And here it is. Random forest.

55
00:02:35,433 --> 00:02:37,400
Well created. Perfect.

56
00:02:37,400 --> 00:02:39,066
So now it's time to have fun.

57
00:02:39,066 --> 00:02:40,833
Would you like to visualize the result?

58
00:02:40,833 --> 00:02:43,200
First or getting the prediction?

59
00:02:43,200 --> 00:02:45,833
Well,
first let's maybe visualize the results

60
00:02:45,833 --> 00:02:48,300
because we want to make sure
we have the right model

61
00:02:48,300 --> 00:02:51,600
and we want to validate it because we will
try several number of trees.

62
00:02:51,866 --> 00:02:53,333
Here we are starting with ten trees.

63
00:02:53,333 --> 00:02:55,800
So we want to see
if it looks like a correct model.

64
00:02:55,800 --> 00:02:58,300
So I'm going to execute this section.

65
00:02:58,300 --> 00:03:00,666
Here we go. And let's see what will get.

66
00:03:02,433 --> 00:03:02,766
Okay.

67
00:03:02,766 --> 00:03:04,266
So first of all this looks fine.

68
00:03:04,266 --> 00:03:06,900
We don't seem to have any problem here.

69
00:03:06,900 --> 00:03:09,900
The only thing that we can improve
very quickly is actually, you know,

70
00:03:10,000 --> 00:03:11,600
those straight lines here.

71
00:03:11,600 --> 00:03:13,066
There are supposed to be vertical.

72
00:03:13,066 --> 00:03:15,500
And to get a better
representation of this,

73
00:03:15,500 --> 00:03:20,566
we just need to increase the resolution
as we did for decision tree regression.

74
00:03:20,566 --> 00:03:23,566
So let's add 0.01.
That will be sufficient.

75
00:03:23,766 --> 00:03:26,766
And let's re-execute this.

76
00:03:27,366 --> 00:03:28,366
And now much better.

77
00:03:28,366 --> 00:03:29,500
It almost looks like it's

78
00:03:29,500 --> 00:03:33,466
some vertical straight lines
representing better than some continuity.

79
00:03:33,866 --> 00:03:34,966
And so now what can we say.

80
00:03:34,966 --> 00:03:38,300
Let's zoom on this plot
to have a better look.

81
00:03:38,733 --> 00:03:40,700
And now listen to it.

82
00:03:40,700 --> 00:03:44,800
Okay, so the answer to the enigma
that I asked you in the previous section

83
00:03:44,800 --> 00:03:47,100
and that I was asking you again
in this tutorial,

84
00:03:47,100 --> 00:03:50,900
is that we simply get more steps
in the stairs

85
00:03:51,266 --> 00:03:55,066
by having several decision trees
instead of one decision tree.

86
00:03:55,400 --> 00:04:00,200
We have a lot more steps in the stairs
than what we had with one decision tree,

87
00:04:00,700 --> 00:04:04,533
and therefore we have a lot more of splits
of the whole range of levels,

88
00:04:04,533 --> 00:04:07,433
and therefore a lot more intervals
of the different levels.

89
00:04:07,433 --> 00:04:12,000
So each straight horizontal line here
separated by these vertical lines

90
00:04:12,166 --> 00:04:15,166
or one interval that is one split.

91
00:04:15,233 --> 00:04:18,700
And the fact that we get more steps in
the stairs is actually quite intuitive

92
00:04:18,700 --> 00:04:23,700
because, you know, if we get, for example,
this prediction here for the 6.5 level,

93
00:04:23,800 --> 00:04:27,866
well, what happened for this prediction
is that we had ten trees voting

94
00:04:27,866 --> 00:04:32,200
on which step the salary of the 6.5
level position would be.

95
00:04:32,833 --> 00:04:36,266
And then the random forest took the
average of all the different predictions

96
00:04:36,266 --> 00:04:40,400
of the salary of the 6.5 level made
by all the different trees in the forest.

97
00:04:40,733 --> 00:04:44,600
And for example, if we take the fourth
position level, ten votes were made.

98
00:04:44,900 --> 00:04:47,833
Each of these ten votes
correspond to one prediction

99
00:04:47,833 --> 00:04:51,033
of the level for salary
made by each one of those ten trees.

100
00:04:51,333 --> 00:04:54,666
And then the random forest
took the average of these ten predictions.

101
00:04:54,800 --> 00:04:58,300
And this average is nothing else
than the prediction of the level

102
00:04:58,300 --> 00:05:01,300
for salary
made by the random forest itself.

103
00:05:01,800 --> 00:05:05,233
And so we get more steps,
because simply the whole range of levels

104
00:05:05,233 --> 00:05:07,100
is splitting into more intervals.

105
00:05:07,100 --> 00:05:10,600
And that is because the random forest
is calculating many different

106
00:05:10,600 --> 00:05:14,333
averages of its decision trees
predictions in each of these intervals.

107
00:05:14,900 --> 00:05:17,200
So that's what happened.
It's quite intuitive.

108
00:05:17,200 --> 00:05:19,500
However, there is something important
to point out here

109
00:05:19,500 --> 00:05:23,233
is that if we add a lot more trees
in our random forest,

110
00:05:23,433 --> 00:05:27,600
well, it doesn't mean
we'll get a lot more steps in the stairs,

111
00:05:27,866 --> 00:05:30,100
because the more you add some trees,
the more

112
00:05:30,100 --> 00:05:33,833
the average of the different predictions
made by the trees is converging

113
00:05:33,866 --> 00:05:35,200
to the same average.

114
00:05:35,200 --> 00:05:39,033
You know, this is based on the same
technique entropy and information gain.

115
00:05:39,166 --> 00:05:42,266
So the more you add trees, the more
the average of these votes

116
00:05:42,266 --> 00:05:45,266
will converge to the same ultimate average

117
00:05:45,333 --> 00:05:49,266
and therefore it will converge
to some certain shape of stairs here.

118
00:05:49,366 --> 00:05:51,733
So that's important
to visualize this as well.

119
00:05:51,733 --> 00:05:55,300
And now since we have our intuition
of the visualization of the random forest

120
00:05:55,300 --> 00:05:58,500
regression in 1D, let's
see what happens with the prediction.