1
00:00:00,066 --> 00:00:01,800
Okay, so let's input the arguments.

2
00:00:01,800 --> 00:00:03,200
As you remember, the first argument

3
00:00:03,200 --> 00:00:06,600
was the matrix of features
the matrix of independent variables.

4
00:00:07,033 --> 00:00:13,500
And that is training set excluding
the last column which has index three.

5
00:00:13,666 --> 00:00:15,200
Because as you remember
in the training set

6
00:00:15,200 --> 00:00:18,600
we have the first two columns
which are the independent variables

7
00:00:19,033 --> 00:00:22,733
age and estimated salary was therefore
indexes one and two.

8
00:00:23,000 --> 00:00:25,200
And we have the third
column indexed by three

9
00:00:25,200 --> 00:00:28,200
which is our dependent variable
vector purchased.

10
00:00:28,333 --> 00:00:29,966
So here minus three.

11
00:00:29,966 --> 00:00:31,166
Then what was the next argument.

12
00:00:31,166 --> 00:00:34,200
The next argument was why
the dependent variable vector.

13
00:00:34,533 --> 00:00:37,533
And then here will take training set.

14
00:00:37,633 --> 00:00:40,833
And let's pick it this way to specify
the name of the independent variable.

15
00:00:40,833 --> 00:00:43,833
$2 here. And purchased.

16
00:00:43,866 --> 00:00:46,866
Purchased is the name of our dependent
variable column.

17
00:00:47,233 --> 00:00:49,600
All right.
So we almost have everything we need.

18
00:00:49,600 --> 00:00:52,500
The last thing we need now is of course
the number of trees.

19
00:00:52,500 --> 00:00:56,433
And that is entry equals ten.

20
00:00:57,000 --> 00:00:58,800
You can play around
with the entry argument.

21
00:00:58,800 --> 00:01:01,566
You can choose
many more trees in the forest.

22
00:01:01,566 --> 00:01:03,300
You'll observe some interesting results.

23
00:01:03,300 --> 00:01:07,066
That's interesting to see
what different teams of trees can do

24
00:01:07,066 --> 00:01:10,900
to predict the response of your users
in the social network,

25
00:01:10,900 --> 00:01:13,333
whether they buy yes or no. The SUV.

26
00:01:13,333 --> 00:01:17,100
But if you do this,
make sure to pay attention to overfitting,

27
00:01:17,100 --> 00:01:18,900
which you want to avoid.

28
00:01:18,900 --> 00:01:22,666
You don't want to overfit the random
forest classifier to the training set,

29
00:01:22,866 --> 00:01:27,066
because if you do this, then it might make
some poor predictions on a new set.

30
00:01:27,466 --> 00:01:30,266
You can actually check it out
with the test set, but here we'll

31
00:01:30,266 --> 00:01:33,266
choose ten trees
and we'll see what happens.

32
00:01:33,633 --> 00:01:34,000
All right.

33
00:01:34,000 --> 00:01:36,566
So actually we're done with the templates.

34
00:01:36,566 --> 00:01:38,966
We changed everything we had to change.

35
00:01:38,966 --> 00:01:40,300
And now we can just,

36
00:01:40,300 --> 00:01:44,433
you know select everything
and execute to make everything ready.

37
00:01:44,500 --> 00:01:48,433
You can actually take some coffee or tea
and you can just select

38
00:01:48,433 --> 00:01:51,433
everything
and execute to watch the results.

39
00:01:51,433 --> 00:01:53,366
But let's rather do it step by step.

40
00:01:53,366 --> 00:01:57,866
We'll just do the first pre-processing
step all in once here.

41
00:01:57,866 --> 00:02:00,300
So I just selected
the pre-processing phase.

42
00:02:00,300 --> 00:02:01,800
And now I'll press Command and Control.

43
00:02:01,800 --> 00:02:03,600
Press enter to execute.

44
00:02:03,600 --> 00:02:04,466
All right all good.

45
00:02:04,466 --> 00:02:08,666
We have our data set, our training set
and our test set.

46
00:02:09,000 --> 00:02:10,166
So everything looks fine.

47
00:02:10,166 --> 00:02:14,500
We have 400 observations
in total, 300 observations

48
00:02:14,500 --> 00:02:19,100
that went into the training set and 100
observations that went into the test set.

49
00:02:19,633 --> 00:02:23,033
As you can see, the training set
and the test set are scaled

50
00:02:23,366 --> 00:02:27,766
because in the end, we are plotting some
graphic results with a resolution of 0.01.

51
00:02:27,900 --> 00:02:33,533
So in order for our code to execute faster
and not actually break our code,

52
00:02:33,600 --> 00:02:36,766
we need to apply feature scaling
to our training set and our test set.

53
00:02:37,300 --> 00:02:39,966
Otherwise, we wouldn't need to do that
because the random

54
00:02:39,966 --> 00:02:43,400
forest classification
is not based on Euclidean distances,

55
00:02:43,900 --> 00:02:47,333
but it's based on, you know, conditions
on the independent variables.

56
00:02:47,700 --> 00:02:51,300
But because of this code here
that is compute intensive,

57
00:02:51,533 --> 00:02:54,766
we need to apply feature scaling
so that everything is well executed.

58
00:02:55,466 --> 00:02:56,800
All right. So let's do this.

59
00:02:56,800 --> 00:02:58,333
Let's watch the results.

60
00:02:58,333 --> 00:03:02,466
We just need to create our classifier here
by executing this section.

61
00:03:02,766 --> 00:03:05,066
So here I'll just do this.

62
00:03:05,066 --> 00:03:06,300
All right okay.

63
00:03:06,300 --> 00:03:08,633
Now let's predict the test set results.

64
00:03:08,633 --> 00:03:12,300
Then we have the confusion matrix
which will tell us in the flashlight

65
00:03:12,300 --> 00:03:14,366
how many incorrect predictions we have.

66
00:03:14,366 --> 00:03:16,433
So let's actually do it directly.

67
00:03:16,433 --> 00:03:19,866
It will be faster to see how our random
forest classifier did

68
00:03:19,866 --> 00:03:21,400
well on the predictions.

69
00:03:21,400 --> 00:03:23,600
So let's execute this.

70
00:03:23,600 --> 00:03:26,066
And now let's enter

71
00:03:26,066 --> 00:03:29,233
CM here in the console press enter.

72
00:03:29,833 --> 00:03:32,233
And here we have our confusion matrix

73
00:03:32,233 --> 00:03:36,300
okay we have seven
plus ten equals 17 incorrect predictions.

74
00:03:36,600 --> 00:03:37,866
Well that's not too bad.

75
00:03:37,866 --> 00:03:41,300
Just for fun let's let's just pick
another number of trees.

76
00:03:41,300 --> 00:03:44,300
Like for example let's pick 500 trees.

77
00:03:44,533 --> 00:03:45,766
500 trees is a lot.

78
00:03:45,766 --> 00:03:49,500
That's a really a big army of trees
to make some predictions.

79
00:03:49,900 --> 00:03:52,466
And now, just for fun, let's take this.

80
00:03:52,466 --> 00:03:55,166
I don't need to include this
because my library was

81
00:03:55,166 --> 00:03:58,300
already selected from the previous
execution of this code section here.

82
00:03:58,300 --> 00:04:02,566
So let's rebuild a new classifier
with 500 trees.

83
00:04:02,566 --> 00:04:05,600
And now let's
look at the confusion matrix.

84
00:04:05,866 --> 00:04:08,400
But before let's build
our vector of prediction.

85
00:04:08,400 --> 00:04:09,333
Because right now

86
00:04:09,333 --> 00:04:13,300
the y vector of prediction is the one
given by the random forest with ten trees.

87
00:04:13,933 --> 00:04:15,900
So let's re-execute this. All right.

88
00:04:15,900 --> 00:04:19,966
Now we have y pred
as the vector of predictions

89
00:04:19,966 --> 00:04:22,966
predicted by the random forest
with 500 trees.

90
00:04:23,100 --> 00:04:25,133
And now let's look at the matrix
of predictions.

91
00:04:25,133 --> 00:04:28,800
Remember, with ten trees
we had 17 incorrect predictions.

92
00:04:28,800 --> 00:04:29,966
And now let's see.

93
00:04:29,966 --> 00:04:33,233
Select execute CM enter.

94
00:04:33,600 --> 00:04:36,600
And now we have 15 incorrect predictions.

95
00:04:36,933 --> 00:04:37,500
Great.

96
00:04:37,500 --> 00:04:41,200
We invested 490 more trees to win
two correct predictions.

97
00:04:41,566 --> 00:04:45,633
So definitely that means that there are
a lot of and useful trees in the team.

98
00:04:46,066 --> 00:04:46,366
Okay.

99
00:04:46,366 --> 00:04:50,100
So as you want
let's maybe go back to ten trees here

100
00:04:50,100 --> 00:04:54,333
because obviously 500 trees
is not very useful.

101
00:04:55,200 --> 00:04:55,500
All right.

102
00:04:55,500 --> 00:04:58,466
So I'll just take that again.

103
00:04:58,466 --> 00:05:01,600
That as well and that as well.

104
00:05:01,866 --> 00:05:02,700
All right.

105
00:05:02,700 --> 00:05:05,066
And now let's look at the training
set results.

106
00:05:05,066 --> 00:05:06,800
So everything is fine here.

107
00:05:06,800 --> 00:05:10,033
We changed the title here
with random forest classification.

108
00:05:10,033 --> 00:05:11,233
So it's all good.

109
00:05:11,233 --> 00:05:14,166
We are ready to look
at the graphic result.

110
00:05:14,166 --> 00:05:17,533
And by the way you can pause on the video
now and try to guess what

111
00:05:17,533 --> 00:05:18,966
you're about to see.

112
00:05:18,966 --> 00:05:23,066
Try to guess the shape of the prediction
regions and of the prediction boundary.

113
00:05:23,300 --> 00:05:25,333
If you understood correctly
the decision trees,

114
00:05:25,333 --> 00:05:28,466
then you would have no problem
guessing what's about to happen.

115
00:05:28,933 --> 00:05:29,266
All right.

116
00:05:29,266 --> 00:05:32,433
So I'm going to execute right now
command and control

117
00:05:32,433 --> 00:05:35,433
plus enter to execute and show time.

118
00:05:35,766 --> 00:05:38,266
All right.
Wow. That's quite something here.

119
00:05:38,266 --> 00:05:41,700
So the points here
are the real observation points.

120
00:05:41,700 --> 00:05:44,300
There is the real users
of the social network.

121
00:05:44,300 --> 00:05:47,066
And then we have the regions here
the red region and the green region

122
00:05:47,066 --> 00:05:48,600
which are the prediction regions.

123
00:05:48,600 --> 00:05:51,700
The red region is the region
where our random forest classifier

124
00:05:51,700 --> 00:05:55,200
predicts that the user doesn't buy
the SUV, and the green region

125
00:05:55,200 --> 00:05:58,466
where a random forest classifier
predicts that the user buys the SUV.