1
00:00:00,233 --> 00:00:00,566
All right.

2
00:00:00,566 --> 00:00:03,766
Then, as
we said, we want to get the same criterion

3
00:00:03,766 --> 00:00:07,333
as in the intuition lectures,
meaning entropy with the information gain.

4
00:00:07,566 --> 00:00:08,533
So there we go.

5
00:00:08,533 --> 00:00:13,166
Let's add criterion equals
and quote entropy.

6
00:00:13,766 --> 00:00:14,300
Great.

7
00:00:14,300 --> 00:00:18,100
And then finally that final parameter
random underscore state

8
00:00:18,566 --> 00:00:21,566
to which we set the value zero.

9
00:00:21,766 --> 00:00:22,600
Perfect.

10
00:00:22,600 --> 00:00:26,300
And now final step
you know it by heart classifier.

11
00:00:26,566 --> 00:00:29,333
Then from this classifier we call the fit

12
00:00:29,333 --> 00:00:33,166
method
which will train the classifier only built

13
00:00:33,166 --> 00:00:37,866
so far onto the training set
composed of the two arguments.

14
00:00:37,866 --> 00:00:42,566
We have two input here, which are x train
for the matrix of features

15
00:00:42,566 --> 00:00:45,933
of the training set, and then y train

16
00:00:45,933 --> 00:00:49,400
for the dependent variable vector
of the same training set.

17
00:00:49,800 --> 00:00:50,933
And that's it my friends.

18
00:00:50,933 --> 00:00:56,866
Now we're about to find out if we can beat
that record accuracy of 93%.

19
00:00:57,166 --> 00:00:58,933
I actually have a good feeling about this.

20
00:00:58,933 --> 00:01:01,433
We might beat it,
but let's not talk too fast.

21
00:01:01,433 --> 00:01:03,066
We never know what's going to happen.

22
00:01:03,066 --> 00:01:06,733
So first let's upload
the data set by clicking this fully here.

23
00:01:06,733 --> 00:01:08,866
You know let's upload it in the notebook.

24
00:01:08,866 --> 00:01:11,633
So right now as usual you know same story.

25
00:01:11,633 --> 00:01:12,966
The Colab notebook

26
00:01:12,966 --> 00:01:16,466
is connecting to a runtime to enable file
browsing on your machine.

27
00:01:16,800 --> 00:01:19,500
And we will get the upload button
in a second.

28
00:01:19,500 --> 00:01:20,533
There we go.

29
00:01:20,533 --> 00:01:25,400
So let's click it and let's
go to where we have our machine learning.

30
00:01:25,400 --> 00:01:26,666
It is at folder.

31
00:01:26,666 --> 00:01:28,666
There it is. Mine is on my machine.

32
00:01:28,666 --> 00:01:31,600
So we're going to go inside and part
three classification.

33
00:01:31,600 --> 00:01:34,433
Then section 20 random forest
classification.

34
00:01:34,433 --> 00:01:37,666
The last class regression
model of this part Gratulations again

35
00:01:37,666 --> 00:01:40,666
for making such huge progress
with this course.

36
00:01:40,766 --> 00:01:45,866
There we go inside and Python
and then social network add dot csv.

37
00:01:46,533 --> 00:01:47,433
Let's open it.

38
00:01:47,433 --> 00:01:50,233
And now
we're very close to the final result,

39
00:01:50,233 --> 00:01:54,466
you know to the final discovery
of whether we're going to beat yes or no.

40
00:01:54,466 --> 00:01:57,766
The record accuracy of 93%.
So there we go.

41
00:01:57,766 --> 00:01:59,600
Let's click runtime here.

42
00:01:59,600 --> 00:02:04,500
And then let's click run
URL to build and train again.

43
00:02:04,500 --> 00:02:06,366
The random Forest classification. Here
we go.

44
00:02:06,366 --> 00:02:09,133
We have it now and our future prediction.

45
00:02:09,133 --> 00:02:13,066
So let's see let's see let's
see what we get first that prediction

46
00:02:13,066 --> 00:02:15,466
of the purchase decision
of that single customer of age

47
00:02:15,466 --> 00:02:19,600
30 and $87,000
estimated salary is correct, right?

48
00:02:19,600 --> 00:02:22,633
Because in reality,
this customer didn't buy the SUV.

49
00:02:23,100 --> 00:02:24,900
And now with the test result,

50
00:02:24,900 --> 00:02:28,166
let's scroll back up here
and let's see a bit what we have.

51
00:02:28,500 --> 00:02:30,700
So all this is correct here. Correct.

52
00:02:30,700 --> 00:02:32,633
One incorrect prediction here.

53
00:02:32,633 --> 00:02:34,466
Two other incorrect predictions here.

54
00:02:34,466 --> 00:02:36,466
Oh maybe we won't beat it.

55
00:02:36,466 --> 00:02:40,800
You know, let's see directly if we beat it
and well actually no.

56
00:02:40,800 --> 00:02:42,733
Wow. Okay. I'm very surprised.

57
00:02:42,733 --> 00:02:44,900
I thought we had a chance to beat it.

58
00:02:44,900 --> 00:02:46,533
I hope you're not too disappointed.

59
00:02:46,533 --> 00:02:50,833
But indeed, we didn't
beat that record accuracy of 93%.

60
00:02:51,000 --> 00:02:54,600
Because indeed, with the random forest,
we get 91%.

61
00:02:54,600 --> 00:02:55,766
Let's try to tune.

62
00:02:55,766 --> 00:02:57,600
You know, this is not our final word.

63
00:02:57,600 --> 00:03:00,733
Let's try to tune a bit
the number of estimators.

64
00:03:00,733 --> 00:03:02,133
Maybe we can get a better one.

65
00:03:02,133 --> 00:03:05,066
Let's try, for example,
the default value of 100.

66
00:03:05,066 --> 00:03:05,500
But you know

67
00:03:05,500 --> 00:03:09,900
I don't think we will even improve that
because we might yet anyway overfitting.

68
00:03:09,900 --> 00:03:13,400
And this will not help of course
for the predictions of new observations

69
00:03:13,600 --> 00:03:14,333
in the test set.

70
00:03:14,333 --> 00:03:15,600
But anyway let's try.

71
00:03:15,600 --> 00:03:17,666
Let's run all again.

72
00:03:17,666 --> 00:03:19,800
So this will rebuild and retrain

73
00:03:19,800 --> 00:03:22,800
your random forest classification
with 100 trees.

74
00:03:23,233 --> 00:03:23,800
All right.

75
00:03:23,800 --> 00:03:25,633
We're about to get a new one. There we go.

76
00:03:25,633 --> 00:03:29,333
So now we have indeed
100 trees in the random forest.

77
00:03:29,733 --> 00:03:30,166
All right.

78
00:03:30,166 --> 00:03:32,900
The new result
prediction is still correct as a result.

79
00:03:32,900 --> 00:03:35,266
Okay.
And now let's see the confusion matrix.

80
00:03:35,266 --> 00:03:36,366
That's what I was telling you.

81
00:03:36,366 --> 00:03:37,700
Still 91%.

82
00:03:37,700 --> 00:03:40,600
So it was perhaps
better trained on the training set.

83
00:03:40,600 --> 00:03:43,433
But what we get on
the test set is just the same.

84
00:03:43,433 --> 00:03:47,033
So anyway, you know, clearly
the best model for our data set here,

85
00:03:47,033 --> 00:03:52,133
you know, for classification is kernel
SVM and k nearest neighbors.

86
00:03:52,133 --> 00:03:55,133
So I'm going to put that back
to ten right.

87
00:03:55,200 --> 00:03:57,466
Press save run everything again.

88
00:03:57,466 --> 00:04:01,233
And I'm going to show you,
you know the final visualization results

89
00:04:01,233 --> 00:04:03,566
for Random Forest
because it's always good to see it.

90
00:04:03,566 --> 00:04:07,100
You know even if we didn't
beat the accuracy let's observe them.

91
00:04:07,233 --> 00:04:09,433
Let's actually observe them
in the original file

92
00:04:09,433 --> 00:04:11,933
because it is right now running.

93
00:04:11,933 --> 00:04:14,400
All right.
So we'll find them at the bottom.

94
00:04:14,400 --> 00:04:15,200
And so there you go.

95
00:04:15,200 --> 00:04:16,800
That's the result of the training set.

96
00:04:16,800 --> 00:04:19,500
And below
you have the results on the test set.

97
00:04:19,500 --> 00:04:23,566
And indeed we see that even if it could be
very well trained on the training set.

98
00:04:23,800 --> 00:04:28,000
Well we still have some wrong predictions
here of green customers

99
00:04:28,000 --> 00:04:30,000
who fall in the wrong red region.

100
00:04:30,000 --> 00:04:31,700
And there's not much we can do.

101
00:04:31,700 --> 00:04:33,000
You know, when tuning the random forest

102
00:04:33,000 --> 00:04:36,933
classification
to catch correctly these customers here.

103
00:04:37,300 --> 00:04:39,600
But I want to say something now.

104
00:04:39,600 --> 00:04:39,933
You know,

105
00:04:39,933 --> 00:04:43,733
maybe you want to play with the diverse
classification models we implemented.

106
00:04:43,733 --> 00:04:46,733
And by playing with them I mean playing
with the parameters, you know,

107
00:04:46,733 --> 00:04:49,733
trying different values of the parameters.

108
00:04:49,766 --> 00:04:53,900
And please let me know, you know, in
either private message or in the Q&A,

109
00:04:54,200 --> 00:04:58,200
if you managed to beat 93%,
you know, as a final accuracy

110
00:04:58,200 --> 00:04:59,600
on the test set, of course.

111
00:04:59,600 --> 00:05:00,866
But let me know if you succeed.

112
00:05:00,866 --> 00:05:03,633
I'll be very interested
to see how you did it.

113
00:05:03,633 --> 00:05:06,933
All right, so here we are
at the end of the section.

114
00:05:06,933 --> 00:05:08,933
Congratulations for completing it.

115
00:05:08,933 --> 00:05:12,100
Now you have some great tools
in your classification toolkit.

116
00:05:12,366 --> 00:05:15,300
Please understand
that the best models we got here

117
00:05:15,300 --> 00:05:18,566
are just for this data set
before your future data set.

118
00:05:18,566 --> 00:05:20,100
The best model might be another one.

119
00:05:20,100 --> 00:05:23,200
It might be random forest,
or it might be Naive Bayes.

120
00:05:23,400 --> 00:05:25,000
So you have to try all of them.

121
00:05:25,000 --> 00:05:28,500
And speaking of which,
this is exactly what we'll do next.

122
00:05:28,500 --> 00:05:32,700
In this part three, I'm going to take all
these code templates here that we made.

123
00:05:32,700 --> 00:05:33,766
I'm going to simplify them.

124
00:05:33,766 --> 00:05:36,800
You know, I'm going to remove
all the prints and everything

125
00:05:36,900 --> 00:05:40,033
so that it can be very clear
and well-structured, and mostly

126
00:05:40,166 --> 00:05:44,633
so that you can get very efficient code
templates that you can try and deploy

127
00:05:44,666 --> 00:05:47,400
very quickly
and efficiently on your data sets,

128
00:05:47,400 --> 00:05:51,033
so that you can quickly figure out
what is the best model and that's

129
00:05:51,033 --> 00:05:53,266
what we'll do in this last section of part
three.

130
00:05:53,266 --> 00:05:54,466
Can't wait to meet you there.

131
00:05:54,466 --> 00:05:56,200
And until then, enjoy machine learning.