1
00:00:00,533 --> 00:00:01,266
And now let's see.

2
00:00:01,266 --> 00:00:05,000
Okay, so the random forest classifier
definitely catches

3
00:00:05,000 --> 00:00:08,666
most of the users that didn't buy
the SUV in the right category.

4
00:00:08,666 --> 00:00:09,533
That is the red region.

5
00:00:09,533 --> 00:00:13,600
So that means that it classified well,
most of the users who didn't buy the SUV.

6
00:00:13,600 --> 00:00:17,366
And then for the green users
who are the users but the SUV in reality,

7
00:00:17,600 --> 00:00:21,500
because as we can see,
most of them are in the right green region

8
00:00:22,266 --> 00:00:26,066
and it's desperately
trying to catch some outliers.

9
00:00:26,466 --> 00:00:27,600
We can call them this way.

10
00:00:27,600 --> 00:00:29,300
For example, this guy here is

11
00:00:29,300 --> 00:00:32,900
a user that didn't buy the SUV in reality
because this is a red point

12
00:00:33,266 --> 00:00:36,100
and it is way into the green region
here, as we can see.

13
00:00:36,100 --> 00:00:39,100
But the random forest classifier
managed to make this

14
00:00:39,333 --> 00:00:42,500
little rectangle part of the region in red

15
00:00:42,633 --> 00:00:46,600
to catch this user that didn't
buy the SUV and classify it well.

16
00:00:47,000 --> 00:00:48,566
But is it the smart way of doing it?

17
00:00:48,566 --> 00:00:52,033
Is because what will tell you that
for some new observations, we will have,

18
00:00:52,033 --> 00:00:57,266
you know, some users who didn't buy
the SUV in this red rectangle here.

19
00:00:57,300 --> 00:01:00,800
So that looks like overfitting,
because it made this red rectangle here

20
00:01:00,800 --> 00:01:03,800
because we had this user indeed
who didn't buy the SUV.

21
00:01:03,900 --> 00:01:06,800
But nothing tells us that
for some new observations,

22
00:01:06,800 --> 00:01:10,166
we will have some users
who didn't buy the SUV in this

23
00:01:10,166 --> 00:01:13,500
red rectangle here,
so we should be careful with that.

24
00:01:13,800 --> 00:01:15,366
And same for this user here.

25
00:01:15,366 --> 00:01:19,366
As you can see that this user is in
some sort of a irregular red region here,

26
00:01:19,533 --> 00:01:23,066
but fortunately our random forest
classifier was not too obsessed

27
00:01:23,166 --> 00:01:25,500
at making all the predictions correct.

28
00:01:25,500 --> 00:01:28,700
Because as we can see,
this red user here is in the green region.

29
00:01:28,933 --> 00:01:32,000
So that means that it's still paid
attention to overfitting,

30
00:01:32,000 --> 00:01:34,400
but not too much.
And we should be careful with that.

31
00:01:34,400 --> 00:01:37,333
So speaking of overfitting,
let's check that right now.

32
00:01:37,333 --> 00:01:42,300
Let's look at the test results right now
to see how this region is here.

33
00:01:42,300 --> 00:01:43,833
Because you know the regions won't change.

34
00:01:43,833 --> 00:01:46,133
These are the regions built by our model.

35
00:01:46,133 --> 00:01:50,100
So when we look at the test results
we will have the same red region

36
00:01:50,100 --> 00:01:53,100
here with this rectangle here
and green region here.

37
00:01:53,400 --> 00:01:56,500
But what will change
will be the test set observation points.

38
00:01:56,500 --> 00:01:58,233
That is all the red points
and the green points.

39
00:01:58,233 --> 00:01:59,166
This will change

40
00:01:59,166 --> 00:02:03,100
and we will see if we have some red points
here in this rectangle here.

41
00:02:03,100 --> 00:02:06,600
And actually probably not
because this looks like overfitting

42
00:02:06,600 --> 00:02:10,566
that occurred because our classifier
was fitted too much to the training set.

43
00:02:10,566 --> 00:02:12,700
So let's find out about that right now.

44
00:02:12,700 --> 00:02:18,000
Let's select this section dedicated
to visualize the test set results.

45
00:02:18,566 --> 00:02:23,066
So I'll just select everything and press
Command and Control plus enter to execute.

46
00:02:24,000 --> 00:02:24,433
All right.

47
00:02:24,433 --> 00:02:26,933
So what is the first thing you see here.

48
00:02:26,933 --> 00:02:28,366
Well yes indeed.

49
00:02:28,366 --> 00:02:33,233
This red rectangle here is totally
and useful for some new observations.

50
00:02:33,533 --> 00:02:38,466
So that was clearly a red rectangle region
to catch some uses of the training set,

51
00:02:38,733 --> 00:02:41,733
because our classifier
was too much fitted to the training set.

52
00:02:41,866 --> 00:02:45,466
And this red rectangle
actually doesn't make any sense here

53
00:02:45,466 --> 00:02:49,733
because indeed we don't have any red user
in this rectangle region here.

54
00:02:49,733 --> 00:02:50,500
Well, it's it's

55
00:02:50,500 --> 00:02:53,500
not that it doesn't make any sense,
but it's totally and useful here.

56
00:02:54,300 --> 00:02:57,900
And besides, you know, we have
this green point here and this green point

57
00:02:57,900 --> 00:03:01,800
could have been in this region here
that would make an incorrect prediction.

58
00:03:01,800 --> 00:03:02,933
We were lucky on this one,

59
00:03:02,933 --> 00:03:06,300
but this could have happened
because these are new observations.

60
00:03:06,466 --> 00:03:08,600
And now random forest
classification machine learning

61
00:03:08,600 --> 00:03:12,366
model didn't learn anything
from this new observation points.

62
00:03:12,500 --> 00:03:15,500
So this guy could totally
have ended up here.

63
00:03:16,033 --> 00:03:17,700
So lucky on this one.

64
00:03:17,700 --> 00:03:21,566
And by the way, same for this region here
we don't have any red user.

65
00:03:21,566 --> 00:03:24,233
That is some user who didn't buy
the SUV in this red region.

66
00:03:24,233 --> 00:03:27,233
So this red region
is totally and useful as well.

67
00:03:27,700 --> 00:03:28,800
Okay so that's the idea.

68
00:03:28,800 --> 00:03:31,666
But most of all it did a pretty good job
because of course it got

69
00:03:31,666 --> 00:03:34,933
most of the red users here with a low edge
and low

70
00:03:34,933 --> 00:03:38,033
estimated salary, and therefore users
who didn't buy the SUV.

71
00:03:38,466 --> 00:03:41,533
And most of the green users
who are quite old with a higher

72
00:03:41,533 --> 00:03:45,766
estimated salary,
who bought this awesome, cheap luxury SUV?

73
00:03:46,633 --> 00:03:48,600
Okay, and now
what is the conclusion of all this?

74
00:03:48,600 --> 00:03:49,600
Because we reached

75
00:03:49,600 --> 00:03:53,800
the end of our classification adventure,
we built all our classifiers.

76
00:03:53,800 --> 00:03:54,866
So according to you,

77
00:03:54,866 --> 00:03:58,800
what is the best classifier
for this particular business problem?

78
00:03:58,800 --> 00:04:00,366
What is the best one?

79
00:04:00,366 --> 00:04:04,166
It should be a classifier that classified
correctly the users who didn't buy

80
00:04:04,166 --> 00:04:08,100
the SUV, and the users who bought the SUV,
and at the same time

81
00:04:08,300 --> 00:04:11,500
prevented overfitting in the training
set to be able

82
00:04:11,500 --> 00:04:14,666
to make some good new predictions
of some new observations.

83
00:04:15,200 --> 00:04:18,700
So in my opinion,
the best classifier would be the kernel

84
00:04:18,700 --> 00:04:22,000
SVM in terms of the balance
between the percentage

85
00:04:22,000 --> 00:04:25,500
of incorrect predictions and the fact
that we want to prevent overfitting.

86
00:04:25,866 --> 00:04:28,566
Well,
if we look at them again, in my opinion

87
00:04:28,566 --> 00:04:31,400
the kernel SVM classifier
would be the best one.

88
00:04:31,400 --> 00:04:33,533
All right.
So that's the end of this tutorial.

89
00:04:33,533 --> 00:04:36,200
And now I have to say congratulations,

90
00:04:36,200 --> 00:04:40,400
because you built a great deal
of classifiers from simple classifiers

91
00:04:40,400 --> 00:04:44,800
with logistic regression
to more sophisticated and more complex

92
00:04:45,000 --> 00:04:49,033
classifiers like kernel,
SVM or random forest classifiers.

93
00:04:49,433 --> 00:04:51,000
But that's not the end of the journey.

94
00:04:51,000 --> 00:04:54,333
In the next section,
we will be talking about how to evaluate

95
00:04:54,566 --> 00:04:58,500
the performance of our models
and how we can improve them.

96
00:04:58,700 --> 00:05:02,400
And then eventually we will have
a homework on a real life data set,

97
00:05:02,700 --> 00:05:06,433
where we will combine what we learned here
about how to build some classifiers,

98
00:05:06,566 --> 00:05:08,600
and all the next concept
that we will learn

99
00:05:08,600 --> 00:05:12,300
to evaluate the model performance
in order to find the best model

100
00:05:12,566 --> 00:05:16,266
for this real life business problem data
set that you will be given and we will

101
00:05:16,266 --> 00:05:20,866
do the job as a data scientist or machine
learning scientist would do in reality.

102
00:05:21,133 --> 00:05:22,666
So congratulations again.

103
00:05:22,666 --> 00:05:24,633
I look forward to seeing you
in the next section.

104
00:05:24,633 --> 00:05:26,433
And until then, enjoy machine learning.