1
00:00:00,066 --> 00:00:01,500
Now, the other very

2
00:00:01,500 --> 00:00:05,366
important thing to understand
is that these are two prediction regions

3
00:00:05,633 --> 00:00:08,866
separated by a straight line,
which is the straight line.

4
00:00:08,866 --> 00:00:10,066
Here.

5
00:00:10,066 --> 00:00:13,066
And the straight line is called
the prediction boundary,

6
00:00:13,200 --> 00:00:16,200
because it's the boundary
between the two prediction regions.

7
00:00:16,866 --> 00:00:20,100
And the fact that it's a straight
line is not random.

8
00:00:20,633 --> 00:00:22,033
It is for a particular reason.

9
00:00:22,033 --> 00:00:24,466
And that's the thing
very important to understand,

10
00:00:24,466 --> 00:00:27,766
because that's the essence
of logistic regression.

11
00:00:28,500 --> 00:00:31,066
If the prediction boundary
is a straight line here,

12
00:00:31,066 --> 00:00:35,966
that's because our logistic regression
classifier is a linear classifier.

13
00:00:36,566 --> 00:00:39,266
That means that here,
since we are in two dimensions, you know,

14
00:00:39,266 --> 00:00:42,366
because we have two independent variable
the age and the estimated salary.

15
00:00:42,366 --> 00:00:43,866
So we are in two dimensions.

16
00:00:43,866 --> 00:00:47,800
Then since the logistic regression
classifier is a linear classifier,

17
00:00:48,300 --> 00:00:52,900
then the prediction boundary separator
here can only be a straight line.

18
00:00:53,200 --> 00:00:55,966
If we were in three dimensions
then it would be

19
00:00:55,966 --> 00:00:58,966
a straight plan separating two spaces.

20
00:00:59,000 --> 00:01:01,233
But here in two dimensions
it's a straight line

21
00:01:01,233 --> 00:01:03,000
and it will always be a straight line.

22
00:01:03,000 --> 00:01:06,933
If your classifier is a linear classifier,
but you will see later

23
00:01:06,933 --> 00:01:10,166
that when we build non linear classifiers,

24
00:01:10,300 --> 00:01:14,500
then the prediction boundary separator
won't be a straight line anymore.

25
00:01:14,766 --> 00:01:17,933
I won't tell you more right now
and I will let you wait for the surprise.

26
00:01:18,533 --> 00:01:22,966
So here we can clearly see that our 
logistic regression classifier manages

27
00:01:22,966 --> 00:01:27,866
to catch most of the users who didn't buy
the SUV in the red region here,

28
00:01:28,300 --> 00:01:32,300
and most of the users who bought the SUV
in the green region here.

29
00:01:32,566 --> 00:01:34,866
So it actually did a pretty good job.

30
00:01:34,866 --> 00:01:39,033
However, it seems to have trouble
catching some green users here

31
00:01:39,033 --> 00:01:43,066
who in spite of their low salary,
but the luxury SUV,

32
00:01:43,533 --> 00:01:47,700
as well as those other green users here
who also bought the luxury SUV

33
00:01:48,266 --> 00:01:49,266
because as you can see,

34
00:01:49,266 --> 00:01:53,200
this green points here
and those here are in the red region,

35
00:01:53,500 --> 00:01:56,400
which is the region
where our classifier predicts

36
00:01:56,400 --> 00:01:59,400
that the users don't buy the SUV.

37
00:01:59,433 --> 00:02:02,666
And those incorrect predictions
are due specifically

38
00:02:02,666 --> 00:02:06,300
to the fact that our classifier
is a linear classifier.

39
00:02:06,300 --> 00:02:09,900
And because our users
are not linearly distributed,

40
00:02:10,266 --> 00:02:13,966
if they were linearly distributed,
then we will have all the green points

41
00:02:13,966 --> 00:02:17,266
here in the space
and all the red points here in this space.

42
00:02:17,466 --> 00:02:19,800
And then the linear classifier
with a straight line could

43
00:02:19,800 --> 00:02:23,400
perfectly separate all the red points
here, and all the green points here.

44
00:02:23,833 --> 00:02:28,266
But here we have some rebellious points
who are not in the wanted linear regions.

45
00:02:28,533 --> 00:02:32,200
And because our classifier has a linear
straight line separator,

46
00:02:32,200 --> 00:02:36,333
that's why it has trouble
catching those users here and those here.

47
00:02:36,566 --> 00:02:40,366
You can clearly see that
even if you try to rotate this straight

48
00:02:40,366 --> 00:02:44,833
line here, well, you will always have
some green points in the wrong category.

49
00:02:45,166 --> 00:02:50,166
For example, if we try to rotate here
this way, like putting it down, well

50
00:02:50,166 --> 00:02:53,733
okay, we will catch these green points
here and the right green region here.

51
00:02:54,033 --> 00:02:59,233
But since we rotated down
we will take more green users here

52
00:02:59,233 --> 00:03:04,200
because this will go up and more green
users here will be in the red region.

53
00:03:04,600 --> 00:03:07,033
So that's the best separator.

54
00:03:07,033 --> 00:03:09,366
The logistic regression
classifier could find.

55
00:03:09,366 --> 00:03:10,766
And it couldn't do better

56
00:03:10,766 --> 00:03:14,366
because it can only be a straight line
separating these two regions.

57
00:03:14,966 --> 00:03:18,000
Because to catch those users,
the green users here and the green users

58
00:03:18,000 --> 00:03:21,133
here in the red category
that is the green region are classified.

59
00:03:21,133 --> 00:03:25,433
We need to make some kind of a curve here
to, you know, classify

60
00:03:25,433 --> 00:03:29,433
correctly those green users here and here
and place them in the green region.

61
00:03:29,600 --> 00:03:33,866
And that would prevent our classroom from
making this incorrect predictions here

62
00:03:34,000 --> 00:03:36,333
because it is a straight line
with a curve.

63
00:03:36,333 --> 00:03:39,833
Here we would catch all the red users,
probably in the red region

64
00:03:40,033 --> 00:03:42,566
and all the green users
in the green region.

65
00:03:42,566 --> 00:03:45,100
So that would make an awesome classifier.

66
00:03:45,100 --> 00:03:45,766
And you will see

67
00:03:45,766 --> 00:03:49,633
how our nonlinear classifiers
will make a terrific job in doing this.

68
00:03:49,866 --> 00:03:51,066
I can't wait to show you this.

69
00:03:52,066 --> 00:03:52,500
Okay.

70
00:03:52,500 --> 00:03:55,700
And now eventually, the last thing
very important to understand is that

71
00:03:56,266 --> 00:03:58,333
this is the training set.

72
00:03:58,333 --> 00:03:59,300
This is a training set.

73
00:03:59,300 --> 00:04:00,333
So that means that

74
00:04:00,333 --> 00:04:04,600
our classifier learns how to classify
based on these informations here.

75
00:04:04,833 --> 00:04:08,100
So I would hold my breath
a few more seconds until I find out

76
00:04:08,100 --> 00:04:12,100
if our logistic regression classifier
can manage to make good predictions

77
00:04:12,100 --> 00:04:16,366
of new observations, that is, to classify
new users into the right regions,

78
00:04:16,633 --> 00:04:20,600
which, by the way, are fixed regions here,
because these are the regions

79
00:04:20,600 --> 00:04:24,566
generated by the learning experience
of our logistic regression classifier,

80
00:04:24,900 --> 00:04:28,200
and therefore won't change
if we look at some new observations.

81
00:04:28,200 --> 00:04:31,033
That is, new social network users,

82
00:04:31,033 --> 00:04:34,033
and that's what we are about to find out
on the test set.

83
00:04:34,166 --> 00:04:35,500
So hold on.

84
00:04:35,500 --> 00:04:37,500
So it's very simple.

85
00:04:37,500 --> 00:04:40,933
We're just going to copy
all this code section here.

86
00:04:42,900 --> 00:04:44,100
Paste it here.

87
00:04:44,100 --> 00:04:46,900
And I'm just going to change the training
set here.

88
00:04:46,900 --> 00:04:49,466
My test set.

89
00:04:49,466 --> 00:04:52,833
Same here
I change training set by test set.

90
00:04:52,833 --> 00:04:54,233
And that's all.

91
00:04:54,233 --> 00:04:56,566
That's all because I structured the codes
in such a way

92
00:04:56,566 --> 00:04:59,900
that we only need to change
the training set into the test set here.

93
00:05:00,200 --> 00:05:03,200
To plot this graph on a specific set.

94
00:05:03,366 --> 00:05:08,200
However, let's change the title here
because we want to specify

95
00:05:08,200 --> 00:05:11,200
that it's the test set and it's ready.

96
00:05:11,266 --> 00:05:14,333
So let's select this and execute.

97
00:05:16,800 --> 00:05:19,800
Let's see what happens.

98
00:05:19,966 --> 00:05:22,966
And here are the results of the test set.

99
00:05:23,400 --> 00:05:24,433
So that's not too bad.

100
00:05:24,433 --> 00:05:26,666
That's not too bad. Because as we can see

101
00:05:26,666 --> 00:05:30,033
the major the majority of red points
are in the right region.

102
00:05:30,300 --> 00:05:34,466
That means the region predicted to be zero
and the majority of green points

103
00:05:34,466 --> 00:05:35,966
are in the right region.

104
00:05:37,166 --> 00:05:37,666
As for the

105
00:05:37,666 --> 00:05:41,400
training set, there are some observations
that were incorrectly predicted.

106
00:05:41,633 --> 00:05:42,366
That's normal.

107
00:05:42,366 --> 00:05:46,166
That's because it's a linear classifier
and it cannot make a curve here.

108
00:05:46,200 --> 00:05:49,200
Catching all the right guys.

109
00:05:49,333 --> 00:05:52,266
All right so that's it
for the interpretation of the graph.

110
00:05:52,266 --> 00:05:55,566
I can't wait to show you
how we can make more powerful classifiers.

111
00:05:55,766 --> 00:05:58,866
And of course these
are going to be nonlinear classifiers.