1
00:00:00,700 --> 00:00:02,200
All right, so that's done.

2
00:00:02,200 --> 00:00:04,766
That's,
what we had to input for train here.

3
00:00:04,766 --> 00:00:10,266
So train is your training set, but
without the dependent variable then comma.

4
00:00:10,266 --> 00:00:12,233
And then let's add the second argument.

5
00:00:12,233 --> 00:00:14,300
So the second argument is test.

6
00:00:14,300 --> 00:00:15,700
So you can guess what it is.

7
00:00:15,700 --> 00:00:19,466
It's going to be the same test
equals test set of course.

8
00:00:20,333 --> 00:00:20,633
All right.

9
00:00:20,633 --> 00:00:22,966
So test set and then same training set.

10
00:00:22,966 --> 00:00:24,133
We're going to remove

11
00:00:24,133 --> 00:00:28,500
the dependent variable because anyway
we are supposed not to know the results.

12
00:00:28,500 --> 00:00:31,066
We want to predict the observations
of the test set.

13
00:00:31,066 --> 00:00:32,800
So anyway we need to remove it.

14
00:00:32,800 --> 00:00:35,766
So comma to take all the lines and minus

15
00:00:35,766 --> 00:00:38,766
three to remove the last column

16
00:00:39,233 --> 00:00:39,566
okay.

17
00:00:39,566 --> 00:00:42,066
So we have our training set
and our test set.

18
00:00:42,066 --> 00:00:44,066
And now what is the next parameter okay.

19
00:00:44,066 --> 00:00:48,333
The next parameter is CL factor of true
classification of training set.

20
00:00:48,933 --> 00:00:51,266
So can you guess what it's going to be

21
00:00:52,600 --> 00:00:52,866
okay.

22
00:00:52,866 --> 00:00:57,033
Let's see CL equals in your opinion
what is it going to be.

23
00:00:57,733 --> 00:01:00,900
Well you know to train a classifier

24
00:01:00,900 --> 00:01:04,166
the classifier needs to have okay
the independent variables.

25
00:01:04,466 --> 00:01:05,900
But it also needs to have

26
00:01:05,900 --> 00:01:10,000
the dependent variable
because it needs to have the results to,

27
00:01:10,366 --> 00:01:15,233
you know, find the correlations between
the informations of the independent

28
00:01:15,233 --> 00:01:18,233
variables and the information contained
in the dependent variable.

29
00:01:18,333 --> 00:01:22,100
So here, since we only have the info
about the independent variables,

30
00:01:22,200 --> 00:01:25,833
we also need to include somewhere
the info of the dependent variable.

31
00:01:26,133 --> 00:01:27,500
And that's what we add here.

32
00:01:27,500 --> 00:01:31,300
That's the CL so factor of true
classifications of training set.

33
00:01:31,300 --> 00:01:33,933
That is the categorical
dependent variable.

34
00:01:33,933 --> 00:01:34,800
So let's do this.

35
00:01:34,800 --> 00:01:37,333
So to take this vector actually.

36
00:01:37,333 --> 00:01:40,466
So as you can see that's
the last column of the training set.

37
00:01:40,766 --> 00:01:44,933
So it's going to be training set
taking all the lines of the observations.

38
00:01:45,166 --> 00:01:47,066
And then the 123.

39
00:01:47,066 --> 00:01:50,233
So third index of the column purchased.

40
00:01:50,566 --> 00:01:55,900
So let's take that training
set brackets come up.

41
00:01:56,133 --> 00:02:01,066
And then three because the column we want
is indexed by three.

42
00:02:01,766 --> 00:02:03,466
All right.
So that's for the third argument.

43
00:02:03,466 --> 00:02:07,400
And then we have one more argument
which is the number of neighbors.

44
00:02:07,766 --> 00:02:09,333
So let's add this one.

45
00:02:09,333 --> 00:02:12,666
So remember in Python
we took five neighbors.

46
00:02:12,933 --> 00:02:14,533
That's actually the default parameter.

47
00:02:14,533 --> 00:02:15,666
So here we're going to take the same.

48
00:02:15,666 --> 00:02:20,333
That will allow us to compare the results
we obtained on Python in R.

49
00:02:20,700 --> 00:02:22,300
So it will be interesting.

50
00:02:22,300 --> 00:02:25,300
So let's take k equals five neighbors.

51
00:02:26,100 --> 00:02:26,566
All right.

52
00:02:26,566 --> 00:02:28,200
And now we have everything we need.

53
00:02:28,200 --> 00:02:30,366
We can select this.

54
00:02:30,366 --> 00:02:31,733
And here it is widespread.

55
00:02:31,733 --> 00:02:32,700
All good.

56
00:02:32,700 --> 00:02:35,700
So now let's have a look at white bread.

57
00:02:35,900 --> 00:02:36,800
We can have a look here.

58
00:02:36,800 --> 00:02:40,600
White bread and pressing white
bread in the console and press enter to

59
00:02:40,666 --> 00:02:41,700
have a look at it.

60
00:02:41,700 --> 00:02:43,700
And here are all the predictions
for the test set.

61
00:02:43,700 --> 00:02:45,000
So remember the test.

62
00:02:45,000 --> 00:02:46,633
It contains 100 observations.

63
00:02:46,633 --> 00:02:49,500
So here we have 100 predictions.

64
00:02:49,500 --> 00:02:53,700
Correspond to the same observations
as these guys here.

65
00:02:54,100 --> 00:02:58,566
So for example let's take
the first observation to the first users.

66
00:02:58,933 --> 00:03:02,633
So let's take the 12345.

67
00:03:02,633 --> 00:03:07,600
So the five first users
these five first users didn't buy the SUV.

68
00:03:07,600 --> 00:03:11,500
In reality because the purchased
variable equals zero here.

69
00:03:11,500 --> 00:03:12,400
And that's the truth.

70
00:03:12,400 --> 00:03:14,900
That's what actually happens in reality.

71
00:03:14,900 --> 00:03:17,333
And what does our prediction say?

72
00:03:17,333 --> 00:03:20,066
1234550 here.

73
00:03:20,066 --> 00:03:23,900
So correct predictions
for the five first users okay perfect.

74
00:03:24,433 --> 00:03:27,433
Then then we have four ones.

75
00:03:27,600 --> 00:03:30,300
The the 678

76
00:03:30,300 --> 00:03:33,600
and nine users actually bought the SUV.

77
00:03:33,900 --> 00:03:35,633
So great for the six one.

78
00:03:35,633 --> 00:03:38,466
The seventh one great. Correct prediction
eight one.

79
00:03:38,466 --> 00:03:41,200
Correct prediction as well.
But the classifier.

80
00:03:41,200 --> 00:03:42,900
You made a little mistake here.

81
00:03:42,900 --> 00:03:43,533
But that's fine.

82
00:03:43,533 --> 00:03:46,666
It looks like it's making
some correct prediction most of the time.

83
00:03:46,666 --> 00:03:49,500
And then we were going to check that
on the confusion matrix.

84
00:03:49,500 --> 00:03:50,533
That will be faster.

85
00:03:51,566 --> 00:03:52,400
Was just to

86
00:03:52,400 --> 00:03:55,400
you know, understand
to explain what white bread was.

87
00:03:55,400 --> 00:03:56,900
But I think you get it.

88
00:03:56,900 --> 00:03:59,900
So here
we just need to select this and execute

89
00:04:00,066 --> 00:04:03,000
here CM
we can have a look at it in the console.

90
00:04:03,000 --> 00:04:07,600
CM and that's the predictions okay.

91
00:04:07,600 --> 00:04:12,333
So we have six plus five incorrect
predictions 11 incorrect predictions.

92
00:04:12,333 --> 00:04:13,966
So that's not too bad.

93
00:04:13,966 --> 00:04:18,433
And now what we are most interested to see
is the prediction regions

94
00:04:18,633 --> 00:04:21,500
how they behave
and especially the prediction boundary to

95
00:04:21,500 --> 00:04:24,966
see if it's going to be a straight line
or something else.

96
00:04:25,333 --> 00:04:29,366
And actually you're going to see that
the K then is a nonlinear classifier.

97
00:04:29,366 --> 00:04:31,566
So we will get something different.

98
00:04:31,566 --> 00:04:34,566
Then once we got the logistic regression.