1
00:00:00,200 --> 00:00:02,866
Hello and welcome to this art tutorial.

2
00:00:02,866 --> 00:00:05,800
So we'll quickly set our folder as working
directory.

3
00:00:05,800 --> 00:00:09,200
Part three classification decision
tree classification.

4
00:00:09,200 --> 00:00:10,100
And here is the folder.

5
00:00:10,100 --> 00:00:13,200
Make sure that you have the social network
as CSV file.

6
00:00:13,700 --> 00:00:17,233
Then you click on this more button here
to set the folder as working directory.

7
00:00:17,666 --> 00:00:19,633
Then let's quickly take our template.

8
00:00:19,633 --> 00:00:22,633
Select everything from here to the bottom.

9
00:00:23,166 --> 00:00:25,666
Copy and

10
00:00:25,666 --> 00:00:27,033
paste it here.

11
00:00:27,033 --> 00:00:27,600
All right.

12
00:00:27,600 --> 00:00:30,533
And now let's change a few things.

13
00:00:30,533 --> 00:00:33,833
So just not to forget this
change the titles and the plot.

14
00:00:34,200 --> 00:00:37,200
So we will replace classifier
by decision tree.

15
00:00:39,033 --> 00:00:42,033
And here as well.

16
00:00:43,733 --> 00:00:44,200
Okay.

17
00:00:44,200 --> 00:00:46,500
And now let's create our classifier.

18
00:00:46,500 --> 00:00:51,600
So in order to create a decision tree
classifier we will use again

19
00:00:51,766 --> 00:00:55,466
the most popular library for that
which is the R part library.

20
00:00:56,133 --> 00:01:00,466
So now just check to see if you have
the support library in your packages.

21
00:01:00,766 --> 00:01:03,700
So for example mine is right here.

22
00:01:03,700 --> 00:01:06,700
It might not be the case for you
if you're starting R for the first time.

23
00:01:06,833 --> 00:01:10,966
So I'll just write this line of code
for those of you who need to install it.

24
00:01:11,400 --> 00:01:13,466
And so as usual it's install

25
00:01:14,833 --> 00:01:15,866
packages.

26
00:01:15,866 --> 00:01:19,966
And then in quotes in the parenthesis
you input the name of the package,

27
00:01:20,300 --> 00:01:22,866
which is then our part.

28
00:01:22,866 --> 00:01:23,533
All right.

29
00:01:23,533 --> 00:01:26,900
And then to install the package
you need to select this line and execute.

30
00:01:27,266 --> 00:01:30,033
I won't do it right now
because my package is already installed.

31
00:01:30,033 --> 00:01:32,366
So I'll just put that as command.

32
00:01:32,366 --> 00:01:37,400
And however we are going to include this
line of code here library.

33
00:01:37,800 --> 00:01:42,733
And then parenthesis are part
to automatically select this library.

34
00:01:42,733 --> 00:01:46,466
Because once this is executed
this will be selected.

35
00:01:46,666 --> 00:01:50,833
As you can see right now it's not selected
but it will be once this is executed

36
00:01:51,833 --> 00:01:52,666
okay.

37
00:01:52,666 --> 00:01:55,900
And now we are ready
to create our classifier.

38
00:01:55,933 --> 00:01:58,933
So let's do this classifier as usual.

39
00:02:00,266 --> 00:02:03,000
And then we are going to use
actually a function

40
00:02:03,000 --> 00:02:05,700
which is the same as a library or part.

41
00:02:05,700 --> 00:02:07,700
So our part.

42
00:02:07,700 --> 00:02:10,700
And then this function
we will input the right parameters.

43
00:02:10,800 --> 00:02:14,966
So well as you can see right now
we can see what those parameters are.

44
00:02:15,066 --> 00:02:19,766
But if you want more info
we can click here and press F1.

45
00:02:19,900 --> 00:02:24,600
And here we just need to click here
to get some info about our part.

46
00:02:25,133 --> 00:02:25,466
Okay.

47
00:02:25,466 --> 00:02:28,566
So as we can see
the first argument is formula.

48
00:02:28,766 --> 00:02:30,466
And as usual
we're going to write the formula

49
00:02:30,466 --> 00:02:34,066
equals dependent variable tilde dot.

50
00:02:34,233 --> 00:02:35,700
So that's the same as usual.

51
00:02:35,700 --> 00:02:38,700
And then we have the data argument here

52
00:02:38,766 --> 00:02:43,066
which is of course the data on which
you want to train your classifier.

53
00:02:43,200 --> 00:02:45,600
So this data will be the training set.

54
00:02:45,600 --> 00:02:45,933
All right.

55
00:02:45,933 --> 00:02:47,566
So let's input the arguments.

56
00:02:47,566 --> 00:02:50,700
So remember
the first argument was formula

57
00:02:52,066 --> 00:02:53,700
equals

58
00:02:53,700 --> 00:02:54,600
purchased.

59
00:02:54,600 --> 00:02:57,333
That's the dependent variable tilde.

60
00:02:57,333 --> 00:03:00,666
I just press alt n and then a dot

61
00:03:00,900 --> 00:03:03,900
to include all the independent variables.

62
00:03:03,966 --> 00:03:05,200
Then comma.

63
00:03:05,200 --> 00:03:08,533
And then we put the second argument
which remembered was data.

64
00:03:09,233 --> 00:03:12,133
And we pick our training set.

65
00:03:12,133 --> 00:03:12,900
Perfect.

66
00:03:12,900 --> 00:03:15,900
And now let's execute the whole code.

67
00:03:16,366 --> 00:03:20,266
So first we execute this
pre-processing part here as usual.

68
00:03:21,000 --> 00:03:22,933
Done. Perfect.

69
00:03:22,933 --> 00:03:24,866
So we can have a look at the data set.

70
00:03:24,866 --> 00:03:26,666
Data set
I'll fine with our two independent

71
00:03:26,666 --> 00:03:28,600
variables age and estimated salary.

72
00:03:28,600 --> 00:03:31,700
And our dependent variable purchased
training set.

73
00:03:31,933 --> 00:03:34,666
All good and test set all good.

74
00:03:34,666 --> 00:03:35,500
Okay.

75
00:03:35,500 --> 00:03:38,200
So the training set
and the test set are scaled

76
00:03:38,200 --> 00:03:43,266
because we will plot the prediction
regions with a high resolution.

77
00:03:43,266 --> 00:03:44,166
So we need to scale.

78
00:03:44,166 --> 00:03:47,600
Actually you can try to not scale
the independent variables here.

79
00:03:47,866 --> 00:03:51,500
You know because for decision tree
you don't need to scale your independent

80
00:03:51,500 --> 00:03:55,200
variables because the decision tree
model is not based on Euclidean distance.

81
00:03:55,366 --> 00:03:58,433
But since we want to plot the prediction
regions with a high resolution,

82
00:03:58,566 --> 00:04:02,900
you will see that your code will execute
a huge time faster

83
00:04:02,900 --> 00:04:04,233
than if you don't scale it.

84
00:04:04,233 --> 00:04:07,666
Actually, I think that if you don't scale
it, your code might break.

85
00:04:08,233 --> 00:04:11,033
You can try that, but, Be careful.

86
00:04:11,033 --> 00:04:11,833
So we will do it.

87
00:04:11,833 --> 00:04:17,033
But then we will execute the code again
without the scaling to plot the tree.

88
00:04:17,366 --> 00:04:18,866
So we will clear everything.

89
00:04:18,866 --> 00:04:22,900
And then the preprocessing part select
everything except the feature scaling.

90
00:04:23,100 --> 00:04:26,100
And then we will plot our tree
in a very simple way.

91
00:04:26,400 --> 00:04:28,600
But right now
we want to plot the prediction regions.

92
00:04:28,600 --> 00:04:31,200
So we scale are independent
variables. Okay.

93
00:04:31,200 --> 00:04:32,566
So perfect.

94
00:04:32,566 --> 00:04:34,033
Now the classifier is ready.

95
00:04:34,033 --> 00:04:35,933
So let's execute it.

96
00:04:37,433 --> 00:04:38,000
All right.

97
00:04:38,000 --> 00:04:38,833
All good.

98
00:04:38,833 --> 00:04:42,566
Now we can execute this line
to predict the test set results.

99
00:04:42,600 --> 00:04:47,600
And actually what's funny is that y pred
is not the same as what we were used to.

100
00:04:47,866 --> 00:04:50,033
First, for example,
we can see. Wipe it here.

101
00:04:50,033 --> 00:04:51,000
Remember before wipe.

102
00:04:51,000 --> 00:04:52,233
It wasn't in the data here.

103
00:04:52,233 --> 00:04:55,233
We had to type it in the console
to have a look at it.

104
00:04:55,266 --> 00:04:56,300
And here it's here.

105
00:04:56,300 --> 00:04:59,300
So let's click on it
to find out what it is.

106
00:04:59,500 --> 00:04:59,800
All right.

107
00:04:59,800 --> 00:05:05,400
This is why pred and this is actually
a matrix of two columns and 100 lines.

108
00:05:05,833 --> 00:05:07,933
So what is this new y print?

109
00:05:07,933 --> 00:05:08,833
What is it exactly?

110
00:05:08,833 --> 00:05:13,300
Well,
as you can see, the sum of the two cells

111
00:05:13,300 --> 00:05:16,300
here in each line is equal to one.

112
00:05:16,400 --> 00:05:18,166
So can you guess what it is?

113
00:05:19,133 --> 00:05:20,833
Well these are probabilities.

114
00:05:20,833 --> 00:05:22,766
The first column gives the probability

115
00:05:22,766 --> 00:05:25,800
that the observation
the user belongs to class zero.

116
00:05:25,833 --> 00:05:28,300
That is done by the SVM.

117
00:05:28,300 --> 00:05:30,933
And this probability in the second column

118
00:05:30,933 --> 00:05:34,166
is the probability
that the user buys the SUV.

119
00:05:34,666 --> 00:05:36,333
So here
if we look at the first observation

120
00:05:36,333 --> 00:05:40,333
we can see that there is a very high
probability that the user buys the SUV.

121
00:05:40,666 --> 00:05:41,700
And so that means here

122
00:05:41,700 --> 00:05:45,200
that the prediction here
is that the user doesn't buy the SUV.

123
00:05:45,666 --> 00:05:47,233
And if we look at the test here

124
00:05:47,233 --> 00:05:50,700
and look at the index zero,
we can see that indeed in reality

125
00:05:51,100 --> 00:05:53,866
the user didn't buy the SUV and therefore

126
00:05:53,866 --> 00:05:56,866
the prediction is correct.