1 00:00:00,200 --> 00:00:02,866 Hello and welcome to this art tutorial. 2 00:00:02,866 --> 00:00:05,800 So we'll quickly set our folder as working directory. 3 00:00:05,800 --> 00:00:09,200 Part three classification decision tree classification. 4 00:00:09,200 --> 00:00:10,100 And here is the folder. 5 00:00:10,100 --> 00:00:13,200 Make sure that you have the social network as CSV file. 6 00:00:13,700 --> 00:00:17,233 Then you click on this more button here to set the folder as working directory. 7 00:00:17,666 --> 00:00:19,633 Then let's quickly take our template. 8 00:00:19,633 --> 00:00:22,633 Select everything from here to the bottom. 9 00:00:23,166 --> 00:00:25,666 Copy and 10 00:00:25,666 --> 00:00:27,033 paste it here. 11 00:00:27,033 --> 00:00:27,600 All right. 12 00:00:27,600 --> 00:00:30,533 And now let's change a few things. 13 00:00:30,533 --> 00:00:33,833 So just not to forget this change the titles and the plot. 14 00:00:34,200 --> 00:00:37,200 So we will replace classifier by decision tree. 15 00:00:39,033 --> 00:00:42,033 And here as well. 16 00:00:43,733 --> 00:00:44,200 Okay. 17 00:00:44,200 --> 00:00:46,500 And now let's create our classifier. 18 00:00:46,500 --> 00:00:51,600 So in order to create a decision tree classifier we will use again 19 00:00:51,766 --> 00:00:55,466 the most popular library for that which is the R part library. 20 00:00:56,133 --> 00:01:00,466 So now just check to see if you have the support library in your packages. 21 00:01:00,766 --> 00:01:03,700 So for example mine is right here. 22 00:01:03,700 --> 00:01:06,700 It might not be the case for you if you're starting R for the first time. 23 00:01:06,833 --> 00:01:10,966 So I'll just write this line of code for those of you who need to install it. 24 00:01:11,400 --> 00:01:13,466 And so as usual it's install 25 00:01:14,833 --> 00:01:15,866 packages. 26 00:01:15,866 --> 00:01:19,966 And then in quotes in the parenthesis you input the name of the package, 27 00:01:20,300 --> 00:01:22,866 which is then our part. 28 00:01:22,866 --> 00:01:23,533 All right. 29 00:01:23,533 --> 00:01:26,900 And then to install the package you need to select this line and execute. 30 00:01:27,266 --> 00:01:30,033 I won't do it right now because my package is already installed. 31 00:01:30,033 --> 00:01:32,366 So I'll just put that as command. 32 00:01:32,366 --> 00:01:37,400 And however we are going to include this line of code here library. 33 00:01:37,800 --> 00:01:42,733 And then parenthesis are part to automatically select this library. 34 00:01:42,733 --> 00:01:46,466 Because once this is executed this will be selected. 35 00:01:46,666 --> 00:01:50,833 As you can see right now it's not selected but it will be once this is executed 36 00:01:51,833 --> 00:01:52,666 okay. 37 00:01:52,666 --> 00:01:55,900 And now we are ready to create our classifier. 38 00:01:55,933 --> 00:01:58,933 So let's do this classifier as usual. 39 00:02:00,266 --> 00:02:03,000 And then we are going to use actually a function 40 00:02:03,000 --> 00:02:05,700 which is the same as a library or part. 41 00:02:05,700 --> 00:02:07,700 So our part. 42 00:02:07,700 --> 00:02:10,700 And then this function we will input the right parameters. 43 00:02:10,800 --> 00:02:14,966 So well as you can see right now we can see what those parameters are. 44 00:02:15,066 --> 00:02:19,766 But if you want more info we can click here and press F1. 45 00:02:19,900 --> 00:02:24,600 And here we just need to click here to get some info about our part. 46 00:02:25,133 --> 00:02:25,466 Okay. 47 00:02:25,466 --> 00:02:28,566 So as we can see the first argument is formula. 48 00:02:28,766 --> 00:02:30,466 And as usual we're going to write the formula 49 00:02:30,466 --> 00:02:34,066 equals dependent variable tilde dot. 50 00:02:34,233 --> 00:02:35,700 So that's the same as usual. 51 00:02:35,700 --> 00:02:38,700 And then we have the data argument here 52 00:02:38,766 --> 00:02:43,066 which is of course the data on which you want to train your classifier. 53 00:02:43,200 --> 00:02:45,600 So this data will be the training set. 54 00:02:45,600 --> 00:02:45,933 All right. 55 00:02:45,933 --> 00:02:47,566 So let's input the arguments. 56 00:02:47,566 --> 00:02:50,700 So remember the first argument was formula 57 00:02:52,066 --> 00:02:53,700 equals 58 00:02:53,700 --> 00:02:54,600 purchased. 59 00:02:54,600 --> 00:02:57,333 That's the dependent variable tilde. 60 00:02:57,333 --> 00:03:00,666 I just press alt n and then a dot 61 00:03:00,900 --> 00:03:03,900 to include all the independent variables. 62 00:03:03,966 --> 00:03:05,200 Then comma. 63 00:03:05,200 --> 00:03:08,533 And then we put the second argument which remembered was data. 64 00:03:09,233 --> 00:03:12,133 And we pick our training set. 65 00:03:12,133 --> 00:03:12,900 Perfect. 66 00:03:12,900 --> 00:03:15,900 And now let's execute the whole code. 67 00:03:16,366 --> 00:03:20,266 So first we execute this pre-processing part here as usual. 68 00:03:21,000 --> 00:03:22,933 Done. Perfect. 69 00:03:22,933 --> 00:03:24,866 So we can have a look at the data set. 70 00:03:24,866 --> 00:03:26,666 Data set I'll fine with our two independent 71 00:03:26,666 --> 00:03:28,600 variables age and estimated salary. 72 00:03:28,600 --> 00:03:31,700 And our dependent variable purchased training set. 73 00:03:31,933 --> 00:03:34,666 All good and test set all good. 74 00:03:34,666 --> 00:03:35,500 Okay. 75 00:03:35,500 --> 00:03:38,200 So the training set and the test set are scaled 76 00:03:38,200 --> 00:03:43,266 because we will plot the prediction regions with a high resolution. 77 00:03:43,266 --> 00:03:44,166 So we need to scale. 78 00:03:44,166 --> 00:03:47,600 Actually you can try to not scale the independent variables here. 79 00:03:47,866 --> 00:03:51,500 You know because for decision tree you don't need to scale your independent 80 00:03:51,500 --> 00:03:55,200 variables because the decision tree model is not based on Euclidean distance. 81 00:03:55,366 --> 00:03:58,433 But since we want to plot the prediction regions with a high resolution, 82 00:03:58,566 --> 00:04:02,900 you will see that your code will execute a huge time faster 83 00:04:02,900 --> 00:04:04,233 than if you don't scale it. 84 00:04:04,233 --> 00:04:07,666 Actually, I think that if you don't scale it, your code might break. 85 00:04:08,233 --> 00:04:11,033 You can try that, but, Be careful. 86 00:04:11,033 --> 00:04:11,833 So we will do it. 87 00:04:11,833 --> 00:04:17,033 But then we will execute the code again without the scaling to plot the tree. 88 00:04:17,366 --> 00:04:18,866 So we will clear everything. 89 00:04:18,866 --> 00:04:22,900 And then the preprocessing part select everything except the feature scaling. 90 00:04:23,100 --> 00:04:26,100 And then we will plot our tree in a very simple way. 91 00:04:26,400 --> 00:04:28,600 But right now we want to plot the prediction regions. 92 00:04:28,600 --> 00:04:31,200 So we scale are independent variables. Okay. 93 00:04:31,200 --> 00:04:32,566 So perfect. 94 00:04:32,566 --> 00:04:34,033 Now the classifier is ready. 95 00:04:34,033 --> 00:04:35,933 So let's execute it. 96 00:04:37,433 --> 00:04:38,000 All right. 97 00:04:38,000 --> 00:04:38,833 All good. 98 00:04:38,833 --> 00:04:42,566 Now we can execute this line to predict the test set results. 99 00:04:42,600 --> 00:04:47,600 And actually what's funny is that y pred is not the same as what we were used to. 100 00:04:47,866 --> 00:04:50,033 First, for example, we can see. Wipe it here. 101 00:04:50,033 --> 00:04:51,000 Remember before wipe. 102 00:04:51,000 --> 00:04:52,233 It wasn't in the data here. 103 00:04:52,233 --> 00:04:55,233 We had to type it in the console to have a look at it. 104 00:04:55,266 --> 00:04:56,300 And here it's here. 105 00:04:56,300 --> 00:04:59,300 So let's click on it to find out what it is. 106 00:04:59,500 --> 00:04:59,800 All right. 107 00:04:59,800 --> 00:05:05,400 This is why pred and this is actually a matrix of two columns and 100 lines. 108 00:05:05,833 --> 00:05:07,933 So what is this new y print? 109 00:05:07,933 --> 00:05:08,833 What is it exactly? 110 00:05:08,833 --> 00:05:13,300 Well, as you can see, the sum of the two cells 111 00:05:13,300 --> 00:05:16,300 here in each line is equal to one. 112 00:05:16,400 --> 00:05:18,166 So can you guess what it is? 113 00:05:19,133 --> 00:05:20,833 Well these are probabilities. 114 00:05:20,833 --> 00:05:22,766 The first column gives the probability 115 00:05:22,766 --> 00:05:25,800 that the observation the user belongs to class zero. 116 00:05:25,833 --> 00:05:28,300 That is done by the SVM. 117 00:05:28,300 --> 00:05:30,933 And this probability in the second column 118 00:05:30,933 --> 00:05:34,166 is the probability that the user buys the SUV. 119 00:05:34,666 --> 00:05:36,333 So here if we look at the first observation 120 00:05:36,333 --> 00:05:40,333 we can see that there is a very high probability that the user buys the SUV. 121 00:05:40,666 --> 00:05:41,700 And so that means here 122 00:05:41,700 --> 00:05:45,200 that the prediction here is that the user doesn't buy the SUV. 123 00:05:45,666 --> 00:05:47,233 And if we look at the test here 124 00:05:47,233 --> 00:05:50,700 and look at the index zero, we can see that indeed in reality 125 00:05:51,100 --> 00:05:53,866 the user didn't buy the SUV and therefore 126 00:05:53,866 --> 00:05:56,866 the prediction is correct.