1 00:00:00,066 --> 00:00:01,800 Okay, so let's input the arguments. 2 00:00:01,800 --> 00:00:03,200 As you remember, the first argument 3 00:00:03,200 --> 00:00:06,600 was the matrix of features the matrix of independent variables. 4 00:00:07,033 --> 00:00:13,500 And that is training set excluding the last column which has index three. 5 00:00:13,666 --> 00:00:15,200 Because as you remember in the training set 6 00:00:15,200 --> 00:00:18,600 we have the first two columns which are the independent variables 7 00:00:19,033 --> 00:00:22,733 age and estimated salary was therefore indexes one and two. 8 00:00:23,000 --> 00:00:25,200 And we have the third column indexed by three 9 00:00:25,200 --> 00:00:28,200 which is our dependent variable vector purchased. 10 00:00:28,333 --> 00:00:29,966 So here minus three. 11 00:00:29,966 --> 00:00:31,166 Then what was the next argument. 12 00:00:31,166 --> 00:00:34,200 The next argument was why the dependent variable vector. 13 00:00:34,533 --> 00:00:37,533 And then here will take training set. 14 00:00:37,633 --> 00:00:40,833 And let's pick it this way to specify the name of the independent variable. 15 00:00:40,833 --> 00:00:43,833 $2 here. And purchased. 16 00:00:43,866 --> 00:00:46,866 Purchased is the name of our dependent variable column. 17 00:00:47,233 --> 00:00:49,600 All right. So we almost have everything we need. 18 00:00:49,600 --> 00:00:52,500 The last thing we need now is of course the number of trees. 19 00:00:52,500 --> 00:00:56,433 And that is entry equals ten. 20 00:00:57,000 --> 00:00:58,800 You can play around with the entry argument. 21 00:00:58,800 --> 00:01:01,566 You can choose many more trees in the forest. 22 00:01:01,566 --> 00:01:03,300 You'll observe some interesting results. 23 00:01:03,300 --> 00:01:07,066 That's interesting to see what different teams of trees can do 24 00:01:07,066 --> 00:01:10,900 to predict the response of your users in the social network, 25 00:01:10,900 --> 00:01:13,333 whether they buy yes or no. The SUV. 26 00:01:13,333 --> 00:01:17,100 But if you do this, make sure to pay attention to overfitting, 27 00:01:17,100 --> 00:01:18,900 which you want to avoid. 28 00:01:18,900 --> 00:01:22,666 You don't want to overfit the random forest classifier to the training set, 29 00:01:22,866 --> 00:01:27,066 because if you do this, then it might make some poor predictions on a new set. 30 00:01:27,466 --> 00:01:30,266 You can actually check it out with the test set, but here we'll 31 00:01:30,266 --> 00:01:33,266 choose ten trees and we'll see what happens. 32 00:01:33,633 --> 00:01:34,000 All right. 33 00:01:34,000 --> 00:01:36,566 So actually we're done with the templates. 34 00:01:36,566 --> 00:01:38,966 We changed everything we had to change. 35 00:01:38,966 --> 00:01:40,300 And now we can just, 36 00:01:40,300 --> 00:01:44,433 you know select everything and execute to make everything ready. 37 00:01:44,500 --> 00:01:48,433 You can actually take some coffee or tea and you can just select 38 00:01:48,433 --> 00:01:51,433 everything and execute to watch the results. 39 00:01:51,433 --> 00:01:53,366 But let's rather do it step by step. 40 00:01:53,366 --> 00:01:57,866 We'll just do the first pre-processing step all in once here. 41 00:01:57,866 --> 00:02:00,300 So I just selected the pre-processing phase. 42 00:02:00,300 --> 00:02:01,800 And now I'll press Command and Control. 43 00:02:01,800 --> 00:02:03,600 Press enter to execute. 44 00:02:03,600 --> 00:02:04,466 All right all good. 45 00:02:04,466 --> 00:02:08,666 We have our data set, our training set and our test set. 46 00:02:09,000 --> 00:02:10,166 So everything looks fine. 47 00:02:10,166 --> 00:02:14,500 We have 400 observations in total, 300 observations 48 00:02:14,500 --> 00:02:19,100 that went into the training set and 100 observations that went into the test set. 49 00:02:19,633 --> 00:02:23,033 As you can see, the training set and the test set are scaled 50 00:02:23,366 --> 00:02:27,766 because in the end, we are plotting some graphic results with a resolution of 0.01. 51 00:02:27,900 --> 00:02:33,533 So in order for our code to execute faster and not actually break our code, 52 00:02:33,600 --> 00:02:36,766 we need to apply feature scaling to our training set and our test set. 53 00:02:37,300 --> 00:02:39,966 Otherwise, we wouldn't need to do that because the random 54 00:02:39,966 --> 00:02:43,400 forest classification is not based on Euclidean distances, 55 00:02:43,900 --> 00:02:47,333 but it's based on, you know, conditions on the independent variables. 56 00:02:47,700 --> 00:02:51,300 But because of this code here that is compute intensive, 57 00:02:51,533 --> 00:02:54,766 we need to apply feature scaling so that everything is well executed. 58 00:02:55,466 --> 00:02:56,800 All right. So let's do this. 59 00:02:56,800 --> 00:02:58,333 Let's watch the results. 60 00:02:58,333 --> 00:03:02,466 We just need to create our classifier here by executing this section. 61 00:03:02,766 --> 00:03:05,066 So here I'll just do this. 62 00:03:05,066 --> 00:03:06,300 All right okay. 63 00:03:06,300 --> 00:03:08,633 Now let's predict the test set results. 64 00:03:08,633 --> 00:03:12,300 Then we have the confusion matrix which will tell us in the flashlight 65 00:03:12,300 --> 00:03:14,366 how many incorrect predictions we have. 66 00:03:14,366 --> 00:03:16,433 So let's actually do it directly. 67 00:03:16,433 --> 00:03:19,866 It will be faster to see how our random forest classifier did 68 00:03:19,866 --> 00:03:21,400 well on the predictions. 69 00:03:21,400 --> 00:03:23,600 So let's execute this. 70 00:03:23,600 --> 00:03:26,066 And now let's enter 71 00:03:26,066 --> 00:03:29,233 CM here in the console press enter. 72 00:03:29,833 --> 00:03:32,233 And here we have our confusion matrix 73 00:03:32,233 --> 00:03:36,300 okay we have seven plus ten equals 17 incorrect predictions. 74 00:03:36,600 --> 00:03:37,866 Well that's not too bad. 75 00:03:37,866 --> 00:03:41,300 Just for fun let's let's just pick another number of trees. 76 00:03:41,300 --> 00:03:44,300 Like for example let's pick 500 trees. 77 00:03:44,533 --> 00:03:45,766 500 trees is a lot. 78 00:03:45,766 --> 00:03:49,500 That's a really a big army of trees to make some predictions. 79 00:03:49,900 --> 00:03:52,466 And now, just for fun, let's take this. 80 00:03:52,466 --> 00:03:55,166 I don't need to include this because my library was 81 00:03:55,166 --> 00:03:58,300 already selected from the previous execution of this code section here. 82 00:03:58,300 --> 00:04:02,566 So let's rebuild a new classifier with 500 trees. 83 00:04:02,566 --> 00:04:05,600 And now let's look at the confusion matrix. 84 00:04:05,866 --> 00:04:08,400 But before let's build our vector of prediction. 85 00:04:08,400 --> 00:04:09,333 Because right now 86 00:04:09,333 --> 00:04:13,300 the y vector of prediction is the one given by the random forest with ten trees. 87 00:04:13,933 --> 00:04:15,900 So let's re-execute this. All right. 88 00:04:15,900 --> 00:04:19,966 Now we have y pred as the vector of predictions 89 00:04:19,966 --> 00:04:22,966 predicted by the random forest with 500 trees. 90 00:04:23,100 --> 00:04:25,133 And now let's look at the matrix of predictions. 91 00:04:25,133 --> 00:04:28,800 Remember, with ten trees we had 17 incorrect predictions. 92 00:04:28,800 --> 00:04:29,966 And now let's see. 93 00:04:29,966 --> 00:04:33,233 Select execute CM enter. 94 00:04:33,600 --> 00:04:36,600 And now we have 15 incorrect predictions. 95 00:04:36,933 --> 00:04:37,500 Great. 96 00:04:37,500 --> 00:04:41,200 We invested 490 more trees to win two correct predictions. 97 00:04:41,566 --> 00:04:45,633 So definitely that means that there are a lot of and useful trees in the team. 98 00:04:46,066 --> 00:04:46,366 Okay. 99 00:04:46,366 --> 00:04:50,100 So as you want let's maybe go back to ten trees here 100 00:04:50,100 --> 00:04:54,333 because obviously 500 trees is not very useful. 101 00:04:55,200 --> 00:04:55,500 All right. 102 00:04:55,500 --> 00:04:58,466 So I'll just take that again. 103 00:04:58,466 --> 00:05:01,600 That as well and that as well. 104 00:05:01,866 --> 00:05:02,700 All right. 105 00:05:02,700 --> 00:05:05,066 And now let's look at the training set results. 106 00:05:05,066 --> 00:05:06,800 So everything is fine here. 107 00:05:06,800 --> 00:05:10,033 We changed the title here with random forest classification. 108 00:05:10,033 --> 00:05:11,233 So it's all good. 109 00:05:11,233 --> 00:05:14,166 We are ready to look at the graphic result. 110 00:05:14,166 --> 00:05:17,533 And by the way you can pause on the video now and try to guess what 111 00:05:17,533 --> 00:05:18,966 you're about to see. 112 00:05:18,966 --> 00:05:23,066 Try to guess the shape of the prediction regions and of the prediction boundary. 113 00:05:23,300 --> 00:05:25,333 If you understood correctly the decision trees, 114 00:05:25,333 --> 00:05:28,466 then you would have no problem guessing what's about to happen. 115 00:05:28,933 --> 00:05:29,266 All right. 116 00:05:29,266 --> 00:05:32,433 So I'm going to execute right now command and control 117 00:05:32,433 --> 00:05:35,433 plus enter to execute and show time. 118 00:05:35,766 --> 00:05:38,266 All right. Wow. That's quite something here. 119 00:05:38,266 --> 00:05:41,700 So the points here are the real observation points. 120 00:05:41,700 --> 00:05:44,300 There is the real users of the social network. 121 00:05:44,300 --> 00:05:47,066 And then we have the regions here the red region and the green region 122 00:05:47,066 --> 00:05:48,600 which are the prediction regions. 123 00:05:48,600 --> 00:05:51,700 The red region is the region where our random forest classifier 124 00:05:51,700 --> 00:05:55,200 predicts that the user doesn't buy the SUV, and the green region 125 00:05:55,200 --> 00:05:58,466 where a random forest classifier predicts that the user buys the SUV.