1 00:00:00,100 --> 00:00:03,133 And now we actually need to input a third argument. 2 00:00:03,666 --> 00:00:05,066 Can you guess what that is? 3 00:00:05,066 --> 00:00:05,700 For those of you 4 00:00:05,700 --> 00:00:09,600 who followed the Python tutorial, well, you will guess what it's going to be. 5 00:00:09,966 --> 00:00:11,066 It's actually going to be. 6 00:00:11,066 --> 00:00:13,566 And tree the number of trees in the forest. 7 00:00:13,566 --> 00:00:17,733 Well, of course we're building a random forest, so it's actually a lot better 8 00:00:17,733 --> 00:00:21,200 if we can choose the number of trees that we build in our forest. 9 00:00:21,500 --> 00:00:24,433 And it's even better considering the fact that we're going to play around 10 00:00:24,433 --> 00:00:26,100 with different number of trees. 11 00:00:26,100 --> 00:00:29,566 That is, we're going to start with ten trees with a forest of ten trees. 12 00:00:29,900 --> 00:00:33,666 And then, you know, we'll try with a lot more than ten trees, like 100 trees 13 00:00:33,666 --> 00:00:36,666 or 300 trees or 500 trees. 14 00:00:36,766 --> 00:00:40,066 So that's where we're going to input the third argument and tree. 15 00:00:40,466 --> 00:00:43,033 And we're going to start with ten trees. 16 00:00:43,033 --> 00:00:44,766 All right. So let's start with this. 17 00:00:44,766 --> 00:00:48,233 And that's all the arguments we need to build a random forest. 18 00:00:48,300 --> 00:00:52,333 We only need independent variables the dependent variable and the number of 19 00:00:52,333 --> 00:00:52,966 trees. 20 00:00:52,966 --> 00:00:56,866 And that will already make a robust random forest regression model. 21 00:00:56,866 --> 00:01:01,466 And then we will make it even more robust by adding more trees in the forest. 22 00:01:02,133 --> 00:01:05,200 But before we continue, let's set the random factors 23 00:01:05,200 --> 00:01:08,066 to something fixed so that we all get the same results. 24 00:01:08,066 --> 00:01:11,266 So, you know, in Python we used a random state parameter equal to zero. 25 00:01:11,433 --> 00:01:14,433 Here we can do the same on R by using the set 26 00:01:15,066 --> 00:01:17,400 dot seed function. 27 00:01:17,400 --> 00:01:19,866 And then in this function we actually input a seed. 28 00:01:19,866 --> 00:01:22,333 And you know we can use whatever seed we want. 29 00:01:22,333 --> 00:01:26,833 In Python we usually take zero 42 and an oh what we like to do 30 00:01:26,833 --> 00:01:30,500 is you know, to take either one two 3 or 1, two, three, four. 31 00:01:30,633 --> 00:01:33,166 So let's use the seed to all get the same result. 32 00:01:33,166 --> 00:01:35,666 And that's what make this tutorial easier to follow. 33 00:01:35,666 --> 00:01:37,866 If you're coding at the same time. 34 00:01:37,866 --> 00:01:39,333 So now we're all good. 35 00:01:39,333 --> 00:01:42,133 We're actually all good with the whole code. 36 00:01:42,133 --> 00:01:44,166 We don't have anything to replace. 37 00:01:44,166 --> 00:01:46,666 The only thing that we'll do now is to, you know, 38 00:01:46,666 --> 00:01:50,400 try several random forests with several number of trees. 39 00:01:50,733 --> 00:01:54,000 And look at the visualization results and look at the prediction 40 00:01:54,166 --> 00:01:58,633 to see if we're getting close to the supposed 160 41 00:01:58,633 --> 00:02:01,866 K preview salary of our new employee that is about to be hired. 42 00:02:02,700 --> 00:02:03,866 So let's do it. 43 00:02:03,866 --> 00:02:06,366 Let's execute the sections one by one. 44 00:02:06,366 --> 00:02:08,033 So let's import the data set first. 45 00:02:09,533 --> 00:02:10,200 Here we go. 46 00:02:10,200 --> 00:02:11,766 Data set will import it. 47 00:02:11,766 --> 00:02:15,433 We make sure we have our two columns the independent variable level 48 00:02:15,666 --> 00:02:18,666 and the dependent variable salary. Perfect. 49 00:02:18,866 --> 00:02:21,933 Now no need to split the data set into the training set and the test set. 50 00:02:22,200 --> 00:02:24,233 No need to apply feature scaling. 51 00:02:24,233 --> 00:02:28,600 And now time to create our first random forest. 52 00:02:28,733 --> 00:02:30,200 So let's do this. 53 00:02:30,200 --> 00:02:33,200 Let's execute this code section here. 54 00:02:33,700 --> 00:02:35,433 And here it is. Random forest. 55 00:02:35,433 --> 00:02:37,400 Well created. Perfect. 56 00:02:37,400 --> 00:02:39,066 So now it's time to have fun. 57 00:02:39,066 --> 00:02:40,833 Would you like to visualize the result? 58 00:02:40,833 --> 00:02:43,200 First or getting the prediction? 59 00:02:43,200 --> 00:02:45,833 Well, first let's maybe visualize the results 60 00:02:45,833 --> 00:02:48,300 because we want to make sure we have the right model 61 00:02:48,300 --> 00:02:51,600 and we want to validate it because we will try several number of trees. 62 00:02:51,866 --> 00:02:53,333 Here we are starting with ten trees. 63 00:02:53,333 --> 00:02:55,800 So we want to see if it looks like a correct model. 64 00:02:55,800 --> 00:02:58,300 So I'm going to execute this section. 65 00:02:58,300 --> 00:03:00,666 Here we go. And let's see what will get. 66 00:03:02,433 --> 00:03:02,766 Okay. 67 00:03:02,766 --> 00:03:04,266 So first of all this looks fine. 68 00:03:04,266 --> 00:03:06,900 We don't seem to have any problem here. 69 00:03:06,900 --> 00:03:09,900 The only thing that we can improve very quickly is actually, you know, 70 00:03:10,000 --> 00:03:11,600 those straight lines here. 71 00:03:11,600 --> 00:03:13,066 There are supposed to be vertical. 72 00:03:13,066 --> 00:03:15,500 And to get a better representation of this, 73 00:03:15,500 --> 00:03:20,566 we just need to increase the resolution as we did for decision tree regression. 74 00:03:20,566 --> 00:03:23,566 So let's add 0.01. That will be sufficient. 75 00:03:23,766 --> 00:03:26,766 And let's re-execute this. 76 00:03:27,366 --> 00:03:28,366 And now much better. 77 00:03:28,366 --> 00:03:29,500 It almost looks like it's 78 00:03:29,500 --> 00:03:33,466 some vertical straight lines representing better than some continuity. 79 00:03:33,866 --> 00:03:34,966 And so now what can we say. 80 00:03:34,966 --> 00:03:38,300 Let's zoom on this plot to have a better look. 81 00:03:38,733 --> 00:03:40,700 And now listen to it. 82 00:03:40,700 --> 00:03:44,800 Okay, so the answer to the enigma that I asked you in the previous section 83 00:03:44,800 --> 00:03:47,100 and that I was asking you again in this tutorial, 84 00:03:47,100 --> 00:03:50,900 is that we simply get more steps in the stairs 85 00:03:51,266 --> 00:03:55,066 by having several decision trees instead of one decision tree. 86 00:03:55,400 --> 00:04:00,200 We have a lot more steps in the stairs than what we had with one decision tree, 87 00:04:00,700 --> 00:04:04,533 and therefore we have a lot more of splits of the whole range of levels, 88 00:04:04,533 --> 00:04:07,433 and therefore a lot more intervals of the different levels. 89 00:04:07,433 --> 00:04:12,000 So each straight horizontal line here separated by these vertical lines 90 00:04:12,166 --> 00:04:15,166 or one interval that is one split. 91 00:04:15,233 --> 00:04:18,700 And the fact that we get more steps in the stairs is actually quite intuitive 92 00:04:18,700 --> 00:04:23,700 because, you know, if we get, for example, this prediction here for the 6.5 level, 93 00:04:23,800 --> 00:04:27,866 well, what happened for this prediction is that we had ten trees voting 94 00:04:27,866 --> 00:04:32,200 on which step the salary of the 6.5 level position would be. 95 00:04:32,833 --> 00:04:36,266 And then the random forest took the average of all the different predictions 96 00:04:36,266 --> 00:04:40,400 of the salary of the 6.5 level made by all the different trees in the forest. 97 00:04:40,733 --> 00:04:44,600 And for example, if we take the fourth position level, ten votes were made. 98 00:04:44,900 --> 00:04:47,833 Each of these ten votes correspond to one prediction 99 00:04:47,833 --> 00:04:51,033 of the level for salary made by each one of those ten trees. 100 00:04:51,333 --> 00:04:54,666 And then the random forest took the average of these ten predictions. 101 00:04:54,800 --> 00:04:58,300 And this average is nothing else than the prediction of the level 102 00:04:58,300 --> 00:05:01,300 for salary made by the random forest itself. 103 00:05:01,800 --> 00:05:05,233 And so we get more steps, because simply the whole range of levels 104 00:05:05,233 --> 00:05:07,100 is splitting into more intervals. 105 00:05:07,100 --> 00:05:10,600 And that is because the random forest is calculating many different 106 00:05:10,600 --> 00:05:14,333 averages of its decision trees predictions in each of these intervals. 107 00:05:14,900 --> 00:05:17,200 So that's what happened. It's quite intuitive. 108 00:05:17,200 --> 00:05:19,500 However, there is something important to point out here 109 00:05:19,500 --> 00:05:23,233 is that if we add a lot more trees in our random forest, 110 00:05:23,433 --> 00:05:27,600 well, it doesn't mean we'll get a lot more steps in the stairs, 111 00:05:27,866 --> 00:05:30,100 because the more you add some trees, the more 112 00:05:30,100 --> 00:05:33,833 the average of the different predictions made by the trees is converging 113 00:05:33,866 --> 00:05:35,200 to the same average. 114 00:05:35,200 --> 00:05:39,033 You know, this is based on the same technique entropy and information gain. 115 00:05:39,166 --> 00:05:42,266 So the more you add trees, the more the average of these votes 116 00:05:42,266 --> 00:05:45,266 will converge to the same ultimate average 117 00:05:45,333 --> 00:05:49,266 and therefore it will converge to some certain shape of stairs here. 118 00:05:49,366 --> 00:05:51,733 So that's important to visualize this as well. 119 00:05:51,733 --> 00:05:55,300 And now since we have our intuition of the visualization of the random forest 120 00:05:55,300 --> 00:05:58,500 regression in 1D, let's see what happens with the prediction.