1 00:00:00,133 --> 00:00:02,533 Hello and welcome to this art tutorial. 2 00:00:02,533 --> 00:00:05,100 Here we are at the final round of regression 3 00:00:05,100 --> 00:00:08,433 with our final regression model random Forest Regression. 4 00:00:09,000 --> 00:00:12,066 In the previous section we saw the decision tree regression model. 5 00:00:12,333 --> 00:00:15,533 So if now the decision tree regression doesn't have any secret for you, 6 00:00:15,600 --> 00:00:18,600 then you will perfectly understand random forest regression 7 00:00:18,766 --> 00:00:22,100 because random forest is just a team of decision trees, 8 00:00:22,400 --> 00:00:25,733 each one making some prediction of your dependent variable 9 00:00:25,933 --> 00:00:29,466 and the ultimate prediction of the random forest itself is simply 10 00:00:29,466 --> 00:00:33,266 the average of the different predictions of all the different trees in the forest. 11 00:00:33,466 --> 00:00:35,233 And actually, at the end of the previous section 12 00:00:35,233 --> 00:00:37,833 about decision trees, I asked you an enigma. 13 00:00:37,833 --> 00:00:43,000 The enigma was knowing the results we got with one tree, what would be the result? 14 00:00:43,000 --> 00:00:46,000 With ten trees or 100 trees or 500 trees 15 00:00:46,200 --> 00:00:49,400 in terms of visualization and in term of prediction. 16 00:00:49,833 --> 00:00:53,966 So I hope that after watching the intuition tutorial made by Kirill, 17 00:00:54,166 --> 00:00:57,600 you actually asked yourself this question and tried to predict 18 00:00:57,600 --> 00:01:00,600 what's going to happen here with random forest regression. 19 00:01:00,900 --> 00:01:02,366 So let's find out about that. 20 00:01:02,366 --> 00:01:06,333 We are going to build a random forest regression model and see what happens. 21 00:01:06,533 --> 00:01:07,700 So let's do it. 22 00:01:07,700 --> 00:01:11,200 We're going to start by selecting the right folder as a working directory. 23 00:01:11,200 --> 00:01:13,733 So it's in part two regression. 24 00:01:13,733 --> 00:01:15,466 And here is the final regression model. 25 00:01:15,466 --> 00:01:17,400 We were building random forest regression. 26 00:01:17,400 --> 00:01:18,733 So let's go inside. 27 00:01:18,733 --> 00:01:20,266 And that's the right folder we want to set 28 00:01:20,266 --> 00:01:23,266 is working directory with the position salary CSV file. 29 00:01:23,400 --> 00:01:27,600 So let's click on this more button and set as working directory. 30 00:01:27,966 --> 00:01:28,933 All good. 31 00:01:28,933 --> 00:01:32,733 And now let's take our regression template to build this model efficiently. 32 00:01:33,000 --> 00:01:37,033 So we're actually going to take everything from here to the bottom. 33 00:01:37,300 --> 00:01:40,533 But we will only include this code section to visualize the regression 34 00:01:40,533 --> 00:01:41,533 model results. 35 00:01:41,533 --> 00:01:44,333 Because you understood that the decision tree regression 36 00:01:44,333 --> 00:01:47,100 model is a non continuous regression model. 37 00:01:47,100 --> 00:01:50,100 And since random forest is a combination of decision trees 38 00:01:50,300 --> 00:01:53,133 then it's a combination of non continuous regression model. 39 00:01:53,133 --> 00:01:56,766 And intuitively we understand we can guess that the random forest 40 00:01:56,766 --> 00:01:59,766 regression model is not going to be continuous either. 41 00:01:59,900 --> 00:02:03,866 So since this code doesn't work for non continuous regression model 42 00:02:04,133 --> 00:02:05,666 we will actually use this one. 43 00:02:05,666 --> 00:02:07,533 That works perfectly for it. 44 00:02:07,533 --> 00:02:12,300 So I'm going to copy this paste that here and remove this section. 45 00:02:12,300 --> 00:02:16,100 That is non appropriate for non continuous regression models. 46 00:02:16,333 --> 00:02:17,133 Here we go. 47 00:02:17,133 --> 00:02:19,500 And now the template is ready. 48 00:02:19,500 --> 00:02:21,066 Let's change the basics. 49 00:02:21,066 --> 00:02:24,266 Let's replace here regression model by random forest 50 00:02:25,200 --> 00:02:28,066 regression. 51 00:02:28,066 --> 00:02:30,966 Visualizing the random forest regression results 52 00:02:30,966 --> 00:02:36,000 and fitting random forest regression to our dataset okay great. 53 00:02:36,200 --> 00:02:39,833 So now let's build the model which is in this section here. 54 00:02:39,833 --> 00:02:42,600 So let's remove this. 55 00:02:42,600 --> 00:02:46,666 And as usual we're going to import the right library for the job. 56 00:02:46,866 --> 00:02:51,500 And then use a function to build our random forest regressor. 57 00:02:51,966 --> 00:02:55,400 So the package we are going to import is called random forest. 58 00:02:55,833 --> 00:02:58,833 So for those of you who don't have the package installed 59 00:02:59,100 --> 00:03:02,100 in your packages here, well you can check it out. 60 00:03:02,100 --> 00:03:05,100 Mine is already installed because I used it before, 61 00:03:05,100 --> 00:03:09,933 but I'm going to write this line here for those of you who need to install it. 62 00:03:10,200 --> 00:03:13,200 So install dot packages, 63 00:03:13,500 --> 00:03:16,900 parenthesis and in quotes random. 64 00:03:17,633 --> 00:03:20,400 So no capital R but then capital F. 65 00:03:20,400 --> 00:03:23,133 Oh right. All right random forest. 66 00:03:23,133 --> 00:03:25,766 And so I'm not going to install it because mine is already installed. 67 00:03:25,766 --> 00:03:27,266 So I'm going to put that in comment. 68 00:03:27,266 --> 00:03:30,100 But if you want to install it you just need to select this line as 69 00:03:30,100 --> 00:03:33,133 I just did and press Command and Control plus enter to execute it. 70 00:03:33,466 --> 00:03:35,800 And this will install the package properly. 71 00:03:35,800 --> 00:03:39,233 But here I'm going to put in command by pressing command plus shift plus C. 72 00:03:39,433 --> 00:03:40,500 Here we go. 73 00:03:40,500 --> 00:03:43,933 And now when we have to do is to add this you know library 74 00:03:44,500 --> 00:03:48,066 random forest to actually automatically select 75 00:03:48,066 --> 00:03:51,066 the box here to import automatically the random forest package 76 00:03:51,200 --> 00:03:54,166 when we execute the whole code or the section. 77 00:03:54,166 --> 00:03:55,566 So that's important. 78 00:03:55,566 --> 00:03:58,066 And now time to build the regressor. 79 00:03:58,066 --> 00:03:59,200 So let's do it. 80 00:03:59,200 --> 00:04:04,300 We're going to call regressor regressor as usual to keep things simple and equals. 81 00:04:04,633 --> 00:04:07,500 And now the function that we're going to use is also 82 00:04:07,500 --> 00:04:10,733 random forest written the same. 83 00:04:10,966 --> 00:04:13,166 And so now let's add some parenthesis. 84 00:04:13,166 --> 00:04:17,433 And now let's press F1 to have a look at the arguments okay. 85 00:04:17,433 --> 00:04:18,866 So the arguments are here. 86 00:04:18,866 --> 00:04:21,400 And the first argument is data. 87 00:04:21,400 --> 00:04:24,400 But as you can see it specifies that it's an optional dataframe. 88 00:04:24,400 --> 00:04:27,566 And we could use this argument to build our regressor. 89 00:04:27,566 --> 00:04:31,666 But we're going to use the main arguments to specify, you know, the independent 90 00:04:31,666 --> 00:04:34,900 variables on one side and the dependent variable and another side. 91 00:04:35,266 --> 00:04:39,233 And to do this we're going to use these two arguments x and y. 92 00:04:39,333 --> 00:04:44,300 So x will contain the matrix and features that is the independent variables. 93 00:04:44,533 --> 00:04:48,700 And y will contain the dependent variable vector that is the salary column. 94 00:04:49,333 --> 00:04:51,466 So let's first input these two arguments. 95 00:04:51,466 --> 00:04:54,300 So the first argument is x equals. 96 00:04:54,300 --> 00:04:57,466 And so we have several ways to take our independent variables. 97 00:04:57,766 --> 00:05:01,000 So one of the way is to take our data set here. 98 00:05:01,500 --> 00:05:05,500 And then choose the right columns of the independent variables. 99 00:05:05,833 --> 00:05:08,700 And you know our data set is composed of two columns. 100 00:05:08,700 --> 00:05:12,000 The first column indexed by one which is the independent variable column. 101 00:05:12,300 --> 00:05:16,600 And the second column indexed by two which is our dependent variable column. 102 00:05:17,066 --> 00:05:20,700 So here we need index one because we want to take the independent variable. 103 00:05:21,566 --> 00:05:22,433 All right. 104 00:05:22,433 --> 00:05:23,433 Now next argument. 105 00:05:23,433 --> 00:05:26,433 The next argument is y the dependent variable vector. 106 00:05:26,466 --> 00:05:30,900 And now as you can see y is expected to be a response vector. 107 00:05:30,900 --> 00:05:32,266 It's actually a vector. 108 00:05:32,266 --> 00:05:34,800 And here it expected to have a data frame. 109 00:05:34,800 --> 00:05:39,666 So by using this one index into brackets here I actually import a data frame. 110 00:05:39,666 --> 00:05:43,333 But here to get a vector I actually need to use another trick, 111 00:05:43,333 --> 00:05:46,200 another technique which is in know to use a dollar sign. 112 00:05:46,200 --> 00:05:49,400 And then the name of the column, which is of course salary. 113 00:05:50,366 --> 00:05:53,366 And that will give me a vector.