1 00:00:01,710 --> 00:00:06,600 In this video, we are going to look at the code to build a regression tree. 2 00:00:07,200 --> 00:00:10,530 And we will build a regression tree on our movie data set. 3 00:00:12,020 --> 00:00:19,670 And using that tree, we will predict the values of boxoffice collection for our test set. 4 00:00:20,970 --> 00:00:27,330 The first step that we need to take to create this tree is we need to install some packages. 5 00:00:28,490 --> 00:00:33,410 To create this entry, we will be using our part package. 6 00:00:34,990 --> 00:00:41,100 And once a tree is created, we will use the R part or plot package to plot that decision tree. 7 00:00:42,000 --> 00:00:45,620 So we need two packages, our part and our part not. 8 00:00:47,010 --> 00:00:50,310 Again, if these are already in start, you do not need to install them. 9 00:00:51,640 --> 00:00:57,580 But if they are not, you need to run these two lines to install the other part and are part of the 10 00:00:57,610 --> 00:00:58,120 package. 11 00:00:58,420 --> 00:00:59,800 So I will run these two commands. 12 00:01:02,380 --> 00:01:03,480 And Condoling did. 13 00:01:12,270 --> 00:01:14,780 Now both the packages are installed. 14 00:01:15,390 --> 00:01:21,090 But if you go to the packages section, the take will not be dead. 15 00:01:25,650 --> 00:01:27,840 You can see we have these two packages. 16 00:01:27,990 --> 00:01:32,070 These are installed, but these are not active to make them active. 17 00:01:32,390 --> 00:01:38,140 We either in library are part and library are blocked or we just take it here. 18 00:01:38,400 --> 00:01:40,710 So we'll run these two commands to make them active. 19 00:01:42,790 --> 00:01:50,200 Now we can use these two libraries, training a regression tree is very simple, not just one line of 20 00:01:50,200 --> 00:01:50,590 code. 21 00:01:51,130 --> 00:01:54,670 This is that line we will use. 22 00:01:54,820 --> 00:02:01,530 The odd part funkin Gregory is the name of Abel in which we will get all the information or designation 23 00:02:01,600 --> 00:02:01,990 tree. 24 00:02:03,130 --> 00:02:07,180 And lately, I'm getting information from the, ah, part function. 25 00:02:08,520 --> 00:02:09,570 That formula is. 26 00:02:10,910 --> 00:02:14,900 Collection delayed dart, the meaning of collection delayed orders. 27 00:02:16,620 --> 00:02:20,870 Anything on the left of L.A., this very symbol? 28 00:02:22,110 --> 00:02:25,590 Anything to the left of this baby symbol is the dependent variable. 29 00:02:26,310 --> 00:02:28,290 So my dependent variable is collection. 30 00:02:29,010 --> 00:02:36,900 And I want to identify the relationship of collection with all the other variables in my data to represent 31 00:02:37,050 --> 00:02:37,980 all other variables. 32 00:02:38,040 --> 00:02:39,330 I have to write a dot. 33 00:02:40,390 --> 00:02:45,190 If I want to use particular variable, I can write the name of that particular variable also here, 34 00:02:46,030 --> 00:02:52,150 if I want to use two variables, I can write the name of both the variables and use a plus symbol in 35 00:02:52,150 --> 00:02:52,720 between them. 36 00:02:53,470 --> 00:03:00,220 So if I want to find the relationship of collection with marketing expense and production expense, 37 00:03:00,640 --> 00:03:07,000 I'll just write collection delay marketing expense plus production expense. 38 00:03:08,640 --> 00:03:12,660 So formula will tell the output function. 39 00:03:13,930 --> 00:03:19,140 Which is the dependent variable and which are the independent variables here? 40 00:03:19,270 --> 00:03:26,970 Collection is the dependent variable and DART signifies that all other variables except collection are 41 00:03:26,970 --> 00:03:29,650 the independent or predictable variables. 42 00:03:31,000 --> 00:03:32,770 The next parameter is data. 43 00:03:33,850 --> 00:03:35,400 Name of data is green. 44 00:03:35,980 --> 00:03:39,640 Because we want to use only retrains it to bring that model. 45 00:03:41,660 --> 00:03:44,810 The third parameter we are going to use is control. 46 00:03:45,990 --> 00:03:50,340 This barometer will help us control the length of the decision three. 47 00:03:51,580 --> 00:03:57,670 We do not want a very long decision tree, a small tree is more interpretable. 48 00:03:58,030 --> 00:04:00,880 And it has less chance of overfitting. 49 00:04:01,790 --> 00:04:05,360 So here, I'm just going to keep our depth of three. 50 00:04:06,140 --> 00:04:08,450 That is, it will have at Max. 51 00:04:08,690 --> 00:04:13,250 Three more layers from the initial load, which has the total population. 52 00:04:15,050 --> 00:04:23,030 If you want to know more about the are part function, that is what all other arguments can you use 53 00:04:23,150 --> 00:04:23,980 with our part? 54 00:04:24,680 --> 00:04:27,090 You just go to our bar and press EF1. 55 00:04:28,940 --> 00:04:33,290 It will open the help of our part function in the dating scene. 56 00:04:34,250 --> 00:04:37,790 You can see all the different argument that can be given. 57 00:04:38,420 --> 00:04:41,120 The arguments that we are giving are mostly mandatory. 58 00:04:41,390 --> 00:04:43,850 This control argument is not mandatory. 59 00:04:44,150 --> 00:04:48,040 It has some default values but to control the growth of this country. 60 00:04:48,200 --> 00:04:50,300 We have put this control argument to. 61 00:04:52,670 --> 00:04:54,380 So I will run this, come on, no. 62 00:04:58,510 --> 00:05:07,690 And you can see a new very well, Greg Tree is created and it has the information of the decision tree 63 00:05:07,780 --> 00:05:09,700 model on this trading site. 64 00:05:11,480 --> 00:05:16,670 Not if you want to look at this decision tree created by this single line of code. 65 00:05:17,740 --> 00:05:19,420 You can run this second line. 66 00:05:20,890 --> 00:05:26,090 The second line uses the passport law function of our partner plourde package. 67 00:05:27,880 --> 00:05:35,560 The first argument, and this is the variable which contains the disagreeably information for us, it 68 00:05:35,560 --> 00:05:36,260 was a. 69 00:05:39,190 --> 00:05:41,470 The second argument is not mandatory. 70 00:05:41,530 --> 00:05:46,280 It is just the color palette that we are going to use to lord this decision tree. 71 00:05:46,450 --> 00:05:48,530 You can use a different color palette. 72 00:05:49,240 --> 00:05:51,370 You can miss this argument also. 73 00:05:52,980 --> 00:05:56,450 If you do not write this, it will use some default colors of its own. 74 00:05:59,220 --> 00:06:04,480 The third argument is digit digit tells how many significant digits. 75 00:06:04,740 --> 00:06:06,240 Should that decision tree help? 76 00:06:07,050 --> 00:06:13,500 Once we applaud this decision tree, I will change the value of this digit to show you how the value 77 00:06:13,980 --> 00:06:15,480 in the decision tree changes. 78 00:06:16,450 --> 00:06:19,190 So let us run this command and Blätter diffidently. 79 00:06:24,060 --> 00:06:27,930 You can see on the right we have this decision tree plot. 80 00:06:28,890 --> 00:06:30,840 We can click on the zoom button. 81 00:06:34,160 --> 00:06:37,400 And it is the zoomed version of this decision tree. 82 00:06:38,790 --> 00:06:41,790 Let us look at this decision tree to see what this is telling. 83 00:06:43,300 --> 00:06:49,480 So the top node that is the initial node had all the observations. 84 00:06:49,570 --> 00:06:51,460 That is hundred percent of the observations. 85 00:06:52,270 --> 00:06:56,850 And the average collection of all the movies was nearly 45000. 86 00:06:58,520 --> 00:07:02,470 The first split is using the variable budget. 87 00:07:03,540 --> 00:07:08,400 At nearly thirty eight thousand, well, you saw the left side. 88 00:07:08,820 --> 00:07:16,890 As for movies which have budget less than 38000 and the right side is for movies which have budget more 89 00:07:16,890 --> 00:07:17,850 than to you told them. 90 00:07:20,140 --> 00:07:26,940 In the right side, you can see this node contains nearly 16 percent of the observations. 91 00:07:27,750 --> 00:07:33,330 And whereas this left node contains nearly 84 percent of the obligations. 92 00:07:36,050 --> 00:07:41,190 In these 16 percent of the movies, which are budget, more than 38000. 93 00:07:42,470 --> 00:07:47,210 The average collection is nearly seven retooled holen, whereas. 94 00:07:48,640 --> 00:07:54,730 In movies with less than 38000 budget, the average collection is nearly 39000. 95 00:07:55,390 --> 00:08:02,000 So there is a huge difference between movies which have budget more than 38000. 96 00:08:02,440 --> 00:08:04,450 And budget less than 38000. 97 00:08:05,410 --> 00:08:09,730 You can continue further to see the effect of different variables. 98 00:08:10,880 --> 00:08:11,720 At each node. 99 00:08:11,930 --> 00:08:17,030 So in this node, the split is on the Berek everything. 100 00:08:18,410 --> 00:08:26,190 And this node displayed is on trailered views and so on, since we mentioned that we need a maximum 101 00:08:26,190 --> 00:08:27,630 depth of three layers. 102 00:08:28,770 --> 00:08:29,970 This is layer zero. 103 00:08:30,900 --> 00:08:31,860 This is layer one. 104 00:08:32,610 --> 00:08:33,470 This is layer, too. 105 00:08:33,810 --> 00:08:35,210 And this is layer three. 106 00:08:36,620 --> 00:08:41,570 So this tree has a maximum depth of three. 107 00:08:42,590 --> 00:08:49,430 If we increase that control barometer of maximum debt to five, we will have a bigger three. 108 00:08:50,740 --> 00:08:53,510 Although that tree will be difficult to interpret. 109 00:08:55,440 --> 00:09:00,480 So here you can clearly see that the most important variable is budget after budget. 110 00:09:00,610 --> 00:09:02,200 It is nearly three levels. 111 00:09:03,190 --> 00:09:06,130 And then we have that iterating our marketing expense. 112 00:09:07,640 --> 00:09:12,760 Now, let us go back and see the effect of mentioning those digits. 113 00:09:12,970 --> 00:09:13,510 But we did. 114 00:09:17,000 --> 00:09:22,850 So if I put digits as Zito and done, come on. 115 00:09:26,330 --> 00:09:28,490 You can see that now I have. 116 00:09:31,000 --> 00:09:41,080 Now, I have this plot in scientific format that is, it is giving me values in 45 in two tenders to 117 00:09:41,130 --> 00:09:42,940 develop three format. 118 00:09:44,310 --> 00:09:51,120 So since this becomes a little bit difficult to read and understand, that is why I had Bridget is equal 119 00:09:51,120 --> 00:09:51,990 to minus three. 120 00:09:52,230 --> 00:09:52,650 That is. 121 00:09:54,620 --> 00:09:59,300 I want those three significant figures also, which were going into the. 122 00:09:59,960 --> 00:10:05,680 So if you wanted to be in the power, you increase the value of digital. 123 00:10:06,560 --> 00:10:12,320 And if you want more significant deserves to be displayed like this, you reduce the value of digit. 124 00:10:14,040 --> 00:10:21,540 The love we have created, we have trained our regression tree on the training set and we have used 125 00:10:22,020 --> 00:10:25,140 the Arpad outlawed function to block the integration de. 126 00:10:28,330 --> 00:10:33,030 You can use these regression trees in your presentations and in your document. 127 00:10:33,790 --> 00:10:37,060 And these can help you make important business decisions. 128 00:10:38,900 --> 00:10:47,600 No other part of using the C entry is to be able to predict the values for other movies or for future 129 00:10:47,990 --> 00:10:49,820 observations that we are going to get. 130 00:10:51,390 --> 00:10:54,580 Do predictive value, we use predict function. 131 00:10:55,600 --> 00:11:00,780 I'm going to add a new column in the test dataset. 132 00:11:01,540 --> 00:11:04,930 This column will contain the predicted values. 133 00:11:06,380 --> 00:11:08,660 Using the decision tree that we have created. 134 00:11:11,700 --> 00:11:17,880 So to find out the predicted values, we use this predict function in this predict function. 135 00:11:18,720 --> 00:11:26,570 The first argument is that variable, which contains the information of DDC entry, which was a three 136 00:11:26,580 --> 00:11:27,090 for us. 137 00:11:28,580 --> 00:11:33,350 The second argument is the data on which you want to predict. 138 00:11:33,800 --> 00:11:37,160 So that data is contained in the test, does it? 139 00:11:38,670 --> 00:11:44,580 And the third parameter is telling the type of response that you want to predict. 140 00:11:45,360 --> 00:11:49,590 Since we want continuous menus, we'll use type is equal to vector. 141 00:11:50,980 --> 00:11:57,550 If this was a classification decision tree, that is, we wanted to predict the classes, then we would 142 00:11:57,550 --> 00:11:59,230 have used a different type. 143 00:11:59,590 --> 00:12:01,270 That is Tabor's equal duck class. 144 00:12:02,110 --> 00:12:03,940 Here is a disintegration tree. 145 00:12:04,060 --> 00:12:05,580 We want a political director. 146 00:12:06,400 --> 00:12:13,160 So I will run this and we can go and look at the test dataset. 147 00:12:14,000 --> 00:12:14,740 Notice click on it. 148 00:12:18,180 --> 00:12:24,450 And in the end, you can see this column, the collection column has the actual values. 149 00:12:25,410 --> 00:12:31,970 And this last column, the red column, has the predicted values for are. 150 00:12:32,260 --> 00:12:33,120 We'll read it as it. 151 00:12:34,690 --> 00:12:41,410 So using the information in a tree, it has predicted these values for the desert. 152 00:12:43,120 --> 00:12:50,290 Now, one measure of performance of our model is calculating the mean squared error. 153 00:12:52,510 --> 00:13:00,100 And asked Orlandi, two related means squared error is the meaning of squared of differences between 154 00:13:00,220 --> 00:13:01,360 actual and predicted. 155 00:13:02,150 --> 00:13:08,470 So basically we find out the difference between predicted and actual V squared is different. 156 00:13:09,770 --> 00:13:12,620 And then we find, I mean, of all those values. 157 00:13:15,120 --> 00:13:18,560 If I run this, a new variable, MASC squared is created. 158 00:13:19,340 --> 00:13:20,390 It has this value. 159 00:13:21,950 --> 00:13:24,560 Which is nearly one hundred thirteen million. 160 00:13:26,570 --> 00:13:32,480 And this is the mean square difference between actual and deeply rooted values. 161 00:13:34,030 --> 00:13:41,620 Now, when we study the advanced techniques in the future, we calculate masc values for all those techniques 162 00:13:42,250 --> 00:13:47,500 and we will see that whether using those techniques reduces this MASC. 163 00:13:49,450 --> 00:13:56,620 So basically, the accuracy will be dulled using MASC if masc value is low. 164 00:13:57,190 --> 00:13:59,170 The model is more accurate. 165 00:14:03,640 --> 00:14:07,500 So this is the code to create our D.C. entry. 166 00:14:07,950 --> 00:14:15,940 Then to plot a D.C. entry and then to use that this entry to predictive values for another dataset.