1 00:00:01,200 --> 00:00:06,650 In this we do we are going to understand the steps taken to be a regression tree. 2 00:00:09,400 --> 00:00:14,560 The conceptual understanding that you will love from this lecture will help you until your interview 3 00:00:14,590 --> 00:00:15,640 or Veira questions. 4 00:00:16,270 --> 00:00:22,900 Plus, you'll be able to manipulate the decision tree and interpret its result much better than someone 5 00:00:23,140 --> 00:00:25,550 who just knows the court to make a decision tree. 6 00:00:27,330 --> 00:00:34,290 So as I showed you earlier in a decision tree, we are trying to create regions or segments. 7 00:00:35,680 --> 00:00:39,790 These segments have particular creek districts such as here. 8 00:00:40,120 --> 00:00:46,930 We said that this region is a group of students who studied less than an arts. 9 00:00:48,530 --> 00:00:54,960 The next region is a good group of students who studied more than 10 nuts but scored less than sixty 10 00:00:54,960 --> 00:00:56,870 five marks in the midterms. 11 00:00:57,940 --> 00:00:58,540 And so on. 12 00:00:59,800 --> 00:01:07,240 Secondly, when we have these regions, we make a prediction for the response with evil, which is usually 13 00:01:07,240 --> 00:01:10,510 the mean of the response value of observations in that region. 14 00:01:12,240 --> 00:01:18,870 So for the first region, we have five students and we are predicting that if a student studies less 15 00:01:18,870 --> 00:01:26,000 than 10 us, that student will score the average of this quarter of these five students, which is thirty 16 00:01:26,000 --> 00:01:26,820 nine MOCs. 17 00:01:28,520 --> 00:01:34,560 Similarly, if a student belongs to the second region, that is student studied more than 10 hours. 18 00:01:34,790 --> 00:01:41,120 But Madame Skoda's less than 65 monks, then that student will be scoring 70 monks. 19 00:01:42,380 --> 00:01:46,610 But the main question is, how do we decide these regions? 20 00:01:48,180 --> 00:01:50,820 Which variable should we pick for the first plate? 21 00:01:51,480 --> 00:01:52,530 And at what value? 22 00:01:54,100 --> 00:02:02,110 How and why did we decide that first this our variable will be used and that to be just putting value 23 00:02:02,110 --> 00:02:02,790 of Penas? 24 00:02:03,310 --> 00:02:04,570 And why not for Peanut's? 25 00:02:06,990 --> 00:02:14,550 The answer is we will pick such variable and such splitting value so that we get minimum sum of squared 26 00:02:14,700 --> 00:02:14,930 A. 27 00:02:16,950 --> 00:02:18,340 Dissembles, good at it. 28 00:02:18,580 --> 00:02:27,220 Time is given by this formula, which is the actual rally of response variable in the observation minus 29 00:02:27,700 --> 00:02:31,100 the predicted value of response in that region. 30 00:02:32,400 --> 00:02:40,690 Then we square that down and then we add all touchstones, the meaning of this sigma symbol here means 31 00:02:40,840 --> 00:02:44,380 that we are adding this is summation of all such terms. 32 00:02:44,860 --> 00:02:51,160 So for all the observations, we are going to find the difference from that predicted value. 33 00:02:52,460 --> 00:02:57,500 And we are going to square it and we are going to add all those values for all the regions. 34 00:02:59,740 --> 00:03:03,010 And are variable and splitting value will be chosen. 35 00:03:03,030 --> 00:03:06,170 Such that the value of this term is minimal. 36 00:03:08,050 --> 00:03:15,040 If you know or remember from linear regression, this is very similar to the or Mary Lee squared method. 37 00:03:16,540 --> 00:03:19,480 Let us understand what this means for decision trees. 38 00:03:21,340 --> 00:03:25,810 Let us consider only the first card for now, which is this is us. 39 00:03:27,810 --> 00:03:29,100 So we have two regions. 40 00:03:30,030 --> 00:03:33,380 This is region one, which is for less than 10 hours. 41 00:03:33,630 --> 00:03:36,270 And this is region two, which is far more than Tynan's. 42 00:03:39,750 --> 00:03:40,590 Region one. 43 00:03:41,100 --> 00:03:42,600 We have five values. 44 00:03:43,200 --> 00:03:45,030 That is 50 percent of the population. 45 00:03:45,660 --> 00:03:46,840 These first five. 46 00:03:47,590 --> 00:03:51,360 Where the odds values less than 10 belong to the first region. 47 00:03:52,580 --> 00:03:56,360 And for these very values, we have a predicted value of thirty nine. 48 00:03:57,800 --> 00:03:59,690 Which is the mean score of this population. 49 00:04:01,110 --> 00:04:03,870 What region do we have, the other five values? 50 00:04:04,740 --> 00:04:10,320 These observations belong to Region two, and the predicted value for them is the average value, which 51 00:04:10,320 --> 00:04:11,130 is 75. 52 00:04:13,620 --> 00:04:17,100 So as buddy formula, we find the difference of. 53 00:04:18,120 --> 00:04:23,550 The first value was just to define what this means for 39 discredit. 54 00:04:23,950 --> 00:04:25,120 And this is at first, um. 55 00:04:26,430 --> 00:04:28,810 Then we do this for the second observation. 56 00:04:29,130 --> 00:04:31,290 We find a difference of 38 and 39. 57 00:04:31,500 --> 00:04:32,170 We square it. 58 00:04:32,310 --> 00:04:33,720 And this is our second that Adam. 59 00:04:35,460 --> 00:04:43,290 Then we find a difference of 40 and 39 square, it turned out at home and so on, when we have all these 60 00:04:43,290 --> 00:04:50,100 other times, but all the regions, we add all those other towns to get the value of odysseys. 61 00:04:50,240 --> 00:04:50,730 There it is. 62 00:04:50,880 --> 00:04:52,320 Let's do some of Squeers. 63 00:04:54,890 --> 00:05:04,220 Now, instead of a value of ten for us, if we had a splitting relly 015, we'll have these seven thumbs. 64 00:05:05,060 --> 00:05:11,600 These seven observations in the region one and these three in the region to the average of these seven 65 00:05:11,600 --> 00:05:13,040 will be taken as the mean score. 66 00:05:13,280 --> 00:05:19,270 An average of these three will be taken as a means score for region to really do this exercise again 67 00:05:19,850 --> 00:05:21,470 and find out the Odyssey's value. 68 00:05:22,490 --> 00:05:25,940 And we will choose that datasource value, which is lowered out of these two. 69 00:05:27,830 --> 00:05:31,990 So basically the split is based on the value of. 70 00:05:33,380 --> 00:05:38,960 Retired, all possible variables and all possible splitting values of those variables. 71 00:05:39,680 --> 00:05:43,220 We find out the artists and we choose the RSS with just minimum. 72 00:05:45,120 --> 00:05:51,110 So for this scenario, it turns out that odds the first where he will likely choose. 73 00:05:51,720 --> 00:05:55,290 And we have a splitting rally of an Ares. 74 00:05:56,940 --> 00:05:59,850 Or this combination of variable and splitting relu. 75 00:06:00,170 --> 00:06:02,190 We get the minimum value of odysseys. 76 00:06:07,000 --> 00:06:12,760 Now, ideally, we have to do this at all possible values of our trade, even then, we also have to 77 00:06:12,760 --> 00:06:14,620 consider the second variable. 78 00:06:15,910 --> 00:06:16,920 Which is make them good. 79 00:06:17,770 --> 00:06:24,250 So it turns out that when we have a lot of preening observations and a lot of variables, it becomes 80 00:06:24,250 --> 00:06:30,580 computationally infeasible to consider every possible partition and all such possible regions. 81 00:06:31,750 --> 00:06:36,700 That is why we Tager top down approach known as the coercive binary splitting. 82 00:06:38,630 --> 00:06:46,220 The approach that we take is top down because we start at the top of the tree, that is all the observations 83 00:06:46,370 --> 00:06:51,500 are belonging to a single region in the beginning, and then we start making these split. 84 00:06:53,740 --> 00:06:58,890 Now, each split separates the predictable space into two parts. 85 00:07:00,670 --> 00:07:02,680 This is why it is called binary splitting. 86 00:07:04,840 --> 00:07:12,040 It is greedy because at each step of the rebuilding process, the best split at that particular step 87 00:07:12,640 --> 00:07:13,390 is considered. 88 00:07:13,840 --> 00:07:19,870 We do not look ahead or considered picking a split that will lead to a better three later on. 89 00:07:20,560 --> 00:07:27,250 We contend that only that current split and Jews that split, which is giving us these minimum odysseys. 90 00:07:30,490 --> 00:07:33,640 So I'll summarize the whole rebuilding process. 91 00:07:35,010 --> 00:07:39,160 When our program is trying to build a data entry, this is what is happening in the background. 92 00:07:40,780 --> 00:07:52,090 It considers all the predictors x1 x2 up to XP one by one, then all possible values of points for each 93 00:07:52,090 --> 00:07:55,720 variable is considered to divide the space into two regions. 94 00:07:57,640 --> 00:07:59,740 It calculates the squared error. 95 00:08:01,190 --> 00:08:08,720 This squared error times for all such possibilities and chooses the one with least value of the sum 96 00:08:08,720 --> 00:08:09,660 of squared error. 97 00:08:12,860 --> 00:08:20,390 It continues to make this place like this such that the resulting tree has lowest Odyssey's until the 98 00:08:20,410 --> 00:08:22,220 stop in criteria's made. 99 00:08:23,220 --> 00:08:27,180 So in our example problem we had, we were predicting students scored. 100 00:08:27,720 --> 00:08:29,430 The program continued. 101 00:08:29,640 --> 00:08:30,920 All the variables first. 102 00:08:31,320 --> 00:08:33,830 Initially, we had 10 students data. 103 00:08:34,590 --> 00:08:36,450 The average was coming out to be 57. 104 00:08:37,970 --> 00:08:47,450 It considered the variables are then make them and all possible values of these are midterm values. 105 00:08:47,630 --> 00:08:51,870 So it considered five or six or seven, eight, nine, 10. 106 00:08:51,950 --> 00:08:53,780 All such possible values of ARS. 107 00:08:54,980 --> 00:08:59,110 And all such possible splitting values of McDonald's. 108 00:08:59,840 --> 00:09:07,370 It chose this particular variable, which is odd and dispiriting value, because at this step this was 109 00:09:07,370 --> 00:09:11,120 giving the minimum Odyssey's once displayed was made. 110 00:09:11,810 --> 00:09:19,850 It went to each of detailed posted to the left node, but it did not spread it further because some 111 00:09:19,910 --> 00:09:21,850 stopping criteria was met here. 112 00:09:24,130 --> 00:09:28,000 Then it went to the date, nor the stopping criteria was not met here. 113 00:09:28,690 --> 00:09:30,490 It again tried all the variables. 114 00:09:30,640 --> 00:09:32,290 That is odd and make them. 115 00:09:33,590 --> 00:09:38,390 It again tested all the possible splitting values for Odd and Midem that even. 116 00:09:40,300 --> 00:09:40,560 It got. 117 00:09:41,000 --> 00:09:47,260 The Odyssey's value for all such possibilities and chose the Odyssey's, which was minimum, which in 118 00:09:47,260 --> 00:09:49,770 this case came out to be Metung variable. 119 00:09:50,290 --> 00:09:51,630 Less than 60 feet. 120 00:09:54,720 --> 00:09:57,120 All this happens in the background of this awkward. 121 00:09:58,220 --> 00:10:04,760 You just need to give the data let which is the one variable that you want to predict, which are the 122 00:10:04,760 --> 00:10:05,930 predicted variables. 123 00:10:06,620 --> 00:10:09,830 And what is this topping criteria for that decision tree? 124 00:10:11,190 --> 00:10:16,740 Once you specify all these values, the three can run and you get an output like this.