1 00:00:00,966 --> 00:00:03,600 Hello and welcome back to the course on Machine Learning. 2 00:00:03,600 --> 00:00:04,300 In today's tutorial, 3 00:00:04,300 --> 00:00:07,600 we're talking about decision trees and the intuition behind them. 4 00:00:08,066 --> 00:00:11,666 All right so you may have heard the term cart which stands for classification 5 00:00:11,666 --> 00:00:13,066 and regression trees. 6 00:00:13,066 --> 00:00:17,700 And this is an umbrella term that encompasses two types of a decision trees. 7 00:00:17,700 --> 00:00:19,033 And as you've correctly guessed, 8 00:00:19,033 --> 00:00:22,433 they are the classification trees and regression trees. 9 00:00:22,900 --> 00:00:25,633 And in this course we're going to talk about both types. 10 00:00:25,633 --> 00:00:30,333 But specifically in this section we're focusing on the regression trees. 11 00:00:30,900 --> 00:00:34,300 And I wanted to mention right away that regression trees are a bit 12 00:00:34,300 --> 00:00:36,200 more complex than classification trees. 13 00:00:36,200 --> 00:00:38,100 And that's why this tutorial is going to be a bit longer 14 00:00:38,100 --> 00:00:40,800 and is going to require some additional attention. 15 00:00:40,800 --> 00:00:44,800 But nevertheless, we're still going to break this kind of somewhat complex 16 00:00:44,800 --> 00:00:50,100 topic into, very simple, bite sized, elements of information. 17 00:00:50,100 --> 00:00:51,866 So it will all make sense. 18 00:00:51,866 --> 00:00:54,733 And towards the end of it, you'll be quite comfortable with regression trees. 19 00:00:54,733 --> 00:00:56,666 So let's get straight into it. 20 00:00:56,666 --> 00:00:57,166 All right. 21 00:00:57,166 --> 00:01:00,733 So here we've got a scatterplot which represents our data set. 22 00:01:00,733 --> 00:01:02,666 So data set that has been given to us. 23 00:01:02,666 --> 00:01:06,133 And the interesting thing about the scatterplot is that we've got two 24 00:01:06,133 --> 00:01:08,133 independent variables x1 and x2. 25 00:01:08,133 --> 00:01:13,300 And what we're predicting is a third variable a dependent variable which is y. 26 00:01:13,600 --> 00:01:16,200 And you cannot actually see why on this chart. 27 00:01:16,200 --> 00:01:19,500 And that is because this is a simply a two dimensional chart. 28 00:01:19,533 --> 00:01:22,600 When you fit the two variables, y is the third dimension. 29 00:01:22,600 --> 00:01:25,733 And if you think about it, it's like sticking out of your screen. 30 00:01:25,733 --> 00:01:27,333 That's where that dimension is. 31 00:01:27,333 --> 00:01:31,033 And this is just a projection of all the points on the x1, x2 plane. 32 00:01:31,400 --> 00:01:34,766 And so if I add a third dimension, it would look something like that. 33 00:01:35,100 --> 00:01:37,733 But once again we can't see y right now. 34 00:01:37,733 --> 00:01:39,433 And the interesting thing is 35 00:01:39,433 --> 00:01:43,200 that we don't actually need to see y because we need to work with this, 36 00:01:43,200 --> 00:01:47,066 scatterplot first for a little bit to build our decision tree. 37 00:01:47,200 --> 00:01:49,666 And then once we've built it will return to Y. 38 00:01:49,666 --> 00:01:52,666 Now a quick point I wanted to make here is that 39 00:01:53,066 --> 00:01:57,300 I've seen decision trees explained with just one independent variable. 40 00:01:57,300 --> 00:01:59,433 So x1 or just x and y. 41 00:01:59,433 --> 00:02:03,933 And then in and in that case, yes, you can you can just put x1 over here 42 00:02:03,933 --> 00:02:06,533 and then Y would go over here and you would have 43 00:02:06,533 --> 00:02:10,400 a bit of a different kind of diagram, and you'd be able to explain it that way. 44 00:02:10,400 --> 00:02:13,966 But at the same time, I think it might not really drive the point home. 45 00:02:14,200 --> 00:02:17,533 And, it can be a bit confusing when it's explained like that, 46 00:02:17,866 --> 00:02:20,600 although sometimes it is done. 47 00:02:20,600 --> 00:02:24,333 nevertheless, I thought would go, the full way, would do the full Monty 48 00:02:24,333 --> 00:02:28,533 and would look at this problem with two independent variables, 49 00:02:28,533 --> 00:02:30,600 because it'll be a more robust explanation. 50 00:02:30,600 --> 00:02:34,333 So it will make it a bit more complex, but it's definitely worth it in the long 51 00:02:34,333 --> 00:02:38,666 run, because that way will understand the decision tree regression a bit. 52 00:02:38,966 --> 00:02:42,000 or actually I would say quite a bit better. 53 00:02:42,233 --> 00:02:42,566 All right. 54 00:02:42,566 --> 00:02:44,133 So let's continue. 55 00:02:44,133 --> 00:02:45,700 We've got the X1 and X2. 56 00:02:45,700 --> 00:02:47,333 These are independent variables. 57 00:02:47,333 --> 00:02:48,433 The dependent variable. 58 00:02:48,433 --> 00:02:51,100 We cannot see it. And it's the third dimension. 59 00:02:51,100 --> 00:02:54,466 And we're actually going to forget about it for a little while. 60 00:02:54,466 --> 00:02:54,666 Right. 61 00:02:54,666 --> 00:02:57,700 So we're going to just forget about it because we need to work with this 62 00:02:57,800 --> 00:03:01,200 scatterplot to see how our decision tree is going to be created. 63 00:03:01,633 --> 00:03:05,633 So once you run the regression tree or decision 64 00:03:05,866 --> 00:03:10,500 tree algorithm in the regression sense of it, what will happen is 65 00:03:10,500 --> 00:03:15,433 your scatterplot will be split up into segments. 66 00:03:15,433 --> 00:03:18,900 And let's have a look at how an algorithm could go about doing that. 67 00:03:18,900 --> 00:03:24,066 So an algorithm would create a split over here for example at somewhere around 20. 68 00:03:24,600 --> 00:03:29,000 so it would basically split your diagram or your scatterplot into two parts. 69 00:03:29,000 --> 00:03:30,133 Everything has less than 20. 70 00:03:30,133 --> 00:03:32,933 Everything that's greater than 20 for the X1 variable. 71 00:03:32,933 --> 00:03:34,700 Then another split would happen here. 72 00:03:34,700 --> 00:03:37,700 So for all of the elements in this side 73 00:03:37,700 --> 00:03:40,766 they would be compared to 170 greater or less. 74 00:03:41,066 --> 00:03:42,900 And then there'd would be another split here 75 00:03:42,900 --> 00:03:44,766 and then maybe another split over here. 76 00:03:44,766 --> 00:03:48,566 Now how and where these splits are conducted 77 00:03:48,933 --> 00:03:51,600 is determined by the algorithm. 78 00:03:51,600 --> 00:03:54,600 And, it is actually involves 79 00:03:54,600 --> 00:03:58,033 looking at something called the information entropy. 80 00:03:58,266 --> 00:04:00,866 And it is a mathematical concept. 81 00:04:00,866 --> 00:04:02,866 It is quite complex. 82 00:04:02,866 --> 00:04:06,366 So it basically means when I perform this split right. 83 00:04:06,700 --> 00:04:09,700 Is this split increasing 84 00:04:09,700 --> 00:04:12,833 the amount of information that we have about our points? 85 00:04:12,833 --> 00:04:18,700 Is it actually adding some value to our way that we want to group our points? 86 00:04:18,966 --> 00:04:22,566 And the algorithm knows when to stop, is when there's 87 00:04:22,566 --> 00:04:26,700 a certain minimum for the information that needs to be added. 88 00:04:27,033 --> 00:04:30,066 And once the, like, it cannot add 89 00:04:30,066 --> 00:04:33,866 any more information to our set up by split. 90 00:04:34,000 --> 00:04:35,966 These leaves are called leaves. 91 00:04:35,966 --> 00:04:38,166 So each one of these splits is called a leaf. 92 00:04:38,166 --> 00:04:41,833 By splitting these leaves, once it kind of adding more information, then 93 00:04:41,833 --> 00:04:46,866 it stops or, or the algorithm could, let's say stop when you have less than 5%. 94 00:04:47,666 --> 00:04:50,333 if you were to conduct a split, then you'd have less than 5% 95 00:04:50,333 --> 00:04:54,066 of your total points in that leaf, and then that leaf wouldn't be created. 96 00:04:54,166 --> 00:04:58,200 So there are, different variations or different options for that to happen. 97 00:04:58,533 --> 00:05:02,100 And but the most important thing is, of course, where the splits are happening. 98 00:05:02,400 --> 00:05:03,700 And if you'd like to learn 99 00:05:03,700 --> 00:05:07,500 more about that, you'd you'd need to study a bit more about information entropy. 100 00:05:07,833 --> 00:05:10,533 We're not going to go into that mathematical depth right now. 101 00:05:10,533 --> 00:05:14,400 For us, it's sufficient to know that the algorithm can handle this, 102 00:05:14,400 --> 00:05:19,533 and that it is finding the optimal splits of our data set into these leaves. 103 00:05:19,533 --> 00:05:22,000 And the final leaves are called terminal leaves. 104 00:05:22,000 --> 00:05:25,700 And then we're going to focus on the practical application 105 00:05:25,700 --> 00:05:29,666 of this algorithm, how and why we're using these, 106 00:05:29,666 --> 00:05:33,066 decision trees and how this regression is going to work. 107 00:05:33,566 --> 00:05:35,800 All right. So hopefully we're on the same page. 108 00:05:35,800 --> 00:05:36,433 Let's continue. 109 00:05:36,433 --> 00:05:39,533 So we're going to rewind all of this a little bit. 110 00:05:39,833 --> 00:05:42,900 And we're going to create these splits one by one. 111 00:05:42,900 --> 00:05:46,233 And alongside we're going to actually start drawing our decision tree. 112 00:05:46,766 --> 00:05:49,333 So there's our diagram brand new and fresh. 113 00:05:49,333 --> 00:05:51,366 And there goes our first split. 114 00:05:51,366 --> 00:05:54,366 So now we're going to start creating our decision tree. 115 00:05:54,500 --> 00:05:55,800 the splitting happened at 20. 116 00:05:55,800 --> 00:05:57,633 So let's start drawing. 117 00:05:57,633 --> 00:05:59,666 There is our first decision. 118 00:05:59,666 --> 00:06:02,666 And we have two options yes and no. 119 00:06:03,100 --> 00:06:03,433 All right. 120 00:06:03,433 --> 00:06:05,266 So let's let's see what happens next. 121 00:06:05,266 --> 00:06:07,233 Next happens split two. 122 00:06:07,233 --> 00:06:09,066 Split two happens at 170. 123 00:06:09,066 --> 00:06:12,066 And only happens for the points that are greater than 20. 124 00:06:12,266 --> 00:06:15,833 So that means you would check this condition x one is less than 20 125 00:06:15,866 --> 00:06:18,333 meaning you check. No you. The answer is no. 126 00:06:18,333 --> 00:06:23,333 And then you check if x two is less than one, 70 x two is less than 170, 127 00:06:23,966 --> 00:06:25,200 then a split three happens 128 00:06:25,200 --> 00:06:28,633 on the other side and it checks if x two is less than 200. 129 00:06:29,166 --> 00:06:31,766 Let's add that here x two less than 200 130 00:06:31,766 --> 00:06:34,933 and then split four happens at 40. 131 00:06:35,066 --> 00:06:38,666 And it checks if x one is greater or less than 40. 132 00:06:38,866 --> 00:06:42,400 And a split four only happens for the points that answered to split one. 133 00:06:42,400 --> 00:06:45,033 They answered and no, it's not less than 20. 134 00:06:45,033 --> 00:06:49,933 And to split they answered no, it's yes, it's actually less than 170. 135 00:06:50,400 --> 00:06:52,433 So no, it's not less than 20. 136 00:06:52,433 --> 00:06:53,833 Yes, it's less than 170. 137 00:06:53,833 --> 00:06:56,500 And then this is where split world four happens. 138 00:06:56,500 --> 00:06:59,400 X1 is less than 40 is no. 139 00:06:59,400 --> 00:06:59,700 All right. 140 00:06:59,700 --> 00:07:01,033 So that's our decision tree. 141 00:07:01,033 --> 00:07:02,866 It's done. It's drawn. 142 00:07:02,866 --> 00:07:04,366 And so what happens next. 143 00:07:04,366 --> 00:07:07,366 How what do we actually populate into those boxes. 144 00:07:07,700 --> 00:07:11,233 Well this is where we need to remember about our dependent variable. 145 00:07:11,233 --> 00:07:13,166 The third dimension. 146 00:07:13,166 --> 00:07:16,500 And what we need to check here is 147 00:07:16,866 --> 00:07:21,966 how are we going to predict the value of y 148 00:07:22,100 --> 00:07:28,066 for a new observation that gets added to our scatterplot or to our dataset. 149 00:07:28,066 --> 00:07:35,033 So let's say we add a observation which is has x1 equals to 30 and x2 equals to 50. 150 00:07:35,300 --> 00:07:39,266 It would fall somewhere over here and 50 is somewhere over here. 151 00:07:39,266 --> 00:07:40,600 It would fall somewhere over here. 152 00:07:40,600 --> 00:07:44,566 So obviously it falls into this, terminal leaf. 153 00:07:44,900 --> 00:07:47,766 And how does that information. 154 00:07:47,766 --> 00:07:49,400 So as you can see, we've by adding 155 00:07:49,400 --> 00:07:52,933 these splits, we've added information into our system. 156 00:07:53,233 --> 00:07:57,333 So how does that information that now we know that it falls into this 157 00:07:57,333 --> 00:07:58,400 terminal leaf. 158 00:07:58,400 --> 00:08:01,833 How does it information help us in terms of predicting 159 00:08:01,833 --> 00:08:04,833 the value of y for that new element that we're going to add? 160 00:08:05,100 --> 00:08:08,433 Well, the way it works is it's actually pretty straightforward. 161 00:08:08,466 --> 00:08:14,466 The way it works is you just take the averages of each of your terminal leaves. 162 00:08:14,800 --> 00:08:18,200 So you take the average of Y for all of these points. 163 00:08:18,200 --> 00:08:19,533 And that will be the value 164 00:08:19,533 --> 00:08:24,066 that will be assigned to any new point that falls in this terminal leaf. 165 00:08:24,333 --> 00:08:25,666 Same for this terminal leaf. 166 00:08:25,666 --> 00:08:28,533 Same for this terminal leaf, same for this one and the same for this one. 167 00:08:28,533 --> 00:08:29,166 So let's have a look. 168 00:08:29,166 --> 00:08:32,433 Let's say the average for y here is 65.7. 169 00:08:32,433 --> 00:08:36,366 The average for y is here is 300 and point five 1023. 170 00:08:36,366 --> 00:08:39,600 Here -64.1 0.7 here. 171 00:08:39,900 --> 00:08:42,800 So for that point that we just 172 00:08:42,800 --> 00:08:45,800 discussed with x1 equals 30 and x2 173 00:08:46,066 --> 00:08:50,100 equals 50, the predicted value of y that the regression tree 174 00:08:50,100 --> 00:08:53,933 algorithm would predict a value of -64.1. 175 00:08:54,400 --> 00:08:56,500 If it were to fall in any other terminal leaf, 176 00:08:56,500 --> 00:08:59,466 then that's what the value there would predict. 177 00:08:59,466 --> 00:09:01,866 So as you can see, it's actually pretty straightforward. 178 00:09:01,866 --> 00:09:03,933 It's it's very simple. 179 00:09:03,933 --> 00:09:05,700 It's just taking averages. 180 00:09:05,700 --> 00:09:10,200 but you do need to remember that we are, working. 181 00:09:10,233 --> 00:09:13,500 The whole point of this exercise is to add 182 00:09:13,500 --> 00:09:18,000 more information into our chart, into our system, to better predict 183 00:09:18,000 --> 00:09:21,666 y, because if you think about it, what was our other option? 184 00:09:21,900 --> 00:09:23,533 What is our default option? 185 00:09:23,533 --> 00:09:26,766 If the default option were for running any machine 186 00:09:26,766 --> 00:09:30,600 learning on this, data set is to just take all of the points 187 00:09:30,600 --> 00:09:34,533 and take the average across all of the points and whatever that is. 188 00:09:34,666 --> 00:09:37,766 Wherever our new point, the new element of data 189 00:09:37,766 --> 00:09:41,200 that is added to our data set, wherever it falls, we just assign. 190 00:09:41,200 --> 00:09:45,300 It's always that's average for all of the points that we had existing 191 00:09:45,300 --> 00:09:45,800 previously. 192 00:09:45,800 --> 00:09:49,800 What do we did now is we've split our diagram up into these terminal leaves. 193 00:09:49,800 --> 00:09:53,800 The machine learning algorithm has added information to our entire system. 194 00:09:53,966 --> 00:09:57,766 And so now we can more accurately predict the value 195 00:09:57,766 --> 00:10:02,100 or assign the value of y to a new coming element. 196 00:10:02,400 --> 00:10:05,600 And as you can see now, it's average, not just across all of them. 197 00:10:05,600 --> 00:10:11,333 The average is taken into in specific parts or segments of our scatterplot. 198 00:10:11,333 --> 00:10:14,733 And therefore it is or it's supposed to be more accurate. 199 00:10:14,766 --> 00:10:17,766 That's the whole point of the regression tree. 200 00:10:17,833 --> 00:10:19,600 And now last thing we have left to do 201 00:10:19,600 --> 00:10:23,800 is to add the values into our, decision tree. 202 00:10:23,800 --> 00:10:26,033 So basically we just add those values in here. 203 00:10:26,033 --> 00:10:29,433 And now whenever we have a new value, 204 00:10:29,433 --> 00:10:34,033 what would happen is the algorithm which is go through this, 205 00:10:34,600 --> 00:10:37,800 these checks and it would check where it falls and assign the value. 206 00:10:38,266 --> 00:10:39,266 And that's pretty much it. 207 00:10:39,266 --> 00:10:43,900 So the scatterplot is more for like visualization, conceptual purposes. 208 00:10:43,900 --> 00:10:46,300 So you can maybe drive some insights from there. 209 00:10:46,300 --> 00:10:50,100 But the core of Decision Tree is actually held here. 210 00:10:50,300 --> 00:10:53,300 That's why the algorithm is called a regression tree. 211 00:10:53,766 --> 00:10:55,266 I hope you enjoyed today's tutorial. 212 00:10:55,266 --> 00:10:58,333 And, hopefully we did break down this quite complex 213 00:10:58,333 --> 00:11:01,566 topic into some simple and actionable steps, 214 00:11:01,866 --> 00:11:03,933 and I'll look forward to seeing you next time. 215 00:11:03,933 --> 00:11:05,866 Until then, enjoy machine learning.