1 00:00:00,600 --> 00:00:03,300 Hello and welcome back to the course on Machine Learning. 2 00:00:03,300 --> 00:00:07,333 In the previous tutorial we talked about the hierarchical clustering, 3 00:00:07,700 --> 00:00:10,633 the intuition behind it, and how it works. 4 00:00:10,633 --> 00:00:12,233 But at the same time, we didn't quite understand 5 00:00:12,233 --> 00:00:16,500 what the whole purpose of C was and what's what the benefit of of it was. 6 00:00:16,500 --> 00:00:18,900 Yes, we went from a huge amount of 7 00:00:18,900 --> 00:00:22,800 clusters where every single point or data element was considered a cluster, 8 00:00:22,800 --> 00:00:27,033 and then to one big cluster, but as a result, now we have one huge cluster. 9 00:00:27,033 --> 00:00:28,833 What's what's the point of all of it? 10 00:00:28,833 --> 00:00:30,633 How do we get to the result? 11 00:00:30,633 --> 00:00:32,700 We want the actual clustering. 12 00:00:32,700 --> 00:00:36,366 So like in K-means for instance, we would have 2 or 3 clusters. 13 00:00:36,366 --> 00:00:38,100 How do we get to that? 14 00:00:38,100 --> 00:00:39,600 Right. Number of clusters. 15 00:00:39,600 --> 00:00:42,200 So this is where the dendrogram come in. 16 00:00:42,200 --> 00:00:43,900 And they will help us understand everything. 17 00:00:43,900 --> 00:00:45,800 So let's get straight into it. 18 00:00:45,800 --> 00:00:49,733 So here I've got a chart on the left which contains, six points. 19 00:00:50,366 --> 00:00:53,100 And on the right if we go to another chart we're going to use this chart 20 00:00:53,100 --> 00:00:54,400 to create a dendrogram. 21 00:00:54,400 --> 00:00:56,233 Now, I know it might sound a bit 22 00:00:56,233 --> 00:00:59,600 confusing at first, especially because we haven't talked about the grams, 23 00:00:59,833 --> 00:01:02,666 but through creating one we will learn what they are. 24 00:01:02,666 --> 00:01:06,133 So first off, just to make things a bit, 25 00:01:06,400 --> 00:01:09,900 more legible, I'm going to add the points at the bottom 26 00:01:10,166 --> 00:01:13,166 so that they're a bit bigger so we can see them better. 27 00:01:13,200 --> 00:01:15,233 And so there they there they are. 28 00:01:15,233 --> 00:01:18,400 The points just listed on the bottom, on the vertical axis 29 00:01:18,400 --> 00:01:19,700 we've got Euclidean distances. 30 00:01:19,700 --> 00:01:22,200 And it'll all make sense just now. So we're going to now 31 00:01:22,200 --> 00:01:26,200 go through the algorithm and slowly create those clusters. 32 00:01:26,200 --> 00:01:29,633 So to start off with every single point is an individual cluster. 33 00:01:30,100 --> 00:01:30,333 Right. 34 00:01:30,333 --> 00:01:32,666 So every single one of these points is an individual cluster. 35 00:01:32,666 --> 00:01:35,966 Next what we're going to do is we're going to find the two closest points 36 00:01:35,966 --> 00:01:36,900 which are these two. 37 00:01:36,900 --> 00:01:38,566 And put them into one cluster. 38 00:01:38,566 --> 00:01:41,566 So that's our step two in our algorithm. 39 00:01:41,600 --> 00:01:42,200 So there we go. 40 00:01:42,200 --> 00:01:44,166 That's the two closest points. 41 00:01:44,166 --> 00:01:47,166 And now we're putting them into one cluster. 42 00:01:47,333 --> 00:01:51,366 Now what do we want to do on this diagram here on the dendrogram is we want to 43 00:01:51,366 --> 00:01:55,266 somehow signify that these were indeed the two closest points. 44 00:01:55,266 --> 00:02:00,300 Because the dendrogram is kind of like the memory of the algorithm 45 00:02:00,333 --> 00:02:03,333 is going to remember every single step that we were performing. 46 00:02:03,600 --> 00:02:06,000 So there they are, those two points P2 and P3. 47 00:02:06,000 --> 00:02:09,366 How do we signify that we've just connected them and that they were 48 00:02:09,566 --> 00:02:10,100 the closest? 49 00:02:10,100 --> 00:02:13,366 Well, to connect them we would use like a horizontal line. 50 00:02:13,633 --> 00:02:15,333 But then where would we put it. 51 00:02:15,333 --> 00:02:17,900 Would you put at the very bottom. Would we put a bit higher. 52 00:02:17,900 --> 00:02:19,833 What's going to determine the distance. 53 00:02:19,833 --> 00:02:22,300 How high are we going to place this line. 54 00:02:22,300 --> 00:02:26,833 So this line is actually placed this height actually has a meaning. 55 00:02:26,833 --> 00:02:29,833 This height is the Euclidean distance between them. 56 00:02:29,966 --> 00:02:35,466 And it also represents the computed dissimilarity between the two points. 57 00:02:35,466 --> 00:02:39,466 So the two clusters and what that means is the further away two points are. 58 00:02:39,466 --> 00:02:42,266 So for instance P2 is that far away from P3. 59 00:02:42,266 --> 00:02:44,400 And this could be a variable. 60 00:02:44,400 --> 00:02:47,266 For instance, could be the age of a person. 61 00:02:47,266 --> 00:02:47,533 Right. 62 00:02:47,533 --> 00:02:52,233 And this variable could be, for instance, the salary of a person. 63 00:02:52,433 --> 00:02:52,800 Right. 64 00:02:52,800 --> 00:02:56,566 Or this variable could be how long a person has been of the company. 65 00:02:56,766 --> 00:03:00,533 And this for this variable could be the salary of the same person. 66 00:03:00,533 --> 00:03:01,466 So something like that. 67 00:03:01,466 --> 00:03:07,333 So basically we can see that P2 and P3 are they have that distance between them. 68 00:03:07,633 --> 00:03:10,633 Well whereas P2 and before have a greater distance between them. 69 00:03:10,633 --> 00:03:15,066 And that means that these two points P2 and P3, they have a certain dissimilarity 70 00:03:15,066 --> 00:03:17,433 which is measured by the distance between them. 71 00:03:17,433 --> 00:03:21,000 So the distance represents the dissimilarity between 72 00:03:21,000 --> 00:03:24,800 the two points and P2 and before also have a dissimilarity. 73 00:03:24,800 --> 00:03:27,100 And it's greater because you can see that the distance is greater. 74 00:03:27,100 --> 00:03:31,433 So, let's say if this was age and this was salary, these two points, 75 00:03:31,766 --> 00:03:35,433 even though they're not identical, they are less dissimilar 76 00:03:35,433 --> 00:03:38,466 in terms of age and salary than P2 and P4. 77 00:03:38,466 --> 00:03:40,566 And again, these variables are just arbitrary. 78 00:03:40,566 --> 00:03:43,200 I'm just calling on arbitrary variables. It could be anything else. 79 00:03:43,200 --> 00:03:45,833 And this data set could not be employees. 80 00:03:45,833 --> 00:03:47,333 It could be a machines, 81 00:03:47,333 --> 00:03:51,900 it could be certain observations from nature and pretty much anything. 82 00:03:52,433 --> 00:03:57,033 The point here is that the further away two points are, the more dissimilar 83 00:03:57,033 --> 00:03:58,800 they are, and that is being measured 84 00:03:58,800 --> 00:04:02,400 or captured in our dendrogram by the height of this bar. 85 00:04:02,400 --> 00:04:03,700 How high we're setting it. 86 00:04:03,700 --> 00:04:06,766 And then the bar itself just shows us that we connected P2 and P3. 87 00:04:07,200 --> 00:04:07,466 All right. 88 00:04:07,466 --> 00:04:09,900 So that's our first step in the dendrogram. 89 00:04:09,900 --> 00:04:11,600 Next we're going to move on. 90 00:04:11,600 --> 00:04:15,433 And we're going to proceed to the next step in our algorithm. 91 00:04:15,666 --> 00:04:17,800 We're going to perform step three. 92 00:04:17,800 --> 00:04:22,900 So we're going to find the next two closest clusters and connect them. 93 00:04:22,900 --> 00:04:27,200 So here we've got or each each point out of these four is a cluster. 94 00:04:27,200 --> 00:04:28,533 And then we've got this cluster. 95 00:04:28,533 --> 00:04:31,266 Now we need to find the two closest out of all of them. 96 00:04:31,266 --> 00:04:36,066 And let's say or from what we see these two are the closest. 97 00:04:36,066 --> 00:04:37,800 So let's outline them. 98 00:04:37,800 --> 00:04:40,100 There we are. And so now they form their own cluster. 99 00:04:40,100 --> 00:04:42,933 Now we want to point that out in the dendrogram as well. 100 00:04:42,933 --> 00:04:45,933 So again we're going to place this vertical horizontal line. 101 00:04:46,200 --> 00:04:48,900 Again. How high do we place it. 102 00:04:48,900 --> 00:04:51,900 Do we place it higher or lower than this line. 103 00:04:51,900 --> 00:04:56,433 Well we agreed that this vertical axis represents the Euclidean distance. 104 00:04:56,700 --> 00:04:58,500 And Euclidean distance 105 00:04:58,500 --> 00:05:02,766 represents the dissimilarity between two of our observations. 106 00:05:02,766 --> 00:05:07,566 So here we can see that P5 and p6 are actually further apart than P2 and P3. 107 00:05:07,566 --> 00:05:11,400 And that is of course natural because if 108 00:05:11,400 --> 00:05:15,433 p5 and P6 were closer then in the in the previous step, 109 00:05:15,433 --> 00:05:18,600 we would have put P2 and P3 in one cluster. 110 00:05:18,600 --> 00:05:20,700 We would have put P5 and P6 in one cluster. 111 00:05:20,700 --> 00:05:21,200 Remember 112 00:05:21,200 --> 00:05:24,500 we're always looking for the closest and then we're moving on to the next step. 113 00:05:24,500 --> 00:05:26,600 So between P3 were the closest. 114 00:05:26,600 --> 00:05:31,033 And that's why this distance is such P5 and P6 are further apart 115 00:05:31,033 --> 00:05:32,300 from each other than P2 and P3. 116 00:05:32,300 --> 00:05:34,200 So the distance has to be greater. 117 00:05:34,200 --> 00:05:36,566 And that's why we're going to show that on the dendrogram. 118 00:05:36,566 --> 00:05:39,566 You can see that this bar is set higher. 119 00:05:39,866 --> 00:05:40,633 All right. 120 00:05:40,633 --> 00:05:44,400 And the next step is to again repeat step three. 121 00:05:44,400 --> 00:05:49,133 So we're going to look among these all of these clusters which are the closest. 122 00:05:49,600 --> 00:05:50,600 So there we go. 123 00:05:50,600 --> 00:05:53,833 So this one is the was the closest I'm going to go back here. 124 00:05:54,100 --> 00:05:57,866 This cluster is closer to this cluster than to any other cluster. 125 00:05:58,033 --> 00:06:01,766 And pretty much out of all the distances between the clusters this is the 126 00:06:01,833 --> 00:06:02,466 the lowest. 127 00:06:02,466 --> 00:06:06,366 Again a lot is here is determined by how you measure distances. You. 128 00:06:06,366 --> 00:06:07,300 We can see that 129 00:06:07,300 --> 00:06:12,900 the distance between P4 and this cluster is quite close to this distance. 130 00:06:13,166 --> 00:06:16,833 But so we're going to say that this distance is the lowest. 131 00:06:16,833 --> 00:06:17,200 All right. 132 00:06:17,200 --> 00:06:21,900 So what do we do next is we combine these clusters into one cluster. 133 00:06:21,900 --> 00:06:23,100 Let's do that. 134 00:06:23,100 --> 00:06:23,633 There it is. 135 00:06:23,633 --> 00:06:26,400 So now we have one cluster. Now we need to represent that somehow here. 136 00:06:26,400 --> 00:06:30,000 So what we just did is we took this cluster that we have P2, P3 137 00:06:30,200 --> 00:06:32,066 connected it with P1. 138 00:06:32,066 --> 00:06:37,333 So again we're going to draw a line and we're going to draw vertical lines here. 139 00:06:37,333 --> 00:06:38,366 Again. 140 00:06:38,366 --> 00:06:41,066 And once once again the distance 141 00:06:41,066 --> 00:06:44,066 here from P1 to the top over here 142 00:06:44,233 --> 00:06:47,766 represents the dissimilarity between that cluster that we had. 143 00:06:47,766 --> 00:06:50,766 And what's the point that we connected it to. 144 00:06:51,266 --> 00:06:53,700 All right. So now let's connect. 145 00:06:53,700 --> 00:06:55,000 let's find what's the next step. 146 00:06:55,000 --> 00:06:56,733 Next step again is step three. 147 00:06:56,733 --> 00:07:00,766 And we're going to look out of one, two, three clusters that we have 148 00:07:00,966 --> 00:07:01,866 which are the closest. 149 00:07:01,866 --> 00:07:04,866 Well all this is before and it's the closest to these. 150 00:07:05,133 --> 00:07:05,666 Again. 151 00:07:05,666 --> 00:07:08,033 There it is. We're expanding that cluster. 152 00:07:08,033 --> 00:07:09,866 And now we're going to represent on the dendrogram. 153 00:07:09,866 --> 00:07:13,766 As you can see the line is about the same height as this previous line, 154 00:07:13,933 --> 00:07:17,333 because the distance between P1 and this cluster was about the same 155 00:07:17,333 --> 00:07:22,500 as the distance between PS4 and PS5 and P6, maybe this one was a bit greater. 156 00:07:22,800 --> 00:07:24,033 Sometimes it's hard to tell. 157 00:07:24,033 --> 00:07:25,800 And that's why we have algorithms. 158 00:07:25,800 --> 00:07:27,900 That's why machines do it for us. 159 00:07:27,900 --> 00:07:31,000 One of the reasons, that's what our data gram looks so far. 160 00:07:31,000 --> 00:07:35,766 And our final step is to combine these two remaining clusters, because by default, 161 00:07:35,766 --> 00:07:38,766 they are going to be the closest since there are no other clusters. 162 00:07:38,866 --> 00:07:42,366 So we're going to combine them and represent that on a dendrogram. 163 00:07:42,600 --> 00:07:47,100 So here the line is very high because the distance it was just similarity. 164 00:07:47,100 --> 00:07:49,700 It was very high between these two clusters. 165 00:07:49,700 --> 00:07:50,400 And there we go. 166 00:07:50,400 --> 00:07:54,166 So that is how we construct our dendrogram slowly from the bottom up 167 00:07:54,433 --> 00:07:55,833 is being constructed. 168 00:07:55,833 --> 00:07:58,966 And at the end we've got that one cluster. 169 00:07:58,966 --> 00:08:01,000 So all of this is one cluster. 170 00:08:01,000 --> 00:08:04,000 And that is what I mean when I say that the dendrogram contains 171 00:08:04,066 --> 00:08:06,900 the memory of the hierarchical clustering algorithm. 172 00:08:06,900 --> 00:08:09,100 So you can just by looking at the dendrogram 173 00:08:09,100 --> 00:08:12,233 understand in which order these clusters were formed. 174 00:08:12,900 --> 00:08:14,100 And here I've got an example. 175 00:08:14,100 --> 00:08:18,066 So this is the actual example generated by computer 176 00:08:18,066 --> 00:08:22,233 generated by an algorithm showing us the hierarchical clustering. 177 00:08:22,300 --> 00:08:26,433 So we've got the points here and we've got the dendrogram over here. 178 00:08:26,433 --> 00:08:29,133 So this is what it actually looks like. 179 00:08:29,133 --> 00:08:29,466 All right. 180 00:08:29,466 --> 00:08:32,066 So now we know how dendrogram are constructed. 181 00:08:32,066 --> 00:08:36,966 In the next tutorial we will learn how to use them to enhance 182 00:08:36,966 --> 00:08:41,533 our or actually execute our hierarchical clustering algorithm. 183 00:08:42,166 --> 00:08:44,266 So there we go. I hope you enjoyed today's tutorial. 184 00:08:44,266 --> 00:08:45,533 I look forward to seeing you next time. 185 00:08:45,533 --> 00:08:47,400 And until then, enjoy machine learning.