1 00:00:00,900 --> 00:00:03,100 Hello and welcome back to the course on Machine Learning. 2 00:00:03,100 --> 00:00:06,600 We've already talked about hierarchical clustering and how the algorithm works. 3 00:00:06,666 --> 00:00:09,666 Also we talked about dental grammars and how they're constructed. 4 00:00:09,700 --> 00:00:13,166 Today we're going to put the two together and learn how to get 5 00:00:13,166 --> 00:00:16,833 the maximum value out of our hierarchical clustering algorithms. 6 00:00:17,100 --> 00:00:19,066 So let's get straight into it. 7 00:00:19,066 --> 00:00:21,866 All right so here we've got an example. 8 00:00:21,866 --> 00:00:22,733 The example that we looked 9 00:00:22,733 --> 00:00:26,900 at previously where on the left we've got the points you know scatterplot. 10 00:00:26,900 --> 00:00:31,133 And then here on the right we've got the dendrogram as a which contains the memory 11 00:00:31,200 --> 00:00:35,700 of how the clusters were formed during the hierarchical clustering algorithm. 12 00:00:35,700 --> 00:00:37,166 So here we can tell right away from 13 00:00:37,166 --> 00:00:39,800 first of all Peter and P3 were combined into a cluster. 14 00:00:39,800 --> 00:00:41,166 Then because their 15 00:00:41,166 --> 00:00:44,100 their height is the lowest, the height of this bar is the lowest. 16 00:00:44,100 --> 00:00:46,800 Then we look at the next lowest bar is this one. 17 00:00:46,800 --> 00:00:50,700 So P5 and p6 are the least dissimilar out of the remaining. 18 00:00:51,000 --> 00:00:53,000 And then these are pretty much the same height. 19 00:00:53,000 --> 00:00:56,766 But we first perform cluster A we combine these into one cluster. 20 00:00:56,766 --> 00:00:59,700 So P1 was added to cluster p2 p3. 21 00:00:59,700 --> 00:01:02,400 Then P4 was added to cluster of p5 p6. 22 00:01:02,400 --> 00:01:05,400 And then at the end all of the points were combined into one cluster. 23 00:01:05,400 --> 00:01:07,466 So that's what the dendrogram is telling us. 24 00:01:07,466 --> 00:01:11,333 As you can see right away, is giving us a lot of additional information 25 00:01:11,333 --> 00:01:12,800 on top of the scatter plots. 26 00:01:12,800 --> 00:01:17,400 And, it contains that memory of the hierarchical clustering algorithm. 27 00:01:17,900 --> 00:01:21,233 So how do we use this dendrogram to understand 28 00:01:21,333 --> 00:01:25,900 how to best execute, or get the most value out of the HTK. 29 00:01:26,266 --> 00:01:27,600 So let's have a look. 30 00:01:27,600 --> 00:01:29,500 What we need to do with the dendrogram, 31 00:01:29,500 --> 00:01:33,000 or what we can do is look at the horizontal levels 32 00:01:33,000 --> 00:01:37,666 and set thresholds so we can set height thresholds a distance actually distance 33 00:01:37,666 --> 00:01:40,766 thresholds are also called dissimilarity thresholds 34 00:01:40,933 --> 00:01:44,566 because this vertical axis measures the Euclidean distance between points, 35 00:01:44,566 --> 00:01:49,266 which also represents the dissimilarity between them or points or clusters. 36 00:01:49,500 --> 00:01:53,433 So what we can do is set a threshold for all dissimilarity. 37 00:01:53,433 --> 00:01:57,233 And we can say that, we don't want the dissimilarity to be greater 38 00:01:57,233 --> 00:01:58,366 than this level. 39 00:01:58,366 --> 00:02:01,666 So again, it doesn't matter what the absolute value is, it matters 40 00:02:01,666 --> 00:02:04,866 what's, the relative values and how it looks on this image. 41 00:02:04,866 --> 00:02:07,666 So we we're setting the dissimilarity threshold. 42 00:02:07,666 --> 00:02:11,400 We're saying that anything if we come across clusters 43 00:02:11,566 --> 00:02:13,500 that are above this threshold. 44 00:02:13,500 --> 00:02:18,900 So we don't want within a cluster to have dissimilarity above this threshold. 45 00:02:18,900 --> 00:02:21,600 So what that will do is it'll give us two clusters. 46 00:02:21,600 --> 00:02:22,400 And let's have a look at them. 47 00:02:22,400 --> 00:02:25,133 There's our first cluster and there's our second cluster. 48 00:02:25,133 --> 00:02:26,700 And that's that makes sense. 49 00:02:26,700 --> 00:02:30,900 So what that it's telling us is that within each one of these clusters, 50 00:02:30,900 --> 00:02:34,100 the dissimilarity is always less than our threshold. 51 00:02:34,100 --> 00:02:36,000 So let's say we've got some values here. 52 00:02:36,000 --> 00:02:38,900 Let's say this is 1.5. This is 2.0. 53 00:02:38,900 --> 00:02:42,600 So let's say we want to set the threshold at 1.7. 54 00:02:43,000 --> 00:02:47,966 And what this is doing is it is not allowing any clusters 55 00:02:47,966 --> 00:02:52,733 that would have dissimilarity of greater than 1.7 within them. 56 00:02:53,133 --> 00:02:57,600 And as you can see from the dendrogram, we can tell that over everything below 57 00:02:57,600 --> 00:03:00,600 that level, this cluster and this cluster, 58 00:03:00,633 --> 00:03:03,633 they don't have dissimilarity of 1.7, 59 00:03:03,900 --> 00:03:08,566 because the similarity is represented by these vertical lines. 60 00:03:09,266 --> 00:03:11,600 And that's how the concept of thresholding works. 61 00:03:11,600 --> 00:03:16,600 And the interesting part about dendrogram is you can quickly tell how many classes 62 00:03:16,600 --> 00:03:20,500 you will have at a certain threshold by just looking at how many 63 00:03:20,500 --> 00:03:23,666 vertical lines this horizontal threshold actually crosses. 64 00:03:23,866 --> 00:03:27,100 So here you can see it crosses one two vertical lines. 65 00:03:27,100 --> 00:03:28,766 That means we will have two clusters. 66 00:03:28,766 --> 00:03:33,066 Will be this cluster of all these points p1, p2, p3 and this cluster p45 p6. 67 00:03:33,566 --> 00:03:35,833 All right. So let's have a look at another example. 68 00:03:35,833 --> 00:03:40,466 Let's have a look at a example where we put the threshold at this level. 69 00:03:40,466 --> 00:03:45,266 So somewhere just below where we combined as you remember 70 00:03:45,266 --> 00:03:50,166 we had p5 p6 in one class A, p2, p3 in one cluster before by itself P1 by itself. 71 00:03:50,400 --> 00:03:54,100 And then we combined P1 with this cluster, P4 with this cluster. 72 00:03:54,300 --> 00:03:57,900 So let's say we're setting the threshold at just before 73 00:03:57,900 --> 00:04:01,033 that level of dissimilarity, which allowed us to combine 74 00:04:01,033 --> 00:04:04,200 P1 with this cluster and before with this cluster. 75 00:04:04,433 --> 00:04:07,833 So what that will do is it will give us a certain number of clusters. 76 00:04:07,833 --> 00:04:11,466 So can you tell just by looking at the dendrogram how many clusters will have. 77 00:04:11,600 --> 00:04:12,533 Exactly correct. 78 00:04:12,533 --> 00:04:13,800 We're going to have four clusters 79 00:04:13,800 --> 00:04:17,900 because it crosses four vertical lines one, two, three, four. 80 00:04:17,933 --> 00:04:18,200 Right. 81 00:04:18,200 --> 00:04:20,233 So we're going to have a cluster P1 82 00:04:20,233 --> 00:04:24,433 cluster with P2 and P3 cluster with before cluster five and p6. 83 00:04:24,600 --> 00:04:26,766 Let's have a look for clusters. 84 00:04:26,766 --> 00:04:27,533 And there they are. 85 00:04:27,533 --> 00:04:31,133 So that is what we're going to get if we set the 86 00:04:31,133 --> 00:04:34,700 dissimilarity or distance threshold at that level. 87 00:04:35,233 --> 00:04:36,500 Let's try another one. 88 00:04:36,500 --> 00:04:42,000 Let's say we want to set our dissimilarity threshold very low at 0.3, 89 00:04:42,000 --> 00:04:46,633 meaning that we don't want clusters that have any points 90 00:04:46,633 --> 00:04:50,933 within them that have dissimilarity greater than this threshold. 91 00:04:50,933 --> 00:04:53,633 So we're not going to allow any clusters like that. 92 00:04:53,633 --> 00:04:56,900 And the interesting part here is that we're actually setting the threshold 93 00:04:57,066 --> 00:05:01,333 below our very first cluster that we created over here, P2 and P3. 94 00:05:01,333 --> 00:05:04,833 So we're not even going to allow P2 and P3 to be combined in one cluster. 95 00:05:04,833 --> 00:05:05,700 We're going to say 96 00:05:05,700 --> 00:05:09,866 that dissimilarity level, that distance between them is too great, too high. 97 00:05:09,866 --> 00:05:12,900 We we don't think that based 98 00:05:12,900 --> 00:05:16,433 on our business knowledge or based on our other internal research 99 00:05:16,433 --> 00:05:20,400 or external research, that we don't think that any points with, 100 00:05:20,466 --> 00:05:24,966 dissimilarity greater than this level should be combined into a cluster. 101 00:05:25,200 --> 00:05:27,200 It's just it just doesn't make sense 102 00:05:27,200 --> 00:05:31,066 from a, from a finite line of financial, from a business perspective, 103 00:05:31,066 --> 00:05:34,800 from a perspective of knowledge about what this dataset is about. 104 00:05:35,166 --> 00:05:39,200 And what that will do is it'll create six clusters because we cross six lives 105 00:05:39,233 --> 00:05:43,000 one, two, three, 4 or 5, six, and then they are every single point 106 00:05:43,000 --> 00:05:44,566 will be in its own cluster. 107 00:05:44,566 --> 00:05:47,400 As you can see, we've got six clusters. 108 00:05:47,400 --> 00:05:52,366 So that is how a dendrogram works or how you can get value out of a dendrogram. 109 00:05:52,366 --> 00:05:56,600 And you can set this threshold at different levels to understand 110 00:05:56,933 --> 00:05:58,066 how many clusters you'll get. 111 00:05:58,066 --> 00:06:00,066 Just by looking at the dendrogram, you can tell right away. 112 00:06:00,066 --> 00:06:03,800 And, you can that we find the optimal level for the threshold, 113 00:06:03,800 --> 00:06:08,133 or the optimal number of clusters that suits your project the best. 114 00:06:08,966 --> 00:06:11,433 So but how do you find the actual, 115 00:06:11,433 --> 00:06:14,900 not just an optimal number of clusters that you think is optimal? 116 00:06:14,900 --> 00:06:16,400 What is the dendrogram giving us? 117 00:06:16,400 --> 00:06:19,400 Any ideas about the optimal number of clusters? 118 00:06:19,500 --> 00:06:22,500 Well, what can we tell from the dendrogram? 119 00:06:22,500 --> 00:06:27,233 That might be a good guide for us to select the optimal number of clusters. 120 00:06:27,600 --> 00:06:30,600 Well, there's a great giveaway that the dendrogram contains, 121 00:06:30,833 --> 00:06:35,833 and that is the vertical distance because it is measuring a dissimilarity. 122 00:06:35,833 --> 00:06:39,000 So the one of the standard approaches is just to look for 123 00:06:39,000 --> 00:06:42,500 the highest vertical distance that you can find on the dendrogram. 124 00:06:42,500 --> 00:06:47,766 So basically any line that will not cross any horizontal lines. 125 00:06:48,066 --> 00:06:51,166 So for instance this line can be considered. 126 00:06:51,200 --> 00:06:52,633 This line can be considered. 127 00:06:52,633 --> 00:06:55,500 This line cannot be considered for that research 128 00:06:55,500 --> 00:06:58,800 because it crosses hypothetical horizontal lines. 129 00:06:58,800 --> 00:07:02,033 So what you need to do is kind of like every horizontal line you have. 130 00:07:02,033 --> 00:07:05,033 Just imagine it extends all the way across the dendrogram. 131 00:07:05,133 --> 00:07:07,100 Every single horizontal line you have. 132 00:07:07,100 --> 00:07:13,033 And now find the longest line among yours, among your existing vertical lines 133 00:07:13,033 --> 00:07:16,866 that doesn't cross any horizontal, any of these extended horizontal lines. 134 00:07:16,866 --> 00:07:19,833 So for instance, even this line cannot be considered 135 00:07:19,833 --> 00:07:24,066 for that purpose because it would hypothetically cross this horizontal line 136 00:07:24,066 --> 00:07:27,700 that we have coming from this red line between 5 and 6. 137 00:07:28,100 --> 00:07:32,100 Again, this line cannot be considered because it's crossing this line. 138 00:07:32,100 --> 00:07:35,600 So you would need to look at this line, for example, or this line. 139 00:07:35,733 --> 00:07:39,433 Or if you wanted to use this line, you would need to use only a bit of it, 140 00:07:39,433 --> 00:07:41,033 that part or this part. 141 00:07:41,033 --> 00:07:44,500 So you can only use parts of lines that are between horizontal lines. 142 00:07:45,066 --> 00:07:47,566 So out of all of the lines that you have here, 143 00:07:47,566 --> 00:07:52,333 which is the longest that doesn't cross any extended horizontal lines. 144 00:07:52,700 --> 00:07:53,500 Well that's correct. 145 00:07:53,500 --> 00:07:56,500 This one over here is the longest one. 146 00:07:56,500 --> 00:07:59,500 Or basically in our example, the green and the red 147 00:07:59,500 --> 00:08:01,200 there were about the same height. 148 00:08:01,200 --> 00:08:04,200 So this one or this one are the longest ones. 149 00:08:04,566 --> 00:08:06,933 And so this is the largest distance 150 00:08:06,933 --> 00:08:10,800 and therefore the best or the recommended approach. 151 00:08:11,100 --> 00:08:13,666 Again it's not a set in stone approach. 152 00:08:13,666 --> 00:08:16,000 It's a kind of one of the things that you could do 153 00:08:16,000 --> 00:08:20,766 is take a threshold that will cross this largest distance. 154 00:08:20,766 --> 00:08:23,100 So cross that largest distance with a threshold, 155 00:08:23,100 --> 00:08:25,366 and then you use that threshold to calculate 156 00:08:25,366 --> 00:08:27,900 the optimal number of clusters and actually find them. 157 00:08:27,900 --> 00:08:32,200 So once we've crossed this, largest distance with our threshold, 158 00:08:32,600 --> 00:08:34,066 it doesn't matter what you said, you can set it here, 159 00:08:34,066 --> 00:08:37,000 you can set low or you can set high as long as it crosses this line. 160 00:08:37,000 --> 00:08:40,266 Then now the two clusters are this one and this one. 161 00:08:40,300 --> 00:08:44,266 As you can see, that is considered to be one of the approaches. 162 00:08:44,800 --> 00:08:48,233 or this approach is telling us that the optimal number of clusters are two 163 00:08:48,233 --> 00:08:49,400 and these are them. 164 00:08:49,400 --> 00:08:51,600 And kind of in this case it makes sense. 165 00:08:51,600 --> 00:08:56,833 You can see that indeed these points look that as if they're closer together. 166 00:08:57,100 --> 00:08:59,500 And these points look as if they're closer together. 167 00:08:59,500 --> 00:09:04,166 that rather than getting any clusters in between them or even breaking up 168 00:09:04,166 --> 00:09:08,000 into more classes, wouldn't make as much sense as this makes sense. 169 00:09:08,566 --> 00:09:09,633 And, so there you go. 170 00:09:09,633 --> 00:09:12,033 That's that's one of the approaches that you can use. 171 00:09:12,033 --> 00:09:15,033 You can still look at this whole problem using 172 00:09:15,033 --> 00:09:18,266 a similar approach to K-means, where you use the elbow method. 173 00:09:18,266 --> 00:09:19,766 So you could use something like that. 174 00:09:19,766 --> 00:09:21,700 But in, hierarchical clustering 175 00:09:21,700 --> 00:09:25,000 we're going to focus on this approach with the largest distance. 176 00:09:25,533 --> 00:09:28,533 And now let's quickly have a knowledge test. 177 00:09:28,833 --> 00:09:32,833 So I'm going to I have two charts here which are hidden on the left. 178 00:09:32,833 --> 00:09:35,733 We've got the scatterplot on the right. We've got the dendrogram. 179 00:09:35,733 --> 00:09:37,833 I'm going to show you only the dendrogram. 180 00:09:37,833 --> 00:09:40,900 And I would like you to try to understand or try to assess 181 00:09:40,900 --> 00:09:43,900 very quickly what's going on on the scatter plots. 182 00:09:44,266 --> 00:09:48,333 So for instance, we'd like to know even without seeing the scatterplot 183 00:09:48,333 --> 00:09:49,066 or the data set. 184 00:09:49,066 --> 00:09:52,433 At the moment we'd like to know what is the optimal number of clusters 185 00:09:52,433 --> 00:09:54,466 in this dataset just by looking at dendrogram. 186 00:09:54,466 --> 00:09:55,800 Can you identify that. 187 00:09:55,800 --> 00:10:00,600 So if you like you can pause the video and just look at, these vertical 188 00:10:00,600 --> 00:10:03,700 and horizontal lines and try to find out based on the method that we discussed 189 00:10:03,700 --> 00:10:06,700 what would be the optimal number of clusters. 190 00:10:07,066 --> 00:10:12,333 So in 3 to 1 I'm going to now reveal how I would solve this, challenge. 191 00:10:12,333 --> 00:10:14,000 Well, what I would do is I would look for the 192 00:10:14,000 --> 00:10:18,233 the longest vertical line that doesn't cross any extended horizontal lines. 193 00:10:18,233 --> 00:10:21,000 So if you extend that extend that extend that, 194 00:10:21,000 --> 00:10:23,300 you can see that it's probably this line over here. 195 00:10:23,300 --> 00:10:26,300 And so that's the largest distance. 196 00:10:26,400 --> 00:10:30,500 That means we need to cross it with a horizontal lines with our threshold. 197 00:10:30,900 --> 00:10:33,300 And that will give us the number of clusters 198 00:10:33,300 --> 00:10:37,400 which is three clusters because it crosses three lines here one, two, three. 199 00:10:37,700 --> 00:10:42,000 And if we look at the chart, as you can see, indeed we do have three clusters. 200 00:10:42,000 --> 00:10:43,900 And it does look that, that like that 201 00:10:43,900 --> 00:10:47,666 is the optimal number of clusters for this business problem. 202 00:10:48,100 --> 00:10:51,066 So hopefully you enjoyed this tutorial. 203 00:10:51,066 --> 00:10:53,800 We walk through all of this many so that you have a better intuitive 204 00:10:53,800 --> 00:10:58,366 understanding of how the hierarchical clustering algorithm and dendrogram work. 205 00:10:58,633 --> 00:11:03,600 And next, headlong will show you around in R and Python, and together 206 00:11:03,600 --> 00:11:09,166 you will create some amazing analysis around hierarchical clustering. 207 00:11:09,600 --> 00:11:12,633 And together with him you will solve a business problem 208 00:11:12,900 --> 00:11:15,466 using the hierarchical clustering algorithm. 209 00:11:15,466 --> 00:11:17,266 There you got some fun tutorials ahead of you 210 00:11:17,266 --> 00:11:19,133 and I look forward to seeing you next time. 211 00:11:19,133 --> 00:11:21,100 Until then, enjoy machine learning.