1 00:00:00,200 --> 00:00:01,033 All right, my friends. 2 00:00:01,033 --> 00:00:01,866 Are you ready? 3 00:00:01,866 --> 00:00:05,800 Are you ready to build that dendrogram and use it to find the 4 00:00:05,833 --> 00:00:07,433 optimal number of clusters? 5 00:00:07,433 --> 00:00:08,166 Here we go. 6 00:00:08,166 --> 00:00:10,400 Let's implement the solution together. 7 00:00:10,400 --> 00:00:13,066 So let's start by creating a new code cell. 8 00:00:13,066 --> 00:00:15,300 And now what will be the first step? 9 00:00:15,300 --> 00:00:18,633 Well, as usual you know we want to implement this efficiently. 10 00:00:18,800 --> 00:00:20,633 So we're going to use a function. 11 00:00:20,633 --> 00:00:26,466 But this time exceptionally this function won't be imported from scikit learn. 12 00:00:26,466 --> 00:00:31,100 It will actually be imported from another very popular library in data science 13 00:00:31,100 --> 00:00:35,300 I would say, you know, it is in the top three most popular libraries. 14 00:00:35,300 --> 00:00:39,033 I put scikit learn first, of course, and TensorFlow for deep learning. 15 00:00:39,333 --> 00:00:42,366 But that other library is SciPy. 16 00:00:42,400 --> 00:00:45,800 SciPy contains a lot of great tools when building 17 00:00:45,800 --> 00:00:49,000 machine learning models and well for hierarchical clustering. 18 00:00:49,000 --> 00:00:52,100 It indeed contains a function which is called dendrogram 19 00:00:52,266 --> 00:00:54,866 and which will return the dendrogram itself. 20 00:00:54,866 --> 00:00:56,800 You know the plot of the dendrogram. 21 00:00:56,800 --> 00:00:57,633 So let's do this. 22 00:00:57,633 --> 00:01:03,066 Let's directly import that, you know, module first that contains this function. 23 00:01:03,300 --> 00:01:07,266 And as we said this model is taken first from SciPy, 24 00:01:07,733 --> 00:01:10,533 then from the module cluster 25 00:01:10,533 --> 00:01:14,500 and then from the submodule hierarchy. 26 00:01:14,833 --> 00:01:17,433 Right. Google collab guesses it perfectly. 27 00:01:17,433 --> 00:01:21,233 The other way to write this is of course to say from SciPy import 28 00:01:21,466 --> 00:01:23,166 cluster dot hierarchy. 29 00:01:23,166 --> 00:01:24,766 So that's just another writing. 30 00:01:24,766 --> 00:01:27,600 And then we're going to add of course a shortcut to this. 31 00:01:27,600 --> 00:01:30,000 Otherwise we would need to call all of this again. 32 00:01:30,000 --> 00:01:33,600 And the shortcut will be as c h right. 33 00:01:33,600 --> 00:01:36,566 For SciPy cluster hierarchy okay. 34 00:01:36,566 --> 00:01:40,400 So that's the module which contains the function we want to use 35 00:01:40,400 --> 00:01:42,833 in order to build our dendrogram. 36 00:01:42,833 --> 00:01:44,333 So now next step. 37 00:01:44,333 --> 00:01:47,833 Well the next step is to use that function which we can now access 38 00:01:48,000 --> 00:01:51,033 from that hierarchy module which we just imported. 39 00:01:51,600 --> 00:01:55,300 And since this function returns directly the dendrogram itself, 40 00:01:55,466 --> 00:01:59,100 well we are going to create a new variable here which we're going to call. 41 00:01:59,100 --> 00:02:02,100 Then draw grab as simple as that. 42 00:02:02,200 --> 00:02:05,633 And this dendrogram variable will be the output 43 00:02:05,700 --> 00:02:09,766 of this dendrogram function which we're about to use 44 00:02:09,766 --> 00:02:14,666 from the hierarchy submodule by the cluster module from the side by library. 45 00:02:14,833 --> 00:02:16,700 All right. So let's do this. 46 00:02:16,700 --> 00:02:20,733 Since this function belongs to all of this here in the hierarchy module. 47 00:02:20,733 --> 00:02:24,866 Well we have to call first the shortcut leading to that module. 48 00:02:25,900 --> 00:02:26,933 And from which. 49 00:02:26,933 --> 00:02:31,466 Now we can call this then draw gram function. 50 00:02:31,466 --> 00:02:33,900 Perfect. Thank you so much Google Collab. 51 00:02:33,900 --> 00:02:37,800 And now indeed in the parenthesis we have to input some arguments. 52 00:02:38,033 --> 00:02:41,066 And now you can't really guess what the arguments will be. 53 00:02:41,066 --> 00:02:44,533 So I'm just going to write it and then I will explain what this means. 54 00:02:44,533 --> 00:02:47,633 So first we actually have to call quote H. 55 00:02:47,633 --> 00:02:52,400 Again you know the hierarchy module from the cluster module from the SciPy library. 56 00:02:52,400 --> 00:02:55,066 So from which this time 57 00:02:55,066 --> 00:02:59,400 we're going to call another function which is the linkage function. 58 00:02:59,633 --> 00:03:02,733 And this linkage function will take as input two arguments. 59 00:03:03,000 --> 00:03:08,833 First well your matrix of features inside which you want to identify the clusters. 60 00:03:08,833 --> 00:03:10,733 And that's of course x. 61 00:03:10,733 --> 00:03:15,133 And then the second argument is the clustering technique. 62 00:03:15,333 --> 00:03:19,000 And in hierarchical clustering the most recommended method 63 00:03:19,000 --> 00:03:21,933 and the one that brings the most relevant results. 64 00:03:21,933 --> 00:03:26,700 And the most relevant clusters is the method of minimum variance, 65 00:03:26,833 --> 00:03:31,566 which is a technique that will result in having clusters inside 66 00:03:31,566 --> 00:03:33,566 which you know the observation points. 67 00:03:33,566 --> 00:03:34,733 Don't worry too much. 68 00:03:34,733 --> 00:03:37,600 You know, have among all of them a low variance. 69 00:03:37,600 --> 00:03:38,633 And that's what it means. 70 00:03:38,633 --> 00:03:43,400 You know, the method of minimum variance consists of minimizing the variance 71 00:03:43,400 --> 00:03:47,500 in each of the clusters resulting from hierarchical clustering. 72 00:03:47,800 --> 00:03:50,133 And so this is really the method that I recommend. 73 00:03:50,133 --> 00:03:51,566 And speaking of this method, 74 00:03:51,566 --> 00:03:54,666 that's exactly the next argument of this linkage function. 75 00:03:54,666 --> 00:03:55,966 We have two input here. 76 00:03:55,966 --> 00:03:58,733 And the name of that parameter is method. 77 00:03:58,733 --> 00:04:03,600 And the name of that minimum variance method is not called minimum variance. 78 00:04:03,600 --> 00:04:05,733 But ward Ward. 79 00:04:05,733 --> 00:04:07,800 You can actually check this on Wikipedia. 80 00:04:07,800 --> 00:04:09,933 There is a whole page on the ward. 81 00:04:09,933 --> 00:04:11,133 And you will see that indeed it 82 00:04:11,133 --> 00:04:14,733 consists of minimized the variance inside your clusters. 83 00:04:15,266 --> 00:04:15,633 All right. 84 00:04:15,633 --> 00:04:20,933 And that's it for the whole dendrogram function here it only expects one argument 85 00:04:21,133 --> 00:04:23,933 which is basically the method you choose 86 00:04:23,933 --> 00:04:27,066 for your clustering that you link to your matrix. 87 00:04:27,066 --> 00:04:30,033 If you just x in which you want to identify the clusters. 88 00:04:30,033 --> 00:04:33,133 So that's all you need to input here in this dendrogram. 89 00:04:33,366 --> 00:04:34,300 And there you go. 90 00:04:34,300 --> 00:04:37,333 This will already return the dendrogram itself. 91 00:04:37,333 --> 00:04:39,133 You know the plot of the dendrogram. 92 00:04:39,133 --> 00:04:41,466 But as usual we want to make it nice. 93 00:04:41,466 --> 00:04:44,966 So we're just going to add a title an x label and y label. 94 00:04:45,100 --> 00:04:46,266 And then we will show it. 95 00:04:46,266 --> 00:04:47,400 And now I will teach you 96 00:04:47,400 --> 00:04:51,866 how to read the dendrogram to indeed find that optimal number of clusters.