1 00:00:00,100 --> 00:00:00,533 Hello my 2 00:00:00,533 --> 00:00:04,366 friends, and welcome to this new practical activity where we're 3 00:00:04,366 --> 00:00:08,400 going to build together this time the hierarchical clustering algorithm. 4 00:00:08,900 --> 00:00:12,266 And now we're all going to go into part for clustering 5 00:00:12,433 --> 00:00:17,033 to this time tackle the hierarchical clustering model. 6 00:00:17,033 --> 00:00:21,266 And we're going to start with Python as usual we're going to work 7 00:00:21,266 --> 00:00:26,600 on the same data set mode customers right where each row corresponds to a customer. 8 00:00:26,600 --> 00:00:29,733 And for each of these customers world of mode gathered some info 9 00:00:29,733 --> 00:00:34,800 like the customer ID to join the age, the annual income, and the spending score. 10 00:00:35,200 --> 00:00:38,400 And we're actually going to work only with these two features. 11 00:00:38,400 --> 00:00:42,733 The annual income and the spending score to identify these clusters, 12 00:00:42,733 --> 00:00:45,733 but this time with hierarchical clustering. 13 00:00:45,733 --> 00:00:46,133 All right. 14 00:00:46,133 --> 00:00:51,533 So executives same and therefore let's proceed directly to the implementation 15 00:00:51,700 --> 00:00:55,800 which you can open with either Google Collaboratory as we're about to do. 16 00:00:56,100 --> 00:01:00,566 Or if you don't like Google Colaboratory, you can open it with Jupyter Notebook. 17 00:01:00,566 --> 00:01:03,900 And for Google Colaboratory lovers will follow me here 18 00:01:04,100 --> 00:01:08,033 to open this implementation in Google Colab. 19 00:01:08,566 --> 00:01:09,366 And there we go. 20 00:01:09,366 --> 00:01:12,566 So that's the hierarchical clustering implementation. 21 00:01:12,900 --> 00:01:17,400 As you can see, it follows the exact same structure as K-means. 22 00:01:17,500 --> 00:01:19,400 We first import the libraries. 23 00:01:19,400 --> 00:01:24,533 We then import the data set exactly the same way as how we did for K-means. 24 00:01:24,533 --> 00:01:29,000 You know, we select these two columns of index three and four, which corresponds, 25 00:01:29,000 --> 00:01:32,200 of course, to the annual salary and the spending score. 26 00:01:32,500 --> 00:01:35,766 So executives same we want re-implement this together. 27 00:01:36,166 --> 00:01:39,566 And then this time instead of using the elbow method, 28 00:01:39,600 --> 00:01:44,400 well we're going to use the dendrogram to find the optimal number of clusters. 29 00:01:44,533 --> 00:01:47,233 And I will explain not only the implementation. 30 00:01:47,233 --> 00:01:49,200 You know, we will re-implement this together. 31 00:01:49,200 --> 00:01:52,200 And also I will explain how to find that 32 00:01:52,366 --> 00:01:55,333 optimal number of clusters in this graph. 33 00:01:55,333 --> 00:01:58,100 And finally we will train the hierarchical 34 00:01:58,100 --> 00:02:02,066 clustering model on the data set using the agglomerative clustering class. 35 00:02:02,333 --> 00:02:05,233 And finally we will visualize the clusters 36 00:02:05,233 --> 00:02:08,466 exactly the same way as what we did with K-means. 37 00:02:08,633 --> 00:02:11,300 And actually here the code is exactly the same. 38 00:02:11,300 --> 00:02:15,633 The only thing that changes is the name of the dependent variable which we create. 39 00:02:15,633 --> 00:02:16,000 Right? 40 00:02:16,000 --> 00:02:19,533 Because still with hierarchical clustering we're going to create 41 00:02:19,766 --> 00:02:21,433 that dependent variable. 42 00:02:21,433 --> 00:02:24,000 But this time, instead of calling it why K-means. 43 00:02:24,000 --> 00:02:27,566 As we did for K-means, we're calling it simply y HC. 44 00:02:27,766 --> 00:02:30,766 And therefore here it's exactly the same code with only 45 00:02:30,900 --> 00:02:34,233 that different name for that created dependent variable. 46 00:02:34,233 --> 00:02:37,266 So we want to re-implement this either we'll just keep the code 47 00:02:37,500 --> 00:02:38,600 and therefore there we go. 48 00:02:38,600 --> 00:02:42,700 We are only going to re-implement two cells, which is just one, 49 00:02:42,700 --> 00:02:46,833 to build the dendrogram and figure out that optimal number of clusters, 50 00:02:47,100 --> 00:02:52,266 and to build a hierarchical clustering model and train it on the whole data set. 51 00:02:52,800 --> 00:02:53,666 Are you ready? 52 00:02:53,666 --> 00:02:54,600 Let's do this. 53 00:02:54,600 --> 00:02:58,200 And in order to do this, we have to create a copy of this notebook. 54 00:02:58,200 --> 00:03:00,433 Because this is in read only mode. 55 00:03:00,433 --> 00:03:02,633 And therefore we're going to go to file here. 56 00:03:02,633 --> 00:03:06,733 And then click save Copy and Drive to indeed create 57 00:03:06,933 --> 00:03:09,933 a copy of this notebook. 58 00:03:10,066 --> 00:03:12,700 Perfect. All right so there we go. 59 00:03:12,700 --> 00:03:13,433 That's our copy. 60 00:03:13,433 --> 00:03:14,600 Now we can modify it. 61 00:03:14,600 --> 00:03:16,866 Now we can re-implement it. 62 00:03:16,866 --> 00:03:19,700 But as we've just said we won't re-implement everything. 63 00:03:19,700 --> 00:03:22,900 We will just re-implement these two cells here. 64 00:03:22,900 --> 00:03:26,900 First the dendrogram how to build it and how to read it. 65 00:03:27,133 --> 00:03:28,533 And then of course, well, 66 00:03:28,533 --> 00:03:32,433 how to build the hierarchical clustering model on the data set. 67 00:03:32,433 --> 00:03:36,966 And then we keep the other cells because they're exactly the same as in K-means. 68 00:03:36,966 --> 00:03:38,266 And if you want let's just, 69 00:03:38,266 --> 00:03:41,333 you know, remove this so that we don't see the final result. 70 00:03:41,633 --> 00:03:43,600 And perfect now all right. 71 00:03:43,600 --> 00:03:46,866 All you see here is executive same as with K-means. 72 00:03:47,033 --> 00:03:48,300 The only thing that will change 73 00:03:48,300 --> 00:03:52,333 are these two cells which we will re-implement together. 74 00:03:52,900 --> 00:03:53,700 Okay. Perfect. 75 00:03:53,700 --> 00:03:58,100 So first step let's just execute these two first cells here. 76 00:03:58,100 --> 00:04:01,366 And to do this we need of course to upload the data set. 77 00:04:01,633 --> 00:04:03,933 So let's click this folder here. 78 00:04:03,933 --> 00:04:07,800 Now it is connecting to a runtime to enable file browsing. 79 00:04:07,800 --> 00:04:09,633 You know in your computer in your machine. 80 00:04:09,633 --> 00:04:13,300 And in a second we should see the upload button. 81 00:04:13,500 --> 00:04:15,600 There we go. Upload. 82 00:04:15,600 --> 00:04:18,800 And and well I'm already in the K-means folder. 83 00:04:18,800 --> 00:04:21,733 But let me show you again the whole path. 84 00:04:21,733 --> 00:04:25,666 So that's the folder you were given at the beginning of each section, 85 00:04:25,666 --> 00:04:26,566 including this one. 86 00:04:26,566 --> 00:04:29,566 Hierarchical clustering, which you could download on your machine. 87 00:04:29,566 --> 00:04:31,200 So I hope you have it right now. 88 00:04:31,200 --> 00:04:34,166 Otherwise you would just need to go back to the previous article. 89 00:04:34,166 --> 00:04:35,800 And now we're all going to go inside. 90 00:04:35,800 --> 00:04:37,700 Then we're going to go to part four. 91 00:04:37,700 --> 00:04:38,600 Of course 92 00:04:38,600 --> 00:04:43,500 then section 25 hierarchical clustering, then Python and then there we go. 93 00:04:43,500 --> 00:04:46,200 Mode customers dot CSV. 94 00:04:46,200 --> 00:04:49,466 This will upload the data set into the notebook. 95 00:04:49,666 --> 00:04:54,000 And so now we can run these two first cells first importing the libraries. 96 00:04:54,266 --> 00:04:59,066 And now that we have pandas we can import that data set 97 00:04:59,400 --> 00:05:02,600 which at the same time creates this matrix 98 00:05:02,600 --> 00:05:05,600 of two features containing only. 99 00:05:05,700 --> 00:05:10,366 Let's see in the data set containing only the annual income and the spending score. 100 00:05:10,366 --> 00:05:15,100 In other words, X is just these two columns here with all the rows okay. 101 00:05:15,700 --> 00:05:18,633 All right. So data preprocessing phase done. 102 00:05:18,633 --> 00:05:23,300 Now we can focus on the heart of the hierarchical clustering model 103 00:05:23,433 --> 00:05:26,833 which is first to build the dendrogram to indeed find 104 00:05:27,000 --> 00:05:28,633 the optimal number of clusters. 105 00:05:28,633 --> 00:05:29,800 And of course 106 00:05:29,800 --> 00:05:33,333 the optimal number of clusters that will result from this dendrogram 107 00:05:33,333 --> 00:05:38,133 will be the same number as the one we found with K-means, meaning five clusters. 108 00:05:38,133 --> 00:05:41,866 But I will explain how to read the dendrogram in order to indeed end up 109 00:05:42,066 --> 00:05:45,066 with an optimal number of five clusters. 110 00:05:45,100 --> 00:05:47,266 All right, so that was the introduction. 111 00:05:47,266 --> 00:05:49,566 And as you know, I like to take it step by step. 112 00:05:49,566 --> 00:05:53,233 So we will implement that next step of building the dendrogram 113 00:05:53,433 --> 00:05:54,833 in the next tutorial. 114 00:05:54,833 --> 00:05:55,833 So get ready. 115 00:05:55,833 --> 00:05:57,900 And until then enjoy machine learning.