1 00:00:00,300 --> 00:00:02,966 Hello and welcome to this art tutorial. 2 00:00:02,966 --> 00:00:05,466 In the previous tutorial, we imported on mobile data set 3 00:00:05,466 --> 00:00:08,666 and we prepared our data correctly by taking the two columns 4 00:00:08,666 --> 00:00:11,700 we are interested in the annual income and the spending score. 5 00:00:11,700 --> 00:00:15,233 So we created this variable x that contains these two columns. 6 00:00:15,700 --> 00:00:18,000 And now things are going to get much more interesting 7 00:00:18,000 --> 00:00:21,533 because in this tutorial we are going to build our dendrogram. 8 00:00:21,833 --> 00:00:25,266 And we will use it to find the optimal number of clusters 9 00:00:25,566 --> 00:00:29,133 exactly like we did in the K-means section, where as you remember, 10 00:00:29,566 --> 00:00:33,666 in step two, we use the elbow method chart to find the optimal number of clusters. 11 00:00:33,966 --> 00:00:36,966 Well, here in Hierarchical Clustering step two, 12 00:00:37,166 --> 00:00:40,166 we will also look for this optimal number of clusters. 13 00:00:40,166 --> 00:00:42,400 Only this time we're not going to use the elbow method. 14 00:00:42,400 --> 00:00:43,800 We aren't going to use the dendrogram. 15 00:00:44,966 --> 00:00:46,833 So let's do that right now. 16 00:00:46,833 --> 00:00:48,166 The very cool thing about it 17 00:00:48,166 --> 00:00:51,433 is that we only need one line of code to build this dendrogram. 18 00:00:52,133 --> 00:00:53,100 So let's write it. 19 00:00:53,100 --> 00:00:54,733 Let's write this line of code. 20 00:00:54,733 --> 00:00:57,733 We start by creating our variable dendrogram. 21 00:00:57,900 --> 00:00:59,733 Then equals. 22 00:00:59,733 --> 00:01:02,200 And then we're going to use the class H cluster. 23 00:01:02,200 --> 00:01:05,000 So let's type h class here. 24 00:01:05,000 --> 00:01:07,000 And then let's press F1. 25 00:01:07,000 --> 00:01:10,000 And here we have all the info of this H class class. 26 00:01:10,800 --> 00:01:12,633 So let's look at the arguments here. 27 00:01:12,633 --> 00:01:14,966 We only need the first two arguments. 28 00:01:14,966 --> 00:01:18,600 The first argument is a dissimilarity structure as produced by this. 29 00:01:18,933 --> 00:01:20,766 And in our case this parameter is going to be 30 00:01:20,766 --> 00:01:24,833 the distance matrix of our data set X, which is a matrix 31 00:01:24,833 --> 00:01:28,966 that tells, for each pair of customers, the Euclidean distance between the two. 32 00:01:29,100 --> 00:01:29,933 So that means that 33 00:01:29,933 --> 00:01:33,900 for each pair of customers, we take the two coordinates annual income 34 00:01:33,900 --> 00:01:37,000 and spending score, and we compute the Euclidean 35 00:01:37,000 --> 00:01:40,000 distance between the two based on these coordinates. 36 00:01:40,200 --> 00:01:40,500 Okay. 37 00:01:40,500 --> 00:01:43,633 So that was just to explain the first parameter of the h class class. 38 00:01:43,966 --> 00:01:45,000 And so let's import it. 39 00:01:45,000 --> 00:01:48,000 In our code we input dist 40 00:01:48,066 --> 00:01:50,600 and in parenthesis x comma. 41 00:01:50,600 --> 00:01:53,100 And then method equals Euclidean. 42 00:01:54,300 --> 00:01:56,300 So that specifies that we want to compute 43 00:01:56,300 --> 00:02:00,533 the Euclidean distance matrix for our data x okay. 44 00:02:00,533 --> 00:02:03,766 So that's the first parameter this distance matrix. 45 00:02:04,133 --> 00:02:07,133 And now the second parameter is the method. 46 00:02:07,233 --> 00:02:12,000 So this method is simply the method used to find the clusters. 47 00:02:12,466 --> 00:02:15,466 And like in Python we're going to choose the most common method 48 00:02:15,466 --> 00:02:16,900 which is the word method. 49 00:02:16,900 --> 00:02:19,433 Here it's called word 30. 50 00:02:19,433 --> 00:02:20,866 And it's actually a method 51 00:02:20,866 --> 00:02:24,433 that is trying to minimize the variance within each cluster. 52 00:02:25,133 --> 00:02:28,433 Kind of like what we did in K-means when we were trying to minimize the 53 00:02:28,433 --> 00:02:30,200 within cluster sum of squares. 54 00:02:30,200 --> 00:02:31,800 Well, here it's based on the same idea. 55 00:02:31,800 --> 00:02:34,866 But instead of trying to minimize the within cluster sum of squares, 56 00:02:35,133 --> 00:02:39,000 we are trying to minimize the within cluster variance to find our clusters. 57 00:02:40,200 --> 00:02:43,800 So here we write method equals words. 58 00:02:44,766 --> 00:02:48,000 So that is the end of the line to build this dendrogram. 59 00:02:48,300 --> 00:02:50,133 And now we just need to plot it. 60 00:02:50,133 --> 00:02:53,133 So just below we are going to write plot. 61 00:02:53,700 --> 00:02:56,700 Then in parentheses is dendrogram. 62 00:02:57,000 --> 00:03:00,233 Then let's give it a title by typing main equals 63 00:03:00,233 --> 00:03:03,666 paste parentheses dendrogram in quotes. 64 00:03:05,500 --> 00:03:07,900 Then let's give a name to the x axis 65 00:03:07,900 --> 00:03:12,100 by adding x slab equals customers, because in the dendrogram 66 00:03:12,100 --> 00:03:14,966 all our customers are going to be on the X axis. 67 00:03:14,966 --> 00:03:17,966 And then finally let's give a name to our y label. 68 00:03:18,333 --> 00:03:20,966 We're going to call it Euclidean distances. 69 00:03:20,966 --> 00:03:24,166 And that's because in the dendrogram the vertical lines that we're going to see 70 00:03:24,166 --> 00:03:27,166 are actually the Euclidean distances of the clusters. 71 00:03:27,166 --> 00:03:30,166 That is between the centroids of the clusters. 72 00:03:30,566 --> 00:03:30,933 Okay. 73 00:03:30,933 --> 00:03:32,966 So we are good to go with our plot. 74 00:03:32,966 --> 00:03:35,233 So and our dendrogram actually. 75 00:03:35,233 --> 00:03:38,866 So let's select all this code section here execute. 76 00:03:39,266 --> 00:03:41,566 And here is our dendrogram. 77 00:03:41,566 --> 00:03:44,333 So let's have a look at it I'm clicking on zoom 78 00:03:44,333 --> 00:03:47,333 here to make it bigger okay. 79 00:03:47,400 --> 00:03:50,633 And now let's try to find the optimal number of clusters 80 00:03:50,733 --> 00:03:52,600 thanks to this dendrogram. 81 00:03:52,600 --> 00:03:55,600 So as Kirill explains in the intuition section 82 00:03:55,933 --> 00:04:00,800 to find this optimal number of clusters, we need to find the largest 83 00:04:00,800 --> 00:04:06,100 vertical distance that we can make without crossing any other horizontal line. 84 00:04:08,266 --> 00:04:08,966 And then we just 85 00:04:08,966 --> 00:04:12,733 need to count the number of vertical lines at this level okay. 86 00:04:12,733 --> 00:04:15,900 So let's start by finding the largest vertical distance. 87 00:04:16,400 --> 00:04:18,866 So it's not here obviously. 88 00:04:18,866 --> 00:04:20,333 Then maybe this one. 89 00:04:20,333 --> 00:04:22,100 It's quite a large distance. 90 00:04:22,100 --> 00:04:24,766 That would actually give us three clusters because as you can see here 91 00:04:24,766 --> 00:04:27,766 I'm crossing three vertical lines. 92 00:04:27,866 --> 00:04:29,633 Definitely not this one. 93 00:04:29,633 --> 00:04:32,366 And here we have another large distance. 94 00:04:32,366 --> 00:04:36,800 You see from this point to this point is quite a large distance. 95 00:04:37,166 --> 00:04:39,966 And then below obviously we don't have any large distance. 96 00:04:39,966 --> 00:04:42,600 So now the question is what is the largest distance 97 00:04:42,600 --> 00:04:45,600 between this distance and this distance. 98 00:04:45,700 --> 00:04:48,600 Well if you have a better look at it we can see that 99 00:04:48,600 --> 00:04:51,600 the largest distance is actually this distance. 100 00:04:51,933 --> 00:04:54,800 And how many vertical lines do we have at this level. 101 00:04:54,800 --> 00:04:58,400 Let's see 1234 and five. 102 00:04:58,400 --> 00:05:02,033 So that means that our optimal number of clusters is five clusters. 103 00:05:02,733 --> 00:05:05,400 And that's of course a relief because that's what we obtained 104 00:05:05,400 --> 00:05:08,400 with the K-means algorithm using the elbow method. 105 00:05:08,700 --> 00:05:10,033 So everything is fine. 106 00:05:10,033 --> 00:05:12,100 Everything is perfectly coherent. 107 00:05:12,100 --> 00:05:14,633 So we will completed our second step. 108 00:05:14,633 --> 00:05:17,600 And now we are ready to move on to the next step, which is to fit 109 00:05:17,600 --> 00:05:20,666 our hierarchy called clustering algorithm to our data X. 110 00:05:21,266 --> 00:05:23,733 And that's what we will be doing in the next tutorial.