1 00:00:00,333 --> 00:00:01,366 Are you ready? 2 00:00:01,366 --> 00:00:03,400 We're going to first add a title, 3 00:00:03,400 --> 00:00:06,900 you know, with the title function by Matplotlib.pyplot. 4 00:00:07,200 --> 00:00:08,466 And this time let's say, you know, 5 00:00:08,466 --> 00:00:12,000 we're just going to choose a simple title like dendrogram. 6 00:00:12,466 --> 00:00:13,800 All right. 7 00:00:13,800 --> 00:00:17,200 Then let's add X label. 8 00:00:17,666 --> 00:00:19,366 There you go X label. 9 00:00:19,366 --> 00:00:21,233 And then in quotes. 10 00:00:21,233 --> 00:00:24,466 Well you tell me what is going to be on the x axis. 11 00:00:24,900 --> 00:00:28,733 Let's not fall into the trap of, you know, answering too quickly. 12 00:00:29,100 --> 00:00:29,766 You know. 13 00:00:29,766 --> 00:00:33,266 The x axis is neither the annual income or the spending score. 14 00:00:33,300 --> 00:00:35,366 You know, it's none of the features. 15 00:00:35,366 --> 00:00:38,966 The x axis in the dendrogram is actually 16 00:00:39,400 --> 00:00:42,466 the customers, you know, the observation points. 17 00:00:42,866 --> 00:00:50,033 Basically, the x axis are not the columns as I just suggested, but the rows. 18 00:00:50,033 --> 00:00:53,733 Because remember in this matrix X or even this data set, 19 00:00:53,966 --> 00:00:56,366 each row corresponds to a customer. 20 00:00:56,366 --> 00:00:59,366 In this customer is actually an observation points. 21 00:00:59,600 --> 00:01:02,666 And that's what is on the x axis of the dendrogram. 22 00:01:02,866 --> 00:01:08,266 You have all the observation points, you know, just given by indexes from 1 to 200. 23 00:01:08,266 --> 00:01:10,866 Because we have 200 customers in this data set. 24 00:01:10,866 --> 00:01:14,166 And then on the y axis what will you have. 25 00:01:14,166 --> 00:01:19,666 Well you will have the Euclidean distances between each pair of customers. 26 00:01:19,666 --> 00:01:24,600 And then each pair of groups of customers where the groups get bigger and bigger. 27 00:01:24,600 --> 00:01:27,700 And then by considering bigger and bigger groups, you get also 28 00:01:27,700 --> 00:01:31,266 the Euclidean distances between two of these bigger groups. 29 00:01:31,266 --> 00:01:33,766 Right. So that's how the dendrogram works. 30 00:01:33,766 --> 00:01:36,700 It will become more clear once we visualize this. 31 00:01:36,700 --> 00:01:37,333 So there you go. 32 00:01:37,333 --> 00:01:41,533 I just wanted to get you to think on what is the X label and the Y label. 33 00:01:41,666 --> 00:01:45,600 So now that we know well let's input here the X label which will be. 34 00:01:45,800 --> 00:01:48,633 Therefore you can either call it observation points. 35 00:01:48,633 --> 00:01:52,100 If you want to generalize this or if you want to stay in the context 36 00:01:52,100 --> 00:01:56,133 of this case study, well we can call it because to Murs. 37 00:01:56,300 --> 00:01:59,000 All right customers that's for the X label. 38 00:01:59,000 --> 00:02:01,800 And now for the Y label. 39 00:02:01,800 --> 00:02:03,900 Well that will be always the same. 40 00:02:03,900 --> 00:02:07,366 You know, independently of whether we want to stay in that case or not. 41 00:02:07,600 --> 00:02:12,400 That's going to be always the you Kelly d and distances. 42 00:02:12,766 --> 00:02:13,566 All right. 43 00:02:13,566 --> 00:02:16,733 That's always what is on the y axis in a dendrogram. 44 00:02:17,100 --> 00:02:21,700 And finally we end up remember with PLT dot show 45 00:02:22,033 --> 00:02:25,433 to indeed display the graph in the output. 46 00:02:25,933 --> 00:02:26,200 All right. 47 00:02:26,200 --> 00:02:29,166 So now let's check it out. Let's see that dendrogram. 48 00:02:29,166 --> 00:02:33,666 And I will explain again what we have on the x axis and the y axis. 49 00:02:33,866 --> 00:02:36,300 All right let's do this. Let's play the cell. 50 00:02:36,300 --> 00:02:40,100 And we're about to get the dendrogram in a second. 51 00:02:40,100 --> 00:02:41,400 There we go. 52 00:02:41,400 --> 00:02:43,600 All right. So that's our beautiful dendrogram. 53 00:02:43,600 --> 00:02:47,066 And as we said in the x axis we have 54 00:02:47,066 --> 00:02:50,700 the customers listed from 1 to 200. 55 00:02:50,700 --> 00:02:53,700 Because there are 200 customers in the data set. 56 00:02:53,900 --> 00:02:54,966 And on the Y axis. 57 00:02:54,966 --> 00:02:58,066 Well, we have indeed that as you can see, Euclidean 58 00:02:58,066 --> 00:03:01,633 distances first between each pair of customers. 59 00:03:01,633 --> 00:03:03,700 That's the little pairs you see here. 60 00:03:03,700 --> 00:03:06,000 And then as you see when you link two customers, you know, 61 00:03:06,000 --> 00:03:08,533 within the same group, well, this forms a group. 62 00:03:08,533 --> 00:03:11,933 And then with the group next to it linking to other customers. 63 00:03:12,066 --> 00:03:15,566 Well, you link these two groups within a new pair 64 00:03:15,566 --> 00:03:19,366 and you compute the Euclidean distance between these two groups 65 00:03:19,633 --> 00:03:23,566 by, you know, taking the root of the sum of the squared distances 66 00:03:23,700 --> 00:03:26,133 between the customers inside these groups. 67 00:03:26,133 --> 00:03:26,800 All right. 68 00:03:26,800 --> 00:03:30,500 And you do this for then each pair of bigger groups as we see, 69 00:03:30,500 --> 00:03:35,266 for example, the two biggest groups that we see on this dendrogram are first, 70 00:03:35,266 --> 00:03:39,900 this one, you know, that first group containing lots of customers. 71 00:03:40,400 --> 00:03:45,900 And the second group is this one, right, which was linking these two subgroups. 72 00:03:45,900 --> 00:03:48,900 And then all these subset groups inside. 73 00:03:48,900 --> 00:03:52,266 And so now the question is how do we figure out 74 00:03:52,266 --> 00:03:54,400 that optimal number of clusters. 75 00:03:54,400 --> 00:03:55,666 Well it's super simple. 76 00:03:55,666 --> 00:03:59,933 And to show you this I will actually click these three dots here 77 00:03:59,933 --> 00:04:02,933 to view the output in full screen. 78 00:04:03,233 --> 00:04:04,466 But then that's still not enough. 79 00:04:04,466 --> 00:04:07,533 What I would like to do is now save image as 80 00:04:08,666 --> 00:04:09,333 all right 81 00:04:09,333 --> 00:04:13,766 I'm going to call this image then draw Graham I'm going to save it. 82 00:04:14,166 --> 00:04:18,333 And now we're going to go onto my desktop which is right here. 83 00:04:18,500 --> 00:04:20,733 Yes I'm recording at night. 84 00:04:20,733 --> 00:04:22,366 And here is the dendrogram. 85 00:04:22,366 --> 00:04:25,333 So let's see if we can zoom better now. 86 00:04:25,333 --> 00:04:26,466 Oops. 87 00:04:26,466 --> 00:04:29,466 Let me let me just there you go. 88 00:04:29,600 --> 00:04:32,100 Let me enlarge this. 89 00:04:32,100 --> 00:04:32,800 All right. 90 00:04:32,800 --> 00:04:33,900 Well it's not that great, 91 00:04:33,900 --> 00:04:38,133 but it's good because what I wanted to do is this horizontal line. 92 00:04:38,133 --> 00:04:41,133 That's exactly what I wanted to get, which I couldn't get on Colab. 93 00:04:41,666 --> 00:04:43,200 And the reason why I wanted to get this 94 00:04:43,200 --> 00:04:47,033 horizontal line is because it is what will help us figure out 95 00:04:47,033 --> 00:04:51,166 the optimal number of clusters, because very simply, on the dendrogram, 96 00:04:51,466 --> 00:04:54,866 the optimal number of clusters can be found. 97 00:04:54,866 --> 00:04:59,966 Where you have the largest distance, you can move vertically 98 00:05:01,200 --> 00:05:02,033 without 99 00:05:02,033 --> 00:05:05,133 touching one of these horizontal bar, you know, by 100 00:05:05,166 --> 00:05:08,233 horizontal bar this is a first horizontal bar. 101 00:05:08,233 --> 00:05:12,233 Then this is the second horizontal bar, the third horizontal bar 102 00:05:12,400 --> 00:05:13,700 fourth horizontal bar. 103 00:05:13,700 --> 00:05:17,066 And then you know this one is the next one horizontal 104 00:05:17,066 --> 00:05:20,066 bar, horizontal bar, horizontal bar, etc.. 105 00:05:20,333 --> 00:05:24,100 And the optimal number of clusters can be found 106 00:05:24,400 --> 00:05:28,266 where you can move that horizontal line. 107 00:05:28,266 --> 00:05:31,066 You know, the ones that I'm creating with my mouse. 108 00:05:31,066 --> 00:05:32,500 And then I'm going to try to move 109 00:05:32,500 --> 00:05:36,566 that vertical bar inside the dendrogram, you know, starting from the top. 110 00:05:36,700 --> 00:05:39,600 And we'll see where I managed to move the most 111 00:05:39,600 --> 00:05:42,766 vertically before meeting one of these horizontal bar. 112 00:05:42,900 --> 00:05:44,200 So let's do this together. 113 00:05:44,200 --> 00:05:44,633 And you know, 114 00:05:44,633 --> 00:05:49,200 we will very easily find where we have the largest vertical move, let's say. 115 00:05:49,333 --> 00:05:53,366 And then the optimal number of clusters will actually be the number 116 00:05:53,566 --> 00:05:57,866 of vertical bars we have inside that vertical move.