1
00:00:00,300 --> 00:00:02,966
Hello and welcome to this art tutorial.

2
00:00:02,966 --> 00:00:05,466
In the previous tutorial,
we imported on mobile data set

3
00:00:05,466 --> 00:00:08,666
and we prepared our data correctly
by taking the two columns

4
00:00:08,666 --> 00:00:11,700
we are interested in the annual income
and the spending score.

5
00:00:11,700 --> 00:00:15,233
So we created this variable
x that contains these two columns.

6
00:00:15,700 --> 00:00:18,000
And now things are going
to get much more interesting

7
00:00:18,000 --> 00:00:21,533
because in this tutorial
we are going to build our dendrogram.

8
00:00:21,833 --> 00:00:25,266
And we will use it
to find the optimal number of clusters

9
00:00:25,566 --> 00:00:29,133
exactly like we did in the K-means
section, where as you remember,

10
00:00:29,566 --> 00:00:33,666
in step two, we use the elbow method chart
to find the optimal number of clusters.

11
00:00:33,966 --> 00:00:36,966
Well, here in Hierarchical Clustering
step two,

12
00:00:37,166 --> 00:00:40,166
we will also look for this
optimal number of clusters.

13
00:00:40,166 --> 00:00:42,400
Only this time
we're not going to use the elbow method.

14
00:00:42,400 --> 00:00:43,800
We aren't going to use the dendrogram.

15
00:00:44,966 --> 00:00:46,833
So let's do that right now.

16
00:00:46,833 --> 00:00:48,166
The very cool thing about it

17
00:00:48,166 --> 00:00:51,433
is that we only need one line of code
to build this dendrogram.

18
00:00:52,133 --> 00:00:53,100
So let's write it.

19
00:00:53,100 --> 00:00:54,733
Let's write this line of code.

20
00:00:54,733 --> 00:00:57,733
We start
by creating our variable dendrogram.

21
00:00:57,900 --> 00:00:59,733
Then equals.

22
00:00:59,733 --> 00:01:02,200
And then we're going to use the class
H cluster.

23
00:01:02,200 --> 00:01:05,000
So let's type h class here.

24
00:01:05,000 --> 00:01:07,000
And then let's press F1.

25
00:01:07,000 --> 00:01:10,000
And here we have all the info of this
H class class.

26
00:01:10,800 --> 00:01:12,633
So let's look at the arguments here.

27
00:01:12,633 --> 00:01:14,966
We only need the first two arguments.

28
00:01:14,966 --> 00:01:18,600
The first argument is a dissimilarity
structure as produced by this.

29
00:01:18,933 --> 00:01:20,766
And in our case
this parameter is going to be

30
00:01:20,766 --> 00:01:24,833
the distance matrix of our data set X,
which is a matrix

31
00:01:24,833 --> 00:01:28,966
that tells, for each pair of customers,
the Euclidean distance between the two.

32
00:01:29,100 --> 00:01:29,933
So that means that

33
00:01:29,933 --> 00:01:33,900
for each pair of customers,
we take the two coordinates annual income

34
00:01:33,900 --> 00:01:37,000
and spending score,
and we compute the Euclidean

35
00:01:37,000 --> 00:01:40,000
distance between the two
based on these coordinates.

36
00:01:40,200 --> 00:01:40,500
Okay.

37
00:01:40,500 --> 00:01:43,633
So that was just to explain
the first parameter of the h class class.

38
00:01:43,966 --> 00:01:45,000
And so let's import it.

39
00:01:45,000 --> 00:01:48,000
In our code we input dist

40
00:01:48,066 --> 00:01:50,600
and in parenthesis x comma.

41
00:01:50,600 --> 00:01:53,100
And then method equals Euclidean.

42
00:01:54,300 --> 00:01:56,300
So that specifies that we want to compute

43
00:01:56,300 --> 00:02:00,533
the Euclidean distance matrix
for our data x okay.

44
00:02:00,533 --> 00:02:03,766
So that's the first parameter
this distance matrix.

45
00:02:04,133 --> 00:02:07,133
And now the second parameter
is the method.

46
00:02:07,233 --> 00:02:12,000
So this method is simply
the method used to find the clusters.

47
00:02:12,466 --> 00:02:15,466
And like in Python we're going to choose
the most common method

48
00:02:15,466 --> 00:02:16,900
which is the word method.

49
00:02:16,900 --> 00:02:19,433
Here it's called word 30.

50
00:02:19,433 --> 00:02:20,866
And it's actually a method

51
00:02:20,866 --> 00:02:24,433
that is trying to minimize the variance
within each cluster.

52
00:02:25,133 --> 00:02:28,433
Kind of like what we did in K-means
when we were trying to minimize the

53
00:02:28,433 --> 00:02:30,200
within cluster sum of squares.

54
00:02:30,200 --> 00:02:31,800
Well, here it's based on the same idea.

55
00:02:31,800 --> 00:02:34,866
But instead of trying to minimize the
within cluster sum of squares,

56
00:02:35,133 --> 00:02:39,000
we are trying to minimize the within
cluster variance to find our clusters.

57
00:02:40,200 --> 00:02:43,800
So here we write method equals words.

58
00:02:44,766 --> 00:02:48,000
So that is the end of the line
to build this dendrogram.

59
00:02:48,300 --> 00:02:50,133
And now we just need to plot it.

60
00:02:50,133 --> 00:02:53,133
So just below we are going to write plot.

61
00:02:53,700 --> 00:02:56,700
Then in parentheses is dendrogram.

62
00:02:57,000 --> 00:03:00,233
Then let's give it a title
by typing main equals

63
00:03:00,233 --> 00:03:03,666
paste parentheses dendrogram in quotes.

64
00:03:05,500 --> 00:03:07,900
Then let's give a name to the x axis

65
00:03:07,900 --> 00:03:12,100
by adding x slab equals customers,
because in the dendrogram

66
00:03:12,100 --> 00:03:14,966
all our customers
are going to be on the X axis.

67
00:03:14,966 --> 00:03:17,966
And then finally let's
give a name to our y label.

68
00:03:18,333 --> 00:03:20,966
We're going to call it
Euclidean distances.

69
00:03:20,966 --> 00:03:24,166
And that's because in the dendrogram
the vertical lines that we're going to see

70
00:03:24,166 --> 00:03:27,166
are actually
the Euclidean distances of the clusters.

71
00:03:27,166 --> 00:03:30,166
That is between
the centroids of the clusters.

72
00:03:30,566 --> 00:03:30,933
Okay.

73
00:03:30,933 --> 00:03:32,966
So we are good to go with our plot.

74
00:03:32,966 --> 00:03:35,233
So and our dendrogram actually.

75
00:03:35,233 --> 00:03:38,866
So let's select
all this code section here execute.

76
00:03:39,266 --> 00:03:41,566
And here is our dendrogram.

77
00:03:41,566 --> 00:03:44,333
So let's have a look at it
I'm clicking on zoom

78
00:03:44,333 --> 00:03:47,333
here to make it bigger okay.

79
00:03:47,400 --> 00:03:50,633
And now let's try to find
the optimal number of clusters

80
00:03:50,733 --> 00:03:52,600
thanks to this dendrogram.

81
00:03:52,600 --> 00:03:55,600
So as Kirill
explains in the intuition section

82
00:03:55,933 --> 00:04:00,800
to find this optimal number of clusters,
we need to find the largest

83
00:04:00,800 --> 00:04:06,100
vertical distance that we can make without
crossing any other horizontal line.

84
00:04:08,266 --> 00:04:08,966
And then we just

85
00:04:08,966 --> 00:04:12,733
need to count the number of vertical lines
at this level okay.

86
00:04:12,733 --> 00:04:15,900
So let's start
by finding the largest vertical distance.

87
00:04:16,400 --> 00:04:18,866
So it's not here obviously.

88
00:04:18,866 --> 00:04:20,333
Then maybe this one.

89
00:04:20,333 --> 00:04:22,100
It's quite a large distance.

90
00:04:22,100 --> 00:04:24,766
That would actually give us three clusters
because as you can see here

91
00:04:24,766 --> 00:04:27,766
I'm crossing three vertical lines.

92
00:04:27,866 --> 00:04:29,633
Definitely not this one.

93
00:04:29,633 --> 00:04:32,366
And here we have another large distance.

94
00:04:32,366 --> 00:04:36,800
You see from this point to
this point is quite a large distance.

95
00:04:37,166 --> 00:04:39,966
And then below obviously
we don't have any large distance.

96
00:04:39,966 --> 00:04:42,600
So now the question is
what is the largest distance

97
00:04:42,600 --> 00:04:45,600
between this distance and this distance.

98
00:04:45,700 --> 00:04:48,600
Well if you have a better look at it
we can see that

99
00:04:48,600 --> 00:04:51,600
the largest distance
is actually this distance.

100
00:04:51,933 --> 00:04:54,800
And how many vertical lines do
we have at this level.

101
00:04:54,800 --> 00:04:58,400
Let's see 1234 and five.

102
00:04:58,400 --> 00:05:02,033
So that means that our optimal number
of clusters is five clusters.

103
00:05:02,733 --> 00:05:05,400
And that's of course a relief
because that's what we obtained

104
00:05:05,400 --> 00:05:08,400
with the K-means algorithm
using the elbow method.

105
00:05:08,700 --> 00:05:10,033
So everything is fine.

106
00:05:10,033 --> 00:05:12,100
Everything is perfectly coherent.

107
00:05:12,100 --> 00:05:14,633
So we will completed our second step.

108
00:05:14,633 --> 00:05:17,600
And now we are ready to move on
to the next step, which is to fit

109
00:05:17,600 --> 00:05:20,666
our hierarchy called
clustering algorithm to our data X.

110
00:05:21,266 --> 00:05:23,733
And that's what we will be
doing in the next tutorial.