1
00:00:00,300 --> 00:00:01,200
Hello, my friends.

2
00:00:01,200 --> 00:00:02,666
All right. Are you ready for that

3
00:00:02,666 --> 00:00:06,300
last tutorial of the hierarchical
clustering implementation?

4
00:00:06,766 --> 00:00:07,433
Here we go.

5
00:00:07,433 --> 00:00:11,400
We now have the dendrogram, which gave us
the optimal number of clusters,

6
00:00:11,633 --> 00:00:13,233
which turned out to be five.

7
00:00:13,233 --> 00:00:17,400
And so now we're going to build, train
and run the hierarchical clustering

8
00:00:17,533 --> 00:00:20,633
to indeed identify five clusters.

9
00:00:20,666 --> 00:00:22,133
All right so let's do this.

10
00:00:22,133 --> 00:00:25,733
Let's create a new code
cell here to build train

11
00:00:25,733 --> 00:00:28,733
and run
this hierarchical clustering model.

12
00:00:28,766 --> 00:00:29,266
All right.

13
00:00:29,266 --> 00:00:32,300
So remember to build our dendrogram.

14
00:00:32,300 --> 00:00:36,066
We actually used the site by library
because it contains

15
00:00:36,133 --> 00:00:40,800
this dendrogram function which directly
returned the dendrogram which was perfect.

16
00:00:41,133 --> 00:00:46,433
But now in order to build the hierarchy
clustering model with five clusters,

17
00:00:46,633 --> 00:00:50,400
well we're going to go back
to our best friend scikit learn.

18
00:00:50,566 --> 00:00:52,566
Because indeed scikit learn contains.

19
00:00:52,566 --> 00:00:57,300
If you remember the clustering module
which contains the agglomerative

20
00:00:57,300 --> 00:01:01,333
clustering class, which is exactly,
you know, the classic version

21
00:01:01,500 --> 00:01:05,766
of hierarchical clustering, the one that
you studied in the intuition lectures.

22
00:01:06,366 --> 00:01:06,700
All right.

23
00:01:06,700 --> 00:01:09,833
So we're going to start from scikit learn.

24
00:01:10,600 --> 00:01:14,133
There we go from which
we're going to get access to cluster

25
00:01:14,133 --> 00:01:17,633
the cluster module
from which we're going to import

26
00:01:17,800 --> 00:01:21,633
that aglow mirror Tiff clustering class.

27
00:01:21,633 --> 00:01:24,300
Perfect. Thank you so much Google Colab.

28
00:01:24,300 --> 00:01:24,966
All right.

29
00:01:24,966 --> 00:01:28,700
Now the next natural step as usual
as most of the time,

30
00:01:28,966 --> 00:01:32,633
is to create an object
or an instance of this class.

31
00:01:32,633 --> 00:01:37,533
And we're going to call it h c
because this object will be nothing else.

32
00:01:37,600 --> 00:01:39,200
Then the hierarchical

33
00:01:39,200 --> 00:01:43,033
clustering model itself,
you know, with all its algorithm inside.

34
00:01:43,333 --> 00:01:44,533
So h c.

35
00:01:44,533 --> 00:01:47,666
And therefore now
we're going to call the class to indeed

36
00:01:47,666 --> 00:01:50,733
create an instance of this class.

37
00:01:50,733 --> 00:01:52,966
And then adding some parentheses.

38
00:01:52,966 --> 00:01:55,966
And now let's see what we have to input.

39
00:01:56,433 --> 00:01:56,866
All right.

40
00:01:56,866 --> 00:01:59,100
So can you guess the first parameter.

41
00:01:59,100 --> 00:02:02,700
It's pretty obvious it's
actually the same as in the K-means class.

42
00:02:03,000 --> 00:02:05,800
The first parameter
is the number of clusters

43
00:02:05,800 --> 00:02:08,766
we want to identify in our data set.

44
00:02:08,766 --> 00:02:10,300
And we know that it's five.

45
00:02:10,300 --> 00:02:12,600
But you know I'm very curious about that.

46
00:02:12,600 --> 00:02:15,566
Three you know that number three
is the other option

47
00:02:15,566 --> 00:02:17,666
of the optimal number of clusters.

48
00:02:17,666 --> 00:02:20,233
So, you know, we'll try that at the end
we will see what we get.

49
00:02:20,233 --> 00:02:25,633
But let's start first with ten clusters
equals five.

50
00:02:25,966 --> 00:02:27,766
All right five clusters.

51
00:02:27,766 --> 00:02:30,166
And now we need to add two more arguments.

52
00:02:30,166 --> 00:02:32,366
The second one is affinity

53
00:02:32,366 --> 00:02:36,433
which is simply the type of distance
that will be computed

54
00:02:36,433 --> 00:02:39,566
in order to measure the variance
within your clusters.

55
00:02:39,566 --> 00:02:42,866
Because then you're going to see that
we will use again this word method

56
00:02:42,866 --> 00:02:46,766
which corresponds to the minimization
of the variance within your clusters.

57
00:02:47,033 --> 00:02:50,400
So for affinity here we're going to choose
well the

58
00:02:50,400 --> 00:02:53,500
you CLI d and distance.

59
00:02:53,966 --> 00:02:58,566
And that last parameter that we need to
add is of course that method.

60
00:02:58,566 --> 00:03:03,666
But this time the name of the parameter
is not method it is directly linkage.

61
00:03:03,800 --> 00:03:04,500
All right.

62
00:03:04,500 --> 00:03:06,933
And so linkage here should be equal.

63
00:03:06,933 --> 00:03:08,500
You know there are several options.

64
00:03:08,500 --> 00:03:11,900
But the one we recommend
is the ward method

65
00:03:11,900 --> 00:03:14,900
which corresponds to the minimum
variance method.

66
00:03:14,966 --> 00:03:15,500
All right.

67
00:03:15,500 --> 00:03:16,233
And that's it.

68
00:03:16,233 --> 00:03:19,400
So now we have
our hierarchical clustering model.

69
00:03:19,400 --> 00:03:24,933
But of course it is not yet trained
or fitted to the data set.

70
00:03:25,166 --> 00:03:27,433
And so that's exactly our next step here.

71
00:03:27,433 --> 00:03:32,166
But remember that at the same time
we want to create this dependent variable

72
00:03:32,166 --> 00:03:36,266
which contains for each customer
or the future class they will belong to,

73
00:03:36,300 --> 00:03:39,300
you know, the future cluster
they will belong to.

74
00:03:39,466 --> 00:03:43,166
And therefore instead of only
using the fit method, which you know,

75
00:03:43,200 --> 00:03:46,200
usually trains your machine
learning models on your data

76
00:03:46,200 --> 00:03:49,200
set, well,
we're going to use the fit predict method,

77
00:03:49,366 --> 00:03:53,500
which will not only train
your clustering model on your data set,

78
00:03:53,500 --> 00:03:57,333
but also will create at the same time
this dependent variable

79
00:03:57,333 --> 00:04:00,966
containing for each of the customers
the cluster they belong to.

80
00:04:01,200 --> 00:04:01,700
All right.

81
00:04:01,700 --> 00:04:05,500
And speaking of this future
created dependent variable,

82
00:04:05,666 --> 00:04:07,000
well we're going to introduce here

83
00:04:07,000 --> 00:04:10,900
a new variable
which we're going to call y underscore HC.

84
00:04:11,466 --> 00:04:13,933
And this is exactly you know that

85
00:04:13,933 --> 00:04:17,466
dependent variable
you see here with the five clusters.

86
00:04:17,600 --> 00:04:21,633
All right so y h c and let's go back that

87
00:04:21,633 --> 00:04:25,166
y hc variable well will be equal to

88
00:04:25,166 --> 00:04:28,500
what is returned by this fit
predict method.

89
00:04:28,933 --> 00:04:32,233
Not only training the hierarchical
clustering model on the data set,

90
00:04:32,466 --> 00:04:36,900
but also returning the clusters
to which each customer belongs to.

91
00:04:37,400 --> 00:04:37,833
All right.

92
00:04:37,833 --> 00:04:42,366
And therefore what we have to do here
is just take RHC object

93
00:04:42,566 --> 00:04:44,466
because that's from this object.

94
00:04:44,466 --> 00:04:49,766
That's we have to call this fit underscore
predict method.

95
00:04:49,933 --> 00:04:53,966
And inside we of course input x just x.

96
00:04:53,966 --> 00:04:55,833
Right. Because we just need to connect.

97
00:04:55,833 --> 00:05:00,366
We just need to fit RHC object
a hierarchical clustering model

98
00:05:00,533 --> 00:05:03,533
to the data set, which is exactly x.

99
00:05:03,633 --> 00:05:04,900
But only you know.

100
00:05:04,900 --> 00:05:10,233
Remember containing the two last features
the annual income and spending score.

101
00:05:10,500 --> 00:05:10,800
All right.

102
00:05:10,800 --> 00:05:14,200
So exactly the same as with K-means okay.

103
00:05:14,200 --> 00:05:14,966
And that's it.

104
00:05:14,966 --> 00:05:17,200
Once again
thanks to our best friend scikit learn.

105
00:05:17,200 --> 00:05:20,400
Well, in only three lines of code,
we build, train and run

106
00:05:20,400 --> 00:05:24,433
the hierarchical clustering model
to identify five clusters.

107
00:05:24,800 --> 00:05:25,600
So let's do this.

108
00:05:25,600 --> 00:05:28,166
Let's run this cell.

109
00:05:28,166 --> 00:05:29,733
And done.

110
00:05:29,733 --> 00:05:32,633
We have our model
and it is already trained.

111
00:05:32,633 --> 00:05:36,066
So now let's actually do a little print
to see

112
00:05:36,633 --> 00:05:41,233
you know that created dependent variable
y h c.

113
00:05:41,833 --> 00:05:45,500
And let's play
the cell and we'll see what we.