1
00:00:00,200 --> 00:00:01,033
All right, my friends.

2
00:00:01,033 --> 00:00:01,866
Are you ready?

3
00:00:01,866 --> 00:00:05,800
Are you ready to build that dendrogram
and use it to find the

4
00:00:05,833 --> 00:00:07,433
optimal number of clusters?

5
00:00:07,433 --> 00:00:08,166
Here we go.

6
00:00:08,166 --> 00:00:10,400
Let's implement the solution together.

7
00:00:10,400 --> 00:00:13,066
So let's start
by creating a new code cell.

8
00:00:13,066 --> 00:00:15,300
And now what will be the first step?

9
00:00:15,300 --> 00:00:18,633
Well, as usual you know
we want to implement this efficiently.

10
00:00:18,800 --> 00:00:20,633
So we're going to use a function.

11
00:00:20,633 --> 00:00:26,466
But this time exceptionally this function
won't be imported from scikit learn.

12
00:00:26,466 --> 00:00:31,100
It will actually be imported from another
very popular library in data science

13
00:00:31,100 --> 00:00:35,300
I would say, you know, it is in the top
three most popular libraries.

14
00:00:35,300 --> 00:00:39,033
I put scikit learn first, of course,
and TensorFlow for deep learning.

15
00:00:39,333 --> 00:00:42,366
But that other library is SciPy.

16
00:00:42,400 --> 00:00:45,800
SciPy contains a lot of great tools
when building

17
00:00:45,800 --> 00:00:49,000
machine learning models
and well for hierarchical clustering.

18
00:00:49,000 --> 00:00:52,100
It indeed contains a function
which is called dendrogram

19
00:00:52,266 --> 00:00:54,866
and which will return the dendrogram
itself.

20
00:00:54,866 --> 00:00:56,800
You know the plot of the dendrogram.

21
00:00:56,800 --> 00:00:57,633
So let's do this.

22
00:00:57,633 --> 00:01:03,066
Let's directly import that, you know,
module first that contains this function.

23
00:01:03,300 --> 00:01:07,266
And as we said this model is taken
first from SciPy,

24
00:01:07,733 --> 00:01:10,533
then from the module cluster

25
00:01:10,533 --> 00:01:14,500
and then from the submodule hierarchy.

26
00:01:14,833 --> 00:01:17,433
Right. Google collab guesses it perfectly.

27
00:01:17,433 --> 00:01:21,233
The other way to write
this is of course to say from SciPy import

28
00:01:21,466 --> 00:01:23,166
cluster dot hierarchy.

29
00:01:23,166 --> 00:01:24,766
So that's just another writing.

30
00:01:24,766 --> 00:01:27,600
And then we're going to add
of course a shortcut to this.

31
00:01:27,600 --> 00:01:30,000
Otherwise
we would need to call all of this again.

32
00:01:30,000 --> 00:01:33,600
And the shortcut will be as c h right.

33
00:01:33,600 --> 00:01:36,566
For SciPy cluster hierarchy okay.

34
00:01:36,566 --> 00:01:40,400
So that's the module
which contains the function we want to use

35
00:01:40,400 --> 00:01:42,833
in order to build our dendrogram.

36
00:01:42,833 --> 00:01:44,333
So now next step.

37
00:01:44,333 --> 00:01:47,833
Well the next step is to use that function
which we can now access

38
00:01:48,000 --> 00:01:51,033
from that hierarchy module
which we just imported.

39
00:01:51,600 --> 00:01:55,300
And since this function returns directly
the dendrogram itself,

40
00:01:55,466 --> 00:01:59,100
well we are going to create a new
variable here which we're going to call.

41
00:01:59,100 --> 00:02:02,100
Then draw grab as simple as that.

42
00:02:02,200 --> 00:02:05,633
And this dendrogram variable
will be the output

43
00:02:05,700 --> 00:02:09,766
of this dendrogram function
which we're about to use

44
00:02:09,766 --> 00:02:14,666
from the hierarchy submodule by the
cluster module from the side by library.

45
00:02:14,833 --> 00:02:16,700
All right. So let's do this.

46
00:02:16,700 --> 00:02:20,733
Since this function belongs to all of this
here in the hierarchy module.

47
00:02:20,733 --> 00:02:24,866
Well we have to call first
the shortcut leading to that module.

48
00:02:25,900 --> 00:02:26,933
And from which.

49
00:02:26,933 --> 00:02:31,466
Now we can call this
then draw gram function.

50
00:02:31,466 --> 00:02:33,900
Perfect. Thank you so much Google Collab.

51
00:02:33,900 --> 00:02:37,800
And now indeed in the parenthesis
we have to input some arguments.

52
00:02:38,033 --> 00:02:41,066
And now you can't really guess
what the arguments will be.

53
00:02:41,066 --> 00:02:44,533
So I'm just going to write it
and then I will explain what this means.

54
00:02:44,533 --> 00:02:47,633
So first we actually have to call quote H.

55
00:02:47,633 --> 00:02:52,400
Again you know the hierarchy module from
the cluster module from the SciPy library.

56
00:02:52,400 --> 00:02:55,066
So from which this time

57
00:02:55,066 --> 00:02:59,400
we're going to call another function
which is the linkage function.

58
00:02:59,633 --> 00:03:02,733
And this linkage function
will take as input two arguments.

59
00:03:03,000 --> 00:03:08,833
First well your matrix of features inside
which you want to identify the clusters.

60
00:03:08,833 --> 00:03:10,733
And that's of course x.

61
00:03:10,733 --> 00:03:15,133
And then the second argument
is the clustering technique.

62
00:03:15,333 --> 00:03:19,000
And in hierarchical clustering
the most recommended method

63
00:03:19,000 --> 00:03:21,933
and the one that brings
the most relevant results.

64
00:03:21,933 --> 00:03:26,700
And the most relevant clusters
is the method of minimum variance,

65
00:03:26,833 --> 00:03:31,566
which is a technique
that will result in having clusters inside

66
00:03:31,566 --> 00:03:33,566
which you know the observation points.

67
00:03:33,566 --> 00:03:34,733
Don't worry too much.

68
00:03:34,733 --> 00:03:37,600
You know, have among all of them
a low variance.

69
00:03:37,600 --> 00:03:38,633
And that's what it means.

70
00:03:38,633 --> 00:03:43,400
You know, the method of minimum variance
consists of minimizing the variance

71
00:03:43,400 --> 00:03:47,500
in each of the clusters
resulting from hierarchical clustering.

72
00:03:47,800 --> 00:03:50,133
And so this is really the method
that I recommend.

73
00:03:50,133 --> 00:03:51,566
And speaking of this method,

74
00:03:51,566 --> 00:03:54,666
that's exactly the next argument
of this linkage function.

75
00:03:54,666 --> 00:03:55,966
We have two input here.

76
00:03:55,966 --> 00:03:58,733
And the name of that parameter is method.

77
00:03:58,733 --> 00:04:03,600
And the name of that minimum variance
method is not called minimum variance.

78
00:04:03,600 --> 00:04:05,733
But ward Ward.

79
00:04:05,733 --> 00:04:07,800
You can actually check this on Wikipedia.

80
00:04:07,800 --> 00:04:09,933
There is a whole page on the ward.

81
00:04:09,933 --> 00:04:11,133
And you will see that indeed it

82
00:04:11,133 --> 00:04:14,733
consists of minimized
the variance inside your clusters.

83
00:04:15,266 --> 00:04:15,633
All right.

84
00:04:15,633 --> 00:04:20,933
And that's it for the whole dendrogram
function here it only expects one argument

85
00:04:21,133 --> 00:04:23,933
which is basically the method you choose

86
00:04:23,933 --> 00:04:27,066
for your clustering
that you link to your matrix.

87
00:04:27,066 --> 00:04:30,033
If you just x in which
you want to identify the clusters.

88
00:04:30,033 --> 00:04:33,133
So that's all you need to input here
in this dendrogram.

89
00:04:33,366 --> 00:04:34,300
And there you go.

90
00:04:34,300 --> 00:04:37,333
This will already return
the dendrogram itself.

91
00:04:37,333 --> 00:04:39,133
You know the plot of the dendrogram.

92
00:04:39,133 --> 00:04:41,466
But as usual we want to make it nice.

93
00:04:41,466 --> 00:04:44,966
So we're just going to add a title
an x label and y label.

94
00:04:45,100 --> 00:04:46,266
And then we will show it.

95
00:04:46,266 --> 00:04:47,400
And now I will teach you

96
00:04:47,400 --> 00:04:51,866
how to read the dendrogram to indeed
find that optimal number of clusters.