1
00:00:00,100 --> 00:00:00,533
Hello my

2
00:00:00,533 --> 00:00:04,366
friends, and welcome
to this new practical activity where we're

3
00:00:04,366 --> 00:00:08,400
going to build together this time
the hierarchical clustering algorithm.

4
00:00:08,900 --> 00:00:12,266
And now we're all going to go into part
for clustering

5
00:00:12,433 --> 00:00:17,033
to this time tackle
the hierarchical clustering model.

6
00:00:17,033 --> 00:00:21,266
And we're going to start with Python
as usual we're going to work

7
00:00:21,266 --> 00:00:26,600
on the same data set mode customers right
where each row corresponds to a customer.

8
00:00:26,600 --> 00:00:29,733
And for each of these customers
world of mode gathered some info

9
00:00:29,733 --> 00:00:34,800
like the customer ID to join the age,
the annual income, and the spending score.

10
00:00:35,200 --> 00:00:38,400
And we're actually going to work only
with these two features.

11
00:00:38,400 --> 00:00:42,733
The annual income and the spending score
to identify these clusters,

12
00:00:42,733 --> 00:00:45,733
but this time
with hierarchical clustering.

13
00:00:45,733 --> 00:00:46,133
All right.

14
00:00:46,133 --> 00:00:51,533
So executives same and therefore let's
proceed directly to the implementation

15
00:00:51,700 --> 00:00:55,800
which you can open with either
Google Collaboratory as we're about to do.

16
00:00:56,100 --> 00:01:00,566
Or if you don't like Google Colaboratory,
you can open it with Jupyter Notebook.

17
00:01:00,566 --> 00:01:03,900
And for Google Colaboratory lovers
will follow me here

18
00:01:04,100 --> 00:01:08,033
to open this implementation
in Google Colab.

19
00:01:08,566 --> 00:01:09,366
And there we go.

20
00:01:09,366 --> 00:01:12,566
So that's the hierarchical clustering
implementation.

21
00:01:12,900 --> 00:01:17,400
As you can see, it follows
the exact same structure as K-means.

22
00:01:17,500 --> 00:01:19,400
We first import the libraries.

23
00:01:19,400 --> 00:01:24,533
We then import the data set exactly
the same way as how we did for K-means.

24
00:01:24,533 --> 00:01:29,000
You know, we select these two columns of
index three and four, which corresponds,

25
00:01:29,000 --> 00:01:32,200
of course, to the annual salary
and the spending score.

26
00:01:32,500 --> 00:01:35,766
So executives
same we want re-implement this together.

27
00:01:36,166 --> 00:01:39,566
And then this time
instead of using the elbow method,

28
00:01:39,600 --> 00:01:44,400
well we're going to use the dendrogram
to find the optimal number of clusters.

29
00:01:44,533 --> 00:01:47,233
And I will explain
not only the implementation.

30
00:01:47,233 --> 00:01:49,200
You know,
we will re-implement this together.

31
00:01:49,200 --> 00:01:52,200
And also I will explain how to find that

32
00:01:52,366 --> 00:01:55,333
optimal number of clusters in this graph.

33
00:01:55,333 --> 00:01:58,100
And finally we will train the hierarchical

34
00:01:58,100 --> 00:02:02,066
clustering model on the data set
using the agglomerative clustering class.

35
00:02:02,333 --> 00:02:05,233
And finally we will visualize the clusters

36
00:02:05,233 --> 00:02:08,466
exactly the same way
as what we did with K-means.

37
00:02:08,633 --> 00:02:11,300
And actually here
the code is exactly the same.

38
00:02:11,300 --> 00:02:15,633
The only thing that changes is the name
of the dependent variable which we create.

39
00:02:15,633 --> 00:02:16,000
Right?

40
00:02:16,000 --> 00:02:19,533
Because still with hierarchical clustering
we're going to create

41
00:02:19,766 --> 00:02:21,433
that dependent variable.

42
00:02:21,433 --> 00:02:24,000
But this time, instead of calling it
why K-means.

43
00:02:24,000 --> 00:02:27,566
As we did for K-means,
we're calling it simply y HC.

44
00:02:27,766 --> 00:02:30,766
And therefore here
it's exactly the same code with only

45
00:02:30,900 --> 00:02:34,233
that different name
for that created dependent variable.

46
00:02:34,233 --> 00:02:37,266
So we want to re-implement
this either we'll just keep the code

47
00:02:37,500 --> 00:02:38,600
and therefore there we go.

48
00:02:38,600 --> 00:02:42,700
We are only going to re-implement
two cells, which is just one,

49
00:02:42,700 --> 00:02:46,833
to build the dendrogram and figure out
that optimal number of clusters,

50
00:02:47,100 --> 00:02:52,266
and to build a hierarchical clustering
model and train it on the whole data set.

51
00:02:52,800 --> 00:02:53,666
Are you ready?

52
00:02:53,666 --> 00:02:54,600
Let's do this.

53
00:02:54,600 --> 00:02:58,200
And in order to do this,
we have to create a copy of this notebook.

54
00:02:58,200 --> 00:03:00,433
Because this is in read only mode.

55
00:03:00,433 --> 00:03:02,633
And therefore we're going to go to file
here.

56
00:03:02,633 --> 00:03:06,733
And then click save Copy and Drive
to indeed create

57
00:03:06,933 --> 00:03:09,933
a copy of this notebook.

58
00:03:10,066 --> 00:03:12,700
Perfect. All right so there we go.

59
00:03:12,700 --> 00:03:13,433
That's our copy.

60
00:03:13,433 --> 00:03:14,600
Now we can modify it.

61
00:03:14,600 --> 00:03:16,866
Now we can re-implement it.

62
00:03:16,866 --> 00:03:19,700
But as we've just said
we won't re-implement everything.

63
00:03:19,700 --> 00:03:22,900
We will just re-implement
these two cells here.

64
00:03:22,900 --> 00:03:26,900
First the dendrogram how to build it
and how to read it.

65
00:03:27,133 --> 00:03:28,533
And then of course, well,

66
00:03:28,533 --> 00:03:32,433
how to build the hierarchical
clustering model on the data set.

67
00:03:32,433 --> 00:03:36,966
And then we keep the other cells because
they're exactly the same as in K-means.

68
00:03:36,966 --> 00:03:38,266
And if you want let's just,

69
00:03:38,266 --> 00:03:41,333
you know, remove this
so that we don't see the final result.

70
00:03:41,633 --> 00:03:43,600
And perfect now all right.

71
00:03:43,600 --> 00:03:46,866
All you see here is executive
same as with K-means.

72
00:03:47,033 --> 00:03:48,300
The only thing that will change

73
00:03:48,300 --> 00:03:52,333
are these two cells
which we will re-implement together.

74
00:03:52,900 --> 00:03:53,700
Okay. Perfect.

75
00:03:53,700 --> 00:03:58,100
So first step let's just execute these two
first cells here.

76
00:03:58,100 --> 00:04:01,366
And to do this
we need of course to upload the data set.

77
00:04:01,633 --> 00:04:03,933
So let's click this folder here.

78
00:04:03,933 --> 00:04:07,800
Now it is connecting to a runtime
to enable file browsing.

79
00:04:07,800 --> 00:04:09,633
You know in your computer in your machine.

80
00:04:09,633 --> 00:04:13,300
And in a second
we should see the upload button.

81
00:04:13,500 --> 00:04:15,600
There we go. Upload.

82
00:04:15,600 --> 00:04:18,800
And and well
I'm already in the K-means folder.

83
00:04:18,800 --> 00:04:21,733
But let me show you again the whole path.

84
00:04:21,733 --> 00:04:25,666
So that's the folder you were given
at the beginning of each section,

85
00:04:25,666 --> 00:04:26,566
including this one.

86
00:04:26,566 --> 00:04:29,566
Hierarchical clustering,
which you could download on your machine.

87
00:04:29,566 --> 00:04:31,200
So I hope you have it right now.

88
00:04:31,200 --> 00:04:34,166
Otherwise you would just need to go back
to the previous article.

89
00:04:34,166 --> 00:04:35,800
And now we're all going to go inside.

90
00:04:35,800 --> 00:04:37,700
Then we're going to go to part four.

91
00:04:37,700 --> 00:04:38,600
Of course

92
00:04:38,600 --> 00:04:43,500
then section 25 hierarchical clustering,
then Python and then there we go.

93
00:04:43,500 --> 00:04:46,200
Mode customers dot CSV.

94
00:04:46,200 --> 00:04:49,466
This will upload the data
set into the notebook.

95
00:04:49,666 --> 00:04:54,000
And so now we can run these two
first cells first importing the libraries.

96
00:04:54,266 --> 00:04:59,066
And now that we have pandas
we can import that data set

97
00:04:59,400 --> 00:05:02,600
which at the same time creates this matrix

98
00:05:02,600 --> 00:05:05,600
of two features containing only.

99
00:05:05,700 --> 00:05:10,366
Let's see in the data set containing only
the annual income and the spending score.

100
00:05:10,366 --> 00:05:15,100
In other words, X is just these
two columns here with all the rows okay.

101
00:05:15,700 --> 00:05:18,633
All right.
So data preprocessing phase done.

102
00:05:18,633 --> 00:05:23,300
Now we can focus on the heart
of the hierarchical clustering model

103
00:05:23,433 --> 00:05:26,833
which is first to build the dendrogram
to indeed find

104
00:05:27,000 --> 00:05:28,633
the optimal number of clusters.

105
00:05:28,633 --> 00:05:29,800
And of course

106
00:05:29,800 --> 00:05:33,333
the optimal number of clusters
that will result from this dendrogram

107
00:05:33,333 --> 00:05:38,133
will be the same number as the one we
found with K-means, meaning five clusters.

108
00:05:38,133 --> 00:05:41,866
But I will explain how to read
the dendrogram in order to indeed end up

109
00:05:42,066 --> 00:05:45,066
with an optimal number of five clusters.

110
00:05:45,100 --> 00:05:47,266
All right, so that was the introduction.

111
00:05:47,266 --> 00:05:49,566
And as you know,
I like to take it step by step.

112
00:05:49,566 --> 00:05:53,233
So we will implement that next step
of building the dendrogram

113
00:05:53,433 --> 00:05:54,833
in the next tutorial.

114
00:05:54,833 --> 00:05:55,833
So get ready.

115
00:05:55,833 --> 00:05:57,900
And until then enjoy machine learning.