1
00:00:00,066 --> 00:00:01,733
All right, my friends, let's do this.

2
00:00:01,733 --> 00:00:06,600
Let's use elbow method
to find optimal number of clusters.

3
00:00:06,600 --> 00:00:11,500
So we're going to use of course the excess
which is the within cluster

4
00:00:11,500 --> 00:00:12,400
sum of squares.

5
00:00:12,400 --> 00:00:13,700
I will remind what this is.

6
00:00:13,700 --> 00:00:16,666
But first let's create a new code cell.

7
00:00:16,666 --> 00:00:19,600
To start this new step
of the implementation.

8
00:00:19,600 --> 00:00:21,866
All right.
So what are we going to start with here.

9
00:00:21,866 --> 00:00:25,533
Well we're going to find back
a very good friend scikit learn.

10
00:00:25,533 --> 00:00:31,033
Because we will actually implement
that elbow method with a class of scikit

11
00:00:31,033 --> 00:00:34,700
learn which guess what is called
well K-means

12
00:00:35,033 --> 00:00:39,300
because indeed the way we will implement
the elbow method will actually be

13
00:00:39,300 --> 00:00:44,100
by running the K-means algorithm
with several number of clusters.

14
00:00:44,100 --> 00:00:47,200
So you see, we're going to run the K-means
algorithm several

15
00:00:47,200 --> 00:00:50,433
times, each time
with a different number of clusters.

16
00:00:50,433 --> 00:00:51,300
And that's why

17
00:00:51,300 --> 00:00:55,500
we have to call that K-means class
that can run this algorithm already.

18
00:00:55,933 --> 00:01:00,600
And so well, our first step here
will be to start from scikit learn,

19
00:01:00,900 --> 00:01:04,900
from which we're going to get access
to the module that contains

20
00:01:04,900 --> 00:01:08,633
that K-means class,
and that module is called cluster.

21
00:01:08,966 --> 00:01:09,900
Just like that.

22
00:01:09,900 --> 00:01:12,900
And then from which we are going to import

23
00:01:13,066 --> 00:01:16,466
that k means class perfect.

24
00:01:17,033 --> 00:01:19,800
And now what do you think
the next step is going to be?

25
00:01:19,800 --> 00:01:21,900
Well exceptionally this time.

26
00:01:21,900 --> 00:01:22,733
The next step

27
00:01:22,733 --> 00:01:27,400
won't be to create an instance or
you know, an object of this K-means class,

28
00:01:27,800 --> 00:01:32,666
because we are about to start a for loop,
which will run the K-means

29
00:01:32,666 --> 00:01:36,966
algorithm
with ten different numbers of clusters.

30
00:01:36,966 --> 00:01:41,033
So very simply, we will run the K-means
algorithm with one cluster,

31
00:01:41,033 --> 00:01:44,033
then with two clusters,
three clusters, etc.

32
00:01:44,033 --> 00:01:48,933
up to ten clusters and therefore
the way to do this is through a loop.

33
00:01:48,933 --> 00:01:52,800
And we will do a for loop, because we knew
exactly the different numbers

34
00:01:52,800 --> 00:01:55,966
of clusters
we want to try, which are from 1 to 10.

35
00:01:56,466 --> 00:01:59,800
And each time
we run the K-means algorithm, you know,

36
00:01:59,800 --> 00:02:02,866
with these different numbers
of clusters, well, we will compute.

37
00:02:02,866 --> 00:02:05,966
Of course, you know,
that famous metric in clustering,

38
00:02:06,100 --> 00:02:09,200
which is, as I
told you at the beginning, w

39
00:02:10,200 --> 00:02:11,000
to within

40
00:02:11,000 --> 00:02:13,800
cluster sum of squares, which I remind

41
00:02:13,800 --> 00:02:17,700
is defined
as the sum of the squared distances

42
00:02:17,900 --> 00:02:21,000
between each observation
point of the cluster.

43
00:02:21,166 --> 00:02:24,466
And it's essentially
the centroid of the cluster.

44
00:02:24,833 --> 00:02:28,300
So we're going to compute
that some of these squared distances.

45
00:02:28,500 --> 00:02:34,200
And this is exactly what will be on the
y axis in the graph of the elbow method.

46
00:02:34,233 --> 00:02:39,266
You know remember the graph in the elbow
method contains in the x axis.

47
00:02:39,266 --> 00:02:42,600
Well the different numbers of clusters
we will try from 1 to 10.

48
00:02:42,900 --> 00:02:47,300
And in the y axis
it contains the w axis computed

49
00:02:47,500 --> 00:02:50,300
for each of these numbers of clusters.

50
00:02:50,300 --> 00:02:51,166
And therefore.

51
00:02:51,166 --> 00:02:56,333
Here what we have to do right before
starting this for loop is to create a list

52
00:02:56,633 --> 00:03:01,200
which will, through the for loop,
be populated with the successive

53
00:03:01,333 --> 00:03:04,766
w x values you know,
for each of the numbers of clusters,

54
00:03:05,066 --> 00:03:08,733
and therefore
we're going to call that list w x,

55
00:03:09,133 --> 00:03:12,300
which we will initialize as an empty list.

56
00:03:12,466 --> 00:03:17,066
Remember that lists in Python
are written in a pair of square brackets.

57
00:03:17,066 --> 00:03:21,566
So here in this pair of square brackets
we're going to add one by one to different

58
00:03:21,666 --> 00:03:26,400
w x values
for each of the numbers of clusters okay.

59
00:03:26,566 --> 00:03:28,333
And now we can start the for loop.

60
00:03:28,333 --> 00:03:31,966
So the way to write a for loop in
Python is to start with four.

61
00:03:32,333 --> 00:03:35,666
Then we choose
the name of the iterated variable,

62
00:03:35,766 --> 00:03:40,633
which you know will be incremented by one
each time in each iteration of the loop.

63
00:03:40,833 --> 00:03:43,833
And the classic name for that
variable is I.

64
00:03:44,166 --> 00:03:46,733
And then we add in range.

65
00:03:46,733 --> 00:03:48,833
And here we specify in parentheses.

66
00:03:48,833 --> 00:03:54,600
Well the values we want this index
of the loop to take over the iterations.

67
00:03:54,800 --> 00:03:58,800
And here that's very simple
I will actually take the different values

68
00:03:58,800 --> 00:04:03,366
of the numbers of clusters
we want to try, which are from 1 to 10.

69
00:04:03,366 --> 00:04:04,833
Include it.

70
00:04:04,833 --> 00:04:09,366
But remember ranges in Python
include the lower bound

71
00:04:09,366 --> 00:04:11,333
but exclude the upper bound.

72
00:04:11,333 --> 00:04:12,900
That's actually what we see here.

73
00:04:12,900 --> 00:04:15,133
You know start defaults to zero okay.

74
00:04:15,133 --> 00:04:16,966
So that's the default lower bound.

75
00:04:16,966 --> 00:04:19,766
And stop is emitted right.

76
00:04:19,766 --> 00:04:21,100
It is excluded.

77
00:04:21,100 --> 00:04:23,566
So that's why
I also really like Google Colab.

78
00:04:23,566 --> 00:04:26,800
You have all the info
in this little help window.

79
00:04:26,966 --> 00:04:28,666
But I'm also here for the explanation.

80
00:04:28,666 --> 00:04:29,633
So there you go.

81
00:04:29,633 --> 00:04:33,333
The range we have to input here
is from one,

82
00:04:33,533 --> 00:04:36,400
you know, the first number of classes
we will try

83
00:04:36,400 --> 00:04:41,500
and then up to not ten but 11
because we want to include ten.

84
00:04:41,733 --> 00:04:45,133
And therefore we have to go up to 11
which is excluded.

85
00:04:45,333 --> 00:04:46,000
All right.

86
00:04:46,000 --> 00:04:49,300
And then we add just a little colon
just like that.

87
00:04:49,300 --> 00:04:51,900
And then we start the for loop.

88
00:04:51,900 --> 00:04:52,366
All right.

89
00:04:52,366 --> 00:04:55,033
And now now can come
the next natural step.

90
00:04:55,033 --> 00:04:58,133
You know after we import this class
to K-means class.

91
00:04:58,133 --> 00:05:02,300
Because indeed now
we can create our first K-means object.

92
00:05:02,533 --> 00:05:05,500
Why do I say our first K-means object?

93
00:05:05,500 --> 00:05:07,833
That's because, once again,
you know, we were going to create

94
00:05:07,833 --> 00:05:13,366
ten different K-means object for each of
these numbers of clusters from 1 to 10.

95
00:05:13,733 --> 00:05:17,866
So here we're creating the first K-means
algorithm, which will be run

96
00:05:17,866 --> 00:05:22,266
with therefore one cluster,
because I here starts at one.

97
00:05:22,500 --> 00:05:22,866
All right.

98
00:05:22,866 --> 00:05:26,966
So let's create our first K-means object,
which represents

99
00:05:26,966 --> 00:05:31,133
exactly the K-means algorithm,
which will be run to identify.

100
00:05:31,166 --> 00:05:33,500
Well, actually some one cluster.

101
00:05:33,500 --> 00:05:35,833
You see what I mean?
And then I will be equal to two.

102
00:05:35,833 --> 00:05:40,100
So that new K-means algorithm
will be run to identify two clusters.

103
00:05:40,266 --> 00:05:44,400
And then a new K-means algorithm will be
run to identify three clusters, etc.

104
00:05:44,400 --> 00:05:46,466
up to ten. Okay, there you go.

105
00:05:46,466 --> 00:05:51,800
That's our first object, which we create
by calling, of course, the K-means class.

106
00:05:51,800 --> 00:05:54,733
Be careful with the capital letters
K-means class.

107
00:05:54,733 --> 00:05:58,866
We add some parenthesis
and now we import the arguments.