1
00:00:00,066 --> 00:00:00,966
Hello my friends,

2
00:00:00,966 --> 00:00:04,766
and welcome to the practical activity
of this new part.

3
00:00:04,800 --> 00:00:08,200
Part four clustering,
where we're going to build two

4
00:00:08,233 --> 00:00:11,933
clustering models K-means
and Hierarchical Clustering.

5
00:00:12,233 --> 00:00:13,566
And of course we're going to start with.

6
00:00:13,566 --> 00:00:17,233
K-means which is actually the most popular
model in clustering.

7
00:00:17,366 --> 00:00:20,800
And indeed we will see together
that it provides fantastic results.

8
00:00:21,100 --> 00:00:24,400
So you just saw the intuition lectures
with Kirill.

9
00:00:24,400 --> 00:00:27,400
And now we're going to put that theory
into practice

10
00:00:27,466 --> 00:00:32,066
by building this K-means
clustering model in both Python and R.

11
00:00:32,433 --> 00:00:34,800
And now we should all be on the same page.

12
00:00:34,800 --> 00:00:38,800
And therefore we're going to go to this
folder here part for clustering.

13
00:00:39,066 --> 00:00:42,100
And then we will attack
K-means clustering.

14
00:00:42,400 --> 00:00:44,366
And we're going to start
with Python of course.

15
00:00:44,366 --> 00:00:47,300
And this is your folder containing
two files.

16
00:00:47,300 --> 00:00:52,133
First to K-means clustering implementation
in the Ipy and the format

17
00:00:52,133 --> 00:00:56,700
which therefore you can open with either
Google Collaboratory or Jupyter Notebook.

18
00:00:57,000 --> 00:01:02,300
And then you have mode customers dot CSV,
which is the CSB file.

19
00:01:02,300 --> 00:01:05,666
You know, the data set
with which we will work in this section

20
00:01:05,866 --> 00:01:08,866
to build our K-means clustering model.

21
00:01:09,000 --> 00:01:09,300
All right.

22
00:01:09,300 --> 00:01:13,500
So first step as usual
I will explain what this dataset is about,

23
00:01:13,700 --> 00:01:16,966
which will allow me to explain
the purpose of this mission.

24
00:01:17,133 --> 00:01:20,400
You know the why we want to build
the K-means algorithm and what for.

25
00:01:20,666 --> 00:01:24,233
And then we'll start of course,
our implementation from scratch,

26
00:01:24,266 --> 00:01:29,066
step by step, and you'll take action
with me to build the K-means algorithm.

27
00:01:29,266 --> 00:01:30,000
All right.

28
00:01:30,000 --> 00:01:32,066
So what is this dataset about?

29
00:01:32,066 --> 00:01:36,833
Well, as you can see by the title of this
data set mode customers.

30
00:01:37,033 --> 00:01:41,866
Well, it's actually a data
set made by among, you know, the strategic

31
00:01:41,866 --> 00:01:46,433
team, let's say, of a model that collected
some data about their customers.

32
00:01:46,433 --> 00:01:48,600
So here it's important to see it this way.

33
00:01:48,600 --> 00:01:52,666
Each row corresponds
to a customer of the model.

34
00:01:52,833 --> 00:01:55,200
And for each of these customers
of the model.

35
00:01:55,200 --> 00:01:58,833
Well, the data analyst of this team
gathered the following information.

36
00:01:58,866 --> 00:02:04,566
First the customer ID, then the joint male
female, then the age, the annual income.

37
00:02:04,566 --> 00:02:06,700
And let's expand this.

38
00:02:06,700 --> 00:02:10,433
Well I can't do it here, but
that last variable is the spinning score.

39
00:02:10,433 --> 00:02:13,566
And it can take values between 1 and 100.

40
00:02:13,833 --> 00:02:16,733
So all these features are pretty clear.

41
00:02:16,733 --> 00:02:18,766
Let me explain what this one means.

42
00:02:18,766 --> 00:02:22,333
The spending score is a metric
made by the model

43
00:02:22,500 --> 00:02:26,166
to measure
you know how much each customer spends.

44
00:02:26,166 --> 00:02:29,833
And so they made this metric
which takes values from 1 to 100.

45
00:02:29,833 --> 00:02:34,066
You know, that's the scale of the metric
such that well, the lower the score,

46
00:02:34,066 --> 00:02:37,300
the less the customer spends
and the higher the score, the more

47
00:02:37,300 --> 00:02:38,233
the customer spends.

48
00:02:38,233 --> 00:02:41,666
You know, in a certain period of time,
let's say in the past year.

49
00:02:41,700 --> 00:02:42,300
Okay.

50
00:02:42,300 --> 00:02:45,800
So for example,
this customer actually spends

51
00:02:45,800 --> 00:02:49,433
a lot in this model, you know,
because he has a score of 81.

52
00:02:49,666 --> 00:02:52,800
However, this customer spends very few

53
00:02:52,800 --> 00:02:55,733
in the model
because she has a score of six.

54
00:02:55,733 --> 00:02:56,300
All right.

55
00:02:56,300 --> 00:03:00,000
So that's just a metric measuring
the spending of each customer.

56
00:03:00,400 --> 00:03:03,100
And so now
what is the purpose of this mission.

57
00:03:03,100 --> 00:03:07,100
What did this strategic team
or analytics team want to do?

58
00:03:07,466 --> 00:03:10,533
Well, as you might guess, since right now
we're doing clustering,

59
00:03:10,766 --> 00:03:15,233
this team wants to very simply understand
its customers.

60
00:03:15,233 --> 00:03:18,600
You know,
they want to identify some patterns

61
00:03:18,766 --> 00:03:21,833
within its customers,
within its base of customers.

62
00:03:22,266 --> 00:03:24,433
And that's the key thing to understand
here.

63
00:03:24,433 --> 00:03:28,900
You know, when doing clustering this time
as opposed to, you know, previously

64
00:03:28,900 --> 00:03:33,733
with regression and classification, where
we were actually knowing what to predict.

65
00:03:34,000 --> 00:03:37,766
Well, this time
we actually have no idea what to predict.

66
00:03:38,066 --> 00:03:41,566
But even though we don't know
what specifically to predict,

67
00:03:41,700 --> 00:03:45,500
we still know that
we want to identify some patterns.

68
00:03:45,500 --> 00:03:47,533
And that's the why of this mission.

69
00:03:47,533 --> 00:03:49,200
You know, the purpose of this mission.

70
00:03:49,200 --> 00:03:49,600
Okay.

71
00:03:49,600 --> 00:03:51,666
So it's good we understand the why.

72
00:03:51,666 --> 00:03:55,566
And now let's understand
how how are we going to identify

73
00:03:55,600 --> 00:03:56,700
such patterns?

74
00:03:56,700 --> 00:03:59,166
Well,
we will do this with K-means of course.

75
00:03:59,166 --> 00:04:02,266
And more specifically, what we will do is

76
00:04:02,266 --> 00:04:05,333
we will create a dependent
variable, right?

77
00:04:05,333 --> 00:04:09,566
We will create a dependent variable
which will take a finite number of values.

78
00:04:09,566 --> 00:04:12,066
You know, let's say 4 or 5 values.

79
00:04:12,066 --> 00:04:15,733
And actually each of the values
will be a class

80
00:04:15,733 --> 00:04:18,733
of this dependent variable
we're going to create.

81
00:04:18,733 --> 00:04:20,966
And that's exactly what clustering means.

82
00:04:20,966 --> 00:04:25,266
You know technically in the details
if you want to be broad on how to explain

83
00:04:25,266 --> 00:04:29,000
clustering, you would say that we are
identifying some patterns in the data.

84
00:04:29,166 --> 00:04:33,133
But if you want to clearly explain how
to identify these patterns in the data,

85
00:04:33,333 --> 00:04:37,033
well you would say that
we are building a dependent variable.

86
00:04:37,033 --> 00:04:38,666
You know, we are creating it

87
00:04:38,666 --> 00:04:42,733
in such a way that each of the values
of this future dependent variable,

88
00:04:42,733 --> 00:04:46,800
we are creating are actually the classes
of this dependent variable.

89
00:04:47,100 --> 00:04:47,700
All right.

90
00:04:47,700 --> 00:04:51,966
So this will become much more clear once
you know we build our K-means

91
00:04:51,966 --> 00:04:55,233
algorithm and we get that
dependent variable we are creating.

92
00:04:55,400 --> 00:04:58,966
But please remember this
we are creating a dependent variable.