1
00:00:00,300 --> 00:00:02,900
Hello and welcome to this are tutorials.

2
00:00:02,900 --> 00:00:06,866
So in the previous tutorials
we solved our business problem in Python

3
00:00:06,866 --> 00:00:08,833
using hierarchical clustering.

4
00:00:08,833 --> 00:00:11,033
And this time
we're going to solve it in R.

5
00:00:11,033 --> 00:00:12,900
And you're going to see that
it's exactly the same.

6
00:00:12,900 --> 00:00:15,566
We are going to import our model data
set first.

7
00:00:15,566 --> 00:00:18,833
Then we're going to use the dendrogram
to find the optimal number of clusters.

8
00:00:19,300 --> 00:00:22,366
Then we will fit hierarchical
clustering to our small data sets.

9
00:00:22,633 --> 00:00:25,633
And then finally
we will visualize our results.

10
00:00:25,633 --> 00:00:26,633
So in this tutorial

11
00:00:26,633 --> 00:00:29,800
we're going to do the first step
which is to import the model data sets.

12
00:00:29,866 --> 00:00:32,433
So let's start doing it right now.

13
00:00:32,433 --> 00:00:35,700
But before that let's not forget
to set our working directory.

14
00:00:36,066 --> 00:00:38,133
So here I'm on my desktop.

15
00:00:38,133 --> 00:00:39,900
This is my machine learning A-Z folder.

16
00:00:39,900 --> 00:00:44,900
Let's open it then let's go to part three
clustering then hierarchical clustering.

17
00:00:45,300 --> 00:00:46,766
And now we click on this more button.

18
00:00:46,766 --> 00:00:49,300
Here we click on Settings
Working directory.

19
00:00:49,300 --> 00:00:53,233
And that sets our hierarchical clustering
folder as working directory.

20
00:00:53,666 --> 00:00:56,666
So let's make sure we have our small data
set in the folder.

21
00:00:56,800 --> 00:00:59,666
Here it is perfect. We are ready to start.

22
00:00:59,666 --> 00:01:03,000
Okay so let's introduce a new section
with the comments.

23
00:01:03,000 --> 00:01:04,433
Importing the small data set.

24
00:01:06,300 --> 00:01:07,000
Here we go.

25
00:01:07,000 --> 00:01:09,300
And now let's import our data set.

26
00:01:09,300 --> 00:01:10,900
So we create this new variable.

27
00:01:10,900 --> 00:01:14,533
Data set equals red dot CSV.

28
00:01:15,000 --> 00:01:19,200
And in parenthesis we put the name of
our data set model CSV in quotes.

29
00:01:19,833 --> 00:01:22,833
Okay so let's select
this line and execute.

30
00:01:22,833 --> 00:01:25,533
And now our data set appears in data.

31
00:01:25,533 --> 00:01:28,500
So let's click on it. And here it is.

32
00:01:28,500 --> 00:01:31,366
So for those of you
who didn't follow the Python tutorials,

33
00:01:31,366 --> 00:01:34,233
I'll just give a quick reminder
of what this dataset is about.

34
00:01:34,233 --> 00:01:38,366
So basically these are informations of
customers in a model which are customers

35
00:01:38,366 --> 00:01:42,866
that not only subscribe to the membership
card, but also come often to the mall

36
00:01:43,266 --> 00:01:46,966
and the mall gathered some informations
of 200 of these customers,

37
00:01:47,600 --> 00:01:50,600
their gender,
their age, their annual income.

38
00:01:50,733 --> 00:01:54,100
And then for each of these customers,
they computed a spending score.

39
00:01:54,433 --> 00:01:57,700
So this spending score takes values
between 1 and 100.

40
00:01:58,133 --> 00:02:01,933
And the closer the spending score
is to one, the less the customer spends

41
00:02:02,233 --> 00:02:05,933
and the closer the spending score
is to 100, the more the customer spends.

42
00:02:06,366 --> 00:02:08,600
Okay. So we have these informations.

43
00:02:08,600 --> 00:02:12,100
And now our mission
is to find some groups of customers.

44
00:02:12,266 --> 00:02:15,333
But since we have no idea
of what kind of groups we're looking for,

45
00:02:15,600 --> 00:02:18,600
or even the number of groups of customers
we're looking for,

46
00:02:18,700 --> 00:02:21,433
this specifically
makes this business problem

47
00:02:21,433 --> 00:02:24,366
a clustering problem
because we don't know the answers.

48
00:02:24,366 --> 00:02:26,100
We don't know the final result.

49
00:02:26,100 --> 00:02:29,600
And more precisely, we don't know
the final categories of our customers.

50
00:02:30,300 --> 00:02:32,500
Okay. So we imported our data set.

51
00:02:32,500 --> 00:02:37,133
And now what we have to do is to prepare
our data because we want to do this

52
00:02:37,133 --> 00:02:40,566
clustering only based on the annual income
and the spending score.

53
00:02:41,000 --> 00:02:43,600
So let's create a new variable x

54
00:02:43,600 --> 00:02:46,100
equals data set.

55
00:02:46,100 --> 00:02:49,266
And then in square brackets
we're going to put the two indexes

56
00:02:49,266 --> 00:02:51,666
of our columns of interest which are.

57
00:02:51,666 --> 00:02:52,566
Let's see.

58
00:02:52,566 --> 00:02:54,900
Let's go back to our data set indexes.

59
00:02:54,900 --> 00:02:56,466
And our start one.

60
00:02:56,466 --> 00:02:59,666
So customer ideas
index one gender as index two.

61
00:02:59,700 --> 00:03:01,200
Age as index three.

62
00:03:01,200 --> 00:03:04,200
Annual income as index four
and spending scores index five.

63
00:03:04,366 --> 00:03:04,766
Okay.

64
00:03:04,766 --> 00:03:08,166
So here in the square brackets
we add four column five.

65
00:03:08,400 --> 00:03:11,200
That takes our columns
annual income and spending score.

66
00:03:11,200 --> 00:03:14,200
And now let's select
this line of code and execute it.

67
00:03:14,400 --> 00:03:15,600
And here it is.

68
00:03:15,600 --> 00:03:17,933
Our x variable appears in the data.

69
00:03:17,933 --> 00:03:20,933
Let's click on it
to make sure everything is fine.

70
00:03:20,933 --> 00:03:21,800
Okay perfect.

71
00:03:21,800 --> 00:03:26,100
We have our two columns Annual Income and
Spending Score and our 200 observations.

72
00:03:27,166 --> 00:03:27,600
Perfect.

73
00:03:27,600 --> 00:03:29,466
So we completed our first step.

74
00:03:29,466 --> 00:03:31,366
So that's the end of this tutorial.

75
00:03:31,366 --> 00:03:34,366
And in the next tutorial
things are going to get more interesting.

76
00:03:34,500 --> 00:03:37,800
We're going to use the dendrogram
to find the optimal number of clusters.

77
00:03:38,133 --> 00:03:41,200
And you're going to see what a dendrogram
looks like in R.

78
00:03:41,633 --> 00:03:44,400
Thank you for watching this video
and I look forward to seeing you

79
00:03:44,400 --> 00:03:47,400
in the next tutorial.