1 00:00:00,066 --> 00:00:00,966 Hello my friends, 2 00:00:00,966 --> 00:00:04,766 and welcome to the practical activity of this new part. 3 00:00:04,800 --> 00:00:08,200 Part four clustering, where we're going to build two 4 00:00:08,233 --> 00:00:11,933 clustering models K-means and Hierarchical Clustering. 5 00:00:12,233 --> 00:00:13,566 And of course we're going to start with. 6 00:00:13,566 --> 00:00:17,233 K-means which is actually the most popular model in clustering. 7 00:00:17,366 --> 00:00:20,800 And indeed we will see together that it provides fantastic results. 8 00:00:21,100 --> 00:00:24,400 So you just saw the intuition lectures with Kirill. 9 00:00:24,400 --> 00:00:27,400 And now we're going to put that theory into practice 10 00:00:27,466 --> 00:00:32,066 by building this K-means clustering model in both Python and R. 11 00:00:32,433 --> 00:00:34,800 And now we should all be on the same page. 12 00:00:34,800 --> 00:00:38,800 And therefore we're going to go to this folder here part for clustering. 13 00:00:39,066 --> 00:00:42,100 And then we will attack K-means clustering. 14 00:00:42,400 --> 00:00:44,366 And we're going to start with Python of course. 15 00:00:44,366 --> 00:00:47,300 And this is your folder containing two files. 16 00:00:47,300 --> 00:00:52,133 First to K-means clustering implementation in the Ipy and the format 17 00:00:52,133 --> 00:00:56,700 which therefore you can open with either Google Collaboratory or Jupyter Notebook. 18 00:00:57,000 --> 00:01:02,300 And then you have mode customers dot CSV, which is the CSB file. 19 00:01:02,300 --> 00:01:05,666 You know, the data set with which we will work in this section 20 00:01:05,866 --> 00:01:08,866 to build our K-means clustering model. 21 00:01:09,000 --> 00:01:09,300 All right. 22 00:01:09,300 --> 00:01:13,500 So first step as usual I will explain what this dataset is about, 23 00:01:13,700 --> 00:01:16,966 which will allow me to explain the purpose of this mission. 24 00:01:17,133 --> 00:01:20,400 You know the why we want to build the K-means algorithm and what for. 25 00:01:20,666 --> 00:01:24,233 And then we'll start of course, our implementation from scratch, 26 00:01:24,266 --> 00:01:29,066 step by step, and you'll take action with me to build the K-means algorithm. 27 00:01:29,266 --> 00:01:30,000 All right. 28 00:01:30,000 --> 00:01:32,066 So what is this dataset about? 29 00:01:32,066 --> 00:01:36,833 Well, as you can see by the title of this data set mode customers. 30 00:01:37,033 --> 00:01:41,866 Well, it's actually a data set made by among, you know, the strategic 31 00:01:41,866 --> 00:01:46,433 team, let's say, of a model that collected some data about their customers. 32 00:01:46,433 --> 00:01:48,600 So here it's important to see it this way. 33 00:01:48,600 --> 00:01:52,666 Each row corresponds to a customer of the model. 34 00:01:52,833 --> 00:01:55,200 And for each of these customers of the model. 35 00:01:55,200 --> 00:01:58,833 Well, the data analyst of this team gathered the following information. 36 00:01:58,866 --> 00:02:04,566 First the customer ID, then the joint male female, then the age, the annual income. 37 00:02:04,566 --> 00:02:06,700 And let's expand this. 38 00:02:06,700 --> 00:02:10,433 Well I can't do it here, but that last variable is the spinning score. 39 00:02:10,433 --> 00:02:13,566 And it can take values between 1 and 100. 40 00:02:13,833 --> 00:02:16,733 So all these features are pretty clear. 41 00:02:16,733 --> 00:02:18,766 Let me explain what this one means. 42 00:02:18,766 --> 00:02:22,333 The spending score is a metric made by the model 43 00:02:22,500 --> 00:02:26,166 to measure you know how much each customer spends. 44 00:02:26,166 --> 00:02:29,833 And so they made this metric which takes values from 1 to 100. 45 00:02:29,833 --> 00:02:34,066 You know, that's the scale of the metric such that well, the lower the score, 46 00:02:34,066 --> 00:02:37,300 the less the customer spends and the higher the score, the more 47 00:02:37,300 --> 00:02:38,233 the customer spends. 48 00:02:38,233 --> 00:02:41,666 You know, in a certain period of time, let's say in the past year. 49 00:02:41,700 --> 00:02:42,300 Okay. 50 00:02:42,300 --> 00:02:45,800 So for example, this customer actually spends 51 00:02:45,800 --> 00:02:49,433 a lot in this model, you know, because he has a score of 81. 52 00:02:49,666 --> 00:02:52,800 However, this customer spends very few 53 00:02:52,800 --> 00:02:55,733 in the model because she has a score of six. 54 00:02:55,733 --> 00:02:56,300 All right. 55 00:02:56,300 --> 00:03:00,000 So that's just a metric measuring the spending of each customer. 56 00:03:00,400 --> 00:03:03,100 And so now what is the purpose of this mission. 57 00:03:03,100 --> 00:03:07,100 What did this strategic team or analytics team want to do? 58 00:03:07,466 --> 00:03:10,533 Well, as you might guess, since right now we're doing clustering, 59 00:03:10,766 --> 00:03:15,233 this team wants to very simply understand its customers. 60 00:03:15,233 --> 00:03:18,600 You know, they want to identify some patterns 61 00:03:18,766 --> 00:03:21,833 within its customers, within its base of customers. 62 00:03:22,266 --> 00:03:24,433 And that's the key thing to understand here. 63 00:03:24,433 --> 00:03:28,900 You know, when doing clustering this time as opposed to, you know, previously 64 00:03:28,900 --> 00:03:33,733 with regression and classification, where we were actually knowing what to predict. 65 00:03:34,000 --> 00:03:37,766 Well, this time we actually have no idea what to predict. 66 00:03:38,066 --> 00:03:41,566 But even though we don't know what specifically to predict, 67 00:03:41,700 --> 00:03:45,500 we still know that we want to identify some patterns. 68 00:03:45,500 --> 00:03:47,533 And that's the why of this mission. 69 00:03:47,533 --> 00:03:49,200 You know, the purpose of this mission. 70 00:03:49,200 --> 00:03:49,600 Okay. 71 00:03:49,600 --> 00:03:51,666 So it's good we understand the why. 72 00:03:51,666 --> 00:03:55,566 And now let's understand how how are we going to identify 73 00:03:55,600 --> 00:03:56,700 such patterns? 74 00:03:56,700 --> 00:03:59,166 Well, we will do this with K-means of course. 75 00:03:59,166 --> 00:04:02,266 And more specifically, what we will do is 76 00:04:02,266 --> 00:04:05,333 we will create a dependent variable, right? 77 00:04:05,333 --> 00:04:09,566 We will create a dependent variable which will take a finite number of values. 78 00:04:09,566 --> 00:04:12,066 You know, let's say 4 or 5 values. 79 00:04:12,066 --> 00:04:15,733 And actually each of the values will be a class 80 00:04:15,733 --> 00:04:18,733 of this dependent variable we're going to create. 81 00:04:18,733 --> 00:04:20,966 And that's exactly what clustering means. 82 00:04:20,966 --> 00:04:25,266 You know technically in the details if you want to be broad on how to explain 83 00:04:25,266 --> 00:04:29,000 clustering, you would say that we are identifying some patterns in the data. 84 00:04:29,166 --> 00:04:33,133 But if you want to clearly explain how to identify these patterns in the data, 85 00:04:33,333 --> 00:04:37,033 well you would say that we are building a dependent variable. 86 00:04:37,033 --> 00:04:38,666 You know, we are creating it 87 00:04:38,666 --> 00:04:42,733 in such a way that each of the values of this future dependent variable, 88 00:04:42,733 --> 00:04:46,800 we are creating are actually the classes of this dependent variable. 89 00:04:47,100 --> 00:04:47,700 All right. 90 00:04:47,700 --> 00:04:51,966 So this will become much more clear once you know we build our K-means 91 00:04:51,966 --> 00:04:55,233 algorithm and we get that dependent variable we are creating. 92 00:04:55,400 --> 00:04:58,966 But please remember this we are creating a dependent variable.