1 00:00:00,300 --> 00:00:02,900 Hello and welcome to this are tutorials. 2 00:00:02,900 --> 00:00:06,866 So in the previous tutorials we solved our business problem in Python 3 00:00:06,866 --> 00:00:08,833 using hierarchical clustering. 4 00:00:08,833 --> 00:00:11,033 And this time we're going to solve it in R. 5 00:00:11,033 --> 00:00:12,900 And you're going to see that it's exactly the same. 6 00:00:12,900 --> 00:00:15,566 We are going to import our model data set first. 7 00:00:15,566 --> 00:00:18,833 Then we're going to use the dendrogram to find the optimal number of clusters. 8 00:00:19,300 --> 00:00:22,366 Then we will fit hierarchical clustering to our small data sets. 9 00:00:22,633 --> 00:00:25,633 And then finally we will visualize our results. 10 00:00:25,633 --> 00:00:26,633 So in this tutorial 11 00:00:26,633 --> 00:00:29,800 we're going to do the first step which is to import the model data sets. 12 00:00:29,866 --> 00:00:32,433 So let's start doing it right now. 13 00:00:32,433 --> 00:00:35,700 But before that let's not forget to set our working directory. 14 00:00:36,066 --> 00:00:38,133 So here I'm on my desktop. 15 00:00:38,133 --> 00:00:39,900 This is my machine learning A-Z folder. 16 00:00:39,900 --> 00:00:44,900 Let's open it then let's go to part three clustering then hierarchical clustering. 17 00:00:45,300 --> 00:00:46,766 And now we click on this more button. 18 00:00:46,766 --> 00:00:49,300 Here we click on Settings Working directory. 19 00:00:49,300 --> 00:00:53,233 And that sets our hierarchical clustering folder as working directory. 20 00:00:53,666 --> 00:00:56,666 So let's make sure we have our small data set in the folder. 21 00:00:56,800 --> 00:00:59,666 Here it is perfect. We are ready to start. 22 00:00:59,666 --> 00:01:03,000 Okay so let's introduce a new section with the comments. 23 00:01:03,000 --> 00:01:04,433 Importing the small data set. 24 00:01:06,300 --> 00:01:07,000 Here we go. 25 00:01:07,000 --> 00:01:09,300 And now let's import our data set. 26 00:01:09,300 --> 00:01:10,900 So we create this new variable. 27 00:01:10,900 --> 00:01:14,533 Data set equals red dot CSV. 28 00:01:15,000 --> 00:01:19,200 And in parenthesis we put the name of our data set model CSV in quotes. 29 00:01:19,833 --> 00:01:22,833 Okay so let's select this line and execute. 30 00:01:22,833 --> 00:01:25,533 And now our data set appears in data. 31 00:01:25,533 --> 00:01:28,500 So let's click on it. And here it is. 32 00:01:28,500 --> 00:01:31,366 So for those of you who didn't follow the Python tutorials, 33 00:01:31,366 --> 00:01:34,233 I'll just give a quick reminder of what this dataset is about. 34 00:01:34,233 --> 00:01:38,366 So basically these are informations of customers in a model which are customers 35 00:01:38,366 --> 00:01:42,866 that not only subscribe to the membership card, but also come often to the mall 36 00:01:43,266 --> 00:01:46,966 and the mall gathered some informations of 200 of these customers, 37 00:01:47,600 --> 00:01:50,600 their gender, their age, their annual income. 38 00:01:50,733 --> 00:01:54,100 And then for each of these customers, they computed a spending score. 39 00:01:54,433 --> 00:01:57,700 So this spending score takes values between 1 and 100. 40 00:01:58,133 --> 00:02:01,933 And the closer the spending score is to one, the less the customer spends 41 00:02:02,233 --> 00:02:05,933 and the closer the spending score is to 100, the more the customer spends. 42 00:02:06,366 --> 00:02:08,600 Okay. So we have these informations. 43 00:02:08,600 --> 00:02:12,100 And now our mission is to find some groups of customers. 44 00:02:12,266 --> 00:02:15,333 But since we have no idea of what kind of groups we're looking for, 45 00:02:15,600 --> 00:02:18,600 or even the number of groups of customers we're looking for, 46 00:02:18,700 --> 00:02:21,433 this specifically makes this business problem 47 00:02:21,433 --> 00:02:24,366 a clustering problem because we don't know the answers. 48 00:02:24,366 --> 00:02:26,100 We don't know the final result. 49 00:02:26,100 --> 00:02:29,600 And more precisely, we don't know the final categories of our customers. 50 00:02:30,300 --> 00:02:32,500 Okay. So we imported our data set. 51 00:02:32,500 --> 00:02:37,133 And now what we have to do is to prepare our data because we want to do this 52 00:02:37,133 --> 00:02:40,566 clustering only based on the annual income and the spending score. 53 00:02:41,000 --> 00:02:43,600 So let's create a new variable x 54 00:02:43,600 --> 00:02:46,100 equals data set. 55 00:02:46,100 --> 00:02:49,266 And then in square brackets we're going to put the two indexes 56 00:02:49,266 --> 00:02:51,666 of our columns of interest which are. 57 00:02:51,666 --> 00:02:52,566 Let's see. 58 00:02:52,566 --> 00:02:54,900 Let's go back to our data set indexes. 59 00:02:54,900 --> 00:02:56,466 And our start one. 60 00:02:56,466 --> 00:02:59,666 So customer ideas index one gender as index two. 61 00:02:59,700 --> 00:03:01,200 Age as index three. 62 00:03:01,200 --> 00:03:04,200 Annual income as index four and spending scores index five. 63 00:03:04,366 --> 00:03:04,766 Okay. 64 00:03:04,766 --> 00:03:08,166 So here in the square brackets we add four column five. 65 00:03:08,400 --> 00:03:11,200 That takes our columns annual income and spending score. 66 00:03:11,200 --> 00:03:14,200 And now let's select this line of code and execute it. 67 00:03:14,400 --> 00:03:15,600 And here it is. 68 00:03:15,600 --> 00:03:17,933 Our x variable appears in the data. 69 00:03:17,933 --> 00:03:20,933 Let's click on it to make sure everything is fine. 70 00:03:20,933 --> 00:03:21,800 Okay perfect. 71 00:03:21,800 --> 00:03:26,100 We have our two columns Annual Income and Spending Score and our 200 observations. 72 00:03:27,166 --> 00:03:27,600 Perfect. 73 00:03:27,600 --> 00:03:29,466 So we completed our first step. 74 00:03:29,466 --> 00:03:31,366 So that's the end of this tutorial. 75 00:03:31,366 --> 00:03:34,366 And in the next tutorial things are going to get more interesting. 76 00:03:34,500 --> 00:03:37,800 We're going to use the dendrogram to find the optimal number of clusters. 77 00:03:38,133 --> 00:03:41,200 And you're going to see what a dendrogram looks like in R. 78 00:03:41,633 --> 00:03:44,400 Thank you for watching this video and I look forward to seeing you 79 00:03:44,400 --> 00:03:47,400 in the next tutorial.