1 00:00:00,066 --> 00:00:01,733 All right, my friends, let's do this. 2 00:00:01,733 --> 00:00:06,600 Let's use elbow method to find optimal number of clusters. 3 00:00:06,600 --> 00:00:11,500 So we're going to use of course the excess which is the within cluster 4 00:00:11,500 --> 00:00:12,400 sum of squares. 5 00:00:12,400 --> 00:00:13,700 I will remind what this is. 6 00:00:13,700 --> 00:00:16,666 But first let's create a new code cell. 7 00:00:16,666 --> 00:00:19,600 To start this new step of the implementation. 8 00:00:19,600 --> 00:00:21,866 All right. So what are we going to start with here. 9 00:00:21,866 --> 00:00:25,533 Well we're going to find back a very good friend scikit learn. 10 00:00:25,533 --> 00:00:31,033 Because we will actually implement that elbow method with a class of scikit 11 00:00:31,033 --> 00:00:34,700 learn which guess what is called well K-means 12 00:00:35,033 --> 00:00:39,300 because indeed the way we will implement the elbow method will actually be 13 00:00:39,300 --> 00:00:44,100 by running the K-means algorithm with several number of clusters. 14 00:00:44,100 --> 00:00:47,200 So you see, we're going to run the K-means algorithm several 15 00:00:47,200 --> 00:00:50,433 times, each time with a different number of clusters. 16 00:00:50,433 --> 00:00:51,300 And that's why 17 00:00:51,300 --> 00:00:55,500 we have to call that K-means class that can run this algorithm already. 18 00:00:55,933 --> 00:01:00,600 And so well, our first step here will be to start from scikit learn, 19 00:01:00,900 --> 00:01:04,900 from which we're going to get access to the module that contains 20 00:01:04,900 --> 00:01:08,633 that K-means class, and that module is called cluster. 21 00:01:08,966 --> 00:01:09,900 Just like that. 22 00:01:09,900 --> 00:01:12,900 And then from which we are going to import 23 00:01:13,066 --> 00:01:16,466 that k means class perfect. 24 00:01:17,033 --> 00:01:19,800 And now what do you think the next step is going to be? 25 00:01:19,800 --> 00:01:21,900 Well exceptionally this time. 26 00:01:21,900 --> 00:01:22,733 The next step 27 00:01:22,733 --> 00:01:27,400 won't be to create an instance or you know, an object of this K-means class, 28 00:01:27,800 --> 00:01:32,666 because we are about to start a for loop, which will run the K-means 29 00:01:32,666 --> 00:01:36,966 algorithm with ten different numbers of clusters. 30 00:01:36,966 --> 00:01:41,033 So very simply, we will run the K-means algorithm with one cluster, 31 00:01:41,033 --> 00:01:44,033 then with two clusters, three clusters, etc. 32 00:01:44,033 --> 00:01:48,933 up to ten clusters and therefore the way to do this is through a loop. 33 00:01:48,933 --> 00:01:52,800 And we will do a for loop, because we knew exactly the different numbers 34 00:01:52,800 --> 00:01:55,966 of clusters we want to try, which are from 1 to 10. 35 00:01:56,466 --> 00:01:59,800 And each time we run the K-means algorithm, you know, 36 00:01:59,800 --> 00:02:02,866 with these different numbers of clusters, well, we will compute. 37 00:02:02,866 --> 00:02:05,966 Of course, you know, that famous metric in clustering, 38 00:02:06,100 --> 00:02:09,200 which is, as I told you at the beginning, w 39 00:02:10,200 --> 00:02:11,000 to within 40 00:02:11,000 --> 00:02:13,800 cluster sum of squares, which I remind 41 00:02:13,800 --> 00:02:17,700 is defined as the sum of the squared distances 42 00:02:17,900 --> 00:02:21,000 between each observation point of the cluster. 43 00:02:21,166 --> 00:02:24,466 And it's essentially the centroid of the cluster. 44 00:02:24,833 --> 00:02:28,300 So we're going to compute that some of these squared distances. 45 00:02:28,500 --> 00:02:34,200 And this is exactly what will be on the y axis in the graph of the elbow method. 46 00:02:34,233 --> 00:02:39,266 You know remember the graph in the elbow method contains in the x axis. 47 00:02:39,266 --> 00:02:42,600 Well the different numbers of clusters we will try from 1 to 10. 48 00:02:42,900 --> 00:02:47,300 And in the y axis it contains the w axis computed 49 00:02:47,500 --> 00:02:50,300 for each of these numbers of clusters. 50 00:02:50,300 --> 00:02:51,166 And therefore. 51 00:02:51,166 --> 00:02:56,333 Here what we have to do right before starting this for loop is to create a list 52 00:02:56,633 --> 00:03:01,200 which will, through the for loop, be populated with the successive 53 00:03:01,333 --> 00:03:04,766 w x values you know, for each of the numbers of clusters, 54 00:03:05,066 --> 00:03:08,733 and therefore we're going to call that list w x, 55 00:03:09,133 --> 00:03:12,300 which we will initialize as an empty list. 56 00:03:12,466 --> 00:03:17,066 Remember that lists in Python are written in a pair of square brackets. 57 00:03:17,066 --> 00:03:21,566 So here in this pair of square brackets we're going to add one by one to different 58 00:03:21,666 --> 00:03:26,400 w x values for each of the numbers of clusters okay. 59 00:03:26,566 --> 00:03:28,333 And now we can start the for loop. 60 00:03:28,333 --> 00:03:31,966 So the way to write a for loop in Python is to start with four. 61 00:03:32,333 --> 00:03:35,666 Then we choose the name of the iterated variable, 62 00:03:35,766 --> 00:03:40,633 which you know will be incremented by one each time in each iteration of the loop. 63 00:03:40,833 --> 00:03:43,833 And the classic name for that variable is I. 64 00:03:44,166 --> 00:03:46,733 And then we add in range. 65 00:03:46,733 --> 00:03:48,833 And here we specify in parentheses. 66 00:03:48,833 --> 00:03:54,600 Well the values we want this index of the loop to take over the iterations. 67 00:03:54,800 --> 00:03:58,800 And here that's very simple I will actually take the different values 68 00:03:58,800 --> 00:04:03,366 of the numbers of clusters we want to try, which are from 1 to 10. 69 00:04:03,366 --> 00:04:04,833 Include it. 70 00:04:04,833 --> 00:04:09,366 But remember ranges in Python include the lower bound 71 00:04:09,366 --> 00:04:11,333 but exclude the upper bound. 72 00:04:11,333 --> 00:04:12,900 That's actually what we see here. 73 00:04:12,900 --> 00:04:15,133 You know start defaults to zero okay. 74 00:04:15,133 --> 00:04:16,966 So that's the default lower bound. 75 00:04:16,966 --> 00:04:19,766 And stop is emitted right. 76 00:04:19,766 --> 00:04:21,100 It is excluded. 77 00:04:21,100 --> 00:04:23,566 So that's why I also really like Google Colab. 78 00:04:23,566 --> 00:04:26,800 You have all the info in this little help window. 79 00:04:26,966 --> 00:04:28,666 But I'm also here for the explanation. 80 00:04:28,666 --> 00:04:29,633 So there you go. 81 00:04:29,633 --> 00:04:33,333 The range we have to input here is from one, 82 00:04:33,533 --> 00:04:36,400 you know, the first number of classes we will try 83 00:04:36,400 --> 00:04:41,500 and then up to not ten but 11 because we want to include ten. 84 00:04:41,733 --> 00:04:45,133 And therefore we have to go up to 11 which is excluded. 85 00:04:45,333 --> 00:04:46,000 All right. 86 00:04:46,000 --> 00:04:49,300 And then we add just a little colon just like that. 87 00:04:49,300 --> 00:04:51,900 And then we start the for loop. 88 00:04:51,900 --> 00:04:52,366 All right. 89 00:04:52,366 --> 00:04:55,033 And now now can come the next natural step. 90 00:04:55,033 --> 00:04:58,133 You know after we import this class to K-means class. 91 00:04:58,133 --> 00:05:02,300 Because indeed now we can create our first K-means object. 92 00:05:02,533 --> 00:05:05,500 Why do I say our first K-means object? 93 00:05:05,500 --> 00:05:07,833 That's because, once again, you know, we were going to create 94 00:05:07,833 --> 00:05:13,366 ten different K-means object for each of these numbers of clusters from 1 to 10. 95 00:05:13,733 --> 00:05:17,866 So here we're creating the first K-means algorithm, which will be run 96 00:05:17,866 --> 00:05:22,266 with therefore one cluster, because I here starts at one. 97 00:05:22,500 --> 00:05:22,866 All right. 98 00:05:22,866 --> 00:05:26,966 So let's create our first K-means object, which represents 99 00:05:26,966 --> 00:05:31,133 exactly the K-means algorithm, which will be run to identify. 100 00:05:31,166 --> 00:05:33,500 Well, actually some one cluster. 101 00:05:33,500 --> 00:05:35,833 You see what I mean? And then I will be equal to two. 102 00:05:35,833 --> 00:05:40,100 So that new K-means algorithm will be run to identify two clusters. 103 00:05:40,266 --> 00:05:44,400 And then a new K-means algorithm will be run to identify three clusters, etc. 104 00:05:44,400 --> 00:05:46,466 up to ten. Okay, there you go. 105 00:05:46,466 --> 00:05:51,800 That's our first object, which we create by calling, of course, the K-means class. 106 00:05:51,800 --> 00:05:54,733 Be careful with the capital letters K-means class. 107 00:05:54,733 --> 00:05:58,866 We add some parenthesis and now we import the arguments.