1 00:00:00,066 --> 00:00:04,733 Hello my friends, and welcome to the final section of part three classification, 2 00:00:04,866 --> 00:00:08,400 where we're going to answer together a very important question. 3 00:00:08,400 --> 00:00:12,233 You know, one of the most frequently asked question in the data science community, 4 00:00:12,466 --> 00:00:15,966 which is which classification model should I select? 5 00:00:15,966 --> 00:00:18,900 You know, should I choose for my data set? 6 00:00:18,900 --> 00:00:23,866 And the goal of this tutorial is to show you how with any data set, 7 00:00:23,866 --> 00:00:26,000 you know, regardless of the number of features 8 00:00:26,000 --> 00:00:29,700 you have in the data set, well, I will show you how to select quickly 9 00:00:29,700 --> 00:00:32,700 and efficiently the best classification model. 10 00:00:33,133 --> 00:00:33,500 All right. 11 00:00:33,500 --> 00:00:36,700 So that's why here we're back into our machine learning. 12 00:00:36,700 --> 00:00:38,633 It is that model selection folder. 13 00:00:38,633 --> 00:00:41,866 You know which is a separate folder compared to the whole machine. 14 00:00:41,866 --> 00:00:45,200 Learning is a folder with all the codes and data sets to figure out 15 00:00:45,200 --> 00:00:48,900 how are we going to select the best classification model. 16 00:00:49,500 --> 00:00:49,800 All right. 17 00:00:49,800 --> 00:00:53,900 So here we are in the classification folder of our model selection big folder. 18 00:00:54,166 --> 00:00:58,433 And as you can recognize in this folder we have all the classification models 19 00:00:58,433 --> 00:01:02,833 that we implemented together all along this part three you have all of them. 20 00:01:02,866 --> 00:01:05,633 However I slightly modified them. 21 00:01:05,633 --> 00:01:08,733 But the only thing I did, you know with respect to what we did before, 22 00:01:09,033 --> 00:01:12,533 is that I removed all the prints, you know, to alleviate 23 00:01:12,533 --> 00:01:16,166 or lighten the implementation so that we can see more clearly. 24 00:01:16,433 --> 00:01:20,233 And also, of course, you know, at the end, I removed the two cells 25 00:01:20,233 --> 00:01:23,266 where we visualize the training set and test result. 26 00:01:23,266 --> 00:01:23,566 Right. 27 00:01:23,566 --> 00:01:28,566 Because remember this visualizations only work when you have two features and here, 28 00:01:28,700 --> 00:01:32,433 as you can see, I took a classic data set with many features. 29 00:01:32,433 --> 00:01:34,000 You can see all of them here. 30 00:01:34,000 --> 00:01:35,400 So these are all the features. 31 00:01:35,400 --> 00:01:37,200 And this is the dependent variable. 32 00:01:37,200 --> 00:01:40,533 But you can see this data set as a generic data set 33 00:01:40,533 --> 00:01:43,833 containing many features all with numerical values. 34 00:01:43,833 --> 00:01:44,100 Right. 35 00:01:44,100 --> 00:01:47,100 We won't do any kind of specific data preprocessing 36 00:01:47,233 --> 00:01:52,200 and indeed a binary dependent variable taking values 2 or 4. 37 00:01:52,233 --> 00:01:52,633 All right. 38 00:01:52,633 --> 00:01:54,000 So since we have 39 00:01:54,000 --> 00:01:57,166 the data set in front of us, well let me explain what this is about. 40 00:01:57,166 --> 00:02:00,333 Even if you know it doesn't really matter, because the goal of this tutorial 41 00:02:00,333 --> 00:02:03,366 is just to explain how to deploy efficiently 42 00:02:03,366 --> 00:02:06,566 all your classification models and quickly figure out what is the best one 43 00:02:06,766 --> 00:02:09,933 on any data set, regardless of the number of features. 44 00:02:09,933 --> 00:02:11,933 But let me still explain what this is about. 45 00:02:11,933 --> 00:02:15,866 So this is a classic data set which belongs to the UCI 46 00:02:15,866 --> 00:02:19,466 Machine Learning Repository and which is about breast cancer. 47 00:02:19,766 --> 00:02:23,966 So in this data set, each row corresponds to a patient, 48 00:02:24,000 --> 00:02:25,733 you know, different patients here. 49 00:02:25,733 --> 00:02:30,133 And for each of these patients we gathered well first assemble code number 50 00:02:30,433 --> 00:02:35,233 the clump thickness the uniformity of cell size 51 00:02:35,500 --> 00:02:39,200 the uniformity of cell shape, the marginal adhesion, 52 00:02:39,200 --> 00:02:42,766 the single epithelial cell, the Bernoulli, 53 00:02:42,800 --> 00:02:47,200 the blood chromatin, the normal nuclei and the mitosis. 54 00:02:47,200 --> 00:02:48,000 Okay. 55 00:02:48,000 --> 00:02:51,633 And all these variables are the features, you know, from sample code number, 56 00:02:51,633 --> 00:02:54,900 even if that's not really the feature up to mitosis. 57 00:02:55,033 --> 00:02:59,366 And with all these features, we are predicting the class which details 58 00:02:59,366 --> 00:03:05,333 for each patient if the tumor is benign, in which case class takes the value of two 59 00:03:05,533 --> 00:03:09,300 or malignant, in which case class takes a value for. 60 00:03:09,633 --> 00:03:11,666 All right. So that's what the data set is about. 61 00:03:11,666 --> 00:03:16,033 You can find it on the UCI ML repository by the name breast cancer. 62 00:03:16,033 --> 00:03:17,866 And you can take the original version. 63 00:03:17,866 --> 00:03:20,733 But really don't worry about all these features 64 00:03:20,733 --> 00:03:23,866 because, you know, most of us don't understand what they mean. 65 00:03:23,866 --> 00:03:27,133 You know, we're not doctors here, but we are data scientists. 66 00:03:27,133 --> 00:03:31,733 And even if we don't understand the domain knowledge here of oncology, 67 00:03:31,733 --> 00:03:34,266 you know, cancer medicine, well, that's still fine, 68 00:03:34,266 --> 00:03:37,500 because we can still build classification models to understand 69 00:03:37,666 --> 00:03:40,666 the correlations between all these features here 70 00:03:40,866 --> 00:03:44,400 and the dependent variable class, which we want to predict, telling 71 00:03:44,400 --> 00:03:48,533 if the tumor of each of these patients is benign or malignant. 72 00:03:48,866 --> 00:03:49,600 All right. 73 00:03:49,600 --> 00:03:52,400 And so we're going to use this data set to deploy 74 00:03:52,400 --> 00:03:54,600 all our classification models in a flashlight. 75 00:03:54,600 --> 00:03:56,300 You know in a matter of seconds. 76 00:03:56,300 --> 00:03:59,833 And after just a few clicks we will be able to figure out 77 00:03:59,833 --> 00:04:03,166 what is the best classification model for this data set. 78 00:04:03,333 --> 00:04:04,600 All right. Great. 79 00:04:04,600 --> 00:04:06,566 So let's do this. Let's close this. 80 00:04:06,566 --> 00:04:10,466 And now now what we're going to do in order to start the demo 81 00:04:10,500 --> 00:04:13,966 is because you know, this is a Google Drive folder to which 82 00:04:13,966 --> 00:04:17,633 all of you have access and therefore you can't modify it, obviously. 83 00:04:17,633 --> 00:04:20,866 And so what you have to do in order to modify these cells, you know, 84 00:04:20,900 --> 00:04:22,333 because we will have to enter 85 00:04:22,333 --> 00:04:25,333 the name of the data set because these are all code templates. 86 00:04:25,533 --> 00:04:28,166 In order to modify these cells you need to create a copy. 87 00:04:28,166 --> 00:04:30,233 So that's the first thing we'll do here. 88 00:04:30,233 --> 00:04:31,400 Let's do this quickly. 89 00:04:31,400 --> 00:04:36,433 You know you just need to do right click and then make a copy for each of them. 90 00:04:36,866 --> 00:04:37,400 All right 91 00:04:38,600 --> 00:04:42,233 then Colonel SVM make a copy logistic regression. 92 00:04:42,366 --> 00:04:44,433 So you see it's pretty fast. Sorry about that. 93 00:04:44,433 --> 00:04:47,266 But at least it only takes a few seconds. 94 00:04:47,266 --> 00:04:51,500 And then you'll get all your copies in case you know you want to modify them. 95 00:04:51,500 --> 00:04:53,566 But I recommend to. 96 00:04:53,566 --> 00:04:53,900 All right. 97 00:04:53,900 --> 00:04:57,966 Then your copies would go naturally to your main drive or, 98 00:04:57,966 --> 00:05:01,300 you know, in the Colab notebooks folder here. 99 00:05:01,300 --> 00:05:03,600 They just went into my drive, so all good. 100 00:05:03,600 --> 00:05:05,133 Now we're going to open them all. 101 00:05:05,133 --> 00:05:08,133 So starting with the last one, random forest. 102 00:05:08,233 --> 00:05:11,300 All right then we're going to open 103 00:05:11,600 --> 00:05:14,733 the decision tree classification open. 104 00:05:15,066 --> 00:05:15,333 All right. 105 00:05:15,333 --> 00:05:18,333 You can open it with Jupyter Notebook also if you want. 106 00:05:18,366 --> 00:05:18,866 Right. 107 00:05:18,866 --> 00:05:21,866 Then we're going to open Naive Bayes. 108 00:05:22,000 --> 00:05:26,233 All right then we're going to open kernel SVM. 109 00:05:27,166 --> 00:05:28,433 Perfect. 110 00:05:28,433 --> 00:05:31,200 Then we're going to open SVM. 111 00:05:31,200 --> 00:05:32,466 Where is it right here. 112 00:05:32,466 --> 00:05:35,600 Support vector machine open. Then 113 00:05:36,633 --> 00:05:39,633 we're going to open the K-nearest neighbors. 114 00:05:40,366 --> 00:05:41,366 All right. 115 00:05:41,366 --> 00:05:44,900 And finally we're going to open logistic regression 116 00:05:45,166 --> 00:05:48,533 I went from the last to the first because as you can see this is the way 117 00:05:48,700 --> 00:05:51,700 now we have all the files in the correct order.