1
00:00:00,066 --> 00:00:04,733
Hello my friends, and welcome to the final
section of part three classification,

2
00:00:04,866 --> 00:00:08,400
where we're going to answer together
a very important question.

3
00:00:08,400 --> 00:00:12,233
You know, one of the most frequently asked
question in the data science community,

4
00:00:12,466 --> 00:00:15,966
which is which classification model
should I select?

5
00:00:15,966 --> 00:00:18,900
You know, should I choose for my data set?

6
00:00:18,900 --> 00:00:23,866
And the goal of this tutorial
is to show you how with any data set,

7
00:00:23,866 --> 00:00:26,000
you know,
regardless of the number of features

8
00:00:26,000 --> 00:00:29,700
you have in the data set, well,
I will show you how to select quickly

9
00:00:29,700 --> 00:00:32,700
and efficiently
the best classification model.

10
00:00:33,133 --> 00:00:33,500
All right.

11
00:00:33,500 --> 00:00:36,700
So that's why here
we're back into our machine learning.

12
00:00:36,700 --> 00:00:38,633
It is that model selection folder.

13
00:00:38,633 --> 00:00:41,866
You know which is a separate folder
compared to the whole machine.

14
00:00:41,866 --> 00:00:45,200
Learning is a folder with all the codes
and data sets to figure out

15
00:00:45,200 --> 00:00:48,900
how are we going to select
the best classification model.

16
00:00:49,500 --> 00:00:49,800
All right.

17
00:00:49,800 --> 00:00:53,900
So here we are in the classification
folder of our model selection big folder.

18
00:00:54,166 --> 00:00:58,433
And as you can recognize in this folder
we have all the classification models

19
00:00:58,433 --> 00:01:02,833
that we implemented together all along
this part three you have all of them.

20
00:01:02,866 --> 00:01:05,633
However I slightly modified them.

21
00:01:05,633 --> 00:01:08,733
But the only thing I did, you know
with respect to what we did before,

22
00:01:09,033 --> 00:01:12,533
is that I removed all the prints,
you know, to alleviate

23
00:01:12,533 --> 00:01:16,166
or lighten the implementation
so that we can see more clearly.

24
00:01:16,433 --> 00:01:20,233
And also, of course, you know, at the end,
I removed the two cells

25
00:01:20,233 --> 00:01:23,266
where we visualize the training set
and test result.

26
00:01:23,266 --> 00:01:23,566
Right.

27
00:01:23,566 --> 00:01:28,566
Because remember this visualizations only
work when you have two features and here,

28
00:01:28,700 --> 00:01:32,433
as you can see, I took a classic data
set with many features.

29
00:01:32,433 --> 00:01:34,000
You can see all of them here.

30
00:01:34,000 --> 00:01:35,400
So these are all the features.

31
00:01:35,400 --> 00:01:37,200
And this is the dependent variable.

32
00:01:37,200 --> 00:01:40,533
But you can see this
data set as a generic data set

33
00:01:40,533 --> 00:01:43,833
containing many features
all with numerical values.

34
00:01:43,833 --> 00:01:44,100
Right.

35
00:01:44,100 --> 00:01:47,100
We won't do any kind of specific data
preprocessing

36
00:01:47,233 --> 00:01:52,200
and indeed a binary dependent variable
taking values 2 or 4.

37
00:01:52,233 --> 00:01:52,633
All right.

38
00:01:52,633 --> 00:01:54,000
So since we have

39
00:01:54,000 --> 00:01:57,166
the data set in front of us,
well let me explain what this is about.

40
00:01:57,166 --> 00:02:00,333
Even if you know it doesn't really matter,
because the goal of this tutorial

41
00:02:00,333 --> 00:02:03,366
is just to explain
how to deploy efficiently

42
00:02:03,366 --> 00:02:06,566
all your classification models and quickly
figure out what is the best one

43
00:02:06,766 --> 00:02:09,933
on any data set,
regardless of the number of features.

44
00:02:09,933 --> 00:02:11,933
But let me still explain
what this is about.

45
00:02:11,933 --> 00:02:15,866
So this is a classic data set
which belongs to the UCI

46
00:02:15,866 --> 00:02:19,466
Machine Learning Repository
and which is about breast cancer.

47
00:02:19,766 --> 00:02:23,966
So in this data set,
each row corresponds to a patient,

48
00:02:24,000 --> 00:02:25,733
you know, different patients here.

49
00:02:25,733 --> 00:02:30,133
And for each of these patients we gathered
well first assemble code number

50
00:02:30,433 --> 00:02:35,233
the clump thickness
the uniformity of cell size

51
00:02:35,500 --> 00:02:39,200
the uniformity of cell shape,
the marginal adhesion,

52
00:02:39,200 --> 00:02:42,766
the single epithelial cell,
the Bernoulli,

53
00:02:42,800 --> 00:02:47,200
the blood chromatin, the normal nuclei
and the mitosis.

54
00:02:47,200 --> 00:02:48,000
Okay.

55
00:02:48,000 --> 00:02:51,633
And all these variables are the features,
you know, from sample code number,

56
00:02:51,633 --> 00:02:54,900
even if that's not really the feature 
up to mitosis.

57
00:02:55,033 --> 00:02:59,366
And with all these features,
we are predicting the class which details

58
00:02:59,366 --> 00:03:05,333
for each patient if the tumor is benign,
in which case class takes the value of two

59
00:03:05,533 --> 00:03:09,300
or malignant,
in which case class takes a value for.

60
00:03:09,633 --> 00:03:11,666
All right.
So that's what the data set is about.

61
00:03:11,666 --> 00:03:16,033
You can find it on the UCI
ML repository by the name breast cancer.

62
00:03:16,033 --> 00:03:17,866
And you can take the original version.

63
00:03:17,866 --> 00:03:20,733
But really don't worry
about all these features

64
00:03:20,733 --> 00:03:23,866
because, you know, most of us
don't understand what they mean.

65
00:03:23,866 --> 00:03:27,133
You know, we're not doctors here,
but we are data scientists.

66
00:03:27,133 --> 00:03:31,733
And even if we don't understand
the domain knowledge here of oncology,

67
00:03:31,733 --> 00:03:34,266
you know, cancer medicine, well,
that's still fine,

68
00:03:34,266 --> 00:03:37,500
because we can still build
classification models to understand

69
00:03:37,666 --> 00:03:40,666
the correlations
between all these features here

70
00:03:40,866 --> 00:03:44,400
and the dependent variable class,
which we want to predict, telling

71
00:03:44,400 --> 00:03:48,533
if the tumor of each of these patients
is benign or malignant.

72
00:03:48,866 --> 00:03:49,600
All right.

73
00:03:49,600 --> 00:03:52,400
And so we're going to use this data
set to deploy

74
00:03:52,400 --> 00:03:54,600
all our classification models
in a flashlight.

75
00:03:54,600 --> 00:03:56,300
You know in a matter of seconds.

76
00:03:56,300 --> 00:03:59,833
And after just a few clicks
we will be able to figure out

77
00:03:59,833 --> 00:04:03,166
what is the best classification model
for this data set.

78
00:04:03,333 --> 00:04:04,600
All right. Great.

79
00:04:04,600 --> 00:04:06,566
So let's do this. Let's close this.

80
00:04:06,566 --> 00:04:10,466
And now now what we're going to do
in order to start the demo

81
00:04:10,500 --> 00:04:13,966
is because you know,
this is a Google Drive folder to which

82
00:04:13,966 --> 00:04:17,633
all of you have access and therefore
you can't modify it, obviously.

83
00:04:17,633 --> 00:04:20,866
And so what you have to do
in order to modify these cells, you know,

84
00:04:20,900 --> 00:04:22,333
because we will have to enter

85
00:04:22,333 --> 00:04:25,333
the name of the data set
because these are all code templates.

86
00:04:25,533 --> 00:04:28,166
In order to modify these cells
you need to create a copy.

87
00:04:28,166 --> 00:04:30,233
So that's the first thing we'll do here.

88
00:04:30,233 --> 00:04:31,400
Let's do this quickly.

89
00:04:31,400 --> 00:04:36,433
You know you just need to do right click
and then make a copy for each of them.

90
00:04:36,866 --> 00:04:37,400
All right

91
00:04:38,600 --> 00:04:42,233
then Colonel
SVM make a copy logistic regression.

92
00:04:42,366 --> 00:04:44,433
So you see it's pretty fast.
Sorry about that.

93
00:04:44,433 --> 00:04:47,266
But at least it only takes a few seconds.

94
00:04:47,266 --> 00:04:51,500
And then you'll get all your copies
in case you know you want to modify them.

95
00:04:51,500 --> 00:04:53,566
But I recommend to.

96
00:04:53,566 --> 00:04:53,900
All right.

97
00:04:53,900 --> 00:04:57,966
Then your copies would go naturally
to your main drive or,

98
00:04:57,966 --> 00:05:01,300
you know, in the Colab notebooks
folder here.

99
00:05:01,300 --> 00:05:03,600
They just went into my drive, so all good.

100
00:05:03,600 --> 00:05:05,133
Now we're going to open them all.

101
00:05:05,133 --> 00:05:08,133
So starting with the last one,
random forest.

102
00:05:08,233 --> 00:05:11,300
All right then we're going to open

103
00:05:11,600 --> 00:05:14,733
the decision tree classification open.

104
00:05:15,066 --> 00:05:15,333
All right.

105
00:05:15,333 --> 00:05:18,333
You can open it with Jupyter Notebook
also if you want.

106
00:05:18,366 --> 00:05:18,866
Right.

107
00:05:18,866 --> 00:05:21,866
Then we're going to open Naive Bayes.

108
00:05:22,000 --> 00:05:26,233
All right
then we're going to open kernel SVM.

109
00:05:27,166 --> 00:05:28,433
Perfect.

110
00:05:28,433 --> 00:05:31,200
Then we're going to open SVM.

111
00:05:31,200 --> 00:05:32,466
Where is it right here.

112
00:05:32,466 --> 00:05:35,600
Support vector machine open. Then

113
00:05:36,633 --> 00:05:39,633
we're
going to open the K-nearest neighbors.

114
00:05:40,366 --> 00:05:41,366
All right.

115
00:05:41,366 --> 00:05:44,900
And finally we're going to open
logistic regression

116
00:05:45,166 --> 00:05:48,533
I went from the last to the first
because as you can see this is the way

117
00:05:48,700 --> 00:05:51,700
now we have all the files
in the correct order.