1
00:00:00,210 --> 00:00:03,420
So the final classification model evaluation metric.

2
00:00:03,420 --> 00:00:09,380
We're gonna have a look at is a classification report and now really a classification report is also

3
00:00:09,420 --> 00:00:13,170
a collection of different evaluation metrics rather than a single one.

4
00:00:13,170 --> 00:00:15,050
So that's where the report comes from.

5
00:00:15,060 --> 00:00:21,920
It's going to report back of a number of different parameters evaluating our classification will.

6
00:00:22,060 --> 00:00:23,770
Let's see one in action.

7
00:00:23,800 --> 00:00:29,930
So from S.K. learned up metrics import classification report to do it.

8
00:00:29,960 --> 00:00:32,120
We just pass it some true labels.

9
00:00:33,070 --> 00:00:33,520
Okay.

10
00:00:33,530 --> 00:00:36,050
Why test and some predictions.

11
00:00:36,050 --> 00:00:42,410
So once again this evaluation metric is just comparing the true labels or our data versus the predictions

12
00:00:42,410 --> 00:00:43,460
that our model has made.

13
00:00:44,720 --> 00:00:46,420
So let's see what happens here.

14
00:00:46,460 --> 00:00:46,920
All right.

15
00:00:47,820 --> 00:00:49,400
There's a fair few things going on.

16
00:00:49,400 --> 00:00:54,190
We got precision recall have f1 score support zero.

17
00:00:54,200 --> 00:01:00,740
Maybe that's for the class zero for not heart disease one maybe that's for heart disease.

18
00:01:01,150 --> 00:01:06,560
If we come down with good accuracy point seven five macro average weighted average.

19
00:01:06,560 --> 00:01:07,560
Mm hmm.

20
00:01:07,700 --> 00:01:08,720
What's going on here.

21
00:01:09,290 --> 00:01:14,600
Well let's go to our keynote and have a look at our classification report anatomy.

22
00:01:14,600 --> 00:01:19,540
So over here we see we've got the same thing to just import classification report.

23
00:01:19,550 --> 00:01:23,840
We've done one on the test labels and the predictions seem slightly different numbers here what's in

24
00:01:23,840 --> 00:01:24,450
our notebook.

25
00:01:24,460 --> 00:01:31,830
But that's okay so what's precision or precision indicates the proportion of positive identifications.

26
00:01:31,910 --> 00:01:36,890
So a.k.a. the model pretty class one which were actually correct.

27
00:01:36,890 --> 00:01:40,940
Now a model which produces no false positives as a precision of one.

28
00:01:41,030 --> 00:01:45,980
So that's what we can see our precision scores if the model produced no false positives and we know

29
00:01:46,010 --> 00:01:50,750
that our model hasn't because we've looked at the classification or the confusion matrix and we've seen

30
00:01:50,750 --> 00:01:55,180
it confused on some examples and it's predictive them as false positives.

31
00:01:55,220 --> 00:01:58,260
So our precision scores are a little bit lower than one.

32
00:01:58,600 --> 00:02:01,140
Now the next one is recall.

33
00:02:01,190 --> 00:02:06,410
So recall indicates the proportion of actual positives which were correctly classified.

34
00:02:06,410 --> 00:02:11,060
Now a model which produces no false negatives has a recall of 1.0.

35
00:02:11,060 --> 00:02:14,560
And again we know from our confusion matrix that this isn't the case.

36
00:02:14,660 --> 00:02:23,120
So our recall is a little bit lower now f1 score a combination of precision and recall a perfect model

37
00:02:23,360 --> 00:02:26,530
achieves an f1 score of one point zero.

38
00:02:26,540 --> 00:02:26,860
Okay.

39
00:02:26,870 --> 00:02:32,690
So we see our f1 score is not 1.0 but it's still relatively high.

40
00:02:32,820 --> 00:02:37,940
Then if we go support the number of samples each metric was calculated on.

41
00:02:37,980 --> 00:02:38,910
So we see here.

42
00:02:39,570 --> 00:02:43,650
So this means we look at this twenty nine.

43
00:02:43,650 --> 00:02:44,300
So we look at that.

44
00:02:44,320 --> 00:02:47,040
That's 4 0 0 and down here at 61.

45
00:02:47,040 --> 00:02:53,280
So that means that these metrics here will calculate on 61 different samples and these metrics along

46
00:02:53,280 --> 00:03:01,260
here were calculated on twenty nine samples which had the class label zero and that this one is 32 with

47
00:03:01,260 --> 00:03:02,680
the class label 1.

48
00:03:02,680 --> 00:03:03,120
Okay.

49
00:03:03,120 --> 00:03:04,100
Has heart disease.

50
00:03:04,110 --> 00:03:13,070
And so if you total these up 29 plus 32 you get 61 accuracy we've seen that before.

51
00:03:13,200 --> 00:03:15,970
This is just the accuracy of the model in decimal form.

52
00:03:16,110 --> 00:03:21,930
Perfect accuracy is 1.0 a.k.a. our model is making predictions right 100 percent of the time if its

53
00:03:21,930 --> 00:03:25,920
accuracy is one point zero then we have a look at macro average.

54
00:03:25,920 --> 00:03:30,520
Now what is that short for macro average as you may have guessed.

55
00:03:30,570 --> 00:03:35,200
So it's the average precision recall and F1 scores between classes.

56
00:03:35,220 --> 00:03:43,230
Now the key here is macro average doesn't take class in balance into this should be effect into account.

57
00:03:43,350 --> 00:03:45,280
So pretend this is account.

58
00:03:45,450 --> 00:03:49,620
So if you do have class imbalances pay attention to this metric.

59
00:03:49,680 --> 00:03:50,100
Okay.

60
00:03:50,100 --> 00:03:53,010
So in our case what does it class imbalance.

61
00:03:53,010 --> 00:03:57,000
Well if we look at our example we don't have class imbalances here.

62
00:03:57,000 --> 00:03:58,350
Why is that.

63
00:03:58,350 --> 00:04:07,020
Well because we have relatively the same amount of samples with class 0 and Class 1 meaning that about

64
00:04:07,020 --> 00:04:14,220
50/50 split here it's not exactly 50/50 but it's not like we have 60 examples of class one.

65
00:04:14,370 --> 00:04:16,830
And one example of class zero.

66
00:04:16,890 --> 00:04:18,680
So we've got about 30 in each.

67
00:04:18,690 --> 00:04:23,910
So if you did have class imbalances you really want to check out your macro average now because our

68
00:04:23,910 --> 00:04:26,870
example the class is a relatively balanced.

69
00:04:26,880 --> 00:04:33,930
That's why the macro average and the weighted average shortfall weighted average AVP the weighted average

70
00:04:33,990 --> 00:04:40,200
is a precision recall in f1 score between classes weighted means each metric is calculated with respect

71
00:04:40,470 --> 00:04:43,230
to how many samples are in each class.

72
00:04:43,230 --> 00:04:45,830
So this metric will favour the majority class.

73
00:04:46,020 --> 00:04:52,330
For example we'll give a high value when one class outperforms another due to having more samples.

74
00:04:52,350 --> 00:05:00,270
So this is why that our values here are quite similar because we have balanced classes.

75
00:05:00,360 --> 00:05:04,830
Now you might be thinking as what kind of situation would it be that these values would be out of whack

76
00:05:04,830 --> 00:05:05,160
right.

77
00:05:05,160 --> 00:05:06,150
Like where would this happen.

78
00:05:06,160 --> 00:05:11,010
You might also be thinking we've learned a whole bunch of different metrics here we come back to our

79
00:05:11,010 --> 00:05:17,700
notebook we've gone overclassification reports we've done confusion matrix we've done accuracy we've

80
00:05:17,700 --> 00:05:20,010
done area under rock curve.

81
00:05:20,010 --> 00:05:23,400
So when should I use each of these.

82
00:05:23,400 --> 00:05:28,500
So now we've kind of gone through them might be tempting to think well why don't I just ditch all these

83
00:05:28,500 --> 00:05:34,350
other ones that I really haven't kind of trouble grasping and just use accuracy.

84
00:05:34,350 --> 00:05:39,420
Well let's have an example let's see when other metrics come into play and maybe you shouldn't just

85
00:05:39,420 --> 00:05:43,470
use accuracy because this is a trap that I got caught in right when I first started building classification

86
00:05:43,470 --> 00:05:49,980
models I was like Yes my model is getting 90 plus percent accuracy it must be a great more let's say

87
00:05:49,980 --> 00:05:50,700
a scenario.

88
00:05:51,330 --> 00:05:57,060
So for example let's say there were 10000 people and one of them had a disease and you're asked to build

89
00:05:57,060 --> 00:05:59,580
a model to predict who has it.

90
00:05:59,580 --> 00:06:00,150
All right.

91
00:06:00,510 --> 00:06:03,150
So let's actually do that let's cut this up.

92
00:06:03,150 --> 00:06:11,010
So this is where precision and recall become valuable and in fact all the metrics in our classification

93
00:06:11,010 --> 00:06:14,460
report become valuable here so disease.

94
00:06:14,460 --> 00:06:15,000
True.

95
00:06:15,030 --> 00:06:18,640
So there was ten thousand people and one of them had a disease.

96
00:06:18,660 --> 00:06:25,460
So what we might do is create an umpire Ray of 10000 people and then we might change one of these I

97
00:06:25,540 --> 00:06:31,600
can spell disease so disease zero index equals 1.

98
00:06:31,720 --> 00:06:33,900
So there's only one positive case

99
00:06:36,860 --> 00:06:44,820
then if we come down here and say our model predicted or disease spreads equals and P zeros ten thousand.

100
00:06:44,820 --> 00:06:50,520
Now this means model predicts every case as zero.

101
00:06:51,500 --> 00:06:59,630
All right and then what we're going to do is create the PD data frame so we can visualize this classification

102
00:06:59,630 --> 00:06:59,970
report.

103
00:06:59,970 --> 00:07:01,870
What if that's what our model did right.

104
00:07:01,880 --> 00:07:03,110
This is what's happening here.

105
00:07:03,140 --> 00:07:06,710
Disease true so there's only one positive case one in 10000.

106
00:07:06,830 --> 00:07:11,420
And we build a model and our model predicts that every single case is zero.

107
00:07:11,480 --> 00:07:17,420
So it misses the one prediction what would happen if we build a classification important so we go disease

108
00:07:17,480 --> 00:07:27,450
to ruin over here disease spreads want to pass this little parameter so it fits nicely into a data frame

109
00:07:27,750 --> 00:07:31,450
so output dict equals through all right.

110
00:07:32,360 --> 00:07:34,760
So what's happening here.

111
00:07:36,040 --> 00:07:41,140
We look at this this is just another format of this classification report up here.

112
00:07:41,230 --> 00:07:46,430
If we didn't do this it would air out.

113
00:07:46,430 --> 00:07:49,080
So this is just to visualize what's going on.

114
00:07:49,100 --> 00:07:56,930
True so this is a prime example right of where you want to use another metric other than accuracy is

115
00:07:56,930 --> 00:07:59,430
when you have a very large class imbalance.

116
00:07:59,450 --> 00:08:05,330
So in our case we have a massive class imbalance because in our original dataset disease equals true

117
00:08:06,200 --> 00:08:11,360
there's only one example so only one example of where the Labor would be one and everything else is

118
00:08:11,360 --> 00:08:12,170
zero.

119
00:08:12,170 --> 00:08:15,300
We've built a model that just predicts zero for every case.

120
00:08:15,350 --> 00:08:15,740
Right.

121
00:08:15,770 --> 00:08:20,750
Because it just it's only one sample so it's really hard to learn that there is a pattern there for

122
00:08:20,750 --> 00:08:22,970
this one particular case.

123
00:08:23,000 --> 00:08:28,550
And so what happens is if we were to measure just accuracy on our model that is predicted zero for everything

124
00:08:29,300 --> 00:08:34,060
it comes out with an accuracy of point nine nine or in other words 99 per.

125
00:08:34,220 --> 00:08:41,450
And so ask yourself although the model achieves ninety nine point nine nine percent accuracy is the

126
00:08:41,450 --> 00:08:43,590
model still useful.

127
00:08:43,670 --> 00:08:45,010
Right.

128
00:08:45,050 --> 00:08:47,780
That's why we look at these other metrics such as the macro average.

129
00:08:47,780 --> 00:08:52,670
That's where we can see okay our models falling down here on the precision it's actually getting zero

130
00:08:52,670 --> 00:08:59,940
precision for class 1 which makes sense because it didn't even predict Class 1 and here is getting basically

131
00:09:00,210 --> 00:09:04,940
100 percent on class 0 because it has only predicted class 0.

132
00:09:04,980 --> 00:09:10,590
So this is a kind of use case scenario of where you want to make sure that you're using a wide spectrum

133
00:09:10,860 --> 00:09:16,230
of evaluation metrics for your classification models and not just accuracy.

134
00:09:16,230 --> 00:09:22,020
And now we've only just touched on some of the metrics that you can use for evaluating a classification

135
00:09:22,020 --> 00:09:22,490
algorithm.

136
00:09:23,130 --> 00:09:26,330
But there are a few more so if you want to have a look at more.

137
00:09:26,360 --> 00:09:27,080
So we go here.

138
00:09:27,090 --> 00:09:30,600
Model evaluation psychic learn.

139
00:09:30,900 --> 00:09:35,060
We'll look at the documentation if you come in here.

140
00:09:35,140 --> 00:09:39,930
This is a little bit of extra curricular is to have a look at this section here.

141
00:09:39,940 --> 00:09:44,000
So all I've done is gone into the socket learn documentation for model evaluation.

142
00:09:44,290 --> 00:09:45,630
We come to classification.

143
00:09:45,820 --> 00:09:48,770
We can see a few here the ones that we've covered.

144
00:09:49,100 --> 00:09:53,290
And so the ones that we've covered are probably the bare minimum that you want to use right.

145
00:09:53,620 --> 00:10:01,340
So accuracy area under the rock curve or a you see confusion matrix and classification report.

146
00:10:01,390 --> 00:10:06,090
So keep those in your tool bag and if you need more be sure to refer to the documentation.

147
00:10:06,730 --> 00:10:11,030
So that's going to wrap up our classification specific evaluation metrics.

148
00:10:11,080 --> 00:10:15,940
Let's dive into the next section where we'll have a look at some more specific regression model evaluation

149
00:10:15,940 --> 00:10:16,450
metrics.