1 00:00:00,210 --> 00:00:03,420 So the final classification model evaluation metric. 2 00:00:03,420 --> 00:00:09,380 We're gonna have a look at is a classification report and now really a classification report is also 3 00:00:09,420 --> 00:00:13,170 a collection of different evaluation metrics rather than a single one. 4 00:00:13,170 --> 00:00:15,050 So that's where the report comes from. 5 00:00:15,060 --> 00:00:21,920 It's going to report back of a number of different parameters evaluating our classification will. 6 00:00:22,060 --> 00:00:23,770 Let's see one in action. 7 00:00:23,800 --> 00:00:29,930 So from S.K. learned up metrics import classification report to do it. 8 00:00:29,960 --> 00:00:32,120 We just pass it some true labels. 9 00:00:33,070 --> 00:00:33,520 Okay. 10 00:00:33,530 --> 00:00:36,050 Why test and some predictions. 11 00:00:36,050 --> 00:00:42,410 So once again this evaluation metric is just comparing the true labels or our data versus the predictions 12 00:00:42,410 --> 00:00:43,460 that our model has made. 13 00:00:44,720 --> 00:00:46,420 So let's see what happens here. 14 00:00:46,460 --> 00:00:46,920 All right. 15 00:00:47,820 --> 00:00:49,400 There's a fair few things going on. 16 00:00:49,400 --> 00:00:54,190 We got precision recall have f1 score support zero. 17 00:00:54,200 --> 00:01:00,740 Maybe that's for the class zero for not heart disease one maybe that's for heart disease. 18 00:01:01,150 --> 00:01:06,560 If we come down with good accuracy point seven five macro average weighted average. 19 00:01:06,560 --> 00:01:07,560 Mm hmm. 20 00:01:07,700 --> 00:01:08,720 What's going on here. 21 00:01:09,290 --> 00:01:14,600 Well let's go to our keynote and have a look at our classification report anatomy. 22 00:01:14,600 --> 00:01:19,540 So over here we see we've got the same thing to just import classification report. 23 00:01:19,550 --> 00:01:23,840 We've done one on the test labels and the predictions seem slightly different numbers here what's in 24 00:01:23,840 --> 00:01:24,450 our notebook. 25 00:01:24,460 --> 00:01:31,830 But that's okay so what's precision or precision indicates the proportion of positive identifications. 26 00:01:31,910 --> 00:01:36,890 So a.k.a. the model pretty class one which were actually correct. 27 00:01:36,890 --> 00:01:40,940 Now a model which produces no false positives as a precision of one. 28 00:01:41,030 --> 00:01:45,980 So that's what we can see our precision scores if the model produced no false positives and we know 29 00:01:46,010 --> 00:01:50,750 that our model hasn't because we've looked at the classification or the confusion matrix and we've seen 30 00:01:50,750 --> 00:01:55,180 it confused on some examples and it's predictive them as false positives. 31 00:01:55,220 --> 00:01:58,260 So our precision scores are a little bit lower than one. 32 00:01:58,600 --> 00:02:01,140 Now the next one is recall. 33 00:02:01,190 --> 00:02:06,410 So recall indicates the proportion of actual positives which were correctly classified. 34 00:02:06,410 --> 00:02:11,060 Now a model which produces no false negatives has a recall of 1.0. 35 00:02:11,060 --> 00:02:14,560 And again we know from our confusion matrix that this isn't the case. 36 00:02:14,660 --> 00:02:23,120 So our recall is a little bit lower now f1 score a combination of precision and recall a perfect model 37 00:02:23,360 --> 00:02:26,530 achieves an f1 score of one point zero. 38 00:02:26,540 --> 00:02:26,860 Okay. 39 00:02:26,870 --> 00:02:32,690 So we see our f1 score is not 1.0 but it's still relatively high. 40 00:02:32,820 --> 00:02:37,940 Then if we go support the number of samples each metric was calculated on. 41 00:02:37,980 --> 00:02:38,910 So we see here. 42 00:02:39,570 --> 00:02:43,650 So this means we look at this twenty nine. 43 00:02:43,650 --> 00:02:44,300 So we look at that. 44 00:02:44,320 --> 00:02:47,040 That's 4 0 0 and down here at 61. 45 00:02:47,040 --> 00:02:53,280 So that means that these metrics here will calculate on 61 different samples and these metrics along 46 00:02:53,280 --> 00:03:01,260 here were calculated on twenty nine samples which had the class label zero and that this one is 32 with 47 00:03:01,260 --> 00:03:02,680 the class label 1. 48 00:03:02,680 --> 00:03:03,120 Okay. 49 00:03:03,120 --> 00:03:04,100 Has heart disease. 50 00:03:04,110 --> 00:03:13,070 And so if you total these up 29 plus 32 you get 61 accuracy we've seen that before. 51 00:03:13,200 --> 00:03:15,970 This is just the accuracy of the model in decimal form. 52 00:03:16,110 --> 00:03:21,930 Perfect accuracy is 1.0 a.k.a. our model is making predictions right 100 percent of the time if its 53 00:03:21,930 --> 00:03:25,920 accuracy is one point zero then we have a look at macro average. 54 00:03:25,920 --> 00:03:30,520 Now what is that short for macro average as you may have guessed. 55 00:03:30,570 --> 00:03:35,200 So it's the average precision recall and F1 scores between classes. 56 00:03:35,220 --> 00:03:43,230 Now the key here is macro average doesn't take class in balance into this should be effect into account. 57 00:03:43,350 --> 00:03:45,280 So pretend this is account. 58 00:03:45,450 --> 00:03:49,620 So if you do have class imbalances pay attention to this metric. 59 00:03:49,680 --> 00:03:50,100 Okay. 60 00:03:50,100 --> 00:03:53,010 So in our case what does it class imbalance. 61 00:03:53,010 --> 00:03:57,000 Well if we look at our example we don't have class imbalances here. 62 00:03:57,000 --> 00:03:58,350 Why is that. 63 00:03:58,350 --> 00:04:07,020 Well because we have relatively the same amount of samples with class 0 and Class 1 meaning that about 64 00:04:07,020 --> 00:04:14,220 50/50 split here it's not exactly 50/50 but it's not like we have 60 examples of class one. 65 00:04:14,370 --> 00:04:16,830 And one example of class zero. 66 00:04:16,890 --> 00:04:18,680 So we've got about 30 in each. 67 00:04:18,690 --> 00:04:23,910 So if you did have class imbalances you really want to check out your macro average now because our 68 00:04:23,910 --> 00:04:26,870 example the class is a relatively balanced. 69 00:04:26,880 --> 00:04:33,930 That's why the macro average and the weighted average shortfall weighted average AVP the weighted average 70 00:04:33,990 --> 00:04:40,200 is a precision recall in f1 score between classes weighted means each metric is calculated with respect 71 00:04:40,470 --> 00:04:43,230 to how many samples are in each class. 72 00:04:43,230 --> 00:04:45,830 So this metric will favour the majority class. 73 00:04:46,020 --> 00:04:52,330 For example we'll give a high value when one class outperforms another due to having more samples. 74 00:04:52,350 --> 00:05:00,270 So this is why that our values here are quite similar because we have balanced classes. 75 00:05:00,360 --> 00:05:04,830 Now you might be thinking as what kind of situation would it be that these values would be out of whack 76 00:05:04,830 --> 00:05:05,160 right. 77 00:05:05,160 --> 00:05:06,150 Like where would this happen. 78 00:05:06,160 --> 00:05:11,010 You might also be thinking we've learned a whole bunch of different metrics here we come back to our 79 00:05:11,010 --> 00:05:17,700 notebook we've gone overclassification reports we've done confusion matrix we've done accuracy we've 80 00:05:17,700 --> 00:05:20,010 done area under rock curve. 81 00:05:20,010 --> 00:05:23,400 So when should I use each of these. 82 00:05:23,400 --> 00:05:28,500 So now we've kind of gone through them might be tempting to think well why don't I just ditch all these 83 00:05:28,500 --> 00:05:34,350 other ones that I really haven't kind of trouble grasping and just use accuracy. 84 00:05:34,350 --> 00:05:39,420 Well let's have an example let's see when other metrics come into play and maybe you shouldn't just 85 00:05:39,420 --> 00:05:43,470 use accuracy because this is a trap that I got caught in right when I first started building classification 86 00:05:43,470 --> 00:05:49,980 models I was like Yes my model is getting 90 plus percent accuracy it must be a great more let's say 87 00:05:49,980 --> 00:05:50,700 a scenario. 88 00:05:51,330 --> 00:05:57,060 So for example let's say there were 10000 people and one of them had a disease and you're asked to build 89 00:05:57,060 --> 00:05:59,580 a model to predict who has it. 90 00:05:59,580 --> 00:06:00,150 All right. 91 00:06:00,510 --> 00:06:03,150 So let's actually do that let's cut this up. 92 00:06:03,150 --> 00:06:11,010 So this is where precision and recall become valuable and in fact all the metrics in our classification 93 00:06:11,010 --> 00:06:14,460 report become valuable here so disease. 94 00:06:14,460 --> 00:06:15,000 True. 95 00:06:15,030 --> 00:06:18,640 So there was ten thousand people and one of them had a disease. 96 00:06:18,660 --> 00:06:25,460 So what we might do is create an umpire Ray of 10000 people and then we might change one of these I 97 00:06:25,540 --> 00:06:31,600 can spell disease so disease zero index equals 1. 98 00:06:31,720 --> 00:06:33,900 So there's only one positive case 99 00:06:36,860 --> 00:06:44,820 then if we come down here and say our model predicted or disease spreads equals and P zeros ten thousand. 100 00:06:44,820 --> 00:06:50,520 Now this means model predicts every case as zero. 101 00:06:51,500 --> 00:06:59,630 All right and then what we're going to do is create the PD data frame so we can visualize this classification 102 00:06:59,630 --> 00:06:59,970 report. 103 00:06:59,970 --> 00:07:01,870 What if that's what our model did right. 104 00:07:01,880 --> 00:07:03,110 This is what's happening here. 105 00:07:03,140 --> 00:07:06,710 Disease true so there's only one positive case one in 10000. 106 00:07:06,830 --> 00:07:11,420 And we build a model and our model predicts that every single case is zero. 107 00:07:11,480 --> 00:07:17,420 So it misses the one prediction what would happen if we build a classification important so we go disease 108 00:07:17,480 --> 00:07:27,450 to ruin over here disease spreads want to pass this little parameter so it fits nicely into a data frame 109 00:07:27,750 --> 00:07:31,450 so output dict equals through all right. 110 00:07:32,360 --> 00:07:34,760 So what's happening here. 111 00:07:36,040 --> 00:07:41,140 We look at this this is just another format of this classification report up here. 112 00:07:41,230 --> 00:07:46,430 If we didn't do this it would air out. 113 00:07:46,430 --> 00:07:49,080 So this is just to visualize what's going on. 114 00:07:49,100 --> 00:07:56,930 True so this is a prime example right of where you want to use another metric other than accuracy is 115 00:07:56,930 --> 00:07:59,430 when you have a very large class imbalance. 116 00:07:59,450 --> 00:08:05,330 So in our case we have a massive class imbalance because in our original dataset disease equals true 117 00:08:06,200 --> 00:08:11,360 there's only one example so only one example of where the Labor would be one and everything else is 118 00:08:11,360 --> 00:08:12,170 zero. 119 00:08:12,170 --> 00:08:15,300 We've built a model that just predicts zero for every case. 120 00:08:15,350 --> 00:08:15,740 Right. 121 00:08:15,770 --> 00:08:20,750 Because it just it's only one sample so it's really hard to learn that there is a pattern there for 122 00:08:20,750 --> 00:08:22,970 this one particular case. 123 00:08:23,000 --> 00:08:28,550 And so what happens is if we were to measure just accuracy on our model that is predicted zero for everything 124 00:08:29,300 --> 00:08:34,060 it comes out with an accuracy of point nine nine or in other words 99 per. 125 00:08:34,220 --> 00:08:41,450 And so ask yourself although the model achieves ninety nine point nine nine percent accuracy is the 126 00:08:41,450 --> 00:08:43,590 model still useful. 127 00:08:43,670 --> 00:08:45,010 Right. 128 00:08:45,050 --> 00:08:47,780 That's why we look at these other metrics such as the macro average. 129 00:08:47,780 --> 00:08:52,670 That's where we can see okay our models falling down here on the precision it's actually getting zero 130 00:08:52,670 --> 00:08:59,940 precision for class 1 which makes sense because it didn't even predict Class 1 and here is getting basically 131 00:09:00,210 --> 00:09:04,940 100 percent on class 0 because it has only predicted class 0. 132 00:09:04,980 --> 00:09:10,590 So this is a kind of use case scenario of where you want to make sure that you're using a wide spectrum 133 00:09:10,860 --> 00:09:16,230 of evaluation metrics for your classification models and not just accuracy. 134 00:09:16,230 --> 00:09:22,020 And now we've only just touched on some of the metrics that you can use for evaluating a classification 135 00:09:22,020 --> 00:09:22,490 algorithm. 136 00:09:23,130 --> 00:09:26,330 But there are a few more so if you want to have a look at more. 137 00:09:26,360 --> 00:09:27,080 So we go here. 138 00:09:27,090 --> 00:09:30,600 Model evaluation psychic learn. 139 00:09:30,900 --> 00:09:35,060 We'll look at the documentation if you come in here. 140 00:09:35,140 --> 00:09:39,930 This is a little bit of extra curricular is to have a look at this section here. 141 00:09:39,940 --> 00:09:44,000 So all I've done is gone into the socket learn documentation for model evaluation. 142 00:09:44,290 --> 00:09:45,630 We come to classification. 143 00:09:45,820 --> 00:09:48,770 We can see a few here the ones that we've covered. 144 00:09:49,100 --> 00:09:53,290 And so the ones that we've covered are probably the bare minimum that you want to use right. 145 00:09:53,620 --> 00:10:01,340 So accuracy area under the rock curve or a you see confusion matrix and classification report. 146 00:10:01,390 --> 00:10:06,090 So keep those in your tool bag and if you need more be sure to refer to the documentation. 147 00:10:06,730 --> 00:10:11,030 So that's going to wrap up our classification specific evaluation metrics. 148 00:10:11,080 --> 00:10:15,940 Let's dive into the next section where we'll have a look at some more specific regression model evaluation 149 00:10:15,940 --> 00:10:16,450 metrics.