1 00:00:00,570 --> 00:00:07,740 Having already looked at accuracy as a metric the second metric that we're going to look at is the recall 2 00:00:07,740 --> 00:00:08,740 score. 3 00:00:08,940 --> 00:00:16,020 Recall is also known as the sensitivity the metric is defined as follows. 4 00:00:16,170 --> 00:00:25,210 Recall is the true positives divided by the sum of the true positives and the false negatives. 5 00:00:25,230 --> 00:00:30,720 Now looking at this formula and this definition is very well and good but let's think about a situation 6 00:00:30,780 --> 00:00:34,460 in which the recall score is high and in what situation. 7 00:00:34,470 --> 00:00:37,420 The recall score is low to understand it a little better. 8 00:00:38,820 --> 00:00:44,820 Looking at our fraction here we can see that the number of false negatives is key when this number goes 9 00:00:44,820 --> 00:00:46,420 down say 2 0. 10 00:00:46,530 --> 00:00:50,260 Then the recall score goes up in the extreme. 11 00:00:50,310 --> 00:00:59,360 If the false negatives are equal to zero then the recall score has its maximum value a value of 1. 12 00:00:59,370 --> 00:01:05,580 Now remember from our previous discussion a false negative is when our models said that an email was 13 00:01:05,580 --> 00:01:08,970 not spam when in fact it was spam. 14 00:01:09,150 --> 00:01:12,370 The spammer got into our inbox. 15 00:01:12,570 --> 00:01:15,570 So how do we interpret the recall score. 16 00:01:16,590 --> 00:01:22,550 Well think of recall as answering this question of all the spam emails. 17 00:01:22,590 --> 00:01:29,190 How many of these did we actually label a spam this whole equation as a ratio right. 18 00:01:29,210 --> 00:01:35,840 It's the ratio between the correctly identified spam emails and all the spam emails that were present 19 00:01:36,050 --> 00:01:38,130 in the dataset. 20 00:01:38,150 --> 00:01:44,090 Here's how we can visualize the true positives and the false negatives on our chart. 21 00:01:44,090 --> 00:01:50,180 Remember we had our probability that an email contains certain words given that it is spam on the y 22 00:01:50,180 --> 00:01:56,870 axis and we had the probability that an email has certain words given that the email is non spam on 23 00:01:56,870 --> 00:01:58,510 the x axis. 24 00:01:58,550 --> 00:02:01,880 Our decision boundary separated two regions. 25 00:02:01,880 --> 00:02:05,270 Any data point that falls into the top red triangle here. 26 00:02:05,450 --> 00:02:12,420 Our naive base model will classify as spam and any email that falls into the bottom blue triangle here. 27 00:02:12,530 --> 00:02:14,830 Our model would classify as non spam. 28 00:02:15,680 --> 00:02:19,100 So where are those true positives and those false negatives. 29 00:02:19,130 --> 00:02:26,630 Well plotting all the data points like this the true positives are the ones that are actually spam and 30 00:02:26,630 --> 00:02:28,130 classified correctly. 31 00:02:28,220 --> 00:02:37,920 They would be showing up like so on the other hand the false negatives would be on the other side they 32 00:02:37,920 --> 00:02:45,580 would be spam emails but they would be classified as non spam so what recall is trying to measure is 33 00:02:45,670 --> 00:02:51,570 our models ability to find the data points of interests namely our spam emails. 34 00:02:51,580 --> 00:02:54,610 It's a measure of quantity or completeness. 35 00:02:54,610 --> 00:02:59,590 How many spam e-mails that we get versus how many did we miss. 36 00:02:59,590 --> 00:03:02,830 How let's calculate our recall score in Jupiter notebook. 37 00:03:02,890 --> 00:03:09,700 The first thing I'll do is insert a section heading so called the section heading here. 38 00:03:09,790 --> 00:03:16,990 Recall score and at this point I'm going to throw it over to you. 39 00:03:17,050 --> 00:03:22,870 I'd actually like you to calculate what the recall score is for our youth based model. 40 00:03:22,870 --> 00:03:30,580 So can you store the result in a variable called recall on the school school and then print the value 41 00:03:30,790 --> 00:03:37,060 of the recall score as a percentage rounded to two decimal places. 42 00:03:37,060 --> 00:03:41,140 I'll give you a few seconds to pause the video before I show you the solution 43 00:03:44,570 --> 00:03:45,130 all right. 44 00:03:45,170 --> 00:03:46,280 Here we go. 45 00:03:46,580 --> 00:03:52,540 Recall on the school school is equal to the true positives. 46 00:03:52,670 --> 00:04:02,520 So true underscore P Os dot sum divided by the sum of our true positives. 47 00:04:02,530 --> 00:04:16,070 So to underscore pause dot sum plus falls on a score negative dot sum parentheses and that's it and 48 00:04:16,130 --> 00:04:19,200 we'll print the whole thing. 49 00:04:19,310 --> 00:04:26,160 Recall score is curly braces colon dot 2 percent. 50 00:04:26,240 --> 00:04:33,050 So this is part one of how we format the whole thing as a percentage round to two decimal places and 51 00:04:33,150 --> 00:04:43,170 on the string we'll call the format method and we'll provide recall on a score score as an argument. 52 00:04:43,220 --> 00:04:45,440 Now I've just reopened this notebook actually. 53 00:04:45,470 --> 00:04:56,020 So when I go to sell and run all and there we have it our recall score is ninety three point two percent 54 00:04:56,620 --> 00:04:57,610 brilliant. 55 00:04:57,970 --> 00:05:01,210 But you know the thing is no metric is perfect. 56 00:05:01,480 --> 00:05:07,600 And as we've discussed earlier a weakness of the accuracy metric was when we were dealing with unbalanced 57 00:05:07,690 --> 00:05:09,040 datasets. 58 00:05:09,130 --> 00:05:15,550 But recall actually also has a weakness because there is actually a very very easy way to manipulate 59 00:05:15,880 --> 00:05:21,600 the recall score and maximize it simply by labeling all the emails as spam. 60 00:05:21,640 --> 00:05:28,000 Because if we do that we have no false negatives no mis identified spam emails. 61 00:05:28,030 --> 00:05:33,230 So our recall score would be equal to 1 and we'd have the perfect recall score. 62 00:05:33,610 --> 00:05:38,290 So as you can imagine there exists another metric that we should look at as well. 63 00:05:38,440 --> 00:05:41,050 And this is what we'll look at in the next lesson. 64 00:05:41,090 --> 00:05:42,000 Hey I'll see you there.