1 00:00:00,350 --> 00:00:00,840 All right. 2 00:00:00,870 --> 00:00:08,700 So upwards and onwards in this lesson we're going to make our predictions based on comparing the two 3 00:00:08,700 --> 00:00:12,850 join probabilities that we calculated in the last lesson. 4 00:00:12,900 --> 00:00:20,350 This really is the meat of the naive bayes classifier as part of the NI phase classifier algorithm. 5 00:00:20,460 --> 00:00:29,010 We will be comparing the two probabilities and classifying email based on which probability is higher. 6 00:00:29,010 --> 00:00:32,830 Let's add a section heading here to commemorate this. 7 00:00:33,150 --> 00:00:45,410 Making predictions and for the subheading All right checking for the higher joined probability. 8 00:00:45,420 --> 00:00:51,270 Now I'm also gonna add some latex marked down as a note to ourselves and maybe our future selves when 9 00:00:51,270 --> 00:00:56,380 we come back and look at this again to remind us as to what the code that we're going to write is gonna 10 00:00:56,400 --> 00:01:06,300 do so I'll write to taller signs and then I'll put P parentheses spam and then for some spacing allowed 11 00:01:06,300 --> 00:01:08,540 the backslash and the comma. 12 00:01:08,550 --> 00:01:17,940 Then the pipe symbol then another backslash and a comma then an X backslash comma greater than backslash 13 00:01:17,940 --> 00:01:32,910 comma P parentheses ham backslash comma pipe backslash comma X closing parentheses and two dollar signs. 14 00:01:33,390 --> 00:01:38,710 Maybe even capitalized the S here and when I capitalized the X here. 15 00:01:39,380 --> 00:01:41,640 All right so that makes it pretty explicit. 16 00:01:41,640 --> 00:01:47,070 We're going to be comparing whether the probability that an email is spam given its tokens is greater 17 00:01:47,070 --> 00:01:52,830 than than the probability that it is non spam given the tokens. 18 00:01:52,830 --> 00:01:55,350 The converse of course is also true. 19 00:01:55,350 --> 00:01:58,680 So when a copy this line pasted in. 20 00:01:58,980 --> 00:02:10,800 Change this to less than and maybe put an oar in between star or star star would be quite nice though 21 00:02:10,800 --> 00:02:14,800 if this was centered instead of being aligned to the left. 22 00:02:15,000 --> 00:02:17,590 So I'll add some H2 mail mark down here. 23 00:02:17,730 --> 00:02:26,040 Angle brackets center closing angle brackets and then on the other side it'll have opening angle brackets 24 00:02:26,420 --> 00:02:35,400 forwards large center closing angle brackets and I'll add a little line break here as well with the 25 00:02:35,400 --> 00:02:39,050 letters B are enclosed by angled brackets. 26 00:02:39,150 --> 00:02:39,660 There we go. 27 00:02:40,350 --> 00:02:43,620 So these two lines of latex notation summarize what we're gonna do. 28 00:02:44,540 --> 00:02:49,470 And because we've done so much legwork making these predictions it's actually not that hard at this 29 00:02:49,470 --> 00:02:50,490 stage. 30 00:02:50,490 --> 00:02:55,260 And I think it actually makes for a nice mini challenge so I'd like you to give us a try. 31 00:02:55,260 --> 00:03:01,020 Can you create the vector of predictions are y hat. 32 00:03:01,360 --> 00:03:08,860 Now remember that the spam emails in this vector should have the value 1 or true and the non spam the 33 00:03:08,860 --> 00:03:17,070 ham emails should have the value 0 or false I'd like you to store your results in a variable called 34 00:03:17,280 --> 00:03:22,600 prediction I'll give you a few seconds to pause the video and have a go 35 00:03:25,370 --> 00:03:26,380 ready. 36 00:03:26,390 --> 00:03:27,950 Here's the solution. 37 00:03:27,950 --> 00:03:35,420 We're gonna be making use of this variable here which holds on to the joint probability that an email 38 00:03:35,420 --> 00:03:44,120 is non spam or ham given its tokens and we're gonna be making use of this variable here which holds 39 00:03:44,120 --> 00:03:50,570 onto the joint probability that an email is spam given its tokens and the way we're going to compare 40 00:03:50,570 --> 00:04:00,770 these two is simply by writing prediction as equal to joint on a school log on the school spam greater 41 00:04:00,770 --> 00:04:04,610 than joint I just got log on a score. 42 00:04:04,750 --> 00:04:06,850 Ham and that's it. 43 00:04:07,540 --> 00:04:11,160 Let's peek inside our prediction vector. 44 00:04:11,410 --> 00:04:13,630 Let's look at the last five results. 45 00:04:13,780 --> 00:04:19,630 So prediction square brackets minus five colon. 46 00:04:19,930 --> 00:04:25,000 Those are equal to False false false false and false. 47 00:04:25,000 --> 00:04:31,890 In other words the last five emails in our prediction vector are all non spam emails. 48 00:04:31,900 --> 00:04:36,730 Let's compare this to the last five emails in y underscore test. 49 00:04:37,630 --> 00:04:39,460 Now here are the actual labels. 50 00:04:39,490 --> 00:04:47,860 These are the actual classifications and what we see is five zeros meaning the last five emails are 51 00:04:47,860 --> 00:04:54,340 actually all non spam emails and these five predictions that we looked at were in fact correct. 52 00:04:54,340 --> 00:05:02,110 Now if you don't fancy looking at integers here and billions here all you have to do to convert a boolean 53 00:05:02,110 --> 00:05:06,480 to an integer is to multiply by 1. 54 00:05:06,520 --> 00:05:08,140 Now I want to show you something else. 55 00:05:08,260 --> 00:05:11,830 It turns out that we can actually simplify the calculation a little bit. 56 00:05:11,830 --> 00:05:18,790 Looking at our equation for making a comparison between this quantity here and this quantity here looking 57 00:05:18,790 --> 00:05:27,460 at our formulas above we can see that the probability of him given X is expressed by this fraction where 58 00:05:27,460 --> 00:05:35,500 we're dividing by p of x in the denominator and the probability of spam given X is given by this fraction 59 00:05:35,740 --> 00:05:42,550 where we are also dividing by p of x in the denominator since we're doing this on both sides of the 60 00:05:42,550 --> 00:05:45,910 equation effectively we're doing this for both quantities. 61 00:05:45,910 --> 00:05:49,710 We can actually remove this bit and still get the same results. 62 00:05:49,990 --> 00:05:56,470 In the case of our code this p of x this bottom part where we're dividing by the probability of a token 63 00:05:56,470 --> 00:06:02,530 occurring is given by this minus n p dot log probability of all tokens. 64 00:06:02,650 --> 00:06:08,590 We took the log of the probabilities remember so instead of dividing we're actually subtracting so I 65 00:06:08,590 --> 00:06:12,860 can take this line here and copy it. 66 00:06:12,940 --> 00:06:23,280 A little mark down so call it simplify pasted in and I'll grab this line here and also paste it in and 67 00:06:23,280 --> 00:06:30,970 then what I'll do is I'll delete the same section of code on both lines. 68 00:06:30,970 --> 00:06:39,010 The part or subtracting the probability of a token occurring and the reason I can do this is because 69 00:06:39,200 --> 00:06:44,950 we're after being able to predict the why we were interested in being able to predict whether we have 70 00:06:44,950 --> 00:06:47,510 a spam email or a non spam email. 71 00:06:47,650 --> 00:06:53,140 This prediction actually does not depend on the probability of a token occurring. 72 00:06:53,140 --> 00:06:55,600 This is why I can do this simplification. 73 00:06:55,600 --> 00:06:58,940 Now let me add some late tax code to make this a little bit more clear. 74 00:06:59,140 --> 00:07:03,680 And this is because you know the thing is mathematically those two lines right. 75 00:07:03,850 --> 00:07:09,880 The one or subtracting the log probability of the tokens and the ones where we're not doing that actually 76 00:07:09,880 --> 00:07:11,020 not the same. 77 00:07:11,170 --> 00:07:17,050 You won't actually have the same numbers in joint underscore log underscore spam and enjoined on a scroll 78 00:07:17,050 --> 00:07:18,470 log on a square ham. 79 00:07:18,580 --> 00:07:25,120 However their relationship to each other is unchanged even though these two quantities are not equal 80 00:07:25,180 --> 00:07:26,670 to each other mathematically. 81 00:07:27,120 --> 00:07:32,160 The simplification is still valid because it doesn't change our predictions. 82 00:07:32,170 --> 00:07:38,260 The reason why adding the simplification step here is because I've seen this step in many many of many 83 00:07:38,260 --> 00:07:41,690 implementations of the need based classifier. 84 00:07:41,950 --> 00:07:48,850 And I remember when first looking at this code it was very confusing because I couldn't type back to 85 00:07:48,880 --> 00:07:51,250 the formula in Bayes theorem. 86 00:07:51,340 --> 00:07:57,560 Nonetheless this simplification is perfectly valid and will not change our results. 87 00:07:57,960 --> 00:08:02,880 You might even call this the one weird trick that statisticians don't want you to know. 88 00:08:03,140 --> 00:08:08,470 And you can verify this for yourself and the next lessons where we're gonna be talking about metrics 89 00:08:08,890 --> 00:08:10,270 and evaluation. 90 00:08:10,360 --> 00:08:15,790 We're going to look at our Bayes classifier and actually check how well it's doing. 91 00:08:16,030 --> 00:08:16,850 I'll see you there.