1 00:00:00,750 --> 00:00:01,620 We're almost there. 2 00:00:01,620 --> 00:00:03,930 We're almost there. 3 00:00:03,930 --> 00:00:11,190 In this lesson we'll calculate the probability that a token occurs given that the email is spam. 4 00:00:11,220 --> 00:00:16,740 So in this case we might calculate the probability of Viagra occurring in an email. 5 00:00:16,740 --> 00:00:22,920 Given that the email is spam and then afterwards we'll calculate the probability of a particular token 6 00:00:22,920 --> 00:00:30,630 occurring given that the email is a non spam message so that this as a markdown Cell C P parentheses 7 00:00:31,550 --> 00:00:45,400 token pipe spam parentheses problem ability that a token occurs given the email is spam. 8 00:00:45,520 --> 00:00:48,890 Now we're not going to just do this for a single token. 9 00:00:48,960 --> 00:00:52,360 We did do this for all the tokens simultaneously. 10 00:00:52,640 --> 00:01:00,240 I'm going to store our results of this calculation in a variable called prob on a score tokens on a 11 00:01:00,240 --> 00:01:04,440 school spam and I'll set that equal to the following. 12 00:01:04,440 --> 00:01:13,520 We're going to take our summed spam tokens and divide those by the spam word count. 13 00:01:13,520 --> 00:01:14,210 Right. 14 00:01:14,250 --> 00:01:22,440 This will divide our summed spam tokens are two thousand five hundred values of how often each word 15 00:01:22,770 --> 00:01:29,960 occurred in spam emails and we're dividing it by the total number of spam tokens. 16 00:01:30,690 --> 00:01:37,740 But there is one modification that I have to make this calculation and that modification has to do with 17 00:01:37,740 --> 00:01:39,330 what we did over here. 18 00:01:39,360 --> 00:01:42,460 It has to do with our Laplace smoothing technique. 19 00:01:42,570 --> 00:01:48,440 We added 1 2 hour summation because we don't want to end up dividing zero. 20 00:01:48,540 --> 00:01:54,450 Since there were two thousand five hundred words where we added 1 we have to balance that out again 21 00:01:54,870 --> 00:02:02,300 and this is why we're going to add the vocabulary size to our denominator here. 22 00:02:02,430 --> 00:02:09,810 We've added 1 two thousand five hundred times so therefore we also have to add two thousand five hundred 23 00:02:10,110 --> 00:02:15,030 to our spam would count to fully implement this smoothing technique. 24 00:02:15,630 --> 00:02:16,590 And that's it. 25 00:02:16,590 --> 00:02:22,590 We've done a lot of legwork so it's just one line of code for that calculation which is really really 26 00:02:22,590 --> 00:02:23,250 neat. 27 00:02:23,250 --> 00:02:26,810 The thing you're probably wondering about though is what are the probabilities actually look like. 28 00:02:26,880 --> 00:02:27,490 Right. 29 00:02:27,510 --> 00:02:32,430 Prob on a school tokens on a school spam square brackets. 30 00:02:32,430 --> 00:02:40,530 Colon 5 will show us the first couple of entries in this series so here we see the actual values that 31 00:02:40,710 --> 00:02:42,220 we get now. 32 00:02:42,300 --> 00:02:47,760 The numbers that we're working with aren't particularly large but if we sum them all up they should 33 00:02:47,760 --> 00:02:48,770 add up to 1. 34 00:02:48,780 --> 00:02:51,850 That's how probability works right. 35 00:02:52,000 --> 00:03:01,560 Probably underscore tokens on the score spam dot some parentheses will show us if our math ties out 36 00:03:02,160 --> 00:03:04,060 and I think it does. 37 00:03:04,080 --> 00:03:10,440 We've calculated the probability of the tokens given that the email is spam for all our two thousand 38 00:03:10,440 --> 00:03:12,190 five hundred entries. 39 00:03:12,190 --> 00:03:14,510 Now let's do the same thing the other way round. 40 00:03:14,550 --> 00:03:18,150 So I'll quickly copy my mark down sell here. 41 00:03:18,150 --> 00:03:19,620 Pasted in. 42 00:03:19,770 --> 00:03:27,230 Change this to mark down and change this to him and change this to non spam. 43 00:03:27,330 --> 00:03:29,250 This calculation is very similar. 44 00:03:29,250 --> 00:03:29,690 Right. 45 00:03:29,700 --> 00:03:37,740 Prob on a score tokens on a school non spam will hold on to the token probabilities. 46 00:03:37,740 --> 00:03:47,010 And that's gonna be sometime tokens divided by Open parentheses the non spam word count so the total 47 00:03:47,010 --> 00:03:55,860 number of words in the non spam messages and we're to add 2500 since we're also using the Laplace smoothing 48 00:03:55,860 --> 00:03:56,270 technique. 49 00:03:56,290 --> 00:03:58,390 Here There we go. 50 00:03:58,460 --> 00:04:05,990 That line calculates all the probabilities and can do a quick check here that these probabilities indeed 51 00:04:06,170 --> 00:04:12,840 some to 1 are very close to 1 but calling the summation method and hitting shift enter. 52 00:04:12,980 --> 00:04:13,490 There we go. 53 00:04:14,090 --> 00:04:15,670 I'm pretty happy with that. 54 00:04:16,300 --> 00:04:18,570 OK so where does this leave us. 55 00:04:18,590 --> 00:04:24,480 We've tackled the fraction in the numerator but we haven't tackled the fraction in the denominator yet. 56 00:04:24,500 --> 00:04:29,270 This here is the overall probability of a particular token. 57 00:04:29,420 --> 00:04:34,260 I'll add that as a markdown cell I'll call it p token. 58 00:04:34,420 --> 00:04:44,270 This is the probability that a token occurs regardless of whether we're dealing with spam or non spam 59 00:04:44,330 --> 00:04:45,750 emails. 60 00:04:45,800 --> 00:04:53,690 So what I'll do is I'll create a variable called prob underscore tokens on a score all and that will 61 00:04:53,690 --> 00:05:05,420 be equal to our full train features summed up across the columns axis is equal to zero divided by the 62 00:05:05,420 --> 00:05:07,390 total word count. 63 00:05:07,760 --> 00:05:08,720 That's it. 64 00:05:08,720 --> 00:05:11,880 Again this probability should sum to 1. 65 00:05:11,930 --> 00:05:16,060 So let's just quickly check if we've done this calculation right. 66 00:05:16,100 --> 00:05:17,220 Looks good. 67 00:05:17,270 --> 00:05:21,410 You'll notice that I haven't done any lap plus smoothing him because I'm pretty much guaranteed I'm 68 00:05:21,410 --> 00:05:27,320 not dividing a zero because we've taken the two thousand five hundred most frequent words in our whole 69 00:05:27,380 --> 00:05:30,150 dataset to be our features. 70 00:05:30,290 --> 00:05:37,220 I think it's a good time to save our train model to a text file and kind of create a checkpoint for 71 00:05:37,220 --> 00:05:38,260 ourselves. 72 00:05:39,490 --> 00:05:47,180 At the top of her notebook where we've got her Constance I'm going to copy these two lines paste the 73 00:05:47,230 --> 00:05:51,460 minigun and and there I'm going to create two new constants. 74 00:05:51,460 --> 00:05:58,400 The first one will be called token and a school spam on a scroll prob file. 75 00:05:58,410 --> 00:06:02,890 I'm going to save this work to a different folder namely our testing folder. 76 00:06:02,890 --> 00:06:10,750 So I'm going to replace 0 2 on it's got training with 0 3 on a score testing and this text file shall 77 00:06:10,750 --> 00:06:15,490 be called prob hyphen spam T X T. 78 00:06:15,490 --> 00:06:17,960 I'll do the same thing for non spam files. 79 00:06:18,040 --> 00:06:26,770 So token I'll just go ham on a scope problem on the score file will save us prob hyphen non spam dot 80 00:06:26,770 --> 00:06:27,960 t t. 81 00:06:28,000 --> 00:06:35,050 The goal is to save both of these text files 100 0 3 and a score testing inside our spam data folder. 82 00:06:35,050 --> 00:06:42,700 Going back up to the constants I'll quickly add another constant up here for all our tokens. 83 00:06:42,970 --> 00:06:51,710 So token on underscore all prob file I want to call this file prob hyphen all hyphen tokens knowledge 84 00:06:51,910 --> 00:06:53,350 shift enter on this. 85 00:06:53,680 --> 00:07:02,300 Then I'll come down here and this is where we save that trained model where again gonna use num PIs 86 00:07:02,930 --> 00:07:10,490 save text function and we're gonna call it three times first time for the spam probabilities where we're 87 00:07:10,490 --> 00:07:19,760 gonna save prob on the score tokens on a school spam we're gonna call it again for our ham tokens. 88 00:07:19,760 --> 00:07:29,870 So token I'd just go ham and just go prob just go file comma prob tokens non spam and we're gonna do 89 00:07:29,870 --> 00:07:33,200 it one more time for all the tokens. 90 00:07:33,230 --> 00:07:42,340 So this is gonna be ham and spam combined. 91 00:07:42,640 --> 00:07:43,720 There we go. 92 00:07:43,750 --> 00:07:49,780 Saving data to the disk as a file is a really nice way of creating like a checkpoint for yourself. 93 00:07:49,780 --> 00:07:55,000 At least this way you can pick up where you left off and you don't have to rerun the calculations that 94 00:07:55,000 --> 00:07:59,020 you've done before and you can always work off like the same text file. 95 00:07:59,020 --> 00:08:00,550 You can always work off the same data. 96 00:08:00,670 --> 00:08:01,510 It's not gonna change.