1
00:00:00,750 --> 00:00:01,620
We're almost there.

2
00:00:01,620 --> 00:00:03,930
We're almost there.

3
00:00:03,930 --> 00:00:11,190
In this lesson we'll calculate the probability that a token occurs given that the email is spam.

4
00:00:11,220 --> 00:00:16,740
So in this case we might calculate the probability of Viagra occurring in an email.

5
00:00:16,740 --> 00:00:22,920
Given that the email is spam and then afterwards we'll calculate the probability of a particular token

6
00:00:22,920 --> 00:00:30,630
occurring given that the email is a non spam message so that this as a markdown Cell C P parentheses

7
00:00:31,550 --> 00:00:45,400
token pipe spam parentheses problem ability that a token occurs given the email is spam.

8
00:00:45,520 --> 00:00:48,890
Now we're not going to just do this for a single token.

9
00:00:48,960 --> 00:00:52,360
We did do this for all the tokens simultaneously.

10
00:00:52,640 --> 00:01:00,240
I'm going to store our results of this calculation in a variable called prob on a score tokens on a

11
00:01:00,240 --> 00:01:04,440
school spam and I'll set that equal to the following.

12
00:01:04,440 --> 00:01:13,520
We're going to take our summed spam tokens and divide those by the spam word count.

13
00:01:13,520 --> 00:01:14,210
Right.

14
00:01:14,250 --> 00:01:22,440
This will divide our summed spam tokens are two thousand five hundred values of how often each word

15
00:01:22,770 --> 00:01:29,960
occurred in spam emails and we're dividing it by the total number of spam tokens.

16
00:01:30,690 --> 00:01:37,740
But there is one modification that I have to make this calculation and that modification has to do with

17
00:01:37,740 --> 00:01:39,330
what we did over here.

18
00:01:39,360 --> 00:01:42,460
It has to do with our Laplace smoothing technique.

19
00:01:42,570 --> 00:01:48,440
We added 1 2 hour summation because we don't want to end up dividing zero.

20
00:01:48,540 --> 00:01:54,450
Since there were two thousand five hundred words where we added 1 we have to balance that out again

21
00:01:54,870 --> 00:02:02,300
and this is why we're going to add the vocabulary size to our denominator here.

22
00:02:02,430 --> 00:02:09,810
We've added 1 two thousand five hundred times so therefore we also have to add two thousand five hundred

23
00:02:10,110 --> 00:02:15,030
to our spam would count to fully implement this smoothing technique.

24
00:02:15,630 --> 00:02:16,590
And that's it.

25
00:02:16,590 --> 00:02:22,590
We've done a lot of legwork so it's just one line of code for that calculation which is really really

26
00:02:22,590 --> 00:02:23,250
neat.

27
00:02:23,250 --> 00:02:26,810
The thing you're probably wondering about though is what are the probabilities actually look like.

28
00:02:26,880 --> 00:02:27,490
Right.

29
00:02:27,510 --> 00:02:32,430
Prob on a school tokens on a school spam square brackets.

30
00:02:32,430 --> 00:02:40,530
Colon 5 will show us the first couple of entries in this series so here we see the actual values that

31
00:02:40,710 --> 00:02:42,220
we get now.

32
00:02:42,300 --> 00:02:47,760
The numbers that we're working with aren't particularly large but if we sum them all up they should

33
00:02:47,760 --> 00:02:48,770
add up to 1.

34
00:02:48,780 --> 00:02:51,850
That's how probability works right.

35
00:02:52,000 --> 00:03:01,560
Probably underscore tokens on the score spam dot some parentheses will show us if our math ties out

36
00:03:02,160 --> 00:03:04,060
and I think it does.

37
00:03:04,080 --> 00:03:10,440
We've calculated the probability of the tokens given that the email is spam for all our two thousand

38
00:03:10,440 --> 00:03:12,190
five hundred entries.

39
00:03:12,190 --> 00:03:14,510
Now let's do the same thing the other way round.

40
00:03:14,550 --> 00:03:18,150
So I'll quickly copy my mark down sell here.

41
00:03:18,150 --> 00:03:19,620
Pasted in.

42
00:03:19,770 --> 00:03:27,230
Change this to mark down and change this to him and change this to non spam.

43
00:03:27,330 --> 00:03:29,250
This calculation is very similar.

44
00:03:29,250 --> 00:03:29,690
Right.

45
00:03:29,700 --> 00:03:37,740
Prob on a score tokens on a school non spam will hold on to the token probabilities.

46
00:03:37,740 --> 00:03:47,010
And that's gonna be sometime tokens divided by Open parentheses the non spam word count so the total

47
00:03:47,010 --> 00:03:55,860
number of words in the non spam messages and we're to add 2500 since we're also using the Laplace smoothing

48
00:03:55,860 --> 00:03:56,270
technique.

49
00:03:56,290 --> 00:03:58,390
Here There we go.

50
00:03:58,460 --> 00:04:05,990
That line calculates all the probabilities and can do a quick check here that these probabilities indeed

51
00:04:06,170 --> 00:04:12,840
some to 1 are very close to 1 but calling the summation method and hitting shift enter.

52
00:04:12,980 --> 00:04:13,490
There we go.

53
00:04:14,090 --> 00:04:15,670
I'm pretty happy with that.

54
00:04:16,300 --> 00:04:18,570
OK so where does this leave us.

55
00:04:18,590 --> 00:04:24,480
We've tackled the fraction in the numerator but we haven't tackled the fraction in the denominator yet.

56
00:04:24,500 --> 00:04:29,270
This here is the overall probability of a particular token.

57
00:04:29,420 --> 00:04:34,260
I'll add that as a markdown cell I'll call it p token.

58
00:04:34,420 --> 00:04:44,270
This is the probability that a token occurs regardless of whether we're dealing with spam or non spam

59
00:04:44,330 --> 00:04:45,750
emails.

60
00:04:45,800 --> 00:04:53,690
So what I'll do is I'll create a variable called prob underscore tokens on a score all and that will

61
00:04:53,690 --> 00:05:05,420
be equal to our full train features summed up across the columns axis is equal to zero divided by the

62
00:05:05,420 --> 00:05:07,390
total word count.

63
00:05:07,760 --> 00:05:08,720
That's it.

64
00:05:08,720 --> 00:05:11,880
Again this probability should sum to 1.

65
00:05:11,930 --> 00:05:16,060
So let's just quickly check if we've done this calculation right.

66
00:05:16,100 --> 00:05:17,220
Looks good.

67
00:05:17,270 --> 00:05:21,410
You'll notice that I haven't done any lap plus smoothing him because I'm pretty much guaranteed I'm

68
00:05:21,410 --> 00:05:27,320
not dividing a zero because we've taken the two thousand five hundred most frequent words in our whole

69
00:05:27,380 --> 00:05:30,150
dataset to be our features.

70
00:05:30,290 --> 00:05:37,220
I think it's a good time to save our train model to a text file and kind of create a checkpoint for

71
00:05:37,220 --> 00:05:38,260
ourselves.

72
00:05:39,490 --> 00:05:47,180
At the top of her notebook where we've got her Constance I'm going to copy these two lines paste the

73
00:05:47,230 --> 00:05:51,460
minigun and and there I'm going to create two new constants.

74
00:05:51,460 --> 00:05:58,400
The first one will be called token and a school spam on a scroll prob file.

75
00:05:58,410 --> 00:06:02,890
I'm going to save this work to a different folder namely our testing folder.

76
00:06:02,890 --> 00:06:10,750
So I'm going to replace 0 2 on it's got training with 0 3 on a score testing and this text file shall

77
00:06:10,750 --> 00:06:15,490
be called prob hyphen spam T X T.

78
00:06:15,490 --> 00:06:17,960
I'll do the same thing for non spam files.

79
00:06:18,040 --> 00:06:26,770
So token I'll just go ham on a scope problem on the score file will save us prob hyphen non spam dot

80
00:06:26,770 --> 00:06:27,960
t t.

81
00:06:28,000 --> 00:06:35,050
The goal is to save both of these text files 100 0 3 and a score testing inside our spam data folder.

82
00:06:35,050 --> 00:06:42,700
Going back up to the constants I'll quickly add another constant up here for all our tokens.

83
00:06:42,970 --> 00:06:51,710
So token on underscore all prob file I want to call this file prob hyphen all hyphen tokens knowledge

84
00:06:51,910 --> 00:06:53,350
shift enter on this.

85
00:06:53,680 --> 00:07:02,300
Then I'll come down here and this is where we save that trained model where again gonna use num PIs

86
00:07:02,930 --> 00:07:10,490
save text function and we're gonna call it three times first time for the spam probabilities where we're

87
00:07:10,490 --> 00:07:19,760
gonna save prob on the score tokens on a school spam we're gonna call it again for our ham tokens.

88
00:07:19,760 --> 00:07:29,870
So token I'd just go ham and just go prob just go file comma prob tokens non spam and we're gonna do

89
00:07:29,870 --> 00:07:33,200
it one more time for all the tokens.

90
00:07:33,230 --> 00:07:42,340
So this is gonna be ham and spam combined.

91
00:07:42,640 --> 00:07:43,720
There we go.

92
00:07:43,750 --> 00:07:49,780
Saving data to the disk as a file is a really nice way of creating like a checkpoint for yourself.

93
00:07:49,780 --> 00:07:55,000
At least this way you can pick up where you left off and you don't have to rerun the calculations that

94
00:07:55,000 --> 00:07:59,020
you've done before and you can always work off like the same text file.

95
00:07:59,020 --> 00:08:00,550
You can always work off the same data.

96
00:08:00,670 --> 00:08:01,510
It's not gonna change.