1
00:00:01,100 --> 00:00:04,100
First off let's add a section heading here.

2
00:00:04,130 --> 00:00:16,860
Call it something the tokens occurring in spam now are full on escort train on the school features data

3
00:00:16,860 --> 00:00:24,270
frame has the following shape it's got two thousand five hundred columns and it's got four thousand

4
00:00:24,510 --> 00:00:33,330
and fourteen rows because it contains both spam and non spam messages what I want to do is I want to

5
00:00:33,330 --> 00:00:41,550
create a subset again of this feature state of frame I want to create a subset of all the rows that

6
00:00:41,700 --> 00:00:51,800
correspond to spam messages I'm going to call the subset train on a school spam on a school tokens and

7
00:00:52,040 --> 00:01:00,410
I'll set that equal to full on a squat train on a school features dot LLC.

8
00:01:00,410 --> 00:01:06,110
This is one of my favorite favorite favorite functions to create subsets by the way.

9
00:01:07,250 --> 00:01:14,630
And then within the square brackets I'll just pick the rows that are spam messages meaning I'll add

10
00:01:14,630 --> 00:01:22,520
a condition that the category should be equal to one full on it's got train on the score data category

11
00:01:22,720 --> 00:01:25,350
double equals 1.

12
00:01:25,620 --> 00:01:32,350
Let's take a look at the train spam tokens had the first five rows.

13
00:01:32,370 --> 00:01:43,440
There we go emails 0 1 2 3 4 are all spam messages the last few emails in train underscore spam and

14
00:01:43,510 --> 00:01:49,900
it's tokens we can pull up with dots tail and they look like this.

15
00:01:50,800 --> 00:01:56,230
So we've got document I.D. a thousand eight hundred eighty five eighty seven eighty nine ninety and

16
00:01:56,230 --> 00:01:59,700
ninety five as spam messages.

17
00:02:00,100 --> 00:02:07,810
The overall shape of this data frame should be a thousand two hundred and forty nine times two thousand

18
00:02:07,810 --> 00:02:09,250
five hundred right.

19
00:02:09,280 --> 00:02:14,800
This is what we've ascertained the number of spam messages in our training data set is a thousand two

20
00:02:14,800 --> 00:02:16,280
hundred and forty nine.

21
00:02:16,360 --> 00:02:22,900
So let's just quickly verifying that this is exactly what we've got here and that our sub setting has

22
00:02:22,900 --> 00:02:25,200
gone according to plan.

23
00:02:25,210 --> 00:02:32,790
Now what I want to do is I want to sum up all these tokens column by column right.

24
00:02:32,860 --> 00:02:37,620
I want to have all the tokens for word any number zero sum up.

25
00:02:38,170 --> 00:02:43,760
I don't have all the tokens in column 1 summed up and so on.

26
00:02:43,900 --> 00:02:52,630
Across all the word ideas what I want to end up with is a penned a series of all the word ideas with

27
00:02:52,630 --> 00:02:59,120
the number of times that these tokens occur in spam messages.

28
00:02:59,140 --> 00:03:07,360
This is my goal but I'm going to save this series in a variable as well I'll call it summed spam on

29
00:03:07,370 --> 00:03:16,550
a school tokens so that equal to train and a score spam and it's got tokens that sum.

30
00:03:16,760 --> 00:03:23,110
Now instead of Axis being one and summing across a row I want access to be equal to zero.

31
00:03:23,230 --> 00:03:32,130
To sum across a column so axis is equal to zero will be for the column.

32
00:03:32,190 --> 00:03:37,210
Now there's one final thing I want to do on this line and this has to do with the fact that we're gonna

33
00:03:37,230 --> 00:03:46,440
be calculating a probability later on I'm gonna add one to my summation.

34
00:03:46,540 --> 00:03:52,930
Would you like to guess why I'm doing this why am I seemingly arbitrarily adding one to this calculation

35
00:03:54,760 --> 00:03:59,650
Well the reason is is that we're gonna be doing a division like down the line right.

36
00:03:59,670 --> 00:04:05,290
Wouldn't be dividing the number of occurrences by the total number of words either in the numerator

37
00:04:05,740 --> 00:04:06,780
or the denominator.

38
00:04:07,880 --> 00:04:12,540
And we're gonna be doing this for all our tokens across our spam messages.

39
00:04:12,760 --> 00:04:19,990
And if one of these tokens doesn't actually occur within the spam messages we've got zero divided by

40
00:04:20,140 --> 00:04:22,500
the total number of tokens.

41
00:04:22,500 --> 00:04:31,600
So I'm adding one here to make this calculation non zero late on this technique actually has a name.

42
00:04:31,660 --> 00:04:36,100
It's named after a French mathematician and it's called LaPlace smoothing.

43
00:04:37,300 --> 00:04:45,380
Here's what are some spam tokens look like after we've created the subset and summed up all the values.

44
00:04:45,370 --> 00:04:53,290
It's simply a penned a series with two thousand five hundred entries the last few entries in this series

45
00:04:53,790 --> 00:05:02,000
look like this summed up the sort of spam on a token dot tail shows me that the word with word eighty

46
00:05:02,210 --> 00:05:11,220
two thousand four hundred ninety nine occurs a total of six times across all our spam messages now I

47
00:05:11,220 --> 00:05:17,230
think you can repeat the process for how messages as well so I'll leave this to you as a challenge.

48
00:05:17,280 --> 00:05:24,210
Sum up the tokens that occur for the non spam messages and then store these values in a variable called

49
00:05:24,310 --> 00:05:26,700
summed up underscore ham underscore score tokens

50
00:05:30,070 --> 00:05:30,550
ready.

51
00:05:30,550 --> 00:05:32,320
Here's the solution.

52
00:05:32,670 --> 00:05:40,650
I'll add a markdown sell here something that tokens occurring in ham.

53
00:05:41,200 --> 00:05:48,340
Once again I'll credit data frame called train on the school ham on the school tokens and this will

54
00:05:48,340 --> 00:05:53,190
be a subset of our full on this Katrina on the school features.

55
00:05:53,530 --> 00:06:02,500
But at the following locations namely where full on is Katrina and its core data that category is equal

56
00:06:02,500 --> 00:06:10,340
to our non spam category double equals zero imposes this logical condition

57
00:06:13,250 --> 00:06:22,590
the summed ham tokens are gonna be equal to the state of frame train on a go ham and a scope tokens

58
00:06:24,520 --> 00:06:33,330
not some parentheses axis is equal to zero plus one.

59
00:06:33,400 --> 00:06:39,340
We're going to apply our lab plus smoothing technique once again and that's it.

60
00:06:39,340 --> 00:06:40,530
Let's take a quick look.

61
00:06:40,660 --> 00:06:49,560
If we've done this correctly some underscore ham square tokens what shape should be as expected.

62
00:06:49,560 --> 00:06:56,400
Two thousand five hundred the last few items in this series look as follows.

63
00:06:56,500 --> 00:07:05,990
Some underscore ham underscore tokens dot tail and I can even do a spot check on this by going to train.

64
00:07:06,040 --> 00:07:14,230
On just go ham and just got tokens square brackets two thousand four hundred and ninety nine some just

65
00:07:14,230 --> 00:07:20,140
that one and add one and that should tie out with this entry right here.

66
00:07:20,230 --> 00:07:21,360
Brilliant.

67
00:07:21,700 --> 00:07:29,920
In the next lesson we're gonna be calculating the probability that a token occurs given that the email

68
00:07:29,920 --> 00:07:31,500
is spam.

69
00:07:31,570 --> 00:07:36,840
Here we're gonna be calculating our conditional probability Hussey in the next lesson.

70
00:07:36,860 --> 00:07:37,330
Take care.