1
00:00:00,480 --> 00:00:05,570
Over the next couple of lessons we will be training our naive bayes classifier.

2
00:00:05,820 --> 00:00:11,790
But what does this involve the training step primarily involves calculating the token probabilities

3
00:00:12,330 --> 00:00:16,920
we need to work out the probabilities for an individual word or token.

4
00:00:17,070 --> 00:00:21,690
How if you haven't watched the lessons on probability theory and the base theorem from the previous

5
00:00:21,690 --> 00:00:29,310
module pause here and watch the now before you continue you'll recall that the formula for conditional

6
00:00:29,310 --> 00:00:33,870
probability under Bayes Theorem looks like this.

7
00:00:33,870 --> 00:00:38,640
This formula gives us the probability of an email being spam.

8
00:00:38,640 --> 00:00:44,550
Given that it contains the word diagram to work out the final numbers for this formula we need a couple

9
00:00:44,550 --> 00:00:45,670
of things.

10
00:00:45,690 --> 00:00:49,740
First off we need the overall probability of an email being spam.

11
00:00:49,740 --> 00:00:53,730
Second we need the probability of Agora occurring in any email.

12
00:00:54,600 --> 00:01:00,450
And third we need the probability of an email containing the would diagram given that the email is spam.

13
00:01:00,450 --> 00:01:03,270
This calculation boils down to five numbers.

14
00:01:03,270 --> 00:01:05,060
So what are these numbers that we need to work this out.

15
00:01:06,280 --> 00:01:14,590
The first number at the top is how often the token question in this case the word diagram appears and

16
00:01:14,590 --> 00:01:16,530
spam emails.

17
00:01:16,650 --> 00:01:22,200
The second number is the total number of words in spam emails.

18
00:01:22,260 --> 00:01:26,720
The third number is the probability of spam e-mails occurring in the first place.

19
00:01:26,760 --> 00:01:33,110
On this slide I've put this number at 55 percent but will actually work this out from our training data.

20
00:01:34,570 --> 00:01:40,120
In the denominator of this fraction we have our fourth number the total number of times that the wood

21
00:01:40,210 --> 00:01:47,160
Viagra occurs in all emails both in spam emails and in non spam e-mails.

22
00:01:47,290 --> 00:01:52,720
And finally we've got the total number of words that we looked at across all emails in the first place.

23
00:01:52,720 --> 00:01:57,930
Both spam messages and non spam messages across the entire dataset.

24
00:01:58,300 --> 00:02:04,150
The python code that we're gonna write next will tease out a couple of these numbers with these numbers

25
00:02:04,150 --> 00:02:04,790
in hand.

26
00:02:04,840 --> 00:02:08,830
We can calculate the relevant probabilities for our tokens.

27
00:02:08,890 --> 00:02:16,640
This is effectively the training step for our need based classifier I'll add a quick markdown cell here

28
00:02:16,940 --> 00:02:31,160
that reads training the naive bayes model and for subbing here will read calculating the problem ability

29
00:02:31,760 --> 00:02:34,750
of spam in the slide.

30
00:02:34,790 --> 00:02:40,490
I had the set at 55 percent but I think we should work this out from our data.

31
00:02:40,490 --> 00:02:43,340
This actually makes for a very good challenge.

32
00:02:43,340 --> 00:02:50,030
I'd like you to quickly calculate the overall probability of spam will assume that this is the percentage

33
00:02:50,210 --> 00:02:58,510
of spam messages in the training dataset store this value in a variable called prob on a score spam.

34
00:02:58,630 --> 00:03:02,290
I'll give you a few seconds to pause the video and tackle this challenge

35
00:03:05,580 --> 00:03:06,900
Java go.

36
00:03:07,120 --> 00:03:10,660
The total number of messages that we have in our dataset is full.

37
00:03:10,660 --> 00:03:17,140
On the score train on the score data thought category dot size.

38
00:03:17,240 --> 00:03:21,060
We've got about 4000 and 14 messages.

39
00:03:21,260 --> 00:03:24,590
The number of spam messages is full.

40
00:03:24,590 --> 00:03:32,580
On a score train and a score of data but category dot some.

41
00:03:32,680 --> 00:03:41,440
The reason I can do this is because I've labeled the category of spam as one and the category of non

42
00:03:41,440 --> 00:03:43,140
spam at zero.

43
00:03:43,180 --> 00:03:49,660
So simply summing up all the ones in my category column will give me the number of spam messages of

44
00:03:49,660 --> 00:03:53,930
which I've got one thousand two hundred and forty nine.

45
00:03:53,940 --> 00:04:04,230
Therefore the probability of spam problem on a score spam is in fact equal to the training data category.

46
00:04:04,320 --> 00:04:17,110
Some divided by full on is called train on the score data adult category don't size and that we can

47
00:04:17,110 --> 00:04:29,190
print out as probability prob Bill T of spam is come up.

48
00:04:29,230 --> 00:04:32,030
Problem is score spam.

49
00:04:32,090 --> 00:04:33,530
There we go.

50
00:04:33,620 --> 00:04:36,920
It's around thirty one percent.

51
00:04:36,920 --> 00:04:44,420
This is the first of the numbers that we're looking for in our Bayes theorem the next number that we

52
00:04:44,420 --> 00:04:56,200
are looking for is the total number of words or tokens we need to count the total number of tokens that

53
00:04:56,200 --> 00:04:57,820
we have in our dataset.

54
00:04:58,290 --> 00:05:02,770
And we also need to calculate a number of tokens that belong to the spam category and the number of

55
00:05:02,770 --> 00:05:05,790
tokens that belong to the non spam category.

56
00:05:05,800 --> 00:05:09,760
But first let's create the total number.

57
00:05:09,760 --> 00:05:16,420
What I'm gonna do is I'm going to create a subset from our full on a school train and a score.

58
00:05:16,430 --> 00:05:18,130
Data data frame.

59
00:05:18,260 --> 00:05:23,890
I'll call the subset full on a school train on a school features.

60
00:05:23,890 --> 00:05:26,870
So this will be just the tokens.

61
00:05:26,890 --> 00:05:29,450
I'll set this equal to full unescorted train.

62
00:05:29,450 --> 00:05:38,490
And as for data dot allow C ATL C's for selecting a number of rows square brackets colon.

63
00:05:38,530 --> 00:05:40,920
This is for selecting all the rows.

64
00:05:41,260 --> 00:05:47,180
But I'm not going to select all the columns I'll just select the columns with our word at ease.

65
00:05:47,230 --> 00:05:54,800
In other words I leave out the category column ellipses perfect for this to leave out the category column.

66
00:05:54,910 --> 00:06:00,160
I'll just say well take all the columns that you've gone full on as Katrina on the score.

67
00:06:00,180 --> 00:06:00,650
Data

68
00:06:03,280 --> 00:06:04,380
columns.

69
00:06:04,810 --> 00:06:15,370
But don't pick the category so I'll say not equal to single quotes category that exclamation mark that

70
00:06:15,370 --> 00:06:21,870
you see here is Python syntax for the logical not together with that equals sign.

71
00:06:21,910 --> 00:06:30,960
This logical condition here reads not equal to category so I'll pick all 2500 columns except the category

72
00:06:30,960 --> 00:06:32,130
column.

73
00:06:32,130 --> 00:06:36,840
Now mind you this string here is of course case sensitive.

74
00:06:36,840 --> 00:06:42,330
Let's look at the head of full on this quatrain on this go features and see what this data frame looks

75
00:06:42,330 --> 00:06:42,900
like.

76
00:06:44,020 --> 00:06:45,260
There you go.

77
00:06:45,340 --> 00:06:50,760
We've just excluded a particular column from the subset using logic.

78
00:06:50,790 --> 00:06:55,520
Now how do we get the total number of words.

79
00:06:55,650 --> 00:06:57,650
Well let's take a two step approach.

80
00:06:57,720 --> 00:07:01,110
Let's sum up all the tokens per email.

81
00:07:01,110 --> 00:07:05,400
So if you think about it we should sum up all the numbers in this row.

82
00:07:05,400 --> 00:07:09,660
Then we should also sum up all the numbers in this row and all the numbers in this row.

83
00:07:09,660 --> 00:07:12,900
That would be the total number of tokens per email.

84
00:07:13,170 --> 00:07:15,980
We can do this really really easily with the sum function.

85
00:07:16,020 --> 00:07:24,920
So I'm going to store this thing in a variable called email to school lengths so that we'll hold on

86
00:07:24,920 --> 00:07:31,160
to the total number of tokens per email and to use that some functionality.

87
00:07:31,160 --> 00:07:39,800
I'll grab my features data frame full on escort train and escort features and I'll use some function

88
00:07:40,600 --> 00:07:42,780
but I'll provide an argument here.

89
00:07:43,010 --> 00:07:49,080
I have to specify how this function should some things right left the right bottom to top up to down.

90
00:07:49,100 --> 00:07:53,470
We have to be more specific and we can do this with this access parameter him.

91
00:07:53,480 --> 00:07:59,040
Access is equal to one This will sum up our columns.

92
00:07:59,140 --> 00:08:01,640
Let's take a look at what email and scroll lengths.

93
00:08:01,840 --> 00:08:10,250
Shape gives us so we've done the summation and we've got four thousand and 14 emails exactly what we

94
00:08:10,250 --> 00:08:17,610
want the first five of these look like this email I just call lengths square brackets.

95
00:08:17,610 --> 00:08:23,260
Colon five and we see that the first email has 50 tokens.

96
00:08:23,260 --> 00:08:28,800
The second email 76 tokens the third e-mail 87 tokens and so on.

97
00:08:29,470 --> 00:08:36,940
So what we've got here is a handy little panda series of all the tokens to get the total number of tokens

98
00:08:37,210 --> 00:08:40,100
our total would count if you will.

99
00:08:40,150 --> 00:08:43,670
All we need to do is sum up the values in the entire series.

100
00:08:43,690 --> 00:08:44,690
Right.

101
00:08:44,830 --> 00:08:47,680
I'll store this number in a variable for later use.

102
00:08:47,860 --> 00:08:48,570
So I'll say.

103
00:08:48,580 --> 00:08:54,870
Total on a score WC for would count and I'll set that equal to email.

104
00:08:54,930 --> 00:09:01,330
I just call lengths not some to see what that value is.

105
00:09:01,400 --> 00:09:10,730
I'll just put total unescorted w see below and hit shift enter and what we see is that we've got approximately

106
00:09:10,730 --> 00:09:15,350
four hundred and forty six thousand tokens.

107
00:09:15,580 --> 00:09:23,640
Looking back at our formula here We've now worked out this white number here and this blue number here.

108
00:09:23,680 --> 00:09:26,140
Now let's work out this green one.

109
00:09:26,320 --> 00:09:32,950
Let's work out the total number of tokens within spam and also within the non spam messages which is

110
00:09:32,950 --> 00:09:46,720
not shown on this slide here a lot of this as a markdown cell number of tokens and Spam and Ham emails.

111
00:09:46,740 --> 00:09:50,040
I reckon this actually makes a good many challenge.

112
00:09:50,040 --> 00:09:57,270
Can you create a subset of the email length series that only contains the spam messages called a subset

113
00:09:57,450 --> 00:10:06,150
are spam unescorted lengths and then count the total number of tokens that occur in this subset also

114
00:10:06,420 --> 00:10:14,490
do the same for the non spam emails create a subset called Ham lengths from the email on the score length

115
00:10:14,490 --> 00:10:23,520
series and then also count the number of words that occur in that non spam or ham emails I'll give you

116
00:10:23,520 --> 00:10:29,340
a few seconds to pause the video and give this a go.

117
00:10:29,560 --> 00:10:30,280
You ready.

118
00:10:30,280 --> 00:10:32,290
Here's the solution.

119
00:10:32,290 --> 00:10:41,710
What I'm going to do is I'll create this variable spam links and I'll subset this whole thing from email

120
00:10:41,740 --> 00:10:47,970
underscore lengths using the square bracket notation where my full on the school train and it's got

121
00:10:47,980 --> 00:10:58,880
data dot category is equal to one double equal to one that is that's the logical condition to create

122
00:10:58,880 --> 00:11:07,390
the subset from email and a scroll lengths the number of emails that are spam are of course spam underscore

123
00:11:07,400 --> 00:11:11,160
lengths dot shape.

124
00:11:11,480 --> 00:11:17,720
Here you can see we've got one thousand two hundred and forty nine to get the number of words or the

125
00:11:17,720 --> 00:11:21,100
number of tokens that are in the spam emails.

126
00:11:21,110 --> 00:11:29,140
I'll store this whole thing under spam on a school W C for word count and I'll set that equal to spam.

127
00:11:29,400 --> 00:11:31,280
On just go lengths don't

128
00:11:34,110 --> 00:11:39,720
and if I print this out as an output we can see that we've got approximately one hundred ninety five

129
00:11:39,930 --> 00:11:44,890
thousand different tokens that are part of the spam emails.

130
00:11:44,910 --> 00:11:47,700
This is the first part of the exercise.

131
00:11:47,790 --> 00:11:49,480
The second part of the exercise.

132
00:11:49,500 --> 00:11:56,850
Doing the same thing for the ham emails the non spam emails we can write ham on the school lengths set

133
00:11:56,850 --> 00:12:03,330
that equal to email that lengths and subset this whole thing based on the condition that full on a school

134
00:12:03,750 --> 00:12:14,410
train and a data not category is equal to zero in this case ham underscore lengths dot shape will show

135
00:12:14,410 --> 00:12:18,140
us that we've got two thousand seven hundred and sixty five.

136
00:12:18,390 --> 00:12:25,090
How many emails are good thing to do at this point maybe is check your work right email dot lengths

137
00:12:25,450 --> 00:12:38,990
dot shape square brackets zero minus spam to underscore lengths dot shape square brackets zero minus

138
00:12:39,620 --> 00:12:48,070
hem underscore the lengths that shape square brackets 0 should be equal to zero because the two subsets

139
00:12:48,070 --> 00:12:52,480
together should have the same number of emails as our dataset as a whole.

140
00:12:52,540 --> 00:12:52,860
Right.

141
00:12:52,870 --> 00:12:59,440
This is a quick check that you can do to make sure you've subset of things correctly the total non spam

142
00:12:59,440 --> 00:13:07,140
word count the word count for the ham messages should be equal to ham on a scroll lengths.

143
00:13:07,160 --> 00:13:16,670
Dot some and this value is equal to two hundred and fifty thousand approximately Hennigan.

144
00:13:16,680 --> 00:13:24,390
I think this is a good time to also do a quick check spam on a school word count plus non spam on a

145
00:13:24,380 --> 00:13:30,270
score word count minus the total word count should be equal to zero.

146
00:13:30,270 --> 00:13:31,710
There we go.

147
00:13:31,740 --> 00:13:34,670
I've actually got an add on challenge for this.

148
00:13:34,680 --> 00:13:37,250
I just thought of an interesting question to ask.

149
00:13:37,470 --> 00:13:43,040
Do you think that the spam emails or the non spam emails tend to be longer.

150
00:13:43,050 --> 00:13:45,480
Which category do you think has longer emails.

151
00:13:45,510 --> 00:13:54,520
On average what I'd like you to do real quick is to take a quick guess and then verify it in the data.

152
00:13:54,540 --> 00:13:55,620
How would you go about doing that

153
00:13:58,630 --> 00:14:00,460
here's the solution.

154
00:14:00,510 --> 00:14:07,840
I want to print out the average number of words in spam emails.

155
00:14:07,890 --> 00:14:15,050
This is gonna be spam on a school W C divided by spam lengths.

156
00:14:15,120 --> 00:14:18,640
Don't shape square brackets.

157
00:14:18,640 --> 00:14:25,980
0 The average number of words in the spam emails in our dataset is approximate one hundred and fifty

158
00:14:25,980 --> 00:14:28,720
six hundred fifty seven.

159
00:14:28,720 --> 00:14:35,000
If this number is a little tough to read we can actually format this have much much more nicely with

160
00:14:35,900 --> 00:14:36,990
curly braces.

161
00:14:37,190 --> 00:14:41,920
Colon daunt 0 F stands for float.

162
00:14:42,110 --> 00:14:51,500
I'll get rid of this comma and I'll write dot format and I'll put the calculation inside the method

163
00:14:51,500 --> 00:14:57,590
call here to format here I'm running to the nearest poll number.

164
00:14:57,700 --> 00:15:02,410
This is a nice little trick if you want your own print statements to look pretty and not give you a

165
00:15:02,410 --> 00:15:04,780
lot of numbers after the decimal point.

166
00:15:04,810 --> 00:15:12,340
Let's compare this to that how many emails the non spam emails will take the ham would count or non

167
00:15:12,340 --> 00:15:20,970
spam would count and will divided by ham on the lengths don't shape and just to show you how this string

168
00:15:20,970 --> 00:15:22,190
formatting works.

169
00:15:22,380 --> 00:15:29,120
I'm going to have a three here instead of a zero so we'll get three decimal points in our output.

170
00:15:29,130 --> 00:15:29,920
There you go.

171
00:15:30,000 --> 00:15:35,340
So we can clearly see that spam emails actually tend to be worse here which I think is actually quite

172
00:15:35,340 --> 00:15:37,060
interesting as well.

173
00:15:37,080 --> 00:15:42,540
Now that we've got a big headline numbers in our formulas sorted we need to take a closer look at the

174
00:15:42,540 --> 00:15:44,090
tokens themselves.

175
00:15:44,130 --> 00:15:50,130
We need to sum up the tokens that occur in the spam messages and the tokens that occur in the normal

176
00:15:50,130 --> 00:15:51,300
emails.

177
00:15:51,740 --> 00:15:58,860
And we need to do this for each token for each single word I.D. and we'll get right on that in the next

178
00:15:58,860 --> 00:15:59,510
lesson.

179
00:15:59,520 --> 00:16:00,650
I'll see you there.