0
1
00:00:00,210 --> 00:00:08,690
So far we've talked about basic probability, independence, joint probability and conditional probability.
1

2
00:00:08,730 --> 00:00:13,820
We're now ready to start talking about Bayes theorem.
2

3
00:00:14,730 --> 00:00:19,740
Bayes theorem is where the Naive Bayes classifier gets its name.
3

4
00:00:19,980 --> 00:00:28,050
You see, a long, long time ago, in the 17th hundreds in Kent, England, there lived a clergyman by the name
4

5
00:00:28,230 --> 00:00:34,920
of Reverend Thomas Bayes. Thomas Bayes got very, very interested in probability
5

6
00:00:34,920 --> 00:00:41,910
later on in his life and he came up with a neat little trick. Let's go back to our weather example. With
6

7
00:00:41,910 --> 00:00:43,830
the conditional probability formula,
7

8
00:00:43,830 --> 00:00:51,990
the hardest part was this joint probability term, namely what is the probability of it raining and being
8

9
00:00:51,990 --> 00:00:53,400
cloudy.
9

10
00:00:53,400 --> 00:00:59,850
Now it turns out, one can phrase the conditional probability formula in another way.
10

11
00:00:59,850 --> 00:01:07,380
Instead of having this joint probability in there, we can have another conditional probability, namely, given
11

12
00:01:07,380 --> 00:01:08,970
that it is raining,
12

13
00:01:08,970 --> 00:01:16,500
what is the probability of it being cloudy. Now this is different from given that it is cloudy,
13

14
00:01:16,530 --> 00:01:18,500
what is the probability of rain.
14

15
00:01:18,540 --> 00:01:19,980
These two are not the same.
15

16
00:01:19,980 --> 00:01:22,530
Have you seen rain on a sunny day?
16

17
00:01:22,530 --> 00:01:25,930
Yes, but not very often, right?
17

18
00:01:25,980 --> 00:01:31,580
The probability of clouds when it is raining is probably around 95%.
18

19
00:01:31,620 --> 00:01:34,610
On the other hand, the probability of it raining,
19

20
00:01:34,680 --> 00:01:38,670
given that it is cloudy is probably more around 40%.
20

21
00:01:38,880 --> 00:01:42,260
So these two numbers are as different as the conditions.
21

22
00:01:42,420 --> 00:01:48,120
But the point I'm trying to make here is that changing the top part of this fraction in this formula
22

23
00:01:48,330 --> 00:01:50,630
is all Bayes theorem is.
23

24
00:01:50,940 --> 00:01:58,410
Reverend Bayes basically figured out how to reverse a conditional probability in the formula to make
24

25
00:01:58,530 --> 00:02:02,760
this whole thing easier to calculate. The conditional probability.
25

26
00:02:02,760 --> 00:02:03,790
looks like this.
26

27
00:02:04,170 --> 00:02:11,130
Here's the generic formula for Bayes Theorem in terms of A and B and you'll find this theorem very early
27

28
00:02:11,130 --> 00:02:17,090
on in any statistics textbook. But let's go back to our spam example.
28

29
00:02:17,400 --> 00:02:25,620
Given that the email has the word "Viagra" in it, what is the probability of this email being spam? Using
29

30
00:02:25,620 --> 00:02:26,800
Bayes theorem
30

31
00:02:26,970 --> 00:02:33,470
we can now calculate the same probability like this. Using this formula,
31

32
00:02:33,480 --> 00:02:39,460
the calculation becomes very, very easy and that's why I'm harping on about this so much.
32

33
00:02:39,570 --> 00:02:46,250
Let's tackle one term at a time starting with the probability of spam.
33

34
00:02:46,290 --> 00:02:48,150
We've actually already discussed this.
34

35
00:02:48,210 --> 00:02:51,600
We figured out how common spam emails were.
35

36
00:02:51,600 --> 00:02:54,620
This was 55% in 2017
36

37
00:02:54,630 --> 00:02:57,390
if you remember. So let's plug that number in here -
37

38
00:02:57,600 --> 00:02:59,810
0.55.
38

39
00:02:59,850 --> 00:03:06,700
Now, what about the probability of the word "Viagra" coming up in an email?
39

40
00:03:07,080 --> 00:03:13,500
To calculate this number, we have to figure out how common the word Viagra is
40

41
00:03:13,560 --> 00:03:15,870
in emails as a whole.
41

42
00:03:15,870 --> 00:03:21,180
How often do people use the word "Viagra" in an email?
42

43
00:03:21,180 --> 00:03:29,470
Both spam and in normal e-mails. Say we've looked at an enormous dataset of emails containing 700000
43

44
00:03:29,490 --> 00:03:38,190
words across thousands of e-mails and we count how many times "Viagra" gets mentioned. If the word "Viagra"
44

45
00:03:38,430 --> 00:03:47,490
gets mentioned 75 times, then the probability of "Viagra" coming up in an email is 75 divided
45

46
00:03:47,490 --> 00:03:49,880
by 700000.
46

47
00:03:49,950 --> 00:03:55,380
In this case, we're just applying basic probability, just like with our calculation of the probability
47

48
00:03:55,380 --> 00:04:00,150
of getting hit by lightning. We take the total number of times "Viagra" gets mentioned and we divid it
48

49
00:04:00,300 --> 00:04:04,490
by the total number of words that we looked at in our dataset.
49

50
00:04:04,530 --> 00:04:09,870
Next let's tackle the probability of "Viagra" being in an email
50

51
00:04:09,870 --> 00:04:14,380
given that the email is spam. What does this mean?
51

52
00:04:14,390 --> 00:04:19,040
This is a conditional probability. Given that the email is spam,
52

53
00:04:19,040 --> 00:04:24,650
what is the probability that the email contains the word "Viagra"?
53

54
00:04:24,650 --> 00:04:26,680
How would we calculate this?
54

55
00:04:26,930 --> 00:04:31,130
Again we look at the frequency of the word "Viagra",
55

56
00:04:31,160 --> 00:04:38,090
but this time just within the spam emails. This allows us to figure out what is the probability that
56

57
00:04:38,090 --> 00:04:40,850
an email contains the word "Viagra"
57

58
00:04:40,970 --> 00:04:48,620
given that it is spam. And looking at our dataset of emails, we count the number of times the word "Viagra"
58

59
00:04:48,620 --> 00:04:57,350
occurs and we look at the total number of words in our dataset that are in spam emails, say
59

60
00:04:57,350 --> 00:04:59,480
370000.
60

61
00:04:59,540 --> 00:05:07,380
If we find that the word "Viagra" appears 65 times, the conditional probability is simply 
61

62
00:05:07,410 --> 00:05:12,080
65 divided by 370000. And
62

63
00:05:12,100 --> 00:05:12,720
that's it.
63

64
00:05:12,740 --> 00:05:14,400
We've got all our numbers.
64

65
00:05:14,600 --> 00:05:21,830
We can work out the probability that an email is spam given that it contains the word "Viagra".
65

66
00:05:22,100 --> 00:05:27,950
And based on our data here, the probability is around 90%.
66

67
00:05:27,950 --> 00:05:34,970
What this example shows is that the frequencies of a word really are key. By looking at the frequencies
67

68
00:05:35,120 --> 00:05:40,070
of a word in a spam messages vs. all messages,
68

69
00:05:40,070 --> 00:05:44,180
our algorithm learns which words are spammy.
69

70
00:05:44,270 --> 00:05:52,370
The reason we think words like "Online Pharmacy", "Double your cash", etc. are spammy is because these words
70

71
00:05:52,700 --> 00:05:55,680
often appear in spam emails.
71

72
00:05:56,180 --> 00:05:58,300
But the story doesn't stop there.
72

73
00:05:58,340 --> 00:06:06,050
If we look at an email message, then we're going to have more than one word in the email body, right?
73

74
00:06:06,050 --> 00:06:08,420
Well that depends on your friends, actually.
74

75
00:06:08,420 --> 00:06:16,250
But my point is that we can calculate the conditional probability for every single word in the emails
75

76
00:06:16,490 --> 00:06:18,350
in the entire dataset.
76

77
00:06:18,350 --> 00:06:24,170
Not only will we have worked out the conditional probability of an email being spam if it contains the
77

78
00:06:24,170 --> 00:06:31,070
word "Viagra" but we will have worked out the probabilities of all the other words as well, like "Expert"
78

79
00:06:31,070 --> 00:06:33,360
"Free", "Cash" and so on.
79

80
00:06:33,380 --> 00:06:43,640
So now when an email comes in, what if it has both the words "Free" and the word "Viagra" in it?
80

81
00:06:43,640 --> 00:06:49,740
What's the probability of an email being spam in that case? Let me rephrase that.
81

82
00:06:50,180 --> 00:06:57,890
Given that the email has the word "Free" and the word "Viagra" in it, what is the probability that the email
82

83
00:06:58,250 --> 00:07:05,590
is spam? At this point we've come full circle all the way back to joint probability.
83

84
00:07:05,870 --> 00:07:12,630
Remember how with the coin flipping example, we simply multiplied the probability of heads
84

85
00:07:12,650 --> 00:07:17,950
times the probability of heads to figure out the chances of getting two heads in a row?
85

86
00:07:17,990 --> 00:07:23,300
The reason we could do that was because the two events were independent.
86

87
00:07:23,300 --> 00:07:29,290
And this brings us to the Naive part of the Naive Bayes classifier.
87

88
00:07:29,420 --> 00:07:38,630
The reason our algorithm is naive is because it assumes independence between the words in the email.
88

89
00:07:38,720 --> 00:07:48,440
In other words, if our email has both the words "Viagra" and the word "Free" in it, we can multiply the two probabilities
89

90
00:07:48,740 --> 00:07:53,520
together and we can do this for every single word in the email.
90

91
00:07:53,570 --> 00:07:59,130
The more and more spammy the words are, the higher the final number.
91

92
00:07:59,130 --> 00:08:02,010
Now let's continue with this line of thinking.
92

93
00:08:02,010 --> 00:08:06,690
If an email has lots and lots of spammy words in it, it's probably spam.
93

94
00:08:06,690 --> 00:08:11,560
And if an email has very, very few spammy words in it, it's probably a normal email.
94

95
00:08:11,710 --> 00:08:18,110
So this brings us back to the final decision where we compare two probabilities.
95

96
00:08:18,360 --> 00:08:21,610
Say we have an email with the words "Hello friend,
96

97
00:08:21,620 --> 00:08:23,490
want free viagra?".
97

98
00:08:23,490 --> 00:08:30,480
Well using Bayes rule we can calculate the conditional probabilities for all these words and figure
98

99
00:08:30,480 --> 00:08:34,920
out what the probability is that that email is spam.
99

100
00:08:34,920 --> 00:08:42,180
Then using that independence assumption, we multiply all these probabilities together to get a final
100

101
00:08:42,480 --> 00:08:51,810
number, namely the probability that the email is spam. And then we can compare this number to the probability
101

102
00:08:52,110 --> 00:08:55,550
that this email is a normal email.
102

103
00:08:55,620 --> 00:09:01,080
So for that first term, we would have: Given that this email contains the word "Hello",
103

104
00:09:01,080 --> 00:09:06,360
what is the probability that the email is a normal email? Next,
104

105
00:09:06,420 --> 00:09:11,090
given that this word contains the word "Want", what is the probability that we have a normal email?
105

106
00:09:11,730 --> 00:09:12,750
And so on.
106

107
00:09:12,750 --> 00:09:19,130
We then multiply all these probabilities together and then we can do our comparison.
107

108
00:09:19,180 --> 00:09:23,910
This is the classification step that we talked about in the very, very beginning.
108

109
00:09:23,980 --> 00:09:30,030
We will classify this email simply based on which final number is higher.
109

110
00:09:30,100 --> 00:09:38,060
What I've just described to you is called the Bag of Words approach for classifying documents. Each word
110

111
00:09:38,390 --> 00:09:45,470
is looked in isolation and the frequency of each word becomes a feature in our machine learning model
111

112
00:09:46,250 --> 00:09:53,190
and the fact that we've looked at each word in isolation is why this model is called Naive.
112

113
00:09:53,390 --> 00:09:55,840
We're treating each word separately.
113

114
00:09:55,940 --> 00:09:57,230
We're ignoring grammar.
114

115
00:09:57,230 --> 00:09:58,840
We're ignoring sarcasm.
115

116
00:09:58,850 --> 00:10:02,900
We treat the city name New York as two separate words,
116

117
00:10:02,910 --> 00:10:09,830
New and York. We're treating the phrase Not Bad as two words Not and Bad.
117

118
00:10:09,860 --> 00:10:16,880
The context is lost with the Bag of Words approach. The dependencies between the words are ignored.
118

119
00:10:17,060 --> 00:10:26,510
We are assuming that the words are independent, like well, words in a bag. At this point you're probably
119

120
00:10:26,510 --> 00:10:27,750
skeptical.
120

121
00:10:27,770 --> 00:10:29,230
Will this really work?
121

122
00:10:29,330 --> 00:10:35,470
This whole approach seems super strange and the assumptions also seem really crude.
122

123
00:10:35,480 --> 00:10:38,350
Well I don't want to spoil it for you.
123

124
00:10:38,390 --> 00:10:41,080
You will find out just how well this works
124

125
00:10:41,180 --> 00:10:48,410
in the coming lessons. And now that we've covered the theory behind the Naive Bayes model, our path for
125

126
00:10:48,410 --> 00:10:51,140
the Python code is clear.
126

127
00:10:51,140 --> 00:10:59,540
The first step will involve extracting all the text from the email bodies and this means moving on to
127

128
00:10:59,540 --> 00:11:01,940
the next parts of our project workflow -
128

129
00:11:02,150 --> 00:11:05,980
namely cleaning and exploring the data.
129

130
00:11:05,990 --> 00:11:10,760
For example, we will need to figure out just how common each word is
130

131
00:11:10,940 --> 00:11:18,530
in an email. We will need to do this for every single word, not just spammy words like Viagra but every word.
131

132
00:11:18,530 --> 00:11:21,560
How often do people use the word Japan in emails?
132

133
00:11:21,560 --> 00:11:28,680
How often do people use the word weekend, apple, Android, computer, lawyer or friend in emails?
133

134
00:11:28,850 --> 00:11:35,690
We will need to count how often these words appear in all the emails in our dataset and we have over
134

135
00:11:35,690 --> 00:11:42,890
5000 different emails in this dataset that we're working with and we will store all this text inside
135

136
00:11:42,890 --> 00:11:49,610
a pandas dataframe in order to manipulate it and we're gonna do all that in the upcoming programming
136

137
00:11:49,610 --> 00:11:50,780
lessons.
137

138
00:11:50,810 --> 00:11:51,910
I'll see you there.
138

139
00:11:53,160 --> 00:12:00,420
You know, I vividly remember learning about probability in school and all my textbook exercises were
139

140
00:12:00,420 --> 00:12:07,560
constantly about urns, always, balls and urns, red balls from an urn, picking out a sequence of balls from
140

141
00:12:07,560 --> 00:12:11,710
an urn, picking out a particular ball out of an urn.
141

142
00:12:11,980 --> 00:12:13,650
Oh man, the memories.