0
1
00:00:00,540 --> 00:00:06,900
Now that we've discussed the final step on how our Naive Bayes Classifier makes decisions, we should talk
1

2
00:00:06,900 --> 00:00:12,840
about where the probabilities that feed into this decision making actually come from.
2

3
00:00:12,840 --> 00:00:20,730
And this means backing up a bit and talking about statistics. Now this is the part that no one gets excited
3

4
00:00:20,730 --> 00:00:21,610
about.
4

5
00:00:21,630 --> 00:00:24,170
I've yet to hear somebody say "Yes, statistics!
5

6
00:00:24,180 --> 00:00:31,620
My favorite topic!", but the thing is, statistics is everything in machine learning and this project will
6

7
00:00:31,620 --> 00:00:34,770
give us a chance to get some exposure.
7

8
00:00:34,770 --> 00:00:35,100
All right.
8

9
00:00:35,190 --> 00:00:39,030
So we're interested in calculating the probability that an email is spam.
9

10
00:00:39,630 --> 00:00:45,960
However, it's not like we can simply work out the probability the same way we would work out the probability
10

11
00:00:46,020 --> 00:00:53,610
of say flipping heads on a coin flip or rolling a six with a die. With coins and dice, working out the
11

12
00:00:53,610 --> 00:00:56,640
probabilities is fairly straightforward.
12

13
00:00:56,640 --> 00:01:02,620
But let's review some of these concepts nonetheless because they're gonna come in handy later.
13

14
00:01:02,880 --> 00:01:12,300
Now with coins, we know that there's two sides and a flip has a 50/50 chance of showing heads. With this
14

15
00:01:12,300 --> 00:01:13,140
dice here,
15

16
00:01:13,170 --> 00:01:20,550
we know that there are six sides and we've got a 1 in 6 or roughly 17% chance of rolling
16

17
00:01:20,610 --> 00:01:23,470
a six or any particular number.
17

18
00:01:23,490 --> 00:01:29,490
Now let me ask you a question outside of the realm of flipping coins and rolling dice,
18

19
00:01:29,550 --> 00:01:33,690
how would you work out your probability of getting hit by lightning?
19

20
00:01:35,220 --> 00:01:41,490
Yeah, I know, you could probably ask Google and get the answer, but if you had to calculate it yourself,
20

21
00:01:41,940 --> 00:01:45,090
how would you do it? Well,
21

22
00:01:45,220 --> 00:01:51,940
the simplest way to do this is by dividing two numbers, the total number of times people get hit by lightning
22

23
00:01:53,080 --> 00:01:55,820
and the total number of lightning strikes.
23

24
00:01:55,900 --> 00:02:04,990
Now I trawled through Wikipedia for you and I've dug out these figures. About 240000 people are injured
24

25
00:02:05,140 --> 00:02:07,220
by lightning every year.
25

26
00:02:07,360 --> 00:02:16,420
Now, 240000 actually sounds like quite a lot of people but there are an order of magnitude more lightning
26

27
00:02:16,420 --> 00:02:24,250
strikes. Every year around 350 million lightning bolts actually strike the ground.
27

28
00:02:24,250 --> 00:02:30,320
So then, just given these two numbers, what's the chance of you being hurt by lightning?
28

29
00:02:30,970 --> 00:02:39,600
Well, it'll be 240000 divided by 350 million or 0.07%.
29

30
00:02:39,640 --> 00:02:44,550
The point I'm trying to get across here is how to use basic probability.
30

31
00:02:44,650 --> 00:02:51,010
We took some observations, like the number of times a lightning struck a person and the total number
31

32
00:02:51,010 --> 00:02:54,130
of times we observed lightning in a year
32

33
00:02:54,130 --> 00:02:56,290
to calculate this figure.
33

34
00:02:56,290 --> 00:03:02,430
Now, suppose we had to work out the chance of an email being spam.
34

35
00:03:02,440 --> 00:03:09,940
Any email that is, right, any email in the whole world. We can apply the same technique as in the lightning
35

36
00:03:09,940 --> 00:03:19,660
example. The chance of an email being spam should also depend on two things, namely one, how many spam
36

37
00:03:19,690 --> 00:03:29,140
emails were sent; and two, how many emails were sent in total. With these two quantities in hand,
37

38
00:03:29,200 --> 00:03:36,940
we can work it out. So I trawled the internet and here's what I pulled up. In 2017,
38

39
00:03:36,940 --> 00:03:42,910
there were an estimated 148 billion spam emails sent.
39

40
00:03:42,910 --> 00:03:52,330
That's right, billion. And the total number of emails being sent was approximately 269
40

41
00:03:52,330 --> 00:03:53,670
billion.
41

42
00:03:53,890 --> 00:04:02,740
So that means, if a new email comes into your inbox, the probability of that email being spam or having
42

43
00:04:02,740 --> 00:04:08,460
been spam in 2017 was 55%.
43

44
00:04:08,470 --> 00:04:15,580
And this is simply based on the observation of the frequencies, namely the total number of spam emails
44

45
00:04:15,700 --> 00:04:19,840
divided by the total number of all email traffic.
45

46
00:04:19,840 --> 00:04:24,110
So you can think of calculating the basic probability as step one.
46

47
00:04:24,400 --> 00:04:31,510
We figured out the overall probability of spam, but we can't build a classifier with this alone.
47

48
00:04:31,510 --> 00:04:32,320
So what's step two?