0 1 00:00:00,540 --> 00:00:06,900 Now that we've discussed the final step on how our Naive Bayes Classifier makes decisions, we should talk 1 2 00:00:06,900 --> 00:00:12,840 about where the probabilities that feed into this decision making actually come from. 2 3 00:00:12,840 --> 00:00:20,730 And this means backing up a bit and talking about statistics. Now this is the part that no one gets excited 3 4 00:00:20,730 --> 00:00:21,610 about. 4 5 00:00:21,630 --> 00:00:24,170 I've yet to hear somebody say "Yes, statistics! 5 6 00:00:24,180 --> 00:00:31,620 My favorite topic!", but the thing is, statistics is everything in machine learning and this project will 6 7 00:00:31,620 --> 00:00:34,770 give us a chance to get some exposure. 7 8 00:00:34,770 --> 00:00:35,100 All right. 8 9 00:00:35,190 --> 00:00:39,030 So we're interested in calculating the probability that an email is spam. 9 10 00:00:39,630 --> 00:00:45,960 However, it's not like we can simply work out the probability the same way we would work out the probability 10 11 00:00:46,020 --> 00:00:53,610 of say flipping heads on a coin flip or rolling a six with a die. With coins and dice, working out the 11 12 00:00:53,610 --> 00:00:56,640 probabilities is fairly straightforward. 12 13 00:00:56,640 --> 00:01:02,620 But let's review some of these concepts nonetheless because they're gonna come in handy later. 13 14 00:01:02,880 --> 00:01:12,300 Now with coins, we know that there's two sides and a flip has a 50/50 chance of showing heads. With this 14 15 00:01:12,300 --> 00:01:13,140 dice here, 15 16 00:01:13,170 --> 00:01:20,550 we know that there are six sides and we've got a 1 in 6 or roughly 17% chance of rolling 16 17 00:01:20,610 --> 00:01:23,470 a six or any particular number. 17 18 00:01:23,490 --> 00:01:29,490 Now let me ask you a question outside of the realm of flipping coins and rolling dice, 18 19 00:01:29,550 --> 00:01:33,690 how would you work out your probability of getting hit by lightning? 19 20 00:01:35,220 --> 00:01:41,490 Yeah, I know, you could probably ask Google and get the answer, but if you had to calculate it yourself, 20 21 00:01:41,940 --> 00:01:45,090 how would you do it? Well, 21 22 00:01:45,220 --> 00:01:51,940 the simplest way to do this is by dividing two numbers, the total number of times people get hit by lightning 22 23 00:01:53,080 --> 00:01:55,820 and the total number of lightning strikes. 23 24 00:01:55,900 --> 00:02:04,990 Now I trawled through Wikipedia for you and I've dug out these figures. About 240000 people are injured 24 25 00:02:05,140 --> 00:02:07,220 by lightning every year. 25 26 00:02:07,360 --> 00:02:16,420 Now, 240000 actually sounds like quite a lot of people but there are an order of magnitude more lightning 26 27 00:02:16,420 --> 00:02:24,250 strikes. Every year around 350 million lightning bolts actually strike the ground. 27 28 00:02:24,250 --> 00:02:30,320 So then, just given these two numbers, what's the chance of you being hurt by lightning? 28 29 00:02:30,970 --> 00:02:39,600 Well, it'll be 240000 divided by 350 million or 0.07%. 29 30 00:02:39,640 --> 00:02:44,550 The point I'm trying to get across here is how to use basic probability. 30 31 00:02:44,650 --> 00:02:51,010 We took some observations, like the number of times a lightning struck a person and the total number 31 32 00:02:51,010 --> 00:02:54,130 of times we observed lightning in a year 32 33 00:02:54,130 --> 00:02:56,290 to calculate this figure. 33 34 00:02:56,290 --> 00:03:02,430 Now, suppose we had to work out the chance of an email being spam. 34 35 00:03:02,440 --> 00:03:09,940 Any email that is, right, any email in the whole world. We can apply the same technique as in the lightning 35 36 00:03:09,940 --> 00:03:19,660 example. The chance of an email being spam should also depend on two things, namely one, how many spam 36 37 00:03:19,690 --> 00:03:29,140 emails were sent; and two, how many emails were sent in total. With these two quantities in hand, 37 38 00:03:29,200 --> 00:03:36,940 we can work it out. So I trawled the internet and here's what I pulled up. In 2017, 38 39 00:03:36,940 --> 00:03:42,910 there were an estimated 148 billion spam emails sent. 39 40 00:03:42,910 --> 00:03:52,330 That's right, billion. And the total number of emails being sent was approximately 269 40 41 00:03:52,330 --> 00:03:53,670 billion. 41 42 00:03:53,890 --> 00:04:02,740 So that means, if a new email comes into your inbox, the probability of that email being spam or having 42 43 00:04:02,740 --> 00:04:08,460 been spam in 2017 was 55%. 43 44 00:04:08,470 --> 00:04:15,580 And this is simply based on the observation of the frequencies, namely the total number of spam emails 44 45 00:04:15,700 --> 00:04:19,840 divided by the total number of all email traffic. 45 46 00:04:19,840 --> 00:04:24,110 So you can think of calculating the basic probability as step one. 46 47 00:04:24,400 --> 00:04:31,510 We figured out the overall probability of spam, but we can't build a classifier with this alone. 47 48 00:04:31,510 --> 00:04:32,320 So what's step two?