0 1 00:00:00,210 --> 00:00:08,690 So far we've talked about basic probability, independence, joint probability and conditional probability. 1 2 00:00:08,730 --> 00:00:13,820 We're now ready to start talking about Bayes theorem. 2 3 00:00:14,730 --> 00:00:19,740 Bayes theorem is where the Naive Bayes classifier gets its name. 3 4 00:00:19,980 --> 00:00:28,050 You see, a long, long time ago, in the 17th hundreds in Kent, England, there lived a clergyman by the name 4 5 00:00:28,230 --> 00:00:34,920 of Reverend Thomas Bayes. Thomas Bayes got very, very interested in probability 5 6 00:00:34,920 --> 00:00:41,910 later on in his life and he came up with a neat little trick. Let's go back to our weather example. With 6 7 00:00:41,910 --> 00:00:43,830 the conditional probability formula, 7 8 00:00:43,830 --> 00:00:51,990 the hardest part was this joint probability term, namely what is the probability of it raining and being 8 9 00:00:51,990 --> 00:00:53,400 cloudy. 9 10 00:00:53,400 --> 00:00:59,850 Now it turns out, one can phrase the conditional probability formula in another way. 10 11 00:00:59,850 --> 00:01:07,380 Instead of having this joint probability in there, we can have another conditional probability, namely, given 11 12 00:01:07,380 --> 00:01:08,970 that it is raining, 12 13 00:01:08,970 --> 00:01:16,500 what is the probability of it being cloudy. Now this is different from given that it is cloudy, 13 14 00:01:16,530 --> 00:01:18,500 what is the probability of rain. 14 15 00:01:18,540 --> 00:01:19,980 These two are not the same. 15 16 00:01:19,980 --> 00:01:22,530 Have you seen rain on a sunny day? 16 17 00:01:22,530 --> 00:01:25,930 Yes, but not very often, right? 17 18 00:01:25,980 --> 00:01:31,580 The probability of clouds when it is raining is probably around 95%. 18 19 00:01:31,620 --> 00:01:34,610 On the other hand, the probability of it raining, 19 20 00:01:34,680 --> 00:01:38,670 given that it is cloudy is probably more around 40%. 20 21 00:01:38,880 --> 00:01:42,260 So these two numbers are as different as the conditions. 21 22 00:01:42,420 --> 00:01:48,120 But the point I'm trying to make here is that changing the top part of this fraction in this formula 22 23 00:01:48,330 --> 00:01:50,630 is all Bayes theorem is. 23 24 00:01:50,940 --> 00:01:58,410 Reverend Bayes basically figured out how to reverse a conditional probability in the formula to make 24 25 00:01:58,530 --> 00:02:02,760 this whole thing easier to calculate. The conditional probability. 25 26 00:02:02,760 --> 00:02:03,790 looks like this. 26 27 00:02:04,170 --> 00:02:11,130 Here's the generic formula for Bayes Theorem in terms of A and B and you'll find this theorem very early 27 28 00:02:11,130 --> 00:02:17,090 on in any statistics textbook. But let's go back to our spam example. 28 29 00:02:17,400 --> 00:02:25,620 Given that the email has the word "Viagra" in it, what is the probability of this email being spam? Using 29 30 00:02:25,620 --> 00:02:26,800 Bayes theorem 30 31 00:02:26,970 --> 00:02:33,470 we can now calculate the same probability like this. Using this formula, 31 32 00:02:33,480 --> 00:02:39,460 the calculation becomes very, very easy and that's why I'm harping on about this so much. 32 33 00:02:39,570 --> 00:02:46,250 Let's tackle one term at a time starting with the probability of spam. 33 34 00:02:46,290 --> 00:02:48,150 We've actually already discussed this. 34 35 00:02:48,210 --> 00:02:51,600 We figured out how common spam emails were. 35 36 00:02:51,600 --> 00:02:54,620 This was 55% in 2017 36 37 00:02:54,630 --> 00:02:57,390 if you remember. So let's plug that number in here - 37 38 00:02:57,600 --> 00:02:59,810 0.55. 38 39 00:02:59,850 --> 00:03:06,700 Now, what about the probability of the word "Viagra" coming up in an email? 39 40 00:03:07,080 --> 00:03:13,500 To calculate this number, we have to figure out how common the word Viagra is 40 41 00:03:13,560 --> 00:03:15,870 in emails as a whole. 41 42 00:03:15,870 --> 00:03:21,180 How often do people use the word "Viagra" in an email? 42 43 00:03:21,180 --> 00:03:29,470 Both spam and in normal e-mails. Say we've looked at an enormous dataset of emails containing 700000 43 44 00:03:29,490 --> 00:03:38,190 words across thousands of e-mails and we count how many times "Viagra" gets mentioned. If the word "Viagra" 44 45 00:03:38,430 --> 00:03:47,490 gets mentioned 75 times, then the probability of "Viagra" coming up in an email is 75 divided 45 46 00:03:47,490 --> 00:03:49,880 by 700000. 46 47 00:03:49,950 --> 00:03:55,380 In this case, we're just applying basic probability, just like with our calculation of the probability 47 48 00:03:55,380 --> 00:04:00,150 of getting hit by lightning. We take the total number of times "Viagra" gets mentioned and we divid it 48 49 00:04:00,300 --> 00:04:04,490 by the total number of words that we looked at in our dataset. 49 50 00:04:04,530 --> 00:04:09,870 Next let's tackle the probability of "Viagra" being in an email 50 51 00:04:09,870 --> 00:04:14,380 given that the email is spam. What does this mean? 51 52 00:04:14,390 --> 00:04:19,040 This is a conditional probability. Given that the email is spam, 52 53 00:04:19,040 --> 00:04:24,650 what is the probability that the email contains the word "Viagra"? 53 54 00:04:24,650 --> 00:04:26,680 How would we calculate this? 54 55 00:04:26,930 --> 00:04:31,130 Again we look at the frequency of the word "Viagra", 55 56 00:04:31,160 --> 00:04:38,090 but this time just within the spam emails. This allows us to figure out what is the probability that 56 57 00:04:38,090 --> 00:04:40,850 an email contains the word "Viagra" 57 58 00:04:40,970 --> 00:04:48,620 given that it is spam. And looking at our dataset of emails, we count the number of times the word "Viagra" 58 59 00:04:48,620 --> 00:04:57,350 occurs and we look at the total number of words in our dataset that are in spam emails, say 59 60 00:04:57,350 --> 00:04:59,480 370000. 60 61 00:04:59,540 --> 00:05:07,380 If we find that the word "Viagra" appears 65 times, the conditional probability is simply 61 62 00:05:07,410 --> 00:05:12,080 65 divided by 370000. And 62 63 00:05:12,100 --> 00:05:12,720 that's it. 63 64 00:05:12,740 --> 00:05:14,400 We've got all our numbers. 64 65 00:05:14,600 --> 00:05:21,830 We can work out the probability that an email is spam given that it contains the word "Viagra". 65 66 00:05:22,100 --> 00:05:27,950 And based on our data here, the probability is around 90%. 66 67 00:05:27,950 --> 00:05:34,970 What this example shows is that the frequencies of a word really are key. By looking at the frequencies 67 68 00:05:35,120 --> 00:05:40,070 of a word in a spam messages vs. all messages, 68 69 00:05:40,070 --> 00:05:44,180 our algorithm learns which words are spammy. 69 70 00:05:44,270 --> 00:05:52,370 The reason we think words like "Online Pharmacy", "Double your cash", etc. are spammy is because these words 70 71 00:05:52,700 --> 00:05:55,680 often appear in spam emails. 71 72 00:05:56,180 --> 00:05:58,300 But the story doesn't stop there. 72 73 00:05:58,340 --> 00:06:06,050 If we look at an email message, then we're going to have more than one word in the email body, right? 73 74 00:06:06,050 --> 00:06:08,420 Well that depends on your friends, actually. 74 75 00:06:08,420 --> 00:06:16,250 But my point is that we can calculate the conditional probability for every single word in the emails 75 76 00:06:16,490 --> 00:06:18,350 in the entire dataset. 76 77 00:06:18,350 --> 00:06:24,170 Not only will we have worked out the conditional probability of an email being spam if it contains the 77 78 00:06:24,170 --> 00:06:31,070 word "Viagra" but we will have worked out the probabilities of all the other words as well, like "Expert" 78 79 00:06:31,070 --> 00:06:33,360 "Free", "Cash" and so on. 79 80 00:06:33,380 --> 00:06:43,640 So now when an email comes in, what if it has both the words "Free" and the word "Viagra" in it? 80 81 00:06:43,640 --> 00:06:49,740 What's the probability of an email being spam in that case? Let me rephrase that. 81 82 00:06:50,180 --> 00:06:57,890 Given that the email has the word "Free" and the word "Viagra" in it, what is the probability that the email 82 83 00:06:58,250 --> 00:07:05,590 is spam? At this point we've come full circle all the way back to joint probability. 83 84 00:07:05,870 --> 00:07:12,630 Remember how with the coin flipping example, we simply multiplied the probability of heads 84 85 00:07:12,650 --> 00:07:17,950 times the probability of heads to figure out the chances of getting two heads in a row? 85 86 00:07:17,990 --> 00:07:23,300 The reason we could do that was because the two events were independent. 86 87 00:07:23,300 --> 00:07:29,290 And this brings us to the Naive part of the Naive Bayes classifier. 87 88 00:07:29,420 --> 00:07:38,630 The reason our algorithm is naive is because it assumes independence between the words in the email. 88 89 00:07:38,720 --> 00:07:48,440 In other words, if our email has both the words "Viagra" and the word "Free" in it, we can multiply the two probabilities 89 90 00:07:48,740 --> 00:07:53,520 together and we can do this for every single word in the email. 90 91 00:07:53,570 --> 00:07:59,130 The more and more spammy the words are, the higher the final number. 91 92 00:07:59,130 --> 00:08:02,010 Now let's continue with this line of thinking. 92 93 00:08:02,010 --> 00:08:06,690 If an email has lots and lots of spammy words in it, it's probably spam. 93 94 00:08:06,690 --> 00:08:11,560 And if an email has very, very few spammy words in it, it's probably a normal email. 94 95 00:08:11,710 --> 00:08:18,110 So this brings us back to the final decision where we compare two probabilities. 95 96 00:08:18,360 --> 00:08:21,610 Say we have an email with the words "Hello friend, 96 97 00:08:21,620 --> 00:08:23,490 want free viagra?". 97 98 00:08:23,490 --> 00:08:30,480 Well using Bayes rule we can calculate the conditional probabilities for all these words and figure 98 99 00:08:30,480 --> 00:08:34,920 out what the probability is that that email is spam. 99 100 00:08:34,920 --> 00:08:42,180 Then using that independence assumption, we multiply all these probabilities together to get a final 100 101 00:08:42,480 --> 00:08:51,810 number, namely the probability that the email is spam. And then we can compare this number to the probability 101 102 00:08:52,110 --> 00:08:55,550 that this email is a normal email. 102 103 00:08:55,620 --> 00:09:01,080 So for that first term, we would have: Given that this email contains the word "Hello", 103 104 00:09:01,080 --> 00:09:06,360 what is the probability that the email is a normal email? Next, 104 105 00:09:06,420 --> 00:09:11,090 given that this word contains the word "Want", what is the probability that we have a normal email? 105 106 00:09:11,730 --> 00:09:12,750 And so on. 106 107 00:09:12,750 --> 00:09:19,130 We then multiply all these probabilities together and then we can do our comparison. 107 108 00:09:19,180 --> 00:09:23,910 This is the classification step that we talked about in the very, very beginning. 108 109 00:09:23,980 --> 00:09:30,030 We will classify this email simply based on which final number is higher. 109 110 00:09:30,100 --> 00:09:38,060 What I've just described to you is called the Bag of Words approach for classifying documents. Each word 110 111 00:09:38,390 --> 00:09:45,470 is looked in isolation and the frequency of each word becomes a feature in our machine learning model 111 112 00:09:46,250 --> 00:09:53,190 and the fact that we've looked at each word in isolation is why this model is called Naive. 112 113 00:09:53,390 --> 00:09:55,840 We're treating each word separately. 113 114 00:09:55,940 --> 00:09:57,230 We're ignoring grammar. 114 115 00:09:57,230 --> 00:09:58,840 We're ignoring sarcasm. 115 116 00:09:58,850 --> 00:10:02,900 We treat the city name New York as two separate words, 116 117 00:10:02,910 --> 00:10:09,830 New and York. We're treating the phrase Not Bad as two words Not and Bad. 117 118 00:10:09,860 --> 00:10:16,880 The context is lost with the Bag of Words approach. The dependencies between the words are ignored. 118 119 00:10:17,060 --> 00:10:26,510 We are assuming that the words are independent, like well, words in a bag. At this point you're probably 119 120 00:10:26,510 --> 00:10:27,750 skeptical. 120 121 00:10:27,770 --> 00:10:29,230 Will this really work? 121 122 00:10:29,330 --> 00:10:35,470 This whole approach seems super strange and the assumptions also seem really crude. 122 123 00:10:35,480 --> 00:10:38,350 Well I don't want to spoil it for you. 123 124 00:10:38,390 --> 00:10:41,080 You will find out just how well this works 124 125 00:10:41,180 --> 00:10:48,410 in the coming lessons. And now that we've covered the theory behind the Naive Bayes model, our path for 125 126 00:10:48,410 --> 00:10:51,140 the Python code is clear. 126 127 00:10:51,140 --> 00:10:59,540 The first step will involve extracting all the text from the email bodies and this means moving on to 127 128 00:10:59,540 --> 00:11:01,940 the next parts of our project workflow - 128 129 00:11:02,150 --> 00:11:05,980 namely cleaning and exploring the data. 129 130 00:11:05,990 --> 00:11:10,760 For example, we will need to figure out just how common each word is 130 131 00:11:10,940 --> 00:11:18,530 in an email. We will need to do this for every single word, not just spammy words like Viagra but every word. 131 132 00:11:18,530 --> 00:11:21,560 How often do people use the word Japan in emails? 132 133 00:11:21,560 --> 00:11:28,680 How often do people use the word weekend, apple, Android, computer, lawyer or friend in emails? 133 134 00:11:28,850 --> 00:11:35,690 We will need to count how often these words appear in all the emails in our dataset and we have over 134 135 00:11:35,690 --> 00:11:42,890 5000 different emails in this dataset that we're working with and we will store all this text inside 135 136 00:11:42,890 --> 00:11:49,610 a pandas dataframe in order to manipulate it and we're gonna do all that in the upcoming programming 136 137 00:11:49,610 --> 00:11:50,780 lessons. 137 138 00:11:50,810 --> 00:11:51,910 I'll see you there. 138 139 00:11:53,160 --> 00:12:00,420 You know, I vividly remember learning about probability in school and all my textbook exercises were 139 140 00:12:00,420 --> 00:12:07,560 constantly about urns, always, balls and urns, red balls from an urn, picking out a sequence of balls from 140 141 00:12:07,560 --> 00:12:11,710 an urn, picking out a particular ball out of an urn. 141 142 00:12:11,980 --> 00:12:13,650 Oh man, the memories.