0 1 00:00:00,500 --> 00:00:09,960 Okay, so step one was calculating the basic probabilities. It was gathering information on the frequencies 1 2 00:00:09,960 --> 00:00:15,600 of an event and figuring out the probability of that event happening 2 3 00:00:15,600 --> 00:00:23,280 based on that. In step two we're going to talk about a concept called the joint probability. 3 4 00:00:23,280 --> 00:00:29,610 Say you flip a coin twice. What are the chances that you get heads both times? 4 5 00:00:29,610 --> 00:00:34,270 What's the probability that you get heads two times in a row? 5 6 00:00:34,650 --> 00:00:41,610 The way you see this written in mathematical syntax is with this little upside down "u" symbol. This 6 7 00:00:41,610 --> 00:00:43,440 symbol stands for 7 8 00:00:43,440 --> 00:00:50,960 "and". Also every textbook out there will refer to an event like the result of a coin flip with a letter. 8 9 00:00:51,060 --> 00:00:59,610 So you'll see this written as "what's the probability of A and B?". A is getting heads on the first flip 9 10 00:00:59,970 --> 00:01:03,630 and B is getting heads on the second flip 10 11 00:01:03,630 --> 00:01:09,900 in our case. If probability is a new topic for you then maybe just quickly pause the video and figure 11 12 00:01:09,900 --> 00:01:15,050 out what the probability is of getting heads two times in a row. 12 13 00:01:15,130 --> 00:01:17,140 Did you have a go? 13 14 00:01:17,150 --> 00:01:20,470 If so, how did you figure this out? 14 15 00:01:20,480 --> 00:01:21,870 What was your approach? 15 16 00:01:22,280 --> 00:01:28,250 One way that you can solve this kind of problem is to draw a matrix. 16 17 00:01:28,250 --> 00:01:36,980 This matrix has all the possible combinations in it and there you can see that there's a total of four 17 18 00:01:37,160 --> 00:01:42,710 possible outcomes with the coin flipping problem. From our little chart here 18 19 00:01:42,710 --> 00:01:52,370 we can see that both coins showing heads only happens 1 in 4 times so therefore the probability of getting 19 20 00:01:52,370 --> 00:01:56,590 heads on two flips in a row is 25 percent. 20 21 00:01:57,050 --> 00:01:58,470 But, you know what? 21 22 00:01:58,520 --> 00:02:03,260 There's a better way to calculate a joint probability. 22 23 00:02:03,260 --> 00:02:12,530 All we have to do in our coin flipping example is to multiply the probability of getting heads times 23 24 00:02:12,830 --> 00:02:20,960 the probability of getting heads. Since we have a 50/50 chance of getting heads, it's 0.5* 24 25 00:02:21,050 --> 00:02:27,200 0.5 which is 0.25 or 25%. 25 26 00:02:27,200 --> 00:02:34,260 In other words we're multiplying the probability of A times the probability of B 26 27 00:02:34,550 --> 00:02:40,270 and that's how we can get the probability of A and B. 27 28 00:02:40,290 --> 00:02:45,590 Now I think this formula is really, really handy when you don't want to draw a huge matrix, because say 28 29 00:02:45,590 --> 00:02:52,580 you want to calculate the probability of getting heads on a coin flip 4 times in a row. 29 30 00:02:52,700 --> 00:02:58,850 Well simply by applying this formula, you know it's 0.5*0.5*0.5* 30 31 00:02:58,850 --> 00:03:04,950 0.5, which is equal to about 6.25%. 31 32 00:03:04,970 --> 00:03:13,310 So now let's use this technique with a dice. Pause the video and as a challenge work out the probability 32 33 00:03:13,580 --> 00:03:20,360 of rolling a 6 on a dice like this three times in a row. 33 34 00:03:20,480 --> 00:03:24,340 Did you have a go? With a six sided die 34 35 00:03:24,360 --> 00:03:28,290 you essentially have a 1 in 6 chance to roll any particular number. 35 36 00:03:28,890 --> 00:03:40,690 So rolling a 6 three times in a row has a joint probability of 1/6 * 1/6 * 1/6 and that's equal 36 37 00:03:40,870 --> 00:03:44,700 to approximately 0.46%. 37 38 00:03:45,250 --> 00:03:47,680 So less than one half of a percent. 38 39 00:03:47,770 --> 00:03:50,830 Very, very unlikely. 39 40 00:03:50,860 --> 00:03:56,170 Now I know these were some very, very simple examples just now but I hope you can see how rolling more 40 41 00:03:56,170 --> 00:04:02,040 and more sixes in a row gets less and less likely and the same with flipping coins, right. 41 42 00:04:02,050 --> 00:04:07,990 It would be strange to see somebody get heads after heads after heads after heads on a fair coin that 42 43 00:04:07,990 --> 00:04:10,110 is. Now I think personally 43 44 00:04:10,150 --> 00:04:17,320 this is like the very friendly and simple side of probability but it's still interesting, because let 44 45 00:04:17,320 --> 00:04:19,150 me ask you a question. 45 46 00:04:19,210 --> 00:04:26,410 Imagine you and I are hanging out on a Friday night chilling, flipping coins and talking about probability 46 47 00:04:26,620 --> 00:04:28,190 as we do. 47 48 00:04:28,510 --> 00:04:33,000 And I've just flipped my coin twice. Both times 48 49 00:04:33,220 --> 00:04:34,660 I got heads. 49 50 00:04:34,660 --> 00:04:36,680 What's the probability that I get 50 51 00:04:36,680 --> 00:04:39,890 heads on my third flip as well? 51 52 00:04:40,240 --> 00:04:46,860 Is it 1/2 * 1/2 * 1/2 or 0.125? 52 53 00:04:46,930 --> 00:04:48,190 No no no no no. 53 54 00:04:48,190 --> 00:04:49,330 It is not. 54 55 00:04:49,330 --> 00:04:51,260 Most definitely not. 55 56 00:04:51,280 --> 00:04:59,860 And the reason is is that the past coin flips don't affect the next coin flip. Each coin flip is independent. 56 57 00:05:00,580 --> 00:05:05,350 The probability of me getting heads on that third flip is still 50%. 57 58 00:05:06,160 --> 00:05:14,470 Now this might seem obvious to you, but intuitively I've seen a lot of people just sort of accept this 58 59 00:05:14,470 --> 00:05:17,690 idea of independence. 59 60 00:05:18,100 --> 00:05:19,620 Intelligent people too. 60 61 00:05:19,750 --> 00:05:23,300 It's like, you know, people don't want to accept this. 61 62 00:05:23,520 --> 00:05:29,860 And if you don't believe me you can actually observe this behavior for yourself. 62 63 00:05:29,860 --> 00:05:38,140 All you need to do is go to a roulette table in a crowded casino and just wait and watch, just watch 63 64 00:05:38,140 --> 00:05:39,090 people. 64 65 00:05:39,160 --> 00:05:44,980 Now if you've never played roulette or you're not familiar with this game, in the casino people spin 65 66 00:05:44,980 --> 00:05:52,570 this wheel and people win when they've predicted where the ball will land. People can bet on a number 66 67 00:05:52,570 --> 00:05:58,480 of things but they can also bet on the colors, right, the color that the ball will land on. 67 68 00:05:58,480 --> 00:06:02,950 So in this game a lot of people love betting on a particular color. 68 69 00:06:02,950 --> 00:06:03,250 Right. 69 70 00:06:03,250 --> 00:06:04,210 Red or black. 70 71 00:06:04,210 --> 00:06:08,460 Will the ball end up on the red or on the black, for example. 71 72 00:06:08,740 --> 00:06:16,030 Now when the ball has been landing on black for a few times in a row you will see that people increasingly 72 73 00:06:16,090 --> 00:06:22,120 start betting on red because they think it's the more likely outcome for the next spin. 73 74 00:06:22,240 --> 00:06:28,480 Intuitively many people think: It's been black four times in a row, on the next spin the ball just has 74 75 00:06:28,480 --> 00:06:29,450 to land on red. 75 76 00:06:29,500 --> 00:06:29,800 Right. 76 77 00:06:30,520 --> 00:06:34,750 But the truth is: Red is not more likely. 77 78 00:06:34,750 --> 00:06:35,720 Why? 78 79 00:06:35,770 --> 00:06:38,940 Because each spin of the roulette wheel is independent. 79 80 00:06:39,040 --> 00:06:44,050 The wheel itself doesn't have any memory of the previous outcomes. 80 81 00:06:44,050 --> 00:06:44,330 OK. 81 82 00:06:44,350 --> 00:06:50,510 So that's a real world example of this concept of independence in action. 82 83 00:06:50,650 --> 00:06:55,740 And we've actually covered quite a few important concepts in probability so far 83 84 00:06:55,750 --> 00:07:02,860 already. We've covered basic probability, which was very much based on observing the frequency of a particular 84 85 00:07:02,860 --> 00:07:09,790 event. We've covered this idea of independence where two events have nothing to do with one another. 85 86 00:07:09,800 --> 00:07:17,510 And finally we've also covered this idea of joint probability, and what we saw was that for independent 86 87 00:07:17,510 --> 00:07:26,780 events like rolls of dice and coin tosses, the formula for both A and B happening was simply by multiplying 87 88 00:07:26,960 --> 00:07:32,420 the probability of A times the probability of B. 88 89 00:07:32,420 --> 00:07:42,350 Now, let's take this discussion back to spam classification. In order to work out the probability of an 89 90 00:07:42,350 --> 00:07:48,890 email being spam we should look at something other than the frequency of spam emails out there in the 90 91 00:07:48,890 --> 00:07:49,280 wild. 91 92 00:07:49,280 --> 00:07:50,200 Right? 92 93 00:07:50,270 --> 00:07:55,510 We should look at the contents of each individual email. 93 94 00:07:55,520 --> 00:08:01,720 This means looking at the message itself which is, well, where it gets interesting. 94 95 00:08:01,820 --> 00:08:08,720 Now if you fire up Hotmail or GMail right now and you look through your email spam folder, what kind 95 96 00:08:08,720 --> 00:08:15,400 of words do you tend to see in the message bodies for these e-mails? 96 97 00:08:15,410 --> 00:08:17,810 I bet you can kind of spot a pattern. 97 98 00:08:17,930 --> 00:08:22,910 You can spot certain themes coming up over and over again in your spam emails. 98 99 00:08:23,150 --> 00:08:31,280 Certain words are just much more likely to come up in spam email than in regular legitimate email. 99 100 00:08:31,970 --> 00:08:42,340 I'm thinking of words like "free", "access", "viagra", "loan", "online pharmacy", "adult", "great offer", "winner", "casino" "Bitcoin", 100 101 00:08:42,410 --> 00:08:49,160 what have you. Like when was the last time a friend of yours send you a legitimate email about an online 101 102 00:08:49,160 --> 00:08:52,530 pharmacy or free bitcoins for example? 102 103 00:08:52,670 --> 00:09:00,230 So clearly there's going to be a clue in the message body whether an email is going to be a spam email 103 104 00:09:00,380 --> 00:09:04,200 or not, just purely based on what the email is about. 104 105 00:09:05,000 --> 00:09:08,090 So imagine a new email arrives in your inbox. 105 106 00:09:08,210 --> 00:09:12,000 The e-mail body contains the word "Viagra". 106 107 00:09:12,230 --> 00:09:16,280 Should this e-mail be classified as spam? 107 108 00:09:16,280 --> 00:09:23,130 Let me rephrase that question - given that this e-mail contains the word "Viagra", 108 109 00:09:23,150 --> 00:09:30,500 what's the probability of this email being spam? This question that I just posed 109 110 00:09:30,630 --> 00:09:34,340 is all about conditional probability. 110 111 00:09:34,740 --> 00:09:38,520 And this is where things are going to start getting pretty wild. 111 112 00:09:38,520 --> 00:09:47,760 Conditional Probability is in my personal opinion both something unintuitive and difficult and we're 112 113 00:09:47,760 --> 00:09:52,620 about to get our hands dirty with some of these meaty statistics. 113 114 00:09:52,620 --> 00:09:53,880 Oh yes. 114 115 00:09:53,880 --> 00:10:00,960 You see this is really, really worthwhile doing actually. Many of the basic statistical concepts like 115 116 00:10:00,960 --> 00:10:07,410 probability independence, joint probability and conditional probability are used everywhere in machine 116 117 00:10:07,410 --> 00:10:08,280 learning. 117 118 00:10:08,280 --> 00:10:10,130 Even if you don't see it right away. 118 119 00:10:10,410 --> 00:10:16,180 Even our box standard linear regression models come straight out of a statistics textbook. 119 120 00:10:16,200 --> 00:10:22,140 The nice thing about the Naive Bayes classifier is that we don't use statistics and probability under 120 121 00:10:22,140 --> 00:10:31,110 the hood, we're gonna use probability explicitly. Our Naive Bayes algorithm uses statistics in its raw 121 122 00:10:31,230 --> 00:10:39,600 and pure format to classify email spam. And conditional probability is at the very heart of it. Conditional 122 123 00:10:39,600 --> 00:10:47,790 probability measures the probability of some event given that another event has occurred. 123 124 00:10:47,790 --> 00:10:50,460 The mathematical syntax looks like this. 124 125 00:10:50,460 --> 00:10:58,500 What we have here reads the "Probability of A given B". For example - what's the probability of me getting 125 126 00:10:58,500 --> 00:11:07,210 heads on my next coin flip given that my last coin flip was heads? Well, we just talked about this, the first 126 127 00:11:07,210 --> 00:11:09,590 coin flip doesn't affect the second coin flip, 127 128 00:11:09,670 --> 00:11:13,520 so the two events are conditionally independent. 128 129 00:11:13,900 --> 00:11:19,660 The probability of me getting heads on that next flip is still 50%. 129 130 00:11:19,930 --> 00:11:26,200 But, but, but, what if the two events are in fact dependent? 130 131 00:11:26,380 --> 00:11:34,300 Going back to our email example, what's the probability of an email being spam given that it contains 131 132 00:11:34,300 --> 00:11:42,530 the word "Viagra"? In this case the probability of the email being spam actually depends on the probability 132 133 00:11:43,130 --> 00:11:46,870 that the email contains the word "Viagra". 133 134 00:11:46,910 --> 00:11:47,920 What? 134 135 00:11:47,930 --> 00:11:50,350 That sounds weird, right? Now, 135 136 00:11:50,360 --> 00:11:53,720 I don't expect you to accept this at face value. 136 137 00:11:53,720 --> 00:11:58,190 Let me try and explain this with a different example where the intuition is is less murky, 137 138 00:11:58,190 --> 00:12:00,140 it's much more clear. 138 139 00:12:00,140 --> 00:12:03,330 Say you look out the window and it's cloudy outside. 139 140 00:12:03,440 --> 00:12:06,600 What's the probability of it raining today? 140 141 00:12:06,620 --> 00:12:09,980 In other words: Given that it is cloudy, 141 142 00:12:09,980 --> 00:12:14,470 what is the probability of rain? Now, in this picture, 142 143 00:12:14,480 --> 00:12:20,570 it looks like it's about to rain any minute, but why rely on eyesight when you can have a mathematical 143 144 00:12:20,570 --> 00:12:21,380 formula? 144 145 00:12:21,680 --> 00:12:23,920 You see, like all good things, 145 146 00:12:23,960 --> 00:12:32,510 conditional Probability has a mathematical formula associated with it and here it is: The probability 146 147 00:12:32,510 --> 00:12:33,380 of rain 147 148 00:12:33,380 --> 00:12:43,400 given that it is cloudy is equal to the probability of rain and it being cloudy divided by the probability 148 149 00:12:43,880 --> 00:12:46,000 of it being cloudy. Hmm. 149 150 00:12:46,390 --> 00:12:53,660 OK, this looks like a very, very strange formula indeed, but the important learning point here is that the 150 151 00:12:53,660 --> 00:13:01,250 probability of it raining depends on the probability of it being cloudy and that's the bottom part of 151 152 00:13:01,250 --> 00:13:02,570 the fraction here. 152 153 00:13:02,630 --> 00:13:06,870 The bottom part of the fraction is the probability of it being cloudy. 153 154 00:13:07,040 --> 00:13:16,750 If it was cloudy for 250 days a year last year, then this part of the fraction is 250/365. 154 155 00:13:16,770 --> 00:13:20,790 Now what about the top part of the fraction? 155 156 00:13:20,800 --> 00:13:22,360 Well, guess what. 156 157 00:13:22,360 --> 00:13:25,900 This is our old friend joint probability. 157 158 00:13:26,080 --> 00:13:33,490 It's the amount of overlap between the days that were cloudy and the days that it rained. So in London, 158 159 00:13:33,610 --> 00:13:37,310 it rains on approximately 107 days per year. 159 160 00:13:37,480 --> 00:13:44,400 But, you know what, two of those rainy days were actually fairly sunny and I saw a rainbow. 160 161 00:13:44,440 --> 00:13:52,250 So the number of days where it was both raining and cloudy was only one 105. 161 162 00:13:52,330 --> 00:13:59,200 So the top part of that fraction is going gonna be 105/365 now doing this calculation gives us 162 163 00:13:59,290 --> 00:14:08,220 a conditional probability of approximately 42%. The probability of it raining given that it is 163 164 00:14:08,220 --> 00:14:13,350 cloudy outside is 42% according to this formula. 164 165 00:14:14,430 --> 00:14:21,030 I know this may sound a little bit trivial, but the really, really cool thing is that what we've just 165 166 00:14:21,030 --> 00:14:30,210 done is calculate the probability of something knowing something else and that sounds a lot like fundamental 166 167 00:14:30,210 --> 00:14:31,140 machine learning, right. 167 168 00:14:31,980 --> 00:14:35,870 Knowing the movie budget we can calculate expected movie revenue. 168 169 00:14:36,060 --> 00:14:40,620 Knowing the number of rooms we can calculate the expected property price. 169 170 00:14:40,680 --> 00:14:48,250 Think of it being cloudy outside as a feature and think of it raining today as your target, your y. 170 171 00:14:48,750 --> 00:14:55,680 So I hope this example shows how conceptually conditional probability is actually widely applicable 171 172 00:14:55,950 --> 00:15:03,300 to machine learning and if you open any statistics textbook out there, you'll see that conditional probability 172 173 00:15:03,300 --> 00:15:06,710 is expressed in very general terms just like this. 173 174 00:15:07,020 --> 00:15:17,480 The probability of A given B is equal to the probability of A and B over the probability of B. What would 174 175 00:15:17,480 --> 00:15:25,590 this formula look like in our spam example? Substituting for A and B, we get something like this, the probability 175 176 00:15:26,010 --> 00:15:36,030 of an email being spam given that it contains the word "Viagra" is equal to the probability of an email 176 177 00:15:36,090 --> 00:15:45,780 being spam and the probability of the email containing the word "Viagra" over the probability of an email 177 178 00:15:45,780 --> 00:15:48,590 containing the word "Viagra". 178 179 00:15:48,780 --> 00:15:56,520 Looking at this formula, there's one problem though, and that's the top term in this fraction - the joint 179 180 00:15:56,820 --> 00:16:04,750 probability. Calculating the joint probability was super easy when we were dealing with dice and with 180 181 00:16:04,750 --> 00:16:08,470 coins because each event was independent. 181 182 00:16:08,580 --> 00:16:14,920 In that case, we could just multiply the probabilities of the events together but with dependent events 182 183 00:16:15,130 --> 00:16:24,460 like clouds and rain or email and the word "Viagra", this calculation is no longer so simple and that's 183 184 00:16:24,460 --> 00:16:31,690 why to calculate the conditional probability and create our spam classifier, we have to take it to the 184 185 00:16:31,690 --> 00:16:32,990 next level. 185 186 00:16:33,280 --> 00:16:35,770 We have to use Bayes theorem.