1 00:00:00,390 --> 00:00:00,780 All right. 2 00:00:00,810 --> 00:00:04,380 So in this lesson it's where all the pieces are gonna come together. 3 00:00:04,470 --> 00:00:11,040 Conditional Probability joint probability the independence assumption and Bayes theorem. 4 00:00:11,160 --> 00:00:18,450 Now I highly encourage you to watch the probability explainer video in the previous module if you haven't 5 00:00:18,450 --> 00:00:19,860 done so already. 6 00:00:20,070 --> 00:00:27,180 But as a quick recap to calculate the joint probability of two events happening say A and B we simply 7 00:00:27,180 --> 00:00:34,350 multiply the probability of eight times the probability of B and the reason this formula is so simplistic 8 00:00:34,770 --> 00:00:36,970 is because we're making a strong assumption here. 9 00:00:37,050 --> 00:00:42,960 We're assuming the events are independent in the case of our spam classifier. 10 00:00:43,050 --> 00:00:46,620 We're calculating the probabilities of these tokens right. 11 00:00:46,620 --> 00:00:54,180 We're calculating the probability that an email is spam given that it contains a particular token. 12 00:00:54,180 --> 00:01:00,270 So if the email contains two tokens say the word Viagra and the word free. 13 00:01:00,270 --> 00:01:07,590 Then we calculate the joint probability simply by multiplying the two conditional probabilities together 14 00:01:08,220 --> 00:01:13,800 the probability that the whole email spam is simply the product of the probability that the email is 15 00:01:13,800 --> 00:01:14,410 spam. 16 00:01:14,490 --> 00:01:21,060 Given that it contains the word for Agora times the probability that the email is spam given that it 17 00:01:21,060 --> 00:01:24,080 contains the word free. 18 00:01:24,150 --> 00:01:25,590 So if you had a sentence like this. 19 00:01:25,860 --> 00:01:26,800 Hello friend. 20 00:01:26,820 --> 00:01:28,440 Want free via agora. 21 00:01:28,710 --> 00:01:36,270 Then the probability that this email is spam is the probability of multiplying all these conditional 22 00:01:36,270 --> 00:01:39,420 probabilities together now. 23 00:01:39,450 --> 00:01:44,610 So far what we've been doing is we've been working out a lot of these numbers that are on the right 24 00:01:44,610 --> 00:01:46,500 hand side of this equation here. 25 00:01:46,650 --> 00:01:52,290 We've been counting the occurrences of our tokens and we've been working out all these colored terms 26 00:01:52,920 --> 00:02:00,030 and we've saved all these calculations in the text files that we've imported into this notebook. 27 00:02:00,120 --> 00:02:07,230 The thing that we haven't done though is multiply all these values together for all the tokens and that's 28 00:02:07,230 --> 00:02:09,010 what we're gonna do now. 29 00:02:09,060 --> 00:02:14,970 So if we just had one term say the word Niagara then the formula for Bayes theorem is pretty straightforward. 30 00:02:14,970 --> 00:02:21,750 It's what we've discussed previously the probability that an e-mail is spam given that it contains the 31 00:02:21,750 --> 00:02:29,610 word Niagara is equal to the probability of Viagra occurring given that the email is spam divided by 32 00:02:29,910 --> 00:02:37,890 the probability of Niagara occurring in any email spam or not spam times the probability of receiving 33 00:02:37,950 --> 00:02:40,350 a spam email. 34 00:02:40,350 --> 00:02:47,130 Now what have we had a whole array of tokens instead of a single token for the word via Gra. 35 00:02:47,130 --> 00:02:53,910 What if we're dealing with our ex on a school test X on a school test. 36 00:02:53,910 --> 00:03:01,830 If you recall looks something like this hits and no higher rate with over a thousand rows over a thousand 37 00:03:01,830 --> 00:03:07,210 columns and it's got integer values for all the tokens in our emails. 38 00:03:07,380 --> 00:03:13,840 How do we adapt Bayes theorem to deal with all of these tokens at the same time. 39 00:03:13,980 --> 00:03:22,340 Thinking back to a distant math class on linear algebra we can adapt this formula here for all our tokens 40 00:03:22,350 --> 00:03:23,610 like so. 41 00:03:23,610 --> 00:03:25,740 So the probability then e-mail is spam. 42 00:03:25,770 --> 00:03:31,260 Given that it contains a set of tokens will be equal to the same terms that we have above for one token 43 00:03:31,920 --> 00:03:42,190 but multiplied by our array of tokens we have to multiply our probabilities by our X underscored test. 44 00:03:42,300 --> 00:03:50,080 Pi array and the technique that we're going to use is called the dot product the dot product. 45 00:03:50,110 --> 00:03:57,000 Simply one way of multiplying two matrices and it's the dot product that will help us multiply our X 46 00:03:57,000 --> 00:04:02,480 and this code test number higher rates with our probability number higher rates. 47 00:04:02,550 --> 00:04:05,450 Let me quickly show you how the dot product works in practice. 48 00:04:05,490 --> 00:04:13,660 Before we implemented on our project so let me add a quick markdown so here that reads calculating the 49 00:04:13,660 --> 00:04:17,210 joint problem ability. 50 00:04:17,210 --> 00:04:23,860 I am also going to add a little subheading here that reads The Doctor product. 51 00:04:23,860 --> 00:04:27,620 So I'll start with a toy example to show you how this works. 52 00:04:27,700 --> 00:04:36,890 See we have to dump higher rates the first arrays called a s and p dot array parentheses square brackets. 53 00:04:36,910 --> 00:04:41,610 Let's give three values C one two three. 54 00:04:41,630 --> 00:04:47,260 Yeah and b it's also going to be an NPR array. 55 00:04:47,260 --> 00:04:53,650 And it's also gonna have three values c 0 5 and 4. 56 00:04:53,650 --> 00:04:57,050 So both A and B are one dimensional. 57 00:04:57,130 --> 00:05:02,370 In other words they have a rank of 1 and you can think of A and B inspectors. 58 00:05:02,380 --> 00:05:05,410 Just a single column of values. 59 00:05:05,560 --> 00:05:14,590 So if you print these out print parentheses single quotes A is equal to any single quote comma a and 60 00:05:14,980 --> 00:05:25,330 print single quotes B equal to and single quote comma b the dot product between A and B would be a dot 61 00:05:26,230 --> 00:05:35,570 and then the word dot d o t parentheses B and the result of that is equal to twenty two. 62 00:05:35,580 --> 00:05:43,180 Now the reason I can do this is because none pi arrays have this dot product method built in so you 63 00:05:43,180 --> 00:05:51,010 can simply put the name of the array which in our case is a put a dot after it and call this method. 64 00:05:51,010 --> 00:05:55,540 And the argument we supply is our second all right. 65 00:05:55,780 --> 00:05:57,830 Now why do we get twenty two. 66 00:05:57,850 --> 00:06:00,770 Why did we get this number. 67 00:06:00,970 --> 00:06:03,420 Let me show you how we arrived at this value. 68 00:06:03,640 --> 00:06:07,330 We simply multiplied the first entry and a. 69 00:06:07,450 --> 00:06:17,950 So that's 1 times the first entry in array B which was 0 and then we added the product of the second 70 00:06:17,950 --> 00:06:20,220 entry in array times. 71 00:06:20,350 --> 00:06:22,580 The second entry in array B. 72 00:06:22,600 --> 00:06:24,930 So two times five. 73 00:06:25,090 --> 00:06:30,810 And then we added you guessed it the product of the third entry in array. 74 00:06:30,880 --> 00:06:36,850 So three times for the third entry in array B. 75 00:06:37,220 --> 00:06:41,080 And the result of this calculation is twenty two. 76 00:06:41,140 --> 00:06:43,490 So that's pretty straightforward right. 77 00:06:43,510 --> 00:06:46,800 But what if we had something a little bit more complex. 78 00:06:46,930 --> 00:06:49,910 What if we had a two dimensional array that we're dealing with. 79 00:06:49,930 --> 00:06:57,460 I'm going to be very imaginative with my variable naming here and call it c is equal to and P array 80 00:06:57,580 --> 00:07:07,510 parentheses square brackets and then another set of square brackets and I'll put in c 0 and 6 and then 81 00:07:07,510 --> 00:07:15,670 I'll add a comma add another pair of square brackets and c put 3 and 0 and then another pair square 82 00:07:15,670 --> 00:07:19,600 brackets after the comma and put 5 and 1. 83 00:07:21,280 --> 00:07:28,870 So here we've got an array of two dimensions and this array has three rows and two columns. 84 00:07:28,870 --> 00:07:37,670 Let me print out the uh the shape of C is and then comma C dot shape. 85 00:07:38,290 --> 00:07:42,630 And for good measure I'll also print out C itself. 86 00:07:42,700 --> 00:07:44,860 This is what it looks like an empire. 87 00:07:44,860 --> 00:07:51,110 Here is three rows two columns and you can see it printed below. 88 00:07:51,130 --> 00:08:00,100 Now what would you guess is gonna be the dot product of a and c and also what s gonna be the shape of 89 00:08:00,100 --> 00:08:00,730 this dot product. 90 00:08:00,730 --> 00:08:02,590 What's gonna be the shape of the result. 91 00:08:04,590 --> 00:08:16,830 Well let's find out shall we print out a dot that duty parentheses see and I also print out the shape 92 00:08:17,310 --> 00:08:31,390 of the dot product is in single quotes comma a dot dot parentheses see Dot shape check it out. 93 00:08:32,190 --> 00:08:37,230 OK so in this calculation we get two values twenty one and nine. 94 00:08:37,290 --> 00:08:38,340 That's pretty interesting right. 95 00:08:39,570 --> 00:08:42,290 Let me show you how to work this out manually. 96 00:08:42,330 --> 00:08:44,920 So our first vampire rate has three values. 97 00:08:44,920 --> 00:08:47,200 It's one dimensional just one two three. 98 00:08:47,220 --> 00:08:51,640 And our second now Perry has two columns with these values. 99 00:08:51,870 --> 00:08:54,680 Now to record this lesson often zooming in a little bit here. 100 00:08:55,080 --> 00:09:01,230 And every time I've got unsafe changes and save those changes I've got this thing bouncing around here. 101 00:09:01,260 --> 00:09:08,400 So I want to go to view and toggle header to hide this and get a bit more screen real estate and less 102 00:09:08,400 --> 00:09:10,050 of this bouncing around. 103 00:09:10,980 --> 00:09:19,980 So the way we can manually work out this 21 a nine is by multiplying the first value in a which was 104 00:09:19,980 --> 00:09:30,420 1 times the first value of the first column in C which was 0 and then added to that is the product of 105 00:09:30,420 --> 00:09:38,490 the second value in a which was two times the second value of that first column in C which was three 106 00:09:39,300 --> 00:09:46,320 and added to that you guessed it it'll be the third value in a which was three times the third value 107 00:09:46,410 --> 00:09:52,550 in the first column in C which was five three times five. 108 00:09:52,750 --> 00:10:03,420 And it's gonna be a similar pattern for that second column L B one times six plus two times zero plus 109 00:10:03,870 --> 00:10:06,470 three times one. 110 00:10:06,930 --> 00:10:09,860 And the result is 21 and nine. 111 00:10:10,780 --> 00:10:19,200 Now as a challenge can you figure out the dimensions of the dot product between X underscore test and 112 00:10:19,200 --> 00:10:22,010 prob and it's got token on a score spam. 113 00:10:22,140 --> 00:10:28,570 I'll give you a few seconds to pause the video and give this a go ready. 114 00:10:28,620 --> 00:10:30,900 Here's the solution. 115 00:10:30,900 --> 00:10:39,390 Now the shape of X and the square test was 1723 three rows and two thousand five hundred columns two 116 00:10:39,380 --> 00:10:45,960 thousand five hundred was the size of our vocabulary and one thousand seven hundred and twenty three 117 00:10:46,380 --> 00:10:52,320 was the number of emails in our testing dataset. 118 00:10:52,710 --> 00:11:00,570 Now prob underscore token unskilled spam dot shape had of course two thousand and five hundred entries 119 00:11:01,340 --> 00:11:07,450 one entry for each token for each word in our vocabulary. 120 00:11:07,590 --> 00:11:12,180 So if I print out the shape of the dot product 121 00:11:16,060 --> 00:11:19,920 this is gonna be X on a school test. 122 00:11:19,940 --> 00:11:33,020 Dot dot dot prob I just got token and it's got a spam dot shape and the result is one thousand seven 123 00:11:33,020 --> 00:11:35,110 hundred and twenty three.