1 00:00:00,480 --> 00:00:05,570 Over the next couple of lessons we will be training our naive bayes classifier. 2 00:00:05,820 --> 00:00:11,790 But what does this involve the training step primarily involves calculating the token probabilities 3 00:00:12,330 --> 00:00:16,920 we need to work out the probabilities for an individual word or token. 4 00:00:17,070 --> 00:00:21,690 How if you haven't watched the lessons on probability theory and the base theorem from the previous 5 00:00:21,690 --> 00:00:29,310 module pause here and watch the now before you continue you'll recall that the formula for conditional 6 00:00:29,310 --> 00:00:33,870 probability under Bayes Theorem looks like this. 7 00:00:33,870 --> 00:00:38,640 This formula gives us the probability of an email being spam. 8 00:00:38,640 --> 00:00:44,550 Given that it contains the word diagram to work out the final numbers for this formula we need a couple 9 00:00:44,550 --> 00:00:45,670 of things. 10 00:00:45,690 --> 00:00:49,740 First off we need the overall probability of an email being spam. 11 00:00:49,740 --> 00:00:53,730 Second we need the probability of Agora occurring in any email. 12 00:00:54,600 --> 00:01:00,450 And third we need the probability of an email containing the would diagram given that the email is spam. 13 00:01:00,450 --> 00:01:03,270 This calculation boils down to five numbers. 14 00:01:03,270 --> 00:01:05,060 So what are these numbers that we need to work this out. 15 00:01:06,280 --> 00:01:14,590 The first number at the top is how often the token question in this case the word diagram appears and 16 00:01:14,590 --> 00:01:16,530 spam emails. 17 00:01:16,650 --> 00:01:22,200 The second number is the total number of words in spam emails. 18 00:01:22,260 --> 00:01:26,720 The third number is the probability of spam e-mails occurring in the first place. 19 00:01:26,760 --> 00:01:33,110 On this slide I've put this number at 55 percent but will actually work this out from our training data. 20 00:01:34,570 --> 00:01:40,120 In the denominator of this fraction we have our fourth number the total number of times that the wood 21 00:01:40,210 --> 00:01:47,160 Viagra occurs in all emails both in spam emails and in non spam e-mails. 22 00:01:47,290 --> 00:01:52,720 And finally we've got the total number of words that we looked at across all emails in the first place. 23 00:01:52,720 --> 00:01:57,930 Both spam messages and non spam messages across the entire dataset. 24 00:01:58,300 --> 00:02:04,150 The python code that we're gonna write next will tease out a couple of these numbers with these numbers 25 00:02:04,150 --> 00:02:04,790 in hand. 26 00:02:04,840 --> 00:02:08,830 We can calculate the relevant probabilities for our tokens. 27 00:02:08,890 --> 00:02:16,640 This is effectively the training step for our need based classifier I'll add a quick markdown cell here 28 00:02:16,940 --> 00:02:31,160 that reads training the naive bayes model and for subbing here will read calculating the problem ability 29 00:02:31,760 --> 00:02:34,750 of spam in the slide. 30 00:02:34,790 --> 00:02:40,490 I had the set at 55 percent but I think we should work this out from our data. 31 00:02:40,490 --> 00:02:43,340 This actually makes for a very good challenge. 32 00:02:43,340 --> 00:02:50,030 I'd like you to quickly calculate the overall probability of spam will assume that this is the percentage 33 00:02:50,210 --> 00:02:58,510 of spam messages in the training dataset store this value in a variable called prob on a score spam. 34 00:02:58,630 --> 00:03:02,290 I'll give you a few seconds to pause the video and tackle this challenge 35 00:03:05,580 --> 00:03:06,900 Java go. 36 00:03:07,120 --> 00:03:10,660 The total number of messages that we have in our dataset is full. 37 00:03:10,660 --> 00:03:17,140 On the score train on the score data thought category dot size. 38 00:03:17,240 --> 00:03:21,060 We've got about 4000 and 14 messages. 39 00:03:21,260 --> 00:03:24,590 The number of spam messages is full. 40 00:03:24,590 --> 00:03:32,580 On a score train and a score of data but category dot some. 41 00:03:32,680 --> 00:03:41,440 The reason I can do this is because I've labeled the category of spam as one and the category of non 42 00:03:41,440 --> 00:03:43,140 spam at zero. 43 00:03:43,180 --> 00:03:49,660 So simply summing up all the ones in my category column will give me the number of spam messages of 44 00:03:49,660 --> 00:03:53,930 which I've got one thousand two hundred and forty nine. 45 00:03:53,940 --> 00:04:04,230 Therefore the probability of spam problem on a score spam is in fact equal to the training data category. 46 00:04:04,320 --> 00:04:17,110 Some divided by full on is called train on the score data adult category don't size and that we can 47 00:04:17,110 --> 00:04:29,190 print out as probability prob Bill T of spam is come up. 48 00:04:29,230 --> 00:04:32,030 Problem is score spam. 49 00:04:32,090 --> 00:04:33,530 There we go. 50 00:04:33,620 --> 00:04:36,920 It's around thirty one percent. 51 00:04:36,920 --> 00:04:44,420 This is the first of the numbers that we're looking for in our Bayes theorem the next number that we 52 00:04:44,420 --> 00:04:56,200 are looking for is the total number of words or tokens we need to count the total number of tokens that 53 00:04:56,200 --> 00:04:57,820 we have in our dataset. 54 00:04:58,290 --> 00:05:02,770 And we also need to calculate a number of tokens that belong to the spam category and the number of 55 00:05:02,770 --> 00:05:05,790 tokens that belong to the non spam category. 56 00:05:05,800 --> 00:05:09,760 But first let's create the total number. 57 00:05:09,760 --> 00:05:16,420 What I'm gonna do is I'm going to create a subset from our full on a school train and a score. 58 00:05:16,430 --> 00:05:18,130 Data data frame. 59 00:05:18,260 --> 00:05:23,890 I'll call the subset full on a school train on a school features. 60 00:05:23,890 --> 00:05:26,870 So this will be just the tokens. 61 00:05:26,890 --> 00:05:29,450 I'll set this equal to full unescorted train. 62 00:05:29,450 --> 00:05:38,490 And as for data dot allow C ATL C's for selecting a number of rows square brackets colon. 63 00:05:38,530 --> 00:05:40,920 This is for selecting all the rows. 64 00:05:41,260 --> 00:05:47,180 But I'm not going to select all the columns I'll just select the columns with our word at ease. 65 00:05:47,230 --> 00:05:54,800 In other words I leave out the category column ellipses perfect for this to leave out the category column. 66 00:05:54,910 --> 00:06:00,160 I'll just say well take all the columns that you've gone full on as Katrina on the score. 67 00:06:00,180 --> 00:06:00,650 Data 68 00:06:03,280 --> 00:06:04,380 columns. 69 00:06:04,810 --> 00:06:15,370 But don't pick the category so I'll say not equal to single quotes category that exclamation mark that 70 00:06:15,370 --> 00:06:21,870 you see here is Python syntax for the logical not together with that equals sign. 71 00:06:21,910 --> 00:06:30,960 This logical condition here reads not equal to category so I'll pick all 2500 columns except the category 72 00:06:30,960 --> 00:06:32,130 column. 73 00:06:32,130 --> 00:06:36,840 Now mind you this string here is of course case sensitive. 74 00:06:36,840 --> 00:06:42,330 Let's look at the head of full on this quatrain on this go features and see what this data frame looks 75 00:06:42,330 --> 00:06:42,900 like. 76 00:06:44,020 --> 00:06:45,260 There you go. 77 00:06:45,340 --> 00:06:50,760 We've just excluded a particular column from the subset using logic. 78 00:06:50,790 --> 00:06:55,520 Now how do we get the total number of words. 79 00:06:55,650 --> 00:06:57,650 Well let's take a two step approach. 80 00:06:57,720 --> 00:07:01,110 Let's sum up all the tokens per email. 81 00:07:01,110 --> 00:07:05,400 So if you think about it we should sum up all the numbers in this row. 82 00:07:05,400 --> 00:07:09,660 Then we should also sum up all the numbers in this row and all the numbers in this row. 83 00:07:09,660 --> 00:07:12,900 That would be the total number of tokens per email. 84 00:07:13,170 --> 00:07:15,980 We can do this really really easily with the sum function. 85 00:07:16,020 --> 00:07:24,920 So I'm going to store this thing in a variable called email to school lengths so that we'll hold on 86 00:07:24,920 --> 00:07:31,160 to the total number of tokens per email and to use that some functionality. 87 00:07:31,160 --> 00:07:39,800 I'll grab my features data frame full on escort train and escort features and I'll use some function 88 00:07:40,600 --> 00:07:42,780 but I'll provide an argument here. 89 00:07:43,010 --> 00:07:49,080 I have to specify how this function should some things right left the right bottom to top up to down. 90 00:07:49,100 --> 00:07:53,470 We have to be more specific and we can do this with this access parameter him. 91 00:07:53,480 --> 00:07:59,040 Access is equal to one This will sum up our columns. 92 00:07:59,140 --> 00:08:01,640 Let's take a look at what email and scroll lengths. 93 00:08:01,840 --> 00:08:10,250 Shape gives us so we've done the summation and we've got four thousand and 14 emails exactly what we 94 00:08:10,250 --> 00:08:17,610 want the first five of these look like this email I just call lengths square brackets. 95 00:08:17,610 --> 00:08:23,260 Colon five and we see that the first email has 50 tokens. 96 00:08:23,260 --> 00:08:28,800 The second email 76 tokens the third e-mail 87 tokens and so on. 97 00:08:29,470 --> 00:08:36,940 So what we've got here is a handy little panda series of all the tokens to get the total number of tokens 98 00:08:37,210 --> 00:08:40,100 our total would count if you will. 99 00:08:40,150 --> 00:08:43,670 All we need to do is sum up the values in the entire series. 100 00:08:43,690 --> 00:08:44,690 Right. 101 00:08:44,830 --> 00:08:47,680 I'll store this number in a variable for later use. 102 00:08:47,860 --> 00:08:48,570 So I'll say. 103 00:08:48,580 --> 00:08:54,870 Total on a score WC for would count and I'll set that equal to email. 104 00:08:54,930 --> 00:09:01,330 I just call lengths not some to see what that value is. 105 00:09:01,400 --> 00:09:10,730 I'll just put total unescorted w see below and hit shift enter and what we see is that we've got approximately 106 00:09:10,730 --> 00:09:15,350 four hundred and forty six thousand tokens. 107 00:09:15,580 --> 00:09:23,640 Looking back at our formula here We've now worked out this white number here and this blue number here. 108 00:09:23,680 --> 00:09:26,140 Now let's work out this green one. 109 00:09:26,320 --> 00:09:32,950 Let's work out the total number of tokens within spam and also within the non spam messages which is 110 00:09:32,950 --> 00:09:46,720 not shown on this slide here a lot of this as a markdown cell number of tokens and Spam and Ham emails. 111 00:09:46,740 --> 00:09:50,040 I reckon this actually makes a good many challenge. 112 00:09:50,040 --> 00:09:57,270 Can you create a subset of the email length series that only contains the spam messages called a subset 113 00:09:57,450 --> 00:10:06,150 are spam unescorted lengths and then count the total number of tokens that occur in this subset also 114 00:10:06,420 --> 00:10:14,490 do the same for the non spam emails create a subset called Ham lengths from the email on the score length 115 00:10:14,490 --> 00:10:23,520 series and then also count the number of words that occur in that non spam or ham emails I'll give you 116 00:10:23,520 --> 00:10:29,340 a few seconds to pause the video and give this a go. 117 00:10:29,560 --> 00:10:30,280 You ready. 118 00:10:30,280 --> 00:10:32,290 Here's the solution. 119 00:10:32,290 --> 00:10:41,710 What I'm going to do is I'll create this variable spam links and I'll subset this whole thing from email 120 00:10:41,740 --> 00:10:47,970 underscore lengths using the square bracket notation where my full on the school train and it's got 121 00:10:47,980 --> 00:10:58,880 data dot category is equal to one double equal to one that is that's the logical condition to create 122 00:10:58,880 --> 00:11:07,390 the subset from email and a scroll lengths the number of emails that are spam are of course spam underscore 123 00:11:07,400 --> 00:11:11,160 lengths dot shape. 124 00:11:11,480 --> 00:11:17,720 Here you can see we've got one thousand two hundred and forty nine to get the number of words or the 125 00:11:17,720 --> 00:11:21,100 number of tokens that are in the spam emails. 126 00:11:21,110 --> 00:11:29,140 I'll store this whole thing under spam on a school W C for word count and I'll set that equal to spam. 127 00:11:29,400 --> 00:11:31,280 On just go lengths don't 128 00:11:34,110 --> 00:11:39,720 and if I print this out as an output we can see that we've got approximately one hundred ninety five 129 00:11:39,930 --> 00:11:44,890 thousand different tokens that are part of the spam emails. 130 00:11:44,910 --> 00:11:47,700 This is the first part of the exercise. 131 00:11:47,790 --> 00:11:49,480 The second part of the exercise. 132 00:11:49,500 --> 00:11:56,850 Doing the same thing for the ham emails the non spam emails we can write ham on the school lengths set 133 00:11:56,850 --> 00:12:03,330 that equal to email that lengths and subset this whole thing based on the condition that full on a school 134 00:12:03,750 --> 00:12:14,410 train and a data not category is equal to zero in this case ham underscore lengths dot shape will show 135 00:12:14,410 --> 00:12:18,140 us that we've got two thousand seven hundred and sixty five. 136 00:12:18,390 --> 00:12:25,090 How many emails are good thing to do at this point maybe is check your work right email dot lengths 137 00:12:25,450 --> 00:12:38,990 dot shape square brackets zero minus spam to underscore lengths dot shape square brackets zero minus 138 00:12:39,620 --> 00:12:48,070 hem underscore the lengths that shape square brackets 0 should be equal to zero because the two subsets 139 00:12:48,070 --> 00:12:52,480 together should have the same number of emails as our dataset as a whole. 140 00:12:52,540 --> 00:12:52,860 Right. 141 00:12:52,870 --> 00:12:59,440 This is a quick check that you can do to make sure you've subset of things correctly the total non spam 142 00:12:59,440 --> 00:13:07,140 word count the word count for the ham messages should be equal to ham on a scroll lengths. 143 00:13:07,160 --> 00:13:16,670 Dot some and this value is equal to two hundred and fifty thousand approximately Hennigan. 144 00:13:16,680 --> 00:13:24,390 I think this is a good time to also do a quick check spam on a school word count plus non spam on a 145 00:13:24,380 --> 00:13:30,270 score word count minus the total word count should be equal to zero. 146 00:13:30,270 --> 00:13:31,710 There we go. 147 00:13:31,740 --> 00:13:34,670 I've actually got an add on challenge for this. 148 00:13:34,680 --> 00:13:37,250 I just thought of an interesting question to ask. 149 00:13:37,470 --> 00:13:43,040 Do you think that the spam emails or the non spam emails tend to be longer. 150 00:13:43,050 --> 00:13:45,480 Which category do you think has longer emails. 151 00:13:45,510 --> 00:13:54,520 On average what I'd like you to do real quick is to take a quick guess and then verify it in the data. 152 00:13:54,540 --> 00:13:55,620 How would you go about doing that 153 00:13:58,630 --> 00:14:00,460 here's the solution. 154 00:14:00,510 --> 00:14:07,840 I want to print out the average number of words in spam emails. 155 00:14:07,890 --> 00:14:15,050 This is gonna be spam on a school W C divided by spam lengths. 156 00:14:15,120 --> 00:14:18,640 Don't shape square brackets. 157 00:14:18,640 --> 00:14:25,980 0 The average number of words in the spam emails in our dataset is approximate one hundred and fifty 158 00:14:25,980 --> 00:14:28,720 six hundred fifty seven. 159 00:14:28,720 --> 00:14:35,000 If this number is a little tough to read we can actually format this have much much more nicely with 160 00:14:35,900 --> 00:14:36,990 curly braces. 161 00:14:37,190 --> 00:14:41,920 Colon daunt 0 F stands for float. 162 00:14:42,110 --> 00:14:51,500 I'll get rid of this comma and I'll write dot format and I'll put the calculation inside the method 163 00:14:51,500 --> 00:14:57,590 call here to format here I'm running to the nearest poll number. 164 00:14:57,700 --> 00:15:02,410 This is a nice little trick if you want your own print statements to look pretty and not give you a 165 00:15:02,410 --> 00:15:04,780 lot of numbers after the decimal point. 166 00:15:04,810 --> 00:15:12,340 Let's compare this to that how many emails the non spam emails will take the ham would count or non 167 00:15:12,340 --> 00:15:20,970 spam would count and will divided by ham on the lengths don't shape and just to show you how this string 168 00:15:20,970 --> 00:15:22,190 formatting works. 169 00:15:22,380 --> 00:15:29,120 I'm going to have a three here instead of a zero so we'll get three decimal points in our output. 170 00:15:29,130 --> 00:15:29,920 There you go. 171 00:15:30,000 --> 00:15:35,340 So we can clearly see that spam emails actually tend to be worse here which I think is actually quite 172 00:15:35,340 --> 00:15:37,060 interesting as well. 173 00:15:37,080 --> 00:15:42,540 Now that we've got a big headline numbers in our formulas sorted we need to take a closer look at the 174 00:15:42,540 --> 00:15:44,090 tokens themselves. 175 00:15:44,130 --> 00:15:50,130 We need to sum up the tokens that occur in the spam messages and the tokens that occur in the normal 176 00:15:50,130 --> 00:15:51,300 emails. 177 00:15:51,740 --> 00:15:58,860 And we need to do this for each token for each single word I.D. and we'll get right on that in the next 178 00:15:58,860 --> 00:15:59,510 lesson. 179 00:15:59,520 --> 00:16:00,650 I'll see you there.