1 00:00:01,100 --> 00:00:04,100 First off let's add a section heading here. 2 00:00:04,130 --> 00:00:16,860 Call it something the tokens occurring in spam now are full on escort train on the school features data 3 00:00:16,860 --> 00:00:24,270 frame has the following shape it's got two thousand five hundred columns and it's got four thousand 4 00:00:24,510 --> 00:00:33,330 and fourteen rows because it contains both spam and non spam messages what I want to do is I want to 5 00:00:33,330 --> 00:00:41,550 create a subset again of this feature state of frame I want to create a subset of all the rows that 6 00:00:41,700 --> 00:00:51,800 correspond to spam messages I'm going to call the subset train on a school spam on a school tokens and 7 00:00:52,040 --> 00:01:00,410 I'll set that equal to full on a squat train on a school features dot LLC. 8 00:01:00,410 --> 00:01:06,110 This is one of my favorite favorite favorite functions to create subsets by the way. 9 00:01:07,250 --> 00:01:14,630 And then within the square brackets I'll just pick the rows that are spam messages meaning I'll add 10 00:01:14,630 --> 00:01:22,520 a condition that the category should be equal to one full on it's got train on the score data category 11 00:01:22,720 --> 00:01:25,350 double equals 1. 12 00:01:25,620 --> 00:01:32,350 Let's take a look at the train spam tokens had the first five rows. 13 00:01:32,370 --> 00:01:43,440 There we go emails 0 1 2 3 4 are all spam messages the last few emails in train underscore spam and 14 00:01:43,510 --> 00:01:49,900 it's tokens we can pull up with dots tail and they look like this. 15 00:01:50,800 --> 00:01:56,230 So we've got document I.D. a thousand eight hundred eighty five eighty seven eighty nine ninety and 16 00:01:56,230 --> 00:01:59,700 ninety five as spam messages. 17 00:02:00,100 --> 00:02:07,810 The overall shape of this data frame should be a thousand two hundred and forty nine times two thousand 18 00:02:07,810 --> 00:02:09,250 five hundred right. 19 00:02:09,280 --> 00:02:14,800 This is what we've ascertained the number of spam messages in our training data set is a thousand two 20 00:02:14,800 --> 00:02:16,280 hundred and forty nine. 21 00:02:16,360 --> 00:02:22,900 So let's just quickly verifying that this is exactly what we've got here and that our sub setting has 22 00:02:22,900 --> 00:02:25,200 gone according to plan. 23 00:02:25,210 --> 00:02:32,790 Now what I want to do is I want to sum up all these tokens column by column right. 24 00:02:32,860 --> 00:02:37,620 I want to have all the tokens for word any number zero sum up. 25 00:02:38,170 --> 00:02:43,760 I don't have all the tokens in column 1 summed up and so on. 26 00:02:43,900 --> 00:02:52,630 Across all the word ideas what I want to end up with is a penned a series of all the word ideas with 27 00:02:52,630 --> 00:02:59,120 the number of times that these tokens occur in spam messages. 28 00:02:59,140 --> 00:03:07,360 This is my goal but I'm going to save this series in a variable as well I'll call it summed spam on 29 00:03:07,370 --> 00:03:16,550 a school tokens so that equal to train and a score spam and it's got tokens that sum. 30 00:03:16,760 --> 00:03:23,110 Now instead of Axis being one and summing across a row I want access to be equal to zero. 31 00:03:23,230 --> 00:03:32,130 To sum across a column so axis is equal to zero will be for the column. 32 00:03:32,190 --> 00:03:37,210 Now there's one final thing I want to do on this line and this has to do with the fact that we're gonna 33 00:03:37,230 --> 00:03:46,440 be calculating a probability later on I'm gonna add one to my summation. 34 00:03:46,540 --> 00:03:52,930 Would you like to guess why I'm doing this why am I seemingly arbitrarily adding one to this calculation 35 00:03:54,760 --> 00:03:59,650 Well the reason is is that we're gonna be doing a division like down the line right. 36 00:03:59,670 --> 00:04:05,290 Wouldn't be dividing the number of occurrences by the total number of words either in the numerator 37 00:04:05,740 --> 00:04:06,780 or the denominator. 38 00:04:07,880 --> 00:04:12,540 And we're gonna be doing this for all our tokens across our spam messages. 39 00:04:12,760 --> 00:04:19,990 And if one of these tokens doesn't actually occur within the spam messages we've got zero divided by 40 00:04:20,140 --> 00:04:22,500 the total number of tokens. 41 00:04:22,500 --> 00:04:31,600 So I'm adding one here to make this calculation non zero late on this technique actually has a name. 42 00:04:31,660 --> 00:04:36,100 It's named after a French mathematician and it's called LaPlace smoothing. 43 00:04:37,300 --> 00:04:45,380 Here's what are some spam tokens look like after we've created the subset and summed up all the values. 44 00:04:45,370 --> 00:04:53,290 It's simply a penned a series with two thousand five hundred entries the last few entries in this series 45 00:04:53,790 --> 00:05:02,000 look like this summed up the sort of spam on a token dot tail shows me that the word with word eighty 46 00:05:02,210 --> 00:05:11,220 two thousand four hundred ninety nine occurs a total of six times across all our spam messages now I 47 00:05:11,220 --> 00:05:17,230 think you can repeat the process for how messages as well so I'll leave this to you as a challenge. 48 00:05:17,280 --> 00:05:24,210 Sum up the tokens that occur for the non spam messages and then store these values in a variable called 49 00:05:24,310 --> 00:05:26,700 summed up underscore ham underscore score tokens 50 00:05:30,070 --> 00:05:30,550 ready. 51 00:05:30,550 --> 00:05:32,320 Here's the solution. 52 00:05:32,670 --> 00:05:40,650 I'll add a markdown sell here something that tokens occurring in ham. 53 00:05:41,200 --> 00:05:48,340 Once again I'll credit data frame called train on the school ham on the school tokens and this will 54 00:05:48,340 --> 00:05:53,190 be a subset of our full on this Katrina on the school features. 55 00:05:53,530 --> 00:06:02,500 But at the following locations namely where full on is Katrina and its core data that category is equal 56 00:06:02,500 --> 00:06:10,340 to our non spam category double equals zero imposes this logical condition 57 00:06:13,250 --> 00:06:22,590 the summed ham tokens are gonna be equal to the state of frame train on a go ham and a scope tokens 58 00:06:24,520 --> 00:06:33,330 not some parentheses axis is equal to zero plus one. 59 00:06:33,400 --> 00:06:39,340 We're going to apply our lab plus smoothing technique once again and that's it. 60 00:06:39,340 --> 00:06:40,530 Let's take a quick look. 61 00:06:40,660 --> 00:06:49,560 If we've done this correctly some underscore ham square tokens what shape should be as expected. 62 00:06:49,560 --> 00:06:56,400 Two thousand five hundred the last few items in this series look as follows. 63 00:06:56,500 --> 00:07:05,990 Some underscore ham underscore tokens dot tail and I can even do a spot check on this by going to train. 64 00:07:06,040 --> 00:07:14,230 On just go ham and just got tokens square brackets two thousand four hundred and ninety nine some just 65 00:07:14,230 --> 00:07:20,140 that one and add one and that should tie out with this entry right here. 66 00:07:20,230 --> 00:07:21,360 Brilliant. 67 00:07:21,700 --> 00:07:29,920 In the next lesson we're gonna be calculating the probability that a token occurs given that the email 68 00:07:29,920 --> 00:07:31,500 is spam. 69 00:07:31,570 --> 00:07:36,840 Here we're gonna be calculating our conditional probability Hussey in the next lesson. 70 00:07:36,860 --> 00:07:37,330 Take care.