0 1 00:00:00,390 --> 00:00:07,110 Now I think this is the perfect time to kind of check our understanding of all the work that we've done 1 2 00:00:07,710 --> 00:00:14,520 and maybe also look at some of these pre-processing data munching subtleties that we might have missed. 2 3 00:00:15,610 --> 00:00:18,650 I'll add a quick section heading here just for that. 3 4 00:00:18,730 --> 00:00:22,870 Now the way I'd like to tackle this is actually kind of with a challenge. 4 5 00:00:23,140 --> 00:00:28,680 You see, we started with about 5796 emails. 5 6 00:00:29,020 --> 00:00:35,650 We split these emails into four 4057 emails for training and about 6 7 00:00:35,650 --> 00:00:38,770 1739 emails for testing. 7 8 00:00:39,370 --> 00:00:47,230 But how many of these individual emails were actually included in the text files that we saved to our 8 9 00:00:47,230 --> 00:00:48,250 disk? 9 10 00:00:48,250 --> 00:00:50,980 Let's just look at the testing emails for now. 10 11 00:00:50,980 --> 00:00:56,830 Can you figure out how many individual testing emails were included in the txt file that we saved 11 12 00:00:56,830 --> 00:01:03,850 to the disk using the "train_grouped" dataframe and then compare that to the amount of emails 12 13 00:01:04,090 --> 00:01:10,580 in X_test? After splitting and shuffling our data, how many individual emails were actually 13 14 00:01:10,580 --> 00:01:20,060 included in the X_test dataframe? Is the number the same? And if not, which emails were excluded 14 15 00:01:20,660 --> 00:01:31,000 and why? I recommend comparing the doc ID values to find out. Did you have a go at this? 15 16 00:01:31,800 --> 00:01:34,410 This might have been a little bit of a tricky challenge actually. 16 17 00:01:34,410 --> 00:01:42,860 So let me show you the solution. The way I'm going to tackle this is using Python sets once again. 17 18 00:01:43,170 --> 00:01:48,230 Now there's other ways for sure, but I think this code is terse and it's also clear. 18 19 00:01:48,600 --> 00:01:53,970 I'll create two variables which will hold on to these sets. The first one will be "train_doc 19 20 00:01:54,030 --> 00:02:03,840 _ids" and that's gonna be our set of all the document IDs which were under "train_ 20 21 00:02:03,840 --> 00:02:12,860 grouped". Then I'll create a second set called "test_doc_ids" and set that 21 22 00:02:12,890 --> 00:02:21,470 equal to a set that I create from "test_grouped.DOC_ID". 22 23 00:02:22,370 --> 00:02:24,330 This will give me a comparison. 23 24 00:02:24,410 --> 00:02:28,370 The reason I'm doing this is because I can easily print out the number of individual emails using these 24 25 00:02:28,370 --> 00:02:38,060 sets, "len(train_doc_ids)" will give me the number of individual emails in our training 25 26 00:02:38,060 --> 00:02:38,660 set. 26 27 00:02:38,840 --> 00:02:47,310 So that's 4014 compared to the 4057. "len(test_ 27 28 00:02:47,360 --> 00:02:53,420 doc_ids)" will give me the number of individual emails that we actually saved under the testing txt 28 29 00:02:53,510 --> 00:03:04,060 file, so that's 1723 and we can compare this number to the length of X_test which was 29 30 00:03:04,050 --> 00:03:06,840 1739 emails. 30 31 00:03:07,100 --> 00:03:13,330 So 16 emails have gone missing or have been excluded, but which ones? 31 32 00:03:13,490 --> 00:03:14,890 Which emails were excluded? 32 33 00:03:15,810 --> 00:03:24,390 Well, remember how we said that Python sets are very, very useful for checking membership to find out 33 34 00:03:24,420 --> 00:03:29,240 which emails are included in one set but not in the other set? 34 35 00:03:29,250 --> 00:03:36,840 We can use, well, Python sets again, only this time we will take the difference between two sets. 35 36 00:03:36,870 --> 00:03:45,630 Check it out. If you recall "X_test.index" actually stored our document IDs. 36 37 00:03:45,660 --> 00:03:53,910 If I take this whole thing and grab the values, then I can transform my index values into a Python set. 37 38 00:03:54,780 --> 00:04:04,290 To see which values are included in this one but missing from this one, all I have to do is take the 38 39 00:04:04,290 --> 00:04:08,910 difference between the two and that will give me the answer. 39 40 00:04:08,910 --> 00:04:16,380 Here are the 16 emails that are not included in our txt file, but that still doesn't answer the question 40 41 00:04:16,380 --> 00:04:21,690 why. We've only identified the specific document IDs that were problematic. 41 42 00:04:21,690 --> 00:04:25,200 Let's dig in and take a closer look at these messages. 42 43 00:04:25,230 --> 00:04:30,680 The first one I'm going to pull up through "data.MESSAGE" is message number 14. 43 44 00:04:31,290 --> 00:04:32,490 So "data.MESSAGE[ 44 45 00:04:32,490 --> 00:04:35,520 14]" 45 46 00:04:35,580 --> 00:04:39,630 will show us the message text of the first email that didn't make it. 46 47 00:04:40,360 --> 00:04:42,760 And this email looks like this. 47 48 00:04:43,320 --> 00:04:44,820 It's, well, it's.. 48 49 00:04:44,850 --> 00:04:48,000 It looks like a private key or a public key of some sort. 49 50 00:04:48,000 --> 00:04:50,370 It's complete gibberish. 50 51 00:04:50,400 --> 00:04:51,840 What about some of the other ones? 51 52 00:04:51,840 --> 00:04:52,730 Let's check 52 53 00:04:53,290 --> 00:04:58,700 325, 416 and 445. 325 53 54 00:04:58,860 --> 00:05:01,390 looks like this; 416 54 55 00:05:01,530 --> 00:05:07,830 looks like this and 445 looks like this. 55 56 00:05:08,120 --> 00:05:11,950 Now I'm starting to spot a pattern here. 56 57 00:05:12,470 --> 00:05:18,890 All of these e-mails seem to look like a complete mess, but maybe the reason they look like this is because 57 58 00:05:18,890 --> 00:05:21,570 we had a problem reading this file, right. 58 59 00:05:21,590 --> 00:05:26,120 Maybe this is related to our encoding that we used and this is why we're getting some sort of like 59 60 00:05:26,120 --> 00:05:28,250 Mojibake here or something, right. 60 61 00:05:28,310 --> 00:05:29,900 Maybe that's the source of the problem. 61 62 00:05:30,620 --> 00:05:33,740 So let me actually pull up one of these files in my text editor. 62 63 00:05:34,160 --> 00:05:36,560 I'm going to go with file number, say 14. 63 64 00:05:36,800 --> 00:05:43,700 The way I can find out which file this actually was is by going to "data.loc" and then feeding in the 64 65 00:05:43,700 --> 00:05:45,550 number 14. 65 66 00:05:45,560 --> 00:05:52,260 This will give me the actual file name, so that starts with 00095 and then something. 66 67 00:05:52,610 --> 00:05:59,000 I've located this file under my "spam_1" folder in my Spam Assassin Corpus. 67 68 00:05:59,000 --> 00:06:03,530 Opening this file in my Atom text editor gives me the following. 68 69 00:06:03,530 --> 00:06:08,820 I got my email header up top and then down here I've got my email body. 69 70 00:06:08,960 --> 00:06:13,110 So what we saw in Jupyter notebook is actually the email body, right, 70 71 00:06:13,130 --> 00:06:17,170 1-to-1, we didn't screw something up on the encoding. 71 72 00:06:17,210 --> 00:06:23,240 Now would you like to venture a guess as to what we get when we stick this email through our message 72 73 00:06:23,240 --> 00:06:25,970 cleaning function? So "clean_ 73 74 00:06:25,970 --> 00:06:37,560 msg_no_html(data.at[14, 'MESSAGE'])". 74 75 00:06:37,790 --> 00:06:43,940 Would you like to venture a guess of what we get when we stick our message body from this particular 75 76 00:06:43,940 --> 00:06:55,960 email through our cleaning function? We get a string like this. Now this particular string, as you can 76 77 00:06:55,960 --> 00:06:59,910 probably guess is not part of our vocabulary. 77 78 00:07:00,010 --> 00:07:09,370 It's not part of the top 2500 words that we're using for our classifier and hence this email would not 78 79 00:07:09,370 --> 00:07:13,810 have been included when we created our sparse matrix. 79 80 00:07:13,990 --> 00:07:22,450 This condition here "if word in word_set" filters out if any of the words in the email are not part of 80 81 00:07:22,450 --> 00:07:28,150 our vocabulary. Now the story is quite similar with some of these other emails. 81 82 00:07:28,480 --> 00:07:38,610 But one exception is this one that I found here, 1096. This document "data.MESSAGE[ 82 83 00:07:38,680 --> 00:07:45,820 1096]" looks like so. And looking at this text, we see that it's 83 84 00:07:45,820 --> 00:07:47,530 not all gibberish, right. 84 85 00:07:47,530 --> 00:07:52,140 This is mostly HTML actually, right? 85 86 00:07:52,210 --> 00:07:57,820 We can see a whole bunch of HTML email tags in the body of this email. 86 87 00:07:58,180 --> 00:08:06,540 And what happens when we clean this message and remove the HTML? So "clean_msg_ 87 88 00:08:06,540 --> 00:08:06,830 no_html( 88 89 00:08:06,850 --> 00:08:18,730 data.at[1096, ' 89 90 00:08:18,730 --> 00:08:24,210 MESSAGE']" is that we get an empty list. 90 91 00:08:24,250 --> 00:08:29,220 This particular message contains so much HTML all that Beautiful Soup, 91 92 00:08:29,380 --> 00:08:37,690 the tool that we use to remove all the HTML tags and strip those from our email bodies actually leaves 92 93 00:08:37,690 --> 00:08:38,590 nothing behind. 93 94 00:08:38,590 --> 00:08:44,530 There is no single word that actually makes it into our list. 94 95 00:08:44,560 --> 00:08:50,380 The reason that I know it's Beautiful Soup and not our word stemmer or something else is if I call our 95 96 00:08:50,470 --> 00:08:57,340 other function "clean_message()" where the HTML tags are left in and our feed in the very same 96 97 00:08:57,340 --> 00:09:11,520 code, "data.at[1096, 'MESSAGE']", ugh typo, then I get the following. I get all my 97 98 00:09:11,520 --> 00:09:17,910 HTML tags as individual words on this list. So I think that was actually quite subtle, 98 99 00:09:17,930 --> 00:09:23,990 there was something that was happening in the background that we might not have noticed while we were 99 100 00:09:24,050 --> 00:09:30,060 writing our code, but it's always good to check if your code is doing what you expect it to do 100 101 00:09:30,340 --> 00:09:37,100 and I think this also gave us a chance to have another go at practicing using Python sets and really 101 102 00:09:37,100 --> 00:09:39,830 seeing their use in checking membership, 102 103 00:09:39,830 --> 00:09:45,620 especially if you're looking for differences and trying to see which values are included in one but 103 104 00:09:45,620 --> 00:09:46,520 not on the other one. 104 105 00:09:48,000 --> 00:09:48,760 And that's it. 105 106 00:09:48,780 --> 00:09:51,980 This wraps up our pre-processing. 106 107 00:09:52,290 --> 00:09:56,630 We've actually done an incredible amount of work in the past couple of lessons. 107 108 00:09:56,850 --> 00:10:03,750 We've extensively cleaned our data, explored our data, visualized our data and the text files and the 108 109 00:10:03,760 --> 00:10:09,810 JSONs and all these files that we've saved to our disk create checkpoints that mean that we don't have 109 110 00:10:09,810 --> 00:10:15,430 to rerun all the work and all the code in our Jupyter notebooks. 110 111 00:10:15,480 --> 00:10:21,300 In fact, we've worked really hard to create the files that contain the data that we will feed into our 111 112 00:10:21,300 --> 00:10:24,360 naive Bayes' classifier algorithm. 112 113 00:10:24,360 --> 00:10:28,890 And if you take another look at this training data and this testing data that we've created, you'll notice 113 114 00:10:28,890 --> 00:10:33,320 that all the strings, all the text has disappeared from them. 114 115 00:10:33,390 --> 00:10:40,410 We're exclusively working with numbers now. All the stemmed words are replaced just by word IDs, by 115 116 00:10:40,410 --> 00:10:41,640 integers. 116 117 00:10:41,670 --> 00:10:43,860 It's much more abstract, right? 117 118 00:10:44,280 --> 00:10:52,320 But what we've effectively done is we've transformed our words into tokens, but not only that, we've also 118 119 00:10:52,320 --> 00:10:58,920 counted how often each token appears in every email message. 119 120 00:10:59,160 --> 00:11:03,430 With this in hand, it's time to train our naive Bayes' classifier. 120 121 00:11:03,570 --> 00:11:09,230 We can move on to the next step in our project. Now our Jupyter notebook, 121 122 00:11:09,250 --> 00:11:15,020 is getting very, very long and as such it's getting also a little bit unwieldy. 122 123 00:11:15,300 --> 00:11:19,470 So I'm quickly gonna go up here and rename this notebook. 123 124 00:11:19,480 --> 00:11:30,770 I'll call it "06 Bayes Classifier - Pre-Processong", click "Rename" and then for the next part where we're 124 125 00:11:30,770 --> 00:11:36,210 training our algorithm, we're gonna be doing it in a new fresh notebook. 125 126 00:11:36,320 --> 00:11:42,560 That way we've separated out all the pre-processing that we're doing in one Jupyter notebook and we're 126 127 00:11:42,560 --> 00:11:48,440 going to separate the training that we're gonna do for this project into a separate notebook. 127 128 00:11:48,440 --> 00:11:51,290 And on that note, I'll see you in the next lessons. 128 129 00:11:51,320 --> 00:11:51,890 Take care.