0
1
00:00:00,390 --> 00:00:07,110
Now I think this is the perfect time to kind of check our understanding of all the work that we've done
1

2
00:00:07,710 --> 00:00:14,520
and maybe also look at some of these pre-processing data munching subtleties that we might have missed.
2

3
00:00:15,610 --> 00:00:18,650
I'll add a quick section heading here just for that.
3

4
00:00:18,730 --> 00:00:22,870
Now the way I'd like to tackle this is actually kind of with a challenge.
4

5
00:00:23,140 --> 00:00:28,680
You see, we started with about 5796 emails.
5

6
00:00:29,020 --> 00:00:35,650
We split these emails into four 4057 emails for training and about 
6

7
00:00:35,650 --> 00:00:38,770
1739 emails for testing.
7

8
00:00:39,370 --> 00:00:47,230
But how many of these individual emails were actually included in the text files that we saved to our
8

9
00:00:47,230 --> 00:00:48,250
disk?
9

10
00:00:48,250 --> 00:00:50,980
Let's just look at the testing emails for now.
10

11
00:00:50,980 --> 00:00:56,830
Can you figure out how many individual testing emails were included in the txt file that we saved
11

12
00:00:56,830 --> 00:01:03,850
to the disk using the "train_grouped" dataframe and then compare that to the amount of emails
12

13
00:01:04,090 --> 00:01:10,580
in X_test? After splitting and shuffling our data, how many individual emails were actually
13

14
00:01:10,580 --> 00:01:20,060
included in the X_test dataframe? Is the number the same? And if not, which emails were excluded
14

15
00:01:20,660 --> 00:01:31,000
and why? I recommend comparing the doc ID values to find out. Did you have a go at this?
15

16
00:01:31,800 --> 00:01:34,410
This might have been a little bit of a tricky challenge actually.
16

17
00:01:34,410 --> 00:01:42,860
So let me show you the solution. The way I'm going to tackle this is using Python sets once again.
17

18
00:01:43,170 --> 00:01:48,230
Now there's other ways for sure, but I think this code is terse and it's also clear.
18

19
00:01:48,600 --> 00:01:53,970
I'll create two variables which will hold on to these sets. The first one will be "train_doc
19

20
00:01:54,030 --> 00:02:03,840
_ids" and that's gonna be our set of all the document IDs which were under "train_
20

21
00:02:03,840 --> 00:02:12,860
grouped". Then I'll create a second set called "test_doc_ids" and set that
21

22
00:02:12,890 --> 00:02:21,470
equal to a set that I create from "test_grouped.DOC_ID".
22

23
00:02:22,370 --> 00:02:24,330
This will give me a comparison.
23

24
00:02:24,410 --> 00:02:28,370
The reason I'm doing this is because I can easily print out the number of individual emails using these
24

25
00:02:28,370 --> 00:02:38,060
sets, "len(train_doc_ids)" will give me the number of individual emails in our training
25

26
00:02:38,060 --> 00:02:38,660
set.
26

27
00:02:38,840 --> 00:02:47,310
So that's 4014 compared to the 4057. "len(test_
27

28
00:02:47,360 --> 00:02:53,420
doc_ids)" will give me the number of individual emails that we actually saved under the testing txt
28

29
00:02:53,510 --> 00:03:04,060
file, so that's 1723 and we can compare this number to the length of X_test which was
29

30
00:03:04,050 --> 00:03:06,840
1739 emails.
30

31
00:03:07,100 --> 00:03:13,330
So 16 emails have gone missing or have been excluded, but which ones?
31

32
00:03:13,490 --> 00:03:14,890
Which emails were excluded?
32

33
00:03:15,810 --> 00:03:24,390
Well, remember how we said that Python sets are very, very useful for checking membership to find out
33

34
00:03:24,420 --> 00:03:29,240
which emails are included in one set but not in the other set?
34

35
00:03:29,250 --> 00:03:36,840
We can use, well, Python sets again, only this time we will take the difference between two sets.
35

36
00:03:36,870 --> 00:03:45,630
Check it out. If you recall "X_test.index" actually stored our document IDs.
36

37
00:03:45,660 --> 00:03:53,910
If I take this whole thing and grab the values, then I can transform my index values into a Python set.
37

38
00:03:54,780 --> 00:04:04,290
To see which values are included in this one but missing from this one, all I have to do is take the
38

39
00:04:04,290 --> 00:04:08,910
difference between the two and that will give me the answer.
39

40
00:04:08,910 --> 00:04:16,380
Here are the 16 emails that are not included in our txt file, but that still doesn't answer the question
40

41
00:04:16,380 --> 00:04:21,690
why. We've only identified the specific document IDs that were problematic.
41

42
00:04:21,690 --> 00:04:25,200
Let's dig in and take a closer look at these messages.
42

43
00:04:25,230 --> 00:04:30,680
The first one I'm going to pull up through "data.MESSAGE" is message number 14.
43

44
00:04:31,290 --> 00:04:32,490
So "data.MESSAGE[
44

45
00:04:32,490 --> 00:04:35,520
14]"
45

46
00:04:35,580 --> 00:04:39,630
will show us the message text of the first email that didn't make it.
46

47
00:04:40,360 --> 00:04:42,760
And this email looks like this.
47

48
00:04:43,320 --> 00:04:44,820
It's, well, it's..
48

49
00:04:44,850 --> 00:04:48,000
It looks like a private key or a public key of some sort.
49

50
00:04:48,000 --> 00:04:50,370
It's complete gibberish.
50

51
00:04:50,400 --> 00:04:51,840
What about some of the other ones?
51

52
00:04:51,840 --> 00:04:52,730
Let's check
52

53
00:04:53,290 --> 00:04:58,700
325, 416 and 445. 325 
53

54
00:04:58,860 --> 00:05:01,390
looks like this; 416
54

55
00:05:01,530 --> 00:05:07,830
looks like this and 445 looks like this.
55

56
00:05:08,120 --> 00:05:11,950
Now I'm starting to spot a pattern here.
56

57
00:05:12,470 --> 00:05:18,890
All of these e-mails seem to look like a complete mess, but maybe the reason they look like this is because
57

58
00:05:18,890 --> 00:05:21,570
we had a problem reading this file, right.
58

59
00:05:21,590 --> 00:05:26,120
Maybe this is related to our encoding that we used and this is why we're getting some sort of like
59

60
00:05:26,120 --> 00:05:28,250
Mojibake here or something, right.
60

61
00:05:28,310 --> 00:05:29,900
Maybe that's the source of the problem.
61

62
00:05:30,620 --> 00:05:33,740
So let me actually pull up one of these files in my text editor.
62

63
00:05:34,160 --> 00:05:36,560
I'm going to go with file number, say 14.
63

64
00:05:36,800 --> 00:05:43,700
The way I can find out which file this actually was is by going to "data.loc" and then feeding in the
64

65
00:05:43,700 --> 00:05:45,550
number 14.
65

66
00:05:45,560 --> 00:05:52,260
This will give me the actual file name, so that starts with 00095 and then something.
66

67
00:05:52,610 --> 00:05:59,000
I've located this file under my "spam_1" folder in my Spam Assassin Corpus.
67

68
00:05:59,000 --> 00:06:03,530
Opening this file in my Atom text editor gives me the following.
68

69
00:06:03,530 --> 00:06:08,820
I got my email header up top and then down here I've got my email body.
69

70
00:06:08,960 --> 00:06:13,110
So what we saw in Jupyter notebook is actually the email body, right,
70

71
00:06:13,130 --> 00:06:17,170
1-to-1, we didn't screw something up on the encoding.
71

72
00:06:17,210 --> 00:06:23,240
Now would you like to venture a guess as to what we get when we stick this email through our message
72

73
00:06:23,240 --> 00:06:25,970
cleaning function? So "clean_
73

74
00:06:25,970 --> 00:06:37,560
msg_no_html(data.at[14, 'MESSAGE'])".
74

75
00:06:37,790 --> 00:06:43,940
Would you like to venture a guess of what we get when we stick our message body from this particular
75

76
00:06:43,940 --> 00:06:55,960
email through our cleaning function? We get a string like this. Now this particular string, as you can
76

77
00:06:55,960 --> 00:06:59,910
probably guess is not part of our vocabulary.
77

78
00:07:00,010 --> 00:07:09,370
It's not part of the top 2500 words that we're using for our classifier and hence this email would not
78

79
00:07:09,370 --> 00:07:13,810
have been included when we created our sparse matrix.
79

80
00:07:13,990 --> 00:07:22,450
This condition here "if word in word_set" filters out if any of the words in the email are not part of
80

81
00:07:22,450 --> 00:07:28,150
our vocabulary. Now the story is quite similar with some of these other emails.
81

82
00:07:28,480 --> 00:07:38,610
But one exception is this one that I found here, 1096. This document "data.MESSAGE[
82

83
00:07:38,680 --> 00:07:45,820
1096]" looks like so. And looking at this text, we see that it's
83

84
00:07:45,820 --> 00:07:47,530
not all gibberish, right.
84

85
00:07:47,530 --> 00:07:52,140
This is mostly HTML actually, right?
85

86
00:07:52,210 --> 00:07:57,820
We can see a whole bunch of HTML email tags in the body of this email.
86

87
00:07:58,180 --> 00:08:06,540
And what happens when we clean this message and remove the HTML? So "clean_msg_
87

88
00:08:06,540 --> 00:08:06,830
no_html(
88

89
00:08:06,850 --> 00:08:18,730
data.at[1096, '
89

90
00:08:18,730 --> 00:08:24,210
MESSAGE']" is that we get an empty list.
90

91
00:08:24,250 --> 00:08:29,220
This particular message contains so much HTML all that Beautiful Soup,
91

92
00:08:29,380 --> 00:08:37,690
the tool that we use to remove all the HTML tags and strip those from our email bodies actually leaves
92

93
00:08:37,690 --> 00:08:38,590
nothing behind.
93

94
00:08:38,590 --> 00:08:44,530
There is no single word that actually makes it into our list.
94

95
00:08:44,560 --> 00:08:50,380
The reason that I know it's Beautiful Soup and not our word stemmer or something else is if I call our
95

96
00:08:50,470 --> 00:08:57,340
other function "clean_message()" where the HTML tags are left in and our feed in the very same
96

97
00:08:57,340 --> 00:09:11,520
code, "data.at[1096, 'MESSAGE']", ugh typo, then I get the following. I get all my
97

98
00:09:11,520 --> 00:09:17,910
HTML tags as individual words on this list. So I think that was actually quite subtle,
98

99
00:09:17,930 --> 00:09:23,990
there was something that was happening in the background that we might not have noticed while we were
99

100
00:09:24,050 --> 00:09:30,060
writing our code, but it's always good to check if your code is doing what you expect it to do
100

101
00:09:30,340 --> 00:09:37,100
and I think this also gave us a chance to have another go at practicing using Python sets and really
101

102
00:09:37,100 --> 00:09:39,830
seeing their use in checking membership,
102

103
00:09:39,830 --> 00:09:45,620
especially if you're looking for differences and trying to see which values are included in one but
103

104
00:09:45,620 --> 00:09:46,520
not on the other one.
104

105
00:09:48,000 --> 00:09:48,760
And that's it.
105

106
00:09:48,780 --> 00:09:51,980
This wraps up our pre-processing.
106

107
00:09:52,290 --> 00:09:56,630
We've actually done an incredible amount of work in the past couple of lessons.
107

108
00:09:56,850 --> 00:10:03,750
We've extensively cleaned our data, explored our data, visualized our data and the text files and the
108

109
00:10:03,760 --> 00:10:09,810
JSONs and all these files that we've saved to our disk create checkpoints that mean that we don't have
109

110
00:10:09,810 --> 00:10:15,430
to rerun all the work and all the code in our Jupyter notebooks.
110

111
00:10:15,480 --> 00:10:21,300
In fact, we've worked really hard to create the files that contain the data that we will feed into our
111

112
00:10:21,300 --> 00:10:24,360
naive Bayes' classifier algorithm.
112

113
00:10:24,360 --> 00:10:28,890
And if you take another look at this training data and this testing data that we've created, you'll notice
113

114
00:10:28,890 --> 00:10:33,320
that all the strings, all the text has disappeared from them.
114

115
00:10:33,390 --> 00:10:40,410
We're exclusively working with numbers now. All the stemmed words are replaced just by word IDs, by
115

116
00:10:40,410 --> 00:10:41,640
integers.
116

117
00:10:41,670 --> 00:10:43,860
It's much more abstract, right?
117

118
00:10:44,280 --> 00:10:52,320
But what we've effectively done is we've transformed our words into tokens, but not only that, we've also
118

119
00:10:52,320 --> 00:10:58,920
counted how often each token appears in every email message.
119

120
00:10:59,160 --> 00:11:03,430
With this in hand, it's time to train our naive Bayes' classifier.
120

121
00:11:03,570 --> 00:11:09,230
We can move on to the next step in our project. Now our Jupyter notebook,
121

122
00:11:09,250 --> 00:11:15,020
is getting very, very long and as such it's getting also a little bit unwieldy.
122

123
00:11:15,300 --> 00:11:19,470
So I'm quickly gonna go up here and rename this notebook.
123

124
00:11:19,480 --> 00:11:30,770
I'll call it "06 Bayes Classifier - Pre-Processong", click "Rename" and then for the next part where we're
124

125
00:11:30,770 --> 00:11:36,210
training our algorithm, we're gonna be doing it in a new fresh notebook.
125

126
00:11:36,320 --> 00:11:42,560
That way we've separated out all the pre-processing that we're doing in one Jupyter notebook and we're
126

127
00:11:42,560 --> 00:11:48,440
going to separate the training that we're gonna do for this project into a separate notebook.
127

128
00:11:48,440 --> 00:11:51,290
And on that note, I'll see you in the next lessons.
128

129
00:11:51,320 --> 00:11:51,890
Take care.