0
1
00:00:00,210 --> 00:00:01,460
Welcome back.
1

2
00:00:01,470 --> 00:00:09,330
Having just completed some very beautiful visualizations let's get back to pre-processing our data for
2

3
00:00:09,330 --> 00:00:11,640
our Bayes' classifier.
3

4
00:00:11,720 --> 00:00:18,620
Now there's lots of individual words among the 5800 odd emails that constitute
4

5
00:00:18,770 --> 00:00:20,300
our dataset.
5

6
00:00:20,300 --> 00:00:26,360
We won't actually use every single word that came up in these email bodies.
6

7
00:00:26,390 --> 00:00:32,720
We're just going to use the 2500 most frequent words. I'm going to add a markdown cell
7

8
00:00:32,720 --> 00:00:43,990
here to commemorate this and it's going to read "Generate Vocabulary & Dictionary" The 
8

9
00:00:43,990 --> 00:00:51,110
2500 most frequent words in our dataset are going to form our vocabulary and we will generate
9

10
00:00:51,470 --> 00:00:59,000
this vocabulary from our stemmed list of words. To get our stemmed list of words we're once again going
10

11
00:00:59,000 --> 00:01:05,010
to call our "clean_msg_no_html" function that we've created earlier.
11

12
00:01:05,090 --> 00:01:08,570
Now I know that for the word cloud we commented out this line.
12

13
00:01:08,570 --> 00:01:14,960
So if you have this commented out, comment it back in. because we actually do understand the words this
13

14
00:01:14,960 --> 00:01:20,820
time round and if you made this change make sure that you do two things.
14

15
00:01:20,900 --> 00:01:27,680
If you comment this line back in, make sure you comment this one back out and also press Shift+Enter
15

16
00:01:27,830 --> 00:01:29,090
on this cell.
16

17
00:01:29,090 --> 00:01:33,470
Otherwise you're going to get some very unexpected results later on.
17

18
00:01:33,530 --> 00:01:33,830
All right.
18

19
00:01:33,860 --> 00:01:39,200
So let's use our apply method and call this function right here.
19

20
00:01:39,200 --> 00:01:47,870
I'll create a variable called "stemmed_nested_list", set that equal to "data.MESSAGE.
20

21
00:01:47,930 --> 00:01:56,460
apply()" and then I'll feed in the name of our function "clean_msg_no_
21

22
00:01:56,570 --> 00:01:58,310
html".
22

23
00:01:58,640 --> 00:02:08,000
And because this is a nested list I'm going to flatten this list and store it under "flat_stemmed_list" and
23

24
00:02:08,000 --> 00:02:12,760
set that equal to the result of some Python list comprehension,
24

25
00:02:13,190 --> 00:02:23,750
"[item for sublist in stemmed_nested_list for item in sublist]".
25

26
00:02:24,620 --> 00:02:27,420
So far nothing new.
26

27
00:02:27,440 --> 00:02:31,040
Let's run this cell and move on. The next step
27

28
00:02:31,280 --> 00:02:34,500
will be getting a unique set of words.
28

29
00:02:34,550 --> 00:02:37,250
This is gonna make up our vocabulary.
29

30
00:02:37,250 --> 00:02:42,980
The easiest way to do this I think is to generate a pandas series and then use the "value_counts" method.
30

31
00:02:42,980 --> 00:02:45,590
Once again you can ignore this warning here.
31

32
00:02:45,590 --> 00:02:51,140
This is aimed at people who are trying to use Beautiful Soup to open a URL. Now to create that series
32

33
00:02:51,230 --> 00:02:52,700
of unique words,
33

34
00:02:52,850 --> 00:03:02,910
I'll create a variable really quickly that's called "unique_words" and I'll set that equal to "pd.Series()",
34

35
00:03:03,480 --> 00:03:11,970
I'll provide the "flat_stemmed_list" that we created a minute ago and then I'm going to call a
35

36
00:03:11,970 --> 00:03:16,230
method by the name of "value_counts".
36

37
00:03:16,320 --> 00:03:18,130
Let me show you what we've just done here.
37

38
00:03:18,210 --> 00:03:27,240
The number of unique words I can print out using the shape of my variable, so I'll say "Nr of unique
38

39
00:03:27,240 --> 00:03:37,620
words" is going to be in a print statement, then a comma, and I'll say "unique_words.shape[
39

40
00:03:37,870 --> 00:03:38,710
0]"
40

41
00:03:38,820 --> 00:03:45,480
This will print out the number of unique words in this series. And to look at the first five rows, the
41

42
00:03:45,480 --> 00:03:47,140
first five entries in this series,
42

43
00:03:47,310 --> 00:03:50,770
I'll say "unique_words.head()".
43

44
00:03:51,180 --> 00:03:59,760
See what we get. And what we see here is that after cleaning and stemming, we are left with 
44

45
00:03:59,760 --> 00:04:07,920
27320 words in our dataset, 27000 unique words across all
45

46
00:04:07,920 --> 00:04:09,590
our email bodies.
46

47
00:04:09,630 --> 00:04:17,430
Now this is an absolutely huge number and we're actually only going to train our classifier with a subset
47

48
00:04:17,880 --> 00:04:25,490
of this number, namely to 2500 most frequent words. Now you might be wondering why "http"
48

49
00:04:25,530 --> 00:04:26,540
is up here.
49

50
00:04:26,580 --> 00:04:30,350
Well, "http" pretty much precedes every single URL,
50

51
00:04:30,410 --> 00:04:37,910
so this goes to show how many hyperlinks people have included in their emails. Now to get the 
51

52
00:04:37,910 --> 00:04:42,640
2500 most frequent words I wanna throw this over to you. As a challenge,
52

53
00:04:42,910 --> 00:04:49,860
and the reason is this is another good opportunity to practice subsetting and working with these series.
53

54
00:04:49,860 --> 00:04:58,610
Can you create a subset of this unique words series and store it in a variable called "frequent_
54

55
00:04:58,680 --> 00:05:05,520
words which will only contain the most frequent 2500 words out of the total?
55

56
00:05:05,730 --> 00:05:09,870
And then afterwards print out the top 10 words.
56

57
00:05:09,990 --> 00:05:14,700
These, of course, are going to overlap with the top 5 words that you see above.
57

58
00:05:14,700 --> 00:05:17,490
I'll give you a few seconds to pause the video and give this a go.
58

59
00:05:20,430 --> 00:05:22,590
Here's the solution.
59

60
00:05:22,740 --> 00:05:31,350
"frequent_words = unique_words" and then we're going to use that square bracket notation, [0:
60

61
00:05:31,380 --> 00:05:35,500
2500]".
61

62
00:05:35,580 --> 00:05:40,040
That's how we create a subset. And to print out the top 10,
62

63
00:05:40,140 --> 00:05:50,620
we'll say "print('Most common words') and then I'm even going to use a escape character and a new line, comma
63

64
00:05:51,760 --> 00:05:52,990
"frequent_words"
64

65
00:05:53,320 --> 00:06:00,250
and once again I'm going to create a subset, this time from the beginning, so I'm even gonna leave out the zero
65

66
00:06:00,730 --> 00:06:01,900
to 10.
66

67
00:06:01,900 --> 00:06:05,620
These are the top 10 words and here they are.
67

68
00:06:06,070 --> 00:06:12,670
With this notation, when you're creating a subset you're setting a starting point and an ending point.
68

69
00:06:12,670 --> 00:06:18,700
And with this notation of creating a subset you're going from the beginning to an end point.
69

70
00:06:18,700 --> 00:06:21,400
So I hope this was a useful review.
70

71
00:06:21,430 --> 00:06:26,800
One thing that we can do to improve our code though is removing some of these magic numbers that we
71

72
00:06:26,800 --> 00:06:27,830
see here.
72

73
00:06:27,850 --> 00:06:35,230
So what I'd like is instead of having 2500 float around in my code here, I'd like
73

74
00:06:35,230 --> 00:06:36,820
to define a constant
74

75
00:06:36,820 --> 00:06:47,440
at the very top called "VOCAB_SIZE" and set that equal to the size of the vocabulary that
75

76
00:06:47,440 --> 00:06:53,290
I'm gonna use in my code later on. That way if I ever want to make a change all I have to do is change
76

77
00:06:53,470 --> 00:07:02,760
this number here and it will filter through as long as I replace this number here with my constant "VOCAB_
77

78
00:07:02,840 --> 00:07:05,100
SIZE".
78

79
00:07:05,150 --> 00:07:09,430
Now you've got to remember, if you've changed a cell up top, you've gotta press Shift+Enter.
79

80
00:07:09,580 --> 00:07:12,580
Otherwise you're gonna get an error.
80

81
00:07:13,120 --> 00:07:17,040
Now with frequent words, we're working with a series, right.
81

82
00:07:17,350 --> 00:07:27,790
so "type(frequent_words)", we can see that we have a pandas series, and not only that; if we
82

83
00:07:27,790 --> 00:07:36,940
look at frequent words, we see that this bit here, the actual words form our index and the number of occurrences
83

84
00:07:37,210 --> 00:07:40,610
are actually the values in this series.
84

85
00:07:40,630 --> 00:07:47,320
Let's practice how we would go between a series and a dataframe and how to work with these indices.
85

86
00:07:47,380 --> 00:07:54,040
We're also going to take this opportunity to assign a word ID to each word, similar to how we assigned
86

87
00:07:54,100 --> 00:08:04,490
a doc ID in an earlier lesson. I'll add a markdown cell here real quick that's gonna read "Create Vocabulary
87

88
00:08:05,780 --> 00:08:15,680
DataFrame with a WORD_ID". Now our word IDs are just going to be integers, they're going to be ranging
88

89
00:08:15,680 --> 00:08:21,980
from zero to 2499, meaning we're gonna work again with this range
89

90
00:08:21,980 --> 00:08:31,190
object and we can create a range very easily with "range(0,)" and then going up to
90

91
00:08:31,790 --> 00:08:33,720
our "VOCAB_SIZE", right.
91

92
00:08:33,740 --> 00:08:39,630
This is how we create our range. Now what we can do is store all these numbers in a list.
92

93
00:08:39,710 --> 00:08:47,090
So I'll wrap this call to the range into a list and I'm actually gonna store this in a variable called
93

94
00:08:47,150 --> 00:08:48,470
"word_ids".
94

95
00:08:48,470 --> 00:08:56,210
So "word_ids = list(range(0, VOCAB_SIZE))" and then closing
95

96
00:08:56,210 --> 00:08:57,740
the parentheses.
96

97
00:08:57,740 --> 00:09:06,410
Now let's create our dataframe with "pd.DataFrame()" and then what I'm going to do is
97

98
00:09:06,480 --> 00:09:14,130
I'm going to provide a dictionary, so I'll have those two curly braces, our dictionary is gonna consist of
98

99
00:09:14,130 --> 00:09:15,580
a key and a value.
99

100
00:09:15,780 --> 00:09:22,810
So the key will be whatever I want that column heading to read, "VOCAB_WORD"
100

101
00:09:22,860 --> 00:09:24,830
sounds good to me.
101

102
00:09:24,960 --> 00:09:31,530
Now the values, I want those to be the actual words. Scrolling up a little bit,
102

103
00:09:31,830 --> 00:09:36,240
the words are here in our series.
103

104
00:09:36,240 --> 00:09:45,630
So this means that if I use "frequent_words" like so, I'm actually accessing the frequencies, the numbers.
104

105
00:09:46,400 --> 00:09:51,190
What I need to do to get the words is work with our index, right.
105

106
00:09:51,240 --> 00:10:00,810
So "index.values" will be the way I can store all these different strings in a column for our 
106

107
00:10:00,810 --> 00:10:02,010
dataframe.
107

108
00:10:02,010 --> 00:10:03,270
So far so good.
108

109
00:10:03,390 --> 00:10:04,200
Let's see what we've got.
109

110
00:10:04,740 --> 00:10:13,750
Let me hit Shift+Enter on this cell. At the moment, our dataframe looks like this. Fair enough.
110

111
00:10:13,750 --> 00:10:21,520
Let's add our word IDs explicitly to this dataframe. So we can do that with setting the dataframe's
111

112
00:10:21,610 --> 00:10:22,270
index,
112

113
00:10:22,300 --> 00:10:22,710
right.
113

114
00:10:22,810 --> 00:10:30,720
So "index = word_ids" and then we can also give that index a name.
114

115
00:10:30,860 --> 00:10:39,990
But first let me give our dataframe a name as well, so I'll say "vocab = pd.DataFrame" and on
115

116
00:10:39,990 --> 00:10:40,700
the line below,
116

117
00:10:40,710 --> 00:10:51,890
I'll say "vocab.index.name = 'WORD_ID'". Let's look at the first
117

118
00:10:52,010 --> 00:10:58,500
five rows in our dataframe with "vocab.head()" and then Shift+Enter.
118

119
00:10:58,610 --> 00:10:59,510
There we go.
119

120
00:10:59,510 --> 00:11:00,670
Fantastic.
120

121
00:11:00,740 --> 00:11:07,040
We've generated our vocabulary that we're going to train our classifier with and previously we've had
121

122
00:11:07,040 --> 00:11:15,080
a pandas dataframe and we've used this "to_json" functionality to save a file in the JSON format
122

123
00:11:15,260 --> 00:11:16,410
to our disk.
123

124
00:11:16,430 --> 00:11:22,640
Now "to_json" is very well and good but of course pandas can save many different file types.
124

125
00:11:22,730 --> 00:11:29,750
A common one that you're gonna be working with a lot is a CSV file, a comma separated value.
125

126
00:11:29,750 --> 00:11:35,060
This is a file format that can easily be opened and nicely formatted with Microsoft Excel or Google
126

127
00:11:35,060 --> 00:11:36,290
Sheets.
127

128
00:11:36,290 --> 00:11:43,340
Now you probably already surmised that you're going to need a file path for this "to_csv" function that
128

129
00:11:43,340 --> 00:11:43,970
we're gonna call.
129

130
00:11:44,780 --> 00:11:52,610
So let's go back up to our constants and create a constant that will hold on to our file path and our file
130

131
00:11:52,610 --> 00:11:56,840
name for this CSV file that we're gonna create. I'm going to copy
131

132
00:11:56,870 --> 00:11:59,900
this constant right here, paste it below
132

133
00:11:59,900 --> 00:12:02,270
and then make a few changes to it of course.
133

134
00:12:02,300 --> 00:12:09,710
So I'll call this one, in all caps, "WORD_ID_FILE" and we're still gonna save
134

135
00:12:09,710 --> 00:12:19,400
it in our Processing folder but we're going to call it "word-by-id" and then add the
135

136
00:12:19,610 --> 00:12:23,210
".csv" extension to it.
136

137
00:12:23,210 --> 00:12:32,420
So that's our file path and I'll hit Shift+Enter to make sure this is saved and then down here gonna add a
137

138
00:12:32,660 --> 00:12:45,470
quick section heading. It's gonna read "Save the Vocabulary as a CSV File". As you can guess we'll access
138

139
00:12:45,650 --> 00:12:57,970
our "to_csv" method from our vocab object, so "vocab.to_csv()" and then we're gonna
139

140
00:12:57,980 --> 00:13:02,920
pass in our "WORD_ID_FILE" path and name.
140

141
00:13:02,960 --> 00:13:10,700
Now if I hit Shift+Tab on this, I can see some of these other parameters that I can specify and that
141

142
00:13:10,700 --> 00:13:16,190
includes a header and an index.
142

143
00:13:16,190 --> 00:13:19,300
Now what does this mean? Coming down here,
143

144
00:13:19,400 --> 00:13:29,020
I can see that our header can be a list of strings and our index label can also be a string. By default,
144

145
00:13:29,120 --> 00:13:31,520
no index label is provided.
145

146
00:13:31,850 --> 00:13:40,430
So let's provide these two additional inputs to our "to_csv" method call. So I'll add a comma here and the
146

147
00:13:40,430 --> 00:13:43,440
first thing I'll do is provide the index label.
147

148
00:13:43,430 --> 00:13:50,390
Now I can provide it as a string with single quotes and say 'WORD_ID', but what I could also
148

149
00:13:50,390 --> 00:13:54,580
do is instead of typing this out and risk making a typo.
149

150
00:13:55,160 --> 00:14:02,690
If I waned to make sure it matches what I've got in my dataframe I can access this directly with "vocab.
150

151
00:14:03,080 --> 00:14:10,010
index.name' and for our header it's actually the very same thing.
151

152
00:14:10,550 --> 00:14:20,350
I could either write 'VOCAB_WORD' which is our header right here, but alternatively
152

153
00:14:20,620 --> 00:14:23,520
I can also grab our column,
153

154
00:14:23,590 --> 00:14:28,110
so with "vocab.VOCAB_WORD.
154

155
00:14:28,120 --> 00:14:34,010
name". This will accomplish the very, very same thing.
155

156
00:14:34,050 --> 00:14:36,660
Now I'm not gonna hit Shift+Enter on this right now.
156

157
00:14:36,660 --> 00:14:43,330
What I want to do instead is bring up my folder here on the right hand side and then hit Shift+Enter,
157

158
00:14:43,560 --> 00:14:46,550
so you can see the file appearing.
158

159
00:14:46,620 --> 00:14:47,670
Here you go.
159

160
00:14:47,670 --> 00:14:49,110
There it is.
160

161
00:14:49,320 --> 00:14:57,330
Now you can open this in a text editor, say like Atom, and the CSV file will be formatted something like
161

162
00:14:57,330 --> 00:14:58,140
this.
162

163
00:14:58,140 --> 00:15:01,350
It's not particularly impressive.
163

164
00:15:01,470 --> 00:15:08,880
But if you have a spreadsheet program like Microsoft Excel or Google Sheets or in my case, this Numbers
164

165
00:15:08,880 --> 00:15:15,810
program that comes with Mac then you'll see the values formatted nicely in these columns.
165

166
00:15:15,810 --> 00:15:20,640
So as you can see the CSV format is very, very handy.
166

167
00:15:20,640 --> 00:15:25,090
Now we're covering quite a lot of stuff in these lessons and with programming
167

168
00:15:25,170 --> 00:15:27,710
the best way of learning it is by doing it.
168

169
00:15:28,020 --> 00:15:33,930
So the next two lessons will consist of some very quick exercises to review some of the concepts that
169

170
00:15:34,110 --> 00:15:36,040
we've talked about.
170

171
00:15:36,090 --> 00:15:39,420
I'm off to grab some more coffee and then I'll see you there.