0
1
00:00:00,620 --> 00:00:07,070
Now that we've extracted the body text of an email from a single email, we need to do this for all our
1

2
00:00:07,070 --> 00:00:13,180
emails and for that we need to create a function. The kind of function that we're going to create
2

3
00:00:13,340 --> 00:00:19,310
in this lesson is a special type of function in Python called a generator function.
3

4
00:00:19,310 --> 00:00:25,360
In other words we will create a function that reads all the files in a folder.
4

5
00:00:25,370 --> 00:00:31,060
Now the functions that we've encountered so far run once and they return a value.
5

6
00:00:31,250 --> 00:00:37,430
If you recall a standard Python function just has that "return" keyword and then it spits out a value
6

7
00:00:37,640 --> 00:00:40,040
following whatever comes after that keyword.
7

8
00:00:40,100 --> 00:00:41,360
And that's it.
8

9
00:00:41,360 --> 00:00:47,110
And what this means is that a function needs to return all the results at once.
9

10
00:00:47,270 --> 00:00:51,640
It needs to return all the results at the same time.
10

11
00:00:51,630 --> 00:00:58,610
Now if we wrote a function that read all the emails and extracted the body text from all the 5000 e-mails
11

12
00:00:58,820 --> 00:01:05,170
all at once, then we would have to return 5000 email bodies all at the same time as well.
12

13
00:01:05,190 --> 00:01:09,650
Now if this sounds like a lot of work, then you're absolutely right.
13

14
00:01:09,650 --> 00:01:13,910
And we don't have to write our Python code to do it this way.
14

15
00:01:13,910 --> 00:01:20,600
There is an alternative and this is where generator functions come into play. In our Python notebook,
15

16
00:01:20,600 --> 00:01:30,310
let's add a Markdown cell that reads "Generator Functions" and in the cell below we're going to go over
16

17
00:01:30,310 --> 00:01:36,250
this advanced functional pattern that you're going to encounter every time you want to spit out a series
17

18
00:01:36,250 --> 00:01:37,580
of values.
18

19
00:01:37,720 --> 00:01:40,480
We're gonna be combining two very powerful programming tools.
19

20
00:01:40,480 --> 00:01:41,920
The first one is loops
20

21
00:01:41,920 --> 00:01:45,290
and the second one is this generator function.
21

22
00:01:45,340 --> 00:01:52,330
So before we parse 5000 e-mails, let's go through a practice generator function. Starts out the same way
22

23
00:01:52,420 --> 00:02:01,060
as every other function, with a definition - "def" keyword, then I'll give it a name "generate_
23

24
00:02:01,810 --> 00:02:09,970
squares" and then I'll give it maybe a capital N as a single parameter and then inside this function
24

25
00:02:10,510 --> 00:02:16,690
I'll write a loop: "for my_number in range(N):",
25

26
00:02:16,690 --> 00:02:27,050
this is gonna be up to N, "yield my_number**2".
26

27
00:02:27,070 --> 00:02:30,130
This here is my generator function.
27

28
00:02:30,130 --> 00:02:36,040
It will take in a single value, N, and it will run the loop N times.
28

29
00:02:36,040 --> 00:02:39,420
Now, one difference that you'll notice is that we don't have a return keyword.
29

30
00:02:39,430 --> 00:02:41,320
Instead we've got this other keyword here.
30

31
00:02:41,560 --> 00:02:42,850
Yield.
31

32
00:02:42,850 --> 00:02:47,500
Let's call this generator function to see how it behaves, then we'll talk a little bit more about the
32

33
00:02:47,500 --> 00:02:48,330
syntax.
33

34
00:02:48,460 --> 00:02:54,490
Having pressed Shift+Enter on the cell, you might think that all we have to do is call the function by
34

35
00:02:54,490 --> 00:03:00,190
using its name like so "generate_squares(3)", say, and press Shift+Enter.
35

36
00:03:01,210 --> 00:03:06,620
But in this case the output looks a bit unexpected. Instead of squaring say the number three,
36

37
00:03:06,760 --> 00:03:09,910
what we get is a generator object.
37

38
00:03:09,910 --> 00:03:13,830
So how do we call this function in a more useful way?
38

39
00:03:13,840 --> 00:03:20,410
One thing we can do is wrap this whole thing in a loop and then you'll also see how this generator function
39

40
00:03:20,740 --> 00:03:22,030
actually works.
40

41
00:03:22,030 --> 00:03:34,710
So if I say "for i in generate_squares(3):", "print(i)", and then the comma, and then say at the end put
41

42
00:03:34,710 --> 00:03:40,110
a little arrow in between the results and let's hit Shift+Enter now.
42

43
00:03:41,140 --> 00:03:42,650
So this is interesting, right.
43

44
00:03:42,740 --> 00:03:49,460
We get 0, 1, 4 and then each number is separated by little arrow here.
44

45
00:03:49,640 --> 00:03:55,530
What's going on? Our loop will run three times because N is equal to three.
45

46
00:03:55,820 --> 00:04:02,840
But what we're doing here is we're feeding in the values into our generator function one at a time - the
46

47
00:04:02,840 --> 00:04:09,590
first value that we feed in is the value 0 and 0 squared is equal to zero.
47

48
00:04:09,590 --> 00:04:17,120
Then we feed in the value 1, 1 squared is equal to, well, 1, then we feed and the value 2, 2 squared is
48

49
00:04:17,120 --> 00:04:18,430
equal to 4.
49

50
00:04:18,620 --> 00:04:24,260
And the amazing thing here is that this function using the yield keyword remembers where it left
50

51
00:04:24,260 --> 00:04:25,370
off.
51

52
00:04:25,370 --> 00:04:30,430
So let's change our argument here to the number 5 and see how this goes.
52

53
00:04:30,440 --> 00:04:40,160
Now our sequence looks like this: 0, 1, 4, 9, 16. In contrast to the return keyword for a normal function where
53

54
00:04:40,160 --> 00:04:42,780
the function basically exits with a value
54

55
00:04:42,830 --> 00:04:46,280
and we're done for good, with the yield keyword
55

56
00:04:46,280 --> 00:04:53,450
it's sort of exiting the function but it remembers the state where we had exited from.
56

57
00:04:53,450 --> 00:05:00,500
So in this case we're iterating through our loop and it remembers the previous value that it was at
57

58
00:05:00,500 --> 00:05:04,410
and we're starting from the point where we had yielded from.
58

59
00:05:04,580 --> 00:05:06,500
But why is this interesting?
59

60
00:05:06,500 --> 00:05:12,530
Why does this matter? At first glance it looks like we could achieve the very, very same thing with a
60

61
00:05:12,530 --> 00:05:19,490
normal function that uses the return keyword instead of having these loops and iterating through a generator
61

62
00:05:19,490 --> 00:05:20,210
function.
62

63
00:05:20,330 --> 00:05:22,220
Why would we do this?
63

64
00:05:22,220 --> 00:05:29,000
Well, here's the thing, with a generator function we don't have to do all the upfront work.
64

65
00:05:29,360 --> 00:05:37,430
So in our case we've got 5000 e-mails that we have to pass. With a large dataset like that or an incredibly
65

66
00:05:37,430 --> 00:05:38,600
long list,
66

67
00:05:38,750 --> 00:05:44,240
it takes an incredible amount of computation to even produce a single value let alone thousands of them
67

68
00:05:44,330 --> 00:05:46,180
at the same time.
68

69
00:05:46,190 --> 00:05:52,550
So what we're going to do now is we're gonna apply this generator function to loop over and iterate
69

70
00:05:52,850 --> 00:06:00,320
over all the files in our directory that holds onto the spam emails and then we're basically going to
70

71
00:06:00,320 --> 00:06:02,970
parse one email at a time.
71

72
00:06:03,020 --> 00:06:07,140
That's how we're going to use this generator function.
72

73
00:06:07,160 --> 00:06:15,440
Let me add another Markdown cell here that reads "Email body extraction" and what we'll do here is we'll
73

74
00:06:15,680 --> 00:06:24,270
define a generator function that walks over all the file names in a particular folder.
74

75
00:06:24,290 --> 00:06:26,720
This is a function from the operating system.
75

76
00:06:26,780 --> 00:06:28,460
Here's how we're going to use it.
76

77
00:06:28,460 --> 00:06:31,130
So we'll wrap the whole thing in a function.
77

78
00:06:31,130 --> 00:06:31,910
Yeah.
78

79
00:06:31,940 --> 00:06:37,450
"def email_body_generator()",
79

80
00:06:37,790 --> 00:06:45,920
and this is going to take a single parameter, namely the relative path to one of our folders,
80

81
00:06:46,040 --> 00:06:50,960
the spam folder or the folder with the legitimate emails.
81

82
00:06:51,010 --> 00:06:54,020
Now what we'll do is we'll write a loop.
82

83
00:06:54,020 --> 00:07:09,040
We're going to say "for root, dirnames, filenames in walk(path):".
83

84
00:07:09,640 --> 00:07:17,710
This walk function is where our operating system comes in. The walk function generates the file names
84

85
00:07:18,070 --> 00:07:25,570
in a directory by walking the tree from the top to the bottom and it yields,
85

86
00:07:25,600 --> 00:07:34,390
that's right, doesn't return, it yields a tuple, so three things consisting of the directory path which
86

87
00:07:34,390 --> 00:07:41,770
is this first one here, the directory names which is the second one here and the file names, which is
87

88
00:07:41,860 --> 00:07:50,950
this third one here. The directory path is obviously the path to our spam folder in this case. The directory
88

89
00:07:50,950 --> 00:07:58,900
names are the sub directories which we're actually not going to use and the file names is the bit that
89

90
00:07:58,900 --> 00:08:07,060
we're actually interested in. This is gonna be a list of names of all the files in our directory.
90

91
00:08:07,300 --> 00:08:13,960
In other words, if we point this function to "easy_ham_1", then we're gonna get all
91

92
00:08:13,960 --> 00:08:20,290
these file names right here. We're gonna get all the file names in this "easy_ham" directory.
92

93
00:08:21,220 --> 00:08:29,080
This is what we're after. Now, the walk function is not inbuilt. It belongs to the os library. So let's
93

94
00:08:29,140 --> 00:08:40,810
import it at the very top of our notebook. Scrolling up, we're going to say "from os import walk" and while we're
94

95
00:08:40,810 --> 00:08:45,670
up here, we're also going to import something else that we're gonna be using in this function, namely
95

96
00:08:46,000 --> 00:08:47,330
the join method.
96

97
00:08:47,470 --> 00:08:53,170
So "from os.path import join".
97

98
00:08:53,220 --> 00:09:02,020
Now let me hit Shift+Enter and scroll back down. Let me add a semicolon and let's write the inner part
98

99
00:09:02,320 --> 00:09:08,530
of this loop. The inner part of this loop is going to make use of all the file names that we're retrieving
99

100
00:09:08,890 --> 00:09:10,780
using the walk function.
100

101
00:09:10,780 --> 00:09:17,110
So what we want to do with a single file is actually very, very similar to this bit of code that we've
101

102
00:09:17,110 --> 00:09:23,290
written earlier, but since this function is going to return all the files to us, we're gonna have to tackle
102

103
00:09:23,770 --> 00:09:25,850
each file one by one.
103

104
00:09:25,850 --> 00:09:35,380
But let me copy this code nonetheless and then down here, we're going to add another loop, namely we'll say "for
104

105
00:09:36,100 --> 00:09:46,760
file_name in filenames:" and then let's paste in this code. I'm going to select this bit
105

106
00:09:46,760 --> 00:09:56,690
here and just hit Tab on my keyboard to indent it and make sure it's in the body off my inner loop and
106

107
00:09:56,690 --> 00:10:03,320
then I'm going to have to make another change. We're not gonna be targeting our example file. We need
107

108
00:10:03,320 --> 00:10:13,300
to be targeting a particular file in this list of file names. How do we get that? Well we'll say the file
108

109
00:10:13,300 --> 00:10:15,930
path of a particular file
109

110
00:10:15,970 --> 00:10:24,790
is gonna be equal to joining the route, which we're getting here from our outer loop, to a particular
110

111
00:10:24,790 --> 00:10:32,470
file name that we're iterating over in our inner loop. So we'll say "combine the path for the root directory
111

112
00:10:32,770 --> 00:10:40,390
with a file name that we're iterating over in our loop". And then in our open function, we can replace
112

113
00:10:40,600 --> 00:10:44,380
example_file with filepath.
113

114
00:10:44,500 --> 00:10:48,180
Everything here will remain the same.
114

115
00:10:48,220 --> 00:10:52,640
The only thing that's gonna change is that we're not gonna be printing out the email body.
115

116
00:10:52,870 --> 00:10:57,610
We want this function to spit out two pieces of information - 
116

117
00:10:57,610 --> 00:11:03,880
one is the file name and the other one is the email body. And this is where we're gonna use that yield
117

118
00:11:04,030 --> 00:11:05,190
keyword once again.
118

119
00:11:05,220 --> 00:11:12,220
So we'll say "yield file_name, email_body".
119

120
00:11:12,220 --> 00:11:18,580
Now I know this bit of code looks very, very involved, but we've broken it down quite a bit in the previous
120

121
00:11:18,580 --> 00:11:20,040
lessons already.
121

122
00:11:20,080 --> 00:11:27,670
So for example, we know that this bit of code extracts an email body from a particular file and we know
122

123
00:11:27,670 --> 00:11:37,300
that using the yield keyword, this function here will give us a result every time it loops over a particular
123

124
00:11:37,300 --> 00:11:38,980
file in our directory.
124

125
00:11:39,460 --> 00:11:45,040
So I'll spit out this file name, then I'll spit out this file name and this email body, then I'll spit out
125

126
00:11:45,040 --> 00:11:52,520
this file name and this email body and so on. The only thing that's really new is this walk function
126

127
00:11:52,850 --> 00:12:00,830
from the os library which spits out a tuple, which we're using in our loop and we're nesting a inner
127

128
00:12:00,830 --> 00:12:06,830
loop inside this one here to go over all the files one by one.
128

129
00:12:07,980 --> 00:12:10,710
Now that's half of the work done.
129

130
00:12:10,880 --> 00:12:17,560
If we look back up here, we've essentially done this bit. We now need to write the second piece of code
130

131
00:12:17,770 --> 00:12:21,040
that actually calls our generator function.
131

132
00:12:21,040 --> 00:12:25,680
We need to write a loop that repeatedly calls our generator function.
132

133
00:12:25,840 --> 00:12:29,430
Let's put the second piece of code inside a function as well.
133

134
00:12:29,950 --> 00:12:37,480
So I'm gonna go down here and I'm going to call this function "dataframe from directory",
134

135
00:12:37,570 --> 00:12:45,480
so "df_from_directory" and it's going to take two inputs, it's gonna take a
135

136
00:12:45,480 --> 00:12:55,180
path and a classification; and by classification I just mean whether this email folder is going to contain
136

137
00:12:55,210 --> 00:13:01,690
spam emails or legitimate emails. To create our data frame we will start out with two empty lists.
137

138
00:13:01,690 --> 00:13:11,680
So I'll say "rows = []" and "row_names" is also equal to a
138

139
00:13:11,680 --> 00:13:18,600
pair of empty square brackets. Our generator function is going to be called Inside a loop.
139

140
00:13:18,970 --> 00:13:30,320
So we'll say "for file_name, email_body" which is what our generator function
140

141
00:13:30,590 --> 00:13:43,030
is returning; "in email_body_generator" and then our generator function here needs
141

142
00:13:43,210 --> 00:13:49,900
one input, namely a path, and by the way if you haven't pressed Shift+Enter on this it's a good idea to
142

143
00:13:49,900 --> 00:13:58,860
do so now and once you've done that all we need to do is supply a path to our generator function as
143

144
00:13:58,860 --> 00:14:05,430
an argument and I'm just going to feed through the path that is being passed into this data frame from
144

145
00:14:05,430 --> 00:14:11,570
directory function to our generator function right here. Inside the loop,
145

146
00:14:11,610 --> 00:14:15,370
we're gonna append our email bodies to our rows list, so I'll say "rows.
146

147
00:14:15,360 --> 00:14:29,370
append({'MESSAGE':  email_body, 
147

148
00:14:29,370 --> 00:14:31,980
'CATEGORY':
148

149
00:14:31,980 --> 00:14:34,950
classification})".
149

150
00:14:35,240 --> 00:14:43,060
So what I've done here is I've created a Python dictionary using the values that our generator function
150

151
00:14:43,330 --> 00:14:51,200
spits out. Each time this loop runs it's gonna give us a file name and an email body and we're storing
151

152
00:14:51,200 --> 00:15:00,010
this in a list where we're appending the email body one by one as it goes over the files.
152

153
00:15:00,050 --> 00:15:08,240
Next we'll do something very similar for the row names. So "row_names.append(
153

154
00:15:08,990 --> 00:15:11,380
file_name)".
154

155
00:15:11,380 --> 00:15:15,330
Now, this dataframe from directory function here is gonna be a regular function.
155

156
00:15:15,350 --> 00:15:16,710
It's not going to yield anything.
156

157
00:15:16,760 --> 00:15:29,540
It's going to return a dataframe, so "pd" for pandas, ".DataFrame(rows, index = 
157

158
00:15:29,540 --> 00:15:33,940
row_names)" and that's it.
158

159
00:15:34,920 --> 00:15:43,140
Except that we need to "import pandas as pd" at the top of our notebook of course.
159

160
00:15:43,140 --> 00:15:55,310
So let's do that now "import pandas as pd", Shift+Enter, scroll down and hit Shift+Enter on this as well.
160

161
00:15:56,620 --> 00:16:01,850
Now we've written quite a bit of code and we haven't tested it at all.
161

162
00:16:02,110 --> 00:16:06,880
So we're not even sure if all of this will work or if we've made an error.
162

163
00:16:07,990 --> 00:16:13,960
Let's try and call this df_from_directory function and see if it works.
163

164
00:16:14,200 --> 00:16:20,780
But before we do that let's add all our paths to the top of our notebook
164

165
00:16:20,860 --> 00:16:27,610
under this constants heading. The paths that I'm interested in are the paths to easy_ham_1,
165

166
00:16:28,240 --> 00:16:33,570
the path to "easy_ham_2", "spam_1" and "spam_2".
166

167
00:16:33,880 --> 00:16:40,480
So let's create four constants with these paths. The kind of path that we're gonna be working with.
167

168
00:16:40,510 --> 00:16:49,420
is gonna be a relative path. The Bayes Classifier notebook is located under MLProjects, so our path
168

169
00:16:49,570 --> 00:16:51,950
will have to go into SpamData,
169

170
00:16:52,180 --> 00:16:56,140
01_Processing, spam_assassin_corpus,
170

171
00:16:56,140 --> 00:17:00,340
and then we'll have these folder names afterwards.
171

172
00:17:00,340 --> 00:17:01,630
So here we go.
172

173
00:17:01,630 --> 00:17:09,880
"SPAM_1_PATH" is going to be equal to this first bit here, which I'm just gonna
173

174
00:17:10,000 --> 00:17:12,130
copy and paste,
174

175
00:17:12,130 --> 00:17:17,230
then we said it was "spam_assassin_corpus",
175

176
00:17:17,230 --> 00:17:22,740
and then we said it was gonna be "spam_1".
176

177
00:17:22,940 --> 00:17:30,970
This is the relative path from our Bayes Classifier notebook to our Spam_1 folder.
177

178
00:17:30,990 --> 00:17:37,640
Now remember, everything is case sensitive and you've got to use forward slashes between the folder names
178

179
00:17:38,120 --> 00:17:40,500
to avoid getting any errors.
179

180
00:17:40,580 --> 00:17:43,280
Let's tackle the other four relative paths now.
180

181
00:17:43,310 --> 00:17:50,500
So I'm just going to copy this, paste it three more times and rename our constants here as well.
181

182
00:17:50,570 --> 00:17:56,240
So "SPAM_2_PATH"; this one I'll call "EASY
182

183
00:17:56,240 --> 00:18:05,380
NONSPAM_1_PATH", and this one I'll call "EASY_NONSPAM_2_PATH". The folder that these ones
183

184
00:18:05,380 --> 00:18:06,820
are going to point to
184

185
00:18:06,850 --> 00:18:19,180
is gonna be "easy_ham_2", "easy_ham_1", "spam_2" and "spam_1" of course.
185

186
00:18:19,180 --> 00:18:20,500
And that's it.
186

187
00:18:20,500 --> 00:18:28,570
If we hit Shift+Enter on this and just make sure we haven't made any typos, we're good to go. Let's call
187

188
00:18:28,570 --> 00:18:35,140
our df_from_directory and create a dataframe of spam emails.
188

189
00:18:35,430 --> 00:18:43,580
So I'm going to call this dataframe "spam_emails" and I'm going to set it equal to "df_from_directory".
189

190
00:18:44,360 --> 00:18:53,760
Note I don't have to type all this out, I can just hit Tab on my keyboard, "(SPAM_1_PATH,)"
190

191
00:18:54,300 --> 00:19:01,100
and then the category for spam is going gonna be the number 1. Now before I continue going,
191

192
00:19:01,110 --> 00:19:05,110
let's look at the head of this dataframe to check out the first few rows.
192

193
00:19:05,250 --> 00:19:13,780
So "spam_emails.head()" and Shift+Enter.
193

194
00:19:13,810 --> 00:19:18,750
Voila! Here we can see the file names of the first five rows.
194

195
00:19:19,050 --> 00:19:20,440
We've got a category.
195

196
00:19:20,580 --> 00:19:24,400
Category is going to be 1 for spam and 0 for non spam.
196

197
00:19:24,690 --> 00:19:26,550
And then we've got the messages,
197

198
00:19:26,550 --> 00:19:32,530
in other words the bodies of all the emails as a column in our dataframe as well.
198

199
00:19:33,500 --> 00:19:40,310
Let's take a look at the shape of this dataframe to see if we've got all our emails, "spam_
199

200
00:19:40,310 --> 00:19:50,570
emails.shape", Shift+Enter gives us 501 and 2. Two for the number of columns, the
200

201
00:19:50,570 --> 00:19:56,800
category and our email bodies and 501 for the number of rows,
201

202
00:19:56,870 --> 00:20:00,890
in other words the number of messages in this folder.