0 1 00:00:00,620 --> 00:00:07,070 Now that we've extracted the body text of an email from a single email, we need to do this for all our 1 2 00:00:07,070 --> 00:00:13,180 emails and for that we need to create a function. The kind of function that we're going to create 2 3 00:00:13,340 --> 00:00:19,310 in this lesson is a special type of function in Python called a generator function. 3 4 00:00:19,310 --> 00:00:25,360 In other words we will create a function that reads all the files in a folder. 4 5 00:00:25,370 --> 00:00:31,060 Now the functions that we've encountered so far run once and they return a value. 5 6 00:00:31,250 --> 00:00:37,430 If you recall a standard Python function just has that "return" keyword and then it spits out a value 6 7 00:00:37,640 --> 00:00:40,040 following whatever comes after that keyword. 7 8 00:00:40,100 --> 00:00:41,360 And that's it. 8 9 00:00:41,360 --> 00:00:47,110 And what this means is that a function needs to return all the results at once. 9 10 00:00:47,270 --> 00:00:51,640 It needs to return all the results at the same time. 10 11 00:00:51,630 --> 00:00:58,610 Now if we wrote a function that read all the emails and extracted the body text from all the 5000 e-mails 11 12 00:00:58,820 --> 00:01:05,170 all at once, then we would have to return 5000 email bodies all at the same time as well. 12 13 00:01:05,190 --> 00:01:09,650 Now if this sounds like a lot of work, then you're absolutely right. 13 14 00:01:09,650 --> 00:01:13,910 And we don't have to write our Python code to do it this way. 14 15 00:01:13,910 --> 00:01:20,600 There is an alternative and this is where generator functions come into play. In our Python notebook, 15 16 00:01:20,600 --> 00:01:30,310 let's add a Markdown cell that reads "Generator Functions" and in the cell below we're going to go over 16 17 00:01:30,310 --> 00:01:36,250 this advanced functional pattern that you're going to encounter every time you want to spit out a series 17 18 00:01:36,250 --> 00:01:37,580 of values. 18 19 00:01:37,720 --> 00:01:40,480 We're gonna be combining two very powerful programming tools. 19 20 00:01:40,480 --> 00:01:41,920 The first one is loops 20 21 00:01:41,920 --> 00:01:45,290 and the second one is this generator function. 21 22 00:01:45,340 --> 00:01:52,330 So before we parse 5000 e-mails, let's go through a practice generator function. Starts out the same way 22 23 00:01:52,420 --> 00:02:01,060 as every other function, with a definition - "def" keyword, then I'll give it a name "generate_ 23 24 00:02:01,810 --> 00:02:09,970 squares" and then I'll give it maybe a capital N as a single parameter and then inside this function 24 25 00:02:10,510 --> 00:02:16,690 I'll write a loop: "for my_number in range(N):", 25 26 00:02:16,690 --> 00:02:27,050 this is gonna be up to N, "yield my_number**2". 26 27 00:02:27,070 --> 00:02:30,130 This here is my generator function. 27 28 00:02:30,130 --> 00:02:36,040 It will take in a single value, N, and it will run the loop N times. 28 29 00:02:36,040 --> 00:02:39,420 Now, one difference that you'll notice is that we don't have a return keyword. 29 30 00:02:39,430 --> 00:02:41,320 Instead we've got this other keyword here. 30 31 00:02:41,560 --> 00:02:42,850 Yield. 31 32 00:02:42,850 --> 00:02:47,500 Let's call this generator function to see how it behaves, then we'll talk a little bit more about the 32 33 00:02:47,500 --> 00:02:48,330 syntax. 33 34 00:02:48,460 --> 00:02:54,490 Having pressed Shift+Enter on the cell, you might think that all we have to do is call the function by 34 35 00:02:54,490 --> 00:03:00,190 using its name like so "generate_squares(3)", say, and press Shift+Enter. 35 36 00:03:01,210 --> 00:03:06,620 But in this case the output looks a bit unexpected. Instead of squaring say the number three, 36 37 00:03:06,760 --> 00:03:09,910 what we get is a generator object. 37 38 00:03:09,910 --> 00:03:13,830 So how do we call this function in a more useful way? 38 39 00:03:13,840 --> 00:03:20,410 One thing we can do is wrap this whole thing in a loop and then you'll also see how this generator function 39 40 00:03:20,740 --> 00:03:22,030 actually works. 40 41 00:03:22,030 --> 00:03:34,710 So if I say "for i in generate_squares(3):", "print(i)", and then the comma, and then say at the end put 41 42 00:03:34,710 --> 00:03:40,110 a little arrow in between the results and let's hit Shift+Enter now. 42 43 00:03:41,140 --> 00:03:42,650 So this is interesting, right. 43 44 00:03:42,740 --> 00:03:49,460 We get 0, 1, 4 and then each number is separated by little arrow here. 44 45 00:03:49,640 --> 00:03:55,530 What's going on? Our loop will run three times because N is equal to three. 45 46 00:03:55,820 --> 00:04:02,840 But what we're doing here is we're feeding in the values into our generator function one at a time - the 46 47 00:04:02,840 --> 00:04:09,590 first value that we feed in is the value 0 and 0 squared is equal to zero. 47 48 00:04:09,590 --> 00:04:17,120 Then we feed in the value 1, 1 squared is equal to, well, 1, then we feed and the value 2, 2 squared is 48 49 00:04:17,120 --> 00:04:18,430 equal to 4. 49 50 00:04:18,620 --> 00:04:24,260 And the amazing thing here is that this function using the yield keyword remembers where it left 50 51 00:04:24,260 --> 00:04:25,370 off. 51 52 00:04:25,370 --> 00:04:30,430 So let's change our argument here to the number 5 and see how this goes. 52 53 00:04:30,440 --> 00:04:40,160 Now our sequence looks like this: 0, 1, 4, 9, 16. In contrast to the return keyword for a normal function where 53 54 00:04:40,160 --> 00:04:42,780 the function basically exits with a value 54 55 00:04:42,830 --> 00:04:46,280 and we're done for good, with the yield keyword 55 56 00:04:46,280 --> 00:04:53,450 it's sort of exiting the function but it remembers the state where we had exited from. 56 57 00:04:53,450 --> 00:05:00,500 So in this case we're iterating through our loop and it remembers the previous value that it was at 57 58 00:05:00,500 --> 00:05:04,410 and we're starting from the point where we had yielded from. 58 59 00:05:04,580 --> 00:05:06,500 But why is this interesting? 59 60 00:05:06,500 --> 00:05:12,530 Why does this matter? At first glance it looks like we could achieve the very, very same thing with a 60 61 00:05:12,530 --> 00:05:19,490 normal function that uses the return keyword instead of having these loops and iterating through a generator 61 62 00:05:19,490 --> 00:05:20,210 function. 62 63 00:05:20,330 --> 00:05:22,220 Why would we do this? 63 64 00:05:22,220 --> 00:05:29,000 Well, here's the thing, with a generator function we don't have to do all the upfront work. 64 65 00:05:29,360 --> 00:05:37,430 So in our case we've got 5000 e-mails that we have to pass. With a large dataset like that or an incredibly 65 66 00:05:37,430 --> 00:05:38,600 long list, 66 67 00:05:38,750 --> 00:05:44,240 it takes an incredible amount of computation to even produce a single value let alone thousands of them 67 68 00:05:44,330 --> 00:05:46,180 at the same time. 68 69 00:05:46,190 --> 00:05:52,550 So what we're going to do now is we're gonna apply this generator function to loop over and iterate 69 70 00:05:52,850 --> 00:06:00,320 over all the files in our directory that holds onto the spam emails and then we're basically going to 70 71 00:06:00,320 --> 00:06:02,970 parse one email at a time. 71 72 00:06:03,020 --> 00:06:07,140 That's how we're going to use this generator function. 72 73 00:06:07,160 --> 00:06:15,440 Let me add another Markdown cell here that reads "Email body extraction" and what we'll do here is we'll 73 74 00:06:15,680 --> 00:06:24,270 define a generator function that walks over all the file names in a particular folder. 74 75 00:06:24,290 --> 00:06:26,720 This is a function from the operating system. 75 76 00:06:26,780 --> 00:06:28,460 Here's how we're going to use it. 76 77 00:06:28,460 --> 00:06:31,130 So we'll wrap the whole thing in a function. 77 78 00:06:31,130 --> 00:06:31,910 Yeah. 78 79 00:06:31,940 --> 00:06:37,450 "def email_body_generator()", 79 80 00:06:37,790 --> 00:06:45,920 and this is going to take a single parameter, namely the relative path to one of our folders, 80 81 00:06:46,040 --> 00:06:50,960 the spam folder or the folder with the legitimate emails. 81 82 00:06:51,010 --> 00:06:54,020 Now what we'll do is we'll write a loop. 82 83 00:06:54,020 --> 00:07:09,040 We're going to say "for root, dirnames, filenames in walk(path):". 83 84 00:07:09,640 --> 00:07:17,710 This walk function is where our operating system comes in. The walk function generates the file names 84 85 00:07:18,070 --> 00:07:25,570 in a directory by walking the tree from the top to the bottom and it yields, 85 86 00:07:25,600 --> 00:07:34,390 that's right, doesn't return, it yields a tuple, so three things consisting of the directory path which 86 87 00:07:34,390 --> 00:07:41,770 is this first one here, the directory names which is the second one here and the file names, which is 87 88 00:07:41,860 --> 00:07:50,950 this third one here. The directory path is obviously the path to our spam folder in this case. The directory 88 89 00:07:50,950 --> 00:07:58,900 names are the sub directories which we're actually not going to use and the file names is the bit that 89 90 00:07:58,900 --> 00:08:07,060 we're actually interested in. This is gonna be a list of names of all the files in our directory. 90 91 00:08:07,300 --> 00:08:13,960 In other words, if we point this function to "easy_ham_1", then we're gonna get all 91 92 00:08:13,960 --> 00:08:20,290 these file names right here. We're gonna get all the file names in this "easy_ham" directory. 92 93 00:08:21,220 --> 00:08:29,080 This is what we're after. Now, the walk function is not inbuilt. It belongs to the os library. So let's 93 94 00:08:29,140 --> 00:08:40,810 import it at the very top of our notebook. Scrolling up, we're going to say "from os import walk" and while we're 94 95 00:08:40,810 --> 00:08:45,670 up here, we're also going to import something else that we're gonna be using in this function, namely 95 96 00:08:46,000 --> 00:08:47,330 the join method. 96 97 00:08:47,470 --> 00:08:53,170 So "from os.path import join". 97 98 00:08:53,220 --> 00:09:02,020 Now let me hit Shift+Enter and scroll back down. Let me add a semicolon and let's write the inner part 98 99 00:09:02,320 --> 00:09:08,530 of this loop. The inner part of this loop is going to make use of all the file names that we're retrieving 99 100 00:09:08,890 --> 00:09:10,780 using the walk function. 100 101 00:09:10,780 --> 00:09:17,110 So what we want to do with a single file is actually very, very similar to this bit of code that we've 101 102 00:09:17,110 --> 00:09:23,290 written earlier, but since this function is going to return all the files to us, we're gonna have to tackle 102 103 00:09:23,770 --> 00:09:25,850 each file one by one. 103 104 00:09:25,850 --> 00:09:35,380 But let me copy this code nonetheless and then down here, we're going to add another loop, namely we'll say "for 104 105 00:09:36,100 --> 00:09:46,760 file_name in filenames:" and then let's paste in this code. I'm going to select this bit 105 106 00:09:46,760 --> 00:09:56,690 here and just hit Tab on my keyboard to indent it and make sure it's in the body off my inner loop and 106 107 00:09:56,690 --> 00:10:03,320 then I'm going to have to make another change. We're not gonna be targeting our example file. We need 107 108 00:10:03,320 --> 00:10:13,300 to be targeting a particular file in this list of file names. How do we get that? Well we'll say the file 108 109 00:10:13,300 --> 00:10:15,930 path of a particular file 109 110 00:10:15,970 --> 00:10:24,790 is gonna be equal to joining the route, which we're getting here from our outer loop, to a particular 110 111 00:10:24,790 --> 00:10:32,470 file name that we're iterating over in our inner loop. So we'll say "combine the path for the root directory 111 112 00:10:32,770 --> 00:10:40,390 with a file name that we're iterating over in our loop". And then in our open function, we can replace 112 113 00:10:40,600 --> 00:10:44,380 example_file with filepath. 113 114 00:10:44,500 --> 00:10:48,180 Everything here will remain the same. 114 115 00:10:48,220 --> 00:10:52,640 The only thing that's gonna change is that we're not gonna be printing out the email body. 115 116 00:10:52,870 --> 00:10:57,610 We want this function to spit out two pieces of information - 116 117 00:10:57,610 --> 00:11:03,880 one is the file name and the other one is the email body. And this is where we're gonna use that yield 117 118 00:11:04,030 --> 00:11:05,190 keyword once again. 118 119 00:11:05,220 --> 00:11:12,220 So we'll say "yield file_name, email_body". 119 120 00:11:12,220 --> 00:11:18,580 Now I know this bit of code looks very, very involved, but we've broken it down quite a bit in the previous 120 121 00:11:18,580 --> 00:11:20,040 lessons already. 121 122 00:11:20,080 --> 00:11:27,670 So for example, we know that this bit of code extracts an email body from a particular file and we know 122 123 00:11:27,670 --> 00:11:37,300 that using the yield keyword, this function here will give us a result every time it loops over a particular 123 124 00:11:37,300 --> 00:11:38,980 file in our directory. 124 125 00:11:39,460 --> 00:11:45,040 So I'll spit out this file name, then I'll spit out this file name and this email body, then I'll spit out 125 126 00:11:45,040 --> 00:11:52,520 this file name and this email body and so on. The only thing that's really new is this walk function 126 127 00:11:52,850 --> 00:12:00,830 from the os library which spits out a tuple, which we're using in our loop and we're nesting a inner 127 128 00:12:00,830 --> 00:12:06,830 loop inside this one here to go over all the files one by one. 128 129 00:12:07,980 --> 00:12:10,710 Now that's half of the work done. 129 130 00:12:10,880 --> 00:12:17,560 If we look back up here, we've essentially done this bit. We now need to write the second piece of code 130 131 00:12:17,770 --> 00:12:21,040 that actually calls our generator function. 131 132 00:12:21,040 --> 00:12:25,680 We need to write a loop that repeatedly calls our generator function. 132 133 00:12:25,840 --> 00:12:29,430 Let's put the second piece of code inside a function as well. 133 134 00:12:29,950 --> 00:12:37,480 So I'm gonna go down here and I'm going to call this function "dataframe from directory", 134 135 00:12:37,570 --> 00:12:45,480 so "df_from_directory" and it's going to take two inputs, it's gonna take a 135 136 00:12:45,480 --> 00:12:55,180 path and a classification; and by classification I just mean whether this email folder is going to contain 136 137 00:12:55,210 --> 00:13:01,690 spam emails or legitimate emails. To create our data frame we will start out with two empty lists. 137 138 00:13:01,690 --> 00:13:11,680 So I'll say "rows = []" and "row_names" is also equal to a 138 139 00:13:11,680 --> 00:13:18,600 pair of empty square brackets. Our generator function is going to be called Inside a loop. 139 140 00:13:18,970 --> 00:13:30,320 So we'll say "for file_name, email_body" which is what our generator function 140 141 00:13:30,590 --> 00:13:43,030 is returning; "in email_body_generator" and then our generator function here needs 141 142 00:13:43,210 --> 00:13:49,900 one input, namely a path, and by the way if you haven't pressed Shift+Enter on this it's a good idea to 142 143 00:13:49,900 --> 00:13:58,860 do so now and once you've done that all we need to do is supply a path to our generator function as 143 144 00:13:58,860 --> 00:14:05,430 an argument and I'm just going to feed through the path that is being passed into this data frame from 144 145 00:14:05,430 --> 00:14:11,570 directory function to our generator function right here. Inside the loop, 145 146 00:14:11,610 --> 00:14:15,370 we're gonna append our email bodies to our rows list, so I'll say "rows. 146 147 00:14:15,360 --> 00:14:29,370 append({'MESSAGE': email_body, 147 148 00:14:29,370 --> 00:14:31,980 'CATEGORY': 148 149 00:14:31,980 --> 00:14:34,950 classification})". 149 150 00:14:35,240 --> 00:14:43,060 So what I've done here is I've created a Python dictionary using the values that our generator function 150 151 00:14:43,330 --> 00:14:51,200 spits out. Each time this loop runs it's gonna give us a file name and an email body and we're storing 151 152 00:14:51,200 --> 00:15:00,010 this in a list where we're appending the email body one by one as it goes over the files. 152 153 00:15:00,050 --> 00:15:08,240 Next we'll do something very similar for the row names. So "row_names.append( 153 154 00:15:08,990 --> 00:15:11,380 file_name)". 154 155 00:15:11,380 --> 00:15:15,330 Now, this dataframe from directory function here is gonna be a regular function. 155 156 00:15:15,350 --> 00:15:16,710 It's not going to yield anything. 156 157 00:15:16,760 --> 00:15:29,540 It's going to return a dataframe, so "pd" for pandas, ".DataFrame(rows, index = 157 158 00:15:29,540 --> 00:15:33,940 row_names)" and that's it. 158 159 00:15:34,920 --> 00:15:43,140 Except that we need to "import pandas as pd" at the top of our notebook of course. 159 160 00:15:43,140 --> 00:15:55,310 So let's do that now "import pandas as pd", Shift+Enter, scroll down and hit Shift+Enter on this as well. 160 161 00:15:56,620 --> 00:16:01,850 Now we've written quite a bit of code and we haven't tested it at all. 161 162 00:16:02,110 --> 00:16:06,880 So we're not even sure if all of this will work or if we've made an error. 162 163 00:16:07,990 --> 00:16:13,960 Let's try and call this df_from_directory function and see if it works. 163 164 00:16:14,200 --> 00:16:20,780 But before we do that let's add all our paths to the top of our notebook 164 165 00:16:20,860 --> 00:16:27,610 under this constants heading. The paths that I'm interested in are the paths to easy_ham_1, 165 166 00:16:28,240 --> 00:16:33,570 the path to "easy_ham_2", "spam_1" and "spam_2". 166 167 00:16:33,880 --> 00:16:40,480 So let's create four constants with these paths. The kind of path that we're gonna be working with. 167 168 00:16:40,510 --> 00:16:49,420 is gonna be a relative path. The Bayes Classifier notebook is located under MLProjects, so our path 168 169 00:16:49,570 --> 00:16:51,950 will have to go into SpamData, 169 170 00:16:52,180 --> 00:16:56,140 01_Processing, spam_assassin_corpus, 170 171 00:16:56,140 --> 00:17:00,340 and then we'll have these folder names afterwards. 171 172 00:17:00,340 --> 00:17:01,630 So here we go. 172 173 00:17:01,630 --> 00:17:09,880 "SPAM_1_PATH" is going to be equal to this first bit here, which I'm just gonna 173 174 00:17:10,000 --> 00:17:12,130 copy and paste, 174 175 00:17:12,130 --> 00:17:17,230 then we said it was "spam_assassin_corpus", 175 176 00:17:17,230 --> 00:17:22,740 and then we said it was gonna be "spam_1". 176 177 00:17:22,940 --> 00:17:30,970 This is the relative path from our Bayes Classifier notebook to our Spam_1 folder. 177 178 00:17:30,990 --> 00:17:37,640 Now remember, everything is case sensitive and you've got to use forward slashes between the folder names 178 179 00:17:38,120 --> 00:17:40,500 to avoid getting any errors. 179 180 00:17:40,580 --> 00:17:43,280 Let's tackle the other four relative paths now. 180 181 00:17:43,310 --> 00:17:50,500 So I'm just going to copy this, paste it three more times and rename our constants here as well. 181 182 00:17:50,570 --> 00:17:56,240 So "SPAM_2_PATH"; this one I'll call "EASY 182 183 00:17:56,240 --> 00:18:05,380 NONSPAM_1_PATH", and this one I'll call "EASY_NONSPAM_2_PATH". The folder that these ones 183 184 00:18:05,380 --> 00:18:06,820 are going to point to 184 185 00:18:06,850 --> 00:18:19,180 is gonna be "easy_ham_2", "easy_ham_1", "spam_2" and "spam_1" of course. 185 186 00:18:19,180 --> 00:18:20,500 And that's it. 186 187 00:18:20,500 --> 00:18:28,570 If we hit Shift+Enter on this and just make sure we haven't made any typos, we're good to go. Let's call 187 188 00:18:28,570 --> 00:18:35,140 our df_from_directory and create a dataframe of spam emails. 188 189 00:18:35,430 --> 00:18:43,580 So I'm going to call this dataframe "spam_emails" and I'm going to set it equal to "df_from_directory". 189 190 00:18:44,360 --> 00:18:53,760 Note I don't have to type all this out, I can just hit Tab on my keyboard, "(SPAM_1_PATH,)" 190 191 00:18:54,300 --> 00:19:01,100 and then the category for spam is going gonna be the number 1. Now before I continue going, 191 192 00:19:01,110 --> 00:19:05,110 let's look at the head of this dataframe to check out the first few rows. 192 193 00:19:05,250 --> 00:19:13,780 So "spam_emails.head()" and Shift+Enter. 193 194 00:19:13,810 --> 00:19:18,750 Voila! Here we can see the file names of the first five rows. 194 195 00:19:19,050 --> 00:19:20,440 We've got a category. 195 196 00:19:20,580 --> 00:19:24,400 Category is going to be 1 for spam and 0 for non spam. 196 197 00:19:24,690 --> 00:19:26,550 And then we've got the messages, 197 198 00:19:26,550 --> 00:19:32,530 in other words the bodies of all the emails as a column in our dataframe as well. 198 199 00:19:33,500 --> 00:19:40,310 Let's take a look at the shape of this dataframe to see if we've got all our emails, "spam_ 199 200 00:19:40,310 --> 00:19:50,570 emails.shape", Shift+Enter gives us 501 and 2. Two for the number of columns, the 200 201 00:19:50,570 --> 00:19:56,800 category and our email bodies and 501 for the number of rows, 201 202 00:19:56,870 --> 00:20:00,890 in other words the number of messages in this folder.