0 1 00:00:00,340 --> 00:00:00,990 All right. 1 2 00:00:01,020 --> 00:00:09,600 So in the last lesson, we've looked at one particular folder "spam_1" and we've loaded all the message 2 3 00:00:09,600 --> 00:00:12,140 bodies into a dataframe. 3 4 00:00:12,200 --> 00:00:16,780 What I want to do in this lesson is call our "df_from_directory" function 4 5 00:00:16,890 --> 00:00:24,210 a few more times to load all the emails that we've got into a single dataframe and have extracted all 5 6 00:00:24,210 --> 00:00:26,620 the bodies from those emails. 6 7 00:00:26,670 --> 00:00:32,190 So that means that in addition to our "spam_1" folder, we're going to be loading in our "spam_2" folder, 7 8 00:00:32,700 --> 00:00:38,040 our non-spam 1 and our non-spam 2 folders as well. 8 9 00:00:38,040 --> 00:00:40,240 So let's get right on that. 9 10 00:00:40,320 --> 00:00:48,620 I'm going to modify this cell right here and what I'm going to do is I'm going to take my "spam_emails" 10 11 00:00:48,630 --> 00:00:51,250 dataframe and we're going to overwrite it. 11 12 00:00:51,270 --> 00:00:59,640 We're going to append the emails from the other folder and update it. So "spam_emails.append". And what 12 13 00:00:59,640 --> 00:01:00,790 are we appending? 13 14 00:01:00,810 --> 00:01:06,880 Well, we can append the return value from our df_from_directory function. 14 15 00:01:06,990 --> 00:01:18,630 So "df_from_directory(SPAM_2_PATH, 1)" will extract all the emails from our second folder containing 15 16 00:01:18,630 --> 00:01:25,260 the spam emails and then just append all of those values to our dataframe. 16 17 00:01:25,320 --> 00:01:33,370 If I hit Shift+Enter on this and also hit Shift+Enter on shape, we can now see that we've got 17 18 00:01:33,490 --> 00:01:39,980 1898 values instead of the 500 or so that we had earlier. 18 19 00:01:40,000 --> 00:01:47,350 Now one other thing we can do to make our code slightly more readable is to change this value here, 19 20 00:01:47,350 --> 00:01:49,660 this 1 to a constant 20 21 00:01:49,660 --> 00:01:55,000 that's a little bit more descriptive, tells us a little bit more about what this 1 actually stands 21 22 00:01:55,000 --> 00:01:58,850 for. Scrolling back up to our constants, 22 23 00:01:58,920 --> 00:02:06,550 we can add another constant here, namely SPAM_CAT, short for category, 23 24 00:02:06,720 --> 00:02:14,910 set that equal to 1 and while we're at it, we can also add a HAM_CATEGORY constant and 24 25 00:02:14,910 --> 00:02:17,690 set that equal to 0. 25 26 00:02:17,820 --> 00:02:25,170 So henceforth every time we need to refer to the category, we can use these constants right here. Using 26 27 00:02:25,170 --> 00:02:32,640 the word "ham" to refer to non-spam emails is something that you'll actually see a lot in the literature 27 28 00:02:32,700 --> 00:02:35,220 on spam classification. 28 29 00:02:35,220 --> 00:02:39,740 I'm not exactly sure why, but I suspect it's because this group of people really liked wordplay. 29 30 00:02:40,140 --> 00:02:43,120 So spam and ham it is for us as well. 30 31 00:02:44,130 --> 00:02:47,340 Now, scrolling back down to our last cell where we left off, 31 32 00:02:47,380 --> 00:02:48,790 I want to pose a challenge to you. 32 33 00:02:49,780 --> 00:02:57,070 I want you to create a dataframe that contains all the emails from the non-spam directories and then 33 34 00:02:57,070 --> 00:03:04,430 I want you to also print out the shape of this dataframe and then we'll take it from there. 34 35 00:03:04,570 --> 00:03:07,950 So pause the video and give that a shot. 35 36 00:03:08,020 --> 00:03:15,080 Create a dataframe with all the non-spam emails similar to what I've done for the spam emails. 36 37 00:03:15,240 --> 00:03:16,340 Did you have a go? 37 38 00:03:16,710 --> 00:03:16,980 All right. 38 39 00:03:16,980 --> 00:03:18,540 Here's the solution. 39 40 00:03:18,930 --> 00:03:20,430 "ham_emails" 40 41 00:03:20,580 --> 00:03:23,680 is gonna be what I'm going to call my dataframe. 41 42 00:03:23,790 --> 00:03:30,600 I'm going to use my df_from_directory function and I'm going to point it to "EASY_NONSPAM_1_ 42 43 00:03:30,600 --> 00:03:36,880 PATH" and use the ham category. After that, 43 44 00:03:37,030 --> 00:03:39,950 I'm also going to do the same thing I did before. 44 45 00:03:40,060 --> 00:03:49,190 I'm going to use my ham_emails dataframe and I'm going to append the df_from_directory, point it 45 46 00:03:49,190 --> 00:03:59,190 to "EASY_NON_SPAM_2_PATH" and also using the ham category. Finally, we said we'd print out the shape, right? 46 47 00:03:59,200 --> 00:04:06,730 So "ham_emails.shape" should give us what we're looking for. 47 48 00:04:07,010 --> 00:04:08,320 Hitting Shift+Enter, 48 49 00:04:08,330 --> 00:04:09,850 let's see what we get. 49 50 00:04:09,920 --> 00:04:16,880 So I'm getting 3902 files being appended to this dataframe. 50 51 00:04:16,910 --> 00:04:17,870 Brilliant. 51 52 00:04:17,870 --> 00:04:25,010 Now what we can do is we can get a dataframe that holds onto all our emails. both spam and non-spam. 52 53 00:04:25,010 --> 00:04:27,680 So I'm just gonna call this dataframe "data". 53 54 00:04:27,890 --> 00:04:30,460 Got a lot of imagination as you can tell. 54 55 00:04:30,560 --> 00:04:41,240 And I'm going to use pandas concat method, so "pd.concat([spam_emails, 55 56 00:04:41,580 --> 00:04:52,760 ham_emails])", then I'll add a print statement that reads "Shape of entire dataframe is", and I'll print 56 57 00:04:52,760 --> 00:04:56,950 out "data.shape" and on the next line 57 58 00:04:57,050 --> 00:05:00,690 let's take a look at the head of this dataframe, 58 59 00:05:00,690 --> 00:05:06,280 so the first five rows. Now let me hit Shift+Enter and print this out. 59 60 00:05:06,380 --> 00:05:13,280 What we see is that this dataframe has 5800 rows and 2 columns. 60 61 00:05:13,820 --> 00:05:15,110 Just like before, 61 62 00:05:15,110 --> 00:05:19,800 I've got the file names here as an index. 62 63 00:05:19,900 --> 00:05:27,880 I've got my category showing whether I've got spam or non-spam and I've got the message body here in 63 64 00:05:27,880 --> 00:05:29,910 the message column. 64 65 00:05:30,070 --> 00:05:37,260 If you're curious where the non-spam emails are hiding in our dataframe it's gonna be in the tail. 65 66 00:05:37,300 --> 00:05:44,080 So here we have a couple of category zero non-spam emails hiding out. 66 67 00:05:44,080 --> 00:05:45,070 All right. 67 68 00:05:45,070 --> 00:05:46,390 That's it. 68 69 00:05:46,420 --> 00:05:54,670 We've basically taken 5800 files from our local disk and we've converted them 69 70 00:05:55,240 --> 00:05:57,820 into a pandas dataframe. 70 71 00:05:57,820 --> 00:06:04,460 We've converted them into a format that we can manipulate and work with in our Python code. 71 72 00:06:04,480 --> 00:06:08,170 So I think that's quite an achievement. I'll see you in the next lesson. 72 73 00:06:08,170 --> 00:06:08,710 Take care.