0
1
00:00:00,340 --> 00:00:00,990
All right.
1

2
00:00:01,020 --> 00:00:09,600
So in the last lesson, we've looked at one particular folder "spam_1" and we've loaded all the message
2

3
00:00:09,600 --> 00:00:12,140
bodies into a dataframe.
3

4
00:00:12,200 --> 00:00:16,780
What I want to do in this lesson is call our "df_from_directory" function
4

5
00:00:16,890 --> 00:00:24,210
a few more times to load all the emails that we've got into a single dataframe and have extracted all
5

6
00:00:24,210 --> 00:00:26,620
the bodies from those emails.
6

7
00:00:26,670 --> 00:00:32,190
So that means that in addition to our "spam_1" folder, we're going to be loading in our "spam_2" folder,
7

8
00:00:32,700 --> 00:00:38,040
our non-spam 1 and our non-spam 2 folders as well.
8

9
00:00:38,040 --> 00:00:40,240
So let's get right on that.
9

10
00:00:40,320 --> 00:00:48,620
I'm going to modify this cell right here and what I'm going to do is I'm going to take my "spam_emails"
10

11
00:00:48,630 --> 00:00:51,250
dataframe and we're going to overwrite it.
11

12
00:00:51,270 --> 00:00:59,640
We're going to append the emails from the other folder and update it. So "spam_emails.append". And what
12

13
00:00:59,640 --> 00:01:00,790
are we appending?
13

14
00:01:00,810 --> 00:01:06,880
Well, we can append the return value from our df_from_directory function.
14

15
00:01:06,990 --> 00:01:18,630
So "df_from_directory(SPAM_2_PATH, 1)" will extract all the emails from our second folder containing
15

16
00:01:18,630 --> 00:01:25,260
the spam emails and then just append all of those values to our dataframe.
16

17
00:01:25,320 --> 00:01:33,370
If I hit Shift+Enter on this and also hit Shift+Enter on shape, we can now see that we've got
17

18
00:01:33,490 --> 00:01:39,980
1898 values instead of the 500 or so that we had earlier.
18

19
00:01:40,000 --> 00:01:47,350
Now one other thing we can do to make our code slightly more readable is to change this value here,
19

20
00:01:47,350 --> 00:01:49,660
this 1 to a constant
20

21
00:01:49,660 --> 00:01:55,000
that's a little bit more descriptive, tells us a little bit more about what this 1 actually stands
21

22
00:01:55,000 --> 00:01:58,850
for. Scrolling back up to our constants,
22

23
00:01:58,920 --> 00:02:06,550
we can add another constant here, namely SPAM_CAT, short for category,
23

24
00:02:06,720 --> 00:02:14,910
set that equal to 1 and while we're at it, we can also add a HAM_CATEGORY constant and
24

25
00:02:14,910 --> 00:02:17,690
set that equal to 0.
25

26
00:02:17,820 --> 00:02:25,170
So henceforth every time we need to refer to the category, we can use these constants right here. Using
26

27
00:02:25,170 --> 00:02:32,640
the word "ham" to refer to non-spam emails is something that you'll actually see a lot in the literature
27

28
00:02:32,700 --> 00:02:35,220
on spam classification.
28

29
00:02:35,220 --> 00:02:39,740
I'm not exactly sure why, but I suspect it's because this group of people really liked wordplay.
29

30
00:02:40,140 --> 00:02:43,120
So spam and ham it is for us as well.
30

31
00:02:44,130 --> 00:02:47,340
Now, scrolling back down to our last cell where we left off,
31

32
00:02:47,380 --> 00:02:48,790
I want to pose a challenge to you.
32

33
00:02:49,780 --> 00:02:57,070
I want you to create a dataframe that contains all the emails from the non-spam directories and then
33

34
00:02:57,070 --> 00:03:04,430
I want you to also print out the shape of this dataframe and then we'll take it from there.
34

35
00:03:04,570 --> 00:03:07,950
So pause the video and give that a shot.
35

36
00:03:08,020 --> 00:03:15,080
Create a dataframe with all the non-spam emails similar to what I've done for the spam emails.
36

37
00:03:15,240 --> 00:03:16,340
Did you have a go?
37

38
00:03:16,710 --> 00:03:16,980
All right.
38

39
00:03:16,980 --> 00:03:18,540
Here's the solution.
39

40
00:03:18,930 --> 00:03:20,430
"ham_emails"
40

41
00:03:20,580 --> 00:03:23,680
is gonna be what I'm going to call my dataframe.
41

42
00:03:23,790 --> 00:03:30,600
I'm going to use my df_from_directory function and I'm going to point it to "EASY_NONSPAM_1_
42

43
00:03:30,600 --> 00:03:36,880
PATH" and use the ham category. After that,
43

44
00:03:37,030 --> 00:03:39,950
I'm also going to do the same thing I did before.
44

45
00:03:40,060 --> 00:03:49,190
I'm going to use my ham_emails dataframe and I'm going to append the df_from_directory, point it
45

46
00:03:49,190 --> 00:03:59,190
to "EASY_NON_SPAM_2_PATH" and also using the ham category. Finally, we said we'd print out the shape, right?
46

47
00:03:59,200 --> 00:04:06,730
So "ham_emails.shape" should give us what we're looking for.
47

48
00:04:07,010 --> 00:04:08,320
Hitting Shift+Enter,
48

49
00:04:08,330 --> 00:04:09,850
let's see what we get.
49

50
00:04:09,920 --> 00:04:16,880
So I'm getting 3902 files being appended to this dataframe.
50

51
00:04:16,910 --> 00:04:17,870
Brilliant.
51

52
00:04:17,870 --> 00:04:25,010
Now what we can do is we can get a dataframe that holds onto all our emails. both spam and non-spam.
52

53
00:04:25,010 --> 00:04:27,680
So I'm just gonna call this dataframe "data".
53

54
00:04:27,890 --> 00:04:30,460
Got a lot of imagination as you can tell.
54

55
00:04:30,560 --> 00:04:41,240
And I'm going to use pandas concat method, so "pd.concat([spam_emails, 
55

56
00:04:41,580 --> 00:04:52,760
ham_emails])", then I'll add a print statement that reads "Shape of entire dataframe is", and I'll print
56

57
00:04:52,760 --> 00:04:56,950
out "data.shape" and on the next line
57

58
00:04:57,050 --> 00:05:00,690
let's take a look at the head of this dataframe,
58

59
00:05:00,690 --> 00:05:06,280
so the first five rows. Now let me hit Shift+Enter and print this out.
59

60
00:05:06,380 --> 00:05:13,280
What we see is that this dataframe has 5800 rows and 2 columns.
60

61
00:05:13,820 --> 00:05:15,110
Just like before,
61

62
00:05:15,110 --> 00:05:19,800
I've got the file names here as an index.
62

63
00:05:19,900 --> 00:05:27,880
I've got my category showing whether I've got spam or non-spam and I've got the message body here in
63

64
00:05:27,880 --> 00:05:29,910
the message column.
64

65
00:05:30,070 --> 00:05:37,260
If you're curious where the non-spam emails are hiding in our dataframe it's gonna be in the tail.
65

66
00:05:37,300 --> 00:05:44,080
So here we have a couple of category zero non-spam emails hiding out.
66

67
00:05:44,080 --> 00:05:45,070
All right.
67

68
00:05:45,070 --> 00:05:46,390
That's it.
68

69
00:05:46,420 --> 00:05:54,670
We've basically taken 5800 files from our local disk and we've converted them
69

70
00:05:55,240 --> 00:05:57,820
into a pandas dataframe.
70

71
00:05:57,820 --> 00:06:04,460
We've converted them into a format that we can manipulate and work with in our Python code.
71

72
00:06:04,480 --> 00:06:08,170
So I think that's quite an achievement. I'll see you in the next lesson.
72

73
00:06:08,170 --> 00:06:08,710
Take care.