0
1
00:00:00,330 --> 00:00:04,920
We're going to group our words by email and to do that,
1

2
00:00:05,250 --> 00:00:09,120
we'll use the pandas groupby method.
2

3
00:00:09,120 --> 00:00:15,630
Let me add a markdown cell here that reads "Combine", can't spell,
3

4
00:00:15,630 --> 00:00:25,800
"Combine Occurrences with the Pandas groupby method".
4

5
00:00:26,020 --> 00:00:32,340
Now I'm guessing that you will have used Excel, Microsoft Excel at some point in the past and in Excel
5

6
00:00:32,340 --> 00:00:36,300
there is a very powerful functionality called a pivot table.
6

7
00:00:36,300 --> 00:00:42,850
And this groupby method works in a very, very similar way. It will allow us to group the occurrences
7

8
00:00:42,970 --> 00:00:49,470
by document ID, word ID and label. And then we can sum up our occurrences.
8

9
00:00:49,480 --> 00:00:50,500
Let me show you.
9

10
00:00:50,680 --> 00:01:00,190
"train_grouped = sparse_train_df.groupby()"
10

11
00:01:00,850 --> 00:01:02,910
and I will provide a list
11

12
00:01:03,040 --> 00:01:04,480
square brackets
12

13
00:01:04,480 --> 00:01:05,830
single quotes,
13

14
00:01:05,970 --> 00:01:13,920
"['DOC_ID', 'WORD_ID',
14

15
00:01:13,920 --> 00:01:18,910
'LABEL']" and at the very end
15

16
00:01:18,910 --> 00:01:28,060
we're going to chain another method called, namely the summation, so ".sum()" at the end will sum up our
16

17
00:01:28,060 --> 00:01:31,430
occurrences after they've been grouped, but I think
17

18
00:01:31,430 --> 00:01:32,330
seeing is believing.
18

19
00:01:32,330 --> 00:01:34,000
So let me show you what this looks like.
19

20
00:01:34,500 --> 00:01:37,470
So "train_grouped.head()"
20

21
00:01:37,520 --> 00:01:39,610
will show us the result.
21

22
00:01:39,620 --> 00:01:40,940
Here we go.
22

23
00:01:40,940 --> 00:01:49,130
What we see here is that for our document with ID 0, our first document, we've got a bunch of words
23

24
00:01:49,190 --> 00:01:52,610
in here that are grouped together by IDs.
24

25
00:01:52,760 --> 00:01:58,370
The word with ID number 0 occurs twice in this first email.
25

26
00:01:58,430 --> 00:02:01,420
This is all that this table is showing us right here.
26

27
00:02:01,430 --> 00:02:06,160
Now you might say "All right, well, what's.. what's.. what's that word ID number 0?
27

28
00:02:06,170 --> 00:02:08,030
What's... what's word 0?".
28

29
00:02:08,030 --> 00:02:18,760
So we can pull that up, right? We can go to our vocabulary and we can say ".at[0,
29

30
00:02:18,760 --> 00:02:22,910
'VOCAB_
30

31
00:02:23,130 --> 00:02:28,810
WORD". This if you recall was the column name in our dataframe
31

32
00:02:28,810 --> 00:02:35,680
And this here is the index name which corresponded to our word ID. The actual word that occurs twice
32

33
00:02:35,980 --> 00:02:38,990
in our vocabulary is "http".
33

34
00:02:39,080 --> 00:02:40,750
Why does this occur twice?
34

35
00:02:40,750 --> 00:02:45,400
Well it's because there's two hyperlinks in our email. The original email,
35

36
00:02:45,400 --> 00:02:45,910
right,
36

37
00:02:46,020 --> 00:02:47,350
with document ID 0.
37

38
00:02:47,380 --> 00:02:56,800
We can pull up with "data.MESSAGE[0]" and this e-mail reads "Dear homeowner... Interest
38

39
00:02:56,800 --> 00:02:58,490
rates are at their lowest level... blah blah".
39

40
00:02:59,170 --> 00:03:06,800
If I look further down in the text, I see the first hyperlink here and I see the second hyperlink here.
40

41
00:03:06,830 --> 00:03:11,990
This is why the word "http" appears twice in this document.
41

42
00:03:12,620 --> 00:03:17,620
So our groupby function combined with the summation method seems to have worked really well.
42

43
00:03:17,750 --> 00:03:23,270
The thing I would quite like though is to have less of this pivot table feel to it and repeat the document
43

44
00:03:23,270 --> 00:03:30,890
ID on every single row. We can do that with "train_grouped = ",
44

45
00:03:30,890 --> 00:03:40,340
so we're just gonna overwrite it, right, "train_grouped.reset_index()", "reset_index" will make our document
45

46
00:03:40,370 --> 00:03:47,770
ID appear on every single row. "train_grouped.head()"
46

47
00:03:47,790 --> 00:03:49,990
will show you exactly that.
47

48
00:03:50,850 --> 00:03:52,440
Fantastic.
48

49
00:03:52,470 --> 00:03:56,400
Let's take a quick look at the tail of this dataframe as well.
49

50
00:03:56,490 --> 00:04:05,010
"train_grouped.tail()" gives us this result. And we're going to to the same very quick
50

51
00:04:05,100 --> 00:04:13,410
sense check on this result as well. In particular, let's take a look at what word corresponds to 
51

52
00:04:13,410 --> 00:04:15,460
1923.
52

53
00:04:15,510 --> 00:04:19,360
It appears to occur twice in this e-mail.
53

54
00:04:19,380 --> 00:04:22,030
Now you can either work ahead or follow along with me.
54

55
00:04:22,290 --> 00:04:30,660
But we've done this already, "vocab.at[1923, 
55

56
00:04:31,170 --> 00:04:36,150
'VOCAB_WORD']".
56

57
00:04:36,210 --> 00:04:37,620
This gives us the result,
57

58
00:04:37,770 --> 00:04:46,500
"welch", which is a very odd word, right, and doesn't quite seem like a like a real word, but it could be
58

59
00:04:46,500 --> 00:04:47,520
a stem word, right?
59

60
00:04:47,520 --> 00:04:50,280
So maybe that's why it's a bit strange.
60

61
00:04:50,280 --> 00:04:58,980
Let's pull up the actual message and see why this word appears twice, "data.MESSAGE[
61

62
00:05:00,000 --> 00:05:03,980
5795]".
62

63
00:05:04,020 --> 00:05:11,780
This brings up quite a short e-mail and it appears that "welch" is actually a name.
63

64
00:05:11,850 --> 00:05:17,510
It's the last name of this guy Brent Welch, a software architect.
64

65
00:05:17,640 --> 00:05:24,540
And the word appears again in his email address which is at the very end of this email.
65

66
00:05:24,540 --> 00:05:27,330
So this is why it's here.
66

67
00:05:27,330 --> 00:05:28,590
So I'm quite happy about this.
67

68
00:05:28,590 --> 00:05:30,180
I think this is this is good.
68

69
00:05:30,180 --> 00:05:33,140
This seems to pass the sense check.
69

70
00:05:33,450 --> 00:05:40,350
The only thing I'd be quite curious to find out now is how big of a reduction we've achieved in the
70

71
00:05:40,350 --> 00:05:42,810
number of rows from before?
71

72
00:05:43,410 --> 00:05:50,730
So if I say "train_grouped.shape", then I can see that we've reduced the size of our data
72

73
00:05:50,730 --> 00:05:52,650
frame quite a bit.
73

74
00:05:52,650 --> 00:05:58,560
We've gone from 450000 to approximate 265000
74

75
00:05:58,560 --> 00:05:59,760
rows.
75

76
00:05:59,760 --> 00:06:03,020
That's still a lot but it's about a 40% reduction
76

77
00:06:03,240 --> 00:06:13,500
and I think this puts us in a really good place to save our work, so let's do that now. I'll add a very quick
77

78
00:06:13,650 --> 00:06:15,580
markdown cell here,
78

79
00:06:15,680 --> 00:06:16,880
call this one "Save
79

80
00:06:16,980 --> 00:06:27,820
Training Data as .txt file". In the previous lessons we've saved our files once before as a 
80

81
00:06:27,840 --> 00:06:34,650
.json File and as a .csv file. This is how we saved our files to our disk previously.
81

82
00:06:34,650 --> 00:06:38,520
Now let's save a file as a plain text file.
82

83
00:06:38,520 --> 00:06:41,250
And for this we're going to use numpy's functionality.
83

84
00:06:41,430 --> 00:06:47,500
But before we do that, we're gonna need a relative file path at the top of our notebook.
84

85
00:06:47,760 --> 00:06:51,440
That way it sits nicely with its friends.
85

86
00:06:51,450 --> 00:06:55,080
Now I'm planning to save this file in a slightly different folder right.
86

87
00:06:55,560 --> 00:06:57,320
But first let's give it a name.
87

88
00:06:57,330 --> 00:07:07,920
I'm going to call it "TRAINING_DATA_FILE" and I'll set that equal to "SpamData/
88

89
00:07:08,670 --> 00:07:25,600
02_Training/train-data.txt". I'll be adding our
89

90
00:07:25,600 --> 00:07:29,620
text file to this folder right here. Now,
90

91
00:07:29,620 --> 00:07:35,380
be sure to hit Shift+Enter on your cell with your constants and then join me down here at the bottom
91

92
00:07:35,380 --> 00:07:51,860
of the notebook. With  "np.savetxt(TRAINING_DATA_FILE, train_grouped, fmt = 
92

93
00:07:51,860 --> 00:07:57,630
'%d')".
93

94
00:07:58,040 --> 00:08:04,400
If I hit Shift+Tab on my keyboard to bring up the quick documentation, we see that this is the file
94

95
00:08:04,400 --> 00:08:08,000
name including the relative file path.
95

96
00:08:08,150 --> 00:08:16,040
This here, the second argument is the data and the third argument is "fmt" which stands for format.
96

97
00:08:17,160 --> 00:08:23,970
If I hit the plus sign here and scroll down, then I can see that our format argument requires a string
97

98
00:08:24,060 --> 00:08:26,130
or a sequence of strings.
98

99
00:08:26,160 --> 00:08:29,790
This essentially allows us to specify the number format.
99

100
00:08:29,790 --> 00:08:32,650
Lucky for us we're just dealing with integers.
100

101
00:08:32,700 --> 00:08:39,780
If I bring up my folder here side by side and hit Shift+Enter now, then I should see my text file appear
101

102
00:08:40,020 --> 00:08:47,120
right here. Before I open this text file and peek inside, let me show you what the columns are called
102

103
00:08:47,480 --> 00:08:48,830
in Jupyter notebook.
103

104
00:08:49,100 --> 00:08:53,050
So "train_grouped.columns"
104

105
00:08:53,120 --> 00:08:58,330
will bring up our column names, so it's "DOC_ID", "WORD_ID", "LABEL" and "OCCURENCE".
105

106
00:09:00,970 --> 00:09:02,970
Now let's look at this text file.
106

107
00:09:03,190 --> 00:09:08,350
If I open it up in my text editor here, then I can see the four columns clearly outlined.
107

108
00:09:08,350 --> 00:09:16,090
The first one is my document ID. The second number here is the word ID. The third number is the category
108

109
00:09:16,150 --> 00:09:17,030
or label.
109

110
00:09:17,050 --> 00:09:21,080
So the first message is in fact a spam email.
110

111
00:09:21,220 --> 00:09:26,810
And that fourth number here is the occurrence of the word with this ID.
111

112
00:09:27,160 --> 00:09:36,090
Looking at line 16 here, the word with ID 105 occurs twice in this spam email.
112

113
00:09:36,100 --> 00:09:41,440
So I think that almost wraps it up, except we haven't done this for our test data yet.
113

114
00:09:41,740 --> 00:09:46,560
And that's where I want to throw it over to you. As a challenge,
114

115
00:09:46,840 --> 00:09:54,760
can you create a sparse matrix for the test data and then group all the occurrences of the same word
115

116
00:09:54,790 --> 00:10:02,000
in the same email together just like we did with the training data? After you've done all that, save the
116

117
00:10:02,000 --> 00:10:05,140
data as a txt file.
117

118
00:10:05,150 --> 00:10:09,680
Now I realize, you're gonna have to save that data to some place and give it a file name.
118

119
00:10:09,680 --> 00:10:15,230
So let's do that right now so that you and I have the same file names going forward. Scrolling up to
119

120
00:10:15,230 --> 00:10:16,250
our constants,
120

121
00:10:16,300 --> 00:10:26,060
I'm just quickly going to copy this relative path, paste it in and then change the file name here to "test"
121

122
00:10:27,080 --> 00:10:36,980
and change the constant name from "training" to "test" as well. So "TEST_DATA_FILE" is
122

123
00:10:36,980 --> 00:10:43,970
equal to this relative path file name and extension right here.
123

124
00:10:44,120 --> 00:10:49,670
Now I don't have to ask you to pause the video because I'm going to show you the solution in the next lesson.
124

125
00:10:50,770 --> 00:10:51,490
I'll see you there.