0
1
00:00:00,480 --> 00:00:08,410
Okay, so previously we've split and shuffled our data and put everything into a dataframe.
1

2
00:00:08,430 --> 00:00:12,780
Now let's create that sparse matrix. To do that,
2

3
00:00:12,870 --> 00:00:18,930
we will use three things. We will use our X_train dataframe.
3

4
00:00:18,930 --> 00:00:28,860
We will use our "y_train" pandas series and we will also use our vocabulary words.
4

5
00:00:28,860 --> 00:00:29,820
Remember those?
5

6
00:00:30,030 --> 00:00:31,530
They look like this.
6

7
00:00:31,950 --> 00:00:38,190
Our vocabulary of 2500 words is stored in a dataframe where the index are the
7

8
00:00:38,190 --> 00:00:45,140
word IDs and the individual strings are in a column called VOCAB_WORD.
8

9
00:00:45,180 --> 00:00:53,130
Now if we've got this dataframe here and we want to know which word has word ID number 3 then
9

10
00:00:53,130 --> 00:00:55,140
we can find that really, really easily
10

11
00:00:55,140 --> 00:01:03,750
and we've done this before, because all you'd have to specify is the index and the column and then you'd
11

12
00:01:03,750 --> 00:01:11,620
get a string that reads 'email', but say we know the word and you want to know the word ID.
12

13
00:01:12,300 --> 00:01:14,390
Now we're asking this question in reverse.
13

14
00:01:14,460 --> 00:01:16,830
We're asking it the other way around.
14

15
00:01:16,950 --> 00:01:22,420
For example, what is the word ID for the string "email"?
15

16
00:01:22,500 --> 00:01:30,070
An easy way to answer this question with some Python code is to create an index from this column here,
16

17
00:01:30,780 --> 00:01:39,600
from the VOCAB_WORD column and then look up the position of a particular string in the index.
17

18
00:01:39,600 --> 00:01:41,430
Let me show you what I mean.
18

19
00:01:41,750 --> 00:01:54,360
I'll add a quick markdown cell here that reads "Create a Sparse Matrix for the Training Data".
19

20
00:01:54,360 --> 00:02:01,020
Now, to turn a particular column of our dataframe into an index, all we'd have to do is select our dataframe,
20

21
00:02:01,020 --> 00:02:11,400
select the column and then wrap that whole thing into some parentheses and put it inside "pd.
21

22
00:02:12,270 --> 00:02:13,800
index".
22

23
00:02:13,800 --> 00:02:22,440
This will create an index from a particular column in a dataframe. Let's store this in a variable called
23

24
00:02:22,440 --> 00:02:23,360
"word_
24

25
00:02:23,370 --> 00:02:34,640
index", so "word_index" is equal to "pd.index(vocab.VOCAB_WORD)".
25

26
00:02:34,710 --> 00:02:41,940
Now we know we're dealing with a index with a pandas index, because the type of "word_index"
26

27
00:02:42,420 --> 00:02:52,970
is "pandas.core.indexes.base.Index" and this index is composed of strings, individual
27

28
00:02:52,970 --> 00:02:57,090
strings like "http", "email", "get" and so on,
28

29
00:02:57,110 --> 00:03:05,690
the ones we saw earlier. And I can verify this if I say pull up the say fourth word at index position number
29

30
00:03:05,900 --> 00:03:12,590
3 and check the type of this word. And there you can see that indeed we're dealing with strings
30

31
00:03:12,950 --> 00:03:22,910
inside our index. If I come back up here and take a look at our first email here in our X_
31

32
00:03:22,920 --> 00:03:32,300
train dataframe I can see that it's probably the stemmed word for "thursday", right, "thu". Now if I wanted
32

33
00:03:32,300 --> 00:03:45,120
to know the word ID for "thu" in our word index, I can simply take our index and use the "get_
33

34
00:03:45,530 --> 00:03:55,820
location" method with "thu" passed in as an argument. And I see that this word is at position number
34

35
00:03:55,820 --> 00:03:58,510
395.
35

36
00:03:58,610 --> 00:04:05,690
So here's what we're gonna do, we're going to take our X_train dataframe and we'll take
36

37
00:04:05,690 --> 00:04:12,590
the information contained therein to create our sparse matrix. Let's walk through how we would do it
37

38
00:04:12,710 --> 00:04:18,730
with the first row that I've shown here. The document ID for this row is 4844.
38

39
00:04:18,760 --> 00:04:26,620
We would be able to retrieve this information from our X_train using our index, since
39

40
00:04:26,620 --> 00:04:32,830
our document IDs are stored as the index values. Since 4844 is the
40

41
00:04:32,830 --> 00:04:39,550
very first entry, the very first value in the index, we would be able to just say "Give us the value of
41

42
00:04:39,550 --> 00:04:47,710
the index at position 0". That's what we can use to fill in our document ID of the sparse matrix. Now
42

43
00:04:47,770 --> 00:04:52,560
what about the label? What about the category that this email belongs to?
43

44
00:04:52,660 --> 00:05:00,340
For that, we simply look towards the "y_train" pandas series. There at the entry named 
44

45
00:05:00,430 --> 00:05:05,200
4844, we would either get a 1 or a 0.
45

46
00:05:05,260 --> 00:05:11,070
So this email actually is a non-spam email, so it would have the label 0.
46

47
00:05:11,080 --> 00:05:16,930
Now let's tackle that first stemmed would, "thu" for "thursday".
47

48
00:05:16,930 --> 00:05:19,170
Here our word index comes into play.
48

49
00:05:19,450 --> 00:05:26,930
We can get the word ID for this string "thu" from our word index using that get_loc method.
49

50
00:05:27,340 --> 00:05:34,930
And as we saw earlier "thu" is at position number 395. For that last column
50

51
00:05:35,050 --> 00:05:38,380
in the sparse matrix we will simply add a 1
51

52
00:05:38,480 --> 00:05:44,410
and that's simply because we've counted one occurrence. In fact on our first pass we'll add one for all
52

53
00:05:44,410 --> 00:05:51,220
the occurrences. We'll combine the occurrences later. Let's move on to that next word, "jul", short for
53

54
00:05:51,430 --> 00:05:57,970
"july". The document ID and the label, of course stay the same, 4844
54

55
00:05:58,150 --> 00:06:06,370
and non-spam and then we simply again use our word index and get the location for this particular string,
55

56
00:06:07,390 --> 00:06:11,920
and this string has the word ID 494,
56

57
00:06:11,920 --> 00:06:18,960
and again it occurs a single time. Now because we've actually saved all our word IDs as a CSV file,
57

58
00:06:19,470 --> 00:06:25,410
we can actually verify the word IDs in Microsoft Excel or Google Sheets. If I double click on this
58

59
00:06:25,410 --> 00:06:29,460
file and I scroll down to, what do we see,
59

60
00:06:29,460 --> 00:06:33,710
position 494,
60

61
00:06:33,840 --> 00:06:37,560
then I see my stemmed word right here.
61

62
00:06:37,600 --> 00:06:41,590
Next up it's that third word - "rodent".
62

63
00:06:41,590 --> 00:06:50,940
So let's see what happens when we check whether the word "rodent" is part of our word index. If we do this,
63

64
00:06:51,500 --> 00:07:01,470
"word_index.get_loc('rodent')" we will actually get an
64

65
00:07:01,470 --> 00:07:09,530
error and that's because the word "rodent" doesn't occur frequently enough to have made it into our vocabulary.
65

66
00:07:09,780 --> 00:07:14,120
In other words, the word "rodent" will not be added to our sparse matrix.
66

67
00:07:14,160 --> 00:07:16,980
Instead we move on to the next word.
67

68
00:07:17,610 --> 00:07:26,390
Our next word is in fact this one right here. Checking our index, we find that the word ID is
68

69
00:07:26,390 --> 00:07:28,830
2386.
69

70
00:07:28,970 --> 00:07:33,710
This is essentially the workflow of how we will build up our sparse matrix.
70

71
00:07:33,710 --> 00:07:39,050
We're going to put all of this work into a loop and then wrap all of that into a function.
71

72
00:07:39,110 --> 00:07:40,690
So let's get on it.
72

73
00:07:40,730 --> 00:07:42,530
Here's what our function will look like.
73

74
00:07:42,900 --> 00:07:49,330
As always, we're going to start out with our "def" keyword, define, to create our function. We'll call this function
74

75
00:07:49,520 --> 00:07:55,660
"make_sparse_matrix", a very imaginative name,
75

76
00:07:55,730 --> 00:08:01,970
but it's very clear and I reckon this function should take three inputs - a dataframe, an index for the
76

77
00:08:01,970 --> 00:08:07,510
word ID, so let's call this one "indexed_words"
77

78
00:08:07,820 --> 00:08:12,330
and then third it should take in the labels, namely the y values.
78

79
00:08:12,430 --> 00:08:14,110
So that'll be our third input.
79

80
00:08:14,110 --> 00:08:15,810
Put a colon at the end.
80

81
00:08:16,250 --> 00:08:20,650
And now we can add a quick description of what this function should do.
81

82
00:08:20,660 --> 00:08:28,400
Three double quotes as a doc string and we'll provide a pretty very quick description, right.
82

83
00:08:28,400 --> 00:08:36,320
"Returns sparse matrix as dataframe." And our inputs are going to be as follows,
83

84
00:08:36,380 --> 00:08:38,000
"df" is
84

85
00:08:40,500 --> 00:08:47,160
a dataframe with words in the columns with
85

86
00:08:52,000 --> 00:09:08,700
a document id as an index (X_train or X_test). Then the "indexed_words"
86

87
00:09:09,270 --> 00:09:22,730
parameter is going to be an index of words ordered by word ID. The labels should be the category as
87

88
00:09:22,730 --> 00:09:23,990
a series,
88

89
00:09:24,020 --> 00:09:30,770
in other words y_train or y_test.
89

90
00:09:30,770 --> 00:09:33,020
I think that'll do for our docstring.
90

91
00:09:33,030 --> 00:09:35,410
Now let's add the body of the function.
91

92
00:09:35,660 --> 00:09:41,480
Now as you can see from the docstring, this function should work regardless of whether we feed in the
92

93
00:09:41,540 --> 00:09:49,190
X_train dataframe or the X_test dataframe so let's capture the kind of dimensions
93

94
00:09:49,190 --> 00:09:53,950
of the dataframe that is coming in as an input ahead of time.
94

95
00:09:54,240 --> 00:10:02,930
I'll create a variable called number of rows and that'll be equal to the input,
95

96
00:10:03,040 --> 00:10:12,220
"df.shape[0]" and the number of columns, "nr_cols"
96

97
00:10:12,220 --> 00:10:19,430
is going to be equal to "df.shape[1]". So that's that.
97

98
00:10:20,160 --> 00:10:25,290
Now within the body of this function, I know that I'm going to be doing a lot of lookups. I'm going to
98

99
00:10:25,290 --> 00:10:30,960
be checking if the words in the dataframe are part of our vocabulary list.
99

100
00:10:30,960 --> 00:10:33,620
There's a lot of checks that are going gonna be running as part of our loop,
100

101
00:10:33,900 --> 00:10:39,000
so I want to be working with a data structure called a Python set,
101

102
00:10:39,000 --> 00:10:50,790
as you recall. So I'll say "word_set = set(indexed_words)".
102

103
00:10:51,030 --> 00:10:59,750
Here I'm creating a Python set from our index that is being fed into this function as an argument.
103

104
00:10:59,770 --> 00:11:06,520
Now I'm going to add a nested loop and within that loop I'm going to be adding dictionaries to a Python
104

105
00:11:06,520 --> 00:11:07,580
list.
105

106
00:11:07,630 --> 00:11:09,360
Let me have the outline of this loop first.
106

107
00:11:09,820 --> 00:11:19,150
So I'll create my empty list first, I'll call it "dict_list" and have two square brackets and at the very
107

108
00:11:19,150 --> 00:11:27,910
end of our function I'm going to return a pandas dataframe that is going to be created from this list
108

109
00:11:28,000 --> 00:11:30,630
that we're gonna be creating inside our loop.
109

110
00:11:30,670 --> 00:11:34,790
Now in between these two lines of code is gonna go the meat of our code.
110

111
00:11:34,930 --> 00:11:43,510
There'll be two loops "for i in range(nr_rows)".
111

112
00:11:43,660 --> 00:11:49,060
That'll be the outer loop and then there'll be an inner loop "for j
112

113
00:11:49,210 --> 00:11:53,320
in range(nr_cols)".
113

114
00:11:53,320 --> 00:12:01,090
So we're gonna go through this dataframe that we're feeding in row by row and column by column. Within
114

115
00:12:01,090 --> 00:12:01,740
this inner loop,
115

116
00:12:01,780 --> 00:12:09,080
we're gonna be appending a dictionary to our list every time the loop runs.
116

117
00:12:09,130 --> 00:12:10,720
Here's how it's gonna work.
117

118
00:12:10,990 --> 00:12:16,510
The very first thing that we're gonna do is we're gonna get hold of a particular string, right.
118

119
00:12:16,720 --> 00:12:23,020
And by a particular string I mean value in a particular cell, because we're gonna iterate through this
119

120
00:12:23,020 --> 00:12:24,040
dataframe,
120

121
00:12:24,040 --> 00:12:35,780
row by row and column by column. To get hold of a particular word, we'll say "df.iat[
121

122
00:12:35,800 --> 00:12:45,460
i,j]". In other words, we'll be retrieving the word in the i-th row and the j-th column. Then we'll
122

123
00:12:45,460 --> 00:12:54,680
check if that word that we picked out of our dataframe is in our words set and if it is then we should
123

124
00:12:54,680 --> 00:12:56,130
fetch the document ID,
124

125
00:12:56,180 --> 00:13:05,240
the word ID and the category. The document ID is gonna be equal to the value of the index in the 
125

126
00:13:05,240 --> 00:13:10,210
i-th row, so "df.index[
126

127
00:13:10,230 --> 00:13:17,750
i]". The word ID is going to be equal to the "indexed_words.
127

128
00:13:21,350 --> 00:13:32,000
get_loc(word)". "word" is a string, we can feed that into our get location method to retrieve
128

129
00:13:32,180 --> 00:13:38,440
the position of this word in our index of words and that will be equal to our word ID.
129

130
00:13:39,260 --> 00:13:41,140
Now it's time to get the category.
130

131
00:13:41,180 --> 00:13:48,660
The category is gonna be our y values at, well, at the document ID.
131

132
00:13:48,680 --> 00:13:49,630
Right.
132

133
00:13:49,710 --> 00:13:59,550
The y values we said we'd feed in as this labels parameter here. So we'll say "labels.at[
133

134
00:14:00,330 --> 00:14:06,410
doc_id]". Now we've got the three things that we need.
134

135
00:14:06,530 --> 00:14:12,540
And from that we can create a little dictionary to put them all into one data structure and I'll call
135

136
00:14:12,540 --> 00:14:27,080
that "item = {'LABEL': category, 'DOC_
136

137
00:14:27,080 --> 00:14:47,280
ID': doc_id, 'OCCURENCE': 1, 'WORD_ 
137

138
00:14:47,280 --> 00:14:48,720
ID':
138

139
00:14:48,720 --> 00:14:51,240
word_id}".
139

140
00:14:54,970 --> 00:15:03,820
Here I've created a dictionary with four entries. The first one has the key LABEL and it gets the y value,
140

141
00:15:03,940 --> 00:15:06,230
the category, spam or not spam.
141

142
00:15:06,310 --> 00:15:13,730
The second is our document ID which we'll get the document ID that we've extracted here. The third
142

143
00:15:13,940 --> 00:15:18,880
OCCURENCE which is always gonna be equal to 1 because we're kind of doing a first pass on this
143

144
00:15:18,950 --> 00:15:25,480
and every time we discover a word that's part of our vocabulary we'll add it to our dataframe.
144

145
00:15:25,790 --> 00:15:30,390
And the last one here is the word ID which we've retrieved here.
145

146
00:15:30,440 --> 00:15:37,940
So now that we have a dictionary for a single item, what we can do is take our "dict_list" and
146

147
00:15:38,360 --> 00:15:41,020
append our item.
147

148
00:15:41,670 --> 00:15:49,370
So appending all our dictionaries that we're creating as this loop runs individually to our list, which
148

149
00:15:49,370 --> 00:15:56,000
initially starts off empty, but then gets populated and the dataframe that we're returning from this
149

150
00:15:56,000 --> 00:15:59,800
function gets created using this list.
150

151
00:15:59,840 --> 00:16:00,540
Fantastic.
151

152
00:16:01,340 --> 00:16:04,280
So here's the whole function body in its entirety.
152

153
00:16:04,280 --> 00:16:06,200
Let me press Shift+Enter on this.
153

154
00:16:06,920 --> 00:16:14,900
And now let's try and run this baby. I'm going to scroll down here into this next cell and the first thing I'll
154

155
00:16:14,900 --> 00:16:17,860
is actually at some micro benchmarking code.
155

156
00:16:18,170 --> 00:16:18,660
So,
156

157
00:16:18,680 --> 00:16:28,340
"%%time", this will time how long this cell will take to run. Now I'm going to store the
157

158
00:16:28,340 --> 00:16:37,190
results of this function call, this dataframe, in a variable called "sparse_train_
158

159
00:16:37,730 --> 00:16:45,830
df" and I'll set that equal to "make_sparse_matrix" and you guessed it, I'm going to give it the training
159

160
00:16:45,830 --> 00:16:46,790
data, right,
160

161
00:16:46,910 --> 00:17:00,980
"X_train, word_index, y_train".
161

162
00:17:01,510 --> 00:17:04,140
And now let me hit Shift+Enter and let's see what happens.
162

163
00:17:05,680 --> 00:17:13,630
Now this cell can take quite a long time to execute. Its parsing a lot of data and it's going through
163

164
00:17:13,720 --> 00:17:19,300
another dataframe that has thousands of cells and thousands of columns in it and it's going through
164

165
00:17:19,300 --> 00:17:24,240
it one entry at a time. On the machine that I'm currently on,
165

166
00:17:24,280 --> 00:17:28,080
this cell takes between 5 to 10 minutes to run actually.
166

167
00:17:28,090 --> 00:17:35,420
So I typically step away and grab a croissant or a coffee or something and come back when it's done.
167

168
00:17:35,420 --> 00:17:37,570
And I really encourage you to do the same.
168

169
00:17:37,570 --> 00:17:40,090
There's no point in like waiting around.
169

170
00:17:40,180 --> 00:17:46,120
This is actually one of those times where you'll see a dramatic performance difference whether or not
170

171
00:17:46,270 --> 00:17:50,140
you're using a set data structure or not.
171

172
00:17:50,140 --> 00:17:57,490
This check here in our inner loop runs thousands of times, so any minimal difference in time that this
172

173
00:17:57,490 --> 00:18:03,430
look up, this check takes, you'll actually see that build up to quite a significant amount of time.
173

174
00:18:03,610 --> 00:18:10,540
If we go back up to our constants, then you will spot another thing that really determines the size of
174

175
00:18:10,570 --> 00:18:13,060
the dataset that we're working with.
175

176
00:18:13,060 --> 00:18:18,850
Yes, we imported approximately 5800 e-mails that were passing and so on, but one
176

177
00:18:18,850 --> 00:18:24,010
of the key inputs one of the key constraints that we set is actually our vocabulary size.
177

178
00:18:24,040 --> 00:18:31,750
We set this at 2500 and this vocabulary size will actually determine how big of a matrix we end up with
178

179
00:18:31,750 --> 00:18:32,930
at the end.
179

180
00:18:32,950 --> 00:18:39,910
The reason I picked 2500 is because it's relatively large, it's going to make all computer work quite
180

181
00:18:39,910 --> 00:18:48,190
hard, but it's nowhere near sort of a commercial application spam filter for a naive Bayes' model.
181

182
00:18:48,190 --> 00:18:54,270
If we were on an operating Hotmail or Gmail or something and we had to build this naive Bayes' based classifier,
182

183
00:18:54,910 --> 00:19:02,230
we would typically set our vocabulary size at 10000 to 50000 words and you can imagine how much data
183

184
00:19:02,230 --> 00:19:09,120
we would have to crunch and how long we would have to run our machines for.
184

185
00:19:09,570 --> 00:19:11,990
All right, pretty chuffed to see that it's all done now,
185

186
00:19:12,010 --> 00:19:15,470
that's 6 minutes and 28 seconds.
186

187
00:19:15,520 --> 00:19:21,290
This means that we can take a look at the results, see if they make sense.
187

188
00:19:21,320 --> 00:19:30,410
Let's take a look at the first 5 rows here, so "sparse_train_df[
188

189
00:19:30,530 --> 00:19:32,340
:5]"
189

190
00:19:33,380 --> 00:19:39,110
will give me the first five rows and here I can see my word IDs.
190

191
00:19:39,360 --> 00:19:44,830
Each of these words occurs only once and all of these words occur in email
191

192
00:19:44,880 --> 00:19:50,790
number 4844 which is a non-spam email.
192

193
00:19:50,810 --> 00:19:59,650
Let's take a look at the shape of this dataframe now, so "sparse_train_df.shape"
193

194
00:20:00,460 --> 00:20:11,500
Shift+Enter shows us that we've got approximately 450000 rows in this dataframe.
194

195
00:20:11,500 --> 00:20:13,810
That's an absolutely huge amount.
195

196
00:20:13,810 --> 00:20:15,150
Almost half a million.
196

197
00:20:15,430 --> 00:20:22,970
This is one of the reasons why this whole thing took  good 6 minutes to run on my machine. The last
197

198
00:20:22,970 --> 00:20:29,360
five rows of this dataframe look like this "sparse_train_df[
198

199
00:20:29,930 --> 00:20:35,670
-5:]".
199

200
00:20:35,690 --> 00:20:36,950
Here you go.
200

201
00:20:37,100 --> 00:20:40,760
All of these rows pertain to email number 860.
201

202
00:20:41,120 --> 00:20:43,610
Now one of the reasons why there are
202

203
00:20:43,610 --> 00:20:49,040
450000 rows in this dataframe is because we've put each and every single word
203

204
00:20:49,160 --> 00:20:54,430
from X_train into a separate row.
204

205
00:20:54,470 --> 00:21:03,380
So if in the same email we have the word "thursday" they occur twice it has two separate rows in this 
205

206
00:21:03,380 --> 00:21:04,530
dataframe.
206

207
00:21:04,670 --> 00:21:08,800
What we're gonna do now is we're going to combine these occurrences, right.
207

208
00:21:08,840 --> 00:21:15,460
If a word occurs more than one time in the same email, we should combine it in this dataframe.
208

209
00:21:15,500 --> 00:21:19,010
We should have an occurrence of two for this particular word ID.