0
1
00:00:00,270 --> 00:00:06,180
Throughout the next lessons I want to do the final steps of the data preparation for our Bayes' classifier
1

2
00:00:06,630 --> 00:00:13,830
and also show you a common format for data sets that you'll often encounter out there in the wild.
2

3
00:00:14,040 --> 00:00:19,530
Over the next lessons we'll be writing some Python code to create our feature vectors, but not only that,
3

4
00:00:19,980 --> 00:00:23,850
we'll create our features as a sparse matrix.
4

5
00:00:23,850 --> 00:00:25,420
What do I mean by that?
5

6
00:00:25,440 --> 00:00:28,140
Well, let's take it one step ahead of time.
6

7
00:00:28,320 --> 00:00:31,050
Let's consider a full matrix first.
7

8
00:00:31,080 --> 00:00:38,110
In that case, we have our document IDs in one column and then we'll have our words in another column.
8

9
00:00:39,020 --> 00:00:44,490
Now of course these will be the word IDs, but for illustration, I've written an actual word here on this
9

10
00:00:44,490 --> 00:00:45,830
slide.
10

11
00:00:45,830 --> 00:00:52,490
Then, in that third column, we'll have our Label. Our Label will be equal to 1 if the e-mail is spam
11

12
00:00:52,940 --> 00:00:54,680
and it will be equal to 0
12

13
00:00:54,710 --> 00:00:56,740
if our email is non-spam.
13

14
00:00:56,960 --> 00:01:04,280
So in this case, email number 5795 is a non-spam email and the
14

15
00:01:04,280 --> 00:01:05,840
label is equal to 0.
15

16
00:01:06,110 --> 00:01:13,190
In that last column, where it says Occurrence, we will capture how often the word in the Word column
16

17
00:01:13,400 --> 00:01:15,040
appears in the email.
17

18
00:01:15,080 --> 00:01:20,480
So if the word "free" appears 3 times, then occurrence will be equal to 3.
18

19
00:01:20,510 --> 00:01:22,010
Makes sense, right?
19

20
00:01:22,010 --> 00:01:29,330
The reason that this kind of table is called a full matrix is because for each email, there's an entry
20

21
00:01:29,450 --> 00:01:30,890
for each word,
21

22
00:01:30,890 --> 00:01:38,390
even if that word doesn't occur in the email. The full matrix has an entry for each word in the vocabulary
22

23
00:01:38,780 --> 00:01:41,360
for each and every single email.
23

24
00:01:41,430 --> 00:01:45,110
Now we set our vocabulary size at 2500,
24

25
00:01:45,140 --> 00:01:52,430
remember? Therefore for each document ID, there will be 2500 entries, many of
25

26
00:01:52,430 --> 00:01:54,450
which will be 0.
26

27
00:01:54,530 --> 00:01:59,700
And this is where the sparse matrix comes in. With a sparse matrix,
27

28
00:01:59,760 --> 00:02:03,320
we will remove the entries which are 0.
28

29
00:02:03,790 --> 00:02:12,250
And that means we actually only include the rows which have a word that occurs in the email.
29

30
00:02:12,270 --> 00:02:18,750
In this example, the word "mortgage" and the word "pay" does not appear in email number 
30

31
00:02:18,750 --> 00:02:20,150
5795,
31

32
00:02:20,190 --> 00:02:28,300
therefore those rows will not be present in the sparse matrix. So you can see how the sparse matrix is
32

33
00:02:28,310 --> 00:02:32,480
just a compressed version of the full matrix with fewer rows.
33

34
00:02:32,480 --> 00:02:36,390
Now let's head over to Jupyter notebook and write some Python code.
34

35
00:02:36,620 --> 00:02:41,410
First, let me add a markdown cell to commemorate what we're gonna do.
35

36
00:02:41,570 --> 00:02:52,040
We're going to generate features and a sparse matrix and the first part of that is going to be creating
36

37
00:02:52,160 --> 00:02:59,240
a dataframe with one word per column.
37

38
00:02:59,520 --> 00:03:02,340
Now I say creating a data frame with one word per column,
38

39
00:03:02,540 --> 00:03:08,900
but what are we creating the dataframe from? Where are we at in terms of our data.
39

40
00:03:08,900 --> 00:03:16,980
We're going to be working with our stemmed nested list. Our stemmed nested list looks like this,
40

41
00:03:17,060 --> 00:03:21,230
so we've got a list of words for each document, for each email.
41

42
00:03:21,770 --> 00:03:29,310
This thing about this is that our stemmed nested list is a series, right.
42

43
00:03:29,720 --> 00:03:34,790
It's a pandas series that holds on to individual lists.
43

44
00:03:34,790 --> 00:03:41,360
So if I want to access the email at position 2 then I would see that even though our stemmed nested
44

45
00:03:41,360 --> 00:03:47,240
list as a whole is a pandas series, it contains individual lists.
45

46
00:03:47,450 --> 00:03:53,270
So it's actually a series of lists, each entry is a list. That makes it a little bit of an unwieldy data
46

47
00:03:53,270 --> 00:03:56,420
structure, but it's one we're gonna work with,
47

48
00:03:56,420 --> 00:04:04,130
and what we're gonna do is we're going to convert it from a series containing lists to one of a list
48

49
00:04:04,370 --> 00:04:05,840
containing lists.
49

50
00:04:05,840 --> 00:04:11,330
And the reason for doing that is that there exists a very, very handy method to convert list of lists
50

51
00:04:11,450 --> 00:04:13,090
into a dataframe.
51

52
00:04:13,160 --> 00:04:20,810
So let me quickly copy and paste this cell and show you the method we're gonna use to convert our series
52

53
00:04:20,810 --> 00:04:21,850
to a list.
53

54
00:04:22,130 --> 00:04:29,090
And it's simply called "to_list". Our pandas series has a method called "to_list" which converts the whole
54

55
00:04:29,090 --> 00:04:31,460
thing to a Python list.
55

56
00:04:31,490 --> 00:04:32,450
Fair enough.
56

57
00:04:32,630 --> 00:04:37,640
The way this looks is what you might expect, namely two square brackets,
57

58
00:04:37,670 --> 00:04:41,520
and then in each pair of square brackets you have an individual email,
58

59
00:04:41,520 --> 00:04:42,830
so the first e-mail ends here,
59

60
00:04:42,830 --> 00:04:44,800
the second one starts here.
60

61
00:04:44,840 --> 00:04:45,370
It's...
61

62
00:04:45,850 --> 00:04:51,650
It's kind of messy actually, but we don't have to worry about this so much, because we're going to create
62

63
00:04:51,740 --> 00:04:56,570
a pandas dataframe using this code here.
63

64
00:04:56,720 --> 00:05:04,560
And the way we're gonna do that is with "pd.DataFrame" and instead of the parentheses and enclosing
64

65
00:05:04,560 --> 00:05:13,890
it like so, what we're gonna do instead is we're gonna put a dot and call a method called "from_records",
65

66
00:05:14,130 --> 00:05:21,330
"from_records" that is. And this is where we're going to feed in our "stemmed_nested_
66

67
00:05:21,350 --> 00:05:30,320
list.to_list()". Now before I hit Shift+Enter on this, let's store this whole thing, this whole
67

68
00:05:30,320 --> 00:05:39,470
dataframe, in a variable called "word_columns_df", "df" for dataframe and we'll
68

69
00:05:39,470 --> 00:05:47,730
set that equal to "pd.DataFrame.from_records" and then below, we're going to print out the head
69

70
00:05:47,880 --> 00:05:54,130
of this dataframe, so "word_columns_df.head()".
70

71
00:05:54,210 --> 00:06:00,750
Now let's take a look at the first five rows, and what we see here is as the index, we've got our document
71

72
00:06:00,780 --> 00:06:08,880
IDs, we'll add a label for this shortly, and then we have each word split up as an individual data point
72

73
00:06:09,330 --> 00:06:16,890
in a column and it looks like we have a total of 7661 columns.
73

74
00:06:17,660 --> 00:06:21,940
The overall shape of our dataframe looks like this.
74

75
00:06:22,280 --> 00:06:28,970
We've got 5796 rows and 7661
75

76
00:06:29,540 --> 00:06:30,800
columns.
76

77
00:06:30,830 --> 00:06:32,230
Now here's a question to you.
77

78
00:06:32,480 --> 00:06:40,940
Why does the dataframe have this shape? Well, 5796 is the total
78

79
00:06:40,940 --> 00:06:48,440
number of emails that we have and 7661 is the number of words, stemmed
79

80
00:06:48,440 --> 00:06:56,210
words that is, in the longest email. And we know this to be true because we've worked this out in a previous
80

81
00:06:56,480 --> 00:07:09,990
exercise. Now it's time to take the next step and that is splitting the data into a training and testing
81

82
00:07:10,170 --> 00:07:11,340
dataset.
82

83
00:07:11,790 --> 00:07:14,540
It's time to shuffle and split the data,
83

84
00:07:14,580 --> 00:07:18,450
now that we've got our "word_columns_df".
84

85
00:07:18,860 --> 00:07:20,530
Now, we've actually done this before.
85

86
00:07:20,550 --> 00:07:24,310
So I want to throw this over to you as a challenge.
86

87
00:07:24,390 --> 00:07:31,470
Can you split the data into a training and testing dataset using scikit-learn? As you're doing this
87

88
00:07:31,950 --> 00:07:35,820
set the test size at 30%.
88

89
00:07:35,820 --> 00:07:44,190
That means that the training data should include around 4057 emails and also as you're shuffling
89

90
00:07:44,430 --> 00:07:47,300
set the seed value to 42.
90

91
00:07:47,400 --> 00:07:53,460
And as you pause this video and try to solve this challenge, have a think about what the target values
91

92
00:07:53,670 --> 00:07:55,170
should be as you're doing this.
92

93
00:07:58,490 --> 00:07:58,890
all right.
93

94
00:07:58,890 --> 00:08:00,470
So here's the solution.
94

95
00:08:00,660 --> 00:08:05,830
We're gonna be using scikit learn's "train_test_split" function to accomplish this.
95

96
00:08:05,910 --> 00:08:15,200
So we have to import this whole thing into our notebook, so we'll say "from sklearn.model_
96

97
00:08:15,200 --> 00:08:16,390
selection";
97

98
00:08:16,410 --> 00:08:18,540
this is where the whole thing lives;
98

99
00:08:18,720 --> 00:08:28,180
"import train_test_split". Hitting Tab on your keyboard after typing a few of the letters should help you
99

100
00:08:28,270 --> 00:08:32,570
avoid any typos on this relatively long import statement.
100

101
00:08:32,620 --> 00:08:40,160
Now hit Shift+Enter on this cell and let's continue where we left off at the bottom of the notebook. The
101

102
00:08:40,160 --> 00:08:41,550
"train_test_split"
102

103
00:08:41,570 --> 00:08:47,800
function will give us four outputs. So let's store them in four separate variables.
103

104
00:08:47,990 --> 00:08:55,620
The first one I'll call "X_train", the next one I'll call "X_test", then lowercase
104

105
00:08:55,620 --> 00:08:58,020
"y_train",
105

106
00:08:58,050 --> 00:08:58,720
comma
106

107
00:08:58,930 --> 00:09:01,250
"y_test".
107

108
00:09:01,250 --> 00:09:10,720
And that's gonna be equal to "train_test_split(word_columns_df)".
108

109
00:09:10,760 --> 00:09:12,830
This is going to be our first argument.
109

110
00:09:13,280 --> 00:09:19,160
Then we have to supply our y-values. The dataframe that we just created after all was just the features,
110

111
00:09:19,240 --> 00:09:25,610
right, the different words. The y-values that we're trying to predict are actually our categories and
111

112
00:09:25,610 --> 00:09:32,800
I'm going to grab those from our "data" dataframe. It has a column called CATEGORY and this is what I'll
112

113
00:09:32,950 --> 00:09:34,530
supply here.
113

114
00:09:35,020 --> 00:09:46,720
Next I'll set the test size to 0.3, so 30%, and then I'll set my "random_
114

115
00:09:46,960 --> 00:09:51,140
state" to 42.
115

116
00:09:51,200 --> 00:09:54,500
This is where I'm specifying a seed value.
116

117
00:09:54,500 --> 00:10:01,460
And if you and I specify the same seed value, we'll get the same results, we'll get exactly the same shuffle.
117

118
00:10:01,460 --> 00:10:08,770
So let me run this and then we'll look at our analytics. First, let me print out the number of training
118

119
00:10:08,770 --> 00:10:17,690
samples and I'll print out "X_train.shape[
119

120
00:10:18,000 --> 00:10:18,730
0]".
120

121
00:10:18,760 --> 00:10:25,090
These are the number of training samples and we've got 30% of the total which is 
121

122
00:10:25,150 --> 00:10:28,340
4057. Next,
122

123
00:10:28,930 --> 00:10:36,480
let's just verify the fraction of the training set. So the fraction of the training set
123

124
00:10:36,680 --> 00:10:44,690
is gonna be this number, 4057 divided by the total number of entries, say in
124

125
00:10:44,690 --> 00:10:46,840
our features dataframe.
125

126
00:10:46,880 --> 00:10:59,370
so this was "word_columns_df.shape[0]" and that's 70% or very close
126

127
00:10:59,370 --> 00:11:07,440
to that, meaning the test set is going to be one minus this number, which is 30%, which we've specified
127

128
00:11:07,620 --> 00:11:08,850
here.
128

129
00:11:08,850 --> 00:11:11,290
Now let's take a look at what we've actually got.
129

130
00:11:11,370 --> 00:11:16,200
So I'll take "X_train.head()".
130

131
00:11:16,200 --> 00:11:22,330
These are the first five rows of our shuffled "word_columns_df".
131

132
00:11:22,400 --> 00:11:26,230
Now I said that these numbers here would refer to the index.
132

133
00:11:26,240 --> 00:11:26,470
Right.
133

134
00:11:26,480 --> 00:11:30,690
Our document IDs, and we can add this label very, very easily.
134

135
00:11:30,950 --> 00:11:40,610
So we'll say "X_train.index.name = 'DOC_
135

136
00:11:40,610 --> 00:11:44,290
ID'", Shift+Enter,
136

137
00:11:44,610 --> 00:11:51,210
we'll have our index name show up in the output right here. But say we wanted to add this index name
137

138
00:11:51,720 --> 00:11:55,830
to both the training dataset and the testing dataset.
138

139
00:11:55,860 --> 00:11:57,860
So "X_test" as well.
139

140
00:11:58,200 --> 00:12:03,900
We can actually do that in the very same line of code by inserting another equal sign before we assign
140

141
00:12:03,900 --> 00:12:14,880
this string value here and write "X_test.index.name". In this case we're setting
141

142
00:12:14,880 --> 00:12:22,950
both of these indices equal to "DOC_ID". Let me hit Shift+Enter and show you that document
142

143
00:12:22,980 --> 00:12:32,310
IDs actually match up after shuffling. Let's pull up "y_train.head()" and there we see
143

144
00:12:32,310 --> 00:12:38,610
the first five rows of our target values. You can see here that the document IDs match up with what
144

145
00:12:38,610 --> 00:12:44,450
we see in the training dataset. Now of course, that should be true regardless of whether this says 
145

146
00:12:44,450 --> 00:12:51,720
X_train or X_test, the order of the features and the target values will be the same,
146

147
00:12:52,770 --> 00:12:54,450
but proof is in the pudding, right?
147

148
00:12:54,540 --> 00:13:03,930
"X_test" looks like this and "y_test" looks like this. In the next lesson
148

149
00:13:04,140 --> 00:13:11,550
we're going to create a sparse matrix from our training dataset and we're gonna do that by transforming
149

150
00:13:11,550 --> 00:13:14,840
the values in our dataframe. I'll see you there.