0
1
00:00:00,630 --> 00:00:01,680
Okay, ready?
1

2
00:00:01,680 --> 00:00:03,540
Hope you had a go at doing all this.
2

3
00:00:03,570 --> 00:00:10,210
Here's the solution. Let's very quickly just recap where we are in terms of the data that we're working
3

4
00:00:10,210 --> 00:00:21,160
with. "X_test.head()" will show us the first five rows of our feature matrix. The first
4

5
00:00:21,160 --> 00:00:30,800
five rows of our target values "y_test" look like this. So you've got document 
5

6
00:00:30,800 --> 00:00:39,740
4675 as the very first row in both. "X_test.shape" will show us the
6

7
00:00:39,740 --> 00:00:46,730
size of our dataframe that we're working with. This is gonna be much smaller than "X_train"
7

8
00:00:47,200 --> 00:00:53,360
and as such the function call that we're gonna make to our "make_sparse_matrix" function
8

9
00:00:53,590 --> 00:00:58,370
will take a lot less time to run. Let's check it out.
9

10
00:00:58,370 --> 00:01:07,910
"%%time" will benchmark this for us and I'm going to call a variable "sparse_test_
10

11
00:01:08,270 --> 00:01:18,080
df" to hold on to result from our function call, so "sparse_test_df = make_
11

12
00:01:18,990 --> 00:01:29,840
sparse_matrix(X_test, word_index)". The word index
12

13
00:01:29,840 --> 00:01:38,960
the same for both our training data and our test data and then comma, third argument y_test.
13

14
00:01:39,980 --> 00:01:42,770
Let's run this and see what we get.
14

15
00:01:42,950 --> 00:01:49,200
Scroll down a bit, add a few rows in the meantime. There we go.
15

16
00:01:49,250 --> 00:01:51,030
Now we play the waiting game.
16

17
00:01:51,170 --> 00:01:59,170
I could try and yodel for you to help pass the time but I think neither us would enjoy that very much.
17

18
00:01:59,180 --> 00:02:01,060
Oh man, come on, come on.
18

19
00:02:01,130 --> 00:02:07,160
These are the times when you start feeling a bit of the pain of working on a 4 year old laptop, but my
19

20
00:02:07,160 --> 00:02:14,810
patience has paid off, 2 minutes 45 seconds to complete this calculation.
20

21
00:02:14,840 --> 00:02:17,260
Let's take a look at how many rows we've got here.
21

22
00:02:17,630 --> 00:02:25,370
"sparse_test_df.shape" reveals to us that we've got about 190000
22

23
00:02:25,490 --> 00:02:28,600
individual rows.
23

24
00:02:28,600 --> 00:02:37,090
Let me create another variable called "test_grouped" and set that equal to "sparse_
24

25
00:02:37,090 --> 00:02:46,120
test_df.groupby([])" and I've got to have those column names,
25

26
00:02:46,320 --> 00:02:47,040
right.
26

27
00:02:47,260 --> 00:02:49,700
The first one was "DOC_ID".
27

28
00:02:50,140 --> 00:02:57,490
The second one was "WORD_ID", everything's case sensitive of course and spelling really
28

29
00:02:57,490 --> 00:03:01,210
matters which makes this extra difficult for me.
29

30
00:03:01,210 --> 00:03:02,790
Third one is "LABEL".
30

31
00:03:03,070 --> 00:03:13,450
All of these are in a list and at the end we'll sum it up and I'm going to chain "reset_index()"
31

32
00:03:13,780 --> 00:03:23,030
straight onto this. Finally, I'm going look at the first five rows, "test_grouped.head()" will give
32

33
00:03:23,030 --> 00:03:24,380
me the following.
33

34
00:03:24,710 --> 00:03:35,560
As you can see, it summed up the occurrence of word number 19 in email number 8 quite nicely. "test_
34

35
00:03:35,580 --> 00:03:42,570
grouped.shape" will also show me the has far fewer rows, only 110000
35

36
00:03:42,690 --> 00:03:48,610
as opposed to 190000. To save this as a txt file we'll use numpy again, 
36

37
00:03:48,630 --> 00:04:00,080
"np.savetxt()" and now we supply that constant we created, "TEST_DATA_FILE, test_grouped,
37

38
00:04:00,650 --> 00:04:05,870
fmt = '%d')".
38

39
00:04:07,000 --> 00:04:15,680
I'll move my browser over a little bit, keep that Finder window in view, hit Shift+Enter and there it is,
39

40
00:04:16,240 --> 00:04:24,410
there's my "test-data.txt" file. If I open it in Atom, then I can see that the first five rows
40

41
00:04:24,680 --> 00:04:32,240
that we print out in Jupyter here and are showing as an output mirror what we see in the text file exactly.
41

42
00:04:33,320 --> 00:04:38,540
The document ID is the first column, word ID is the second one, label is the third and occurrence is
42

43
00:04:38,540 --> 00:04:39,350
the fourth column.