0 1 00:00:00,630 --> 00:00:01,680 Okay, ready? 1 2 00:00:01,680 --> 00:00:03,540 Hope you had a go at doing all this. 2 3 00:00:03,570 --> 00:00:10,210 Here's the solution. Let's very quickly just recap where we are in terms of the data that we're working 3 4 00:00:10,210 --> 00:00:21,160 with. "X_test.head()" will show us the first five rows of our feature matrix. The first 4 5 00:00:21,160 --> 00:00:30,800 five rows of our target values "y_test" look like this. So you've got document 5 6 00:00:30,800 --> 00:00:39,740 4675 as the very first row in both. "X_test.shape" will show us the 6 7 00:00:39,740 --> 00:00:46,730 size of our dataframe that we're working with. This is gonna be much smaller than "X_train" 7 8 00:00:47,200 --> 00:00:53,360 and as such the function call that we're gonna make to our "make_sparse_matrix" function 8 9 00:00:53,590 --> 00:00:58,370 will take a lot less time to run. Let's check it out. 9 10 00:00:58,370 --> 00:01:07,910 "%%time" will benchmark this for us and I'm going to call a variable "sparse_test_ 10 11 00:01:08,270 --> 00:01:18,080 df" to hold on to result from our function call, so "sparse_test_df = make_ 11 12 00:01:18,990 --> 00:01:29,840 sparse_matrix(X_test, word_index)". The word index 12 13 00:01:29,840 --> 00:01:38,960 the same for both our training data and our test data and then comma, third argument y_test. 13 14 00:01:39,980 --> 00:01:42,770 Let's run this and see what we get. 14 15 00:01:42,950 --> 00:01:49,200 Scroll down a bit, add a few rows in the meantime. There we go. 15 16 00:01:49,250 --> 00:01:51,030 Now we play the waiting game. 16 17 00:01:51,170 --> 00:01:59,170 I could try and yodel for you to help pass the time but I think neither us would enjoy that very much. 17 18 00:01:59,180 --> 00:02:01,060 Oh man, come on, come on. 18 19 00:02:01,130 --> 00:02:07,160 These are the times when you start feeling a bit of the pain of working on a 4 year old laptop, but my 19 20 00:02:07,160 --> 00:02:14,810 patience has paid off, 2 minutes 45 seconds to complete this calculation. 20 21 00:02:14,840 --> 00:02:17,260 Let's take a look at how many rows we've got here. 21 22 00:02:17,630 --> 00:02:25,370 "sparse_test_df.shape" reveals to us that we've got about 190000 22 23 00:02:25,490 --> 00:02:28,600 individual rows. 23 24 00:02:28,600 --> 00:02:37,090 Let me create another variable called "test_grouped" and set that equal to "sparse_ 24 25 00:02:37,090 --> 00:02:46,120 test_df.groupby([])" and I've got to have those column names, 25 26 00:02:46,320 --> 00:02:47,040 right. 26 27 00:02:47,260 --> 00:02:49,700 The first one was "DOC_ID". 27 28 00:02:50,140 --> 00:02:57,490 The second one was "WORD_ID", everything's case sensitive of course and spelling really 28 29 00:02:57,490 --> 00:03:01,210 matters which makes this extra difficult for me. 29 30 00:03:01,210 --> 00:03:02,790 Third one is "LABEL". 30 31 00:03:03,070 --> 00:03:13,450 All of these are in a list and at the end we'll sum it up and I'm going to chain "reset_index()" 31 32 00:03:13,780 --> 00:03:23,030 straight onto this. Finally, I'm going look at the first five rows, "test_grouped.head()" will give 32 33 00:03:23,030 --> 00:03:24,380 me the following. 33 34 00:03:24,710 --> 00:03:35,560 As you can see, it summed up the occurrence of word number 19 in email number 8 quite nicely. "test_ 34 35 00:03:35,580 --> 00:03:42,570 grouped.shape" will also show me the has far fewer rows, only 110000 35 36 00:03:42,690 --> 00:03:48,610 as opposed to 190000. To save this as a txt file we'll use numpy again, 36 37 00:03:48,630 --> 00:04:00,080 "np.savetxt()" and now we supply that constant we created, "TEST_DATA_FILE, test_grouped, 37 38 00:04:00,650 --> 00:04:05,870 fmt = '%d')". 38 39 00:04:07,000 --> 00:04:15,680 I'll move my browser over a little bit, keep that Finder window in view, hit Shift+Enter and there it is, 39 40 00:04:16,240 --> 00:04:24,410 there's my "test-data.txt" file. If I open it in Atom, then I can see that the first five rows 40 41 00:04:24,680 --> 00:04:32,240 that we print out in Jupyter here and are showing as an output mirror what we see in the text file exactly. 41 42 00:04:33,320 --> 00:04:38,540 The document ID is the first column, word ID is the second one, label is the third and occurrence is 42 43 00:04:38,540 --> 00:04:39,350 the fourth column.