0 1 00:00:00,330 --> 00:00:04,920 We're going to group our words by email and to do that, 1 2 00:00:05,250 --> 00:00:09,120 we'll use the pandas groupby method. 2 3 00:00:09,120 --> 00:00:15,630 Let me add a markdown cell here that reads "Combine", can't spell, 3 4 00:00:15,630 --> 00:00:25,800 "Combine Occurrences with the Pandas groupby method". 4 5 00:00:26,020 --> 00:00:32,340 Now I'm guessing that you will have used Excel, Microsoft Excel at some point in the past and in Excel 5 6 00:00:32,340 --> 00:00:36,300 there is a very powerful functionality called a pivot table. 6 7 00:00:36,300 --> 00:00:42,850 And this groupby method works in a very, very similar way. It will allow us to group the occurrences 7 8 00:00:42,970 --> 00:00:49,470 by document ID, word ID and label. And then we can sum up our occurrences. 8 9 00:00:49,480 --> 00:00:50,500 Let me show you. 9 10 00:00:50,680 --> 00:01:00,190 "train_grouped = sparse_train_df.groupby()" 10 11 00:01:00,850 --> 00:01:02,910 and I will provide a list 11 12 00:01:03,040 --> 00:01:04,480 square brackets 12 13 00:01:04,480 --> 00:01:05,830 single quotes, 13 14 00:01:05,970 --> 00:01:13,920 "['DOC_ID', 'WORD_ID', 14 15 00:01:13,920 --> 00:01:18,910 'LABEL']" and at the very end 15 16 00:01:18,910 --> 00:01:28,060 we're going to chain another method called, namely the summation, so ".sum()" at the end will sum up our 16 17 00:01:28,060 --> 00:01:31,430 occurrences after they've been grouped, but I think 17 18 00:01:31,430 --> 00:01:32,330 seeing is believing. 18 19 00:01:32,330 --> 00:01:34,000 So let me show you what this looks like. 19 20 00:01:34,500 --> 00:01:37,470 So "train_grouped.head()" 20 21 00:01:37,520 --> 00:01:39,610 will show us the result. 21 22 00:01:39,620 --> 00:01:40,940 Here we go. 22 23 00:01:40,940 --> 00:01:49,130 What we see here is that for our document with ID 0, our first document, we've got a bunch of words 23 24 00:01:49,190 --> 00:01:52,610 in here that are grouped together by IDs. 24 25 00:01:52,760 --> 00:01:58,370 The word with ID number 0 occurs twice in this first email. 25 26 00:01:58,430 --> 00:02:01,420 This is all that this table is showing us right here. 26 27 00:02:01,430 --> 00:02:06,160 Now you might say "All right, well, what's.. what's.. what's that word ID number 0? 27 28 00:02:06,170 --> 00:02:08,030 What's... what's word 0?". 28 29 00:02:08,030 --> 00:02:18,760 So we can pull that up, right? We can go to our vocabulary and we can say ".at[0, 29 30 00:02:18,760 --> 00:02:22,910 'VOCAB_ 30 31 00:02:23,130 --> 00:02:28,810 WORD". This if you recall was the column name in our dataframe 31 32 00:02:28,810 --> 00:02:35,680 And this here is the index name which corresponded to our word ID. The actual word that occurs twice 32 33 00:02:35,980 --> 00:02:38,990 in our vocabulary is "http". 33 34 00:02:39,080 --> 00:02:40,750 Why does this occur twice? 34 35 00:02:40,750 --> 00:02:45,400 Well it's because there's two hyperlinks in our email. The original email, 35 36 00:02:45,400 --> 00:02:45,910 right, 36 37 00:02:46,020 --> 00:02:47,350 with document ID 0. 37 38 00:02:47,380 --> 00:02:56,800 We can pull up with "data.MESSAGE[0]" and this e-mail reads "Dear homeowner... Interest 38 39 00:02:56,800 --> 00:02:58,490 rates are at their lowest level... blah blah". 39 40 00:02:59,170 --> 00:03:06,800 If I look further down in the text, I see the first hyperlink here and I see the second hyperlink here. 40 41 00:03:06,830 --> 00:03:11,990 This is why the word "http" appears twice in this document. 41 42 00:03:12,620 --> 00:03:17,620 So our groupby function combined with the summation method seems to have worked really well. 42 43 00:03:17,750 --> 00:03:23,270 The thing I would quite like though is to have less of this pivot table feel to it and repeat the document 43 44 00:03:23,270 --> 00:03:30,890 ID on every single row. We can do that with "train_grouped = ", 44 45 00:03:30,890 --> 00:03:40,340 so we're just gonna overwrite it, right, "train_grouped.reset_index()", "reset_index" will make our document 45 46 00:03:40,370 --> 00:03:47,770 ID appear on every single row. "train_grouped.head()" 46 47 00:03:47,790 --> 00:03:49,990 will show you exactly that. 47 48 00:03:50,850 --> 00:03:52,440 Fantastic. 48 49 00:03:52,470 --> 00:03:56,400 Let's take a quick look at the tail of this dataframe as well. 49 50 00:03:56,490 --> 00:04:05,010 "train_grouped.tail()" gives us this result. And we're going to to the same very quick 50 51 00:04:05,100 --> 00:04:13,410 sense check on this result as well. In particular, let's take a look at what word corresponds to 51 52 00:04:13,410 --> 00:04:15,460 1923. 52 53 00:04:15,510 --> 00:04:19,360 It appears to occur twice in this e-mail. 53 54 00:04:19,380 --> 00:04:22,030 Now you can either work ahead or follow along with me. 54 55 00:04:22,290 --> 00:04:30,660 But we've done this already, "vocab.at[1923, 55 56 00:04:31,170 --> 00:04:36,150 'VOCAB_WORD']". 56 57 00:04:36,210 --> 00:04:37,620 This gives us the result, 57 58 00:04:37,770 --> 00:04:46,500 "welch", which is a very odd word, right, and doesn't quite seem like a like a real word, but it could be 58 59 00:04:46,500 --> 00:04:47,520 a stem word, right? 59 60 00:04:47,520 --> 00:04:50,280 So maybe that's why it's a bit strange. 60 61 00:04:50,280 --> 00:04:58,980 Let's pull up the actual message and see why this word appears twice, "data.MESSAGE[ 61 62 00:05:00,000 --> 00:05:03,980 5795]". 62 63 00:05:04,020 --> 00:05:11,780 This brings up quite a short e-mail and it appears that "welch" is actually a name. 63 64 00:05:11,850 --> 00:05:17,510 It's the last name of this guy Brent Welch, a software architect. 64 65 00:05:17,640 --> 00:05:24,540 And the word appears again in his email address which is at the very end of this email. 65 66 00:05:24,540 --> 00:05:27,330 So this is why it's here. 66 67 00:05:27,330 --> 00:05:28,590 So I'm quite happy about this. 67 68 00:05:28,590 --> 00:05:30,180 I think this is this is good. 68 69 00:05:30,180 --> 00:05:33,140 This seems to pass the sense check. 69 70 00:05:33,450 --> 00:05:40,350 The only thing I'd be quite curious to find out now is how big of a reduction we've achieved in the 70 71 00:05:40,350 --> 00:05:42,810 number of rows from before? 71 72 00:05:43,410 --> 00:05:50,730 So if I say "train_grouped.shape", then I can see that we've reduced the size of our data 72 73 00:05:50,730 --> 00:05:52,650 frame quite a bit. 73 74 00:05:52,650 --> 00:05:58,560 We've gone from 450000 to approximate 265000 74 75 00:05:58,560 --> 00:05:59,760 rows. 75 76 00:05:59,760 --> 00:06:03,020 That's still a lot but it's about a 40% reduction 76 77 00:06:03,240 --> 00:06:13,500 and I think this puts us in a really good place to save our work, so let's do that now. I'll add a very quick 77 78 00:06:13,650 --> 00:06:15,580 markdown cell here, 78 79 00:06:15,680 --> 00:06:16,880 call this one "Save 79 80 00:06:16,980 --> 00:06:27,820 Training Data as .txt file". In the previous lessons we've saved our files once before as a 80 81 00:06:27,840 --> 00:06:34,650 .json File and as a .csv file. This is how we saved our files to our disk previously. 81 82 00:06:34,650 --> 00:06:38,520 Now let's save a file as a plain text file. 82 83 00:06:38,520 --> 00:06:41,250 And for this we're going to use numpy's functionality. 83 84 00:06:41,430 --> 00:06:47,500 But before we do that, we're gonna need a relative file path at the top of our notebook. 84 85 00:06:47,760 --> 00:06:51,440 That way it sits nicely with its friends. 85 86 00:06:51,450 --> 00:06:55,080 Now I'm planning to save this file in a slightly different folder right. 86 87 00:06:55,560 --> 00:06:57,320 But first let's give it a name. 87 88 00:06:57,330 --> 00:07:07,920 I'm going to call it "TRAINING_DATA_FILE" and I'll set that equal to "SpamData/ 88 89 00:07:08,670 --> 00:07:25,600 02_Training/train-data.txt". I'll be adding our 89 90 00:07:25,600 --> 00:07:29,620 text file to this folder right here. Now, 90 91 00:07:29,620 --> 00:07:35,380 be sure to hit Shift+Enter on your cell with your constants and then join me down here at the bottom 91 92 00:07:35,380 --> 00:07:51,860 of the notebook. With "np.savetxt(TRAINING_DATA_FILE, train_grouped, fmt = 92 93 00:07:51,860 --> 00:07:57,630 '%d')". 93 94 00:07:58,040 --> 00:08:04,400 If I hit Shift+Tab on my keyboard to bring up the quick documentation, we see that this is the file 94 95 00:08:04,400 --> 00:08:08,000 name including the relative file path. 95 96 00:08:08,150 --> 00:08:16,040 This here, the second argument is the data and the third argument is "fmt" which stands for format. 96 97 00:08:17,160 --> 00:08:23,970 If I hit the plus sign here and scroll down, then I can see that our format argument requires a string 97 98 00:08:24,060 --> 00:08:26,130 or a sequence of strings. 98 99 00:08:26,160 --> 00:08:29,790 This essentially allows us to specify the number format. 99 100 00:08:29,790 --> 00:08:32,650 Lucky for us we're just dealing with integers. 100 101 00:08:32,700 --> 00:08:39,780 If I bring up my folder here side by side and hit Shift+Enter now, then I should see my text file appear 101 102 00:08:40,020 --> 00:08:47,120 right here. Before I open this text file and peek inside, let me show you what the columns are called 102 103 00:08:47,480 --> 00:08:48,830 in Jupyter notebook. 103 104 00:08:49,100 --> 00:08:53,050 So "train_grouped.columns" 104 105 00:08:53,120 --> 00:08:58,330 will bring up our column names, so it's "DOC_ID", "WORD_ID", "LABEL" and "OCCURENCE". 105 106 00:09:00,970 --> 00:09:02,970 Now let's look at this text file. 106 107 00:09:03,190 --> 00:09:08,350 If I open it up in my text editor here, then I can see the four columns clearly outlined. 107 108 00:09:08,350 --> 00:09:16,090 The first one is my document ID. The second number here is the word ID. The third number is the category 108 109 00:09:16,150 --> 00:09:17,030 or label. 109 110 00:09:17,050 --> 00:09:21,080 So the first message is in fact a spam email. 110 111 00:09:21,220 --> 00:09:26,810 And that fourth number here is the occurrence of the word with this ID. 111 112 00:09:27,160 --> 00:09:36,090 Looking at line 16 here, the word with ID 105 occurs twice in this spam email. 112 113 00:09:36,100 --> 00:09:41,440 So I think that almost wraps it up, except we haven't done this for our test data yet. 113 114 00:09:41,740 --> 00:09:46,560 And that's where I want to throw it over to you. As a challenge, 114 115 00:09:46,840 --> 00:09:54,760 can you create a sparse matrix for the test data and then group all the occurrences of the same word 115 116 00:09:54,790 --> 00:10:02,000 in the same email together just like we did with the training data? After you've done all that, save the 116 117 00:10:02,000 --> 00:10:05,140 data as a txt file. 117 118 00:10:05,150 --> 00:10:09,680 Now I realize, you're gonna have to save that data to some place and give it a file name. 118 119 00:10:09,680 --> 00:10:15,230 So let's do that right now so that you and I have the same file names going forward. Scrolling up to 119 120 00:10:15,230 --> 00:10:16,250 our constants, 120 121 00:10:16,300 --> 00:10:26,060 I'm just quickly going to copy this relative path, paste it in and then change the file name here to "test" 121 122 00:10:27,080 --> 00:10:36,980 and change the constant name from "training" to "test" as well. So "TEST_DATA_FILE" is 122 123 00:10:36,980 --> 00:10:43,970 equal to this relative path file name and extension right here. 123 124 00:10:44,120 --> 00:10:49,670 Now I don't have to ask you to pause the video because I'm going to show you the solution in the next lesson. 124 125 00:10:50,770 --> 00:10:51,490 I'll see you there.