0 1 00:00:00,480 --> 00:00:08,410 Okay, so previously we've split and shuffled our data and put everything into a dataframe. 1 2 00:00:08,430 --> 00:00:12,780 Now let's create that sparse matrix. To do that, 2 3 00:00:12,870 --> 00:00:18,930 we will use three things. We will use our X_train dataframe. 3 4 00:00:18,930 --> 00:00:28,860 We will use our "y_train" pandas series and we will also use our vocabulary words. 4 5 00:00:28,860 --> 00:00:29,820 Remember those? 5 6 00:00:30,030 --> 00:00:31,530 They look like this. 6 7 00:00:31,950 --> 00:00:38,190 Our vocabulary of 2500 words is stored in a dataframe where the index are the 7 8 00:00:38,190 --> 00:00:45,140 word IDs and the individual strings are in a column called VOCAB_WORD. 8 9 00:00:45,180 --> 00:00:53,130 Now if we've got this dataframe here and we want to know which word has word ID number 3 then 9 10 00:00:53,130 --> 00:00:55,140 we can find that really, really easily 10 11 00:00:55,140 --> 00:01:03,750 and we've done this before, because all you'd have to specify is the index and the column and then you'd 11 12 00:01:03,750 --> 00:01:11,620 get a string that reads 'email', but say we know the word and you want to know the word ID. 12 13 00:01:12,300 --> 00:01:14,390 Now we're asking this question in reverse. 13 14 00:01:14,460 --> 00:01:16,830 We're asking it the other way around. 14 15 00:01:16,950 --> 00:01:22,420 For example, what is the word ID for the string "email"? 15 16 00:01:22,500 --> 00:01:30,070 An easy way to answer this question with some Python code is to create an index from this column here, 16 17 00:01:30,780 --> 00:01:39,600 from the VOCAB_WORD column and then look up the position of a particular string in the index. 17 18 00:01:39,600 --> 00:01:41,430 Let me show you what I mean. 18 19 00:01:41,750 --> 00:01:54,360 I'll add a quick markdown cell here that reads "Create a Sparse Matrix for the Training Data". 19 20 00:01:54,360 --> 00:02:01,020 Now, to turn a particular column of our dataframe into an index, all we'd have to do is select our dataframe, 20 21 00:02:01,020 --> 00:02:11,400 select the column and then wrap that whole thing into some parentheses and put it inside "pd. 21 22 00:02:12,270 --> 00:02:13,800 index". 22 23 00:02:13,800 --> 00:02:22,440 This will create an index from a particular column in a dataframe. Let's store this in a variable called 23 24 00:02:22,440 --> 00:02:23,360 "word_ 24 25 00:02:23,370 --> 00:02:34,640 index", so "word_index" is equal to "pd.index(vocab.VOCAB_WORD)". 25 26 00:02:34,710 --> 00:02:41,940 Now we know we're dealing with a index with a pandas index, because the type of "word_index" 26 27 00:02:42,420 --> 00:02:52,970 is "pandas.core.indexes.base.Index" and this index is composed of strings, individual 27 28 00:02:52,970 --> 00:02:57,090 strings like "http", "email", "get" and so on, 28 29 00:02:57,110 --> 00:03:05,690 the ones we saw earlier. And I can verify this if I say pull up the say fourth word at index position number 29 30 00:03:05,900 --> 00:03:12,590 3 and check the type of this word. And there you can see that indeed we're dealing with strings 30 31 00:03:12,950 --> 00:03:22,910 inside our index. If I come back up here and take a look at our first email here in our X_ 31 32 00:03:22,920 --> 00:03:32,300 train dataframe I can see that it's probably the stemmed word for "thursday", right, "thu". Now if I wanted 32 33 00:03:32,300 --> 00:03:45,120 to know the word ID for "thu" in our word index, I can simply take our index and use the "get_ 33 34 00:03:45,530 --> 00:03:55,820 location" method with "thu" passed in as an argument. And I see that this word is at position number 34 35 00:03:55,820 --> 00:03:58,510 395. 35 36 00:03:58,610 --> 00:04:05,690 So here's what we're gonna do, we're going to take our X_train dataframe and we'll take 36 37 00:04:05,690 --> 00:04:12,590 the information contained therein to create our sparse matrix. Let's walk through how we would do it 37 38 00:04:12,710 --> 00:04:18,730 with the first row that I've shown here. The document ID for this row is 4844. 38 39 00:04:18,760 --> 00:04:26,620 We would be able to retrieve this information from our X_train using our index, since 39 40 00:04:26,620 --> 00:04:32,830 our document IDs are stored as the index values. Since 4844 is the 40 41 00:04:32,830 --> 00:04:39,550 very first entry, the very first value in the index, we would be able to just say "Give us the value of 41 42 00:04:39,550 --> 00:04:47,710 the index at position 0". That's what we can use to fill in our document ID of the sparse matrix. Now 42 43 00:04:47,770 --> 00:04:52,560 what about the label? What about the category that this email belongs to? 43 44 00:04:52,660 --> 00:05:00,340 For that, we simply look towards the "y_train" pandas series. There at the entry named 44 45 00:05:00,430 --> 00:05:05,200 4844, we would either get a 1 or a 0. 45 46 00:05:05,260 --> 00:05:11,070 So this email actually is a non-spam email, so it would have the label 0. 46 47 00:05:11,080 --> 00:05:16,930 Now let's tackle that first stemmed would, "thu" for "thursday". 47 48 00:05:16,930 --> 00:05:19,170 Here our word index comes into play. 48 49 00:05:19,450 --> 00:05:26,930 We can get the word ID for this string "thu" from our word index using that get_loc method. 49 50 00:05:27,340 --> 00:05:34,930 And as we saw earlier "thu" is at position number 395. For that last column 50 51 00:05:35,050 --> 00:05:38,380 in the sparse matrix we will simply add a 1 51 52 00:05:38,480 --> 00:05:44,410 and that's simply because we've counted one occurrence. In fact on our first pass we'll add one for all 52 53 00:05:44,410 --> 00:05:51,220 the occurrences. We'll combine the occurrences later. Let's move on to that next word, "jul", short for 53 54 00:05:51,430 --> 00:05:57,970 "july". The document ID and the label, of course stay the same, 4844 54 55 00:05:58,150 --> 00:06:06,370 and non-spam and then we simply again use our word index and get the location for this particular string, 55 56 00:06:07,390 --> 00:06:11,920 and this string has the word ID 494, 56 57 00:06:11,920 --> 00:06:18,960 and again it occurs a single time. Now because we've actually saved all our word IDs as a CSV file, 57 58 00:06:19,470 --> 00:06:25,410 we can actually verify the word IDs in Microsoft Excel or Google Sheets. If I double click on this 58 59 00:06:25,410 --> 00:06:29,460 file and I scroll down to, what do we see, 59 60 00:06:29,460 --> 00:06:33,710 position 494, 60 61 00:06:33,840 --> 00:06:37,560 then I see my stemmed word right here. 61 62 00:06:37,600 --> 00:06:41,590 Next up it's that third word - "rodent". 62 63 00:06:41,590 --> 00:06:50,940 So let's see what happens when we check whether the word "rodent" is part of our word index. If we do this, 63 64 00:06:51,500 --> 00:07:01,470 "word_index.get_loc('rodent')" we will actually get an 64 65 00:07:01,470 --> 00:07:09,530 error and that's because the word "rodent" doesn't occur frequently enough to have made it into our vocabulary. 65 66 00:07:09,780 --> 00:07:14,120 In other words, the word "rodent" will not be added to our sparse matrix. 66 67 00:07:14,160 --> 00:07:16,980 Instead we move on to the next word. 67 68 00:07:17,610 --> 00:07:26,390 Our next word is in fact this one right here. Checking our index, we find that the word ID is 68 69 00:07:26,390 --> 00:07:28,830 2386. 69 70 00:07:28,970 --> 00:07:33,710 This is essentially the workflow of how we will build up our sparse matrix. 70 71 00:07:33,710 --> 00:07:39,050 We're going to put all of this work into a loop and then wrap all of that into a function. 71 72 00:07:39,110 --> 00:07:40,690 So let's get on it. 72 73 00:07:40,730 --> 00:07:42,530 Here's what our function will look like. 73 74 00:07:42,900 --> 00:07:49,330 As always, we're going to start out with our "def" keyword, define, to create our function. We'll call this function 74 75 00:07:49,520 --> 00:07:55,660 "make_sparse_matrix", a very imaginative name, 75 76 00:07:55,730 --> 00:08:01,970 but it's very clear and I reckon this function should take three inputs - a dataframe, an index for the 76 77 00:08:01,970 --> 00:08:07,510 word ID, so let's call this one "indexed_words" 77 78 00:08:07,820 --> 00:08:12,330 and then third it should take in the labels, namely the y values. 78 79 00:08:12,430 --> 00:08:14,110 So that'll be our third input. 79 80 00:08:14,110 --> 00:08:15,810 Put a colon at the end. 80 81 00:08:16,250 --> 00:08:20,650 And now we can add a quick description of what this function should do. 81 82 00:08:20,660 --> 00:08:28,400 Three double quotes as a doc string and we'll provide a pretty very quick description, right. 82 83 00:08:28,400 --> 00:08:36,320 "Returns sparse matrix as dataframe." And our inputs are going to be as follows, 83 84 00:08:36,380 --> 00:08:38,000 "df" is 84 85 00:08:40,500 --> 00:08:47,160 a dataframe with words in the columns with 85 86 00:08:52,000 --> 00:09:08,700 a document id as an index (X_train or X_test). Then the "indexed_words" 86 87 00:09:09,270 --> 00:09:22,730 parameter is going to be an index of words ordered by word ID. The labels should be the category as 87 88 00:09:22,730 --> 00:09:23,990 a series, 88 89 00:09:24,020 --> 00:09:30,770 in other words y_train or y_test. 89 90 00:09:30,770 --> 00:09:33,020 I think that'll do for our docstring. 90 91 00:09:33,030 --> 00:09:35,410 Now let's add the body of the function. 91 92 00:09:35,660 --> 00:09:41,480 Now as you can see from the docstring, this function should work regardless of whether we feed in the 92 93 00:09:41,540 --> 00:09:49,190 X_train dataframe or the X_test dataframe so let's capture the kind of dimensions 93 94 00:09:49,190 --> 00:09:53,950 of the dataframe that is coming in as an input ahead of time. 94 95 00:09:54,240 --> 00:10:02,930 I'll create a variable called number of rows and that'll be equal to the input, 95 96 00:10:03,040 --> 00:10:12,220 "df.shape[0]" and the number of columns, "nr_cols" 96 97 00:10:12,220 --> 00:10:19,430 is going to be equal to "df.shape[1]". So that's that. 97 98 00:10:20,160 --> 00:10:25,290 Now within the body of this function, I know that I'm going to be doing a lot of lookups. I'm going to 98 99 00:10:25,290 --> 00:10:30,960 be checking if the words in the dataframe are part of our vocabulary list. 99 100 00:10:30,960 --> 00:10:33,620 There's a lot of checks that are going gonna be running as part of our loop, 100 101 00:10:33,900 --> 00:10:39,000 so I want to be working with a data structure called a Python set, 101 102 00:10:39,000 --> 00:10:50,790 as you recall. So I'll say "word_set = set(indexed_words)". 102 103 00:10:51,030 --> 00:10:59,750 Here I'm creating a Python set from our index that is being fed into this function as an argument. 103 104 00:10:59,770 --> 00:11:06,520 Now I'm going to add a nested loop and within that loop I'm going to be adding dictionaries to a Python 104 105 00:11:06,520 --> 00:11:07,580 list. 105 106 00:11:07,630 --> 00:11:09,360 Let me have the outline of this loop first. 106 107 00:11:09,820 --> 00:11:19,150 So I'll create my empty list first, I'll call it "dict_list" and have two square brackets and at the very 107 108 00:11:19,150 --> 00:11:27,910 end of our function I'm going to return a pandas dataframe that is going to be created from this list 108 109 00:11:28,000 --> 00:11:30,630 that we're gonna be creating inside our loop. 109 110 00:11:30,670 --> 00:11:34,790 Now in between these two lines of code is gonna go the meat of our code. 110 111 00:11:34,930 --> 00:11:43,510 There'll be two loops "for i in range(nr_rows)". 111 112 00:11:43,660 --> 00:11:49,060 That'll be the outer loop and then there'll be an inner loop "for j 112 113 00:11:49,210 --> 00:11:53,320 in range(nr_cols)". 113 114 00:11:53,320 --> 00:12:01,090 So we're gonna go through this dataframe that we're feeding in row by row and column by column. Within 114 115 00:12:01,090 --> 00:12:01,740 this inner loop, 115 116 00:12:01,780 --> 00:12:09,080 we're gonna be appending a dictionary to our list every time the loop runs. 116 117 00:12:09,130 --> 00:12:10,720 Here's how it's gonna work. 117 118 00:12:10,990 --> 00:12:16,510 The very first thing that we're gonna do is we're gonna get hold of a particular string, right. 118 119 00:12:16,720 --> 00:12:23,020 And by a particular string I mean value in a particular cell, because we're gonna iterate through this 119 120 00:12:23,020 --> 00:12:24,040 dataframe, 120 121 00:12:24,040 --> 00:12:35,780 row by row and column by column. To get hold of a particular word, we'll say "df.iat[ 121 122 00:12:35,800 --> 00:12:45,460 i,j]". In other words, we'll be retrieving the word in the i-th row and the j-th column. Then we'll 122 123 00:12:45,460 --> 00:12:54,680 check if that word that we picked out of our dataframe is in our words set and if it is then we should 123 124 00:12:54,680 --> 00:12:56,130 fetch the document ID, 124 125 00:12:56,180 --> 00:13:05,240 the word ID and the category. The document ID is gonna be equal to the value of the index in the 125 126 00:13:05,240 --> 00:13:10,210 i-th row, so "df.index[ 126 127 00:13:10,230 --> 00:13:17,750 i]". The word ID is going to be equal to the "indexed_words. 127 128 00:13:21,350 --> 00:13:32,000 get_loc(word)". "word" is a string, we can feed that into our get location method to retrieve 128 129 00:13:32,180 --> 00:13:38,440 the position of this word in our index of words and that will be equal to our word ID. 129 130 00:13:39,260 --> 00:13:41,140 Now it's time to get the category. 130 131 00:13:41,180 --> 00:13:48,660 The category is gonna be our y values at, well, at the document ID. 131 132 00:13:48,680 --> 00:13:49,630 Right. 132 133 00:13:49,710 --> 00:13:59,550 The y values we said we'd feed in as this labels parameter here. So we'll say "labels.at[ 133 134 00:14:00,330 --> 00:14:06,410 doc_id]". Now we've got the three things that we need. 134 135 00:14:06,530 --> 00:14:12,540 And from that we can create a little dictionary to put them all into one data structure and I'll call 135 136 00:14:12,540 --> 00:14:27,080 that "item = {'LABEL': category, 'DOC_ 136 137 00:14:27,080 --> 00:14:47,280 ID': doc_id, 'OCCURENCE': 1, 'WORD_ 137 138 00:14:47,280 --> 00:14:48,720 ID': 138 139 00:14:48,720 --> 00:14:51,240 word_id}". 139 140 00:14:54,970 --> 00:15:03,820 Here I've created a dictionary with four entries. The first one has the key LABEL and it gets the y value, 140 141 00:15:03,940 --> 00:15:06,230 the category, spam or not spam. 141 142 00:15:06,310 --> 00:15:13,730 The second is our document ID which we'll get the document ID that we've extracted here. The third 142 143 00:15:13,940 --> 00:15:18,880 OCCURENCE which is always gonna be equal to 1 because we're kind of doing a first pass on this 143 144 00:15:18,950 --> 00:15:25,480 and every time we discover a word that's part of our vocabulary we'll add it to our dataframe. 144 145 00:15:25,790 --> 00:15:30,390 And the last one here is the word ID which we've retrieved here. 145 146 00:15:30,440 --> 00:15:37,940 So now that we have a dictionary for a single item, what we can do is take our "dict_list" and 146 147 00:15:38,360 --> 00:15:41,020 append our item. 147 148 00:15:41,670 --> 00:15:49,370 So appending all our dictionaries that we're creating as this loop runs individually to our list, which 148 149 00:15:49,370 --> 00:15:56,000 initially starts off empty, but then gets populated and the dataframe that we're returning from this 149 150 00:15:56,000 --> 00:15:59,800 function gets created using this list. 150 151 00:15:59,840 --> 00:16:00,540 Fantastic. 151 152 00:16:01,340 --> 00:16:04,280 So here's the whole function body in its entirety. 152 153 00:16:04,280 --> 00:16:06,200 Let me press Shift+Enter on this. 153 154 00:16:06,920 --> 00:16:14,900 And now let's try and run this baby. I'm going to scroll down here into this next cell and the first thing I'll 154 155 00:16:14,900 --> 00:16:17,860 is actually at some micro benchmarking code. 155 156 00:16:18,170 --> 00:16:18,660 So, 156 157 00:16:18,680 --> 00:16:28,340 "%%time", this will time how long this cell will take to run. Now I'm going to store the 157 158 00:16:28,340 --> 00:16:37,190 results of this function call, this dataframe, in a variable called "sparse_train_ 158 159 00:16:37,730 --> 00:16:45,830 df" and I'll set that equal to "make_sparse_matrix" and you guessed it, I'm going to give it the training 159 160 00:16:45,830 --> 00:16:46,790 data, right, 160 161 00:16:46,910 --> 00:17:00,980 "X_train, word_index, y_train". 161 162 00:17:01,510 --> 00:17:04,140 And now let me hit Shift+Enter and let's see what happens. 162 163 00:17:05,680 --> 00:17:13,630 Now this cell can take quite a long time to execute. Its parsing a lot of data and it's going through 163 164 00:17:13,720 --> 00:17:19,300 another dataframe that has thousands of cells and thousands of columns in it and it's going through 164 165 00:17:19,300 --> 00:17:24,240 it one entry at a time. On the machine that I'm currently on, 165 166 00:17:24,280 --> 00:17:28,080 this cell takes between 5 to 10 minutes to run actually. 166 167 00:17:28,090 --> 00:17:35,420 So I typically step away and grab a croissant or a coffee or something and come back when it's done. 167 168 00:17:35,420 --> 00:17:37,570 And I really encourage you to do the same. 168 169 00:17:37,570 --> 00:17:40,090 There's no point in like waiting around. 169 170 00:17:40,180 --> 00:17:46,120 This is actually one of those times where you'll see a dramatic performance difference whether or not 170 171 00:17:46,270 --> 00:17:50,140 you're using a set data structure or not. 171 172 00:17:50,140 --> 00:17:57,490 This check here in our inner loop runs thousands of times, so any minimal difference in time that this 172 173 00:17:57,490 --> 00:18:03,430 look up, this check takes, you'll actually see that build up to quite a significant amount of time. 173 174 00:18:03,610 --> 00:18:10,540 If we go back up to our constants, then you will spot another thing that really determines the size of 174 175 00:18:10,570 --> 00:18:13,060 the dataset that we're working with. 175 176 00:18:13,060 --> 00:18:18,850 Yes, we imported approximately 5800 e-mails that were passing and so on, but one 176 177 00:18:18,850 --> 00:18:24,010 of the key inputs one of the key constraints that we set is actually our vocabulary size. 177 178 00:18:24,040 --> 00:18:31,750 We set this at 2500 and this vocabulary size will actually determine how big of a matrix we end up with 178 179 00:18:31,750 --> 00:18:32,930 at the end. 179 180 00:18:32,950 --> 00:18:39,910 The reason I picked 2500 is because it's relatively large, it's going to make all computer work quite 180 181 00:18:39,910 --> 00:18:48,190 hard, but it's nowhere near sort of a commercial application spam filter for a naive Bayes' model. 181 182 00:18:48,190 --> 00:18:54,270 If we were on an operating Hotmail or Gmail or something and we had to build this naive Bayes' based classifier, 182 183 00:18:54,910 --> 00:19:02,230 we would typically set our vocabulary size at 10000 to 50000 words and you can imagine how much data 183 184 00:19:02,230 --> 00:19:09,120 we would have to crunch and how long we would have to run our machines for. 184 185 00:19:09,570 --> 00:19:11,990 All right, pretty chuffed to see that it's all done now, 185 186 00:19:12,010 --> 00:19:15,470 that's 6 minutes and 28 seconds. 186 187 00:19:15,520 --> 00:19:21,290 This means that we can take a look at the results, see if they make sense. 187 188 00:19:21,320 --> 00:19:30,410 Let's take a look at the first 5 rows here, so "sparse_train_df[ 188 189 00:19:30,530 --> 00:19:32,340 :5]" 189 190 00:19:33,380 --> 00:19:39,110 will give me the first five rows and here I can see my word IDs. 190 191 00:19:39,360 --> 00:19:44,830 Each of these words occurs only once and all of these words occur in email 191 192 00:19:44,880 --> 00:19:50,790 number 4844 which is a non-spam email. 192 193 00:19:50,810 --> 00:19:59,650 Let's take a look at the shape of this dataframe now, so "sparse_train_df.shape" 193 194 00:20:00,460 --> 00:20:11,500 Shift+Enter shows us that we've got approximately 450000 rows in this dataframe. 194 195 00:20:11,500 --> 00:20:13,810 That's an absolutely huge amount. 195 196 00:20:13,810 --> 00:20:15,150 Almost half a million. 196 197 00:20:15,430 --> 00:20:22,970 This is one of the reasons why this whole thing took good 6 minutes to run on my machine. The last 197 198 00:20:22,970 --> 00:20:29,360 five rows of this dataframe look like this "sparse_train_df[ 198 199 00:20:29,930 --> 00:20:35,670 -5:]". 199 200 00:20:35,690 --> 00:20:36,950 Here you go. 200 201 00:20:37,100 --> 00:20:40,760 All of these rows pertain to email number 860. 201 202 00:20:41,120 --> 00:20:43,610 Now one of the reasons why there are 202 203 00:20:43,610 --> 00:20:49,040 450000 rows in this dataframe is because we've put each and every single word 203 204 00:20:49,160 --> 00:20:54,430 from X_train into a separate row. 204 205 00:20:54,470 --> 00:21:03,380 So if in the same email we have the word "thursday" they occur twice it has two separate rows in this 205 206 00:21:03,380 --> 00:21:04,530 dataframe. 206 207 00:21:04,670 --> 00:21:08,800 What we're gonna do now is we're going to combine these occurrences, right. 207 208 00:21:08,840 --> 00:21:15,460 If a word occurs more than one time in the same email, we should combine it in this dataframe. 208 209 00:21:15,500 --> 00:21:19,010 We should have an occurrence of two for this particular word ID.