1 00:00:00,550 --> 00:00:07,500 In this lesson we're gonna go from a sparse matrix to a full matrix. 2 00:00:07,570 --> 00:00:12,850 First I'll show you how to create an empty data frame that we can populate with values later on and 3 00:00:12,850 --> 00:00:20,000 then we will write the python function that will do this transformation from sparse to full matrix. 4 00:00:20,170 --> 00:00:22,580 Let me add a quick markdown cell here. 5 00:00:22,780 --> 00:00:30,310 That reads how to create an M T data frame. 6 00:00:30,320 --> 00:00:32,020 Now why might you want to do this. 7 00:00:32,060 --> 00:00:39,440 Well sometimes you want to create the data frame first and then populated with values later on the key 8 00:00:39,440 --> 00:00:45,530 things that you need of course with this is you need to specify what sort of column names you want how 9 00:00:45,530 --> 00:00:51,710 many columns you want to have how many rows you want what the index names should be and perhaps also 10 00:00:51,710 --> 00:00:54,320 provide some sort of dummy value for the cells right. 11 00:00:54,410 --> 00:01:00,610 Do you want the cells to be equal to zero or none or 1 or something else. 12 00:01:00,650 --> 00:01:02,290 Let's tackle one thing at a time. 13 00:01:02,360 --> 00:01:03,610 Let's start with the column names. 14 00:01:03,710 --> 00:01:09,790 So I'm going to create a variable that will hold onto all of these column and the school names. 15 00:01:09,780 --> 00:01:12,440 It's going to be equal to square brackets. 16 00:01:12,590 --> 00:01:19,190 Single quotes doc on the school I.D. When have our document I.D. for example as a column then we'll 17 00:01:19,190 --> 00:01:24,190 have a plus and let's add the category as well. 18 00:01:24,320 --> 00:01:25,940 So square brackets. 19 00:01:25,940 --> 00:01:27,170 Single quotes category. 20 00:01:27,740 --> 00:01:33,560 And then let's add our word ideas to go 0 1 2 3 4 all the way up to two thousand four hundred ninety 21 00:01:33,560 --> 00:01:34,560 nine. 22 00:01:34,610 --> 00:01:36,220 I'll pass this in as a range. 23 00:01:36,290 --> 00:01:47,200 So list parentheses range parentheses zero comma vocab size our vocab size is saved as a constant up 24 00:01:47,200 --> 00:01:52,300 top if you want to take a look at what the first five columns will look like they're gonna look like 25 00:01:52,300 --> 00:01:52,750 this. 26 00:01:52,750 --> 00:01:56,980 Column names square brackets colon five. 27 00:01:56,980 --> 00:01:58,300 There you go. 28 00:01:58,330 --> 00:02:04,330 So as you can see we've simply created a list that starts with two strings and then some numbers from 29 00:02:04,330 --> 00:02:05,990 our range object. 30 00:02:05,990 --> 00:02:07,420 Now of course the list is quite long right. 31 00:02:07,420 --> 00:02:15,580 So alien parentheses column names shows us that we've got two thousand five hundred and two columns. 32 00:02:15,580 --> 00:02:21,150 So two additional columns plus the number of words in our vocabulary. 33 00:02:21,520 --> 00:02:23,200 Now for our index names. 34 00:02:23,200 --> 00:02:30,760 So index underscore names we'll use maybe the unique values of the different document I.D. in our sparse 35 00:02:31,000 --> 00:02:32,200 training data. 36 00:02:32,200 --> 00:02:41,350 So I want to do this with NUM pi so NDP and I'm going to use the unique function from num Pi and I can 37 00:02:41,350 --> 00:02:50,410 specify my sparse training data here and select one particular column of this name pie array with square 38 00:02:50,410 --> 00:02:58,210 brackets colon comma and then zero for the first column index on the code names. 39 00:02:58,210 --> 00:03:09,250 Now looks like this 0 1 2 and then the last three values are 5 7 9 1 5 7 9 4 and 5 7 9 5. 40 00:03:09,250 --> 00:03:14,110 Now mind you these are not in numerical order of course because some of the documentaries of course 41 00:03:14,200 --> 00:03:20,170 belong to our training data and other documents have been excluded as we've seen in the last lesson 42 00:03:20,260 --> 00:03:22,330 of the previous module. 43 00:03:22,440 --> 00:03:22,710 All right. 44 00:03:22,720 --> 00:03:29,380 So with these two things in hand we can say create a data frame like so full on a school train on the 45 00:03:29,380 --> 00:03:40,480 school data is equal to PD for pandas don't data frame parentheses and then we'll specify some arguments 46 00:03:40,480 --> 00:03:40,880 here. 47 00:03:40,990 --> 00:03:48,610 The index is gonna be equal to our index names and the columns of this data frame are gonna be equal 48 00:03:48,610 --> 00:03:50,980 to our column names. 49 00:03:50,980 --> 00:03:52,180 Fair enough. 50 00:03:52,180 --> 00:03:52,730 Easy right. 51 00:03:53,440 --> 00:04:01,180 So let me shift enter on this so I'll run for a little while and we can look at our head of this data 52 00:04:01,180 --> 00:04:03,510 frame and see what it looks like. 53 00:04:03,550 --> 00:04:06,000 So this is interesting right. 54 00:04:06,040 --> 00:04:08,980 We've got any n not a number. 55 00:04:09,110 --> 00:04:12,100 Nan values for all the cells. 56 00:04:12,100 --> 00:04:19,240 What if we wanted this data frame instead to be filled with zero values as the default instead of these 57 00:04:19,240 --> 00:04:26,760 Nan values we can do that with a handy little function called Phil and a so we can take our data frame 58 00:04:27,790 --> 00:04:32,770 put a dollar for it and write fill in a parentheses. 59 00:04:33,040 --> 00:04:36,850 Value equals zero. 60 00:04:36,850 --> 00:04:45,520 This here is a method from the panel's data frame that allows you to facilitate a frame using a value 61 00:04:45,610 --> 00:04:47,310 of your choice. 62 00:04:47,380 --> 00:04:50,880 We will fill this data frame with the value 0. 63 00:04:51,070 --> 00:04:53,590 And we're also gonna use this in-place argument here. 64 00:04:54,220 --> 00:05:00,950 So instead of writing something like full on a train on it's got data is equal to full on a square train 65 00:05:01,510 --> 00:05:10,720 underscore data don't fill in a we will use in place to just replace our existing data frame in place 66 00:05:10,930 --> 00:05:13,230 is equal to true. 67 00:05:13,230 --> 00:05:13,690 There we go. 68 00:05:14,110 --> 00:05:21,220 So let's shift enter on this refresh this cell and then let's take a look at the head of our data frame. 69 00:05:21,220 --> 00:05:27,430 What are instead of the nine values we have our default values equal to zero now. 70 00:05:27,590 --> 00:05:35,800 Now let's create a full matrix from a sparse matrix. 71 00:05:36,640 --> 00:05:43,930 So if imported a sparse matrix earlier from the text file and let's write a function that will create 72 00:05:43,930 --> 00:05:46,800 our full matrix for us. 73 00:05:46,810 --> 00:05:55,720 Now remember the structure of our sparse matrix was document 80 word i.e. label and then occurrence 74 00:05:56,230 --> 00:05:57,500 of this particular word. 75 00:05:57,580 --> 00:05:59,610 That the structure that we're working with. 76 00:05:59,680 --> 00:06:04,130 These are columns 0 1 2 and 3. 77 00:06:04,240 --> 00:06:11,920 So let's define a function make underscore full on a school matrix and this function shall have quite 78 00:06:11,920 --> 00:06:13,360 a few arguments. 79 00:06:13,540 --> 00:06:19,660 The first one of course will be the sparse matrix that serves as a starting point. 80 00:06:19,660 --> 00:06:26,310 The second argument shall be the number of words an R on a score words in the vocabulary. 81 00:06:26,430 --> 00:06:29,700 The third one maybe the document index. 82 00:06:29,860 --> 00:06:33,220 So our document I.D. art index 0. 83 00:06:33,220 --> 00:06:37,510 So I'll say doc on the school index is equal to zero. 84 00:06:37,690 --> 00:06:43,660 So I'll just give that one a default value then I'll specify as well what the index is for the word 85 00:06:43,750 --> 00:06:44,070 right. 86 00:06:44,070 --> 00:06:52,300 So that way we don't forget the word index it's one that category index was two capped the score Ida 87 00:06:52,300 --> 00:06:54,350 X is equal to 2. 88 00:06:54,430 --> 00:07:04,020 And lastly are number of occurrences are frequencies frequency on the school index is equal to three. 89 00:07:04,060 --> 00:07:06,910 This is gonna be the signature of our function. 90 00:07:07,780 --> 00:07:09,240 Here's how I'd like it to work. 91 00:07:09,410 --> 00:07:18,550 Melody Doc string with three pairs of single quotes and this will read um form a full matrix from a 92 00:07:18,550 --> 00:07:26,640 sparse matrix and it's going to uh return a panda's data frame. 93 00:07:26,660 --> 00:07:30,790 The keyword arguments that we've specified above are as follows. 94 00:07:31,390 --> 00:07:34,100 So the first one was the sparse matrix right. 95 00:07:34,120 --> 00:07:44,140 So this will be a non pi array then we're gonna have the number of words which was the size of the vocabulary 96 00:07:45,560 --> 00:07:50,270 in other words the total number of tokens right. 97 00:07:50,460 --> 00:07:52,250 The document index. 98 00:07:52,250 --> 00:08:02,000 It's just the position of the document I.D. in the sparse matrix and by default. 99 00:08:02,210 --> 00:08:04,170 It's the first column. 100 00:08:04,170 --> 00:08:08,440 Now a lot of the description for the other keyword arguments as well. 101 00:08:08,490 --> 00:08:15,190 And by then I'll have a brief description of what this function should do and how one should use it. 102 00:08:15,200 --> 00:08:20,720 This is handy for ourselves who might be looking at this code in the future and not remembering how 103 00:08:20,720 --> 00:08:22,270 it works exactly. 104 00:08:22,310 --> 00:08:27,260 And it's also handy for anybody else working with our code and in Jupiter notebook when you press shift 105 00:08:27,260 --> 00:08:31,240 tab the dock string is what you see pop up. 106 00:08:31,250 --> 00:08:33,380 Now how we're gonna do this. 107 00:08:33,470 --> 00:08:38,260 Well let's first start out with an empty data frame. 108 00:08:38,390 --> 00:08:42,020 We can reuse some of this code that we've had above. 109 00:08:42,020 --> 00:08:44,330 So let's just copy this line here. 110 00:08:46,070 --> 00:08:50,110 Pasted in for the column names a copy. 111 00:08:50,100 --> 00:09:00,350 This one here pasted in for the index names but I'm going to rename this instead to document I.D. and 112 00:09:00,350 --> 00:09:01,720 it's got names. 113 00:09:01,850 --> 00:09:03,840 I think that's a lot more helpful. 114 00:09:04,100 --> 00:09:12,110 And we also gonna make sure that we replace spots on a train and a squat ditto with our parameter spouse 115 00:09:12,230 --> 00:09:14,000 on a square matrix. 116 00:09:14,000 --> 00:09:18,080 Otherwise we're going to have a big problem later on. 117 00:09:18,080 --> 00:09:18,740 So there you go. 118 00:09:19,800 --> 00:09:28,440 Then I'll copy these two lines of code and I'll come down and I'll add those in here. 119 00:09:28,560 --> 00:09:33,360 Now one of the things that you always want to check in Python is the indentation here. 120 00:09:33,540 --> 00:09:37,560 I'm working within the body of my function but here I am not. 121 00:09:37,600 --> 00:09:43,830 So when I move this over a bit so that I'm within the body of the function and then I'm going to have 122 00:09:43,830 --> 00:09:50,460 to change my index names to document I.D. names. 123 00:09:50,550 --> 00:09:54,300 After all we've renamed this on the line above. 124 00:09:54,300 --> 00:10:03,590 And then at the end I'll return this full matrix right here in between these two lines of code. 125 00:10:03,630 --> 00:10:08,820 We're gonna be doing all our work we're gonna be populating this full matrix. 126 00:10:08,820 --> 00:10:13,130 Here's what I've got in mind our sparse matrix kind of looks like this right. 127 00:10:13,200 --> 00:10:20,190 We've got four columns and they're labeled document I.D. word Daddy category and occurrence. 128 00:10:20,220 --> 00:10:24,210 Now what I'm after for the full matrix is something like this. 129 00:10:24,330 --> 00:10:26,550 I want the document is in one column. 130 00:10:26,550 --> 00:10:32,850 I want the categories in one column but then I want the word I.D. as separate columns and I want zero 131 00:10:32,850 --> 00:10:36,020 values for all the words that don't occur. 132 00:10:36,090 --> 00:10:42,030 In other words the frequencies or the occurrences of the individual tokens will of course be in the 133 00:10:42,030 --> 00:10:44,450 rows below the word I.D.. 134 00:10:44,820 --> 00:10:51,810 Here's a fuller picture of the data frame that we're going to populate every single token has a separate 135 00:10:51,810 --> 00:10:54,680 column in this data frame. 136 00:10:54,690 --> 00:11:00,570 These columns go from zero to two thousand four hundred and ninety nine because we've got two thousand 137 00:11:00,570 --> 00:11:03,820 five hundred words in our vocabulary. 138 00:11:03,930 --> 00:11:07,350 Now of course many many of these columns will be equal to zero. 139 00:11:07,890 --> 00:11:13,740 But what we're gonna do is going to go through our sparse matrix row by row by row by row and we're 140 00:11:13,740 --> 00:11:20,150 going to populate the entries of our full matrix which are non-zero with the correct occurrence. 141 00:11:20,160 --> 00:11:26,850 So without further ado let's get started the way we're gonna go through the sparse matrix is with a 142 00:11:26,850 --> 00:11:27,660 loop. 143 00:11:27,660 --> 00:11:32,150 Now I'm quite partial to the for loop so for I in range. 144 00:11:32,370 --> 00:11:39,990 It's gonna go from zero to the number of rows in the sparse matrix so that's going to be sparse matrix 145 00:11:40,860 --> 00:11:47,090 dot shape square brackets zero colon at the end. 146 00:11:47,090 --> 00:11:53,480 Now let's grab the data that's in each individual row our document I.D. or our document number if you 147 00:11:53,480 --> 00:11:54,380 will. 148 00:11:54,500 --> 00:11:59,580 It's gonna be equal to sparse matrix square brackets. 149 00:11:59,570 --> 00:12:07,690 I so the eighth row and then it'll be the first value so it'll be zero. 150 00:12:07,820 --> 00:12:12,200 The word I.D. is gonna be sparse matrix square brackets. 151 00:12:12,200 --> 00:12:15,740 I square brackets one right. 152 00:12:15,740 --> 00:12:16,520 Our label. 153 00:12:16,640 --> 00:12:20,860 The category will be the sparse matrix square brackets. 154 00:12:20,900 --> 00:12:23,440 I square brackets too. 155 00:12:23,480 --> 00:12:29,480 Now what you can see is that for these hardcoded values we can actually use the parameters that we specified 156 00:12:29,480 --> 00:12:30,070 above. 157 00:12:30,110 --> 00:12:39,680 So far our document 90 I'll replace the 0 with Doc on a school I.D. X for our word I.D. I'll replace 158 00:12:39,680 --> 00:12:48,830 this one with word on the school I.D. X and for our label of replace this with Cat on a school I.D. 159 00:12:48,830 --> 00:12:56,270 X the occurrence is of course equal to spots on the school matrix. 160 00:12:56,270 --> 00:13:03,650 Square brackets I square brackets frequency I.D. x these four lines of code here. 161 00:13:03,740 --> 00:13:10,400 Store all the values in a particular row in four separate variables. 162 00:13:10,400 --> 00:13:16,220 If you'd like to see this square bracket notation in action I can show you what I'm planning to do up 163 00:13:16,220 --> 00:13:16,780 here. 164 00:13:16,910 --> 00:13:27,180 So if I take our response training data and say we look at the rows between I don't know 10 and 13. 165 00:13:27,200 --> 00:13:30,890 In that case we'll see these values printed out here. 166 00:13:30,890 --> 00:13:36,980 What we're doing with our sparse training data our no higher rate and the square bracket notation is 167 00:13:36,980 --> 00:13:41,000 we're going into a particular row with say 10. 168 00:13:41,000 --> 00:13:42,900 So this will be the value of AI. 169 00:13:43,150 --> 00:13:48,110 And then for that second pair of square brackets we're selecting a particular value. 170 00:13:48,110 --> 00:13:51,500 So if that value is zero we're getting the document. 171 00:13:52,070 --> 00:13:59,690 If that value is equal to one we're getting the word I.D. for it to work getting the category and for 172 00:13:59,690 --> 00:14:02,320 3 we're getting the frequency. 173 00:14:02,330 --> 00:14:09,230 This is all we're doing inside of our loop makes sense now that we've extracted this data from a particular 174 00:14:09,230 --> 00:14:16,820 row we can come down here and we can populate our full matrix with data at a particular cell. 175 00:14:17,390 --> 00:14:29,540 So full matrix dot at square brackets document no comma single quotes document I.D. and single quotes 176 00:14:29,990 --> 00:14:31,050 and square brackets. 177 00:14:31,340 --> 00:14:33,380 It's gonna be equal to our document 178 00:14:37,010 --> 00:14:41,240 fool on his call matrix at square brackets. 179 00:14:41,240 --> 00:14:46,260 Document no comma single quotes. 180 00:14:46,610 --> 00:14:56,280 Category is gonna be equal to our label and full on a school matrix at square brackets. 181 00:14:56,650 --> 00:15:02,030 Document no comma word I.D.. 182 00:15:02,740 --> 00:15:05,640 It's gonna be equal to the occurrence. 183 00:15:05,790 --> 00:15:12,850 Now if these three lines look a little bit mysterious what we're doing here with dot at is we're selecting 184 00:15:13,090 --> 00:15:19,050 a particular cell in this data frame that we've created which cell that we select. 185 00:15:19,100 --> 00:15:21,460 Well second value here is the column name. 186 00:15:21,670 --> 00:15:26,200 So this one here will always go to the document I.D. column. 187 00:15:26,200 --> 00:15:28,670 This one here will always go to the category column. 188 00:15:28,780 --> 00:15:29,640 Right. 189 00:15:29,710 --> 00:15:32,650 And this first part here is the row number. 190 00:15:33,190 --> 00:15:38,950 So the row number will correspond to the document i.e. from the sparse matrix. 191 00:15:39,270 --> 00:15:43,760 So we're selecting a single cell here and then we're setting its value. 192 00:15:43,990 --> 00:15:49,780 The document I.D. of course has to go the document and column the category of course has to go in the 193 00:15:49,780 --> 00:15:58,570 category column and the frequency of a particular token it's gonna go under the word I.D. of a particular 194 00:15:58,690 --> 00:16:01,010 document in this data frame. 195 00:16:01,300 --> 00:16:07,120 And this pretty much completes the body of our loop we're grabbing the data from our sparse matrix and 196 00:16:07,120 --> 00:16:11,260 we're populating our previously empty data frame inside our loop. 197 00:16:11,290 --> 00:16:15,130 The only thing left to do is maybe set the index right. 198 00:16:15,160 --> 00:16:24,520 So full matrix don't set underscore index it's gonna be equal to the document that column single quotes 199 00:16:24,860 --> 00:16:32,100 doc on a school I.D. and as a second argument I will provide in place is equal to true. 200 00:16:32,940 --> 00:16:33,250 Okay. 201 00:16:33,280 --> 00:16:40,410 So let's shift enter on this and try this out I'll include our benchmarking code here with. 202 00:16:40,450 --> 00:16:47,740 Percent percent time in the cell below and then I'll create a variable which we'll hold on to our full 203 00:16:47,740 --> 00:16:51,820 matrix full honest quatrain on a score data. 204 00:16:51,820 --> 00:16:54,150 This is gonna be all our training data. 205 00:16:54,640 --> 00:17:01,480 And when I make a function call to make full matrix and in the parentheses we'll provide our sparse 206 00:17:01,480 --> 00:17:05,640 training data and our number of words. 207 00:17:05,710 --> 00:17:10,590 So our vocabulary size vocab on a school science. 208 00:17:10,960 --> 00:17:13,430 Let me hit shift enter on this. 209 00:17:13,480 --> 00:17:17,020 This should take 10 to 15 seconds to run. 210 00:17:17,020 --> 00:17:18,130 There we go. 211 00:17:18,130 --> 00:17:21,670 Now let's take a look at the head of the state of frame. 212 00:17:21,670 --> 00:17:26,050 See what we've got full on as Katrina was called data don't hit. 213 00:17:26,050 --> 00:17:30,110 So the good news is that this is kind of what I've expected to see. 214 00:17:30,160 --> 00:17:38,160 I've expected to see far more occurrences for the frequent words with word idea 0 1 2 3 4. 215 00:17:38,290 --> 00:17:44,920 Then I see for the less frequent words with the word that he's close to two thousand five hundred and 216 00:17:44,920 --> 00:17:52,230 this is the pattern that we should also see in the tail of our data frame brilliant in the next lessons 217 00:17:52,620 --> 00:17:58,710 we're gonna start training our naive pays model in particular this means calculating the probabilities 218 00:17:58,950 --> 00:18:01,680 for the individual tokens. 219 00:18:01,680 --> 00:18:06,610 This is where all the probability theory that we covered in the previous module comes into play. 220 00:18:06,720 --> 00:18:09,530 I'm pretty excited about this so I'll see you there.