1
00:00:00,550 --> 00:00:07,500
In this lesson we're gonna go from a sparse matrix to a full matrix.

2
00:00:07,570 --> 00:00:12,850
First I'll show you how to create an empty data frame that we can populate with values later on and

3
00:00:12,850 --> 00:00:20,000
then we will write the python function that will do this transformation from sparse to full matrix.

4
00:00:20,170 --> 00:00:22,580
Let me add a quick markdown cell here.

5
00:00:22,780 --> 00:00:30,310
That reads how to create an M T data frame.

6
00:00:30,320 --> 00:00:32,020
Now why might you want to do this.

7
00:00:32,060 --> 00:00:39,440
Well sometimes you want to create the data frame first and then populated with values later on the key

8
00:00:39,440 --> 00:00:45,530
things that you need of course with this is you need to specify what sort of column names you want how

9
00:00:45,530 --> 00:00:51,710
many columns you want to have how many rows you want what the index names should be and perhaps also

10
00:00:51,710 --> 00:00:54,320
provide some sort of dummy value for the cells right.

11
00:00:54,410 --> 00:01:00,610
Do you want the cells to be equal to zero or none or 1 or something else.

12
00:01:00,650 --> 00:01:02,290
Let's tackle one thing at a time.

13
00:01:02,360 --> 00:01:03,610
Let's start with the column names.

14
00:01:03,710 --> 00:01:09,790
So I'm going to create a variable that will hold onto all of these column and the school names.

15
00:01:09,780 --> 00:01:12,440
It's going to be equal to square brackets.

16
00:01:12,590 --> 00:01:19,190
Single quotes doc on the school I.D. When have our document I.D. for example as a column then we'll

17
00:01:19,190 --> 00:01:24,190
have a plus and let's add the category as well.

18
00:01:24,320 --> 00:01:25,940
So square brackets.

19
00:01:25,940 --> 00:01:27,170
Single quotes category.

20
00:01:27,740 --> 00:01:33,560
And then let's add our word ideas to go 0 1 2 3 4 all the way up to two thousand four hundred ninety

21
00:01:33,560 --> 00:01:34,560
nine.

22
00:01:34,610 --> 00:01:36,220
I'll pass this in as a range.

23
00:01:36,290 --> 00:01:47,200
So list parentheses range parentheses zero comma vocab size our vocab size is saved as a constant up

24
00:01:47,200 --> 00:01:52,300
top if you want to take a look at what the first five columns will look like they're gonna look like

25
00:01:52,300 --> 00:01:52,750
this.

26
00:01:52,750 --> 00:01:56,980
Column names square brackets colon five.

27
00:01:56,980 --> 00:01:58,300
There you go.

28
00:01:58,330 --> 00:02:04,330
So as you can see we've simply created a list that starts with two strings and then some numbers from

29
00:02:04,330 --> 00:02:05,990
our range object.

30
00:02:05,990 --> 00:02:07,420
Now of course the list is quite long right.

31
00:02:07,420 --> 00:02:15,580
So alien parentheses column names shows us that we've got two thousand five hundred and two columns.

32
00:02:15,580 --> 00:02:21,150
So two additional columns plus the number of words in our vocabulary.

33
00:02:21,520 --> 00:02:23,200
Now for our index names.

34
00:02:23,200 --> 00:02:30,760
So index underscore names we'll use maybe the unique values of the different document I.D. in our sparse

35
00:02:31,000 --> 00:02:32,200
training data.

36
00:02:32,200 --> 00:02:41,350
So I want to do this with NUM pi so NDP and I'm going to use the unique function from num Pi and I can

37
00:02:41,350 --> 00:02:50,410
specify my sparse training data here and select one particular column of this name pie array with square

38
00:02:50,410 --> 00:02:58,210
brackets colon comma and then zero for the first column index on the code names.

39
00:02:58,210 --> 00:03:09,250
Now looks like this 0 1 2 and then the last three values are 5 7 9 1 5 7 9 4 and 5 7 9 5.

40
00:03:09,250 --> 00:03:14,110
Now mind you these are not in numerical order of course because some of the documentaries of course

41
00:03:14,200 --> 00:03:20,170
belong to our training data and other documents have been excluded as we've seen in the last lesson

42
00:03:20,260 --> 00:03:22,330
of the previous module.

43
00:03:22,440 --> 00:03:22,710
All right.

44
00:03:22,720 --> 00:03:29,380
So with these two things in hand we can say create a data frame like so full on a school train on the

45
00:03:29,380 --> 00:03:40,480
school data is equal to PD for pandas don't data frame parentheses and then we'll specify some arguments

46
00:03:40,480 --> 00:03:40,880
here.

47
00:03:40,990 --> 00:03:48,610
The index is gonna be equal to our index names and the columns of this data frame are gonna be equal

48
00:03:48,610 --> 00:03:50,980
to our column names.

49
00:03:50,980 --> 00:03:52,180
Fair enough.

50
00:03:52,180 --> 00:03:52,730
Easy right.

51
00:03:53,440 --> 00:04:01,180
So let me shift enter on this so I'll run for a little while and we can look at our head of this data

52
00:04:01,180 --> 00:04:03,510
frame and see what it looks like.

53
00:04:03,550 --> 00:04:06,000
So this is interesting right.

54
00:04:06,040 --> 00:04:08,980
We've got any n not a number.

55
00:04:09,110 --> 00:04:12,100
Nan values for all the cells.

56
00:04:12,100 --> 00:04:19,240
What if we wanted this data frame instead to be filled with zero values as the default instead of these

57
00:04:19,240 --> 00:04:26,760
Nan values we can do that with a handy little function called Phil and a so we can take our data frame

58
00:04:27,790 --> 00:04:32,770
put a dollar for it and write fill in a parentheses.

59
00:04:33,040 --> 00:04:36,850
Value equals zero.

60
00:04:36,850 --> 00:04:45,520
This here is a method from the panel's data frame that allows you to facilitate a frame using a value

61
00:04:45,610 --> 00:04:47,310
of your choice.

62
00:04:47,380 --> 00:04:50,880
We will fill this data frame with the value 0.

63
00:04:51,070 --> 00:04:53,590
And we're also gonna use this in-place argument here.

64
00:04:54,220 --> 00:05:00,950
So instead of writing something like full on a train on it's got data is equal to full on a square train

65
00:05:01,510 --> 00:05:10,720
underscore data don't fill in a we will use in place to just replace our existing data frame in place

66
00:05:10,930 --> 00:05:13,230
is equal to true.

67
00:05:13,230 --> 00:05:13,690
There we go.

68
00:05:14,110 --> 00:05:21,220
So let's shift enter on this refresh this cell and then let's take a look at the head of our data frame.

69
00:05:21,220 --> 00:05:27,430
What are instead of the nine values we have our default values equal to zero now.

70
00:05:27,590 --> 00:05:35,800
Now let's create a full matrix from a sparse matrix.

71
00:05:36,640 --> 00:05:43,930
So if imported a sparse matrix earlier from the text file and let's write a function that will create

72
00:05:43,930 --> 00:05:46,800
our full matrix for us.

73
00:05:46,810 --> 00:05:55,720
Now remember the structure of our sparse matrix was document 80 word i.e. label and then occurrence

74
00:05:56,230 --> 00:05:57,500
of this particular word.

75
00:05:57,580 --> 00:05:59,610
That the structure that we're working with.

76
00:05:59,680 --> 00:06:04,130
These are columns 0 1 2 and 3.

77
00:06:04,240 --> 00:06:11,920
So let's define a function make underscore full on a school matrix and this function shall have quite

78
00:06:11,920 --> 00:06:13,360
a few arguments.

79
00:06:13,540 --> 00:06:19,660
The first one of course will be the sparse matrix that serves as a starting point.

80
00:06:19,660 --> 00:06:26,310
The second argument shall be the number of words an R on a score words in the vocabulary.

81
00:06:26,430 --> 00:06:29,700
The third one maybe the document index.

82
00:06:29,860 --> 00:06:33,220
So our document I.D. art index 0.

83
00:06:33,220 --> 00:06:37,510
So I'll say doc on the school index is equal to zero.

84
00:06:37,690 --> 00:06:43,660
So I'll just give that one a default value then I'll specify as well what the index is for the word

85
00:06:43,750 --> 00:06:44,070
right.

86
00:06:44,070 --> 00:06:52,300
So that way we don't forget the word index it's one that category index was two capped the score Ida

87
00:06:52,300 --> 00:06:54,350
X is equal to 2.

88
00:06:54,430 --> 00:07:04,020
And lastly are number of occurrences are frequencies frequency on the school index is equal to three.

89
00:07:04,060 --> 00:07:06,910
This is gonna be the signature of our function.

90
00:07:07,780 --> 00:07:09,240
Here's how I'd like it to work.

91
00:07:09,410 --> 00:07:18,550
Melody Doc string with three pairs of single quotes and this will read um form a full matrix from a

92
00:07:18,550 --> 00:07:26,640
sparse matrix and it's going to uh return a panda's data frame.

93
00:07:26,660 --> 00:07:30,790
The keyword arguments that we've specified above are as follows.

94
00:07:31,390 --> 00:07:34,100
So the first one was the sparse matrix right.

95
00:07:34,120 --> 00:07:44,140
So this will be a non pi array then we're gonna have the number of words which was the size of the vocabulary

96
00:07:45,560 --> 00:07:50,270
in other words the total number of tokens right.

97
00:07:50,460 --> 00:07:52,250
The document index.

98
00:07:52,250 --> 00:08:02,000
It's just the position of the document I.D. in the sparse matrix and by default.

99
00:08:02,210 --> 00:08:04,170
It's the first column.

100
00:08:04,170 --> 00:08:08,440
Now a lot of the description for the other keyword arguments as well.

101
00:08:08,490 --> 00:08:15,190
And by then I'll have a brief description of what this function should do and how one should use it.

102
00:08:15,200 --> 00:08:20,720
This is handy for ourselves who might be looking at this code in the future and not remembering how

103
00:08:20,720 --> 00:08:22,270
it works exactly.

104
00:08:22,310 --> 00:08:27,260
And it's also handy for anybody else working with our code and in Jupiter notebook when you press shift

105
00:08:27,260 --> 00:08:31,240
tab the dock string is what you see pop up.

106
00:08:31,250 --> 00:08:33,380
Now how we're gonna do this.

107
00:08:33,470 --> 00:08:38,260
Well let's first start out with an empty data frame.

108
00:08:38,390 --> 00:08:42,020
We can reuse some of this code that we've had above.

109
00:08:42,020 --> 00:08:44,330
So let's just copy this line here.

110
00:08:46,070 --> 00:08:50,110
Pasted in for the column names a copy.

111
00:08:50,100 --> 00:09:00,350
This one here pasted in for the index names but I'm going to rename this instead to document I.D. and

112
00:09:00,350 --> 00:09:01,720
it's got names.

113
00:09:01,850 --> 00:09:03,840
I think that's a lot more helpful.

114
00:09:04,100 --> 00:09:12,110
And we also gonna make sure that we replace spots on a train and a squat ditto with our parameter spouse

115
00:09:12,230 --> 00:09:14,000
on a square matrix.

116
00:09:14,000 --> 00:09:18,080
Otherwise we're going to have a big problem later on.

117
00:09:18,080 --> 00:09:18,740
So there you go.

118
00:09:19,800 --> 00:09:28,440
Then I'll copy these two lines of code and I'll come down and I'll add those in here.

119
00:09:28,560 --> 00:09:33,360
Now one of the things that you always want to check in Python is the indentation here.

120
00:09:33,540 --> 00:09:37,560
I'm working within the body of my function but here I am not.

121
00:09:37,600 --> 00:09:43,830
So when I move this over a bit so that I'm within the body of the function and then I'm going to have

122
00:09:43,830 --> 00:09:50,460
to change my index names to document I.D. names.

123
00:09:50,550 --> 00:09:54,300
After all we've renamed this on the line above.

124
00:09:54,300 --> 00:10:03,590
And then at the end I'll return this full matrix right here in between these two lines of code.

125
00:10:03,630 --> 00:10:08,820
We're gonna be doing all our work we're gonna be populating this full matrix.

126
00:10:08,820 --> 00:10:13,130
Here's what I've got in mind our sparse matrix kind of looks like this right.

127
00:10:13,200 --> 00:10:20,190
We've got four columns and they're labeled document I.D. word Daddy category and occurrence.

128
00:10:20,220 --> 00:10:24,210
Now what I'm after for the full matrix is something like this.

129
00:10:24,330 --> 00:10:26,550
I want the document is in one column.

130
00:10:26,550 --> 00:10:32,850
I want the categories in one column but then I want the word I.D. as separate columns and I want zero

131
00:10:32,850 --> 00:10:36,020
values for all the words that don't occur.

132
00:10:36,090 --> 00:10:42,030
In other words the frequencies or the occurrences of the individual tokens will of course be in the

133
00:10:42,030 --> 00:10:44,450
rows below the word I.D..

134
00:10:44,820 --> 00:10:51,810
Here's a fuller picture of the data frame that we're going to populate every single token has a separate

135
00:10:51,810 --> 00:10:54,680
column in this data frame.

136
00:10:54,690 --> 00:11:00,570
These columns go from zero to two thousand four hundred and ninety nine because we've got two thousand

137
00:11:00,570 --> 00:11:03,820
five hundred words in our vocabulary.

138
00:11:03,930 --> 00:11:07,350
Now of course many many of these columns will be equal to zero.

139
00:11:07,890 --> 00:11:13,740
But what we're gonna do is going to go through our sparse matrix row by row by row by row and we're

140
00:11:13,740 --> 00:11:20,150
going to populate the entries of our full matrix which are non-zero with the correct occurrence.

141
00:11:20,160 --> 00:11:26,850
So without further ado let's get started the way we're gonna go through the sparse matrix is with a

142
00:11:26,850 --> 00:11:27,660
loop.

143
00:11:27,660 --> 00:11:32,150
Now I'm quite partial to the for loop so for I in range.

144
00:11:32,370 --> 00:11:39,990
It's gonna go from zero to the number of rows in the sparse matrix so that's going to be sparse matrix

145
00:11:40,860 --> 00:11:47,090
dot shape square brackets zero colon at the end.

146
00:11:47,090 --> 00:11:53,480
Now let's grab the data that's in each individual row our document I.D. or our document number if you

147
00:11:53,480 --> 00:11:54,380
will.

148
00:11:54,500 --> 00:11:59,580
It's gonna be equal to sparse matrix square brackets.

149
00:11:59,570 --> 00:12:07,690
I so the eighth row and then it'll be the first value so it'll be zero.

150
00:12:07,820 --> 00:12:12,200
The word I.D. is gonna be sparse matrix square brackets.

151
00:12:12,200 --> 00:12:15,740
I square brackets one right.

152
00:12:15,740 --> 00:12:16,520
Our label.

153
00:12:16,640 --> 00:12:20,860
The category will be the sparse matrix square brackets.

154
00:12:20,900 --> 00:12:23,440
I square brackets too.

155
00:12:23,480 --> 00:12:29,480
Now what you can see is that for these hardcoded values we can actually use the parameters that we specified

156
00:12:29,480 --> 00:12:30,070
above.

157
00:12:30,110 --> 00:12:39,680
So far our document 90 I'll replace the 0 with Doc on a school I.D. X for our word I.D. I'll replace

158
00:12:39,680 --> 00:12:48,830
this one with word on the school I.D. X and for our label of replace this with Cat on a school I.D.

159
00:12:48,830 --> 00:12:56,270
X the occurrence is of course equal to spots on the school matrix.

160
00:12:56,270 --> 00:13:03,650
Square brackets I square brackets frequency I.D. x these four lines of code here.

161
00:13:03,740 --> 00:13:10,400
Store all the values in a particular row in four separate variables.

162
00:13:10,400 --> 00:13:16,220
If you'd like to see this square bracket notation in action I can show you what I'm planning to do up

163
00:13:16,220 --> 00:13:16,780
here.

164
00:13:16,910 --> 00:13:27,180
So if I take our response training data and say we look at the rows between I don't know 10 and 13.

165
00:13:27,200 --> 00:13:30,890
In that case we'll see these values printed out here.

166
00:13:30,890 --> 00:13:36,980
What we're doing with our sparse training data our no higher rate and the square bracket notation is

167
00:13:36,980 --> 00:13:41,000
we're going into a particular row with say 10.

168
00:13:41,000 --> 00:13:42,900
So this will be the value of AI.

169
00:13:43,150 --> 00:13:48,110
And then for that second pair of square brackets we're selecting a particular value.

170
00:13:48,110 --> 00:13:51,500
So if that value is zero we're getting the document.

171
00:13:52,070 --> 00:13:59,690
If that value is equal to one we're getting the word I.D. for it to work getting the category and for

172
00:13:59,690 --> 00:14:02,320
3 we're getting the frequency.

173
00:14:02,330 --> 00:14:09,230
This is all we're doing inside of our loop makes sense now that we've extracted this data from a particular

174
00:14:09,230 --> 00:14:16,820
row we can come down here and we can populate our full matrix with data at a particular cell.

175
00:14:17,390 --> 00:14:29,540
So full matrix dot at square brackets document no comma single quotes document I.D. and single quotes

176
00:14:29,990 --> 00:14:31,050
and square brackets.

177
00:14:31,340 --> 00:14:33,380
It's gonna be equal to our document

178
00:14:37,010 --> 00:14:41,240
fool on his call matrix at square brackets.

179
00:14:41,240 --> 00:14:46,260
Document no comma single quotes.

180
00:14:46,610 --> 00:14:56,280
Category is gonna be equal to our label and full on a school matrix at square brackets.

181
00:14:56,650 --> 00:15:02,030
Document no comma word I.D..

182
00:15:02,740 --> 00:15:05,640
It's gonna be equal to the occurrence.

183
00:15:05,790 --> 00:15:12,850
Now if these three lines look a little bit mysterious what we're doing here with dot at is we're selecting

184
00:15:13,090 --> 00:15:19,050
a particular cell in this data frame that we've created which cell that we select.

185
00:15:19,100 --> 00:15:21,460
Well second value here is the column name.

186
00:15:21,670 --> 00:15:26,200
So this one here will always go to the document I.D. column.

187
00:15:26,200 --> 00:15:28,670
This one here will always go to the category column.

188
00:15:28,780 --> 00:15:29,640
Right.

189
00:15:29,710 --> 00:15:32,650
And this first part here is the row number.

190
00:15:33,190 --> 00:15:38,950
So the row number will correspond to the document i.e. from the sparse matrix.

191
00:15:39,270 --> 00:15:43,760
So we're selecting a single cell here and then we're setting its value.

192
00:15:43,990 --> 00:15:49,780
The document I.D. of course has to go the document and column the category of course has to go in the

193
00:15:49,780 --> 00:15:58,570
category column and the frequency of a particular token it's gonna go under the word I.D. of a particular

194
00:15:58,690 --> 00:16:01,010
document in this data frame.

195
00:16:01,300 --> 00:16:07,120
And this pretty much completes the body of our loop we're grabbing the data from our sparse matrix and

196
00:16:07,120 --> 00:16:11,260
we're populating our previously empty data frame inside our loop.

197
00:16:11,290 --> 00:16:15,130
The only thing left to do is maybe set the index right.

198
00:16:15,160 --> 00:16:24,520
So full matrix don't set underscore index it's gonna be equal to the document that column single quotes

199
00:16:24,860 --> 00:16:32,100
doc on a school I.D. and as a second argument I will provide in place is equal to true.

200
00:16:32,940 --> 00:16:33,250
Okay.

201
00:16:33,280 --> 00:16:40,410
So let's shift enter on this and try this out I'll include our benchmarking code here with.

202
00:16:40,450 --> 00:16:47,740
Percent percent time in the cell below and then I'll create a variable which we'll hold on to our full

203
00:16:47,740 --> 00:16:51,820
matrix full honest quatrain on a score data.

204
00:16:51,820 --> 00:16:54,150
This is gonna be all our training data.

205
00:16:54,640 --> 00:17:01,480
And when I make a function call to make full matrix and in the parentheses we'll provide our sparse

206
00:17:01,480 --> 00:17:05,640
training data and our number of words.

207
00:17:05,710 --> 00:17:10,590
So our vocabulary size vocab on a school science.

208
00:17:10,960 --> 00:17:13,430
Let me hit shift enter on this.

209
00:17:13,480 --> 00:17:17,020
This should take 10 to 15 seconds to run.

210
00:17:17,020 --> 00:17:18,130
There we go.

211
00:17:18,130 --> 00:17:21,670
Now let's take a look at the head of the state of frame.

212
00:17:21,670 --> 00:17:26,050
See what we've got full on as Katrina was called data don't hit.

213
00:17:26,050 --> 00:17:30,110
So the good news is that this is kind of what I've expected to see.

214
00:17:30,160 --> 00:17:38,160
I've expected to see far more occurrences for the frequent words with word idea 0 1 2 3 4.

215
00:17:38,290 --> 00:17:44,920
Then I see for the less frequent words with the word that he's close to two thousand five hundred and

216
00:17:44,920 --> 00:17:52,230
this is the pattern that we should also see in the tail of our data frame brilliant in the next lessons

217
00:17:52,620 --> 00:17:58,710
we're gonna start training our naive pays model in particular this means calculating the probabilities

218
00:17:58,950 --> 00:18:01,680
for the individual tokens.

219
00:18:01,680 --> 00:18:06,610
This is where all the probability theory that we covered in the previous module comes into play.

220
00:18:06,720 --> 00:18:09,530
I'm pretty excited about this so I'll see you there.