1 00:00:00,133 --> 00:00:02,466 Hello and welcome to this art tutorial. 2 00:00:02,466 --> 00:00:03,300 So that's it. 3 00:00:03,300 --> 00:00:04,300 We did the first step 4 00:00:04,300 --> 00:00:07,500 of natural language processing, which consisted of cleaning the text 5 00:00:07,500 --> 00:00:08,566 we're working with. 6 00:00:08,566 --> 00:00:11,666 And now it's time to create the sparse matrix of features 7 00:00:11,900 --> 00:00:13,766 containing all the different reviews 8 00:00:13,766 --> 00:00:17,300 and the rows and all the different words of the reviews in the columns. 9 00:00:17,633 --> 00:00:21,600 So as a reminder, what we're about to build is a huge table 10 00:00:21,800 --> 00:00:24,333 in which the rows are the 1000 reviews. 11 00:00:24,333 --> 00:00:26,800 So we're going to have one row for each review. 12 00:00:26,800 --> 00:00:28,700 So we're going to have 1000 rows. 13 00:00:28,700 --> 00:00:31,133 And then the columns are going to contain all the words 14 00:00:31,133 --> 00:00:33,900 we can find in the 1000 reviews in this corpus. 15 00:00:33,900 --> 00:00:36,700 That is the 1000 cleaned reviews. 16 00:00:36,700 --> 00:00:39,633 And so basically what this means is that we are going to take 17 00:00:39,633 --> 00:00:43,133 all the different words in the 1000 cleaned reviews in the corpus, 18 00:00:43,333 --> 00:00:46,433 and we're going to create one column for each word. 19 00:00:46,733 --> 00:00:51,700 So suppose we count in total 1005 hundred words in this corpus of reviews. 20 00:00:51,933 --> 00:00:56,700 Well, that means that our huge table is going to contain 1005 hundred columns. 21 00:00:57,066 --> 00:01:01,233 And then for each cell in this huge table, well, then each cell will correspond 22 00:01:01,233 --> 00:01:05,633 to one review corresponding to the row and one word corresponding to the column. 23 00:01:05,966 --> 00:01:07,700 And so the value that will contain 24 00:01:07,700 --> 00:01:11,233 the cell is the number of times the word appears in the review. 25 00:01:11,666 --> 00:01:15,000 So, as we explained earlier, since most of the words don't appear 26 00:01:15,000 --> 00:01:18,600 in the reviews, well, most of the cells will contain a zero. 27 00:01:19,166 --> 00:01:22,633 And then of course we'll get a few ones because each review is composed 28 00:01:22,633 --> 00:01:24,233 between 5 to 10 words. 29 00:01:24,233 --> 00:01:28,766 So in each row we're going to have 5 or 10 cells having a one, 30 00:01:28,766 --> 00:01:31,800 and all the other cells will have zero, and the cells that have 31 00:01:31,800 --> 00:01:35,266 a one will be in the columns corresponding to the words that are in the review. 32 00:01:35,733 --> 00:01:39,566 And maybe sometimes, but very rarely, we'll get a 2 or 3. 33 00:01:39,800 --> 00:01:43,633 That happens when the word appears twice or three times in a review. 34 00:01:43,966 --> 00:01:45,400 I can give you a simple example. 35 00:01:45,400 --> 00:01:46,766 Let's imagine that we have 36 00:01:46,766 --> 00:01:50,933 a super positive review saying, I love this restaurant very, very much. 37 00:01:51,166 --> 00:01:54,366 Well, in this review the word very appears twice. 38 00:01:54,600 --> 00:01:58,200 So for this particular review in the table, that is, let's say it's 39 00:01:58,200 --> 00:01:59,400 the row 100. 40 00:01:59,400 --> 00:02:01,566 Well, in this cell that belongs to this row 41 00:02:01,566 --> 00:02:04,266 and that belongs to the column corresponding to the word. 42 00:02:04,266 --> 00:02:08,466 Very well we'll get a two because very appears twice in this review. 43 00:02:09,033 --> 00:02:11,700 So that could happen. But it's very rare. 44 00:02:11,700 --> 00:02:15,700 And what's most important to understand is that in this huge table 45 00:02:15,700 --> 00:02:20,633 we'll get mostly zeros a few ones and very few twos or threes. 46 00:02:20,966 --> 00:02:25,366 And we'll get so many zeros that we call this table a sparse matrix. 47 00:02:25,633 --> 00:02:28,833 A sparse matrix is a table that contains a lot of zeros. 48 00:02:29,066 --> 00:02:31,433 It contains very few non-zero values. 49 00:02:31,433 --> 00:02:32,333 And that's exactly 50 00:02:32,333 --> 00:02:35,333 what we're about to obtain because of what we've just explained. 51 00:02:36,000 --> 00:02:39,900 And using the information of the table we build that we have the word sparsity. 52 00:02:40,200 --> 00:02:43,733 And sparsity refers to that situation where we have a lot of zeros. 53 00:02:44,266 --> 00:02:48,066 And speaking of sparsity, let's also keep in mind that if we clean 54 00:02:48,100 --> 00:02:51,900 all the reviews here in this first step of natural language processing, it's 55 00:02:51,900 --> 00:02:55,766 in order to reduce as much as possible this future sparsity 56 00:02:55,766 --> 00:02:58,900 that will occur in this huge table that we're about to build. 57 00:02:59,333 --> 00:03:02,666 So that's the whole point behind this first step here. 58 00:03:02,666 --> 00:03:03,833 Cleaning the text. 59 00:03:03,833 --> 00:03:05,766 It's to avoid having too much sparsity. 60 00:03:05,766 --> 00:03:09,900 That is it's to avoid having a table too big with too many columns. 61 00:03:09,900 --> 00:03:13,800 Because remember one column is created for each word in the corpus. 62 00:03:14,100 --> 00:03:17,700 So by doing all these steps here, we removed a lot of words 63 00:03:17,700 --> 00:03:21,166 and a lot of characters, punctuation, numbers, etc. 64 00:03:21,166 --> 00:03:24,200 so that in this final huge table we get the minimum 65 00:03:24,200 --> 00:03:27,200 number of words and therefore the minimum number of columns. 66 00:03:27,500 --> 00:03:30,833 And one last quick reminder we are creating this table 67 00:03:30,833 --> 00:03:33,966 in order to have the framework of classification models. 68 00:03:34,200 --> 00:03:38,133 That is, you know, having several independent variables and one dependent 69 00:03:38,133 --> 00:03:38,733 variable. 70 00:03:38,733 --> 00:03:41,266 We haven't created the dependent variable yet. 71 00:03:41,266 --> 00:03:44,266 It's actually in this data set here we will just take 72 00:03:44,466 --> 00:03:47,900 the second column of this data set because this contains the outcome. 73 00:03:47,900 --> 00:03:51,000 Whether the review is positive or negative we can see that here. 74 00:03:51,000 --> 00:03:53,166 It's the second column light one. 75 00:03:53,166 --> 00:03:56,166 If the review is positive and zero of the reviews negative. 76 00:03:56,166 --> 00:03:58,333 So that's the dependent variable column. 77 00:03:58,333 --> 00:04:03,566 And the independent variables are going to be nothing else than these columns 78 00:04:03,566 --> 00:04:07,200 corresponding to each one of the words in the cleaned reviews of the corpus. 79 00:04:07,433 --> 00:04:10,800 Because for each review that has each observation, we can link 80 00:04:10,800 --> 00:04:13,833 the review to each of the columns, because for each of the review, 81 00:04:13,833 --> 00:04:18,000 we can associate a value for each of the columns, and this value is 82 00:04:18,000 --> 00:04:21,966 the number of times the word corresponding to the column appears in the review. 83 00:04:22,200 --> 00:04:24,833 So that's how we create our independent variables. 84 00:04:24,833 --> 00:04:27,333 And then we'll create our dependent variable. 85 00:04:27,333 --> 00:04:31,200 And therefore we'll get the classification model as we used to work with. 86 00:04:31,200 --> 00:04:34,200 And eventually we went because we will have everything. 87 00:04:34,333 --> 00:04:38,300 We will have our independent variables, we will have our dependent variable. 88 00:04:38,566 --> 00:04:41,400 And we already have all our classification models. 89 00:04:41,400 --> 00:04:43,633 That's the models we made in part three. 90 00:04:43,633 --> 00:04:44,500 So we will just need 91 00:04:44,500 --> 00:04:48,433 to apply these models on our new data set that we were about to create 92 00:04:48,600 --> 00:04:52,633 that contains the independent variables as the words and the dependent variable 93 00:04:52,800 --> 00:04:55,800 as the light column in our original data set. 94 00:04:55,800 --> 00:04:56,166 All right. 95 00:04:56,166 --> 00:04:58,200 So let's do it. Let's create this table. 96 00:04:58,200 --> 00:05:01,533 And in R we can do it very efficiently using a function 97 00:05:01,866 --> 00:05:04,666 a function that is called document or matrix. 98 00:05:04,666 --> 00:05:08,400 And it's super easy because this function will only take one argument. 99 00:05:08,600 --> 00:05:11,533 And as you might have guessed, it's going to be the corpus. 100 00:05:11,533 --> 00:05:12,400 And that's it. 101 00:05:12,400 --> 00:05:17,333 This will create this huge sparse matrix with all the 1000 reviews in the rows, 102 00:05:17,333 --> 00:05:20,333 and with all the words of the reviews in the columns. 103 00:05:20,466 --> 00:05:21,466 So let's do it. 104 00:05:21,466 --> 00:05:25,733 Let's call this sparse matrix of features DTM, 105 00:05:25,733 --> 00:05:29,333 because the function we're about to use is document or matrix. 106 00:05:29,333 --> 00:05:31,700 So so far we'll call it DTM. 107 00:05:31,700 --> 00:05:32,733 So equals. 108 00:05:32,733 --> 00:05:35,400 And then we use this super function document 109 00:05:36,533 --> 00:05:37,366 term matrix. 110 00:05:37,366 --> 00:05:39,633 Here it is. I just need to press enter. 111 00:05:39,633 --> 00:05:44,966 And as I just said we just need to input one argument which is our corpus. 112 00:05:45,400 --> 00:05:45,966 All right. 113 00:05:45,966 --> 00:05:47,966 And that's done. Here it is corpus. 114 00:05:47,966 --> 00:05:50,966 This will create our sparse matrix of features. 115 00:05:51,033 --> 00:05:54,033 So I'm going to select this line and execute 116 00:05:54,166 --> 00:05:57,166 and done the sparse matrix of features is created. 117 00:05:57,233 --> 00:05:59,166 It appears right here DTM. 118 00:05:59,166 --> 00:06:02,166 We can click on this button here to have some info. 119 00:06:02,233 --> 00:06:04,366 And actually what's interesting to see now is the total 120 00:06:04,366 --> 00:06:08,166 number of words counted in the corpus to create all the columns. 121 00:06:08,166 --> 00:06:12,000 And we can see this total count here 1577. 122 00:06:12,433 --> 00:06:15,800 So that means that the number of columns indicated by end 123 00:06:15,800 --> 00:06:18,900 call here in our document or matrix 124 00:06:18,900 --> 00:06:22,766 or sparse matrix is 1577. 125 00:06:23,133 --> 00:06:26,600 So that means that this huge table has 1000 rows. 126 00:06:26,600 --> 00:06:29,600 So we expected this because of course we have 1000 reviews, 127 00:06:29,833 --> 00:06:33,566 but we didn't expect the number of columns in total because simply 128 00:06:33,566 --> 00:06:35,633 that was the total number of words in the reviews. 129 00:06:35,633 --> 00:06:36,966 So we can count them. 130 00:06:36,966 --> 00:06:40,100 But we can see this number here 1577. 131 00:06:40,500 --> 00:06:41,933 So that's already a big table. 132 00:06:41,933 --> 00:06:44,966 But be prepared if you're working with more complicated 133 00:06:44,966 --> 00:06:47,966 text or longer text like articles or books. 134 00:06:48,266 --> 00:06:50,933 Well, you might get a lot more columns here 135 00:06:50,933 --> 00:06:53,066 because you will get a lot more words. 136 00:06:53,066 --> 00:06:54,300 So what you'll have to do 137 00:06:54,300 --> 00:06:58,200 and you can ask me any questions about that in the Q&A, is reduce 138 00:06:58,200 --> 00:07:02,666 even more to sparsity by filtering the words in your text. 139 00:07:03,033 --> 00:07:06,200 And speaking of filtering, that's what we're going to do right now. 140 00:07:06,600 --> 00:07:10,900 We are going to apply a filter to clean even more the reviews 141 00:07:11,233 --> 00:07:14,233 by only considering the most frequent words 142 00:07:14,266 --> 00:07:18,233 that means that it's like we're going to add a step in this text 143 00:07:18,233 --> 00:07:19,433 cleaning process, 144 00:07:19,433 --> 00:07:23,900 which will consist of only taking the words that are the most frequent. 145 00:07:24,200 --> 00:07:27,200 For example, the words that appear in only one review, 146 00:07:27,300 --> 00:07:30,200 well, they might be removed because they're not frequent, 147 00:07:30,200 --> 00:07:31,533 they only appear once. 148 00:07:31,533 --> 00:07:35,600 Only one cell in the matrix contains one, because these words only appear 149 00:07:35,600 --> 00:07:36,666 in one review. 150 00:07:36,666 --> 00:07:38,700 And these words, of course, are not very relevant 151 00:07:38,700 --> 00:07:42,600 because since they only appear in one review, well, our machine learning 152 00:07:42,600 --> 00:07:45,900 classification model will not be able to establish any correlation 153 00:07:45,900 --> 00:07:49,766 between this word and the outcome, whether the review is positive or negative, 154 00:07:49,900 --> 00:07:52,400 because indeed, to understand such correlations, 155 00:07:52,400 --> 00:07:55,400 the word would need to appear in at least two reviews. 156 00:07:55,600 --> 00:07:58,133 So that's the kind of words we're going to remove. 157 00:07:58,133 --> 00:08:01,100 And again this is in order to reduce sparsity. 158 00:08:01,100 --> 00:08:03,900 And speaking of sparsity I will show you something very interesting. 159 00:08:03,900 --> 00:08:08,100 Right now if we go to the console and type here DTM, 160 00:08:08,433 --> 00:08:13,000 then we'll get other information about this sparse matrix of features. 161 00:08:13,333 --> 00:08:14,700 And the information that I want to 162 00:08:14,700 --> 00:08:18,333 highlight here is of course this sparsity information. 163 00:08:18,733 --> 00:08:22,666 And as you can see the sparsity is 100% right now. 164 00:08:22,866 --> 00:08:25,333 And that's because there are a lot of zeros in the matrix. 165 00:08:25,333 --> 00:08:29,033 And also because we haven't filtered any non frequent word yet. 166 00:08:29,233 --> 00:08:30,533 So that's what we'll do right now. 167 00:08:30,533 --> 00:08:33,600 We will filter all the words that appear only once. 168 00:08:33,800 --> 00:08:36,900 We will filter all the words that are not frequent in the reviews. 169 00:08:37,400 --> 00:08:37,666 All right. 170 00:08:37,666 --> 00:08:38,666 So let's do it. 171 00:08:38,666 --> 00:08:42,900 To do this we are going to update our document term matrix. 172 00:08:43,066 --> 00:08:45,900 So we're taking again DTM here. 173 00:08:45,900 --> 00:08:46,366 All right. 174 00:08:46,366 --> 00:08:50,400 Because we're updating our sparse matrix and equals. 175 00:08:50,833 --> 00:08:53,833 And now we're going to use a function a very practical function 176 00:08:53,833 --> 00:08:57,500 that will filter the non frequent words of our sparse matrix 177 00:08:57,766 --> 00:09:00,766 which so far is nothing else than DTM. 178 00:09:01,000 --> 00:09:03,233 So DTM will be one of the inputs. 179 00:09:03,233 --> 00:09:07,200 And we will filter all the non frequent words by specifying a proportion 180 00:09:07,200 --> 00:09:10,633 of non frequent words that we want to remove from the sparse matrix. 181 00:09:10,966 --> 00:09:13,600 And this proportion of non frequent words will be obtained 182 00:09:13,600 --> 00:09:15,766 thanks to the second input of this function. 183 00:09:15,766 --> 00:09:17,033 Because the second input is 184 00:09:17,033 --> 00:09:20,600 the percentage of the most frequent words we want to keep in the reviews. 185 00:09:20,900 --> 00:09:23,800 So let's say we want to keep 99% 186 00:09:23,800 --> 00:09:26,800 of the words in the review that are the most frequent words. 187 00:09:26,866 --> 00:09:30,000 Well, this second input will take the value of 99%. 188 00:09:30,633 --> 00:09:31,933 So let's use this function. 189 00:09:31,933 --> 00:09:35,566 This function is remove sparse terms. 190 00:09:35,766 --> 00:09:37,800 Here it is remove sparse terms. 191 00:09:37,800 --> 00:09:41,666 So pressing enter and ready to input the two arguments. 192 00:09:41,666 --> 00:09:43,100 So the first argument 193 00:09:43,100 --> 00:09:47,066 is of course the sparse matrix on which we want to apply this filtering. 194 00:09:47,066 --> 00:09:49,700 So of course it's DTM. 195 00:09:49,700 --> 00:09:50,666 All right. 196 00:09:50,666 --> 00:09:54,366 And the second input is the proportion of words that are the most frequent words. 197 00:09:54,533 --> 00:09:57,066 And that will be kept in this sparse matrix. 198 00:09:57,066 --> 00:10:00,833 So let's say we want to keep 99% of the most frequent words. 199 00:10:01,100 --> 00:10:04,233 Well we would need to input here oh point 99. 200 00:10:04,533 --> 00:10:07,333 And therefore we will build the same sparse matrix, 201 00:10:07,333 --> 00:10:10,333 but this time containing 99% of the words 202 00:10:10,366 --> 00:10:13,600 that are the most frequent in this sparse matrix of features. 203 00:10:13,900 --> 00:10:17,133 And therefore, you know, we're not looking at the corpus containing all the words 204 00:10:17,233 --> 00:10:20,866 and counting the most frequent words of this corpus, where this function remove 205 00:10:20,866 --> 00:10:25,033 sparse terms will do, is to look at all the columns of the sparse matrix here, 206 00:10:25,200 --> 00:10:30,000 and then keep 99% of the columns that have the most ones in the columns. 207 00:10:30,000 --> 00:10:33,800 Because each column corresponds to a word, and therefore, when there are very 208 00:10:33,800 --> 00:10:38,133 few ones in the columns, that means that this word appears in very few reviews, 209 00:10:38,533 --> 00:10:39,833 and therefore these are the words 210 00:10:39,833 --> 00:10:43,533 that are non frequent in the reviews and accordingly not relevant. 211 00:10:43,800 --> 00:10:46,200 And that's why we can remove them. 212 00:10:46,200 --> 00:10:46,500 All right. 213 00:10:46,500 --> 00:10:47,166 So let's do it. 214 00:10:47,166 --> 00:10:49,533 Let's apply the filter to be cautious. 215 00:10:49,533 --> 00:10:53,500 Let's maybe take a higher proportion of frequent words we keep. 216 00:10:53,766 --> 00:10:56,700 Because actually with 99% we might remove a lot of words. 217 00:10:56,700 --> 00:10:59,333 You can try it on your studio to see. 218 00:10:59,333 --> 00:11:02,766 But here since we don't have many reviews, you know, we have 1000 reviews. 219 00:11:02,766 --> 00:11:05,433 That is not much compared to other texts. 220 00:11:05,433 --> 00:11:08,000 We can work with in natural language processing. 221 00:11:08,000 --> 00:11:12,966 Let's be careful here and apply a 99.9% proportion of frequent words. 222 00:11:13,166 --> 00:11:15,233 So here I'm going to add a nine 223 00:11:15,233 --> 00:11:18,666 and you'll see that it will already remove quite a lot of words. 224 00:11:18,800 --> 00:11:22,566 So let's try it I'm going to select this and execute. 225 00:11:23,133 --> 00:11:25,866 And indeed as you can see we now have 226 00:11:25,866 --> 00:11:29,033 691 columns in the sparse matrix. 227 00:11:29,266 --> 00:11:32,366 That is we only kept 691 words. 228 00:11:32,700 --> 00:11:34,200 So clearly we can see that 229 00:11:34,200 --> 00:11:37,866 by keeping 99.9% of the words that are the most frequent. 230 00:11:38,133 --> 00:11:41,133 Well, that's already removes almost 1000 words, 231 00:11:41,366 --> 00:11:45,133 because originally we had remember more than 1005 hundred words. 232 00:11:45,466 --> 00:11:46,800 So be careful with this. 233 00:11:46,800 --> 00:11:49,800 Be careful not to apply a too low 234 00:11:49,800 --> 00:11:53,033 proportion of frequent words you want to keep, and to choose that. 235 00:11:53,033 --> 00:11:55,466 Remember to look at the total number of words 236 00:11:55,466 --> 00:11:58,200 that is counted when you build this first sparse matrix. 237 00:11:58,200 --> 00:12:01,000 And of course, you can also choose this number by considering 238 00:12:01,000 --> 00:12:04,000 the total number of reviews you have in your original data set. 239 00:12:04,300 --> 00:12:06,966 And you know, since we only had 1000 reviews, 240 00:12:06,966 --> 00:12:09,966 well, that's why we take such a high proportion here. 241 00:12:10,200 --> 00:12:12,500 And let's see by how much we reduce the sparsity. 242 00:12:12,500 --> 00:12:17,133 So we have to type DTM again because our document sparse matrix 243 00:12:17,133 --> 00:12:21,166 that is our sparse matrix was just updated with all these words removed. 244 00:12:21,366 --> 00:12:26,000 So pressing enter here and the sparsity now became 99%. 245 00:12:26,400 --> 00:12:27,133 So better. 246 00:12:27,133 --> 00:12:30,133 But anyway that was fine because we didn't have too much columns. 247 00:12:30,533 --> 00:12:32,566 You will see that if you work with larger text 248 00:12:32,566 --> 00:12:35,766 you will get a lot more words and therefore a lot more columns. 249 00:12:36,366 --> 00:12:38,633 All right. So that will be all. For this tutorial. 250 00:12:38,633 --> 00:12:40,800 We built our Bag of Words model. 251 00:12:40,800 --> 00:12:42,333 Congratulations for that. 252 00:12:42,333 --> 00:12:45,233 And now it's time to make the classification model. 253 00:12:45,233 --> 00:12:48,466 So that's what we'll do in the next and final tutorial of this section. 254 00:12:48,700 --> 00:12:50,400 And until then enjoy machine learning.