1 00:00:00,233 --> 00:00:01,100 Hello my friends. 2 00:00:01,100 --> 00:00:05,133 Are you ready for the most essential step of this implementation, 3 00:00:05,133 --> 00:00:07,800 which is at the heart of sentiment analysis, 4 00:00:07,800 --> 00:00:12,100 that is creating the bag of Words model, which we are ready to do now 5 00:00:12,100 --> 00:00:15,333 because all our reviews are properly cleaned. 6 00:00:15,333 --> 00:00:19,366 So we're going to get them into the bag of words model to create, 7 00:00:19,366 --> 00:00:23,100 you know, the sparse matrix which will contain in the rows. 8 00:00:23,100 --> 00:00:27,100 Well, the different reviews, you know, the same reviews as the ones in our corpus 9 00:00:27,333 --> 00:00:30,300 and in the columns, all the different words 10 00:00:30,300 --> 00:00:33,300 taken from all the different reviews, you know, all of them. 11 00:00:33,466 --> 00:00:36,766 And each cell will either get a 0 or 1. 12 00:00:36,966 --> 00:00:38,400 It will get zero. 13 00:00:38,400 --> 00:00:43,866 If the word of the column is not in the review of the row, and it will get a one. 14 00:00:44,033 --> 00:00:48,000 If the word of the column is indeed bar of the words 15 00:00:48,000 --> 00:00:52,200 in the review of the row, all right, so that's what the Sports Matrix is about. 16 00:00:52,200 --> 00:00:55,866 And the process of creating all these columns corresponding 17 00:00:55,900 --> 00:01:00,700 to each of the words taken from all the reviews is called tokenization. 18 00:01:00,700 --> 00:01:03,100 So that's exactly what we'll do in this new cell. 19 00:01:03,100 --> 00:01:05,933 But first let me actually show you what we created. 20 00:01:05,933 --> 00:01:08,300 You know, I just want to show you the corpus. 21 00:01:08,300 --> 00:01:12,033 So actually right here we're going to create a new code cell. 22 00:01:12,333 --> 00:01:16,766 And I'm just going to do a print of the corpus 23 00:01:17,133 --> 00:01:20,000 so that I can show you indeed what we created. 24 00:01:20,000 --> 00:01:21,500 So let's press play here. 25 00:01:21,500 --> 00:01:24,233 And this will show the corpus. 26 00:01:24,233 --> 00:01:24,900 All right. 27 00:01:24,900 --> 00:01:27,766 So this is the first review after the cleaning. 28 00:01:27,766 --> 00:01:30,800 You know, after all this cleaning process in different steps. 29 00:01:31,000 --> 00:01:34,000 Remember I can actually show you the original review here. 30 00:01:34,200 --> 00:01:37,933 The original review was wow with, you know, capital letters, 31 00:01:37,933 --> 00:01:42,033 something to here with the three little dots and then left this place. 32 00:01:42,300 --> 00:01:47,233 And after the cleaning process it became wow, love place indeed. 33 00:01:47,233 --> 00:01:51,300 We removed all this stopwords such as, you know, this. 34 00:01:51,333 --> 00:01:54,600 You know, that's an article that doesn't give any hint on 35 00:01:54,600 --> 00:01:56,700 whether the review is positive or negative. 36 00:01:56,700 --> 00:01:57,600 However, of course 37 00:01:57,600 --> 00:02:01,800 we kept loved because loved means of course that the review is positive. 38 00:02:01,966 --> 00:02:05,166 However, we transformed loved into love. 39 00:02:05,400 --> 00:02:07,133 That's the process of stemming. 40 00:02:07,133 --> 00:02:10,333 So we can simplify all the words by their roots. 41 00:02:10,633 --> 00:02:14,133 And then of course, we kept place because that's of course not a step word. 42 00:02:14,333 --> 00:02:17,366 All right then let's have a look at the second one. 43 00:02:17,433 --> 00:02:19,800 Crust is not good. 44 00:02:19,800 --> 00:02:20,100 All right. 45 00:02:20,100 --> 00:02:22,966 So let's try to guess actually how it was transformed. 46 00:02:22,966 --> 00:02:27,333 So crust was just transformed into crust with a lowercase c. 47 00:02:27,733 --> 00:02:29,866 Then is was probably removed. 48 00:02:29,866 --> 00:02:30,233 Right. 49 00:02:30,233 --> 00:02:33,133 Because it doesn't give any hint on whether the review was positive 50 00:02:33,133 --> 00:02:33,900 or negative. 51 00:02:33,900 --> 00:02:37,700 Not was definitely kept because that's a negative statement. 52 00:02:37,966 --> 00:02:41,266 And good was of course kept. Okay. 53 00:02:41,266 --> 00:02:44,666 So after the transformation, you know, after all the cleaning, 54 00:02:44,666 --> 00:02:48,433 this review must become crust with a lowercase c. 55 00:02:48,700 --> 00:02:50,166 Not good. 56 00:02:50,166 --> 00:02:52,000 Let's check that it is the case. 57 00:02:52,000 --> 00:02:54,600 And oh okay. 58 00:02:54,600 --> 00:02:58,433 So actually they removed the nuts, which is a bit strange 59 00:02:58,433 --> 00:03:02,166 actually because you know, not clearly indicates a negative thing. 60 00:03:02,166 --> 00:03:03,333 You know, a negative review. 61 00:03:03,333 --> 00:03:07,733 We clearly have a difference between crust is good and crust is not good. 62 00:03:08,133 --> 00:03:13,300 So I think we need to do some extra work here in order to not include 63 00:03:13,566 --> 00:03:17,033 the not word from the stopwords. 64 00:03:17,033 --> 00:03:19,133 And I'm going to show you how you can do this. 65 00:03:19,133 --> 00:03:20,633 It's very easy. 66 00:03:20,633 --> 00:03:22,733 So we're going to work again on this code. 67 00:03:22,733 --> 00:03:25,733 I'm actually going to take this here, you know, 68 00:03:26,033 --> 00:03:28,666 Stopwords or the English Stopwords. 69 00:03:28,666 --> 00:03:31,366 I'm going to cut that then 70 00:03:31,366 --> 00:03:35,300 right here in a new line of code, I'm going to paste that. 71 00:03:35,700 --> 00:03:38,933 Then I'm going to create actually a new variable 72 00:03:38,966 --> 00:03:42,866 which I'm going to call oh, underscore stopwords. 73 00:03:43,333 --> 00:03:44,133 Right. 74 00:03:44,133 --> 00:03:47,133 And which will be equal to exactly this. 75 00:03:47,166 --> 00:03:50,166 But then what I'm going to do just below 76 00:03:50,300 --> 00:03:53,100 is to take this again, which was now 77 00:03:53,100 --> 00:03:56,400 created, as, you know, this whole end symbol of all the stopwords. 78 00:03:56,400 --> 00:03:59,800 But we don't want to include not in this top word, because that's 79 00:03:59,800 --> 00:04:04,033 clearly a negative term indicating therefore negative review. 80 00:04:04,266 --> 00:04:06,066 So I'm going to paste that here. 81 00:04:06,066 --> 00:04:09,066 And I'm just going to add here a dot remove. 82 00:04:09,600 --> 00:04:14,366 And in the parenthesis I'm simply going to include in quotes not. 83 00:04:14,700 --> 00:04:15,000 All right. 84 00:04:15,000 --> 00:04:19,233 So that will not include the not word from the stopwords. 85 00:04:19,433 --> 00:04:20,500 And therefore here 86 00:04:21,700 --> 00:04:22,733 instead of 87 00:04:22,733 --> 00:04:26,233 taking the set of the original and symbol of Stopwords, 88 00:04:26,366 --> 00:04:29,933 well, we're now going to take the original and symbol of Stopwords. 89 00:04:29,933 --> 00:04:33,166 Excluding this time the nuts word. 90 00:04:33,166 --> 00:04:34,333 Let's see if it works. 91 00:04:34,333 --> 00:04:38,500 I'm kind of improvising things here, but it might work. 92 00:04:38,500 --> 00:04:41,833 So we actually have to restore the runtime. 93 00:04:41,833 --> 00:04:42,700 So let's do this. 94 00:04:42,700 --> 00:04:44,100 Restore runtime. 95 00:04:44,100 --> 00:04:47,100 Yes we still have our data set. All good. 96 00:04:47,300 --> 00:04:48,533 And now let's see if this works. 97 00:04:48,533 --> 00:04:50,633 So we're going to re-execute the cells. 98 00:04:50,633 --> 00:04:51,500 I can not do a run. 99 00:04:51,500 --> 00:04:54,433 Oh here because the implementation is not over. 100 00:04:54,433 --> 00:04:56,333 But let's import the libraries first. 101 00:04:56,333 --> 00:04:57,966 Now the data set. 102 00:04:57,966 --> 00:05:00,966 And now let's clean the text. 103 00:05:01,000 --> 00:05:03,033 I hope this will work. 104 00:05:03,033 --> 00:05:04,500 Let's play. 105 00:05:04,500 --> 00:05:07,300 All right. This seems to be good. Good. 106 00:05:07,300 --> 00:05:09,000 Now let's remove this output. 107 00:05:09,000 --> 00:05:11,900 This was the previous output right. 108 00:05:11,900 --> 00:05:13,300 And now let's print the corpus. 109 00:05:13,300 --> 00:05:18,166 And let's hope that the second review is now no longer, you know crust good. 110 00:05:18,166 --> 00:05:21,200 But indeed crust not good okay. 111 00:05:21,666 --> 00:05:22,666 So let's press play. 112 00:05:22,666 --> 00:05:26,200 And perfect okay. Good I'm relieved. 113 00:05:26,233 --> 00:05:29,633 You know this was really bad to remove the nut because it's 114 00:05:29,633 --> 00:05:33,066 clearly a negative term indicating a negative review. 115 00:05:33,600 --> 00:05:34,733 All right. So much better now. 116 00:05:34,733 --> 00:05:36,766 And actually you know same for the next one. 117 00:05:36,766 --> 00:05:39,133 Nut tasty texture. Nasty. 118 00:05:39,133 --> 00:05:41,433 That definitely means a negative review. 119 00:05:41,433 --> 00:05:44,166 Let's actually check that right. 120 00:05:44,166 --> 00:05:45,600 Yes. Not tasty. 121 00:05:45,600 --> 00:05:48,233 And whatever zero negative review. 122 00:05:48,233 --> 00:05:49,666 And same for this one. 123 00:05:49,666 --> 00:05:50,400 All right. So good. 124 00:05:50,400 --> 00:05:52,466 We have actually a much better model now. 125 00:05:52,466 --> 00:05:53,700 So we can continue. 126 00:05:53,700 --> 00:05:57,000 And we can mostly create the bag for its model. 127 00:05:57,666 --> 00:05:58,800 All right. So let's do this. 128 00:05:58,800 --> 00:06:03,700 Let's actually scroll down a bit and there we go new code cell. 129 00:06:03,700 --> 00:06:07,100 And now let's proceed with this tokenization to create 130 00:06:07,100 --> 00:06:11,500 a sparse matrix containing all the reviews in different rows and all the words 131 00:06:11,500 --> 00:06:13,433 from all the reviews in the different columns, 132 00:06:13,433 --> 00:06:18,166 where the cells will get a one if the word is in the review, and a zero otherwise. 133 00:06:18,533 --> 00:06:18,966 All right. 134 00:06:18,966 --> 00:06:22,500 So we're going to do this with actually scikit learn. 135 00:06:22,500 --> 00:06:26,866 You know the tokenization process will be done thanks to a class from scikit 136 00:06:26,866 --> 00:06:27,300 learn. 137 00:06:27,300 --> 00:06:31,433 More specifically from a module of scikit learn called feature extraction. 138 00:06:31,700 --> 00:06:34,700 And that class is called count Vectorizer. 139 00:06:35,033 --> 00:06:35,333 All right. 140 00:06:35,333 --> 00:06:35,933 So let's do this. 141 00:06:35,933 --> 00:06:38,733 Let's start from scikit learn. 142 00:06:38,733 --> 00:06:41,833 You know this library very well as k learn 143 00:06:42,200 --> 00:06:45,400 from which we're going to call that feature. 144 00:06:45,400 --> 00:06:46,200 There we go. 145 00:06:46,200 --> 00:06:49,966 Extraction module from which actually you know it's not over. 146 00:06:50,000 --> 00:06:54,633 We're going to get access to the submodule called text text 147 00:06:54,900 --> 00:06:58,333 from which we're going to import that count. 148 00:06:58,966 --> 00:07:00,666 Vectorizer class. Perfect. 149 00:07:00,666 --> 00:07:04,000 I really love Google Colab when it assist me this. 150 00:07:04,000 --> 00:07:05,066 Well okay. 151 00:07:05,066 --> 00:07:06,233 So we have the class. 152 00:07:06,233 --> 00:07:08,433 Now you know, what is the next natural step. 153 00:07:08,433 --> 00:07:11,333 It is to create an instance of this class. 154 00:07:11,333 --> 00:07:15,933 And we're going to call that CV as count Vectorizer which will be created 155 00:07:15,933 --> 00:07:19,433 as, you know, an instance of this count 156 00:07:19,833 --> 00:07:24,600 Vectorizer class perfect, which has to take as input 157 00:07:24,700 --> 00:07:27,033 only one important parameter. 158 00:07:27,033 --> 00:07:29,300 Can you actually guess what it is? 159 00:07:29,300 --> 00:07:33,133 Well, it is actually the maximum size of the sparse matrix. 160 00:07:33,133 --> 00:07:35,000 You know, the maximum number of columns. 161 00:07:35,000 --> 00:07:37,133 Therefore the maximum number of words 162 00:07:37,133 --> 00:07:40,133 you want to include in the columns of the sparse matrix. 163 00:07:40,600 --> 00:07:44,433 And why is this important that because, you know, in our corpus of reviews 164 00:07:44,433 --> 00:07:48,800 now with all the simplifications, well, we actually have still some words 165 00:07:48,800 --> 00:07:50,766 that are not relevant or, 166 00:07:50,766 --> 00:07:54,000 you know, not helpful to predict if a review is positive or negative, 167 00:07:54,200 --> 00:07:55,933 even if they were not part of the stopwords. 168 00:07:55,933 --> 00:07:58,766 And these include, for example, you know, text you, 169 00:07:58,766 --> 00:08:02,466 you know, texture doesn't really help to predict it for review, positive 170 00:08:02,466 --> 00:08:05,900 or negative or, you know, bank, you know, or holiday 171 00:08:06,133 --> 00:08:09,366 or Rick and even Steve, you know, Steve doesn't help at all. 172 00:08:09,600 --> 00:08:13,333 So we still have these words which, even if they're not part of the stopwords, 173 00:08:13,466 --> 00:08:16,633 don't help at all to predict if a review is positive or negative. 174 00:08:17,100 --> 00:08:21,100 And the way to get rid of them is by, you know, entering this parameter 175 00:08:21,100 --> 00:08:25,300 that we're about to enter the way to get rid of them is just to take 176 00:08:25,300 --> 00:08:29,800 actually the most frequent words, you know, the words that appear 177 00:08:29,800 --> 00:08:34,200 most frequently in the reviews, because probably here Steve only appears once. 178 00:08:34,200 --> 00:08:37,466 So if we only take the most frequent words, we won't include 179 00:08:37,466 --> 00:08:41,700 Steve in this sparse matrix, you know, in the tokenization process. 180 00:08:42,066 --> 00:08:43,866 So so that's the trick. 181 00:08:43,866 --> 00:08:48,300 And so now we need to just choose a maximum size of the sparse matrix. 182 00:08:48,433 --> 00:08:51,500 However, we can't really know now how many words 183 00:08:51,500 --> 00:08:54,533 there are in total, you know before we take the most frequent ones. 184 00:08:54,733 --> 00:08:57,600 So what we'll do in fact is we will leave this for now. 185 00:08:57,600 --> 00:08:59,400 You know, we want enter this parameter. 186 00:08:59,400 --> 00:09:03,633 Now we will run this cell once we create the sparse matrix, 187 00:09:03,633 --> 00:09:07,100 which is actually going to be the matrix of features when training 188 00:09:07,100 --> 00:09:11,100 or Naive Bayes model on the training set, it's going to be the matrix of features. 189 00:09:11,233 --> 00:09:15,500 And therefore we will do a print in order to know the total number of columns. 190 00:09:15,733 --> 00:09:17,966 And we will get therefore, the total number of words. 191 00:09:17,966 --> 00:09:22,533 And then we can reduce that total number of words to a lower number 192 00:09:22,533 --> 00:09:23,933 of the most frequent words 193 00:09:23,933 --> 00:09:28,200 in the sparse matrix, so that we can simplify even more the bag of words model. 194 00:09:28,266 --> 00:09:29,900 Okay, so that's what we'll do. 195 00:09:29,900 --> 00:09:32,200 Therefore so far let's not enter anything. 196 00:09:32,200 --> 00:09:35,566 Let's just continue to create that bag of words model. 197 00:09:36,700 --> 00:09:37,200 All right. 198 00:09:37,200 --> 00:09:41,733 And actually speaking of the matrix of features that's exactly our next step. 199 00:09:41,733 --> 00:09:45,000 Here we are ready thanks to discount Vectorizer class 200 00:09:45,233 --> 00:09:49,566 to create the matrix of features which is indeed that sparse matrix. 201 00:09:49,800 --> 00:09:53,233 So we're going to call it x as usual as every 202 00:09:53,233 --> 00:09:55,200 of our previous matrices of features. 203 00:09:55,200 --> 00:09:56,566 So x equals. 204 00:09:56,566 --> 00:09:59,266 And now according to you what is the next step here. 205 00:09:59,266 --> 00:10:01,033 Well you guess that we're going 206 00:10:01,033 --> 00:10:05,066 to create this sparse matrix thanks to our CV object. 207 00:10:05,200 --> 00:10:09,333 So there we go I'm calling CV first from which I'm going to call now 208 00:10:09,333 --> 00:10:13,800 a method which you know very well, which we already called many times. 209 00:10:14,133 --> 00:10:19,000 And that method is the fit transform method. 210 00:10:19,333 --> 00:10:20,033 All right. 211 00:10:20,033 --> 00:10:23,633 Fit transform method which will indeed fit well. 212 00:10:23,633 --> 00:10:26,233 You know, the input of this fit transfer method, which will be 213 00:10:26,233 --> 00:10:30,166 you know, I'll tell you now the corpus, it will fit the corpus to X. 214 00:10:30,433 --> 00:10:31,500 And what does it mean. 215 00:10:31,500 --> 00:10:33,066 It means exactly that 216 00:10:33,066 --> 00:10:36,766 it will take all the words from all the reviews in the corpus. 217 00:10:36,966 --> 00:10:40,566 And then using this transform part of the method, it will put all these 218 00:10:40,566 --> 00:10:43,500 words in different columns. So you see that's very simple. 219 00:10:43,500 --> 00:10:45,300 The fit method will just take all the 220 00:10:45,300 --> 00:10:49,033 words, and the transform method will put all these words into the columns. 221 00:10:49,033 --> 00:10:49,800 That's it. 222 00:10:49,800 --> 00:10:51,200 Nothing more. Okay. 223 00:10:51,200 --> 00:10:55,533 So of course inside this fit transform method we have to input our corpus 224 00:10:55,533 --> 00:10:58,533 of reviews of very cleaned reviews. 225 00:10:58,666 --> 00:11:01,900 And then we just need to add here a two array. 226 00:11:02,133 --> 00:11:06,600 Because actually you know, remember that the matrix of features must be a 2D array. 227 00:11:06,600 --> 00:11:08,366 It has to be a 2D array. 228 00:11:08,366 --> 00:11:11,900 Because then, you know, we will train the naive base model on the training set. 229 00:11:12,300 --> 00:11:16,033 And this expects of course, an array as the format of its input, 230 00:11:16,033 --> 00:11:17,433 you know, the matrix of features. 231 00:11:17,433 --> 00:11:19,866 So you know X will be an array here. 232 00:11:19,866 --> 00:11:21,966 Then it will be split it into the training set 233 00:11:21,966 --> 00:11:25,100 and test it, you know with X train Y trend X and y test. 234 00:11:25,333 --> 00:11:26,433 And then there we go. 235 00:11:26,433 --> 00:11:29,666 We'll have the right array format to train the naive base model 236 00:11:29,666 --> 00:11:32,700 on the training set composed of X train and Y train. 237 00:11:33,000 --> 00:11:33,933 So two array. 238 00:11:33,933 --> 00:11:36,000 Let's not forget the parenthesis. 239 00:11:36,000 --> 00:11:38,800 And now there we go. We're almost done. 240 00:11:38,800 --> 00:11:42,666 Our final step here is to create the dependent variable vector y. 241 00:11:43,000 --> 00:11:47,166 And actually I will let you do this now because you know exactly how to do it 242 00:11:47,466 --> 00:11:48,000 right. 243 00:11:48,000 --> 00:11:51,800 We simply need to take that second column here because that's exactly 244 00:11:51,800 --> 00:11:53,766 the dependent variable vector. 245 00:11:53,766 --> 00:11:55,533 And we don't have anything to do here 246 00:11:55,533 --> 00:11:58,566 because it's already ready with the binary outcome zero one. 247 00:11:58,866 --> 00:11:59,500 And so. 248 00:11:59,500 --> 00:12:03,033 Well, the way to get this is actually very simple. 249 00:12:03,033 --> 00:12:06,033 And I'm actually thinking right now of an even simpler way, 250 00:12:06,100 --> 00:12:09,100 which is to go to our data preprocessing template, 251 00:12:09,166 --> 00:12:12,633 then take this line of code, because, you know, I'm very lazy. 252 00:12:12,633 --> 00:12:16,066 And so I'm copying this and pasting it. 253 00:12:16,366 --> 00:12:17,000 Right. 254 00:12:17,000 --> 00:12:19,800 You know, deleting this and right here. 255 00:12:19,800 --> 00:12:22,033 And that's exactly our dependent variable. 256 00:12:22,033 --> 00:12:22,533 Right. 257 00:12:22,533 --> 00:12:24,166 It is just taking 258 00:12:24,166 --> 00:12:27,833 the last column of our data set, which is the same as the second column. 259 00:12:27,833 --> 00:12:28,166 Right. 260 00:12:28,166 --> 00:12:31,133 You can either put a minus one here or the index one. 261 00:12:31,133 --> 00:12:33,600 But we want to make this a code template if we can. 262 00:12:33,600 --> 00:12:36,300 So let's just keep that okay. 263 00:12:36,300 --> 00:12:37,233 Wow. So good. 264 00:12:37,233 --> 00:12:39,833 We are done with actually the bag of words model. 265 00:12:39,833 --> 00:12:43,033 So now as we said we're going to run this to figure out 266 00:12:43,033 --> 00:12:44,700 the number of columns in the matrix 267 00:12:44,700 --> 00:12:48,266 X meaning the total number of words in that sparse matrix. 268 00:12:48,400 --> 00:12:52,533 So let's play this cell in order to first create x and y. 269 00:12:52,533 --> 00:12:57,933 And then we'll do the necessary to indeed get that total number of columns in x. 270 00:12:57,933 --> 00:12:59,933 And that's exactly what we're ready to do. 271 00:12:59,933 --> 00:13:02,933 Now, you saw that this cell executed properly. 272 00:13:02,933 --> 00:13:07,033 And now the trick to get that number of columns in X, 273 00:13:07,033 --> 00:13:10,900 or you know, that number of words resulting from the tokenization 274 00:13:11,233 --> 00:13:16,166 is just to call the Len function here, which is going to take as input 275 00:13:16,366 --> 00:13:21,500 this matrix of features x, and then only the first row. 276 00:13:21,533 --> 00:13:21,900 Right. 277 00:13:21,900 --> 00:13:23,500 Remember that the first index here 278 00:13:23,500 --> 00:13:27,233 and the pair of square brackets corresponds to the index of the row. 279 00:13:27,600 --> 00:13:27,966 All right. 280 00:13:27,966 --> 00:13:31,266 So this will give us exactly the number of elements 281 00:13:31,266 --> 00:13:35,466 basically in the first row therefore the number of columns of x. 282 00:13:35,666 --> 00:13:38,233 So let's see let's press play. 283 00:13:38,233 --> 00:13:40,200 And we're going to get now that indeed. 284 00:13:40,200 --> 00:13:47,200 Well okay there are 1566 words resulting from the tokenization. 285 00:13:47,500 --> 00:13:52,400 Basically we have 1566 words that were taken from all the reviews. 286 00:13:52,633 --> 00:13:55,866 And for each of the reviews, we have either one in the columns 287 00:13:55,866 --> 00:13:59,133 corresponding to the words that are in the review and zero 288 00:13:59,133 --> 00:14:03,300 to all the other columns corresponding to the words that are not in the review. 289 00:14:03,766 --> 00:14:04,033 All right. 290 00:14:04,033 --> 00:14:07,733 So basically we have 1566 words. 291 00:14:07,733 --> 00:14:10,766 And we can simplify this even more by for example 292 00:14:10,766 --> 00:14:16,200 taking the 1500 most frequent words so that we can, you know, get rid of words 293 00:14:16,200 --> 00:14:20,333 such as Rick, Steve and maybe, you know, holiday or, 294 00:14:20,700 --> 00:14:23,366 or let's say, you know, faux. 295 00:14:23,366 --> 00:14:24,466 I don't know what that means. 296 00:14:25,433 --> 00:14:26,233 Rubber. 297 00:14:26,233 --> 00:14:29,033 You know, this probably appears only once. 298 00:14:29,033 --> 00:14:33,000 And, you know, words like that, words that don't help at all. 299 00:14:33,000 --> 00:14:36,000 Predict if the review is positive or negative. 300 00:14:36,000 --> 00:14:38,133 Okay. So that's the idea. 301 00:14:38,133 --> 00:14:43,133 So let's just take, you know, the 1500 most frequent words. 302 00:14:43,400 --> 00:14:46,666 And therefore to do this we just need to enter max 303 00:14:46,833 --> 00:14:49,833 underscore features parameters. 304 00:14:49,833 --> 00:14:50,833 There we go. 305 00:14:50,833 --> 00:14:54,633 And in order to get only the 1500 most frequent words, 306 00:14:54,633 --> 00:14:57,900 we just need to enter here 1500. 307 00:14:57,900 --> 00:14:58,900 And feel free to 308 00:14:58,900 --> 00:15:02,633 try with other values, like for example, the 1000 most frequent words. 309 00:15:02,633 --> 00:15:05,300 But be careful not to remove too many words. 310 00:15:05,300 --> 00:15:08,133 Okay? All right. So good. 311 00:15:08,133 --> 00:15:11,833 Therefore now we're going to you know rerun that cell. 312 00:15:12,133 --> 00:15:13,200 So let's do this. 313 00:15:13,200 --> 00:15:16,200 Let's press play okay. Good. 314 00:15:16,200 --> 00:15:19,800 And now if we rerun that cell we should get 1500 here. 315 00:15:19,800 --> 00:15:21,033 Perfect. 316 00:15:21,033 --> 00:15:24,966 So now we have a nice bag of words model with only relevant words 317 00:15:24,966 --> 00:15:27,666 you know, that appear at least a certain amount of times 318 00:15:27,666 --> 00:15:30,900 and without all the non relevant words that appear once 319 00:15:30,900 --> 00:15:35,766 like Rick, Steve or that weird forward we saw in one of the reviews. 320 00:15:35,766 --> 00:15:37,500 Okay, good. 321 00:15:37,500 --> 00:15:40,933 And so now, well, we basically did the most difficult part. 322 00:15:41,166 --> 00:15:43,266 We created the Bag of Words model. 323 00:15:43,266 --> 00:15:46,100 And so now I actually have an exercise for you 324 00:15:46,100 --> 00:15:49,266 which you're going to do by yourself first before we do it together. 325 00:15:49,400 --> 00:15:53,166 It is of course, to do all the rest of the different steps here. 326 00:15:53,166 --> 00:15:56,833 And you know how to do them because you basically have everything. 327 00:15:57,066 --> 00:16:00,233 You have the matrix of features and the dependent variable vector Y, 328 00:16:00,433 --> 00:16:04,033 which you can therefore split into a training set and test set, 329 00:16:04,033 --> 00:16:09,166 you know, composed respectively of X train and Y train and x and y test. 330 00:16:09,500 --> 00:16:13,333 Then you're going to use the training set composed of X train and white train 331 00:16:13,333 --> 00:16:16,500 to train the naive base model on the train set. 332 00:16:16,800 --> 00:16:20,566 Then you're going to predict a test result using the test set containing 333 00:16:20,566 --> 00:16:24,766 therefore reviews and their outcomes on which the model wasn't trained. 334 00:16:24,966 --> 00:16:29,000 And finally, you're going to make the confusion matrix and compute the accuracy. 335 00:16:29,400 --> 00:16:32,300 Of course you're going to do this using your machine 336 00:16:32,300 --> 00:16:36,200 learning toolkit containing all the code templates we built so far. 337 00:16:36,300 --> 00:16:38,033 So you totally have the right to do that. 338 00:16:38,033 --> 00:16:40,400 And I actually hope that you're going to do this 339 00:16:40,400 --> 00:16:44,166 because I want you to be, as most efficient as possible. 340 00:16:44,366 --> 00:16:47,066 And therefore that's exactly what we will do. 341 00:16:47,066 --> 00:16:51,233 In the next and final tutorial of this section, I will show you how 342 00:16:51,233 --> 00:16:56,666 to juggle with our diverse toolkit and especially the classification toolkit to. 343 00:16:56,666 --> 00:17:00,300 In a flashlight, split the data set into the training set and test it. 344 00:17:00,433 --> 00:17:03,033 Then train the naive base model on the training set 345 00:17:03,033 --> 00:17:06,500 and predict the test results and making the confusion matrix. 346 00:17:06,500 --> 00:17:11,200 I will show you that I will do this with only copy paste, nothing else. 347 00:17:11,200 --> 00:17:15,266 We won't type any code now we have everything in our diverse toolkit, 348 00:17:15,600 --> 00:17:16,933 but please make it first. 349 00:17:16,933 --> 00:17:18,533 Please do it on your own first 350 00:17:18,533 --> 00:17:21,900 and we will implement the solution together in the next tutorial. 351 00:17:22,200 --> 00:17:24,166 Until then, enjoy machine learning.