1 00:00:00,233 --> 00:00:00,766 All right. 2 00:00:00,766 --> 00:00:01,300 Are you ready. 3 00:00:01,300 --> 00:00:02,333 To do some cleaning? 4 00:00:02,333 --> 00:00:03,500 Yes. Good. 5 00:00:03,500 --> 00:00:04,433 Because now we're about. 6 00:00:04,433 --> 00:00:07,766 To do a deep. Cleaning of the. Text. Right. 7 00:00:07,766 --> 00:00:10,133 So far, the texts have punctuations. 8 00:00:10,133 --> 00:00:11,333 Different characters. 9 00:00:11,333 --> 00:00:14,100 Other than. Letters that have capital letters. 10 00:00:14,100 --> 00:00:17,000 Lowercase. The verbs are conjugated differently. 11 00:00:17,000 --> 00:00:19,200 Well, we're going to simplify all this. 12 00:00:19,200 --> 00:00:23,100 And that's indeed an essential step when doing natural language processing. 13 00:00:23,233 --> 00:00:26,700 We need to clean the text as much as we can in order to. 14 00:00:26,700 --> 00:00:29,700 Ease the. Learning process of the. 15 00:00:29,700 --> 00:00:31,133 Future machine learning model. 16 00:00:31,133 --> 00:00:31,666 Which we will. 17 00:00:31,666 --> 00:00:34,633 Train to understand the reviews, understand English. 18 00:00:34,633 --> 00:00:35,366 Basically. 19 00:00:35,366 --> 00:00:38,433 And. Predict if the reviews are positive or negative. 20 00:00:38,966 --> 00:00:40,366 All right, so let's do this. 21 00:00:40,366 --> 00:00:43,200 Let's first. Import the libraries meaning the tools. 22 00:00:43,200 --> 00:00:45,700 Which will. Allow us to. Clean these. Texts. 23 00:00:45,700 --> 00:00:49,066 So the first one and that's the main one the most essential one. 24 00:00:49,333 --> 00:00:50,800 It is the library. 25 00:00:50,800 --> 00:00:51,900 All right. 26 00:00:51,900 --> 00:00:54,066 So let me just. Import it first. 27 00:00:54,066 --> 00:00:55,800 It's simply called r e. 28 00:00:55,800 --> 00:00:58,233 And that's the library. You know we'll use two. 29 00:00:58,233 --> 00:01:00,000 Yes. This one the first one that's. 30 00:01:00,000 --> 00:01:03,100 The library we'll use to simplify basically the reviews. 31 00:01:03,100 --> 00:01:06,000 But this is not the library that will allow us to do the. 32 00:01:06,000 --> 00:01:08,766 Stemming, which I will remind and explain later. 33 00:01:08,766 --> 00:01:10,766 On. Okay. So r e. 34 00:01:10,766 --> 00:01:12,900 Then we will import of. Course, the. 35 00:01:12,900 --> 00:01:15,700 MLK library. 36 00:01:15,700 --> 00:01:17,033 Very classic library. 37 00:01:17,033 --> 00:01:18,933 In natural language processing which. 38 00:01:18,933 --> 00:01:20,100 Will allow us to download. 39 00:01:20,100 --> 00:01:22,166 The ensemble. Of Stopwords. 40 00:01:22,166 --> 00:01:23,833 So what are the stopwords? 41 00:01:23,833 --> 00:01:25,833 These are, you know, the. Words. 42 00:01:25,833 --> 00:01:27,933 We don't want to include in our reviews. 43 00:01:27,933 --> 00:01:30,700 You know, after cleaning the texts, which, you know, are. 44 00:01:30,700 --> 00:01:32,766 Words that. Are not relevant. 45 00:01:32,766 --> 00:01:35,033 To help. The predictions of whether. 46 00:01:35,033 --> 00:01:36,933 A review is positive. Or negative. 47 00:01:36,933 --> 00:01:40,266 And these words include, you know, the simple ones like the, you know. 48 00:01:40,266 --> 00:01:44,433 All the articles like the, And you know, all these. 49 00:01:44,433 --> 00:01:45,333 Words which. 50 00:01:45,333 --> 00:01:46,966 Don't give any hint of. 51 00:01:46,966 --> 00:01:49,033 Whether a review is. Positive or negative. 52 00:01:49,033 --> 00:01:52,200 So we will remove all these words, you know, all the words that are not. 53 00:01:52,200 --> 00:01:55,000 Helpful to predict if a review is positive. 54 00:01:55,000 --> 00:01:56,000 Or negative. 55 00:01:56,000 --> 00:01:57,966 And speaking. Of these, stopwords. 56 00:01:57,966 --> 00:02:02,833 Well, now that we import it and like we can call Nltk 57 00:02:03,366 --> 00:02:06,300 and from which we're going to download. 58 00:02:06,300 --> 00:02:08,066 Well, all this upwards. 59 00:02:08,066 --> 00:02:09,333 And to specify this 60 00:02:09,333 --> 00:02:13,166 we need to enter here in quotes inside this download function from the Nltk. 61 00:02:13,166 --> 00:02:13,900 Library. 62 00:02:13,900 --> 00:02:15,633 Stop. Words. 63 00:02:15,633 --> 00:02:17,666 And this will. Get all the stopwords. And. 64 00:02:17,666 --> 00:02:18,833 You'll see. Later on how. 65 00:02:18,833 --> 00:02:19,700 We use this. 66 00:02:19,700 --> 00:02:23,800 To indeed not include these non relevant words in our. 67 00:02:23,800 --> 00:02:26,800 Reviews. Okay. So good. 68 00:02:26,900 --> 00:02:29,000 Now we're not done with analytic. 69 00:02:29,000 --> 00:02:30,600 Yet because. From. 70 00:02:30,600 --> 00:02:31,900 NLT okay. 71 00:02:31,900 --> 00:02:32,766 And then from the. 72 00:02:32,766 --> 00:02:36,300 Corpus. Module of the Nltk. Library we. 73 00:02:36,300 --> 00:02:37,333 Will. Import. 74 00:02:37,333 --> 00:02:38,033 All these. 75 00:02:38,033 --> 00:02:40,833 Stopwords that we just downloaded before. 76 00:02:40,833 --> 00:02:41,633 There we go. 77 00:02:41,633 --> 00:02:44,300 So basically this line of code downloads them. 78 00:02:44,300 --> 00:02:48,100 And this line of code imports them into our notebook. 79 00:02:48,266 --> 00:02:48,933 Okay. 80 00:02:48,933 --> 00:02:50,333 So all these stopwords. 81 00:02:50,333 --> 00:02:53,433 And finally steal from. Nltk. 82 00:02:53,466 --> 00:02:54,300 And from. 83 00:02:54,300 --> 00:02:57,300 The. Stem module from the Nltk. 84 00:02:57,300 --> 00:02:58,000 Library. 85 00:02:58,000 --> 00:03:02,233 And then once again from the Porter submodule of the stem 86 00:03:02,233 --> 00:03:04,100 module of the NLG library. 87 00:03:04,100 --> 00:03:06,366 Well, we're going to import. 88 00:03:06,366 --> 00:03:09,300 A class which is the Porter. 89 00:03:09,300 --> 00:03:11,300 Stemmer class. Perfect. 90 00:03:11,300 --> 00:03:13,066 And this is a class we'll use of course. 91 00:03:13,066 --> 00:03:14,000 To apply. 92 00:03:14,000 --> 00:03:15,966 Stemming on our. Reviews. 93 00:03:15,966 --> 00:03:18,166 So now let me remind what this is about. 94 00:03:18,166 --> 00:03:21,166 Stemming consists of taking only the. 95 00:03:21,166 --> 00:03:22,633 Root of a word. 96 00:03:22,633 --> 00:03:25,800 That indicates enough about what this word means. 97 00:03:25,800 --> 00:03:27,300 So for example, let's say. 98 00:03:27,300 --> 00:03:28,266 There is a review that. 99 00:03:28,266 --> 00:03:31,533 Says, oh, I love this restaurant. Okay. 100 00:03:31,900 --> 00:03:33,766 And let's say we want to apply stemming to. 101 00:03:33,766 --> 00:03:35,200 The word loved. 102 00:03:35,200 --> 00:03:37,200 Well, what it will do is that it will. 103 00:03:37,200 --> 00:03:39,266 Transform love into. 104 00:03:39,266 --> 00:03:41,000 Love, just to simplify. 105 00:03:41,000 --> 00:03:43,233 The review, because whether we say, oh, I. 106 00:03:43,233 --> 00:03:45,833 Loved this restaurant or, oh, I love this. 107 00:03:45,833 --> 00:03:47,833 Restaurant, well, you know, that means the same. 108 00:03:47,833 --> 00:03:49,500 That means that the review is positive. 109 00:03:49,500 --> 00:03:52,600 So we can totally remove all the conjugation of the verbs, 110 00:03:52,600 --> 00:03:56,233 you know, just keeping the present tense so that we can indeed. 111 00:03:56,233 --> 00:03:57,166 Simplify the reviews. 112 00:03:57,166 --> 00:04:01,200 Because remember at the end, you know, after cleaning the text, when creating. 113 00:04:01,200 --> 00:04:04,233 Actually the bag of words model, we will create a. 114 00:04:04,233 --> 00:04:08,433 Sparse matrix where in each column we will have all the different words. 115 00:04:08,433 --> 00:04:10,500 Of all our. Different reviews. 116 00:04:10,500 --> 00:04:11,866 And therefore, in order. To. 117 00:04:11,866 --> 00:04:15,866 Optimize or, you know, minimize the dimension of this sparse matrix 118 00:04:15,866 --> 00:04:17,566 where the dimension is. Exactly. 119 00:04:17,566 --> 00:04:21,366 The number of columns, well, we need to simplify as much as we can. 120 00:04:21,366 --> 00:04:22,300 The words. 121 00:04:22,300 --> 00:04:25,600 And if we don't apply stemming, well, you know, in the sparse matrix 122 00:04:25,600 --> 00:04:28,866 we will have one column for love and one column for loved. 123 00:04:29,100 --> 00:04:32,133 And since that means the same thing, that would be redundant 124 00:04:32,133 --> 00:04:33,300 and that would make. The. 125 00:04:33,300 --> 00:04:36,633 Sparse matrix even more complex, you know, with a higher dimension. 126 00:04:36,633 --> 00:04:37,800 So that would be wrong. 127 00:04:37,800 --> 00:04:38,766 So that's exactly what the. 128 00:04:38,766 --> 00:04:40,933 Stemming is about. It consists. Of. 129 00:04:40,933 --> 00:04:41,666 Reducing. 130 00:04:41,666 --> 00:04:46,366 The final dimension of the sparse matrix so that we can indeed have not too much. 131 00:04:46,366 --> 00:04:48,333 Trouble to learn the text. 132 00:04:48,333 --> 00:04:51,166 From our machine learning model. All right. 133 00:04:51,166 --> 00:04:54,166 So that's what this poor Stemmer class will do. 134 00:04:54,300 --> 00:04:55,866 And now there you go, my friend. 135 00:04:55,866 --> 00:04:57,900 We can start cleaning the text. 136 00:04:57,900 --> 00:05:00,500 We have all the tools we need. 137 00:05:00,500 --> 00:05:01,500 So the. First thing we'll. 138 00:05:01,500 --> 00:05:03,600 Do is. Create a new list. 139 00:05:03,600 --> 00:05:06,300 Which we'll call. Corpus. All right. 140 00:05:06,300 --> 00:05:07,766 And we will. Initialize this list. 141 00:05:07,766 --> 00:05:09,500 As an empty list. 142 00:05:09,500 --> 00:05:11,700 And what. Will this list be. Exactly. 143 00:05:11,700 --> 00:05:13,100 You know, what will it contain? 144 00:05:13,100 --> 00:05:15,033 Well, it will simply contain. 145 00:05:15,033 --> 00:05:16,633 All our. Different reviews. 146 00:05:16,633 --> 00:05:19,200 You know, all the different reviews from our data set. 147 00:05:19,200 --> 00:05:21,766 But all. Cleaned and all into. 148 00:05:21,766 --> 00:05:23,666 This list. Corpus. 149 00:05:23,666 --> 00:05:25,300 So what we'll do actually. 150 00:05:25,300 --> 00:05:26,233 Is, you know, we will make. 151 00:05:26,233 --> 00:05:29,666 A for loop to iterate through all the different reviews of our. 152 00:05:29,733 --> 00:05:30,833 Data set. 153 00:05:30,833 --> 00:05:31,233 And for. 154 00:05:31,233 --> 00:05:33,233 Each of these review, we will apply. 155 00:05:33,233 --> 00:05:35,166 A cleaning process. You know, by. 156 00:05:35,166 --> 00:05:37,200 Putting all the letters in lowercase. 157 00:05:37,200 --> 00:05:40,533 And removing the punctuations and removing the. 158 00:05:40,533 --> 00:05:42,266 Stopwords. All these things. 159 00:05:42,266 --> 00:05:43,800 And we will do that one review. 160 00:05:43,800 --> 00:05:48,166 After another, and each time we clean a. Review well we will add it to this. 161 00:05:48,166 --> 00:05:48,800 Corpus. 162 00:05:48,800 --> 00:05:49,733 So this corpus will. 163 00:05:49,733 --> 00:05:52,233 Only get in the end all the cleaned reviews. 164 00:05:52,233 --> 00:05:53,166 Okay. 165 00:05:53,166 --> 00:05:54,200 And we do this of course. 166 00:05:54,200 --> 00:05:57,333 Because then the future functions will use in the next steps. 167 00:05:57,333 --> 00:05:58,833 Expect our reviews. 168 00:05:58,833 --> 00:06:00,400 In the list and all cleaned. 169 00:06:00,400 --> 00:06:02,166 Okay. So corpus. 170 00:06:02,166 --> 00:06:03,900 And now now that we initialize. 171 00:06:03,900 --> 00:06:05,366 This list well. We're going to. 172 00:06:05,366 --> 00:06:09,266 Populate it with the clean reviews through a for loop. 173 00:06:09,633 --> 00:06:12,300 For loop which will iterate, 174 00:06:12,300 --> 00:06:15,300 you know, with this classic looping variable I. 175 00:06:15,533 --> 00:06:17,800 In the range from. 176 00:06:17,800 --> 00:06:20,166 Zero to well. Guess what. 177 00:06:20,166 --> 00:06:21,833 Guess what is the upper bound? 178 00:06:21,833 --> 00:06:23,266 You know, we will simply iterate. 179 00:06:23,266 --> 00:06:25,566 Through all the reviews. And since we have one. 180 00:06:25,566 --> 00:06:27,533 Thousand. Reviews in our. Data set, well. 181 00:06:27,533 --> 00:06:28,833 I will simply go from. 182 00:06:28,833 --> 00:06:31,266 Zero to. 1000. 183 00:06:31,266 --> 00:06:32,700 Right? As simple. As that. 184 00:06:32,700 --> 00:06:33,666 It will iterate 185 00:06:33,666 --> 00:06:39,266 through the indexes of the reviews, which go effectively from zero to. 999. 186 00:06:39,466 --> 00:06:40,200 Okay. 187 00:06:40,200 --> 00:06:41,900 So for loop ready. 188 00:06:41,900 --> 00:06:43,766 And now we can go inside the for loop. 189 00:06:43,766 --> 00:06:46,200 And there we go. Now we're going to apply different. 190 00:06:46,200 --> 00:06:47,000 Steps to. 191 00:06:47,000 --> 00:06:49,500 Clean each and every single review. 192 00:06:49,500 --> 00:06:51,000 Of our data sets. 193 00:06:51,000 --> 00:06:52,400 So first of all. 194 00:06:52,400 --> 00:06:55,500 We will create a new variable which we will call review. 195 00:06:55,800 --> 00:06:58,000 And that. Variable will exactly be that. 196 00:06:58,000 --> 00:07:00,866 Clean review. But you know we will clean it step by step. 197 00:07:00,866 --> 00:07:02,866 So we will update review. 198 00:07:02,866 --> 00:07:05,500 Each time we proceed to a new kind of cleaning. 199 00:07:05,500 --> 00:07:07,933 And the first kind of cleaning will. Do will be. 200 00:07:07,933 --> 00:07:10,133 To remove all punctuations. 201 00:07:10,133 --> 00:07:12,433 In other words, it will be to keep only the. 202 00:07:12,433 --> 00:07:14,866 Letters in our reviews. All right. 203 00:07:14,866 --> 00:07:15,400 And to do. 204 00:07:15,400 --> 00:07:18,400 This we're going to call our r e library. 205 00:07:18,600 --> 00:07:19,466 From which. 206 00:07:19,466 --> 00:07:21,200 We're going to call this sub. 207 00:07:21,200 --> 00:07:24,233 Function, which is a function that can. 208 00:07:24,233 --> 00:07:24,900 Replace. 209 00:07:24,900 --> 00:07:27,466 Anything in a text, you know, in a string. Actually. 210 00:07:27,466 --> 00:07:29,166 By anything else you want. 211 00:07:29,166 --> 00:07:30,366 And what we're going to. 212 00:07:30,366 --> 00:07:31,600 Replace actually. 213 00:07:31,600 --> 00:07:34,633 Is. Any element that is not. A letter. 214 00:07:35,000 --> 00:07:36,333 You know, from A to Z. 215 00:07:36,333 --> 00:07:37,866 By a space. 216 00:07:37,866 --> 00:07:40,033 So that every. Punctuation like. 217 00:07:40,033 --> 00:07:42,333 Quotes, double quotes, commas or. 218 00:07:42,333 --> 00:07:44,633 Collins or anything you want will. 219 00:07:44,633 --> 00:07:46,166 Be replaced by a space. 220 00:07:46,166 --> 00:07:48,833 And it. Has to be replaced by a space, because otherwise. 221 00:07:48,833 --> 00:07:50,833 We can have two. Words that stick together. 222 00:07:50,833 --> 00:07:51,933 So we need to make sure we. 223 00:07:51,933 --> 00:07:52,500 Replace the. 224 00:07:52,500 --> 00:07:55,333 Punctuations. By spaces so. That we can indeed. 225 00:07:55,333 --> 00:07:57,600 Still. Separate the words. All right. 226 00:07:57,600 --> 00:07:58,766 And a way to. Do this. 227 00:07:58,766 --> 00:07:59,233 Thanks to. 228 00:07:59,233 --> 00:08:02,066 This sub. Function is to enter here in the. Parameters. 229 00:08:02,066 --> 00:08:04,033 First what we want to replace. 230 00:08:04,033 --> 00:08:04,933 And the trick. 231 00:08:04,933 --> 00:08:05,633 To say that. 232 00:08:05,633 --> 00:08:08,133 We want to replace anything that is not a. 233 00:08:08,133 --> 00:08:10,200 Letter is to do it this way. 234 00:08:10,200 --> 00:08:13,200 You start with a pair of square brackets here just like that. 235 00:08:13,333 --> 00:08:15,133 So that's what's. Inside. 236 00:08:15,133 --> 00:08:16,633 This pair of square brackets. 237 00:08:16,633 --> 00:08:18,700 Will be what will be replaced. 238 00:08:18,700 --> 00:08:20,100 You know, by. The spaces. 239 00:08:20,100 --> 00:08:23,000 And the trick to say that what we want to replace or anything but. 240 00:08:23,000 --> 00:08:25,233 Letters. Is to include a hat. Here. 241 00:08:25,233 --> 00:08:27,866 And now we'll explain what this means and then add. 242 00:08:27,866 --> 00:08:30,466 A. So you have to do it like that actually. 243 00:08:30,466 --> 00:08:33,366 Double hat a. Okay a. 244 00:08:33,366 --> 00:08:34,766 Dash z. 245 00:08:34,766 --> 00:08:38,633 So all the lowercase letters from A to z, but also then. 246 00:08:38,633 --> 00:08:39,600 All the capital. 247 00:08:39,600 --> 00:08:42,600 Letters from A to z. 248 00:08:43,000 --> 00:08:43,700 All right. 249 00:08:43,700 --> 00:08:46,700 And what this hat means. Is exactly. 250 00:08:46,733 --> 00:08:48,600 Not, you know, this symbol 251 00:08:48,600 --> 00:08:52,566 in mathematics or computer science means not meaning not. 252 00:08:52,566 --> 00:08:55,166 All the. Letters from A to Z. In lowercase. 253 00:08:55,166 --> 00:08:57,700 Nor. The capital letters from A to Z, which. 254 00:08:57,700 --> 00:08:58,733 Is exactly. What we want. 255 00:08:58,733 --> 00:09:02,700 We want to replace anything that is not the letters from A to Z in lowercase. 256 00:09:02,700 --> 00:09:03,633 Or capitals. 257 00:09:03,633 --> 00:09:05,033 By spaces. 258 00:09:05,033 --> 00:09:07,166 And the. Way to specify that. 259 00:09:07,166 --> 00:09:07,500 We want to. 260 00:09:07,500 --> 00:09:11,600 Replace all these by spaces is, well, exactly. 261 00:09:11,600 --> 00:09:14,033 What we have to enter here as a second parameter 262 00:09:14,033 --> 00:09:16,066 and which we will enter in, you know. 263 00:09:16,066 --> 00:09:18,500 Quotes. But inside. A space. 264 00:09:18,500 --> 00:09:21,133 Right. What's inside. These quotes is exactly. 265 00:09:21,133 --> 00:09:22,500 What we want to. Replace. 266 00:09:22,500 --> 00:09:24,900 Those non. Letters here by. 267 00:09:24,900 --> 00:09:26,766 All right. So we're going to replace everything. 268 00:09:26,766 --> 00:09:29,000 That is not letters meaning all the punctuations. 269 00:09:29,000 --> 00:09:31,000 By this space okay. 270 00:09:31,000 --> 00:09:34,400 And then finally we have to enter one final argument. 271 00:09:34,766 --> 00:09:35,800 Which is of course. 272 00:09:35,800 --> 00:09:37,633 Where we want to do all these. 273 00:09:37,633 --> 00:09:39,533 Replacements, you know, inside what's inside. 274 00:09:39,533 --> 00:09:40,633 Which review. 275 00:09:40,633 --> 00:09:43,200 Right inside which text. And so very simply. 276 00:09:43,200 --> 00:09:45,333 The third. Parameter we have to enter here. 277 00:09:45,333 --> 00:09:47,100 Is the review. In which we want to. 278 00:09:47,100 --> 00:09:48,600 Do all these replacements. 279 00:09:48,600 --> 00:09:50,066 And to. Access the review. 280 00:09:50,066 --> 00:09:51,100 Well that's very easy. 281 00:09:51,100 --> 00:09:53,300 We need to take our data set. 282 00:09:53,300 --> 00:09:55,500 There we go. Then we need to take. The right. 283 00:09:55,500 --> 00:09:58,066 Column of the data set which contains the reviews. And that's. 284 00:09:58,066 --> 00:09:58,833 Of course the. 285 00:09:58,833 --> 00:10:02,100 First column which we can either access with the dialog function 286 00:10:02,100 --> 00:10:03,866 and then specifying the index. Zero. 287 00:10:03,866 --> 00:10:04,566 Or I'll. 288 00:10:04,566 --> 00:10:07,000 Show you another trick by adding here pair. 289 00:10:07,000 --> 00:10:08,066 Of square brackets. 290 00:10:08,066 --> 00:10:10,366 And then just entering in. Quotes. 291 00:10:10,366 --> 00:10:12,666 Well the name of the column which is. 292 00:10:12,666 --> 00:10:14,766 Read view right. 293 00:10:14,766 --> 00:10:15,900 If we go back. 294 00:10:15,900 --> 00:10:18,133 To our data set, you will see that. 295 00:10:18,133 --> 00:10:20,700 The name of the first column is. Review. Okay. 296 00:10:20,700 --> 00:10:23,333 So as you want I look works very well too. 297 00:10:23,333 --> 00:10:25,866 And now do we need to add something here. 298 00:10:25,866 --> 00:10:28,533 Well of course because this. Only gets. 299 00:10:28,533 --> 00:10:31,066 The first. Column containing. All the reviews. 300 00:10:31,066 --> 00:10:33,566 But now we were dealing with a specific review. 301 00:10:33,566 --> 00:10:35,133 The one of index I. 302 00:10:35,133 --> 00:10:37,500 And therefore to catch that specific review. 303 00:10:37,500 --> 00:10:40,266 Which we want to clean right now inside is for loop. 304 00:10:40,266 --> 00:10:42,333 Well is simply to add in a new. 305 00:10:42,333 --> 00:10:43,800 Pair of square brackets. 306 00:10:43,800 --> 00:10:44,833 I. All right. 307 00:10:44,833 --> 00:10:46,700 So this will get the review. 308 00:10:46,700 --> 00:10:49,700 Of index I in this first column review. 309 00:10:49,800 --> 00:10:50,700 Of the data. Set. 310 00:10:50,700 --> 00:10:52,800 And that's exactly. What we want. 311 00:10:52,800 --> 00:10:53,366 All right. 312 00:10:53,366 --> 00:10:55,000 First cleaning done. 313 00:10:55,000 --> 00:10:57,766 Now we. Will proceed to two other. Types of cleaning. 314 00:10:57,766 --> 00:11:00,600 Then we will take a little break. And then we will. Proceed to. 315 00:11:00,600 --> 00:11:04,266 Stemming which will consist of, you know, simplifying the words 316 00:11:04,266 --> 00:11:07,900 in order to get only the root and therefore simplifying eventually. 317 00:11:08,100 --> 00:11:10,000 The. Sparse matrix. 318 00:11:10,000 --> 00:11:10,300 All right. 319 00:11:10,300 --> 00:11:12,533 So that new step will be to. Transform. 320 00:11:12,533 --> 00:11:15,866 All the capital. Letters into. Lowercase. 321 00:11:15,866 --> 00:11:16,600 And that's very. 322 00:11:16,600 --> 00:11:16,900 Easy. 323 00:11:16,900 --> 00:11:20,600 To do this we just need to take our review variable. 324 00:11:20,800 --> 00:11:22,800 From which we can now call. 325 00:11:22,800 --> 00:11:25,633 A. Specific function. Because, you know, we. Created this variable. 326 00:11:25,633 --> 00:11:27,633 As an. Output of this step. 327 00:11:27,633 --> 00:11:29,700 Function by the. Re library. 328 00:11:29,700 --> 00:11:31,100 And therefore like objects. 329 00:11:31,100 --> 00:11:33,466 It now has some attributes and functions. 330 00:11:33,466 --> 00:11:35,933 Or you know methods that you can call and that function. 331 00:11:35,933 --> 00:11:37,133 We want now to. 332 00:11:37,133 --> 00:11:41,700 Simplify all the letters into lowercase is the lower. 333 00:11:42,166 --> 00:11:43,700 Function. And that's it. 334 00:11:43,700 --> 00:11:45,300 You just have to enter it like. That. 335 00:11:45,300 --> 00:11:47,300 Review dot. Lower. 336 00:11:47,300 --> 00:11:49,066 But this will return. 337 00:11:49,066 --> 00:11:51,333 The review with only lowercase. Letters. 338 00:11:51,333 --> 00:11:52,566 And since we want to. 339 00:11:52,566 --> 00:11:57,000 Update our review variable, well we simply need to add here review. 340 00:11:57,733 --> 00:12:00,333 Equals write equals the result of. 341 00:12:00,333 --> 00:12:02,333 Applying the lower function to. 342 00:12:02,333 --> 00:12:04,633 Our previous. Version of the review. 343 00:12:04,633 --> 00:12:06,533 Okay. So very. Easy. 344 00:12:06,533 --> 00:12:08,400 And then one final cleaning. 345 00:12:08,400 --> 00:12:09,600 Before we proceed to. 346 00:12:09,600 --> 00:12:11,666 Stemming in the next. Tutorial. 347 00:12:11,666 --> 00:12:12,866 Well actually what. 348 00:12:12,866 --> 00:12:14,600 We have to do now is something. 349 00:12:14,600 --> 00:12:16,966 To prepare. For the. Stemming. 350 00:12:16,966 --> 00:12:19,100 And that's something is to. Split the. 351 00:12:19,100 --> 00:12:20,933 Different elements of. The reviews. 352 00:12:20,933 --> 00:12:22,466 In different words. 353 00:12:22,466 --> 00:12:24,966 Actually, because the different. Elements are now. Words. 354 00:12:24,966 --> 00:12:25,500 So we're going to. 355 00:12:25,500 --> 00:12:26,700 Split the review. 356 00:12:26,700 --> 00:12:30,100 Into its different words so that then we can apply stemming. 357 00:12:30,100 --> 00:12:31,300 To each of these. 358 00:12:31,300 --> 00:12:33,300 Words, you know, by simplifying them. 359 00:12:33,300 --> 00:12:34,800 By their root okay. 360 00:12:34,800 --> 00:12:36,966 So the way to do this. Is very simple. 361 00:12:36,966 --> 00:12:38,666 Once again you know, I'm going to copy. 362 00:12:38,666 --> 00:12:40,333 This. And. 363 00:12:40,333 --> 00:12:41,366 Paste it here. 364 00:12:41,366 --> 00:12:44,566 And instead of calling the lower function I'm simply going to call. 365 00:12:44,566 --> 00:12:45,466 The split. 366 00:12:45,466 --> 00:12:47,400 Function. as simple as that. 367 00:12:47,400 --> 00:12:48,533 This will split your. 368 00:12:48,533 --> 00:12:50,633 Review into its different words. 369 00:12:50,633 --> 00:12:53,100 And now now, my friends, we're. Ready. 370 00:12:53,100 --> 00:12:53,800 For the stemming.