1 00:00:00,166 --> 00:00:02,766 Hello and welcome to this art tutorial. 2 00:00:02,766 --> 00:00:05,100 So we did the big part the cleaning process. 3 00:00:05,100 --> 00:00:07,500 We started by creating this corpus here. 4 00:00:07,500 --> 00:00:09,633 Then we put all the words in lower cases. 5 00:00:09,633 --> 00:00:12,633 Then we removed all the numbers, all the punctuation 6 00:00:12,866 --> 00:00:15,666 and we also removed all the non relevant words. 7 00:00:15,666 --> 00:00:19,800 And finally we took the roots of all the words in the 1000 reviews. 8 00:00:20,466 --> 00:00:24,033 So all these steps considerably simplified the words in the reviews. 9 00:00:24,033 --> 00:00:27,166 So thanks to all these steps here, all the simplifications. 10 00:00:27,400 --> 00:00:28,766 Well, our final sparse 11 00:00:28,766 --> 00:00:32,066 matrix of features will get a much smaller number of columns. 12 00:00:32,300 --> 00:00:35,300 So that's good for algorithm because we reduced sparsity. 13 00:00:35,700 --> 00:00:38,400 And now we have one final little step to do 14 00:00:38,400 --> 00:00:41,700 that will not be done to simplify even more of the corpus. 15 00:00:42,000 --> 00:00:44,900 But that just consists of removing the extra spaces. 16 00:00:44,900 --> 00:00:48,300 Because by doing all these simplifications here, well, 17 00:00:48,300 --> 00:00:52,133 you know, we removed some things, and those things that were removed 18 00:00:52,333 --> 00:00:55,200 could actually have been replaced by a space, 19 00:00:55,200 --> 00:00:58,333 and that would cause an extra space in the review. 20 00:00:58,566 --> 00:00:59,933 And therefore, if we want to have 21 00:00:59,933 --> 00:01:03,566 perfectly clean reviews, we must remove these extra spaces. 22 00:01:04,033 --> 00:01:08,300 If we remove all the extra spaces, well, the columns in our final sparse matrix 23 00:01:08,300 --> 00:01:11,466 and features will only contain the words that are relevant 24 00:01:11,700 --> 00:01:14,400 and will not contain any space or anything else. 25 00:01:14,400 --> 00:01:18,200 So that's what we'll remove in this final step of the cleaning process. 26 00:01:18,366 --> 00:01:20,200 So let's do it again. 27 00:01:20,200 --> 00:01:21,533 It's going to be very simple. 28 00:01:21,533 --> 00:01:24,100 Will copy this line here. 29 00:01:24,100 --> 00:01:27,366 Copy and paste it below and replace 30 00:01:27,366 --> 00:01:31,600 stem document here by strip whitespace. 31 00:01:31,600 --> 00:01:37,300 Here it is pressing enter and all good final step ready to be executed. 32 00:01:37,300 --> 00:01:41,066 And therefore the whole cleaning process is ready to be completed. 33 00:01:41,600 --> 00:01:44,400 So before we execute this line, let's have a look 34 00:01:44,400 --> 00:01:47,400 at a review that now has an extra space. 35 00:01:47,400 --> 00:01:51,800 I remember that when we removed the number in the review 841. 36 00:01:52,033 --> 00:01:55,000 Well, we got an extra space. Here it is. 37 00:01:55,000 --> 00:01:58,833 You know, before the review was for 40 bucks ahead. 38 00:01:59,100 --> 00:02:00,400 I really expect better food. 39 00:02:00,400 --> 00:02:04,466 And when we applied the remove numbers here to the corpus, well, 40 00:02:04,466 --> 00:02:06,333 the number 40 here disappeared. 41 00:02:06,333 --> 00:02:07,900 But it actually didn't disappear. 42 00:02:07,900 --> 00:02:10,900 Was just replaced by this extra space here. 43 00:02:11,000 --> 00:02:15,033 Because indeed we can see that we have two spaces between four and books. 44 00:02:15,366 --> 00:02:17,966 And what we want to get is only one space here. 45 00:02:17,966 --> 00:02:20,433 So we're just removing the extra space here. 46 00:02:20,433 --> 00:02:21,633 So let's check it out. 47 00:02:21,633 --> 00:02:26,400 When we select this final line of code of our cleaning process. 48 00:02:26,600 --> 00:02:28,733 Well, let's make sure that this extra space 49 00:02:28,733 --> 00:02:32,166 in the review 841 disappears, this time for good. 50 00:02:32,700 --> 00:02:33,066 All right. 51 00:02:33,066 --> 00:02:35,800 So I'm going to press Command plus enter to execute. 52 00:02:35,800 --> 00:02:36,633 Here we go. 53 00:02:36,633 --> 00:02:40,033 And now let's have a look at the review 841. 54 00:02:40,033 --> 00:02:42,900 So let's press the up arrow here to find it. 55 00:02:44,500 --> 00:02:45,433 Here it is. 56 00:02:45,433 --> 00:02:48,366 We're going to get the new version of the review 841 57 00:02:48,366 --> 00:02:51,966 after we applied this strict whitespace to the corpus. 58 00:02:52,166 --> 00:02:54,866 So let's do it pressing command here. 59 00:02:54,866 --> 00:02:56,766 Well not only the space was removed, 60 00:02:56,766 --> 00:02:59,266 we can see that we don't have any extra space here. 61 00:02:59,266 --> 00:03:02,366 But also we get all the other steps of the cleaning process. 62 00:03:02,666 --> 00:03:06,433 Since indeed we can clearly see that the non relevant words were removed. 63 00:03:06,666 --> 00:03:09,866 So the non relevant words were for example four. 64 00:03:10,066 --> 00:03:11,633 So this was removed. 65 00:03:11,633 --> 00:03:15,366 And then a I, I think that's all. 66 00:03:15,966 --> 00:03:20,533 Yes. As you can see a was removed here and I was removed here. 67 00:03:21,066 --> 00:03:23,366 Okay. So that's the first thing we see very clearly. 68 00:03:23,366 --> 00:03:26,800 And the second thing we see very clearly is the stemming of course, 69 00:03:26,800 --> 00:03:29,700 because we hardly recognize the words here. 70 00:03:29,700 --> 00:03:31,933 Box was replaced by bug. 71 00:03:31,933 --> 00:03:34,800 You know, it's not only about the past tense of verbs, 72 00:03:34,800 --> 00:03:37,900 it's also about the singular and plural of nouns. 73 00:03:38,133 --> 00:03:41,100 So books became bug 74 00:03:41,100 --> 00:03:45,900 and head was not replaced because the root of head is head. 75 00:03:45,900 --> 00:03:48,900 So that's why we kept head here. 76 00:03:49,166 --> 00:03:51,600 And then, really, 77 00:03:51,600 --> 00:03:54,600 really with a why became really with an eye. 78 00:03:54,766 --> 00:03:57,466 So that's how are simply interpreted the root. 79 00:03:57,466 --> 00:03:59,233 So that's okay. It's not a mistake. 80 00:03:59,233 --> 00:04:00,800 And finally expect better. 81 00:04:00,800 --> 00:04:04,966 And food were not replaced because you know these are kind of already 82 00:04:04,966 --> 00:04:06,133 the root of the words. 83 00:04:06,133 --> 00:04:09,133 And we cannot really simplify the words more. 84 00:04:09,200 --> 00:04:09,566 All right. 85 00:04:09,566 --> 00:04:10,833 So that's a good example. 86 00:04:10,833 --> 00:04:12,600 We can clearly see what happened here. 87 00:04:12,600 --> 00:04:16,033 And for what we just did here with the stripped whitespace 88 00:04:16,166 --> 00:04:18,900 we can see that this extra space here was removed. 89 00:04:18,900 --> 00:04:20,733 Well there is still a space here. 90 00:04:20,733 --> 00:04:25,200 But you know before there was the four and two spaces between four and book. 91 00:04:25,466 --> 00:04:28,400 And then after the stopwords step, the four was removed. 92 00:04:28,400 --> 00:04:32,600 So we can still see a space here before bug, but that's actually one space 93 00:04:32,600 --> 00:04:33,600 instead of two. 94 00:04:33,600 --> 00:04:34,300 If we want to get 95 00:04:34,300 --> 00:04:38,033 even more convinced, well, we can see that we have an extra space here 96 00:04:38,300 --> 00:04:39,433 between love and place. 97 00:04:39,433 --> 00:04:42,500 This review is the first review in the previous version 98 00:04:42,500 --> 00:04:45,900 of the corpus, before we applied the strip whitespace here. 99 00:04:46,200 --> 00:04:48,766 And so now we'll see that if we have a look at the first review 100 00:04:48,766 --> 00:04:54,000 of the new version of our corpus, well, this extra space here will be removed, 101 00:04:54,200 --> 00:04:57,666 and we will get only one space between love and place instead of two. 102 00:04:57,666 --> 00:04:59,733 Here. Let's check it out. 103 00:04:59,733 --> 00:05:00,266 Here we go. 104 00:05:00,266 --> 00:05:04,533 We can clearly see that we have only one space here instead of two spaces here. 105 00:05:04,900 --> 00:05:06,033 All right, so good. 106 00:05:06,033 --> 00:05:08,566 Removing the extra spaces worked properly. 107 00:05:08,566 --> 00:05:11,466 And so we are done with the cleaning process. 108 00:05:11,466 --> 00:05:12,366 So great. 109 00:05:12,366 --> 00:05:16,566 That means that now we are ready to build the sparse matrix of features. 110 00:05:16,833 --> 00:05:18,566 We'll do that in the next tutorial. 111 00:05:18,566 --> 00:05:20,400 And until then enjoy machine learning.