1 00:00:00,200 --> 00:00:02,466 Hello and welcome to this art tutorial. 2 00:00:02,466 --> 00:00:05,166 So now let's proceed to the next step of the cleaning process, 3 00:00:05,166 --> 00:00:08,166 which is about removing all the numbers from the reviews. 4 00:00:08,433 --> 00:00:12,066 So we will just copy this line and paste it below, 5 00:00:12,333 --> 00:00:15,900 because you'll see that now during the next steps of the cleaning process 6 00:00:15,900 --> 00:00:17,066 is going to be very easy 7 00:00:17,066 --> 00:00:21,366 because we will always use the same line here to update the corpus. 8 00:00:21,700 --> 00:00:24,700 And at each new step, we are going to apply the right 9 00:00:24,700 --> 00:00:28,600 transformation that we want to do for the particular step of the cleaning process 10 00:00:28,866 --> 00:00:31,500 to the whole corpus, thanks to the tmp function. 11 00:00:31,500 --> 00:00:35,400 So basically what we just need to do is replace this content transformer 12 00:00:35,400 --> 00:00:39,766 with two layers input by remove numbers. 13 00:00:40,066 --> 00:00:41,666 And that's how we apply this 14 00:00:41,666 --> 00:00:45,300 remove numbers function to the reviews in the corpus through the tmp function. 15 00:00:45,533 --> 00:00:50,166 And that will remove all the numbers of all the 1000 reviews in the corpus. 16 00:00:50,466 --> 00:00:51,700 So let's check it out. 17 00:00:51,700 --> 00:00:52,433 To check that out. 18 00:00:52,433 --> 00:00:55,600 We cannot do it with this first review because this first review doesn't 19 00:00:55,600 --> 00:00:58,666 contain any number, so nothing will be removed here. 20 00:00:58,666 --> 00:01:00,733 But I had a look at the data set, 21 00:01:00,733 --> 00:01:04,766 and there is the review 841 that contains a number. 22 00:01:05,066 --> 00:01:07,100 Let's have a look at this review. 23 00:01:07,100 --> 00:01:10,533 To do this we will use the same line as we used to look at the first review. 24 00:01:10,833 --> 00:01:14,633 So I'm pressing the up arrow here to take this line of code 25 00:01:14,700 --> 00:01:16,866 that gives us access to the reviews. 26 00:01:16,866 --> 00:01:21,966 And inputting the index one here will input the 841 index. 27 00:01:22,400 --> 00:01:25,100 And now let's press enter to have a look at the review. 28 00:01:25,100 --> 00:01:27,866 And the review is for 40 bucks ahead. 29 00:01:27,866 --> 00:01:29,700 I really expect better food. 30 00:01:29,700 --> 00:01:30,933 So that's a negative review. 31 00:01:30,933 --> 00:01:34,800 And what should be highlighted in this review is this 40 number here 32 00:01:34,833 --> 00:01:37,900 which we want to see if it's going to be removed. 33 00:01:38,100 --> 00:01:42,133 Once we apply this remove numbers function to the reviews of the corpus, 34 00:01:42,333 --> 00:01:46,500 so that all the numbers in the reviews can be removed thanks to the tmp function. 35 00:01:46,766 --> 00:01:48,200 So let's check it out. 36 00:01:48,200 --> 00:01:51,200 We will select this line and execute. 37 00:01:51,666 --> 00:01:52,466 Done. 38 00:01:52,466 --> 00:01:56,100 And now let's have a look at this 841 review. 39 00:01:56,500 --> 00:01:59,466 So I'm pressing the up arrow twice 40 00:01:59,466 --> 00:02:03,233 to get back to this line of code giving us the written review. 41 00:02:03,433 --> 00:02:05,333 So now the corpus is updated. 42 00:02:05,333 --> 00:02:08,233 Let's see if the number 40 disappeared 43 00:02:08,233 --> 00:02:11,100 and it did four bucks ahead. 44 00:02:11,100 --> 00:02:12,933 I really expect better food. 45 00:02:12,933 --> 00:02:14,033 40 disappeared. 46 00:02:14,033 --> 00:02:16,566 And that's exactly what we wanted. So great. 47 00:02:16,566 --> 00:02:21,166 Basically all the numbers are now removed from the reviews in the corpus. 48 00:02:21,433 --> 00:02:24,733 Next step done, and we are ready to move on to the next step, 49 00:02:25,033 --> 00:02:29,666 which will be about removing any kind of punctuation in the reviews, 50 00:02:29,966 --> 00:02:33,733 because of course, in our sparse matrix, in the end, we don't want to get a column 51 00:02:33,733 --> 00:02:37,700 for a comma or another column for a colon, or another column 52 00:02:37,700 --> 00:02:41,700 for a dot, or for a semicolon, or any kind of punctuation. 53 00:02:42,033 --> 00:02:45,333 Of course, we only want to create some columns for relevant words. 54 00:02:45,333 --> 00:02:48,200 That will help the machine learning classification algorithm 55 00:02:48,200 --> 00:02:49,300 to see the correlations 56 00:02:49,300 --> 00:02:51,533 between the presence of the words and the outcome, 57 00:02:51,533 --> 00:02:54,000 whether the review is positive or negative. 58 00:02:54,000 --> 00:02:54,333 All right. 59 00:02:54,333 --> 00:02:56,100 So let's do that in the next tutorial. 60 00:02:56,100 --> 00:02:57,733 And until then enjoy machine learning.