1 00:00:00,166 --> 00:00:02,533 Hello and welcome to this art tutorial. 2 00:00:02,533 --> 00:00:05,900 So, so far we've been doing a great deal of simplifications 3 00:00:05,900 --> 00:00:09,400 for our corpus and therefore for a future sparse matrix of features. 4 00:00:09,766 --> 00:00:14,566 But we can still do better, and doing better is what we'll do in this tutorial. 5 00:00:14,800 --> 00:00:18,533 And it's a new step of the cleaning process that is called stemming. 6 00:00:18,866 --> 00:00:20,366 So what is stemming? 7 00:00:20,366 --> 00:00:23,766 Well, stemming is about getting the root of each word. 8 00:00:24,333 --> 00:00:28,800 For example, if we look at the first review, we have this loved one here. 9 00:00:29,100 --> 00:00:32,100 And the root of this word would be love. 10 00:00:32,433 --> 00:00:35,433 So what is the purpose of getting the root of the word? 11 00:00:35,566 --> 00:00:39,033 Well, it's of course still related to our goal to reduce the total 12 00:00:39,033 --> 00:00:42,600 number of words that will be in our future sparse matrix of features. 13 00:00:43,000 --> 00:00:45,866 And we can do this by taking the root of the words. 14 00:00:45,866 --> 00:00:50,433 Because whether we have loved or love or will love or loving, 15 00:00:50,766 --> 00:00:53,900 well, this actually means the same thing for our algorithm. 16 00:00:54,333 --> 00:00:57,966 And not only it means the same thing, but also it gives the same hint 17 00:00:58,166 --> 00:01:00,600 whether the review is positive or negative. 18 00:01:00,600 --> 00:01:04,500 So we don't really need to have some different tense of one same verb. 19 00:01:04,733 --> 00:01:07,266 And we don't really need to have derivative words. 20 00:01:07,266 --> 00:01:09,200 We just need the root of the words. 21 00:01:09,200 --> 00:01:12,400 And that will be perfectly enough for our machine learning classification 22 00:01:12,400 --> 00:01:15,600 model to train on the future sparse matrix of features 23 00:01:15,866 --> 00:01:18,866 that therefore will contain only the roots of the words. 24 00:01:18,900 --> 00:01:21,833 And you can imagine how we will considerably reduce 25 00:01:21,833 --> 00:01:23,500 the final total number of words. 26 00:01:23,500 --> 00:01:27,133 That is, the final total number of columns in the sparse matrix of features. 27 00:01:27,333 --> 00:01:29,566 Because of course, by only keeping the roots 28 00:01:29,566 --> 00:01:31,500 of the different versions of the same word. 29 00:01:31,500 --> 00:01:34,700 Well, of course that simplifies it very well and therefore 30 00:01:34,733 --> 00:01:37,733 considerably reduces the final total number of words. 31 00:01:38,166 --> 00:01:39,366 So that's stemming. 32 00:01:39,366 --> 00:01:42,500 That's also a very important step in natural language processing. 33 00:01:42,800 --> 00:01:45,633 You will most of the time apply stemming to your text 34 00:01:45,633 --> 00:01:50,633 whether you are working with reviews or articles or books or HTML pages. 35 00:01:50,833 --> 00:01:53,666 Well, for any kind of text, it's really help your machine learning 36 00:01:53,666 --> 00:01:57,466 algorithm to do an even better job for your classification problem. 37 00:01:57,866 --> 00:01:59,766 So let's do it for our reviews. 38 00:01:59,766 --> 00:02:02,400 And it is still going to be very simple. 39 00:02:02,400 --> 00:02:04,300 We will do another copy paste here. 40 00:02:04,300 --> 00:02:08,400 So I will actually copy this line because we only need two parameters 41 00:02:08,633 --> 00:02:12,633 the corpus and a function that will perform the stemming. 42 00:02:12,633 --> 00:02:16,066 So based here and I will replace 43 00:02:16,066 --> 00:02:20,466 remove punctuation by the appropriate function to proceed to the stemming 44 00:02:20,700 --> 00:02:24,900 which is the stem capital D document. 45 00:02:24,900 --> 00:02:25,733 Here it is. 46 00:02:25,733 --> 00:02:30,366 That's the function we use to perform stemming on all the reviews of our corpus. 47 00:02:30,766 --> 00:02:32,366 So let's check it out. 48 00:02:32,366 --> 00:02:35,166 Let's select this line right now. 49 00:02:35,166 --> 00:02:38,166 Our first review is well left place. 50 00:02:38,366 --> 00:02:42,033 And you'll see that after stemming left becomes love. 51 00:02:42,600 --> 00:02:44,533 All right. So let's execute now. 52 00:02:44,533 --> 00:02:45,600 Press command and control list. 53 00:02:45,600 --> 00:02:46,933 Enter to execute. 54 00:02:46,933 --> 00:02:49,600 Here we go. New corpus updated. 55 00:02:49,600 --> 00:02:53,233 And now let's have a look at the first review of this new corpus. 56 00:02:53,666 --> 00:02:57,433 So I'm pressing the up arrow here to get this line of code. 57 00:02:57,700 --> 00:02:59,533 And now pressing enter. 58 00:02:59,533 --> 00:03:00,733 And here we go. 59 00:03:00,733 --> 00:03:03,166 Wow love and place. 60 00:03:03,166 --> 00:03:05,566 So loved was replaced by love. 61 00:03:05,566 --> 00:03:08,133 Because the root of love is love. 62 00:03:08,133 --> 00:03:09,500 All right. So. 63 00:03:09,500 --> 00:03:11,400 And that's the same for all the reviews 64 00:03:11,400 --> 00:03:15,066 and all the other reviews, the words were replaced by the root. 65 00:03:15,666 --> 00:03:18,100 So that's done for this new step. 66 00:03:18,100 --> 00:03:20,900 And actually we are almost done with the cleaning process. 67 00:03:20,900 --> 00:03:24,866 We have one final step and we will do this final step in the next tutorial. 68 00:03:25,166 --> 00:03:26,733 Until then, enjoy machine learning.