1 00:00:00,166 --> 00:00:02,533 Hello and welcome to this art tutorial. 2 00:00:02,533 --> 00:00:05,466 So we're still trying as hard as we can to simplify 3 00:00:05,466 --> 00:00:09,166 the corpus in order to reduce the future sparse matrix of features. 4 00:00:09,166 --> 00:00:10,666 As much as possible. 5 00:00:10,666 --> 00:00:13,466 And so far we put all the words in lowercase. 6 00:00:13,466 --> 00:00:16,700 We removed all the numbers and removed all the punctuation. 7 00:00:16,933 --> 00:00:18,333 And in today's tutorial 8 00:00:18,333 --> 00:00:21,866 we are going to remove all the non relevant words of our reviews. 9 00:00:22,200 --> 00:00:23,766 So what is the non relevant word. 10 00:00:23,766 --> 00:00:26,466 It's for example this word here. 11 00:00:26,466 --> 00:00:31,300 Indeed if we know that this first review well love this place is a positive review. 12 00:00:31,400 --> 00:00:32,833 It's not thanks to this word. 13 00:00:32,833 --> 00:00:35,833 It's of course thanks to the loved word. 14 00:00:35,866 --> 00:00:39,633 So that means that this word here is not a relevant word. 15 00:00:39,833 --> 00:00:43,633 And so it's totally non relevant for us and not useful to include it 16 00:00:43,633 --> 00:00:45,433 in the future. Sparse matrix. 17 00:00:45,433 --> 00:00:48,166 That will be nothing else than the matrix of features 18 00:00:48,166 --> 00:00:50,900 and the outcome whether the review is positive or negative. 19 00:00:50,900 --> 00:00:52,800 So that's what we'll do in this tutorial. 20 00:00:52,800 --> 00:00:55,566 We will remove all these non relevant words. 21 00:00:55,566 --> 00:00:59,166 And we will update our corpus of reviews by removing all these words. 22 00:00:59,600 --> 00:01:01,333 All right. So same as before. 23 00:01:01,333 --> 00:01:06,133 Very simply we take this line copy and paste it below. 24 00:01:06,400 --> 00:01:09,200 And in this line we're going to replace remove 25 00:01:09,200 --> 00:01:12,600 punctuation by remove words. 26 00:01:13,933 --> 00:01:15,100 But that's not all this time. 27 00:01:15,100 --> 00:01:16,900 It's not as simple as before. 28 00:01:16,900 --> 00:01:18,600 We need to add something 29 00:01:18,600 --> 00:01:22,200 which is a something that will specify which words we want to remove. 30 00:01:22,500 --> 00:01:25,866 And actually there is a built in list of non relevant words 31 00:01:26,100 --> 00:01:28,066 that is called stopwords. 32 00:01:28,066 --> 00:01:31,500 And that basically contains all the non relevant words like this. 33 00:01:31,800 --> 00:01:36,500 Although the articles prepositions like and or well you know all these words 34 00:01:36,500 --> 00:01:38,033 that don't help the machine 35 00:01:38,033 --> 00:01:41,066 learning algorithm figure out if they reduce positive or negative. 36 00:01:41,533 --> 00:01:46,366 So this list of non relevant words in the Stopwords list is practically 37 00:01:46,366 --> 00:01:50,033 always used in natural language processing, because indeed, these words 38 00:01:50,033 --> 00:01:54,066 will never help you or your algorithm to classify your texts. 39 00:01:54,300 --> 00:01:56,400 So you will most of the time use it. 40 00:01:56,400 --> 00:01:59,600 And therefore that's a very important step because of course 41 00:01:59,600 --> 00:02:03,733 that simplifies the corpus and reduces the future sparse matrix very much. 42 00:02:03,933 --> 00:02:08,400 So as I just said, this is not what we only need to input in this map function. 43 00:02:08,666 --> 00:02:10,866 We need to input a third parameter. 44 00:02:10,866 --> 00:02:13,733 And that corresponds to the words we want to remove. 45 00:02:13,733 --> 00:02:15,633 That is all the non relevant words. 46 00:02:15,633 --> 00:02:18,866 And these words are in this stopwords 47 00:02:20,000 --> 00:02:21,233 function. 48 00:02:21,233 --> 00:02:22,200 All right. 49 00:02:22,200 --> 00:02:26,600 So basically this returns all the words that are not relevant for our model. 50 00:02:26,733 --> 00:02:29,700 And therefore thanks to this function here we will remove 51 00:02:29,700 --> 00:02:32,700 all the words returned by this Stopwords function. 52 00:02:32,866 --> 00:02:33,233 All right. 53 00:02:33,233 --> 00:02:35,000 So that's all for this line of code. 54 00:02:35,000 --> 00:02:38,300 But we need to add a little something. 55 00:02:38,333 --> 00:02:39,866 Well especially for you 56 00:02:39,866 --> 00:02:42,866 if you're doing natural language processing for the first time, 57 00:02:43,033 --> 00:02:46,833 which is a library that we need to install and import 58 00:02:47,000 --> 00:02:51,466 to be able to use the stop words function, because this function is not included 59 00:02:51,466 --> 00:02:53,100 in the default package of R, 60 00:02:53,100 --> 00:02:56,100 so we need to install the required package to use this function. 61 00:02:56,400 --> 00:02:58,533 And this package actually has a funny name. 62 00:02:58,533 --> 00:03:01,200 It is called snowball C. 63 00:03:01,200 --> 00:03:06,733 And so let's now install this package so we can, you know, copy this line, 64 00:03:07,500 --> 00:03:08,333 paste it here. 65 00:03:08,333 --> 00:03:13,900 And in this quote here in the parenthesis we input snowball. 66 00:03:13,900 --> 00:03:16,900 So it's spelled this way. Snowball. 67 00:03:16,966 --> 00:03:18,533 And then C. 68 00:03:18,533 --> 00:03:18,833 All right. 69 00:03:18,833 --> 00:03:22,800 So right now in some comments but you know check in your package this list 70 00:03:22,800 --> 00:03:25,933 if you already have this snowball package we never know. 71 00:03:26,400 --> 00:03:29,500 And if you don't have it well you can execute this line 72 00:03:29,500 --> 00:03:32,500 without the command to install the package. 73 00:03:32,533 --> 00:03:33,366 All right. 74 00:03:33,366 --> 00:03:37,166 And now of course, as usual we will import the package 75 00:03:37,200 --> 00:03:40,000 automatically thanks to this library function. 76 00:03:40,000 --> 00:03:44,033 So we will same copy this line, paste 77 00:03:44,066 --> 00:03:49,533 it below and replace time by actually snowball 78 00:03:50,733 --> 00:03:52,766 C. All right snowball. 79 00:03:52,766 --> 00:03:56,000 See now the required package are installed 80 00:03:56,000 --> 00:03:59,133 and imported to be able to use this Stopwords function. 81 00:03:59,366 --> 00:04:01,200 So everything is all good. 82 00:04:01,200 --> 00:04:04,266 And now let's try it on the first review. 83 00:04:04,266 --> 00:04:08,400 Because the first review contains some irrelevant words like this. 84 00:04:08,400 --> 00:04:12,433 Here, it might be the only word that is removed because you know 85 00:04:12,433 --> 00:04:16,700 well might not be a word of this stopwords list because, you know, 86 00:04:16,700 --> 00:04:21,500 the Stopwords list is a list of common words like, articles and prepositions, 87 00:04:21,766 --> 00:04:25,300 common words, but irrelevant words and well, is not common. 88 00:04:25,300 --> 00:04:26,966 So it might not be removed. 89 00:04:26,966 --> 00:04:30,100 But definitely this will be removed because this is a common 90 00:04:30,300 --> 00:04:31,833 and non relevant word. 91 00:04:31,833 --> 00:04:33,133 So let's check it out. 92 00:04:33,133 --> 00:04:36,433 We will select this line and execute. 93 00:04:36,833 --> 00:04:37,533 Here we go. 94 00:04:37,533 --> 00:04:41,700 And now let's have a look at the first review by pressing the up arrow. 95 00:04:42,066 --> 00:04:44,266 Here it is. Let's press enter. 96 00:04:44,266 --> 00:04:47,266 And as I just said this was removed. 97 00:04:47,700 --> 00:04:51,166 So now the first review becomes well loved place. 98 00:04:51,533 --> 00:04:54,900 And you know even if we really simplify the review no. 99 00:04:54,900 --> 00:04:57,333 Now it doesn't look the same as the original review. 100 00:04:57,333 --> 00:05:00,700 Well we can still understand that it's a positive review. 101 00:05:00,700 --> 00:05:04,333 And especially our machine learning model will understand it very well. 102 00:05:04,600 --> 00:05:08,033 And that will be thanks to this loved word here, 103 00:05:08,133 --> 00:05:11,833 which is a word that, of course, might be present in some other reviews, 104 00:05:12,066 --> 00:05:15,066 which themselves will be positive reviews as well. 105 00:05:15,233 --> 00:05:18,600 So that's how our machine learning algorithm will understand 106 00:05:18,733 --> 00:05:22,900 that love indicates a positive review, and therefore 107 00:05:22,900 --> 00:05:26,900 that's all it needs to establish that kind of correlations. 108 00:05:26,900 --> 00:05:29,633 And this is totally unused form. 109 00:05:29,633 --> 00:05:31,900 And we were right to remove it. 110 00:05:31,900 --> 00:05:32,266 All right. 111 00:05:32,266 --> 00:05:33,900 So that's done for this step. 112 00:05:33,900 --> 00:05:35,800 That was another very important step. 113 00:05:35,800 --> 00:05:36,733 But that's not all. 114 00:05:36,733 --> 00:05:37,666 In the next tutorial 115 00:05:37,666 --> 00:05:41,800 we'll do another very important step which is the stemming step. 116 00:05:42,200 --> 00:05:43,666 And so I'll explain you what it is 117 00:05:43,666 --> 00:05:47,266 and how we perform this new cleaning step in the next tutorial. 118 00:05:47,566 --> 00:05:49,166 Until then, enjoy machine learning.