1 00:00:00,233 --> 00:00:02,533 Hello and welcome to this art tutorial. 2 00:00:02,533 --> 00:00:05,933 So in the previous tutorials we imported the data set and we started 3 00:00:05,933 --> 00:00:09,000 with the big first step of natural language processing, 4 00:00:09,000 --> 00:00:12,000 which is about cleaning the texts we are working with. 5 00:00:12,333 --> 00:00:18,066 So this first step consisted of creating a corpus that basically is a new data set, 6 00:00:18,066 --> 00:00:21,666 but this time only containing the reviews, the text of the reviews. 7 00:00:21,900 --> 00:00:24,733 And basically this is in this corpus that we were going to clean 8 00:00:24,733 --> 00:00:28,133 all the 1000 reviews, and we are going to clean them step by step. 9 00:00:28,333 --> 00:00:31,333 And in this tutorial we're going to do the first cleaning step. 10 00:00:31,700 --> 00:00:33,166 All right so let's do it. 11 00:00:33,166 --> 00:00:37,866 This first cleaning step will consist of putting all the reviews in lower cases. 12 00:00:38,100 --> 00:00:39,966 And what is the purpose of doing that. 13 00:00:39,966 --> 00:00:42,966 Well we are doing this so that in the final sparse matrix 14 00:00:42,966 --> 00:00:46,800 containing all the words of our 1000 reviews, we don't get twice 15 00:00:46,800 --> 00:00:49,966 the same word, you know, with one word starting with the capital letter 16 00:00:49,966 --> 00:00:53,000 and the same word, but not starting with the capital letter. 17 00:00:53,200 --> 00:00:56,300 And of course, we only want to have one version of this same word, 18 00:00:56,600 --> 00:00:59,400 and therefore we'll keep the one with lowercase. 19 00:00:59,400 --> 00:01:02,400 So that's why right now, in this first step of the cleaning process, 20 00:01:02,600 --> 00:01:07,133 we will put all the words of our 1000 reviews in lower cases. 21 00:01:07,333 --> 00:01:08,266 So let's do it. 22 00:01:08,266 --> 00:01:12,766 To do this we are going to update the corpus this way. 23 00:01:13,333 --> 00:01:15,700 Then equals because we are updating it. 24 00:01:15,700 --> 00:01:17,566 That is it will contain new reviews 25 00:01:17,566 --> 00:01:21,000 which are going to be the same reviews but with lower cases. 26 00:01:21,466 --> 00:01:25,366 And to put other words of the reviews in this corpus, we will use the t 27 00:01:25,600 --> 00:01:29,966 m underscore map function that will do the job for us. 28 00:01:30,266 --> 00:01:31,433 So function. 29 00:01:31,433 --> 00:01:33,266 So we need to add some parenthesis. 30 00:01:33,266 --> 00:01:35,300 And now we need to add two parameters. 31 00:01:35,300 --> 00:01:39,100 The first parameter is actually the corpus itself. 32 00:01:39,100 --> 00:01:43,733 But you know the old version of the corpus that is the corpus that we have here 33 00:01:44,000 --> 00:01:46,500 which contains the original versions of the reviews, 34 00:01:46,500 --> 00:01:49,500 that is the 1000 reviews we have in our data set. 35 00:01:49,700 --> 00:01:54,433 And this corpus here will be the new updated version of the corpus, 36 00:01:54,700 --> 00:01:57,900 that is the corpus containing all the reviews in lower cases. 37 00:01:58,500 --> 00:02:01,733 So that's the first parameter, the old version of the corpus. 38 00:02:01,733 --> 00:02:07,366 And the second parameter is a function that is some kind of a transform function. 39 00:02:07,800 --> 00:02:10,433 And that will simply transform each word of the corpus 40 00:02:10,433 --> 00:02:13,433 by replacing the capital letters in lowercase. 41 00:02:13,433 --> 00:02:16,766 So this function is content underscore transformer. 42 00:02:16,766 --> 00:02:18,900 Here it is. Let's press enter here. 43 00:02:18,900 --> 00:02:22,900 And actually this content transformer function can perform 44 00:02:22,900 --> 00:02:24,100 several transformations. 45 00:02:24,100 --> 00:02:28,700 As we can see in the yellow rectangle here the parameter of this function is fun. 46 00:02:29,000 --> 00:02:32,000 And so we need to add the function that we want 47 00:02:32,166 --> 00:02:35,233 to put all the words of the reviews in lower cases. 48 00:02:35,500 --> 00:02:38,500 And this function is called two lower. 49 00:02:39,100 --> 00:02:39,500 All right. 50 00:02:39,500 --> 00:02:42,600 So that's the function parameter that we need to input 51 00:02:42,600 --> 00:02:46,766 in this content transformer which is like a transform function. 52 00:02:46,766 --> 00:02:50,366 Having several transformation possibilities as input here. 53 00:02:50,366 --> 00:02:54,066 And the possibility that we choose is this two lower function 54 00:02:54,066 --> 00:02:56,633 which will put all the words in lower cases. 55 00:02:56,633 --> 00:03:01,200 And basically this tmp function is used so that we can apply this content 56 00:03:01,200 --> 00:03:05,766 transformer to lower function for all the words of the 1000 reviews of the corpus. 57 00:03:06,266 --> 00:03:08,600 Great. So that's actually done. 58 00:03:08,600 --> 00:03:09,400 That's actually all 59 00:03:09,400 --> 00:03:13,033 we need to put all the words of the 1000 reviews in lower cases. 60 00:03:13,400 --> 00:03:16,133 So I'm going to show you now what it does. 61 00:03:16,133 --> 00:03:19,900 So what we'll do before selecting this and executing this. 62 00:03:20,200 --> 00:03:24,166 What we'll do is have a look at, you know, one review of the corpus. 63 00:03:24,166 --> 00:03:27,533 Let's take the first review and then we'll run this line of code 64 00:03:27,733 --> 00:03:30,900 and you'll see what it does to this first review okay. 65 00:03:30,900 --> 00:03:32,433 So let's access to the first review. 66 00:03:32,433 --> 00:03:36,266 And to do this we need to use the as that character. 67 00:03:36,933 --> 00:03:39,933 And then in parentheses we input the corpus. 68 00:03:40,166 --> 00:03:41,233 But then since we want to look 69 00:03:41,233 --> 00:03:45,433 at the first review of this corpus well we need to add some double brackets. 70 00:03:45,433 --> 00:03:46,166 Actually. 71 00:03:46,166 --> 00:03:50,200 And one because this is the index of the first review because indexes in 72 00:03:50,200 --> 00:03:51,400 are started one. 73 00:03:51,400 --> 00:03:53,666 So this way we will have a look at the first review. 74 00:03:53,666 --> 00:03:56,466 And you know, since the corpus is kind of a complicated object, 75 00:03:56,466 --> 00:04:00,566 we need to use these double brackets here to access to the written review. 76 00:04:00,833 --> 00:04:03,666 And besides we need to use this as dot character 77 00:04:03,666 --> 00:04:06,666 function to have this written review displayed. 78 00:04:07,033 --> 00:04:07,700 All right. 79 00:04:07,700 --> 00:04:10,200 So I'm going to press enter here. 80 00:04:10,200 --> 00:04:13,366 And as I just told you we get the written review. 81 00:04:13,400 --> 00:04:15,200 Well love this place. 82 00:04:15,200 --> 00:04:16,266 Which is of course 83 00:04:16,266 --> 00:04:20,333 the first review as we can see in our data set while love this place. 84 00:04:20,933 --> 00:04:22,333 All right. So that's the first review. 85 00:04:22,333 --> 00:04:24,766 That's the original version of the first review. 86 00:04:24,766 --> 00:04:28,500 And now we are going to apply the first step of the cleaning process, 87 00:04:28,800 --> 00:04:30,933 which is to put other reviews in other cases. 88 00:04:30,933 --> 00:04:31,800 So let's do it. 89 00:04:31,800 --> 00:04:34,800 Let's select this line and execute. 90 00:04:35,200 --> 00:04:35,966 All right. 91 00:04:35,966 --> 00:04:38,066 So as you can see it was very fast. 92 00:04:38,066 --> 00:04:41,433 All the 1000 reviews were just transformed in lower cases. 93 00:04:41,433 --> 00:04:42,800 So let's check it out. 94 00:04:42,800 --> 00:04:44,633 Let's check it out for the first review. 95 00:04:44,633 --> 00:04:48,466 So we just need to, you know, press the up arrow 96 00:04:48,633 --> 00:04:51,766 to get the previous command which is this one. 97 00:04:52,200 --> 00:04:52,966 And you know, since 98 00:04:52,966 --> 00:04:56,833 our new corpus is also called corpus, we just updated the corpus. 99 00:04:57,133 --> 00:05:00,033 Well, we can run this and hopefully we'll get 100 00:05:00,033 --> 00:05:03,033 the same review written in lower cases. 101 00:05:03,300 --> 00:05:04,233 So let's check it out. 102 00:05:04,233 --> 00:05:06,400 Let's press enter here. 103 00:05:06,400 --> 00:05:07,800 And here we go. 104 00:05:07,800 --> 00:05:11,400 As you can see the capital W became this little W. 105 00:05:11,700 --> 00:05:14,800 And this capital L became this lower L. 106 00:05:15,100 --> 00:05:18,066 Perfect. So first simplification. 107 00:05:18,066 --> 00:05:21,100 Now in the final big table the final sparse matrix, 108 00:05:21,300 --> 00:05:25,133 we will get two versions of the same word, one in capital letter 109 00:05:25,133 --> 00:05:26,500 and one in lowercase. 110 00:05:26,500 --> 00:05:28,466 We'll get one unique version of the word. 111 00:05:28,466 --> 00:05:31,133 And therefore we did the first simplification 112 00:05:31,133 --> 00:05:33,233 of our future sparse matrix. 113 00:05:33,233 --> 00:05:34,800 So that's the first good thing done. 114 00:05:34,800 --> 00:05:37,833 And now we will proceed to the next step of the cleaning process, 115 00:05:38,066 --> 00:05:42,933 which will be to remove all the numbers of the reviews, because indeed the numbers 116 00:05:42,933 --> 00:05:46,833 are not very relevant in telling if a review is positive or negative. 117 00:05:46,933 --> 00:05:50,266 Well, we need to be cautious actually, because maybe some reviews are, 118 00:05:50,266 --> 00:05:53,300 you know, on a scale of 1 to 10, I give it ten. 119 00:05:53,600 --> 00:05:57,266 Well, that's definitely a number that is fully correlated to the outcome, 120 00:05:57,266 --> 00:05:59,266 whether the review is positive or negative. 121 00:05:59,266 --> 00:06:01,133 So we should pay attention to that. 122 00:06:01,133 --> 00:06:04,033 But we could have other numbers that are totally irrelevant, 123 00:06:04,033 --> 00:06:07,600 like, you know, some addresses that contain numbers or phone numbers. 124 00:06:07,600 --> 00:06:10,800 Well, that would be a little weird in a review, but we never know. 125 00:06:11,100 --> 00:06:14,700 Well, in general, when we are dealing with text, when we're working with text, 126 00:06:14,933 --> 00:06:18,133 we want to get rid of the numbers because these are most of the time 127 00:06:18,133 --> 00:06:19,233 not very relevant. 128 00:06:19,233 --> 00:06:21,366 And you know, this could add a lot more columns. 129 00:06:21,366 --> 00:06:24,100 So in general it's better to remove the numbers. 130 00:06:24,100 --> 00:06:26,100 So that's what we'll do in the next tutorial. 131 00:06:26,100 --> 00:06:27,766 And until then enjoy machine learning.