1 00:00:00,200 --> 00:00:02,466 Hello and welcome to this art tutorial. 2 00:00:02,466 --> 00:00:04,933 So in the previous tutorial, we imported the data set 3 00:00:04,933 --> 00:00:08,733 which contains 1000 reviews of a restaurant. 4 00:00:09,066 --> 00:00:10,300 And for each of the review, 5 00:00:10,300 --> 00:00:13,500 we have the information whether the review is positive or negative. 6 00:00:13,800 --> 00:00:17,666 So one when the review is positive and zero when the review is negative 7 00:00:18,066 --> 00:00:20,933 and we are trying to build a machine learning model 8 00:00:20,933 --> 00:00:23,933 that will be able to classify each new review 9 00:00:24,000 --> 00:00:27,233 and tell if this new review is positive or negative. 10 00:00:27,700 --> 00:00:31,166 So basically what we are allowed to do is something we already did. 11 00:00:31,166 --> 00:00:33,600 That is classification and part three. 12 00:00:33,600 --> 00:00:37,533 But this time we are working on text and therefore we need to figure out a way 13 00:00:37,533 --> 00:00:41,266 to create a model where we can have some independent variables 14 00:00:41,500 --> 00:00:44,000 to train a machine learning classification model, 15 00:00:44,000 --> 00:00:47,000 to learn some correlations between the independent variables 16 00:00:47,066 --> 00:00:50,400 and the dependent variable, which is of course the liked column. 17 00:00:50,966 --> 00:00:55,133 So now the goal is simply to create some independent variables. 18 00:00:55,433 --> 00:00:57,633 And what could be those independent variables. 19 00:00:57,633 --> 00:01:00,666 Well the IDE and natural language processing 20 00:01:00,666 --> 00:01:05,800 and that's the bag of words model, is to create a model which will basically 21 00:01:05,800 --> 00:01:10,566 be a huge table, a table where the rows are nothing else, and the reviews. 22 00:01:10,600 --> 00:01:14,200 So this table will have 1000 rows because we have 1000 reviews, 23 00:01:14,333 --> 00:01:18,700 one row for each review and the columns will simply be 24 00:01:19,100 --> 00:01:22,033 all the words that we can find in the reviews. 25 00:01:22,033 --> 00:01:25,000 You know, we take the 1000 reviews, we look at all the words 26 00:01:25,000 --> 00:01:29,466 in those 1000 reviews, and we are going to create one column for each word. 27 00:01:30,033 --> 00:01:33,733 And then each cell in the table will correspond to one row that is one 28 00:01:33,733 --> 00:01:38,933 review and one column that is one word in all these words in these 1000 reviews. 29 00:01:39,233 --> 00:01:39,966 And then this cell, 30 00:01:39,966 --> 00:01:43,533 there is going to be the number of times the word appears in the review. 31 00:01:43,833 --> 00:01:45,633 So for example, here is the first word. 32 00:01:45,633 --> 00:01:49,866 Well, well there's going to be a column for this while word here. 33 00:01:50,166 --> 00:01:53,166 And so for the first line that corresponds to the first review. 34 00:01:53,366 --> 00:01:56,400 Well this well column will get a one because the word 35 00:01:56,400 --> 00:01:59,400 well appears once in the first review. 36 00:01:59,500 --> 00:02:02,400 But then for all the other reviews, that is all the other rows, 37 00:02:02,400 --> 00:02:03,766 we don't see any wow here. 38 00:02:03,766 --> 00:02:07,866 So all the cells in this first column for the other rows will get a zero. 39 00:02:08,333 --> 00:02:11,400 And that's what will be our model in the end. 40 00:02:11,766 --> 00:02:16,066 It's going to be this huge table with 1000 rows that are going to be the reviews. 41 00:02:16,300 --> 00:02:19,300 And a lot of columns that are going to be all the words 42 00:02:19,300 --> 00:02:22,300 that we can find in this 1000 reviews here. 43 00:02:22,600 --> 00:02:26,433 So now you might start to figure out why we would need to clean the reviews. 44 00:02:27,000 --> 00:02:27,733 It's because 45 00:02:27,733 --> 00:02:32,066 since we are going to take all the words in these reviews here, well, we are going 46 00:02:32,066 --> 00:02:35,766 to get a lot of columns because we create one column for each word. 47 00:02:36,233 --> 00:02:40,033 But we don't want to get too many columns, because the more we get some columns 48 00:02:40,233 --> 00:02:40,933 and the harder 49 00:02:40,933 --> 00:02:45,133 it will be for our machine learning model to run properly, to execute efficiently. 50 00:02:45,366 --> 00:02:48,866 Not only our machine learning model will have more trouble to execute properly, 51 00:02:49,166 --> 00:02:51,900 but also it will have more trouble understanding the correlations 52 00:02:51,900 --> 00:02:55,333 between the presence of the words in the reviews and the information. 53 00:02:55,333 --> 00:02:57,733 Whether the review is positive or negative. 54 00:02:57,733 --> 00:03:01,500 Because of course, if we keep all the words in these reviews, well, 55 00:03:01,500 --> 00:03:05,566 we will get some irrelevant words, some words that will not help the machine 56 00:03:05,566 --> 00:03:09,400 learning algorithm to predict if a review is positive or negative. 57 00:03:09,733 --> 00:03:12,700 Because you know the words that we can find in the reviews here. 58 00:03:12,700 --> 00:03:13,566 Well, some words 59 00:03:13,566 --> 00:03:17,366 give a much better hint in telling if the review is positive or negative. 60 00:03:17,600 --> 00:03:19,266 Let me give you a simple example. 61 00:03:19,266 --> 00:03:23,100 We have this loved word here in this review. 62 00:03:23,100 --> 00:03:26,833 This is the word that basically tells us that the reviews positive 63 00:03:27,000 --> 00:03:28,833 because there is this love word. 64 00:03:28,833 --> 00:03:32,866 But then if we look at this, this word or even place, 65 00:03:33,266 --> 00:03:36,433 well, these two words don't give the machine learning algorithm a hint 66 00:03:36,600 --> 00:03:38,866 whether the review is positive or negative. 67 00:03:38,866 --> 00:03:41,300 It's of course, this word loved 68 00:03:41,300 --> 00:03:45,033 by which the machine learning algorithm will understand some correlations 69 00:03:45,033 --> 00:03:48,800 between this love word and the fact that the review is positive. 70 00:03:49,366 --> 00:03:54,300 So that's the whole reason why right now we are going to clean the text. 71 00:03:54,600 --> 00:03:58,733 It's not only to reduce the Bigtable we're going to get in the end, 72 00:03:59,100 --> 00:04:02,600 because we want our algorithm to run properly and not be saturated. 73 00:04:03,333 --> 00:04:07,433 And the other reason is that we want to get the most relevant words 74 00:04:07,600 --> 00:04:11,200 to find the best correlations between the presence of the words and the outcome, 75 00:04:11,200 --> 00:04:13,700 whether the review is positive or negative. 76 00:04:13,700 --> 00:04:14,100 All right. 77 00:04:14,100 --> 00:04:15,633 So now we get the point. 78 00:04:15,633 --> 00:04:17,766 So let's start cleaning the reviews. 79 00:04:17,766 --> 00:04:20,800 And an important thing to understand is that what we'll do here 80 00:04:20,800 --> 00:04:21,966 to clean the reviews. 81 00:04:21,966 --> 00:04:25,600 Well it's the same technique to clean any other kind of text. 82 00:04:26,100 --> 00:04:27,933 Well I will give you the main tools. 83 00:04:27,933 --> 00:04:31,566 You will be able to use these tools to clean any text you're working with. 84 00:04:31,833 --> 00:04:32,366 And of course, 85 00:04:32,366 --> 00:04:36,533 if your text is a little more complicated, like for example, an HTML pages 86 00:04:36,533 --> 00:04:40,600 that contains HTML tags, well, you would need to add a little more tools, 87 00:04:40,800 --> 00:04:43,500 but you would still use the tools that we are about to use. 88 00:04:43,500 --> 00:04:44,566 And the good news is that 89 00:04:44,566 --> 00:04:48,466 if you want to use more tools to clean more sophisticated text, well, 90 00:04:48,466 --> 00:04:50,700 you just need to ask me some questions in the Q&A. 91 00:04:50,700 --> 00:04:54,100 And now I'll help you add these tools to your problem and your text. 92 00:04:54,733 --> 00:04:58,433 But what we'll do here, you will definitely do it to perform 93 00:04:58,433 --> 00:05:01,600 natural language processing on your text files for your problems. 94 00:05:02,233 --> 00:05:02,566 All right. 95 00:05:02,566 --> 00:05:05,000 So let's do it. Let's clean the text. 96 00:05:05,000 --> 00:05:07,700 So that's the next step in natural language processing. 97 00:05:07,700 --> 00:05:09,300 We will clean all the text. 98 00:05:09,300 --> 00:05:13,600 And then we will create our bag of words model which is this huge table 99 00:05:13,600 --> 00:05:17,633 which is actually called a sparse matrix because we'll get a lot of zeros 100 00:05:17,633 --> 00:05:18,933 in the sparse matrix. 101 00:05:18,933 --> 00:05:22,733 And then that means that we'll get a model where we have some independent variables 102 00:05:22,733 --> 00:05:24,333 and one dependent variable. 103 00:05:24,333 --> 00:05:27,133 And that's when we'll be able to use our machine 104 00:05:27,133 --> 00:05:31,066 learning classification models that we built in part three to predict 105 00:05:31,066 --> 00:05:34,833 the class of a new review that the model will not have seen yet. 106 00:05:35,333 --> 00:05:36,666 So let's do it. 107 00:05:36,666 --> 00:05:39,866 Let's start with the first step of cleaning the text. 108 00:05:40,366 --> 00:05:45,100 And this first step is going to be about initializing a corpus, because, 109 00:05:45,433 --> 00:05:48,900 you know we will not clean the reviews directly in the data set. 110 00:05:49,166 --> 00:05:53,366 We will instead create a corpus which will contain all the reviews, 111 00:05:53,600 --> 00:05:56,666 and that will be in this corpus that we will clean all the reviews. 112 00:05:57,033 --> 00:05:59,400 So let's start by training this corpus. 113 00:05:59,400 --> 00:06:02,766 And in order to create this corpus we need to import a package. 114 00:06:03,100 --> 00:06:04,366 And you might need to install it 115 00:06:04,366 --> 00:06:07,133 if it's the first time you're doing natural language processing. 116 00:06:07,133 --> 00:06:10,900 So I'm going to type here the command to install this package. 117 00:06:11,133 --> 00:06:13,033 This package is called the TM package. 118 00:06:13,033 --> 00:06:16,700 It's a very famous package in R for natural language processing. 119 00:06:16,800 --> 00:06:21,566 So let's install this package by typing install dot packages. 120 00:06:21,566 --> 00:06:22,500 Here it is. 121 00:06:22,500 --> 00:06:28,800 And so the name of the package has to be input in quotes which is the TM package. 122 00:06:29,166 --> 00:06:30,066 All right. 123 00:06:30,066 --> 00:06:35,133 So if I go to my packages I will find 124 00:06:36,700 --> 00:06:37,500 the TM package. 125 00:06:37,500 --> 00:06:38,166 Here it is. 126 00:06:38,166 --> 00:06:41,166 So I already have it installed so I don't need to install it again. 127 00:06:41,400 --> 00:06:44,400 So I will put that in comment. 128 00:06:44,566 --> 00:06:45,433 Here we go. 129 00:06:45,433 --> 00:06:49,900 And of course if you don't have the team package here, you will need to install it 130 00:06:49,900 --> 00:06:53,266 by executing this line and everything will run properly. 131 00:06:53,700 --> 00:06:54,666 All right. 132 00:06:54,666 --> 00:06:58,900 So of course after we install the package we need to import the package. 133 00:06:59,200 --> 00:07:02,400 Well mine is already imported but we need to automate all this. 134 00:07:02,400 --> 00:07:05,533 So as usual we are going to take the library command. 135 00:07:06,133 --> 00:07:09,500 And in parentheses we input the name of the package. 136 00:07:09,500 --> 00:07:12,500 So TM again not in quotes. All right. 137 00:07:12,833 --> 00:07:16,000 And now we are ready to build the corpus. 138 00:07:16,666 --> 00:07:18,600 And so how are we going to call our corpus. 139 00:07:18,600 --> 00:07:20,800 We're going to call it corpus. 140 00:07:20,800 --> 00:07:22,966 All right. Corpus equals. 141 00:07:22,966 --> 00:07:25,900 And now we're going to use the V corpus function spelled 142 00:07:25,900 --> 00:07:28,900 this way V corpus with capital v and capital C. 143 00:07:28,933 --> 00:07:33,266 And then in parenthesis we need to input vector source. 144 00:07:33,800 --> 00:07:36,800 This way. And again new parentheses. 145 00:07:36,900 --> 00:07:39,866 And it's in these parentheses that we input 146 00:07:39,866 --> 00:07:42,900 the column that contains the text that we want to clean. 147 00:07:43,166 --> 00:07:44,500 In this corpus. 148 00:07:44,500 --> 00:07:48,100 So this column is of course the review column of our data set. 149 00:07:48,500 --> 00:07:51,833 So here we will input this column which can be taken the following way 150 00:07:52,133 --> 00:07:55,200 by typing data set dollar sign. 151 00:07:55,200 --> 00:07:59,100 And then as you can see we have the two variables here that we can pick. 152 00:07:59,433 --> 00:08:01,933 And the one we want to pick is review. 153 00:08:01,933 --> 00:08:02,433 All right. 154 00:08:02,433 --> 00:08:06,200 And that takes all the review column of our data set. 155 00:08:06,200 --> 00:08:09,633 And that's exactly what we want because this column contains the text. 156 00:08:09,833 --> 00:08:12,800 And we want to clean the text in the corpus. 157 00:08:12,800 --> 00:08:13,200 All right. 158 00:08:13,200 --> 00:08:15,666 So that's the first step of cleaning the text. 159 00:08:15,666 --> 00:08:17,266 So let's select this and press 160 00:08:17,266 --> 00:08:20,900 Command or Control plus enter to execute and create the corpus. 161 00:08:21,133 --> 00:08:22,233 Here we go. 162 00:08:22,233 --> 00:08:25,666 As you can see it already says that it's a large corpus. 163 00:08:25,866 --> 00:08:30,600 And we can even see here that our corpus takes 3.7MB of space. 164 00:08:31,100 --> 00:08:35,000 We will simplify this corpus by cleaning step by step other reviews. 165 00:08:35,000 --> 00:08:37,566 And that's what we'll do in the next tutorials. 166 00:08:37,566 --> 00:08:39,166 Until then, enjoy machine learning.