1 00:00:00,166 --> 00:00:02,400 Hello and welcome to this art tutorial. 2 00:00:02,400 --> 00:00:03,933 So we did the main steps. 3 00:00:03,933 --> 00:00:06,233 We cleaned all the texts, all the reviews. 4 00:00:06,233 --> 00:00:10,200 We created our backwards model and now we have to do one more thing, 5 00:00:10,200 --> 00:00:13,466 which is of course to build our machine learning classification model. 6 00:00:13,733 --> 00:00:16,766 And we can do that because we have all our independent variables 7 00:00:16,766 --> 00:00:21,900 in this sparse matrix DTM here builds thanks to this function Documenter matrix. 8 00:00:22,300 --> 00:00:26,633 And besides, we applied a filter here to remove the non frequent words. 9 00:00:26,666 --> 00:00:27,900 Well a few of them. 10 00:00:27,900 --> 00:00:32,000 But still this considerably reduced the number of words in the matrix. 11 00:00:32,033 --> 00:00:35,033 So that's always good for our model to run faster. 12 00:00:35,400 --> 00:00:37,233 All right. So now let's build the model. 13 00:00:37,233 --> 00:00:43,066 So what we'll do is go to our files to get back to the part three classification. 14 00:00:43,500 --> 00:00:47,100 Because what we'll do of course is take a classification model 15 00:00:47,100 --> 00:00:48,400 that we already built. 16 00:00:48,400 --> 00:00:50,633 And we will apply it on our text here. 17 00:00:50,633 --> 00:00:54,333 Because out of this text, we managed to create a matrix of features 18 00:00:54,333 --> 00:00:56,133 containing the independent variables. 19 00:00:56,133 --> 00:00:57,100 And of course we have 20 00:00:57,100 --> 00:01:00,700 one dependent variable which is the second column of our data 21 00:01:00,700 --> 00:01:03,766 set the liked column, which tells if yes or no. 22 00:01:03,900 --> 00:01:05,100 The review is positive. 23 00:01:06,300 --> 00:01:06,900 So we have 24 00:01:06,900 --> 00:01:11,033 everything and therefore we only need to take our model now. 25 00:01:11,033 --> 00:01:13,766 And therefore we go two by three classification. 26 00:01:13,766 --> 00:01:16,966 And here we can find all our classification models. 27 00:01:17,233 --> 00:01:19,100 All right so which one to pick. 28 00:01:19,100 --> 00:01:21,933 Which one to choose for natural language processing. 29 00:01:21,933 --> 00:01:27,333 Well in general based on experience, the most common classification models used 30 00:01:27,333 --> 00:01:30,400 for natural language processing or Naive 31 00:01:30,400 --> 00:01:33,400 Bayes decision tree or random forest. 32 00:01:33,433 --> 00:01:37,400 You also have the cart model, which is another type of decision tree model, 33 00:01:37,666 --> 00:01:41,700 and you also have the maximum entropy model, which is based on entropy as well. 34 00:01:41,700 --> 00:01:43,800 Like for decision trees. 35 00:01:43,800 --> 00:01:46,933 So these models work very well for natural language processing. 36 00:01:46,933 --> 00:01:50,933 And therefore he will pick one that is related to entropy. 37 00:01:51,233 --> 00:01:54,133 And that's the case for decision tree classification model 38 00:01:54,133 --> 00:01:56,833 as well as our random forest classification model. 39 00:01:56,833 --> 00:01:57,666 Because of course 40 00:01:57,666 --> 00:02:01,933 a random forest is a combination of trees making the same predictions together. 41 00:02:02,400 --> 00:02:05,033 And keep in mind that you can also use Naive Bayes, 42 00:02:05,033 --> 00:02:08,033 which is commonly used as well for natural language processing. 43 00:02:08,666 --> 00:02:12,433 But here in this tutorial we will choose random Forest classification. 44 00:02:12,866 --> 00:02:16,433 So let's go into this section and here are all the files. 45 00:02:16,433 --> 00:02:21,300 You know the data set the classification templates and our model in Python and R. 46 00:02:21,600 --> 00:02:23,100 So let's take the one in R. 47 00:02:23,100 --> 00:02:25,866 So I'm just clicking on the file. Here we go. 48 00:02:25,866 --> 00:02:27,766 Model open here. 49 00:02:27,766 --> 00:02:29,133 So what do we need here. 50 00:02:29,133 --> 00:02:33,100 Well first of all let's notice that you know when we are using the random 51 00:02:33,100 --> 00:02:36,366 forest classification model we are starting with a data set 52 00:02:36,366 --> 00:02:38,100 which is a data frame. 53 00:02:38,100 --> 00:02:42,466 And that contains both the independent variables and the dependent variable. 54 00:02:42,800 --> 00:02:46,466 So what do we have to do right now is go back to our natural language 55 00:02:46,466 --> 00:02:50,033 processing file and create exactly the same. 56 00:02:50,033 --> 00:02:51,366 That is, create a data 57 00:02:51,366 --> 00:02:54,900 set containing independent variables and one dependent variable. 58 00:02:55,300 --> 00:02:58,333 And that will be the input of this model here. 59 00:02:58,333 --> 00:03:00,400 Because you know we have our data set here. 60 00:03:00,400 --> 00:03:03,466 And then we use our data set in each of the code sections here. 61 00:03:03,700 --> 00:03:07,033 And then you know, we split the data sets into a training set and a test set. 62 00:03:07,366 --> 00:03:11,400 And we train our machine learning classification model on the training 63 00:03:11,400 --> 00:03:12,400 set here. 64 00:03:12,400 --> 00:03:15,300 So what we only have to do is just to create this data 65 00:03:15,300 --> 00:03:18,533 set containing the independent variables and the dependent variable. 66 00:03:18,700 --> 00:03:19,866 So that's very simple. 67 00:03:19,866 --> 00:03:22,833 We already have our independent variables. 68 00:03:22,833 --> 00:03:27,000 But the problem is that our independent variables right now are in a matrix. 69 00:03:27,266 --> 00:03:31,633 Because you know this document term matrix function returns a matrix. 70 00:03:31,633 --> 00:03:33,300 So DTM is a matrix right now. 71 00:03:33,300 --> 00:03:37,100 And as you remember in our classification 72 00:03:37,100 --> 00:03:40,133 models on R well this data set is a dataframe. 73 00:03:40,500 --> 00:03:41,833 It's not a matrix. 74 00:03:41,833 --> 00:03:45,466 So we need to make sure that here for the inputs of the model 75 00:03:45,466 --> 00:03:49,466 that we are going to apply on this bag of words model that we just created 76 00:03:49,466 --> 00:03:50,866 in the previous tutorials. 77 00:03:50,866 --> 00:03:53,633 Well we need to make sure that we have a dataframe, 78 00:03:53,633 --> 00:03:55,133 but that's actually very simple. 79 00:03:55,133 --> 00:03:58,633 We just need to take our matrix and use the function 80 00:03:58,633 --> 00:04:01,633 as that data dot frame. 81 00:04:01,700 --> 00:04:04,566 And we will input our sparse matrix DTM inside. 82 00:04:04,566 --> 00:04:09,500 And that will transform our DTM sparse matrix into a data frame. 83 00:04:09,833 --> 00:04:10,966 So let's do this. 84 00:04:10,966 --> 00:04:14,500 And since you know we are just going to copy paste 85 00:04:14,500 --> 00:04:15,966 our random forest classification 86 00:04:15,966 --> 00:04:19,900 here, well, since the input of this model is basically this data set, 87 00:04:20,300 --> 00:04:24,133 well here we will use the same name to create this data frame. 88 00:04:24,133 --> 00:04:26,966 And so we will call it data set. 89 00:04:26,966 --> 00:04:28,433 Data set equals. 90 00:04:28,433 --> 00:04:33,700 And then that's when we use the as dot data dot frame. 91 00:04:33,700 --> 00:04:35,633 Here it is. That's the first one. 92 00:04:35,633 --> 00:04:36,566 Here we go. 93 00:04:36,566 --> 00:04:38,566 And so now we need to input the matrix. 94 00:04:38,566 --> 00:04:41,000 We want to transform into a data frame. 95 00:04:41,000 --> 00:04:43,600 And that's of course DTM. 96 00:04:43,600 --> 00:04:46,400 And just to make sure we have the matrix type 97 00:04:46,400 --> 00:04:49,333 expected by this as dot data frame function. 98 00:04:49,333 --> 00:04:54,600 Well we need to use here the function as dot matrix 99 00:04:54,600 --> 00:04:58,100 and put DTM as input of this as matrix function. 100 00:04:58,533 --> 00:05:03,300 Because you know this sparse matrix DTM here is definitely a matrix, 101 00:05:03,433 --> 00:05:07,300 but it doesn't have the type expected by this as the data frame function. 102 00:05:07,633 --> 00:05:09,966 And to make sure we have the right matrix type, 103 00:05:09,966 --> 00:05:12,900 well we need to use this add that matrix function. 104 00:05:12,900 --> 00:05:13,366 All right. 105 00:05:13,366 --> 00:05:14,366 And now let's be careful. 106 00:05:14,366 --> 00:05:16,500 We lost one parenthesis. 107 00:05:16,500 --> 00:05:19,366 So I'm just adding it. All right. Now we're good. 108 00:05:19,366 --> 00:05:24,566 We are ready to transform our sparse matrix of features into a data frame. 109 00:05:24,933 --> 00:05:28,366 So let's do it I'm going to select this line and execute. 110 00:05:28,700 --> 00:05:29,600 All right. 111 00:05:29,600 --> 00:05:33,000 And now what's interesting to see is that we have the real data set. 112 00:05:33,300 --> 00:05:37,333 You know with all the reviews and the rows and all the words that we took 113 00:05:37,333 --> 00:05:38,100 from the corpus. 114 00:05:38,100 --> 00:05:42,066 And then we're filtered thanks to this remove sparse terms function. 115 00:05:42,400 --> 00:05:46,166 Well, we can see the full data set here with this 1000 rows 116 00:05:46,300 --> 00:05:49,400 and all these 691 columns, 117 00:05:49,766 --> 00:05:53,700 each one corresponding to a word that comes from the reviews in the corpus. 118 00:05:53,700 --> 00:05:57,233 And that was not filtered by the remove sparse terms function. 119 00:05:57,700 --> 00:05:58,433 All right. 120 00:05:58,433 --> 00:06:01,333 So here you can have a look at this huge table. 121 00:06:01,333 --> 00:06:02,933 And we can clearly see here 122 00:06:02,933 --> 00:06:07,033 that this is a sparse matrix because basically we can only see zeros. 123 00:06:07,200 --> 00:06:08,766 Well we have very few ones. 124 00:06:08,766 --> 00:06:10,333 We have one here one here. 125 00:06:10,333 --> 00:06:12,200 But all the rest is zeros. 126 00:06:12,200 --> 00:06:14,666 And so for example if I take this one here, 127 00:06:14,666 --> 00:06:20,400 well this one belongs to the also column and to the 23rd row. 128 00:06:20,400 --> 00:06:22,333 That is the 23rd review. 129 00:06:22,333 --> 00:06:25,166 And so this one here means that the word 130 00:06:25,166 --> 00:06:28,166 also appears in the review 23. 131 00:06:28,533 --> 00:06:30,266 All right. So that's the sparse matrix. 132 00:06:30,266 --> 00:06:33,066 And now you can really see what it is with your own eyes. 133 00:06:33,066 --> 00:06:33,466 All right. 134 00:06:33,466 --> 00:06:36,466 So let's go back to our natural language processing file. 135 00:06:36,600 --> 00:06:40,366 So we have our data set which is now a data 136 00:06:40,366 --> 00:06:43,600 frame as we wanted but still incomplete. 137 00:06:43,900 --> 00:06:44,466 You know why. 138 00:06:44,466 --> 00:06:49,500 It's because the data set we start with in this random forest classification model. 139 00:06:49,633 --> 00:06:53,066 And in general we say classification models is a data frame. 140 00:06:53,066 --> 00:06:54,300 So we're good on that. 141 00:06:54,300 --> 00:06:57,866 But a data frame containing both the independent variables 142 00:06:58,066 --> 00:06:59,933 and the dependent variable. 143 00:06:59,933 --> 00:07:01,366 So what we need to do right now 144 00:07:01,366 --> 00:07:06,066 is add the dependent variable to this data frame data set. 145 00:07:06,266 --> 00:07:09,266 Because right now it only contains the independent variables. 146 00:07:09,600 --> 00:07:09,966 All right. 147 00:07:09,966 --> 00:07:14,100 So you might remember how to add the dependent variable column to a data set. 148 00:07:14,100 --> 00:07:15,366 That is a data frame. 149 00:07:15,366 --> 00:07:17,800 Remember we need to take our data set. 150 00:07:17,800 --> 00:07:19,566 Then add a dollar sign here. 151 00:07:19,566 --> 00:07:22,500 And then after this dollar sign we can either take one of 152 00:07:22,500 --> 00:07:26,200 the existing column here if we want to update the column, 153 00:07:26,533 --> 00:07:30,566 or create a new column to add to this data set. 154 00:07:30,900 --> 00:07:32,500 And that's exactly what we want to do. 155 00:07:32,500 --> 00:07:34,966 We want to create a new column to this data set. 156 00:07:34,966 --> 00:07:36,266 Well that's an existing column. 157 00:07:36,266 --> 00:07:38,033 That's the light column. 158 00:07:38,033 --> 00:07:41,066 But we created for this data set because it is new column. 159 00:07:41,400 --> 00:07:44,833 And so we'll give to this column the same name as the real dependent 160 00:07:44,833 --> 00:07:47,100 variable column. That is light. 161 00:07:48,200 --> 00:07:48,566 All right. 162 00:07:48,566 --> 00:07:52,200 So by doing this we are adding this new column that we call liked 163 00:07:52,833 --> 00:07:54,400 and then equals. 164 00:07:54,400 --> 00:07:55,600 And then after this equal 165 00:07:55,600 --> 00:07:59,133 we need to specify what we want to add in this new column. 166 00:07:59,400 --> 00:08:02,700 And what we want to add is nothing else than the existing 167 00:08:02,900 --> 00:08:05,733 liked column of our data set. 168 00:08:05,733 --> 00:08:09,866 But be careful, because our data set was just a data to this new data frame, 169 00:08:10,133 --> 00:08:14,100 and therefore we no longer have the data set that we imported originally. 170 00:08:14,366 --> 00:08:15,900 So what we'll do is very simple. 171 00:08:15,900 --> 00:08:20,566 We'll just rename this data set by adding an underscore and then original. 172 00:08:21,300 --> 00:08:22,000 Here we go. 173 00:08:22,000 --> 00:08:25,433 And we will select this line again and execute. 174 00:08:25,800 --> 00:08:28,266 All right. So now we have our original data set. 175 00:08:28,266 --> 00:08:31,500 And therefore we can have access to the liked column of this 176 00:08:31,500 --> 00:08:34,866 original data set which is going to be our dependent variable. 177 00:08:35,333 --> 00:08:38,700 So let's add this dependent variable right now to our data set. 178 00:08:39,133 --> 00:08:44,866 And so to take this dependent variable we need to take our data set original. 179 00:08:44,866 --> 00:08:45,300 Here it is 180 00:08:45,300 --> 00:08:49,433 because that's the original data set containing the dependent variable liked. 181 00:08:49,766 --> 00:08:52,766 And so to take this dependent variable vector 182 00:08:52,800 --> 00:08:55,766 we need to add a dollar sign here same. 183 00:08:55,766 --> 00:08:59,033 And then take the column we want which is the liked column. 184 00:08:59,533 --> 00:09:00,900 All right. So that's good. 185 00:09:00,900 --> 00:09:04,066 By selecting this line and executing it 186 00:09:04,466 --> 00:09:07,833 we add the light dependent variable vector column 187 00:09:07,966 --> 00:09:11,866 to our data set already containing the independent variables 188 00:09:12,100 --> 00:09:15,500 that are all the filtered words of our cleaned reviews in the corpus. 189 00:09:16,200 --> 00:09:18,066 All right. So now we have everything we need. 190 00:09:18,066 --> 00:09:21,566 And we are ready to take our machine learning classification model 191 00:09:21,800 --> 00:09:25,500 because we have our data set that not only is a data frame, 192 00:09:25,500 --> 00:09:28,866 but also contains both the independent variables and the dependent variable. 193 00:09:29,100 --> 00:09:30,066 So we have everything. 194 00:09:30,066 --> 00:09:33,200 What is expecting a random forest classification model here? 195 00:09:33,466 --> 00:09:38,200 So what we only need to do here is take everything from here and not from here. 196 00:09:38,200 --> 00:09:40,733 You know, because this section is to import the data set. 197 00:09:40,733 --> 00:09:44,700 But we already have our data set that is ready for classification model. 198 00:09:44,900 --> 00:09:48,433 So we just need to take everything from here because this is where 199 00:09:48,600 --> 00:09:50,900 the data set starts to be processed. 200 00:09:50,900 --> 00:09:53,066 And so we take everything from here to 201 00:09:54,233 --> 00:09:55,300 here. 202 00:09:55,300 --> 00:09:59,233 And we can not take this because this is to plot the results in 2D. 203 00:09:59,233 --> 00:10:00,766 That is two independent variables. 204 00:10:00,766 --> 00:10:04,166 And here since of course we have a lot more than two independent variables. 205 00:10:04,266 --> 00:10:06,633 Well, we cannot use this to plot the results, 206 00:10:06,633 --> 00:10:09,466 but we will definitely have a look at the confusion matrix 207 00:10:09,466 --> 00:10:11,300 to see the number of correct predictions, 208 00:10:11,300 --> 00:10:13,766 as well as the number of incorrect predictions, 209 00:10:13,766 --> 00:10:16,466 so that we can evaluate the model performance. 210 00:10:16,466 --> 00:10:16,800 All right. 211 00:10:16,800 --> 00:10:19,800 So let's get back to our natural language processing file. 212 00:10:20,166 --> 00:10:24,600 And we will paste our random forest classification model right here. 213 00:10:25,166 --> 00:10:26,033 All right. 214 00:10:26,033 --> 00:10:27,866 So now we just need to modify 215 00:10:27,866 --> 00:10:30,866 a very few things because everything is basically ready. 216 00:10:30,900 --> 00:10:33,333 But let's see what we can modify. 217 00:10:33,333 --> 00:10:37,033 Well first here in the section that encodes the target feature as vector. 218 00:10:37,266 --> 00:10:41,066 Well of course we need to replace this purchased dependent variable 219 00:10:41,200 --> 00:10:43,433 which was the dependent variable in part three. 220 00:10:43,433 --> 00:10:47,533 Well we need to replace it with our new dependent variable which is liked. 221 00:10:48,266 --> 00:10:49,200 All right. 222 00:10:49,200 --> 00:10:53,000 And same here we replace purchased by like 223 00:10:53,966 --> 00:10:56,000 all right good for this section. 224 00:10:56,000 --> 00:10:59,533 Then in the next section we split the data sets into the training set 225 00:10:59,533 --> 00:11:00,600 and the test set. 226 00:11:00,600 --> 00:11:02,166 Well that's very important to do this. 227 00:11:02,166 --> 00:11:04,533 Unless you want to create a new review. 228 00:11:04,533 --> 00:11:07,600 But you know we will train our random forest classification 229 00:11:07,600 --> 00:11:10,600 models on say for example, 800 reviews. 230 00:11:10,766 --> 00:11:15,066 And we will test the predictive power of random forests on 200 231 00:11:15,100 --> 00:11:19,366 new reviews on which our random forest classification model was not trained. 232 00:11:19,500 --> 00:11:21,100 And therefore these 200 reviews 233 00:11:21,100 --> 00:11:24,933 and the test set will be new reviews for a random forest classification model. 234 00:11:25,333 --> 00:11:28,500 And so we will see how it manages to predict 235 00:11:28,500 --> 00:11:31,900 whether each of these 200 reviews is positive or negative. 236 00:11:32,233 --> 00:11:34,966 And then that's in the confusion matrix that will see 237 00:11:34,966 --> 00:11:37,866 the number of correct predictions and the number of incorrect 238 00:11:37,866 --> 00:11:40,866 predictions in this 200 new reviews. 239 00:11:40,866 --> 00:11:42,866 All right. So that's what is done in this section. 240 00:11:42,866 --> 00:11:46,833 And since I just gave as an example 800 reviews to train the model 241 00:11:46,833 --> 00:11:50,700 and 200 reviews to test it, well let's go with this choice of numbers. 242 00:11:51,000 --> 00:11:57,233 And so we need to change the split ratio here to 0.8 because that's 80%. 243 00:11:57,233 --> 00:11:58,866 And we have 1000 reviews. 244 00:11:58,866 --> 00:12:03,266 So 80% of 1000 reviews is 800 reviews to go to the training set, 245 00:12:03,533 --> 00:12:06,533 and therefore 200 reviews to go to the test set. 246 00:12:06,533 --> 00:12:07,333 All right. So that's good. 247 00:12:07,333 --> 00:12:09,533 And of course, let's not forget to replace 248 00:12:09,533 --> 00:12:13,100 the purchased variable here by our new dependent variable. 249 00:12:13,100 --> 00:12:15,233 That is light. 250 00:12:15,233 --> 00:12:15,600 All right. 251 00:12:15,600 --> 00:12:17,566 So I think we're good with this section. 252 00:12:17,566 --> 00:12:19,466 So now let's move on to the next one. 253 00:12:19,466 --> 00:12:21,533 The next one is about feature scaling. 254 00:12:21,533 --> 00:12:24,166 And so here do we need to apply feature scaling. 255 00:12:24,166 --> 00:12:24,966 Well not really 256 00:12:24,966 --> 00:12:29,366 because we only have zeros and ones in the sparse matrix of features. 257 00:12:29,700 --> 00:12:30,966 And therefore we don't have one 258 00:12:30,966 --> 00:12:34,133 independent variable dominating another independent variable. 259 00:12:34,300 --> 00:12:36,133 So we don't need to apply feature scaling. 260 00:12:36,133 --> 00:12:39,133 So we will remove this section. 261 00:12:39,266 --> 00:12:40,033 All right. 262 00:12:40,033 --> 00:12:41,233 And so what about this one. 263 00:12:41,233 --> 00:12:45,400 Yes of course we keep this one because this is the section where we build 264 00:12:45,500 --> 00:12:49,200 a random forest classification model that will classify the reviews. 265 00:12:49,500 --> 00:12:50,000 And that's 266 00:12:50,000 --> 00:12:53,733 where we train the random forest classification model on the training set. 267 00:12:53,900 --> 00:12:56,400 And therefore here we need to change two things. 268 00:12:56,400 --> 00:13:01,100 First, the index here that you know is the index of the dependent variable 269 00:13:01,100 --> 00:13:04,900 that we need to remove from x because x is supposed 270 00:13:04,900 --> 00:13:08,200 to be the training set without the dependent variable. 271 00:13:08,666 --> 00:13:11,866 So we need to remove it with the index of our new dependent variable 272 00:13:11,866 --> 00:13:16,100 like it is not three but is 692. 273 00:13:16,433 --> 00:13:18,166 We can see that here very easily. 274 00:13:18,166 --> 00:13:21,600 So let's replace three by 692. 275 00:13:22,166 --> 00:13:23,766 All right. Good. 276 00:13:23,766 --> 00:13:28,033 And now the second thing that we need to change is of course this purchased here 277 00:13:28,033 --> 00:13:31,166 that we still need to replace by light 278 00:13:32,500 --> 00:13:33,600 this way. 279 00:13:33,600 --> 00:13:37,300 And then if we want we can train our random forest 280 00:13:37,300 --> 00:13:39,100 classification with more trees. 281 00:13:39,100 --> 00:13:40,933 Right now we have ten trees. 282 00:13:40,933 --> 00:13:42,266 So we will keep ten trees. 283 00:13:42,266 --> 00:13:44,966 That might be enough for our 1000 reviews, 284 00:13:44,966 --> 00:13:49,600 which is quite a small number of reviews, and especially our 692 285 00:13:49,800 --> 00:13:53,366 words columns that we have in our sparse matrix of features. 286 00:13:53,700 --> 00:13:56,400 Ten trees might be enough, but of course, you're welcome 287 00:13:56,400 --> 00:13:59,933 to try more random forest classification models with more trees. 288 00:14:00,500 --> 00:14:01,933 So we're good with this section. 289 00:14:01,933 --> 00:14:04,100 And now let's move on to the next one. 290 00:14:04,100 --> 00:14:06,933 The next one is about predicting the test results. 291 00:14:06,933 --> 00:14:10,566 So making the predictions on 200 new reviews 292 00:14:10,766 --> 00:14:13,466 that our model won't know anything about. 293 00:14:13,466 --> 00:14:16,700 And therefore for this new reviews, our model is going to try to predict 294 00:14:17,000 --> 00:14:21,166 if those reviews are positive or negative and therefore it will be very interesting 295 00:14:21,166 --> 00:14:24,233 to see if it's making some correct predictions on new reviews. 296 00:14:24,800 --> 00:14:26,000 So right now it's the same. 297 00:14:26,000 --> 00:14:29,233 We have to replace this index here that corresponds to the index 298 00:14:29,233 --> 00:14:30,700 of the dependent variable. 299 00:14:30,700 --> 00:14:34,700 And so we need to replace three by of course 692. 300 00:14:34,700 --> 00:14:37,800 That's exactly the same as we did for the training set here. 301 00:14:38,333 --> 00:14:40,533 And so now we're good for this section. 302 00:14:40,533 --> 00:14:44,000 We're finally getting to the last section that is making the confusion matrix. 303 00:14:44,300 --> 00:14:47,233 That's the interesting section that will tell us the number of correct 304 00:14:47,233 --> 00:14:51,300 predictions and the number of incorrect prediction for these 200 new reviews. 305 00:14:51,633 --> 00:14:52,966 So we will see that right now. 306 00:14:52,966 --> 00:14:56,600 But of course we need to replace this three index that corresponds 307 00:14:56,600 --> 00:14:59,600 to the index of the dependent variable still the same. 308 00:14:59,633 --> 00:15:02,166 And replace it by 692. 309 00:15:03,300 --> 00:15:03,666 All right. 310 00:15:03,666 --> 00:15:05,000 So now everything is good. 311 00:15:05,000 --> 00:15:08,400 We are ready to train our random forest classification model 312 00:15:08,700 --> 00:15:11,233 on our 800 reviews of the training set. 313 00:15:11,233 --> 00:15:13,966 And then evaluate the predictive power of our model 314 00:15:13,966 --> 00:15:16,833 on our 200 new reviews in the test set. 315 00:15:16,833 --> 00:15:18,133 So let's do it. 316 00:15:18,133 --> 00:15:22,733 Since we already executed everything up to here, what we need to do now 317 00:15:22,733 --> 00:15:26,400 is just select everything from here to the bottom. 318 00:15:26,866 --> 00:15:27,800 And now we're good. 319 00:15:27,800 --> 00:15:32,066 We just need to press command or control plus enter to execute to train the model 320 00:15:32,066 --> 00:15:34,033 and test it on the test set, 321 00:15:34,033 --> 00:15:37,200 and eventually have a look at the number of correct predictions and the number 322 00:15:37,200 --> 00:15:40,200 of incorrect predictions on 200 new reviews. 323 00:15:40,366 --> 00:15:41,466 So let's do it. 324 00:15:41,466 --> 00:15:43,366 I'm going to press Command Plus Enter to execute. 325 00:15:45,200 --> 00:15:46,166 And here we go. 326 00:15:46,166 --> 00:15:48,200 Everything worked properly. Great. 327 00:15:48,200 --> 00:15:49,633 So let's have a look. 328 00:15:49,633 --> 00:15:52,200 We will have a look at the confusion matrix. 329 00:15:52,200 --> 00:15:56,100 Of course by typing here c m in the console. 330 00:15:56,433 --> 00:15:59,100 Here we go. So let's see what we have. 331 00:15:59,100 --> 00:16:04,400 We have 79 correct predictions of negative reviews, 70 correct 332 00:16:04,400 --> 00:16:10,100 predictions of positive reviews, 21 incorrect predictions of negative reviews, 333 00:16:10,500 --> 00:16:13,800 and 30 incorrect predictions of positive reviews. 334 00:16:14,233 --> 00:16:16,133 All right, so that's actually not too bad. 335 00:16:16,133 --> 00:16:19,366 You know, because we only had 800 reviews to train the model. 336 00:16:19,500 --> 00:16:21,733 That's not much when you're working with text. 337 00:16:21,733 --> 00:16:24,666 And therefore 30 plus 21 equals 51. 338 00:16:24,666 --> 00:16:28,700 Incorrect prediction is not bad out of 200 new reviews. 339 00:16:29,033 --> 00:16:33,600 When you know that you train your classification model on only 800 reviews. 340 00:16:33,900 --> 00:16:36,200 And actually, let's have a look at the accuracy. 341 00:16:36,200 --> 00:16:41,066 The accuracy is the number of correct predictions that is 79 342 00:16:41,300 --> 00:16:45,266 plus 70 divided by the total number 343 00:16:45,266 --> 00:16:48,600 of observations in the test set, and that is 200. 344 00:16:49,200 --> 00:16:51,133 So let's have a look at the accuracy. 345 00:16:51,133 --> 00:16:52,900 Pressing enter here. 346 00:16:52,900 --> 00:16:55,900 And the accuracy is 74.5%. 347 00:16:56,333 --> 00:16:57,033 So again 348 00:16:57,033 --> 00:17:01,500 that's not bad considering the fact that we trained our model on only 800 reviews. 349 00:17:01,733 --> 00:17:04,866 And you'll clearly see that if you had a lot more reviews to train 350 00:17:05,066 --> 00:17:08,800 your classification model, you will get a much better accuracy. 351 00:17:09,566 --> 00:17:12,033 All right, so that's the end of natural language processing 352 00:17:12,033 --> 00:17:15,433 and are congratulations for having completed all this. 353 00:17:15,433 --> 00:17:18,733 Creating the Bag of Words model training and classification model. 354 00:17:18,733 --> 00:17:20,066 And this data set. 355 00:17:20,066 --> 00:17:23,133 But that's not the end of your natural language processing journey. 356 00:17:23,133 --> 00:17:26,933 Because right after this video you'll get a little challenge. 357 00:17:27,233 --> 00:17:29,133 So we'll let you find out about that. 358 00:17:29,133 --> 00:17:31,100 And until then, enjoy machine learning.