1 00:00:00,233 --> 00:00:02,733 Okay, my friends, are you ready to finish this? 2 00:00:02,733 --> 00:00:06,300 It's actually very simple now, and I'm sure that you're confident 3 00:00:06,400 --> 00:00:10,400 that the solution I'm about to give you will be the same as your solution. 4 00:00:10,666 --> 00:00:11,566 Because indeed. 5 00:00:11,566 --> 00:00:15,233 Now, the only thing that we have to do is juggle with our different 6 00:00:15,300 --> 00:00:18,966 machine learning toolkits, including the Data preprocessing toolkit 7 00:00:18,966 --> 00:00:23,300 and the classic toolkit to complete this implementation. 8 00:00:23,333 --> 00:00:24,466 So let's do this. 9 00:00:24,466 --> 00:00:28,200 Starting with splitting the data set into the training set in a test set. 10 00:00:28,366 --> 00:00:29,800 Well, that's super easy. 11 00:00:29,800 --> 00:00:33,566 We're ready to do this in only one copy paste because we have indeed 12 00:00:33,566 --> 00:00:36,966 the matrix of features x and the dependent variable vector y. 13 00:00:37,133 --> 00:00:39,466 Therefore, the only thing that we have to do here 14 00:00:39,466 --> 00:00:42,700 is just to go to our data preprocessing template 15 00:00:42,966 --> 00:00:46,366 and then take exactly these two lines of code 16 00:00:46,633 --> 00:00:52,000 to indeed split our data set composed of the matrix of features X 17 00:00:52,000 --> 00:00:55,300 and the dependent variable vector y into, well, 18 00:00:55,300 --> 00:00:58,300 a new training set and test it. 19 00:00:58,366 --> 00:01:00,100 And that's our first copy paste. 20 00:01:00,100 --> 00:01:03,100 And of course here we have nothing to change. 21 00:01:03,233 --> 00:01:07,133 Now next step training the naive base model on the training set. 22 00:01:07,333 --> 00:01:10,766 So here we'll have to juggle with another of our machine 23 00:01:10,766 --> 00:01:13,833 learning toolkit which is of course the classification toolkit. 24 00:01:14,100 --> 00:01:17,266 So we're going to go back into our whole machine learning. 25 00:01:17,266 --> 00:01:18,466 Is that folder. 26 00:01:18,466 --> 00:01:19,633 Then we're going to, 27 00:01:19,633 --> 00:01:24,000 you know, use this little shortcut here to go back to the base of the folder, 28 00:01:24,000 --> 00:01:27,266 which is this one, machine learning A to Z codes and data sets. 29 00:01:27,600 --> 00:01:30,600 Then we're going to go into part three classification. 30 00:01:30,600 --> 00:01:35,200 And then we'll see all our different models including the Naive Bayes. 31 00:01:35,500 --> 00:01:39,300 But I would just like to remind that, you know, the choice of Naive 32 00:01:39,300 --> 00:01:41,766 Bayes was just based on my experience. 33 00:01:41,766 --> 00:01:45,800 I observed that the Naive Bayes does very well with natural language 34 00:01:45,800 --> 00:01:47,166 processing problems. 35 00:01:47,166 --> 00:01:50,766 But I'll give you another exercise at the end of this tutorial, 36 00:01:50,900 --> 00:01:54,266 which will be to beat me or, you know, to beat the score 37 00:01:54,300 --> 00:01:57,200 that we're going to get at the end of this implementation. 38 00:01:57,200 --> 00:02:00,600 And so your goal will be to get an even better accuracy. 39 00:02:00,866 --> 00:02:03,433 And if you get it, you'll post it in the comments. 40 00:02:03,433 --> 00:02:06,900 Or you can send me a private message to say that indeed, you managed 41 00:02:06,900 --> 00:02:09,900 to beat the accuracy score that we're about to get together 42 00:02:09,966 --> 00:02:13,566 and that you probably got by yourself when doing this exercise. 43 00:02:14,133 --> 00:02:15,366 All right, so there we go. 44 00:02:15,366 --> 00:02:17,700 Let's just choose for now the Naive Bayes model. 45 00:02:17,700 --> 00:02:20,933 So we're going to go into the section section 18 Naive Bayes. 46 00:02:21,266 --> 00:02:23,100 Then we're going to go into Python. 47 00:02:23,100 --> 00:02:26,533 And then we're going to open this Naive Bayes implementation 48 00:02:26,833 --> 00:02:29,166 open with Google Collaboratory. 49 00:02:29,166 --> 00:02:32,166 And you know I can just put it here. 50 00:02:32,466 --> 00:02:35,866 It is now opening loading it laying out the notebook. 51 00:02:36,100 --> 00:02:36,933 And there we go. 52 00:02:36,933 --> 00:02:38,866 And now you have everything. 53 00:02:38,866 --> 00:02:42,933 You can just find the cell that's, you know, trained Naive Bayes 54 00:02:42,933 --> 00:02:47,100 model on the training set, which is right here. 55 00:02:47,100 --> 00:02:50,133 By the way, you could also take, you know, your model selection folder 56 00:02:50,133 --> 00:02:53,133 containing all the classification models as you want. 57 00:02:53,366 --> 00:02:54,366 But there we go. 58 00:02:54,366 --> 00:02:57,266 What we need right now is this cell and nothing else. 59 00:02:57,266 --> 00:03:00,266 And we actually don't have anything to change inside. 60 00:03:00,366 --> 00:03:01,666 So that's all good. 61 00:03:01,666 --> 00:03:05,266 Now let's go back to our copy of our natural 62 00:03:05,266 --> 00:03:07,133 language Processing implementation. 63 00:03:07,133 --> 00:03:09,266 Let's create a new code cell here. 64 00:03:09,266 --> 00:03:12,133 And let's just paste that cell to train. 65 00:03:12,133 --> 00:03:16,133 Well the Gaussian naive base model on the training set 66 00:03:16,133 --> 00:03:20,066 composed of X train and Y train that was just created just before. 67 00:03:20,566 --> 00:03:21,300 All right. 68 00:03:21,300 --> 00:03:21,800 Good. 69 00:03:21,800 --> 00:03:24,400 Now next step predicting the test results. 70 00:03:24,400 --> 00:03:27,166 Well, once again here we won't have anything to do 71 00:03:27,166 --> 00:03:31,866 except a simple copy paste still from our Naive Bayes implementation. 72 00:03:32,233 --> 00:03:33,200 Because you know what? 73 00:03:33,200 --> 00:03:38,600 We just want to do here is display next to each other the vector of predictions 74 00:03:38,600 --> 00:03:42,400 and the vector of real results containing the real reviews. 75 00:03:42,400 --> 00:03:45,400 You know, the real outcomes of the reviews, whether they're positive, 76 00:03:45,433 --> 00:03:48,566 which gives a one or negative which gives a zero. 77 00:03:48,900 --> 00:03:55,200 So here I just copied and I'm about to paste that here in a new code cell. 78 00:03:55,600 --> 00:03:56,066 All right. 79 00:03:56,066 --> 00:03:59,100 And this will indeed print next to each other. 80 00:03:59,100 --> 00:04:03,200 First the vector of predictions which we got here in this first line of code, 81 00:04:03,466 --> 00:04:08,133 and the vector of real results containing the real outcomes of the reviews. 82 00:04:08,566 --> 00:04:09,233 Perfect. 83 00:04:09,233 --> 00:04:11,700 And finally making the confusion matrix. 84 00:04:11,700 --> 00:04:13,166 That's our last step here. 85 00:04:13,166 --> 00:04:16,100 And once again we're going to go back to our two kids. 86 00:04:16,100 --> 00:04:18,333 You know the classification toolkit for Naive Bayes. 87 00:04:18,333 --> 00:04:20,666 We're going to scroll down a bit more. 88 00:04:20,666 --> 00:04:23,866 And we're going to find indeed the confusion matrix 89 00:04:24,300 --> 00:04:27,533 computing as well, the accuracy score. 90 00:04:27,600 --> 00:04:31,266 You know, the accuracy being simply the number of correct predictions 91 00:04:31,433 --> 00:04:34,966 divided by the total number of observations in the test set. 92 00:04:34,966 --> 00:04:35,866 Of course. 93 00:04:35,866 --> 00:04:36,166 All right. 94 00:04:36,166 --> 00:04:40,000 So let's copy and paste that in this new code cell. 95 00:04:40,000 --> 00:04:41,700 And there you go my friends. 96 00:04:41,700 --> 00:04:43,800 Now this implementation is over. 97 00:04:43,800 --> 00:04:46,066 We finished it in just a few seconds 98 00:04:46,066 --> 00:04:48,233 or you know in a few minutes with the explanation. 99 00:04:48,233 --> 00:04:49,200 But there we go. 100 00:04:49,200 --> 00:04:52,466 That's what I mean by juggling with your different toolkits. 101 00:04:52,500 --> 00:04:56,266 You can be super efficient at implementing a classification 102 00:04:56,266 --> 00:04:59,533 or regression model by using your code templates. 103 00:05:00,233 --> 00:05:00,933 All right. 104 00:05:00,933 --> 00:05:02,066 Now it's showtime. 105 00:05:02,066 --> 00:05:06,100 We will execute the cells that we haven't executed so far. 106 00:05:06,100 --> 00:05:10,200 So the last one we executed with this one, basically everything that is related 107 00:05:10,200 --> 00:05:11,800 to the bag of words model. 108 00:05:11,800 --> 00:05:16,100 So now let's play the rest of the cell starting this one 109 00:05:16,100 --> 00:05:19,800 which will split the data set into the training set and test set. 110 00:05:20,000 --> 00:05:20,866 All good. 111 00:05:20,866 --> 00:05:22,200 Now that we have the training set, 112 00:05:22,200 --> 00:05:25,400 we're going to train the Naive Bayes model on the training set. 113 00:05:25,400 --> 00:05:26,900 And all good again. 114 00:05:26,900 --> 00:05:27,533 And now we're 115 00:05:27,533 --> 00:05:31,066 going to predict the test set results by displaying next to each other 116 00:05:31,333 --> 00:05:36,566 the vector of predictions and the vector of real outcomes of the reviews. 117 00:05:37,000 --> 00:05:38,866 And well, well look at this. 118 00:05:38,866 --> 00:05:43,200 We don't start well because we start with three incorrect predictions. 119 00:05:43,366 --> 00:05:44,066 Right? 120 00:05:44,066 --> 00:05:47,833 For the first review, which, be careful, is not well, love this place 121 00:05:47,833 --> 00:05:51,500 because these are the reviews of the test set, not the whole data set. 122 00:05:51,500 --> 00:05:53,233 So this is not the first review. 123 00:05:53,233 --> 00:05:54,000 I love this place. 124 00:05:54,000 --> 00:05:55,033 This is just a random 125 00:05:55,033 --> 00:05:58,800 review taken from the original data set and put into the test set. 126 00:05:59,100 --> 00:06:02,800 But anyway, for this first review of the test set, our model predicted 127 00:06:02,800 --> 00:06:06,700 this review to be positive, whereas in reality it is negative. 128 00:06:06,933 --> 00:06:10,766 Same for the second review predicted positive, but in reality negative. 129 00:06:10,766 --> 00:06:14,466 The same for the third review predicted positive, but in reality negative. 130 00:06:14,733 --> 00:06:16,400 And then this is correct. 131 00:06:16,400 --> 00:06:19,500 That's negative review which was indeed predicted as negative. 132 00:06:19,633 --> 00:06:20,500 Same for this one. 133 00:06:20,500 --> 00:06:22,933 Negative review predicted as negative here. 134 00:06:22,933 --> 00:06:24,066 Another mistake. 135 00:06:24,066 --> 00:06:26,966 Negative review predicted as positive here. 136 00:06:26,966 --> 00:06:28,000 Correct prediction. 137 00:06:28,000 --> 00:06:31,500 Positive review predicted as positive then incorrect prediction. 138 00:06:31,500 --> 00:06:35,433 Negative review predicted as positive, then an incorrect prediction again. 139 00:06:35,433 --> 00:06:38,266 Negative review predicted as positive and correct. 140 00:06:38,266 --> 00:06:40,166 Correct. Correct. Incorrect. 141 00:06:40,166 --> 00:06:41,766 Anyway. So yeah. 142 00:06:41,766 --> 00:06:45,200 But by scrolling down we can see you know that 143 00:06:45,200 --> 00:06:48,666 we actually have many correct predictions. 144 00:06:49,000 --> 00:06:52,800 And anyway, we're going to check that right away with our confusion 145 00:06:52,800 --> 00:06:53,966 matrix below. 146 00:06:53,966 --> 00:06:56,600 So I'm actually going to scroll down from here. 147 00:06:56,600 --> 00:06:58,966 There we go. And perfect. 148 00:06:58,966 --> 00:07:02,800 So this is our last cell which will display the confusion matrix 149 00:07:02,800 --> 00:07:06,300 and compute the accuracy score which I will want you to beat 150 00:07:06,300 --> 00:07:07,633 right after this tutorial. 151 00:07:07,633 --> 00:07:09,833 As a final exercise of NLP. 152 00:07:09,833 --> 00:07:14,533 And so let's play the cell to see what the confusion matrix looks like, 153 00:07:14,533 --> 00:07:19,666 and mostly to see the final accuracy, which is 73% all right. 154 00:07:19,666 --> 00:07:20,833 So that's pretty good. 155 00:07:20,833 --> 00:07:22,800 But I'm sure we can do better. 156 00:07:22,800 --> 00:07:24,666 You know there are many ways to better. 157 00:07:24,666 --> 00:07:27,600 And so I really look forward to seeing your results. 158 00:07:27,600 --> 00:07:31,100 You know after you experiment with more classification models or even 159 00:07:31,333 --> 00:07:34,800 by doing a better cleaning of the text, you know, the reviews, 160 00:07:34,966 --> 00:07:39,066 maybe you can add some more exclusions in the list of stopwords. 161 00:07:39,066 --> 00:07:40,166 You know, we exclude it 162 00:07:40,166 --> 00:07:43,700 not from the list of stopwords, but maybe you can exclude as well. 163 00:07:43,833 --> 00:07:47,600 Isn't you know, I know that isn't is actually part of the Stopwords list. 164 00:07:47,766 --> 00:07:50,900 So you know you can do some extra work in order to improve this, 165 00:07:51,100 --> 00:07:55,100 because I'm sure that we can get a better accuracy than 73%. 166 00:07:55,100 --> 00:07:56,700 But still, this is pretty good. 167 00:07:56,700 --> 00:08:00,600 You know, remember that we actually trained a machine to understand English. 168 00:08:00,800 --> 00:08:03,333 And, you know, a couple of years ago, maybe decades ago, 169 00:08:03,333 --> 00:08:05,533 this would definitely seem very challenging. 170 00:08:05,533 --> 00:08:07,633 But here we did it in just a few minutes. 171 00:08:07,633 --> 00:08:09,433 And so that's absolutely incredible. 172 00:08:09,433 --> 00:08:12,766 And the model predicts whether these English written reviews 173 00:08:12,766 --> 00:08:16,933 are positive or negative correctly 73% of the time. 174 00:08:16,933 --> 00:08:18,133 So that's really, really good. 175 00:08:18,133 --> 00:08:23,933 And that's the confusion matrix with 55 correct predictions of negative reviews, 176 00:08:23,933 --> 00:08:29,400 91 correct predictions of positive reviews, 42 incorrect predictions 177 00:08:29,400 --> 00:08:33,800 of positive reviews, and 12 incorrect predictions of negative reviews. 178 00:08:33,833 --> 00:08:37,266 All right, so try to reduce these two numbers here 179 00:08:37,366 --> 00:08:39,900 and let me know or let everybody know in the comments 180 00:08:39,900 --> 00:08:43,466 what you managed to get, which solution you managed to improve. 181 00:08:43,466 --> 00:08:46,666 And I look forward to seeing if you managed to, you know, 182 00:08:46,666 --> 00:08:48,566 for example, go over 80%. 183 00:08:48,566 --> 00:08:50,000 That would be fantastic. 184 00:08:50,000 --> 00:08:53,100 You know, you would definitely beat me by pretty far. 185 00:08:53,700 --> 00:08:54,200 All right. 186 00:08:54,200 --> 00:08:57,066 So now we're done with natural language processing. 187 00:08:57,066 --> 00:09:00,733 I hope you liked this introduction to sentiment analysis. 188 00:09:00,966 --> 00:09:03,033 If you liked NLP and if you want to, 189 00:09:03,033 --> 00:09:06,666 you know, study more in-depth this branch of machine learning, 190 00:09:06,766 --> 00:09:10,500 well know that we have other courses about chatbots, about the Bert model. 191 00:09:10,633 --> 00:09:12,433 So I really recommend to check it out. 192 00:09:12,433 --> 00:09:17,266 But first I recommend that, you know you complete this journey of machine learning. 193 00:09:17,266 --> 00:09:20,333 And speaking of this, well, the next step of our journey here 194 00:09:20,333 --> 00:09:25,133 is to enter the fascinating world of deep learning, where we will, you know, 195 00:09:25,133 --> 00:09:30,133 mimic the process of the human brain to actually give for the first time 196 00:09:30,133 --> 00:09:35,533 to our AI, an artificial brain which will itself perform some predictions. 197 00:09:35,700 --> 00:09:37,233 It's super fascinating. 198 00:09:37,233 --> 00:09:40,233 It's actually, you know, the most fascinating branch of machine learning, 199 00:09:40,233 --> 00:09:43,800 because this is the closest one to human intelligence. 200 00:09:44,066 --> 00:09:47,166 So now I just can't wait to see you in this next 201 00:09:47,166 --> 00:09:50,166 part to enter the world of deep learning. 202 00:09:50,300 --> 00:09:52,366 And until then, enjoy machine learning.