1 00:00:00,300 --> 00:00:02,900 Hello my friends, and welcome to this new part. 2 00:00:02,900 --> 00:00:05,700 On Natural Language Processing. 3 00:00:05,700 --> 00:00:09,233 I'm super excited to start this part because this is the branch of machine 4 00:00:09,233 --> 00:00:12,933 learning with which you can build chat bots and machine translations. 5 00:00:13,200 --> 00:00:14,666 So of course, this is not 6 00:00:14,666 --> 00:00:18,000 what we're going to do in this part because this is really advanced NLP. 7 00:00:18,166 --> 00:00:22,800 So we'll just cover the basics with sentiment analysis, which consists 8 00:00:22,833 --> 00:00:26,100 of training a machine to understand some text 9 00:00:26,233 --> 00:00:29,233 and predict a certain outcome for this text. 10 00:00:29,366 --> 00:00:33,700 So in our case study here, these text will be reviews of a restaurant. 11 00:00:33,900 --> 00:00:35,266 And we'll have to train a machine 12 00:00:35,266 --> 00:00:39,133 to understand if each review is positive or negative. 13 00:00:39,366 --> 00:00:41,400 So very simple, very classic. 14 00:00:41,400 --> 00:00:44,400 But the best way to be introduced to NLP. 15 00:00:44,700 --> 00:00:45,000 All right. 16 00:00:45,000 --> 00:00:48,233 So before we start let's make sure everyone here is on the same page. 17 00:00:48,266 --> 00:00:51,666 This is the folder containing all the codes and data sets 18 00:00:51,800 --> 00:00:55,566 and of which I give you the link right before this tutorial in the article. 19 00:00:55,633 --> 00:00:57,600 So make sure to connect to that link. 20 00:00:57,600 --> 00:00:58,600 And now there we go. 21 00:00:58,600 --> 00:01:02,366 We can and support seven natural language processing. 22 00:01:02,900 --> 00:01:05,633 So in this part you will only find one section. 23 00:01:05,633 --> 00:01:10,066 That's because we only do one case study of NLP about sentiment analysis. 24 00:01:10,333 --> 00:01:12,300 However, you will see that you can try 25 00:01:12,300 --> 00:01:15,766 diverse machine learning models to tackle the problem. 26 00:01:16,066 --> 00:01:17,533 Indeed, the essential 27 00:01:17,533 --> 00:01:21,433 part of our implementation will be to build the bag of Words model. 28 00:01:21,600 --> 00:01:26,100 But then once it is built, we can try several classification models. 29 00:01:26,233 --> 00:01:27,633 Why classification models? 30 00:01:27,633 --> 00:01:31,966 That's because we'll have to predict, you know, a binary outcome 1 or 0 one, 31 00:01:31,966 --> 00:01:36,400 meaning the review is positive and zero, meaning the review is negative. 32 00:01:36,600 --> 00:01:40,866 So you'll have actually the flexibility to try several machinery models. 33 00:01:40,866 --> 00:01:45,166 And this will actually be your final exercise of this section. 34 00:01:45,533 --> 00:01:46,866 So there we go. Let's do this. 35 00:01:46,866 --> 00:01:49,000 Let's enter section 36. 36 00:01:49,000 --> 00:01:51,300 Now natural language processing. 37 00:01:51,300 --> 00:01:55,733 And as usual we're going to start with Python in which you will find two files. 38 00:01:55,933 --> 00:01:59,300 The implementation natural language processing dot Ipynb, 39 00:01:59,633 --> 00:02:03,366 which you can open with either Google Collaboratory or Jupyter Notebook. 40 00:02:03,666 --> 00:02:07,400 And our data set restaurant reviews dot this time. 41 00:02:07,400 --> 00:02:11,633 Net CSV, but SVM and this will be a good opportunity 42 00:02:11,633 --> 00:02:16,100 for me to train you on how to import a TSV data set. 43 00:02:16,400 --> 00:02:22,200 TSV mean Tab separated value instead of comma separated value like in a CSV. 44 00:02:22,200 --> 00:02:23,666 So basically the only difference 45 00:02:23,666 --> 00:02:26,733 is that in the previous data sets we worked with, well, 46 00:02:26,733 --> 00:02:30,466 you know, the features and the dependent variable were separated by commas. 47 00:02:30,633 --> 00:02:34,166 And in this one, well, instead of being separated by commas, 48 00:02:34,200 --> 00:02:37,766 the reviews and the dependent variable will be separated by a tab. 49 00:02:37,900 --> 00:02:39,000 And this makes sense, right? 50 00:02:39,000 --> 00:02:41,466 Because in the reviews we already have commas 51 00:02:41,466 --> 00:02:43,733 and therefore they would create nonsense features. 52 00:02:43,733 --> 00:02:46,500 But let me show you what this data set looks like. 53 00:02:46,500 --> 00:02:47,166 So as you can see, 54 00:02:47,166 --> 00:02:51,033 there are only two columns, the first one containing all the reviews. 55 00:02:51,033 --> 00:02:52,600 So for example, this is the first one. 56 00:02:52,600 --> 00:02:54,233 Well, not this place. 57 00:02:54,233 --> 00:02:57,166 A second one trust is not good. 58 00:02:57,166 --> 00:02:59,666 another one, great touch, etc.. 59 00:02:59,666 --> 00:03:03,000 So you have in total let's see, 1000 reviews. 60 00:03:03,366 --> 00:03:03,666 Right. 61 00:03:03,666 --> 00:03:05,200 So we're going to train our machine learning 62 00:03:05,200 --> 00:03:09,466 to actually understand text and predict if the text are positive or negative. 63 00:03:09,466 --> 00:03:11,666 With 1000 texts. 64 00:03:11,666 --> 00:03:12,200 All right. 65 00:03:12,200 --> 00:03:17,033 And then the second column is of course if the review is positive or negative. 66 00:03:17,033 --> 00:03:20,033 So one means that it is positive meaning the customer liked it. 67 00:03:20,300 --> 00:03:23,300 And zero means that the review is negative. 68 00:03:23,466 --> 00:03:26,933 And of course we have the real outcomes in order to train our machine 69 00:03:26,933 --> 00:03:31,800 learning model to understand if each of these text is positive or negative. 70 00:03:32,066 --> 00:03:35,466 So that's purely in the end, you know, a classification problem. 71 00:03:35,700 --> 00:03:40,566 But the essential part of it is that will train the machine to understand 72 00:03:40,566 --> 00:03:44,966 these text first and then to predict if they are positive or negative. 73 00:03:45,500 --> 00:03:45,933 All right. 74 00:03:45,933 --> 00:03:48,600 So very simple case study very simple data set. 75 00:03:48,600 --> 00:03:51,900 That means we are ready to start the implementation 76 00:03:51,900 --> 00:03:53,966 of natural language processing. 77 00:03:53,966 --> 00:03:55,300 So as you prefer 78 00:03:55,300 --> 00:03:59,100 feel free to open it with either Google Colaboratory or Jupyter Notebook. 79 00:03:59,400 --> 00:04:00,200 I'm opening it 80 00:04:00,200 --> 00:04:03,900 with Google Collab as usual, so feel free to do the same if you'd like. 81 00:04:04,000 --> 00:04:06,800 And now the Notebook is loading. 82 00:04:06,800 --> 00:04:10,766 In a second it will be laying out all right loading leading out. 83 00:04:10,766 --> 00:04:13,766 Perfect. And this is the implementation. 84 00:04:13,833 --> 00:04:16,333 And as usual this is in read only mode. 85 00:04:16,333 --> 00:04:19,333 And we want to re-implement this from scratch. 86 00:04:19,366 --> 00:04:20,066 Therefore we're going 87 00:04:20,066 --> 00:04:24,200 to create a copy right away so that we can modify the code inside. 88 00:04:24,466 --> 00:04:27,200 So we're going to click save a Copy and drive here. 89 00:04:27,200 --> 00:04:30,633 This will create a copy, after which we will be able 90 00:04:30,633 --> 00:04:33,633 to modify the code and re-implement this from scratch. 91 00:04:34,233 --> 00:04:37,200 And speaking of re-implementing it from scratch. 92 00:04:37,200 --> 00:04:41,466 Well, let's delete all the code cells because we will re-implement them. 93 00:04:41,666 --> 00:04:45,600 So let's click this trash button here and each of the code cells, 94 00:04:45,600 --> 00:04:49,866 but not the text, so that we can keep that well highlighted structure 95 00:04:50,100 --> 00:04:54,000 and see where we're going at each time of the implementation. 96 00:04:54,366 --> 00:04:57,366 All right. So almost done. 97 00:04:57,400 --> 00:05:00,533 It's actually an implementation in about ten steps. 98 00:05:00,833 --> 00:05:04,500 But you will recognize some of the steps as steps we did before. 99 00:05:04,966 --> 00:05:07,233 You'll see I'm going to show you in a second. 100 00:05:07,233 --> 00:05:07,700 All right. 101 00:05:07,700 --> 00:05:10,466 So let's have a look at the structure of this implementation. 102 00:05:10,466 --> 00:05:13,633 We will start first by importing the libraries as usual. 103 00:05:13,633 --> 00:05:17,000 Because indeed we will need several libraries to preprocess 104 00:05:17,000 --> 00:05:20,000 our texts and train our future machine learning model. 105 00:05:20,333 --> 00:05:21,833 Then we will import the data set. 106 00:05:21,833 --> 00:05:23,966 So that's actually the data preprocessing phase. 107 00:05:23,966 --> 00:05:26,333 But not only that, the data preprocessing 108 00:05:26,333 --> 00:05:30,200 phase will also contain the next two cells cleaning the text. 109 00:05:30,366 --> 00:05:33,900 Indeed, we will have to simplify the text as much as we can 110 00:05:34,166 --> 00:05:38,100 in order to ease the learning process of the machine learning model. 111 00:05:38,100 --> 00:05:40,200 You know, we'll have to remove all the punctuations. 112 00:05:40,200 --> 00:05:42,500 We'll have to put all the letters in lowercase. 113 00:05:42,500 --> 00:05:44,066 Then we'll have to apply stemming. 114 00:05:44,066 --> 00:05:47,366 You know, we'll have to make very clean text to alleviate 115 00:05:47,466 --> 00:05:50,600 the learning process of the future classification model will build. 116 00:05:50,800 --> 00:05:52,866 So that's a compulsory process. 117 00:05:52,866 --> 00:05:56,333 When doing NLP you have to preprocess the text basically. 118 00:05:56,733 --> 00:05:59,400 Then we'll create the bag of words model 119 00:05:59,400 --> 00:06:02,400 which is at the heart of sentiment analysis. 120 00:06:02,533 --> 00:06:03,933 And then there you go. 121 00:06:03,933 --> 00:06:06,200 That's where you will recognize everything. 122 00:06:06,200 --> 00:06:07,633 Once we have the bag of Words 123 00:06:07,633 --> 00:06:11,400 model, we basically have a data set ready to be trained, right? 124 00:06:11,400 --> 00:06:14,800 We have a dataset ready to be trained by a machine learning model. 125 00:06:14,800 --> 00:06:19,466 And that's why then we will just apply the classic process of training a model. 126 00:06:19,666 --> 00:06:23,366 First, we will split the data set into the training set and test it 127 00:06:23,566 --> 00:06:26,400 so that we can indeed have a set where we train the model 128 00:06:26,400 --> 00:06:30,133 to understand text and predict if the text are positive or negative, 129 00:06:30,366 --> 00:06:33,366 and the test set so that we can evaluate the performance 130 00:06:33,500 --> 00:06:36,500 on mutex on which the model wasn't trained. 131 00:06:36,600 --> 00:06:38,433 And then there we go, we train. 132 00:06:38,433 --> 00:06:41,366 So I chose a Naive Bayes model on the training set. 133 00:06:41,366 --> 00:06:42,633 But you will see that your 134 00:06:42,633 --> 00:06:46,933 final exercise at the end will be to try the other classification models 135 00:06:46,933 --> 00:06:51,200 and see if you can beat the accuracy I will get in this implementation. 136 00:06:51,600 --> 00:06:54,466 Then we will predict the test result and finally 137 00:06:54,466 --> 00:06:58,133 we will make the confusion matrix and get the final accuracy. 138 00:06:58,566 --> 00:06:59,566 So that's our structure. 139 00:06:59,566 --> 00:07:01,433 That's our NLP journey. 140 00:07:01,433 --> 00:07:05,133 So now as soon as you're ready let's start in the next tutorial 141 00:07:05,133 --> 00:07:08,133 with the simple data preprocessing phase. 142 00:07:08,433 --> 00:07:09,600 I can't wait to start. 143 00:07:09,600 --> 00:07:12,800 See you in the next tutorial and until then, enjoy machine learning!