1 00:00:00,133 --> 00:00:01,133 All right, let's do this. 2 00:00:01,133 --> 00:00:05,266 Let's begin our implementation of the natural language processing. 3 00:00:05,266 --> 00:00:07,266 Well, you know, branch of machine learning, but more. 4 00:00:07,266 --> 00:00:08,333 Specifically. 5 00:00:08,333 --> 00:00:11,966 Of an NLP model made for sentiment analysis. 6 00:00:12,600 --> 00:00:12,900 All right. 7 00:00:12,900 --> 00:00:16,900 So as usual, we're going to start as much efficiently as we can. 8 00:00:16,966 --> 00:00:19,800 We're going to use our data preprocessing template. 9 00:00:19,800 --> 00:00:21,500 Which I've of course prepared. 10 00:00:21,500 --> 00:00:22,900 For this implementation 11 00:00:22,900 --> 00:00:27,133 which contains, you know, the codes to import the libraries and import. 12 00:00:27,133 --> 00:00:28,033 The dataset. 13 00:00:28,033 --> 00:00:29,766 So let's quickly start with the libraries. 14 00:00:29,766 --> 00:00:31,333 Here I'm going to take them. 15 00:00:31,333 --> 00:00:33,033 I'm going to paste. That right. 16 00:00:33,033 --> 00:00:35,100 Here in a new. Code cell. 17 00:00:35,100 --> 00:00:37,333 To indeed import the essential libraries. 18 00:00:37,333 --> 00:00:38,700 You know, just in case we need them. 19 00:00:38,700 --> 00:00:40,566 It doesn't mean that we will necessarily. 20 00:00:40,566 --> 00:00:43,600 Use. All of them, but at least we have them in case we need them. 21 00:00:43,633 --> 00:00:44,500 Okay. 22 00:00:44,500 --> 00:00:45,600 Then importing the data. 23 00:00:45,600 --> 00:00:47,866 Set let's create a new code cell. 24 00:00:47,866 --> 00:00:50,600 And now according to you. Do I have to take. 25 00:00:50,600 --> 00:00:53,700 All the lines of code here or just this one to get. 26 00:00:53,700 --> 00:00:55,000 The data set? 27 00:00:55,000 --> 00:00:58,200 Well, as you might guess, now we're going to do some different kind 28 00:00:58,200 --> 00:00:59,533 of data preprocessing. 29 00:00:59,533 --> 00:01:00,166 And therefore. 30 00:01:00,166 --> 00:01:01,033 We'll just take. 31 00:01:01,033 --> 00:01:04,600 This line of code to indeed import the reviews inside. 32 00:01:04,600 --> 00:01:06,566 Still a data set variable. 33 00:01:06,566 --> 00:01:07,566 But then you will see. 34 00:01:07,566 --> 00:01:09,966 That there. Will be a certain work needed. 35 00:01:09,966 --> 00:01:11,733 Before creating these two features. 36 00:01:11,733 --> 00:01:14,200 We will indeed create these two features at some point. 37 00:01:14,200 --> 00:01:17,700 You know the matrix of features and the dependent variable, but not now. 38 00:01:17,700 --> 00:01:18,700 This is too. Early. 39 00:01:18,700 --> 00:01:22,433 We will have to clean the text first and prepare the bag of words model. 40 00:01:22,433 --> 00:01:23,133 And in fact. 41 00:01:23,133 --> 00:01:24,766 We will create these two. 42 00:01:24,766 --> 00:01:28,366 Entities a matrix of features and a dependent variable vector in the. 43 00:01:28,366 --> 00:01:30,300 Cell where we create the bag of. 44 00:01:30,300 --> 00:01:31,066 Words model. 45 00:01:31,066 --> 00:01:32,300 Okay. So let's just. 46 00:01:32,300 --> 00:01:34,300 Take this for now the. Data set. 47 00:01:34,300 --> 00:01:37,833 And back into our NLP implementation. 48 00:01:38,100 --> 00:01:40,566 Let's paste that right here and now. 49 00:01:40,566 --> 00:01:42,966 Indeed we have to adapt this a little. 50 00:01:42,966 --> 00:01:44,766 Because now we're not dealing with a. 51 00:01:44,766 --> 00:01:46,000 CSV file. 52 00:01:46,000 --> 00:01:48,966 We're dealing with a. TSB file where the feature. 53 00:01:48,966 --> 00:01:52,100 Is meaning the text and the binary variable 0 or 1. 54 00:01:52,300 --> 00:01:55,300 Are separated by a. Tab instead of a comma. 55 00:01:55,533 --> 00:01:56,600 So first thing. 56 00:01:56,600 --> 00:02:00,000 First, let's replace this data set by the. 57 00:02:00,000 --> 00:02:00,833 Right name. 58 00:02:00,833 --> 00:02:04,466 You notice that I even included the extension because we'll have to change it. 59 00:02:04,766 --> 00:02:06,800 So the name of. Our data set. 60 00:02:06,800 --> 00:02:07,766 Let's have a look. 61 00:02:07,766 --> 00:02:08,500 Again. 62 00:02:08,500 --> 00:02:12,533 Is Restaurant Reviews dot CSV. 63 00:02:12,900 --> 00:02:13,566 All right. 64 00:02:13,566 --> 00:02:15,500 So that's exactly what we'll replace here. 65 00:02:15,500 --> 00:02:18,733 Restaurant underscore reviews 66 00:02:20,200 --> 00:02:22,700 dot t test v okay. 67 00:02:22,700 --> 00:02:24,300 And now since it is a. 68 00:02:24,300 --> 00:02:26,333 TSV we'll have to add. 69 00:02:26,333 --> 00:02:28,966 Some extra parameters. To specify. 70 00:02:28,966 --> 00:02:32,333 That indeed we're dealing with a T as we found instead of a comma. 71 00:02:32,433 --> 00:02:34,533 Separated value file CSV. 72 00:02:34,533 --> 00:02:34,900 All right. 73 00:02:34,900 --> 00:02:36,500 And the way to. Do this is just to add. 74 00:02:36,500 --> 00:02:39,500 One parameter. Here, which is. Delimiter. 75 00:02:39,700 --> 00:02:40,533 All right. 76 00:02:40,533 --> 00:02:43,266 For which the default value is actually. 77 00:02:43,266 --> 00:02:44,133 The comma, meaning. 78 00:02:44,133 --> 00:02:46,966 That the default data set. That we can import. 79 00:02:46,966 --> 00:02:50,366 With this read underscore CSV is indeed CSV. 80 00:02:50,700 --> 00:02:51,900 But you know, we can also. 81 00:02:51,900 --> 00:02:52,900 Use this read. 82 00:02:52,900 --> 00:02:55,333 Underscore CSV function to import. 83 00:02:55,333 --> 00:02:57,466 A TSV. File. And that's exactly. 84 00:02:57,466 --> 00:02:59,000 What we're about to. Do now. 85 00:02:59,000 --> 00:03:01,200 But the way to specify. That we're dealing with a. 86 00:03:01,200 --> 00:03:02,800 TSC file is to. 87 00:03:02,800 --> 00:03:05,000 Enter the following. Value for this. Delimiter. 88 00:03:05,000 --> 00:03:07,666 Parameter, which is in quotes. 89 00:03:07,666 --> 00:03:10,766 This slash here backslash. N. T. 90 00:03:11,166 --> 00:03:13,033 All right. That's the value. Of the delimiter. 91 00:03:13,033 --> 00:03:13,833 You should enter to. 92 00:03:13,833 --> 00:03:16,833 Specify that your data set is a TSC file. 93 00:03:17,133 --> 00:03:18,200 But then that's not all. 94 00:03:18,200 --> 00:03:20,300 We need to add one final parameter. 95 00:03:20,300 --> 00:03:23,433 Very important one when you're working with text 96 00:03:23,766 --> 00:03:26,833 I'm going to show you something now in not this. 97 00:03:26,833 --> 00:03:27,566 Data set. 98 00:03:27,566 --> 00:03:29,833 Because we couldn't see the. Whole reviews. 99 00:03:29,833 --> 00:03:33,666 But I'm going to show you the whole data set inside the folder machine learning 100 00:03:33,700 --> 00:03:36,700 data set, which you could download once again in the article. 101 00:03:36,700 --> 00:03:39,600 Right before this tutorial. So let's. Open it. 102 00:03:39,600 --> 00:03:43,766 Let's go into part seven NLP, then NLP again and Python. 103 00:03:43,766 --> 00:03:45,666 And that's the whole data set. 104 00:03:45,666 --> 00:03:47,800 So I'm on Mac here. So I'm going to open it. 105 00:03:47,800 --> 00:03:50,633 With a classic text editor like text edit. 106 00:03:50,633 --> 00:03:51,433 Perfect. 107 00:03:51,433 --> 00:03:53,866 We just need to have a look at the text quickly. 108 00:03:53,866 --> 00:03:55,000 So there we go. 109 00:03:55,000 --> 00:03:58,900 And now I'm just going to do a command or control F to find something. 110 00:03:59,366 --> 00:04:02,933 Which is a double quotes just like that okay. 111 00:04:03,633 --> 00:04:06,300 And as we see we. Can see that we have many. 112 00:04:06,300 --> 00:04:09,300 Double quotes. Within the. Text. All right. 113 00:04:09,600 --> 00:04:10,733 And in order to. 114 00:04:10,733 --> 00:04:12,066 Process this the right. 115 00:04:12,066 --> 00:04:15,233 Way, you know, when our machinery models learn how to. 116 00:04:15,233 --> 00:04:17,700 Read text, well, we'll have to say to. 117 00:04:17,700 --> 00:04:20,433 Our model to ignore. The double quotes. 118 00:04:20,433 --> 00:04:24,133 Otherwise, you know, if you don't do it, this can cause some processing 119 00:04:24,133 --> 00:04:25,366 or splicing errors 120 00:04:25,366 --> 00:04:29,033 which you want to avoid, you know, because this can lead to an execution error. 121 00:04:29,233 --> 00:04:30,866 So I always recommend to. 122 00:04:30,866 --> 00:04:33,166 Add this. Quoting parameter and set its. 123 00:04:33,166 --> 00:04:34,366 Value to three. 124 00:04:34,366 --> 00:04:37,800 Which means actually no quotes or, you know, ignore the quotes 125 00:04:38,066 --> 00:04:40,933 so that indeed you can be free from processing errors. 126 00:04:40,933 --> 00:04:43,200 You can see there are many quotes, right? 127 00:04:43,200 --> 00:04:43,633 So we're. 128 00:04:43,633 --> 00:04:45,933 Just going to ignore all of them as if, you know, 129 00:04:45,933 --> 00:04:48,066 they're just some different characters in the. 130 00:04:48,066 --> 00:04:48,933 Text. 131 00:04:48,933 --> 00:04:49,400 All right. 132 00:04:49,400 --> 00:04:51,466 So that's all I wanted to show you. 133 00:04:51,466 --> 00:04:53,366 So now let's close. This. 134 00:04:53,366 --> 00:04:56,200 And let's go back to our implementation. 135 00:04:56,200 --> 00:05:01,466 And to add this final parameter we need to add here quoting equals. 136 00:05:01,633 --> 00:05:06,600 And the value of this quoting parameter to ignore all the double quotes is three. 137 00:05:06,900 --> 00:05:07,600 All right. 138 00:05:07,600 --> 00:05:08,733 And now perfect. 139 00:05:08,733 --> 00:05:09,400 That's how. 140 00:05:09,400 --> 00:05:09,900 You import. 141 00:05:09,900 --> 00:05:12,400 Correctly a TSV file which should. 142 00:05:12,400 --> 00:05:13,600 Be the format of, 143 00:05:13,600 --> 00:05:18,166 you know, a data set separating text and a binary outcome like zero one. 144 00:05:18,300 --> 00:05:20,400 That's the classic way to proceed. 145 00:05:20,400 --> 00:05:22,333 With sentiment analysis. 146 00:05:22,333 --> 00:05:23,133 So there we go. 147 00:05:23,133 --> 00:05:25,833 Well, actually let's import the data set to make sure. 148 00:05:25,833 --> 00:05:27,000 Everything's all right. 149 00:05:27,000 --> 00:05:29,633 So we're going to click. This folder here. 150 00:05:29,633 --> 00:05:31,800 Then it's going to take a little time. 151 00:05:31,800 --> 00:05:34,800 You know a few seconds to connect this notebook 152 00:05:35,100 --> 00:05:38,100 to a runtime to enable file browsing. 153 00:05:38,366 --> 00:05:39,466 But in a second. 154 00:05:39,466 --> 00:05:41,466 We should see that upload. 155 00:05:41,466 --> 00:05:43,466 Button here to indeed upload. 156 00:05:43,466 --> 00:05:45,366 There we go that data set. 157 00:05:45,366 --> 00:05:46,800 So let's click it. 158 00:05:46,800 --> 00:05:50,900 And now please find your machine learning A to Z folder on your machine 159 00:05:50,900 --> 00:05:51,833 which you had to download 160 00:05:51,833 --> 00:05:55,366 either in the previous tutorial or at the beginning of each section. 161 00:05:55,600 --> 00:05:56,833 So now let's go inside. 162 00:05:56,833 --> 00:06:00,133 Let's go once again into part seven Natural Language Processing. 163 00:06:00,366 --> 00:06:03,766 Then this section, then Python, and then Restaurant. 164 00:06:03,766 --> 00:06:05,666 Reviews dot CSV. 165 00:06:05,666 --> 00:06:07,966 Let's click open. Let's click okay. 166 00:06:07,966 --> 00:06:08,733 And now we're going to. 167 00:06:08,733 --> 00:06:10,333 Have the data. 168 00:06:10,333 --> 00:06:12,500 Set inside the notebook. 169 00:06:12,500 --> 00:06:13,200 All right. Perfect. 170 00:06:13,200 --> 00:06:14,866 So now let's run the cells. 171 00:06:14,866 --> 00:06:16,266 First this cell where. 172 00:06:16,266 --> 00:06:17,866 We import. The libraries. 173 00:06:17,866 --> 00:06:19,600 So simple one. 174 00:06:19,600 --> 00:06:21,966 And now this cell where we import. 175 00:06:21,966 --> 00:06:23,000 The data set. 176 00:06:23,000 --> 00:06:25,000 Let's do this. Let's make sure everything goes well. 177 00:06:25,000 --> 00:06:26,833 And there we go. 178 00:06:26,833 --> 00:06:28,966 Now we have. The data set ready. 179 00:06:28,966 --> 00:06:31,466 So that means we're ready for the next step. 180 00:06:31,466 --> 00:06:32,600 Cleaning the text. 181 00:06:32,600 --> 00:06:35,866 That's an essential step in natural language processing. 182 00:06:36,100 --> 00:06:40,900 I will show you all the techniques to make your text as clean as possible. 183 00:06:40,900 --> 00:06:43,666 And we will do. All this. In the next tutorial. 184 00:06:43,666 --> 00:06:45,533 Until then, enjoy machine learning.