0 1 00:00:00,750 --> 00:00:07,380 Now, the first step is head over to the course resources list and head over to the skeleton project on 1 2 00:00:07,440 --> 00:00:08,420 GitHub. 2 3 00:00:08,430 --> 00:00:14,880 Now, this skeleton project doesn't have any code in it and it only has the bare-bones storyboard design. 3 4 00:00:14,880 --> 00:00:22,230 So go ahead and git clone or download and unzip this project and open it inside Xcode 10. 4 5 00:00:22,230 --> 00:00:28,490 Now, you can see that all we have onscreen is just a simple app with a single label 5 6 00:00:28,500 --> 00:00:34,680 that is the sentiment label, as well as a text field and a prediction button. 6 7 00:00:34,680 --> 00:00:42,540 And those are linked up as two IBOutlets for the textField and the sentimentLabel, and one IBAction 7 8 00:00:42,540 --> 00:00:45,200 for when the user presses the predict button. 8 9 00:00:45,270 --> 00:00:49,340 That's all I placed inside the skeleton project. 9 10 00:00:49,580 --> 00:00:56,450 So once we're ready with that, we can actually go ahead and close it down because we need to, first, build 10 11 00:00:56,540 --> 00:01:02,510 our machine learning model that is going to do this the sentiment analysis. 11 12 00:01:02,510 --> 00:01:09,980 Now, in order to build any machine learning model, we're gonna need some data. So you can download the 12 13 00:01:09,980 --> 00:01:14,690 Twitter Sentiment Data inside the Course Resources list as well. 13 14 00:01:14,690 --> 00:01:16,700 There is the download link here. 14 15 00:01:16,940 --> 00:01:23,840 And if you click on it, download should start automatically, and make sure that you unzip that file so 15 16 00:01:23,840 --> 00:01:26,740 that you end up with a CSV. 16 17 00:01:26,750 --> 00:01:34,250 Now, this public data set comes from a company called Sanders Analysis. And at one point, they collected 17 18 00:01:34,340 --> 00:01:39,110 1,000 tweets that mentioned the AAPL, 18 19 00:01:39,110 --> 00:01:43,430 so the Apple Stock or the @Apple handle. 19 20 00:01:43,430 --> 00:01:51,560 And then they employed a bunch of people, humans, to read through each of the tweets and give it a rating 20 21 00:01:51,620 --> 00:01:54,650 of positive, negative, or neutral. 21 22 00:01:54,650 --> 00:01:59,110 So let's take a look at this data and see what it looks like. 22 23 00:01:59,120 --> 00:02:05,320 So I want you to head over to Google Sheets and create a brand new spreadsheet. 23 24 00:02:05,420 --> 00:02:12,170 Now, in this spreadsheet, you're going to open up that CSV file by uploading that CSV file that you just 24 25 00:02:12,170 --> 00:02:14,480 downloaded and extracted. 25 26 00:02:14,480 --> 00:02:17,390 So this is what our data looks like. 26 27 00:02:17,390 --> 00:02:23,330 If you scroll down, you can see that there are around thousand tweets and they were found by searching for 27 28 00:02:23,330 --> 00:02:25,100 the handle @Apple, 28 29 00:02:25,100 --> 00:02:32,870 so when somebody is mentioning the Apple brand, basically, and then humans have read through all of these 29 30 00:02:33,500 --> 00:02:41,870 some sensical, some nonsensical tweets to rate it whether as positive, negative, or neutral. 30 31 00:02:41,870 --> 00:02:46,400 And this is the point where we can be grateful that we didn't have to do what we did in previous modules 31 32 00:02:46,730 --> 00:02:50,320 to use Create ML which is generating our own data. 32 33 00:02:50,360 --> 00:02:54,580 I'm pretty sure this would take us days to trawl through Twitter, 33 34 00:02:54,740 --> 00:03:01,040 download the tweets relevant to Apple, and then read through them individually, and specify whether if 34 35 00:03:01,040 --> 00:03:03,270 it's neutral, positive, or negative. 35 36 00:03:03,290 --> 00:03:06,130 So here we have our data. 36 37 00:03:06,230 --> 00:03:12,890 So the next thing we need to do is to use it to create our machine learning model, so that our model 37 38 00:03:13,160 --> 00:03:19,700 learns what positive tweets look like, what neutral tweets look like, and what negative tweets look like. 38 39 00:03:20,060 --> 00:03:27,610 So we can use it to analyze future tweets that it's never seen, that don't exist over here. 39 40 00:03:27,620 --> 00:03:34,810 So as we did previously, in order to use Create ML to generate our machine learning model, we're going 40 41 00:03:34,810 --> 00:03:38,990 to use a macOS blank playground. 41 42 00:03:38,990 --> 00:03:46,600 And I'm just going to save it on my desktop and I'm going to call it TweetSentimentModelMaker. 42 43 00:03:46,640 --> 00:03:55,070 Now, unfortunately, unlike the image recognition builder, there is no live view for creating Natural Language 43 44 00:03:55,070 --> 00:03:56,890 processing models yet. 44 45 00:03:56,990 --> 00:03:58,380 But that's not a problem. 45 46 00:03:58,400 --> 00:04:05,900 All we have to do is just write a few lines of code in order to load up our data and get Create ML 46 47 00:04:06,110 --> 00:04:07,850 to start working on it. 47 48 00:04:07,910 --> 00:04:14,330 Now, the first thing that we're going to do is we're going to delete this variable string, but we're going 48 49 00:04:14,330 --> 00:04:18,220 to keep our Cocoa framework imported. 49 50 00:04:18,320 --> 00:04:25,910 And in addition, we're going to import, of course, CreateML in order to use it to create our machine 50 51 00:04:25,910 --> 00:04:27,310 learning model. 51 52 00:04:27,320 --> 00:04:34,760 Now, the first thing that we need to do is to load up our CSV file and specify the data that we're going 52 53 00:04:34,760 --> 00:04:38,310 to use to train our machine learning model. 53 54 00:04:38,450 --> 00:04:45,560 So to do that, we're going to create a new constant code data and it's going to be initialized as a new 54 55 00:04:45,560 --> 00:04:50,250 instance of a MLDataTable. 55 56 00:04:50,570 --> 00:05:00,400 So this MLDataTable can be initialized from either a JSON data format or a CSV data format. 56 57 00:05:00,530 --> 00:05:03,830 And in our case, our data table is a CSV. 57 58 00:05:03,830 --> 00:05:12,530 So we're going to convert the CSV into a MLDataTable and we'll do that by using the method contentsOf 58 59 00:05:12,680 --> 00:05:15,090 from a URL. 59 60 00:05:15,580 --> 00:05:22,010 And that URL is basically the location where our data exists. 60 61 00:05:22,070 --> 00:05:25,450 So let's go ahead and create it. 61 62 00:05:25,690 --> 00:05:33,430 So we'll just write URL and we'll initialize it using file URL with path and that path has 62 63 00:05:33,430 --> 00:05:40,390 to be a String that represents the entire file path in order to get to this file that we want it to 63 64 00:05:40,390 --> 00:05:41,260 read. 64 65 00:05:41,260 --> 00:05:43,240 Now, that is basically this down here. 65 66 00:05:43,270 --> 00:05:50,410 It's going to be at forward slash, Users/angelayu/Downloads/twitter-sanders- 66 67 00:05:50,410 --> 00:05:52,010 apple3.csv. 67 68 00:05:52,030 --> 00:05:57,440 Now, if you don't want to type all of that out and make sure that you don't make any mistakes, 68 69 00:05:57,490 --> 00:06:05,200 a neat trick is just to head over to Terminal, type cd space, and then we can simply click and drag 69 70 00:06:05,620 --> 00:06:13,540 this file into here, and it'll automatically load up the entire file path that's required to reach that 70 71 00:06:13,540 --> 00:06:17,220 file from the root of our computer. 71 72 00:06:17,230 --> 00:06:24,340 So now I can highlight all of this and copy and paste it in here as the string that I'm going to use 72 73 00:06:24,340 --> 00:06:32,020 to create a file URL to load up and convert into a MLDataTable. 73 74 00:06:32,020 --> 00:06:40,330 Now, we get a error that tells us that the initialization of the MLDataTable can throw an error, 74 75 00:06:40,360 --> 00:06:41,980 but is not marked with "try." 75 76 00:06:41,980 --> 00:06:49,990 So that means that if when we try to convert a file to an MLDataTable and, actually, this file isn't 76 77 00:06:49,990 --> 00:06:56,660 a CSV or a JSON or even if this file plainly just doesn't exist, then we'll get an error. 77 78 00:06:56,960 --> 00:07:06,280 So, and so we need to mark our initialization with a "try." And if we don't have a file here and this fails, 78 79 00:07:06,550 --> 00:07:09,370 then we just don't want our code to continue. 79 80 00:07:09,370 --> 00:07:17,500 Now, the next thing we need to do is, remember, when we created our image recognition model, we had some 80 81 00:07:17,500 --> 00:07:25,360 training data and we had some testing data, so that we could train our model app, and then evaluate how 81 82 00:07:25,360 --> 00:07:30,520 good it is based on images that it had never seen before. 82 83 00:07:30,520 --> 00:07:37,420 Now, we have to do the same thing here with our Twitter Sentiment Analysis Data. 83 84 00:07:37,480 --> 00:07:47,110 Now, you can either randomly go in here and pick out 20 percent of tweets including their sentiment and 84 85 00:07:47,110 --> 00:07:57,340 create a new table, and load that in, and save it, and loaded in, and add it to our script, or we 85 86 00:07:57,340 --> 00:08:06,370 can use a very simple method that just splits up our data randomly into 80 percent for training and 86 87 00:08:06,370 --> 00:08:08,930 20 percent for testing. 87 88 00:08:08,950 --> 00:08:15,910 So the way to do that is to write "let" and we're going to create two constants: one called trainingData 88 89 00:08:18,440 --> 00:08:20,510 and the other is called testingData, 89 90 00:08:23,160 --> 00:08:29,450 and this is going to be equal to the data object that we already have, 90 91 00:08:29,690 --> 00:08:40,170 but split using a method called randomSplit. And this is only available for CreateML, so we can't 91 92 00:08:40,170 --> 00:08:43,830 use this for just, you know, anything that we want to do. 92 93 00:08:43,860 --> 00:08:52,710 It's specifically made so that it's easy to split up a data set to a 80 percent trainingData and 20 93 94 00:08:52,710 --> 00:08:54,590 percent testingData set. 94 95 00:08:54,630 --> 00:08:59,220 Now, this method takes two parameters or two inputs. 95 96 00:08:59,370 --> 00:09:03,360 One is how do you want the random split to be carried out. 96 97 00:09:03,360 --> 00:09:05,250 Do you want it 80-20? 97 98 00:09:05,250 --> 00:09:07,240 Do you want it 70-20? 98 99 00:09:07,290 --> 00:09:09,600 Then you just use a Double. 99 100 00:09:09,600 --> 00:09:16,830 So, for example, 0.8 to represent 80 percent trainingData, 20 percent testingData. 100 101 00:09:16,830 --> 00:09:22,080 If you wanted, say, 20 percent trainingData and 80 percent testingData, then you would just put 101 102 00:09:22,080 --> 00:09:23,890 0.2. 102 103 00:09:24,230 --> 00:09:27,650 Now, the second parameter is a little bit more complex. 103 104 00:09:27,710 --> 00:09:34,820 It's something called a seed and the seed relates to how random number generators work. 104 105 00:09:34,820 --> 00:09:42,860 Now, as we learned previously, random number generation using a computer is never truly random in the same 105 106 00:09:42,860 --> 00:09:44,570 way that it's random in nature. 106 107 00:09:44,570 --> 00:09:52,640 Now, in certain instances, if you want to be able to reproduce the randomness from a random number generator, 107 108 00:09:53,090 --> 00:09:55,160 you can use something called a seed. 108 109 00:09:55,160 --> 00:10:02,840 So, for example, we're currently splitting our data randomly to 80 percent and 20 percent. 109 110 00:10:03,050 --> 00:10:09,740 But if we need to reproduce this in the future to, say, debug our code or try to figure out why is it 110 111 00:10:09,770 --> 00:10:17,600 that our model isn't working, then we can use the same seed to generate the same randomSplit. 111 112 00:10:17,990 --> 00:10:26,810 So in here, I'm just going to put 5, and in the future if I need my data to be split randomly in exactly 112 113 00:10:26,810 --> 00:10:32,470 the same way, then I can simply use that 5 as the seed to achieve that. 113 114 00:10:32,480 --> 00:10:41,540 So, now that we've split our data randomly, then we can go ahead and create a new instance of the ML 114 115 00:10:41,540 --> 00:10:43,250 text classifier. 115 116 00:10:43,670 --> 00:10:49,990 And this is going to create our machine learning model by passing in our training data. 116 117 00:10:50,660 --> 00:10:55,220 So let's call it the sentimentClassifier 117 118 00:10:57,810 --> 00:11:03,950 and it's going to be equal to a MLTextClassifier. 118 119 00:11:04,080 --> 00:11:12,210 Now, we're going to initialize this brand-new text classifier by using the method down here which takes 119 120 00:11:12,270 --> 00:11:13,610 three parameters. 120 121 00:11:13,680 --> 00:11:21,030 One is a MLDataTable of trainingData which we created here by doing our 80-20 split. 121 122 00:11:21,030 --> 00:11:28,800 Next is the name of the textColumn, so the column inside that table that contains the text. And another 122 123 00:11:28,800 --> 00:11:32,730 one which is the name of the column that contains the label, 123 124 00:11:32,760 --> 00:11:40,170 so whether if it was positive, neutral, or negative. And you notice that this method also throws. 124 125 00:11:40,170 --> 00:11:45,930 So in addition to writing the method, we also have to mark it with a "try." 125 126 00:11:45,930 --> 00:11:51,510 Now, the part that we're going to use as the trainingData is going to be this constant that we created 126 127 00:11:51,510 --> 00:11:59,350 here. And this is a random 80 percent of our data that is loaded up from this location. 127 128 00:11:59,370 --> 00:12:04,250 So whatever it is that you called it over here, it has to match over here. 128 129 00:12:04,260 --> 00:12:07,440 Now, how do we figure out what is the name of the text column? 129 130 00:12:07,440 --> 00:12:15,390 Well, let's head back to our data set, CSV, and you can see that right at the top, we've got a name for 130 131 00:12:15,390 --> 00:12:16,770 each column. 131 132 00:12:16,950 --> 00:12:21,980 The column for the text is just called text with a lower case "t." 132 133 00:12:21,990 --> 00:12:25,710 And the column for the labels is called class. 133 134 00:12:25,710 --> 00:12:29,460 So, now we just have to put that in here as the String. 134 135 00:12:29,700 --> 00:12:35,370 So the textColumn is called "text" and the labelColumn is called "class." 135 136 00:12:35,370 --> 00:12:42,970 So this line of code will train up our text classifier by using our trainingData. 136 137 00:12:43,020 --> 00:12:50,360 Now, once that's done, then this object, the sentimentClassifier, is now all machine learning model. 137 138 00:12:50,430 --> 00:12:58,030 But before we're done, we should test it against our testingData to evaluate how accurate it is, 138 139 00:12:58,080 --> 00:13:07,490 so we can generate some evaluationMetrics by using our sentimentClassifier and calling the method 139 140 00:13:07,640 --> 00:13:16,730 evaluation on, and we're going to call it on a object of type MLDataTable which is, of course, going 140 141 00:13:16,730 --> 00:13:20,480 to be our testingData over here. 141 142 00:13:20,510 --> 00:13:24,680 So here, we're going to pass in our testingData set. 142 143 00:13:25,100 --> 00:13:32,240 And at this stage, we will have evaluation metrics that we can then use to figure out the accuracy. 143 144 00:13:32,300 --> 00:13:33,590 And the accuracy-- 144 145 00:13:33,590 --> 00:13:38,170 So let's call it evaluation accuracy. 145 146 00:13:38,210 --> 00:13:46,190 This can be derived by taking the evaluation metrics that we got and finding one of its properties called 146 147 00:13:46,190 --> 00:13:54,350 the classificationError which is the fraction of examples that were incorrectly labeled by this model. 147 148 00:13:54,350 --> 00:13:56,420 So this is going to be a Double. 148 149 00:13:56,420 --> 00:14:02,630 So, for example, if you had 90 percent that were incorrectly labeled, then this number will be 149 150 00:14:02,630 --> 00:14:03,570 0.9. 150 151 00:14:03,620 --> 00:14:07,820 If you only had 10 percent that were incorrectly labeled, then there'll be 0.1. 151 152 00:14:08,240 --> 00:14:14,810 So we can then take 1.0 and subtract the error to get the accuracy, 152 153 00:14:14,810 --> 00:14:17,900 so the percentage that it got right, essentially. 153 154 00:14:18,170 --> 00:14:22,780 And then we can multiply this by 100 in order to get a percentage. 154 155 00:14:22,820 --> 00:14:31,310 Now at this stage, we can already go ahead and click on the play button to see what we get as a result. 155 156 00:14:31,460 --> 00:14:39,230 So you can see that it did all of this in a very, very short period of time and we end up with a evaluation 156 157 00:14:39,260 --> 00:14:43,080 accuracy of close to 70 percent. 157 158 00:14:43,100 --> 00:14:50,780 So this means that in an ideal world, we should really feed our model more data so we can get it close 158 159 00:14:50,780 --> 00:14:52,970 to 75 or 80 percent. 159 160 00:14:53,060 --> 00:15:01,700 But for our purposes, 70 percent is good enough and we're going to go ahead and save our model so that 160 161 00:15:01,700 --> 00:15:04,250 we can use it inside our app. 161 162 00:15:04,250 --> 00:15:12,950 So, first, I'm going to generate some metadata and I'm going to use the MLModelMetadata and I'm going 162 163 00:15:12,950 --> 00:15:15,800 to initialize it with all of these things. 163 164 00:15:15,920 --> 00:15:29,930 So it'll have an author, will have a short description. 164 165 00:15:30,110 --> 00:15:39,130 It doesn't really need a license, but it'll have a version which needs to be a String, so inside quotes, 165 166 00:15:39,190 --> 00:15:40,390 "1.0" 166 167 00:15:40,510 --> 00:15:45,160 And there's also nothing additional I want to add, so I'm also going to delete that. 167 168 00:15:45,700 --> 00:15:53,100 So now that I've created my metadata, I'm going to go ahead and take my sentimentClassifier and I'm 168 169 00:15:53,110 --> 00:15:58,120 going to write to a URL, so that I can save it. 169 170 00:15:59,110 --> 00:16:06,580 And, again, we're going to generate our URL by saying URL(fileURLWithPath: String) 170 171 00:16:07,210 --> 00:16:12,550 and the path I'm going to use is everything up to here. 171 172 00:16:12,580 --> 00:16:20,080 So I'm going to place it inside the same folder that I had my initial data in and I can locate it very 172 173 00:16:20,080 --> 00:16:21,190 easily. 173 174 00:16:21,340 --> 00:16:27,700 Now, after the last forward slash, I'm going to put in the name that I want to give my model, and it's 174 175 00:16:27,700 --> 00:16:34,790 going to be a TweetSentimentClassifier. 175 176 00:16:34,980 --> 00:16:41,560 And, remember, you have to end this with .mlodel as the file extension. 176 177 00:16:41,610 --> 00:16:46,280 And, finally, again, this write method an throw, 177 178 00:16:46,350 --> 00:16:49,500 so we must mark it with a "try." 178 179 00:16:49,500 --> 00:16:49,800 All right. 179 180 00:16:49,800 --> 00:16:55,710 So now that we've written the two lines of code that is going to give our sentimentClassifier some 180 181 00:16:55,770 --> 00:17:02,520 metadata, as well as tell it where it should save it to, then we can go ahead and click the play button 181 182 00:17:02,520 --> 00:17:10,330 to run those two lines of code. And you can see that because this part is blue, when we press this play 182 183 00:17:10,330 --> 00:17:11,050 button, 183 184 00:17:11,050 --> 00:17:15,310 it's only going to start running the code from line 13 onwards. 184 185 00:17:15,310 --> 00:17:18,640 So we don't repeat the part that we've already run before. 185 186 00:17:18,640 --> 00:17:20,620 So let's go ahead and do that. 186 187 00:17:20,680 --> 00:17:27,520 And now if we go into our Downloads folder, you can see there is our brand-new 187 188 00:17:27,610 --> 00:17:29,050 TweetSentimentClassifier.mlmodel. 188 189 00:17:29,050 --> 00:17:31,280 So we've now saved our model. 189 190 00:17:31,360 --> 00:17:38,410 But while we're still inside playgrounds, let's test our sentimentClassifier, and we can do that by using 190 191 00:17:38,470 --> 00:17:41,450 a method called prediction from a String. 191 192 00:17:41,710 --> 00:17:46,860 So the String is any sort of tweet that we might come across in the wild. 192 193 00:17:46,930 --> 00:17:59,080 "@Apple is a terrible company!" And, remember that, again, we have to mark it with a "try" and we're going to 193 194 00:17:59,080 --> 00:18:01,930 go ahead and click play to see what we get. 194 195 00:18:02,080 --> 00:18:08,050 So the output of this method was the label "Negative." 195 196 00:18:08,050 --> 00:18:09,940 So that's working pretty well. 196 197 00:18:09,940 --> 00:18:22,430 So, now let's try it with a positive-sounding tweet, say, something like, "I just found the best restaurant ever, 197 198 00:18:23,340 --> 00:18:27,720 and it's @DuckandWaffle." 198 199 00:18:28,110 --> 00:18:30,660 So now let's try this. 199 200 00:18:30,720 --> 00:18:35,600 Most humans will realize is a very positive sounding tweet. 200 201 00:18:36,240 --> 00:18:43,770 And, again, our classifier is working pretty well, classified that as "Positive" as well. 201 202 00:18:43,770 --> 00:18:48,390 Now, the last one that I'm going to try is something that sounds a bit more neutral. 202 203 00:18:48,570 --> 00:18:51,820 So I'm going to type a tweet that I think sounds kind of neutral. 203 204 00:18:51,960 --> 00:18:53,310 Let's try. 204 205 00:18:53,310 --> 00:19:04,050 "I think @CocaCola ads are just ok." 205 206 00:19:04,570 --> 00:19:04,800 Okay. 206 207 00:19:04,840 --> 00:19:07,200 So let's run this and see what we get. 207 208 00:19:07,240 --> 00:19:09,220 And, again, "Neutral." 208 209 00:19:09,220 --> 00:19:18,670 So our sentimentClassifier is actually predicting pretty accurately based off these three lines of tweets 209 210 00:19:18,940 --> 00:19:21,910 that I've just come up with out of the blue. 210 211 00:19:21,940 --> 00:19:29,020 Now, one of the reasons why we have a much lower accuracy for our Natural Language processing model compared 211 212 00:19:29,020 --> 00:19:35,190 to our Image Classification model is that Image Classification has less ambiguity. 212 213 00:19:35,350 --> 00:19:40,190 If I see a picture of a giraffe, then I will label it as a picture of a giraffe, 213 214 00:19:40,240 --> 00:19:45,810 and I would expect my image recognition model to pick up that it's a picture of a giraffe. 214 215 00:19:46,390 --> 00:19:52,870 But with something like sentiment, whether if something is negative, positive, or neutral, then even if 215 216 00:19:52,870 --> 00:19:59,650 you show two people the same tweet, often they might come to different conclusion. 216 217 00:19:59,650 --> 00:20:05,830 Some people might read sarcasm, or some people might be more sensitive, or interpret the tweet in a different 217 218 00:20:05,830 --> 00:20:06,560 way. 218 219 00:20:06,700 --> 00:20:13,510 So don't be too disheartened if your Natural Language models are a little bit less accurate when you 219 220 00:20:13,510 --> 00:20:16,900 run the training data on it compared to the image recognition. 220 221 00:20:16,900 --> 00:20:22,830 It's a very, very different task and it's actually a much harder task for a machine to perform. 221 222 00:20:22,930 --> 00:20:28,390 And looking at our sample tweets, it's actually not doing such a bad job at all. 222 223 00:20:28,390 --> 00:20:33,410 So in the next lesson, we're going to be creating our Twittermenti project. 223 224 00:20:33,460 --> 00:20:41,260 We're going to be adding a brand-new sentiment classifying model to our Twittermenti project and we're 224 225 00:20:41,260 --> 00:20:43,420 going to continue building our app.