1 00:00:00,540 --> 00:00:00,990 All right. 2 00:00:00,990 --> 00:00:05,090 So this is the very final lesson on the eve Bates. 3 00:00:05,340 --> 00:00:11,990 And in this lesson we're going to implement the need based classifier the quick and dirty way we're 4 00:00:12,000 --> 00:00:16,450 going to use these psychic learn module to do all the heavy lifting for us. 5 00:00:16,560 --> 00:00:24,450 And this is why I'm calling this lesson base brisk brief and better you'll find out. 6 00:00:24,470 --> 00:00:31,990 Stick with me the very first thing we're gonna do in your Jupiter projects folder is check inside spam 7 00:00:31,990 --> 00:00:41,550 data 0 1 processing that you see this Jason file here. 8 00:00:41,750 --> 00:00:45,330 Email hyphen text hyphen data don't chase on. 9 00:00:45,500 --> 00:00:49,040 This is the data file that we're going to be using in this lesson. 10 00:00:49,040 --> 00:00:55,370 So back in your projects folder I'd like you to create a new Python 3 notebook 11 00:00:58,800 --> 00:01:02,430 and I'd like you to give it the title 0 8. 12 00:01:02,630 --> 00:01:14,530 Now you've Baze with socket hyphen learn click rename and then you can go ahead and view it toggle header 13 00:01:14,860 --> 00:01:18,470 and you'll have a bit more screen real estate to play with. 14 00:01:18,550 --> 00:01:25,170 Now in our first cell we're going to add a couple of import statements so we're going to import num 15 00:01:25,180 --> 00:01:33,330 pi as MP and we're going to import pandas as PD in the next cell. 16 00:01:33,370 --> 00:01:39,050 And we add the string for our path to that Jason file that I showed you earlier. 17 00:01:39,130 --> 00:01:41,230 So I'll say data on a school. 18 00:01:41,340 --> 00:01:48,690 Jason on the score file is equal to single quotes span data. 19 00:01:48,700 --> 00:01:57,280 So this is gonna be your folder name forward slash 0 1 on a school processing forward slash email hyphen 20 00:01:57,310 --> 00:02:01,930 text hyphen data dot Jason. 21 00:02:02,290 --> 00:02:10,090 So this string is going to be our relative path to the resources that you've downloaded earlier as part 22 00:02:10,090 --> 00:02:14,810 of this module and saved in your project folder. 23 00:02:14,850 --> 00:02:21,090 Now let's input this Jason as a data frame so we'll use panels for this. 24 00:02:21,260 --> 00:02:25,880 I'll store all of that information in a variable called data. 25 00:02:25,880 --> 00:02:29,650 So all set data equal to PD dot read underscore. 26 00:02:29,960 --> 00:02:31,490 Jason so read on US Code. 27 00:02:31,500 --> 00:02:39,350 Jason is Panda's method for reading that Jason and converting it into a panda's data frame. 28 00:02:39,650 --> 00:02:45,930 And then I'm going to pass in the relative path to this Jason File. 29 00:02:45,980 --> 00:02:46,610 There we go. 30 00:02:47,810 --> 00:02:49,090 Let's see what this looks like. 31 00:02:49,100 --> 00:02:53,210 Data dot tail last five rows. 32 00:02:53,210 --> 00:03:01,830 Here we see our spam messages are file names are labels which are our categories and an index here. 33 00:03:01,850 --> 00:03:03,900 Now you might be wondering why musingly don't. 34 00:03:03,920 --> 00:03:10,500 J some file instead of something say like a CSA which you could open in Microsoft Excel. 35 00:03:10,490 --> 00:03:13,380 And the answer is I've actually tried this. 36 00:03:13,410 --> 00:03:20,540 Hi took our entire dataset and I've put it into a CSP file and then I tried to open it with a 32 bit 37 00:03:20,540 --> 00:03:29,450 version of Microsoft Excel and that completely failed on my windows machine and it kind of shows you 38 00:03:29,660 --> 00:03:35,240 that when you're working with larger amounts of data you quickly run into the limitations of these spreadsheet 39 00:03:35,240 --> 00:03:36,390 programs. 40 00:03:36,470 --> 00:03:42,490 So in order not to tempt you I've gone for a Jason instead. 41 00:03:42,500 --> 00:03:46,180 That way we're not tempted to open it in a spreadsheet. 42 00:03:46,180 --> 00:03:49,710 Now let's take a look at the shape of our data frame him. 43 00:03:49,730 --> 00:03:55,540 I've got three columns and I've got five thousand seven hundred and ninety six rows. 44 00:03:55,610 --> 00:03:58,850 So these are the number of emails that I've got. 45 00:03:58,850 --> 00:04:06,720 And even though I'm looking at my last five rows appear it's got an index of nine hundred and ninety 46 00:04:06,720 --> 00:04:10,400 nine meaning we can probably sort our data frame right. 47 00:04:10,430 --> 00:04:14,120 If we want it to be in the order of our index. 48 00:04:14,120 --> 00:04:24,860 So if I take a data dot sort on the score index parentheses in place equals true this means I'm not 49 00:04:24,860 --> 00:04:31,760 storing it in a different variable and I hit shift enter on this and now I look at my tail. 50 00:04:31,760 --> 00:04:39,320 Then I've got all my indices in order and I can see that the last row here has the index value five 51 00:04:39,320 --> 00:04:42,710 thousand seven hundred and ninety five. 52 00:04:42,710 --> 00:04:43,010 All right. 53 00:04:43,130 --> 00:04:50,330 So so much for data managing data importing and doing general setup and how we're actually not gonna 54 00:04:50,340 --> 00:04:53,710 do any data cleaning and Data Exploration in this lesson. 55 00:04:54,140 --> 00:04:58,490 I've given you a data set here which allows us to take a shortcut. 56 00:04:58,520 --> 00:05:05,870 We're gonna dive straight into generating our vocabulary for our base classifier and here's how we're 57 00:05:05,870 --> 00:05:06,930 gonna do it. 58 00:05:07,010 --> 00:05:17,270 Gonna jump up to our input statements at the top and then we'll see from S.K. learn don't feature extraction 59 00:05:18,970 --> 00:05:26,890 don't text import count factorization. 60 00:05:26,990 --> 00:05:33,860 This is the component that will allow us to generate our vocabulary very very quickly and very efficiently. 61 00:05:33,860 --> 00:05:39,680 Let me add a few rows here scroll down and now I'll create my vector right. 62 00:05:39,680 --> 00:05:46,380 So I'm going to store this vector riser in a variable called well vector riser. 63 00:05:46,720 --> 00:05:56,880 Let's put that equal to count vector riser parentheses and then I can specify some arguments I can customize 64 00:05:57,090 --> 00:06:05,300 the kind of vector riser I want and what I'm going to do is I'm going to say use the English stop words. 65 00:06:05,500 --> 00:06:05,780 Right. 66 00:06:05,850 --> 00:06:11,390 Stop underscore words is equal to single quotes English. 67 00:06:11,780 --> 00:06:17,580 And this means that when our vector either generates the vocabulary from the email dataset then it will 68 00:06:17,580 --> 00:06:22,950 remove the English stop words Selimi it shift enter on this. 69 00:06:23,550 --> 00:06:30,720 And now that I've got my vector right there I can create my vocabulary and I can also create my document 70 00:06:30,930 --> 00:06:32,820 term matrix. 71 00:06:32,820 --> 00:06:38,160 Now if you remember the document term matrix is what we were laboriously building in the previous lessons. 72 00:06:38,220 --> 00:06:42,900 We've done a lot of data manipulation and counting how many times particular words are particular tokens 73 00:06:42,900 --> 00:06:46,230 occurred in our Corpus. 74 00:06:46,230 --> 00:06:49,890 Here's how we can do all of this in one line. 75 00:06:50,280 --> 00:06:52,430 Remember the individual words were our features. 76 00:06:52,440 --> 00:06:56,600 Right so I'll use all underscore features is equal to. 77 00:06:56,910 --> 00:07:07,470 This is we're going to store my documented term matrix vector rise a dot fit on a school transform parentheses 78 00:07:08,100 --> 00:07:13,010 data don't message. 79 00:07:13,400 --> 00:07:20,280 So what I'm doing here is I'm using the fit underscore transform method from the vector riser and I'm 80 00:07:20,280 --> 00:07:22,840 supplying this column here. 81 00:07:22,920 --> 00:07:27,160 This message column from our data frame. 82 00:07:27,290 --> 00:07:29,370 Let me hit shift enter on this. 83 00:07:29,370 --> 00:07:33,650 It's gonna run for a little while few seconds longer than all the other cells. 84 00:07:33,960 --> 00:07:42,860 But uh what we get at the end all features that shape is a sparse matrix. 85 00:07:42,870 --> 00:07:50,250 We get five thousand seven hundred and ninety six rows and just a little over a hundred thousand columns 86 00:07:51,660 --> 00:07:56,660 these columns correspond to the tokens in our emails. 87 00:07:56,700 --> 00:08:00,320 They correspond to the individual words. 88 00:08:00,480 --> 00:08:06,600 Now in this line of code our vector riser has actually already learnt our vocabulary as well and we 89 00:08:06,600 --> 00:08:13,350 can pull this up so we can take a look at the vocabulary that's present in the vector either so vector 90 00:08:13,350 --> 00:08:20,460 riser dot vocabulary with an underscore at the end will pull us up for us. 91 00:08:20,670 --> 00:08:22,240 Here we see the individual words. 92 00:08:22,300 --> 00:08:30,380 Dear homeowner rates lowest point is help best rate situation matching needs and so on. 93 00:08:30,390 --> 00:08:38,680 This is the vocabulary that will help us determine if an e-mail is spam or not spam and notice that 94 00:08:38,980 --> 00:08:41,000 these actually aren't even stem words right. 95 00:08:42,440 --> 00:08:50,240 So now that we've got our features matrix and our vocabulary it's time to split and shuffle our training 96 00:08:50,240 --> 00:08:56,270 and our test data and the way we're going to do this is of course with her tried and trusted train and 97 00:08:56,270 --> 00:09:01,160 it's test on a school split method from cyclone. 98 00:09:01,940 --> 00:09:15,590 So going up to the top I can import this and say from SCA learn dot model selection import train and 99 00:09:15,600 --> 00:09:18,200 a test on a score split. 100 00:09:18,510 --> 00:09:23,580 And if it looks like I'm typing this out really quickly it's because I'm typing a few of the letters 101 00:09:23,970 --> 00:09:28,140 and then hitting tab on my keyboard to insert the rest. 102 00:09:28,560 --> 00:09:38,820 So shift into coming down here I'll create four variables right X unescorted train X on the score test 103 00:09:39,950 --> 00:09:49,700 y on a squad train and y on a score test and those will be equal to the results of train on a score 104 00:09:49,700 --> 00:09:56,170 test on the score split as arguments to this method from cyclone. 105 00:09:56,390 --> 00:09:58,700 We're going to apply for values. 106 00:09:58,700 --> 00:10:02,800 Well the first thing we need are our features. 107 00:10:02,810 --> 00:10:06,010 The second thing that we need are our labels right. 108 00:10:06,020 --> 00:10:08,050 So data don't. 109 00:10:08,180 --> 00:10:11,160 Category is where our labels are stored. 110 00:10:11,330 --> 00:10:17,480 These will read 1 for spam and zero for a non spam email. 111 00:10:18,470 --> 00:10:25,430 So this is the column that we're gonna use and then we're gonna decide on the size of our training and 112 00:10:25,430 --> 00:10:27,030 testing datasets. 113 00:10:27,110 --> 00:10:31,670 So test size test on the school signs. 114 00:10:31,850 --> 00:10:34,790 It's gonna be equal to zero point three. 115 00:10:34,880 --> 00:10:45,530 So I'm going to go with a 30 percent test size and then I'll select a random on a school state and I'll 116 00:10:45,530 --> 00:10:54,320 just see random on a school state is equal to 88 so 88 is the number that you also should type in in 117 00:10:54,320 --> 00:10:57,900 case you want to get the same results on the shuffle as myself. 118 00:10:59,140 --> 00:11:08,670 So let me run this and now we can take a look at the shape of our training and our testing data. 119 00:11:08,670 --> 00:11:18,210 So X on a school train shape is equal to four thousand five hundred and seven and a little over a hundred 120 00:11:18,210 --> 00:11:20,100 thousand on the columns. 121 00:11:20,100 --> 00:11:27,120 So we've got four thousand and fifty seven emails and we've got the rest X on the score test dance shape 122 00:11:27,470 --> 00:11:31,910 is equal to one thousand seven hundred and thirty nine. 123 00:11:32,060 --> 00:11:32,940 Brilliant. 124 00:11:32,970 --> 00:11:37,450 So we're all set to go to train our model. 125 00:11:37,690 --> 00:11:41,810 Now training are need based model could not be easier. 126 00:11:41,810 --> 00:11:52,290 The reason being if we go to the very top once again from SCA learn we can actually import a naive bayes 127 00:11:52,500 --> 00:12:04,540 classifier model so check it out from S.K. learn Dot Ave Baz import multi nominal and be multi normal 128 00:12:04,670 --> 00:12:05,270 naive. 129 00:12:05,340 --> 00:12:13,620 Base shift enter on this guy and coming down here will allow us to create our model here very quickly. 130 00:12:13,620 --> 00:12:24,390 All we need to do is store it somewhere say classifier is equal to and then multi nominal and b parentheses 131 00:12:25,260 --> 00:12:25,890 that's it. 132 00:12:25,920 --> 00:12:28,470 That creates our model for us. 133 00:12:28,650 --> 00:12:32,150 Now that we've got our model we can train it right. 134 00:12:32,330 --> 00:12:44,680 So classifier dot fit parentheses x on a school train comma Y on the school train. 135 00:12:44,680 --> 00:12:46,090 Trains our model. 136 00:12:46,090 --> 00:12:47,620 This is it. 137 00:12:47,680 --> 00:12:54,820 The fit method supplied with two arguments are training data and our training labels will completely 138 00:12:54,910 --> 00:12:58,510 train our model shift enter. 139 00:12:59,480 --> 00:13:02,210 And now we've got a train model. 140 00:13:02,210 --> 00:13:02,970 So how do we do it. 141 00:13:02,990 --> 00:13:03,690 The question right. 142 00:13:03,710 --> 00:13:10,760 That's the next question and I really like to throw this over to you as a challenge because in the previous 143 00:13:10,760 --> 00:13:13,720 lessons we've talked a lot about metrics. 144 00:13:13,790 --> 00:13:18,080 What I'd like you to do is calculate the following for the test dataset. 145 00:13:18,230 --> 00:13:22,070 So X on a score test and Y underscore test. 146 00:13:22,070 --> 00:13:27,740 Can you work out the number of documents that we classify correctly and the number of documents that 147 00:13:27,740 --> 00:13:30,200 were classified incorrectly. 148 00:13:30,260 --> 00:13:37,560 And finally I'd like you to work out the accuracy of our naive based model on the test dataset. 149 00:13:37,790 --> 00:13:42,710 I'll give you a few seconds to pause the video and give this a go. 150 00:13:42,710 --> 00:13:43,820 I'll see you on the other side. 151 00:13:45,580 --> 00:13:46,080 All right. 152 00:13:46,100 --> 00:13:48,840 So let's tackle one of these at a time. 153 00:13:48,840 --> 00:13:59,310 The number of correct documents we can work out by comparing the one and score test data with what the 154 00:13:59,310 --> 00:14:01,680 classifier predicted right. 155 00:14:01,710 --> 00:14:14,970 So why on a score test W equals classifier don't predict parentheses x on a score test our test data 156 00:14:15,600 --> 00:14:25,710 fed into the prediction method from our classifier and then summed up this will be the number of documents 157 00:14:26,040 --> 00:14:27,670 classified correctly. 158 00:14:28,650 --> 00:14:35,670 So the trick for this challenge was googling for the multi nominal and b documentation on cyclone and 159 00:14:35,690 --> 00:14:36,330 there. 160 00:14:36,390 --> 00:14:43,980 When you scroll down you can see that under the methods there is a predict method and this performs 161 00:14:44,100 --> 00:14:48,060 the classification on an array of test vectors. 162 00:14:48,060 --> 00:14:50,340 And this is exactly what we've done here. 163 00:14:50,370 --> 00:14:57,540 We've used our classifier put a dot after it called the predict method supplied r x underscore test 164 00:14:58,260 --> 00:15:09,930 and this is what we're comparing with our actual values because y underscore test looks like this and 165 00:15:10,240 --> 00:15:18,320 classifier don't predict parentheses x on the score test looks like this. 166 00:15:18,510 --> 00:15:22,750 There are two arrays where we can check with the double equal signs. 167 00:15:22,920 --> 00:15:30,930 If the value matches and then all we need to do is to sum up the number of truths in this comparison 168 00:15:31,200 --> 00:15:38,150 to get the number of documents that we predicted correctly so when is this equal to say we want to print 169 00:15:38,150 --> 00:15:40,790 this out so print. 170 00:15:40,790 --> 00:15:41,990 Let's use an F string. 171 00:15:42,210 --> 00:15:46,980 So print f single quotes curly braces. 172 00:15:47,150 --> 00:15:57,410 No underscore correct documents classified correctly. 173 00:15:57,410 --> 00:16:00,710 And single quotes and parentheses. 174 00:16:00,950 --> 00:16:09,890 So if I execute this cell here and then execute my print statement I will get this value here. 175 00:16:09,920 --> 00:16:20,620 This variable inserted here in my string using these curly braces and the F in front of the courts so 176 00:16:20,620 --> 00:16:28,860 here you can see we've correctly predicted a thousand six hundred and sixty documents now what about 177 00:16:28,920 --> 00:16:38,640 the number of documents incorrectly predicted right in our underscore incorrect shall be equal to Y 178 00:16:38,720 --> 00:16:41,820 and a score test dot signs. 179 00:16:41,820 --> 00:16:50,700 So the number of documents in the test dataset the number of emails in y and a score test well minus 180 00:16:50,970 --> 00:16:53,190 a thousand six hundred and sixty. 181 00:16:53,250 --> 00:16:53,910 Right. 182 00:16:53,940 --> 00:16:55,770 And hour on a score. 183 00:16:55,770 --> 00:16:56,220 Correct 184 00:17:00,100 --> 00:17:11,340 so if we want to print this out using an F string I can see a number of documents incorrectly classified 185 00:17:12,960 --> 00:17:23,010 is curly braces you know on a score incorrect shift enter will show us that seventy nine documents have 186 00:17:23,010 --> 00:17:27,660 been classified incorrectly by our classifier. 187 00:17:27,660 --> 00:17:28,720 Brilliant. 188 00:17:28,740 --> 00:17:30,270 So what does this mean for accuracy 189 00:17:33,000 --> 00:17:37,950 and we've worked down the number of documents classify correctly number of documents classify incorrectly 190 00:17:38,910 --> 00:17:41,670 we can calculate the fraction that were classified incorrectly. 191 00:17:41,670 --> 00:17:42,110 Right so. 192 00:17:42,150 --> 00:17:49,620 Fraction on a school wrong it's equal to an hour underscore incorrect. 193 00:17:49,620 --> 00:17:53,910 Divide it by an hour on a score. 194 00:17:53,910 --> 00:17:55,800 Correct. 195 00:17:55,800 --> 00:18:00,020 Plus an R on a score incorrect. 196 00:18:00,030 --> 00:18:03,210 This is the fraction of documents classified incorrectly. 197 00:18:04,280 --> 00:18:10,590 So if I wanted to print out the accuracy of the model I could write something like print in parentheses 198 00:18:10,740 --> 00:18:24,750 f single quotes the parentheses testing accuracy to be specific on the model is curly braces one minus 199 00:18:25,980 --> 00:18:36,080 fraction on a school wrong and here we see that our model is in fact around ninety five percent accurate. 200 00:18:36,080 --> 00:18:41,480 Now if I wanted to format this as a percentage all I need to do is put my cursor in front of this closing 201 00:18:41,480 --> 00:18:49,460 curly brace put a semicolon there then put a dot then a C two and then a percent sign and this will 202 00:18:49,510 --> 00:18:53,750 format my percentage to two decimal places. 203 00:18:53,750 --> 00:18:54,490 There we go. 204 00:18:54,500 --> 00:18:57,110 So that looks quite pretty right now. 205 00:18:57,140 --> 00:19:01,700 So you were studying this documentation a little more closely than you might have noticed that we didn't 206 00:19:01,700 --> 00:19:07,100 even have to go through all that trouble because there is in fact a school method which will report 207 00:19:07,190 --> 00:19:18,620 our accuracy for us so we could have also done it this way have said classifier dot score parentheses 208 00:19:18,950 --> 00:19:26,480 x on the score test come out y on the score test would have gotten the same result. 209 00:19:27,900 --> 00:19:34,200 Now as a follow up challenge to see how our model is doing we should really look beyond accuracy right. 210 00:19:34,230 --> 00:19:36,570 We talked about this in the previous lessons. 211 00:19:36,570 --> 00:19:45,330 So what I'd like you to do is work out the recall and precision and if one school for our classifier. 212 00:19:45,330 --> 00:19:51,180 Once again I encourage you to Google for the site could learn documentation on this topic to work this 213 00:19:51,180 --> 00:19:51,970 out. 214 00:19:52,080 --> 00:19:54,330 I'll give you a few seconds to pause the video. 215 00:19:54,380 --> 00:19:55,950 Ethan I'll see you on the other side 216 00:20:00,240 --> 00:20:06,330 so if I go ahead and Google so I could learn recall precision and I scroll down to the very first result 217 00:20:07,020 --> 00:20:12,060 then what I see is that there's a brief description of what precision and recall is. 218 00:20:12,890 --> 00:20:17,510 But scrolling further down I can see that this example seems to be talking a little bit more about the 219 00:20:17,510 --> 00:20:19,250 precision recall curve. 220 00:20:19,850 --> 00:20:22,370 So I'm not after something this fancy. 221 00:20:22,370 --> 00:20:25,620 What I'm actually after is just a simple metric right. 222 00:20:25,640 --> 00:20:29,990 The recall score the precision score and the F one school. 223 00:20:30,410 --> 00:20:36,600 And these live in Ashkelon dot metrics dot and then the name of the school. 224 00:20:36,620 --> 00:20:42,890 So let's take a recall for example here's the detailed description and how to use it. 225 00:20:43,280 --> 00:20:52,250 And here's a very quick example from SCA learned metrics import recall score I'll copy that line and 226 00:20:52,700 --> 00:21:00,670 here's how I can use it so coming back to our notebook and scrolling to the very top. 227 00:21:00,670 --> 00:21:09,650 I want to import the recall score but also I want to import the precision school while I'm at it and 228 00:21:09,650 --> 00:21:19,360 I'm also going to import the F1 score so three import statements from ASCII learned metrics. 229 00:21:19,610 --> 00:21:26,690 Import recall score from as killer not metrics import precision score and from ASCII learned art metrics 230 00:21:27,080 --> 00:21:34,500 import F1 score and we don't actually have to copy paste all of these three metrics live under Eskil 231 00:21:34,580 --> 00:21:38,910 dot metrics meaning we can put a comma here. 232 00:21:39,110 --> 00:21:40,210 Right. 233 00:21:40,280 --> 00:21:41,960 Precision school. 234 00:21:41,990 --> 00:21:43,220 Put another comma. 235 00:21:43,220 --> 00:21:46,350 And write F one school. 236 00:21:46,380 --> 00:21:49,960 So now we've got one line from ASCII lined up metrics. 237 00:21:50,000 --> 00:21:56,410 Import recall underscore score precision score and F1 on the score score. 238 00:21:56,680 --> 00:22:03,580 Let me hit shift enter on this scroll back down and now it's time to work it out. 239 00:22:03,650 --> 00:22:10,890 Calculated our recall score just needs two inputs right. 240 00:22:10,910 --> 00:22:15,400 It needs the correct labels so why underscore test. 241 00:22:15,710 --> 00:22:17,360 And it needs our predictions. 242 00:22:17,360 --> 00:22:23,360 And as we've seen before we can get our predictions using our classifier and using the predict method 243 00:22:24,260 --> 00:22:35,380 and supplying our test data X underscore test and what we see is that our recall is around 86 percent 244 00:22:37,740 --> 00:22:39,690 our precision underscore score. 245 00:22:39,910 --> 00:22:42,820 We can get in a very very similar way right. 246 00:22:42,820 --> 00:22:47,200 Why on a score test come up classifier 247 00:22:51,380 --> 00:23:02,750 don't predict parentheses excellence or test and precision is at around ninety nine percent very high. 248 00:23:04,470 --> 00:23:07,500 And finally our F school. 249 00:23:07,500 --> 00:23:08,580 Same thing. 250 00:23:08,580 --> 00:23:10,670 I can't even copy the line above. 251 00:23:10,710 --> 00:23:12,150 Change the name of the method. 252 00:23:12,240 --> 00:23:16,050 Right F on the score score and work it out. 253 00:23:16,050 --> 00:23:16,500 That's right. 254 00:23:16,500 --> 00:23:23,640 92 percent so these are our metrics and they're looking looking really strong right. 255 00:23:23,640 --> 00:23:25,620 Looking very very strong. 256 00:23:26,100 --> 00:23:31,680 Once we've got our data and we split it up into our testing and training dataset training our model 257 00:23:31,950 --> 00:23:39,730 making predictions and working out our metrics is actually super super super quick so the last thing 258 00:23:39,730 --> 00:23:45,220 I'm gonna show you in this lesson is that now that we've trained classifier we can actually do some 259 00:23:45,220 --> 00:23:52,870 pretty cool stuff with it like evaluate some sentences or some emails that we're going to write off 260 00:23:52,870 --> 00:23:59,410 the flight just like that we're going to try our own example sentences since we've trained our classifier 261 00:23:59,440 --> 00:24:00,200 already. 262 00:24:00,340 --> 00:24:07,310 We can add some sentences or send emails to a list and then check how spammy they really are. 263 00:24:07,960 --> 00:24:09,320 Let me show you what I mean. 264 00:24:09,460 --> 00:24:16,540 So I had a few cells here and I'll just call this list example and it's going to contain a couple of 265 00:24:16,540 --> 00:24:17,860 strings right. 266 00:24:17,890 --> 00:24:22,680 So the classic one is get via Agora for free now. 267 00:24:23,550 --> 00:24:23,970 Right. 268 00:24:25,330 --> 00:24:28,990 But we can also try to need a mortgage 269 00:24:32,140 --> 00:24:46,240 replied to arrange a call with its vessel list and get a quote Has up pretty spammy rate for the next 270 00:24:46,240 --> 00:24:46,750 one. 271 00:24:46,840 --> 00:24:49,800 Maybe let's try something that isn't very spammy right. 272 00:24:49,810 --> 00:25:00,540 Maybe something like uh could you please help me with the project for tomorrow. 273 00:25:00,550 --> 00:25:04,540 Try that one then maybe um no. 274 00:25:05,860 --> 00:25:07,560 Hello Jonathan. 275 00:25:08,840 --> 00:25:14,010 I watched a game of golf tomorrow. 276 00:25:14,490 --> 00:25:17,600 I imagine this is how the monopoly man talks to his friends. 277 00:25:18,140 --> 00:25:19,310 And for the last one. 278 00:25:20,000 --> 00:25:21,600 Mm hmm. 279 00:25:21,670 --> 00:25:28,510 When I go to Wikipedia and just search for a favorite Austrian pastime namely a ski jumping and I'm 280 00:25:28,500 --> 00:25:32,810 gonna grab the first couple of sentences here. 281 00:25:32,860 --> 00:25:40,660 Copy them come back in here and in single quotes paste them all in and then I'm going to have to hunt 282 00:25:40,660 --> 00:25:47,710 around for an apostrophe because you can tell from the syntax highlighting that this rogue apostrophe 283 00:25:47,710 --> 00:25:55,330 here needs escaping meaning it should be treated as a string so I can do that with a backslash. 284 00:25:55,330 --> 00:25:55,930 There we go. 285 00:25:56,620 --> 00:26:03,490 So now my string ends him and I've got my list of example emails for example sentences that we can try 286 00:26:03,490 --> 00:26:04,540 to make a prediction on. 287 00:26:04,630 --> 00:26:05,730 Using our classifier. 288 00:26:06,730 --> 00:26:07,630 So how do we do this. 289 00:26:07,960 --> 00:26:10,900 Well first up we need our vector writer. 290 00:26:10,890 --> 00:26:12,500 Write the vector riser. 291 00:26:12,510 --> 00:26:15,250 It's going to process this new piece of data. 292 00:26:15,250 --> 00:26:17,310 Write this list of sentences. 293 00:26:17,800 --> 00:26:22,660 So I want to use vector riser dot transform. 294 00:26:23,020 --> 00:26:28,740 That's the method to process these e-mails and I'll feed an example. 295 00:26:28,750 --> 00:26:34,560 So this is the code that will process this list and get it ready for our classifier. 296 00:26:34,570 --> 00:26:39,050 I'll tell you what it's gonna spit out a document term matrix right. 297 00:26:39,100 --> 00:26:47,500 So I can maybe store that under DLC on a score term on a score matrix set that equal to the output from 298 00:26:47,500 --> 00:26:50,000 the vector riser right. 299 00:26:50,260 --> 00:26:51,580 The next line of code. 300 00:26:51,610 --> 00:26:54,160 I'm gonna take my classifier. 301 00:26:54,530 --> 00:27:04,270 Use the predict method and feed in you guessed it the doc on the score term on a score matrix and let's 302 00:27:04,270 --> 00:27:05,930 see what we get. 303 00:27:06,790 --> 00:27:13,360 The very first sentence was very very spammy right and our classifier actually predicts this sentence 304 00:27:13,660 --> 00:27:16,540 to be from a spam email. 305 00:27:16,540 --> 00:27:17,860 Same with the second one. 306 00:27:17,920 --> 00:27:24,100 And that's because the word mortgage and quote probably tipped it off. 307 00:27:24,100 --> 00:27:31,840 But the third fourth and fifth entries here are not classified as spam so so far so good. 308 00:27:31,960 --> 00:27:36,640 I think at this point you can probably try a couple of your own sentences and see how the classifier 309 00:27:36,640 --> 00:27:37,990 behaves. 310 00:27:37,990 --> 00:27:44,000 In any case I hope this lesson was useful and that kind of rounded off our naive bayes module. 311 00:27:44,020 --> 00:27:49,660 I really wanted to show you how you might build a naive based classifier and train it with the power 312 00:27:49,660 --> 00:27:50,980 of these libraries. 313 00:27:51,070 --> 00:27:52,930 In this case S.K. learn. 314 00:27:53,020 --> 00:27:56,310 Now of course there are pros and cons to using libraries. 315 00:27:56,350 --> 00:27:58,230 You can't just apply them blindly. 316 00:27:58,240 --> 00:28:03,430 You have to understand how they work because there's so much going on under the hood and this is why 317 00:28:03,430 --> 00:28:08,950 we spent a lot of the time in the previous lessons covering many of the mechanics and actually built 318 00:28:09,010 --> 00:28:12,870 this naive base classifier from the ground up that way. 319 00:28:12,910 --> 00:28:18,870 These last couple of lines of code don't come across like forbidden magic or something so where does 320 00:28:18,870 --> 00:28:20,170 this leave us. 321 00:28:20,190 --> 00:28:26,760 Well more and more companies are giving job applicants these case studies to solve as part of their 322 00:28:26,760 --> 00:28:27,600 job interviews. 323 00:28:28,200 --> 00:28:33,660 If you're in the job market these days very often you'll be tasked with some sort of data science or 324 00:28:33,660 --> 00:28:37,240 machine learning assignment as part of the interview process. 325 00:28:37,620 --> 00:28:42,720 And my recommendation is that if you have a working on a case study like this or an assignment make 326 00:28:42,720 --> 00:28:48,120 sure that you can demonstrate to your interviewers that you're not just a copy paste coder that you're 327 00:28:48,120 --> 00:28:52,710 not just plugging libraries together but that you truly understand what's going on. 328 00:28:53,490 --> 00:28:58,470 And this will be an important aspect of both the work that you're submitting to the company as well 329 00:28:58,530 --> 00:29:03,650 as what you want to show your interviewer when they bring you in to talk about things. 330 00:29:03,730 --> 00:29:04,840 So what's coming up next. 331 00:29:05,650 --> 00:29:12,460 Well the coming modules are gonna be really exciting because in the upcoming modules we're gonna be 332 00:29:12,460 --> 00:29:15,600 taking this whole classification game up a notch. 333 00:29:15,710 --> 00:29:19,480 We're no longer going to classify things just into two categories. 334 00:29:19,570 --> 00:29:25,360 We're going to classify amongst many different categories and to do that we're going to take the opportunity 335 00:29:25,360 --> 00:29:31,870 to talk about another incredibly powerful tool namely a neural network. 336 00:29:31,990 --> 00:29:38,010 Neural networks are super exciting really looking forward to seeing you on the next lessons. 337 00:29:38,200 --> 00:29:40,950 And if you get a chance go watch some ski jumping. 338 00:29:40,950 --> 00:29:41,520 It's pretty cool.