All right. So this is the very final lesson on the eve Bates. And in this lesson we're going to implement the need based classifier the quick and dirty way we're going to use these psychic learn module to do all the heavy lifting for us. And this is why I'm calling this lesson base brisk brief and better you'll find out. Stick with me the very first thing we're gonna do in your Jupiter projects folder is check inside spam data 0 1 processing that you see this Jason file here. Email hyphen text hyphen data don't chase on. This is the data file that we're going to be using in this lesson. So back in your projects folder I'd like you to create a new Python 3 notebook and I'd like you to give it the title 0 8. Now you've Baze with socket hyphen learn click rename and then you can go ahead and view it toggle header and you'll have a bit more screen real estate to play with. Now in our first cell we're going to add a couple of import statements so we're going to import num pi as MP and we're going to import pandas as PD in the next cell. And we add the string for our path to that Jason file that I showed you earlier. So I'll say data on a school. Jason on the score file is equal to single quotes span data. So this is gonna be your folder name forward slash 0 1 on a school processing forward slash email hyphen text hyphen data dot Jason. So this string is going to be our relative path to the resources that you've downloaded earlier as part of this module and saved in your project folder. Now let's input this Jason as a data frame so we'll use panels for this. I'll store all of that information in a variable called data. So all set data equal to PD dot read underscore. Jason so read on US Code. Jason is Panda's method for reading that Jason and converting it into a panda's data frame. And then I'm going to pass in the relative path to this Jason File. There we go. Let's see what this looks like. Data dot tail last five rows. Here we see our spam messages are file names are labels which are our categories and an index here. Now you might be wondering why musingly don't. J some file instead of something say like a CSA which you could open in Microsoft Excel. And the answer is I've actually tried this. Hi took our entire dataset and I've put it into a CSP file and then I tried to open it with a 32 bit version of Microsoft Excel and that completely failed on my windows machine and it kind of shows you that when you're working with larger amounts of data you quickly run into the limitations of these spreadsheet programs. So in order not to tempt you I've gone for a Jason instead. That way we're not tempted to open it in a spreadsheet. Now let's take a look at the shape of our data frame him. I've got three columns and I've got five thousand seven hundred and ninety six rows. So these are the number of emails that I've got. And even though I'm looking at my last five rows appear it's got an index of nine hundred and ninety nine meaning we can probably sort our data frame right. If we want it to be in the order of our index. So if I take a data dot sort on the score index parentheses in place equals true this means I'm not storing it in a different variable and I hit shift enter on this and now I look at my tail. Then I've got all my indices in order and I can see that the last row here has the index value five thousand seven hundred and ninety five. All right. So so much for data managing data importing and doing general setup and how we're actually not gonna do any data cleaning and Data Exploration in this lesson. I've given you a data set here which allows us to take a shortcut. We're gonna dive straight into generating our vocabulary for our base classifier and here's how we're gonna do it. Gonna jump up to our input statements at the top and then we'll see from S.K. learn don't feature extraction don't text import count factorization. This is the component that will allow us to generate our vocabulary very very quickly and very efficiently. Let me add a few rows here scroll down and now I'll create my vector right. So I'm going to store this vector riser in a variable called well vector riser. Let's put that equal to count vector riser parentheses and then I can specify some arguments I can customize the kind of vector riser I want and what I'm going to do is I'm going to say use the English stop words. Stop underscore words is equal to single quotes English. And this means that when our vector either generates the vocabulary from the email dataset then it will remove the English stop words Selemi it shift enter on this. And now that I've got my vector right there I can create my vocabulary and I can also create my document term matrix. Now if you remember the document term matrix is what we were laboriously building in the previous lessons. We've done a lot of data manipulation and counting how many times particular words are particular tokens occurred in our Corpus. Here's how we can do all of this in one line. Remember the individual words were our features. Right so I'll use all underscore features is equal to. This is we're going to store my documented term matrix vector rise a dot fit on a school transform parentheses data don't message. So what I'm doing here is I'm using the fit underscore transform method from the vector riser and I'm supplying this column here. This message column from our data frame. Let me hit shift enter on this. It's gonna run for a little while few seconds longer than all the other cells. But uh what we get at the end all features that shape is a sparse matrix. We get five thousand seven hundred and ninety six rows and just a little over a hundred thousand columns these columns correspond to the tokens in our emails. They correspond to the individual words. Now in this line of code our vector riser has actually already learnt our vocabulary as well and we can pull this up so we can take a look at the vocabulary that's present in the vector either so vector riser dot vocabulary with an underscore at the end will pull us up for us. Here we see the individual words. Dear homeowner rates lowest point is help best rate situation matching needs and so on. This is the vocabulary that will help us determine if an e-mail is spam or not spam and notice that these actually aren't even stem words right. So now that we've got our features matrix and our vocabulary it's time to split and shuffle our training and our test data and the way we're going to do this is of course with her tried and trusted train and it's test on a school split method from cyclone. So going up to the top I can import this and say from SCA learn dot model selection import train and a test on a score split. And if it looks like I'm typing this out really quickly it's because I'm typing a few of the letters and then hitting tab on my keyboard to insert the rest. So shift into coming down here I'll create four variables right X unescorted train X on the score test y on a squad train and y on a score test and those will be equal to the results of train on a score test on the score split as arguments to this method from cyclone. We're going to apply for values. Well the first thing we need are our features. The second thing that we need are our labels right. So data don't. Category is where our labels are stored. These will read 1 for spam and zero for a non spam email. So this is the column that we're gonna use and then we're gonna decide on the size of our training and testing datasets. So test size test on the school signs. It's gonna be equal to zero point three. So I'm going to go with a 30 percent test size and then I'll select a random on a school state and I'll just see random on a school state is equal to 88 so 88 is the number that you also should type in in case you want to get the same results on the shuffle as myself. So let me run this and now we can take a look at the shape of our training and our testing data. So X on a school train shape is equal to four thousand five hundred and seven and a little over a hundred thousand on the columns. So we've got four thousand and fifty seven emails and we've got the rest X on the score test dance shape is equal to one thousand seven hundred and thirty nine. Brilliant. So we're all set to go to train our model. Now training are need based model could not be easier. The reason being if we go to the very top once again from SCA learn we can actually import a naive bayes classifier model so check it out from S.K. learn Dot Ave Baz import multi nominal and be multi normal naive. Base shift enter on this guy and coming down here will allow us to create our model here very quickly. All we need to do is store it somewhere say classifier is equal to and then multi nominal and b parentheses that's it. That creates our model for us. Now that we've got our model we can train it right. So classifier dot fit parentheses x on a school train comma Y on the school train. Trains our model. This is it. The fit method supplied with two arguments are training data and our training labels will completely train our model shift enter. And now we've got a train model. So how do we do it. The question right. That's the next question and I really like to throw this over to you as a challenge because in the previous lessons we've talked a lot about metrics. What I'd like you to do is calculate the following for the test dataset. So X on a score test and Y underscore test. Can you work out the number of documents that we classify correctly and the number of documents that were classified incorrectly. And finally I'd like you to work out the accuracy of our naive based model on the test dataset. I'll give you a few seconds to pause the video and give this a go. I'll see you on the other side. All right. So let's tackle one of these at a time. The number of correct documents we can work out by comparing the one and score test data with what the classifier predicted right. So why on a score test W equals classifier don't predict parentheses x on a score test our test data fed into the prediction method from our classifier and then summed up this will be the number of documents classified correctly. So the trick for this challenge was googling for the multi nominal and b documentation on cyclone and there. When you scroll down you can see that under the methods there is a predict method and this performs the classification on an array of test vectors. And this is exactly what we've done here. We've used our classifier put a dot after it called the predict method supplied r x underscore test and this is what we're comparing with our actual values because y underscore test looks like this and classifier don't predict parentheses x on the score test looks like this. There are two arrays where we can check with the double equal signs. If the value matches and then all we need to do is to sum up the number of truths in this comparison to get the number of documents that we predicted correctly so when is this equal to say we want to print this out so print. Let's use an F string. So print f single quotes curly braces. No underscore correct documents classified correctly. And single quotes and parentheses. So if I execute this cell here and then execute my print statement I will get this value here. This variable inserted here in my string using these curly braces and the F in front of the courts so here you can see we've correctly predicted a thousand six hundred and sixty documents now what about the number of documents incorrectly predicted right in our underscore incorrect shall be equal to Y and a score test dot signs. So the number of documents in the test dataset the number of emails in y and a score test well minus a thousand six hundred and sixty. Right. And hour on a score. Correct so if we want to print this out using an F string I can see a number of documents incorrectly classified is curly braces you know on a score incorrect shift enter will show us that seventy nine documents have been classified incorrectly by our classifier. Brilliant. So what does this mean for accuracy and we've worked down the number of documents classify correctly number of documents classify incorrectly we can calculate the fraction that were classified incorrectly. Right so. Fraction on a school wrong it's equal to an hour underscore incorrect. Divide it by an hour on a score. Correct. Plus an R on a score incorrect. This is the fraction of documents classified incorrectly. So if I wanted to print out the accuracy of the model I could write something like print in parentheses f single quotes the parentheses testing accuracy to be specific on the model is curly braces one minus fraction on a school wrong and here we see that our model is in fact around ninety five percent accurate. Now if I wanted to format this as a percentage all I need to do is put my cursor in front of this closing curly brace put a semicolon there then put a dot then a C two and then a percent sign and this will format my percentage to two decimal places. There we go. So that looks quite pretty right now. So you were studying this documentation a little more closely than you might have noticed that we didn't even have to go through all that trouble because there is in fact a school method which will report our accuracy for us so we could have also done it this way have said classifier dot score parentheses x on the score test come out y on the score test would have gotten the same result. Now as a follow up challenge to see how our model is doing we should really look beyond accuracy right. We talked about this in the previous lessons. So what I'd like you to do is work out the recall and precision and if one school for our classifier. Once again I encourage you to Google for the site could learn documentation on this topic to work this out. I'll give you a few seconds to pause the video. Ethan I'll see you on the other side so if I go ahead and Google so I could learn recall precision and I scroll down to the very first result then what I see is that there's a brief description of what precision and recall is. But scrolling further down I can see that this example seems to be talking a little bit more about the precision recall curve. So I'm not after something this fancy. What I'm actually after is just a simple metric right. The recall score the precision score and the F one school. And these live in Ashkelon dot metrics dot and then the name of the school. So let's take a recall for example here's the detailed description and how to use it. And here's a very quick example from SCA learned metrics import recall score I'll copy that line and here's how I can use it so coming back to our notebook and scrolling to the very top. I want to import the recall score but also I want to import the precision school while I'm at it and I'm also going to import the F1 score so three import statements from ASCII learned metrics. Import recall score from as killer not metrics import precision score and from ASCII learned art metrics import F1 score and we don't actually have to copy paste all of these three metrics live under Eskil dot metrics meaning we can put a comma here. Right. Precision school. Put another comma. And write F one school. So now we've got one line from ASCII lined up metrics. Import recall underscore score precision score and F1 on the score score. Let me hit shift enter on this scroll back down and now it's time to work it out. Calculated our recall score just needs two inputs right. It needs the correct labels so why underscore test. And it needs our predictions. And as we've seen before we can get our predictions using our classifier and using the predict method and supplying our test data X underscore test and what we see is that our recall is around 86 percent our precision underscore score. We can get in a very very similar way right. Why on a score test come up classifier don't predict parentheses excellence or test and precision is at around ninety nine percent very high. And finally our F school. Same thing. I can't even copy the line above. Change the name of the method. Right F on the score score and work it out. That's right. 92 percent so these are our metrics and they're looking looking really strong right. Looking very very strong. Once we've got our data and we split it up into our testing and training dataset training our model 257 00:23:31,950 --> 00:23:39,730 making predictions and working out our metrics is actually super super super quick so the last thing 258 00:23:39,730 --> 00:23:45,220 I'm gonna show you in this lesson is that now that we've trained classifier we can actually do some 259 00:23:45,220 --> 00:23:52,870 pretty cool stuff with it like evaluate some sentences or some emails that we're going to write off 260 00:23:52,870 --> 00:23:59,410 the flight just like that we're going to try our own example sentences since we've trained our classifier 261 00:23:59,440 --> 00:24:00,200 already. 262 00:24:00,340 --> 00:24:07,310 We can add some sentences or send emails to a list and then check how spammy they really are. 263 00:24:07,960 --> 00:24:09,320 Let me show you what I mean. 264 00:24:09,460 --> 00:24:16,540 So I had a few cells here and I'll just call this list example and it's going to contain a couple of 265 00:24:16,540 --> 00:24:17,860 strings right. 266 00:24:17,890 --> 00:24:22,680 So the classic one is get via Agora for free now. 267 00:24:23,550 --> 00:24:23,970 Right. 268 00:24:25,330 --> 00:24:28,990 But we can also try to need a mortgage 269 00:24:32,140 --> 00:24:46,240 replied to arrange a call with its vessel list and get a quote Has up pretty spammy rate for the next 270 00:24:46,240 --> 00:24:46,750 one. 271 00:24:46,840 --> 00:24:49,800 Maybe let's try something that isn't very spammy right. 272 00:24:49,810 --> 00:25:00,540 Maybe something like uh could you please help me with the project for tomorrow. 273 00:25:00,550 --> 00:25:04,540 Try that one then maybe um no. 274 00:25:05,860 --> 00:25:07,560 Hello Jonathan. 275 00:25:08,840 --> 00:25:14,010 I watched a game of golf tomorrow. 276 00:25:14,490 --> 00:25:17,600 I imagine this is how the monopoly man talks to his friends. 277 00:25:18,140 --> 00:25:19,310 And for the last one. 278 00:25:20,000 --> 00:25:21,600 Mm hmm. 279 00:25:21,670 --> 00:25:28,510 When I go to Wikipedia and just search for a favorite Austrian pastime namely a ski jumping and I'm 280 00:25:28,500 --> 00:25:32,810 gonna grab the first couple of sentences here. 281 00:25:32,860 --> 00:25:40,660 Copy them come back in here and in single quotes paste them all in and then I'm going to have to hunt 282 00:25:40,660 --> 00:25:47,710 around for an apostrophe because you can tell from the syntax highlighting that this rogue apostrophe 283 00:25:47,710 --> 00:25:55,330 here needs escaping meaning it should be treated as a string so I can do that with a backslash. 284 00:25:55,330 --> 00:25:55,930 There we go. 285 00:25:56,620 --> 00:26:03,490 So now my string ends him and I've got my list of example emails for example sentences that we can try 286 00:26:03,490 --> 00:26:04,540 to make a prediction on. 287 00:26:04,630 --> 00:26:05,730 Using our classifier. 288 00:26:06,730 --> 00:26:07,630 So how do we do this. 289 00:26:07,960 --> 00:26:10,900 Well first up we need our vector writer. 290 00:26:10,890 --> 00:26:12,500 Write the vector riser. 291 00:26:12,510 --> 00:26:15,250 It's going to process this new piece of data. 292 00:26:15,250 --> 00:26:17,310 Write this list of sentences. 293 00:26:17,800 --> 00:26:22,660 So I want to use vector riser dot transform. 294 00:26:23,020 --> 00:26:28,740 That's the method to process these e-mails and I'll feed an example. 295 00:26:28,750 --> 00:26:34,560 So this is the code that will process this list and get it ready for our classifier. 296 00:26:34,570 --> 00:26:39,050 I'll tell you what it's gonna spit out a document term matrix right. 297 00:26:39,100 --> 00:26:47,500 So I can maybe store that under DLC on a score term on a score matrix set that equal to the output from 298 00:26:47,500 --> 00:26:50,000 the vector riser right. 299 00:26:50,260 --> 00:26:51,580 The next line of code. 300 00:26:51,610 --> 00:26:54,160 I'm gonna take my classifier. 301 00:26:54,530 --> 00:27:04,270 Use the predict method and feed in you guessed it the doc on the score term on a score matrix and let's 302 00:27:04,270 --> 00:27:05,930 see what we get. 303 00:27:06,790 --> 00:27:13,360 The very first sentence was very very spammy right and our classifier actually predicts this sentence 304 00:27:13,660 --> 00:27:16,540 to be from a spam email. 305 00:27:16,540 --> 00:27:17,860 Same with the second one. 306 00:27:17,920 --> 00:27:24,100 And that's because the word mortgage and quote probably tipped it off. 307 00:27:24,100 --> 00:27:31,840 But the third fourth and fifth entries here are not classified as spam so so far so good. 308 00:27:31,960 --> 00:27:36,640 I think at this point you can probably try a couple of your own sentences and see how the classifier 309 00:27:36,640 --> 00:27:37,990 behaves. 310 00:27:37,990 --> 00:27:44,000 In any case I hope this lesson was useful and that kind of rounded off our naive bayes module. 311 00:27:44,020 --> 00:27:49,660 I really wanted to show you how you might build a naive based classifier and train it with the power 312 00:27:49,660 --> 00:27:50,980 of these libraries. 313 00:27:51,070 --> 00:27:52,930 In this case S.K. learn. 314 00:27:53,020 --> 00:27:56,310 Now of course there are pros and cons to using libraries. 315 00:27:56,350 --> 00:27:58,230 You can't just apply them blindly. 316 00:27:58,240 --> 00:28:03,430 You have to understand how they work because there's so much going on under the hood and this is why 317 00:28:03,430 --> 00:28:08,950 we spent a lot of the time in the previous lessons covering many of the mechanics and actually built 318 00:28:09,010 --> 00:28:12,870 this naive base classifier from the ground up that way. 319 00:28:12,910 --> 00:28:18,870 These last couple of lines of code don't come across like forbidden magic or something so where does 320 00:28:18,870 --> 00:28:20,170 this leave us. 321 00:28:20,190 --> 00:28:26,760 Well more and more companies are giving job applicants these case studies to solve as part of their 322 00:28:26,760 --> 00:28:27,600 job interviews. 323 00:28:28,200 --> 00:28:33,660 If you're in the job market these days very often you'll be tasked with some sort of data science or 324 00:28:33,660 --> 00:28:37,240 machine learning assignment as part of the interview process. 325 00:28:37,620 --> 00:28:42,720 And my recommendation is that if you have a working on a case study like this or an assignment make 326 00:28:42,720 --> 00:28:48,120 sure that you can demonstrate to your interviewers that you're not just a copy paste coder that you're 327 00:28:48,120 --> 00:28:52,710 not just plugging libraries together but that you truly understand what's going on. 328 00:28:53,490 --> 00:28:58,470 And this will be an important aspect of both the work that you're submitting to the company as well 329 00:28:58,530 --> 00:29:03,650 as what you want to show your interviewer when they bring you in to talk about things. 330 00:29:03,730 --> 00:29:04,840 So what's coming up next. 331 00:29:05,650 --> 00:29:12,460 Well the coming modules are gonna be really exciting because in the upcoming modules we're gonna be 332 00:29:12,460 --> 00:29:15,600 taking this whole classification game up a notch. 333 00:29:15,710 --> 00:29:19,480 We're no longer going to classify things just into two categories. 334 00:29:19,570 --> 00:29:25,360 We're going to classify amongst many different categories and to do that we're going to take the opportunity 335 00:29:25,360 --> 00:29:31,870 to talk about another incredibly powerful tool namely a neural network. 336 00:29:31,990 --> 00:29:38,010 Neural networks are super exciting really looking forward to seeing you on the next lessons. 337 00:29:38,200 --> 00:29:40,950 And if you get a chance go watch some ski jumping. 338 00:29:40,950 --> 00:29:41,520 It's pretty cool.