0 1 00:00:00,330 --> 00:00:07,050 Now that we've gathered our data, let's talk about what we're actually going to do in the upcoming lessons. 1 2 00:00:07,050 --> 00:00:11,910 Let's talk about the theory behind the machine learning model that we're gonna use. The machine learning 2 3 00:00:11,910 --> 00:00:17,790 model that I wanna introduce to you in this module is called the Naive Bayes Classifier" 3 4 00:00:18,990 --> 00:00:26,470 and the beauty of this model is its simplicity and its speed. Given that we're looking to classify spam 4 5 00:00:26,500 --> 00:00:32,560 email, speed is actually a really, really nice attribute to have because nobody really wants to wait for 5 6 00:00:32,650 --> 00:00:38,250 ages for some sort of complex neural network to run just to receive an email. 6 7 00:00:38,500 --> 00:00:45,670 The speed of the Naive Bayes Model in fact made it one of the most popular machine learning models in 7 8 00:00:45,670 --> 00:00:49,630 spam classification and is even heavily used today. 8 9 00:00:49,930 --> 00:00:58,120 Spam classification along with weather forecasting is in fact one of the classic applications of the 9 10 00:00:58,210 --> 00:01:01,480 Naive Bayes machine learning model. 10 11 00:01:01,480 --> 00:01:07,960 Now, as I said, the model's speed comes through its simplicity and the simplicity of this machine learning 11 12 00:01:07,960 --> 00:01:15,580 model will also allow us to build this thing from the ground up in a relatively short amount of time. 12 13 00:01:15,610 --> 00:01:19,200 There's going to be zero magic on how this thing works. 13 14 00:01:19,300 --> 00:01:24,460 Instead of calling a built-in function in scikit-learn, you're actually going to see all the nuts and 14 15 00:01:24,460 --> 00:01:31,870 bolts that go into this machine learning model, because you're gonna be coding it up in Python yourself. 15 16 00:01:32,230 --> 00:01:35,690 But so much for the sales pitch on Naive Bayes. 16 17 00:01:35,770 --> 00:01:42,730 Let's talk a little bit about how the Naive Bayes Classifier actually works because this will help us 17 18 00:01:43,000 --> 00:01:50,770 plan out our Python code that we're going to write and also put the upcoming coding lessons into context. 18 19 00:01:50,860 --> 00:02:00,640 Now to make a decision if an email is spam or not spam what the Naive Bayes Classifier does is it will 19 20 00:02:00,640 --> 00:02:10,420 compare two probabilities, and by probability I just mean the chances of something, the likelihood of an 20 21 00:02:10,420 --> 00:02:11,140 event. 21 22 00:02:11,300 --> 00:02:20,650 What our algorithm will do is it will calculate the probability of an email being spam and it will also 22 23 00:02:20,650 --> 00:02:26,710 calculate the probability of an email not being spam, of being legitimate. 23 24 00:02:26,740 --> 00:02:35,770 Now if the probability of an email being spam is higher, then the email will be classified as spam. That's 24 25 00:02:35,770 --> 00:02:43,780 all it is. Our algorithm will basically look at two numbers and then classify an email based on which 25 26 00:02:43,780 --> 00:02:45,210 number's higher. 26 27 00:02:45,310 --> 00:02:48,800 This is what it boils down to behind the scenes. 27 28 00:02:48,950 --> 00:02:55,400 Now one of the things that I find really helpful in thinking about this kind of stuff is to look at 28 29 00:02:55,400 --> 00:02:59,300 these decisions that our algorithm is making visually. 29 30 00:02:59,300 --> 00:03:02,720 So let me illustrate what this would look like in a picture. 30 31 00:03:02,720 --> 00:03:09,380 Suppose we have a chart with an X and a Y axis. On the horizontal, 31 32 00:03:09,380 --> 00:03:13,920 we've got the probability of an email not being spam. 32 33 00:03:13,940 --> 00:03:20,830 And on the vertical we have the probability of an email being spam. In the middle of this chart, 33 34 00:03:20,980 --> 00:03:25,450 we can draw a line. This line is where the X equals the Y. 34 35 00:03:25,470 --> 00:03:27,960 This is where the two probabilities are equal. 35 36 00:03:28,350 --> 00:03:33,680 Now in our Naive Bayes model this line in fact has a very special name, 36 37 00:03:33,690 --> 00:03:42,300 this line is called the decision boundary. The model will decide if an email is spam or not spam based 37 38 00:03:42,300 --> 00:03:50,210 on which side of this line an email falls on. Suppose an e-mail comes in. The probability of this e-mail 38 39 00:03:50,210 --> 00:03:54,440 being spam is calculated to be 70 percent. 39 40 00:03:54,460 --> 00:03:57,650 Where would it go on the chart? Right here. 40 41 00:03:57,840 --> 00:04:00,660 It would go far above this dividing line. 41 42 00:04:01,020 --> 00:04:02,280 And, guess what? 42 43 00:04:02,280 --> 00:04:05,700 Because the email is above the decision boundary, 43 44 00:04:05,700 --> 00:04:09,660 the algorithm will classify this email as spam. 44 45 00:04:09,750 --> 00:04:11,850 Makes sense, right? Now, 45 46 00:04:12,240 --> 00:04:19,020 what if the e-mail comes in and it's only got a 40 percent chance of being spam? 46 47 00:04:19,080 --> 00:04:22,230 It would go somewhere around here. 47 48 00:04:22,230 --> 00:04:29,430 Since this email is below the decision boundary, it would be classified as legitimate, not spam, and we 48 49 00:04:29,430 --> 00:04:38,890 can plot every single email on this chart. So this classification step is actually the final step that 49 50 00:04:38,890 --> 00:04:40,400 the algorithm takes. 50 51 00:04:40,540 --> 00:04:48,280 It applies a decision rule based on the probabilities of the email being spam or not being spam. 51 52 00:04:48,550 --> 00:04:56,110 But what we haven't really talked about so far is where these probabilities actually come from. 52 53 00:04:56,110 --> 00:05:00,120 And this is what we're going to cover in the next lesson. 53 54 00:05:00,160 --> 00:05:01,230 I'll see you there.