0
1
00:00:00,330 --> 00:00:07,050
Now that we've gathered our data, let's talk about what we're actually going to do in the upcoming lessons.
1

2
00:00:07,050 --> 00:00:11,910
Let's talk about the theory behind the machine learning model that we're gonna use. The machine learning
2

3
00:00:11,910 --> 00:00:17,790
model that I wanna introduce to you in this module is called the Naive Bayes Classifier"
3

4
00:00:18,990 --> 00:00:26,470
and the beauty of this model is its simplicity and its speed. Given that we're looking to classify spam
4

5
00:00:26,500 --> 00:00:32,560
email, speed is actually a really, really nice attribute to have because nobody really wants to wait for
5

6
00:00:32,650 --> 00:00:38,250
ages for some sort of complex neural network to run just to receive an email.
6

7
00:00:38,500 --> 00:00:45,670
The speed of the Naive Bayes Model in fact made it one of the most popular machine learning models in
7

8
00:00:45,670 --> 00:00:49,630
spam classification and is even heavily used today.
8

9
00:00:49,930 --> 00:00:58,120
Spam classification along with weather forecasting is in fact one of the classic applications of the
9

10
00:00:58,210 --> 00:01:01,480
Naive Bayes machine learning model.
10

11
00:01:01,480 --> 00:01:07,960
Now, as I said, the model's speed comes through its simplicity and the simplicity of this machine learning
11

12
00:01:07,960 --> 00:01:15,580
model will also allow us to build this thing from the ground up in a relatively short amount of time.
12

13
00:01:15,610 --> 00:01:19,200
There's going to be zero magic on how this thing works.
13

14
00:01:19,300 --> 00:01:24,460
Instead of calling a built-in function in scikit-learn, you're actually going to see all the nuts and
14

15
00:01:24,460 --> 00:01:31,870
bolts that go into this machine learning model, because you're gonna be coding it up in Python yourself.
15

16
00:01:32,230 --> 00:01:35,690
But so much for the sales pitch on Naive Bayes.
16

17
00:01:35,770 --> 00:01:42,730
Let's talk a little bit about how the Naive Bayes Classifier actually works because this will help us
17

18
00:01:43,000 --> 00:01:50,770
plan out our Python code that we're going to write and also put the upcoming coding lessons into context.
18

19
00:01:50,860 --> 00:02:00,640
Now to make a decision if an email is spam or not spam what the Naive Bayes Classifier does is it will
19

20
00:02:00,640 --> 00:02:10,420
compare two probabilities, and by probability I just mean the chances of something, the likelihood of an
20

21
00:02:10,420 --> 00:02:11,140
event.
21

22
00:02:11,300 --> 00:02:20,650
What our algorithm will do is it will calculate the probability of an email being spam and it will also
22

23
00:02:20,650 --> 00:02:26,710
calculate the probability of an email not being spam, of being legitimate.
23

24
00:02:26,740 --> 00:02:35,770
Now if the probability of an email being spam is higher, then the email will be classified as spam. That's
24

25
00:02:35,770 --> 00:02:43,780
all it is. Our algorithm will basically look at two numbers and then classify an email based on which
25

26
00:02:43,780 --> 00:02:45,210
number's higher.
26

27
00:02:45,310 --> 00:02:48,800
This is what it boils down to behind the scenes.
27

28
00:02:48,950 --> 00:02:55,400
Now one of the things that I find really helpful in thinking about this kind of stuff is to look at
28

29
00:02:55,400 --> 00:02:59,300
these decisions that our algorithm is making visually.
29

30
00:02:59,300 --> 00:03:02,720
So let me illustrate what this would look like in a picture.
30

31
00:03:02,720 --> 00:03:09,380
Suppose we have a chart with an X and a Y axis. On the horizontal,
31

32
00:03:09,380 --> 00:03:13,920
we've got the probability of an email not being spam.
32

33
00:03:13,940 --> 00:03:20,830
And on the vertical we have the probability of an email being spam. In the middle of this chart,
33

34
00:03:20,980 --> 00:03:25,450
we can draw a line. This line is where the X equals the Y.
34

35
00:03:25,470 --> 00:03:27,960
This is where the two probabilities are equal.
35

36
00:03:28,350 --> 00:03:33,680
Now in our Naive Bayes model this line in fact has a very special name,
36

37
00:03:33,690 --> 00:03:42,300
this line is called the decision boundary. The model will decide if an email is spam or not spam based
37

38
00:03:42,300 --> 00:03:50,210
on which side of this line an email falls on. Suppose an e-mail comes in. The probability of this e-mail
38

39
00:03:50,210 --> 00:03:54,440
being spam is calculated to be 70 percent.
39

40
00:03:54,460 --> 00:03:57,650
Where would it go on the chart? Right here.
40

41
00:03:57,840 --> 00:04:00,660
It would go far above this dividing line.
41

42
00:04:01,020 --> 00:04:02,280
And, guess what?
42

43
00:04:02,280 --> 00:04:05,700
Because the email is above the decision boundary,
43

44
00:04:05,700 --> 00:04:09,660
the algorithm will classify this email as spam.
44

45
00:04:09,750 --> 00:04:11,850
Makes sense, right? Now,
45

46
00:04:12,240 --> 00:04:19,020
what if the e-mail comes in and it's only got a 40 percent chance of being spam?
46

47
00:04:19,080 --> 00:04:22,230
It would go somewhere around here.
47

48
00:04:22,230 --> 00:04:29,430
Since this email is below the decision boundary, it would be classified as legitimate, not spam, and we
48

49
00:04:29,430 --> 00:04:38,890
can plot every single email on this chart. So this classification step is actually the final step that
49

50
00:04:38,890 --> 00:04:40,400
the algorithm takes.
50

51
00:04:40,540 --> 00:04:48,280
It applies a decision rule based on the probabilities of the email being spam or not being spam.
51

52
00:04:48,550 --> 00:04:56,110
But what we haven't really talked about so far is where these probabilities actually come from.
52

53
00:04:56,110 --> 00:05:00,120
And this is what we're going to cover in the next lesson.
53

54
00:05:00,160 --> 00:05:01,230
I'll see you there.