1
00:00:00,350 --> 00:00:00,840
All right.

2
00:00:00,870 --> 00:00:08,700
So upwards and onwards in this lesson we're going to make our predictions based on comparing the two

3
00:00:08,700 --> 00:00:12,850
join probabilities that we calculated in the last lesson.

4
00:00:12,900 --> 00:00:20,350
This really is the meat of the naive bayes classifier as part of the NI phase classifier algorithm.

5
00:00:20,460 --> 00:00:29,010
We will be comparing the two probabilities and classifying email based on which probability is higher.

6
00:00:29,010 --> 00:00:32,830
Let's add a section heading here to commemorate this.

7
00:00:33,150 --> 00:00:45,410
Making predictions and for the subheading All right checking for the higher joined probability.

8
00:00:45,420 --> 00:00:51,270
Now I'm also gonna add some latex marked down as a note to ourselves and maybe our future selves when

9
00:00:51,270 --> 00:00:56,380
we come back and look at this again to remind us as to what the code that we're going to write is gonna

10
00:00:56,400 --> 00:01:06,300
do so I'll write to taller signs and then I'll put P parentheses spam and then for some spacing allowed

11
00:01:06,300 --> 00:01:08,540
the backslash and the comma.

12
00:01:08,550 --> 00:01:17,940
Then the pipe symbol then another backslash and a comma then an X backslash comma greater than backslash

13
00:01:17,940 --> 00:01:32,910
comma P parentheses ham backslash comma pipe backslash comma X closing parentheses and two dollar signs.

14
00:01:33,390 --> 00:01:38,710
Maybe even capitalized the S here and when I capitalized the X here.

15
00:01:39,380 --> 00:01:41,640
All right so that makes it pretty explicit.

16
00:01:41,640 --> 00:01:47,070
We're going to be comparing whether the probability that an email is spam given its tokens is greater

17
00:01:47,070 --> 00:01:52,830
than than the probability that it is non spam given the tokens.

18
00:01:52,830 --> 00:01:55,350
The converse of course is also true.

19
00:01:55,350 --> 00:01:58,680
So when a copy this line pasted in.

20
00:01:58,980 --> 00:02:10,800
Change this to less than and maybe put an oar in between star or star star would be quite nice though

21
00:02:10,800 --> 00:02:14,800
if this was centered instead of being aligned to the left.

22
00:02:15,000 --> 00:02:17,590
So I'll add some H2 mail mark down here.

23
00:02:17,730 --> 00:02:26,040
Angle brackets center closing angle brackets and then on the other side it'll have opening angle brackets

24
00:02:26,420 --> 00:02:35,400
forwards large center closing angle brackets and I'll add a little line break here as well with the

25
00:02:35,400 --> 00:02:39,050
letters B are enclosed by angled brackets.

26
00:02:39,150 --> 00:02:39,660
There we go.

27
00:02:40,350 --> 00:02:43,620
So these two lines of latex notation summarize what we're gonna do.

28
00:02:44,540 --> 00:02:49,470
And because we've done so much legwork making these predictions it's actually not that hard at this

29
00:02:49,470 --> 00:02:50,490
stage.

30
00:02:50,490 --> 00:02:55,260
And I think it actually makes for a nice mini challenge so I'd like you to give us a try.

31
00:02:55,260 --> 00:03:01,020
Can you create the vector of predictions are y hat.

32
00:03:01,360 --> 00:03:08,860
Now remember that the spam emails in this vector should have the value 1 or true and the non spam the

33
00:03:08,860 --> 00:03:17,070
ham emails should have the value 0 or false I'd like you to store your results in a variable called

34
00:03:17,280 --> 00:03:22,600
prediction I'll give you a few seconds to pause the video and have a go

35
00:03:25,370 --> 00:03:26,380
ready.

36
00:03:26,390 --> 00:03:27,950
Here's the solution.

37
00:03:27,950 --> 00:03:35,420
We're gonna be making use of this variable here which holds on to the joint probability that an email

38
00:03:35,420 --> 00:03:44,120
is non spam or ham given its tokens and we're gonna be making use of this variable here which holds

39
00:03:44,120 --> 00:03:50,570
onto the joint probability that an email is spam given its tokens and the way we're going to compare

40
00:03:50,570 --> 00:04:00,770
these two is simply by writing prediction as equal to joint on a school log on the school spam greater

41
00:04:00,770 --> 00:04:04,610
than joint I just got log on a score.

42
00:04:04,750 --> 00:04:06,850
Ham and that's it.

43
00:04:07,540 --> 00:04:11,160
Let's peek inside our prediction vector.

44
00:04:11,410 --> 00:04:13,630
Let's look at the last five results.

45
00:04:13,780 --> 00:04:19,630
So prediction square brackets minus five colon.

46
00:04:19,930 --> 00:04:25,000
Those are equal to False false false false and false.

47
00:04:25,000 --> 00:04:31,890
In other words the last five emails in our prediction vector are all non spam emails.

48
00:04:31,900 --> 00:04:36,730
Let's compare this to the last five emails in y underscore test.

49
00:04:37,630 --> 00:04:39,460
Now here are the actual labels.

50
00:04:39,490 --> 00:04:47,860
These are the actual classifications and what we see is five zeros meaning the last five emails are

51
00:04:47,860 --> 00:04:54,340
actually all non spam emails and these five predictions that we looked at were in fact correct.

52
00:04:54,340 --> 00:05:02,110
Now if you don't fancy looking at integers here and billions here all you have to do to convert a boolean

53
00:05:02,110 --> 00:05:06,480
to an integer is to multiply by 1.

54
00:05:06,520 --> 00:05:08,140
Now I want to show you something else.

55
00:05:08,260 --> 00:05:11,830
It turns out that we can actually simplify the calculation a little bit.

56
00:05:11,830 --> 00:05:18,790
Looking at our equation for making a comparison between this quantity here and this quantity here looking

57
00:05:18,790 --> 00:05:27,460
at our formulas above we can see that the probability of him given X is expressed by this fraction where

58
00:05:27,460 --> 00:05:35,500
we're dividing by p of x in the denominator and the probability of spam given X is given by this fraction

59
00:05:35,740 --> 00:05:42,550
where we are also dividing by p of x in the denominator since we're doing this on both sides of the

60
00:05:42,550 --> 00:05:45,910
equation effectively we're doing this for both quantities.

61
00:05:45,910 --> 00:05:49,710
We can actually remove this bit and still get the same results.

62
00:05:49,990 --> 00:05:56,470
In the case of our code this p of x this bottom part where we're dividing by the probability of a token

63
00:05:56,470 --> 00:06:02,530
occurring is given by this minus n p dot log probability of all tokens.

64
00:06:02,650 --> 00:06:08,590
We took the log of the probabilities remember so instead of dividing we're actually subtracting so I

65
00:06:08,590 --> 00:06:12,860
can take this line here and copy it.

66
00:06:12,940 --> 00:06:23,280
A little mark down so call it simplify pasted in and I'll grab this line here and also paste it in and

67
00:06:23,280 --> 00:06:30,970
then what I'll do is I'll delete the same section of code on both lines.

68
00:06:30,970 --> 00:06:39,010
The part or subtracting the probability of a token occurring and the reason I can do this is because

69
00:06:39,200 --> 00:06:44,950
we're after being able to predict the why we were interested in being able to predict whether we have

70
00:06:44,950 --> 00:06:47,510
a spam email or a non spam email.

71
00:06:47,650 --> 00:06:53,140
This prediction actually does not depend on the probability of a token occurring.

72
00:06:53,140 --> 00:06:55,600
This is why I can do this simplification.

73
00:06:55,600 --> 00:06:58,940
Now let me add some late tax code to make this a little bit more clear.

74
00:06:59,140 --> 00:07:03,680
And this is because you know the thing is mathematically those two lines right.

75
00:07:03,850 --> 00:07:09,880
The one or subtracting the log probability of the tokens and the ones where we're not doing that actually

76
00:07:09,880 --> 00:07:11,020
not the same.

77
00:07:11,170 --> 00:07:17,050
You won't actually have the same numbers in joint underscore log underscore spam and enjoined on a scroll

78
00:07:17,050 --> 00:07:18,470
log on a square ham.

79
00:07:18,580 --> 00:07:25,120
However their relationship to each other is unchanged even though these two quantities are not equal

80
00:07:25,180 --> 00:07:26,670
to each other mathematically.

81
00:07:27,120 --> 00:07:32,160
The simplification is still valid because it doesn't change our predictions.

82
00:07:32,170 --> 00:07:38,260
The reason why adding the simplification step here is because I've seen this step in many many of many

83
00:07:38,260 --> 00:07:41,690
implementations of the need based classifier.

84
00:07:41,950 --> 00:07:48,850
And I remember when first looking at this code it was very confusing because I couldn't type back to

85
00:07:48,880 --> 00:07:51,250
the formula in Bayes theorem.

86
00:07:51,340 --> 00:07:57,560
Nonetheless this simplification is perfectly valid and will not change our results.

87
00:07:57,960 --> 00:08:02,880
You might even call this the one weird trick that statisticians don't want you to know.

88
00:08:03,140 --> 00:08:08,470
And you can verify this for yourself and the next lessons where we're gonna be talking about metrics

89
00:08:08,890 --> 00:08:10,270
and evaluation.

90
00:08:10,360 --> 00:08:15,790
We're going to look at our Bayes classifier and actually check how well it's doing.

91
00:08:16,030 --> 00:08:16,850
I'll see you there.