1
00:00:00,540 --> 00:00:00,990
All right.

2
00:00:00,990 --> 00:00:05,090
So this is the very final lesson on the eve Bates.

3
00:00:05,340 --> 00:00:11,990
And in this lesson we're going to implement the need based classifier the quick and dirty way we're

4
00:00:12,000 --> 00:00:16,450
going to use these psychic learn module to do all the heavy lifting for us.

5
00:00:16,560 --> 00:00:24,450
And this is why I'm calling this lesson base brisk brief and better you'll find out.

6
00:00:24,470 --> 00:00:31,990
Stick with me the very first thing we're gonna do in your Jupiter projects folder is check inside spam

7
00:00:31,990 --> 00:00:41,550
data 0 1 processing that you see this Jason file here.

8
00:00:41,750 --> 00:00:45,330
Email hyphen text hyphen data don't chase on.

9
00:00:45,500 --> 00:00:49,040
This is the data file that we're going to be using in this lesson.

10
00:00:49,040 --> 00:00:55,370
So back in your projects folder I'd like you to create a new Python 3 notebook

11
00:00:58,800 --> 00:01:02,430
and I'd like you to give it the title 0 8.

12
00:01:02,630 --> 00:01:14,530
Now you've Baze with socket hyphen learn click rename and then you can go ahead and view it toggle header

13
00:01:14,860 --> 00:01:18,470
and you'll have a bit more screen real estate to play with.

14
00:01:18,550 --> 00:01:25,170
Now in our first cell we're going to add a couple of import statements so we're going to import num

15
00:01:25,180 --> 00:01:33,330
pi as MP and we're going to import pandas as PD in the next cell.

16
00:01:33,370 --> 00:01:39,050
And we add the string for our path to that Jason file that I showed you earlier.

17
00:01:39,130 --> 00:01:41,230
So I'll say data on a school.

18
00:01:41,340 --> 00:01:48,690
Jason on the score file is equal to single quotes span data.

19
00:01:48,700 --> 00:01:57,280
So this is gonna be your folder name forward slash 0 1 on a school processing forward slash email hyphen

20
00:01:57,310 --> 00:02:01,930
text hyphen data dot Jason.

21
00:02:02,290 --> 00:02:10,090
So this string is going to be our relative path to the resources that you've downloaded earlier as part

22
00:02:10,090 --> 00:02:14,810
of this module and saved in your project folder.

23
00:02:14,850 --> 00:02:21,090
Now let's input this Jason as a data frame so we'll use panels for this.

24
00:02:21,260 --> 00:02:25,880
I'll store all of that information in a variable called data.

25
00:02:25,880 --> 00:02:29,650
So all set data equal to PD dot read underscore.

26
00:02:29,960 --> 00:02:31,490
Jason so read on US Code.

27
00:02:31,500 --> 00:02:39,350
Jason is Panda's method for reading that Jason and converting it into a panda's data frame.

28
00:02:39,650 --> 00:02:45,930
And then I'm going to pass in the relative path to this Jason File.

29
00:02:45,980 --> 00:02:46,610
There we go.

30
00:02:47,810 --> 00:02:49,090
Let's see what this looks like.

31
00:02:49,100 --> 00:02:53,210
Data dot tail last five rows.

32
00:02:53,210 --> 00:03:01,830
Here we see our spam messages are file names are labels which are our categories and an index here.

33
00:03:01,850 --> 00:03:03,900
Now you might be wondering why musingly don't.

34
00:03:03,920 --> 00:03:10,500
J some file instead of something say like a CSA which you could open in Microsoft Excel.

35
00:03:10,490 --> 00:03:13,380
And the answer is I've actually tried this.

36
00:03:13,410 --> 00:03:20,540
Hi took our entire dataset and I've put it into a CSP file and then I tried to open it with a 32 bit

37
00:03:20,540 --> 00:03:29,450
version of Microsoft Excel and that completely failed on my windows machine and it kind of shows you

38
00:03:29,660 --> 00:03:35,240
that when you're working with larger amounts of data you quickly run into the limitations of these spreadsheet

39
00:03:35,240 --> 00:03:36,390
programs.

40
00:03:36,470 --> 00:03:42,490
So in order not to tempt you I've gone for a Jason instead.

41
00:03:42,500 --> 00:03:46,180
That way we're not tempted to open it in a spreadsheet.

42
00:03:46,180 --> 00:03:49,710
Now let's take a look at the shape of our data frame him.

43
00:03:49,730 --> 00:03:55,540
I've got three columns and I've got five thousand seven hundred and ninety six rows.

44
00:03:55,610 --> 00:03:58,850
So these are the number of emails that I've got.

45
00:03:58,850 --> 00:04:06,720
And even though I'm looking at my last five rows appear it's got an index of nine hundred and ninety

46
00:04:06,720 --> 00:04:10,400
nine meaning we can probably sort our data frame right.

47
00:04:10,430 --> 00:04:14,120
If we want it to be in the order of our index.

48
00:04:14,120 --> 00:04:24,860
So if I take a data dot sort on the score index parentheses in place equals true this means I'm not

49
00:04:24,860 --> 00:04:31,760
storing it in a different variable and I hit shift enter on this and now I look at my tail.

50
00:04:31,760 --> 00:04:39,320
Then I've got all my indices in order and I can see that the last row here has the index value five

51
00:04:39,320 --> 00:04:42,710
thousand seven hundred and ninety five.

52
00:04:42,710 --> 00:04:43,010
All right.

53
00:04:43,130 --> 00:04:50,330
So so much for data managing data importing and doing general setup and how we're actually not gonna

54
00:04:50,340 --> 00:04:53,710
do any data cleaning and Data Exploration in this lesson.

55
00:04:54,140 --> 00:04:58,490
I've given you a data set here which allows us to take a shortcut.

56
00:04:58,520 --> 00:05:05,870
We're gonna dive straight into generating our vocabulary for our base classifier and here's how we're

57
00:05:05,870 --> 00:05:06,930
gonna do it.

58
00:05:07,010 --> 00:05:17,270
Gonna jump up to our input statements at the top and then we'll see from S.K. learn don't feature extraction

59
00:05:18,970 --> 00:05:26,890
don't text import count factorization.

60
00:05:26,990 --> 00:05:33,860
This is the component that will allow us to generate our vocabulary very very quickly and very efficiently.

61
00:05:33,860 --> 00:05:39,680
Let me add a few rows here scroll down and now I'll create my vector right.

62
00:05:39,680 --> 00:05:46,380
So I'm going to store this vector riser in a variable called well vector riser.

63
00:05:46,720 --> 00:05:56,880
Let's put that equal to count vector riser parentheses and then I can specify some arguments I can customize

64
00:05:57,090 --> 00:06:05,300
the kind of vector riser I want and what I'm going to do is I'm going to say use the English stop words.

65
00:06:05,500 --> 00:06:05,780
Right.

66
00:06:05,850 --> 00:06:11,390
Stop underscore words is equal to single quotes English.

67
00:06:11,780 --> 00:06:17,580
And this means that when our vector either generates the vocabulary from the email dataset then it will

68
00:06:17,580 --> 00:06:22,950
remove the English stop words Selimi it shift enter on this.

69
00:06:23,550 --> 00:06:30,720
And now that I've got my vector right there I can create my vocabulary and I can also create my document

70
00:06:30,930 --> 00:06:32,820
term matrix.

71
00:06:32,820 --> 00:06:38,160
Now if you remember the document term matrix is what we were laboriously building in the previous lessons.

72
00:06:38,220 --> 00:06:42,900
We've done a lot of data manipulation and counting how many times particular words are particular tokens

73
00:06:42,900 --> 00:06:46,230
occurred in our Corpus.

74
00:06:46,230 --> 00:06:49,890
Here's how we can do all of this in one line.

75
00:06:50,280 --> 00:06:52,430
Remember the individual words were our features.

76
00:06:52,440 --> 00:06:56,600
Right so I'll use all underscore features is equal to.

77
00:06:56,910 --> 00:07:07,470
This is we're going to store my documented term matrix vector rise a dot fit on a school transform parentheses

78
00:07:08,100 --> 00:07:13,010
data don't message.

79
00:07:13,400 --> 00:07:20,280
So what I'm doing here is I'm using the fit underscore transform method from the vector riser and I'm

80
00:07:20,280 --> 00:07:22,840
supplying this column here.

81
00:07:22,920 --> 00:07:27,160
This message column from our data frame.

82
00:07:27,290 --> 00:07:29,370
Let me hit shift enter on this.

83
00:07:29,370 --> 00:07:33,650
It's gonna run for a little while few seconds longer than all the other cells.

84
00:07:33,960 --> 00:07:42,860
But uh what we get at the end all features that shape is a sparse matrix.

85
00:07:42,870 --> 00:07:50,250
We get five thousand seven hundred and ninety six rows and just a little over a hundred thousand columns

86
00:07:51,660 --> 00:07:56,660
these columns correspond to the tokens in our emails.

87
00:07:56,700 --> 00:08:00,320
They correspond to the individual words.

88
00:08:00,480 --> 00:08:06,600
Now in this line of code our vector riser has actually already learnt our vocabulary as well and we

89
00:08:06,600 --> 00:08:13,350
can pull this up so we can take a look at the vocabulary that's present in the vector either so vector

90
00:08:13,350 --> 00:08:20,460
riser dot vocabulary with an underscore at the end will pull us up for us.

91
00:08:20,670 --> 00:08:22,240
Here we see the individual words.

92
00:08:22,300 --> 00:08:30,380
Dear homeowner rates lowest point is help best rate situation matching needs and so on.

93
00:08:30,390 --> 00:08:38,680
This is the vocabulary that will help us determine if an e-mail is spam or not spam and notice that

94
00:08:38,980 --> 00:08:41,000
these actually aren't even stem words right.

95
00:08:42,440 --> 00:08:50,240
So now that we've got our features matrix and our vocabulary it's time to split and shuffle our training

96
00:08:50,240 --> 00:08:56,270
and our test data and the way we're going to do this is of course with her tried and trusted train and

97
00:08:56,270 --> 00:09:01,160
it's test on a school split method from cyclone.

98
00:09:01,940 --> 00:09:15,590
So going up to the top I can import this and say from SCA learn dot model selection import train and

99
00:09:15,600 --> 00:09:18,200
a test on a score split.

100
00:09:18,510 --> 00:09:23,580
And if it looks like I'm typing this out really quickly it's because I'm typing a few of the letters

101
00:09:23,970 --> 00:09:28,140
and then hitting tab on my keyboard to insert the rest.

102
00:09:28,560 --> 00:09:38,820
So shift into coming down here I'll create four variables right X unescorted train X on the score test

103
00:09:39,950 --> 00:09:49,700
y on a squad train and y on a score test and those will be equal to the results of train on a score

104
00:09:49,700 --> 00:09:56,170
test on the score split as arguments to this method from cyclone.

105
00:09:56,390 --> 00:09:58,700
We're going to apply for values.

106
00:09:58,700 --> 00:10:02,800
Well the first thing we need are our features.

107
00:10:02,810 --> 00:10:06,010
The second thing that we need are our labels right.

108
00:10:06,020 --> 00:10:08,050
So data don't.

109
00:10:08,180 --> 00:10:11,160
Category is where our labels are stored.

110
00:10:11,330 --> 00:10:17,480
These will read 1 for spam and zero for a non spam email.

111
00:10:18,470 --> 00:10:25,430
So this is the column that we're gonna use and then we're gonna decide on the size of our training and

112
00:10:25,430 --> 00:10:27,030
testing datasets.

113
00:10:27,110 --> 00:10:31,670
So test size test on the school signs.

114
00:10:31,850 --> 00:10:34,790
It's gonna be equal to zero point three.

115
00:10:34,880 --> 00:10:45,530
So I'm going to go with a 30 percent test size and then I'll select a random on a school state and I'll

116
00:10:45,530 --> 00:10:54,320
just see random on a school state is equal to 88 so 88 is the number that you also should type in in

117
00:10:54,320 --> 00:10:57,900
case you want to get the same results on the shuffle as myself.

118
00:10:59,140 --> 00:11:08,670
So let me run this and now we can take a look at the shape of our training and our testing data.

119
00:11:08,670 --> 00:11:18,210
So X on a school train shape is equal to four thousand five hundred and seven and a little over a hundred

120
00:11:18,210 --> 00:11:20,100
thousand on the columns.

121
00:11:20,100 --> 00:11:27,120
So we've got four thousand and fifty seven emails and we've got the rest X on the score test dance shape

122
00:11:27,470 --> 00:11:31,910
is equal to one thousand seven hundred and thirty nine.

123
00:11:32,060 --> 00:11:32,940
Brilliant.

124
00:11:32,970 --> 00:11:37,450
So we're all set to go to train our model.

125
00:11:37,690 --> 00:11:41,810
Now training are need based model could not be easier.

126
00:11:41,810 --> 00:11:52,290
The reason being if we go to the very top once again from SCA learn we can actually import a naive bayes

127
00:11:52,500 --> 00:12:04,540
classifier model so check it out from S.K. learn Dot Ave Baz import multi nominal and be multi normal

128
00:12:04,670 --> 00:12:05,270
naive.

129
00:12:05,340 --> 00:12:13,620
Base shift enter on this guy and coming down here will allow us to create our model here very quickly.

130
00:12:13,620 --> 00:12:24,390
All we need to do is store it somewhere say classifier is equal to and then multi nominal and b parentheses

131
00:12:25,260 --> 00:12:25,890
that's it.

132
00:12:25,920 --> 00:12:28,470
That creates our model for us.

133
00:12:28,650 --> 00:12:32,150
Now that we've got our model we can train it right.

134
00:12:32,330 --> 00:12:44,680
So classifier dot fit parentheses x on a school train comma Y on the school train.

135
00:12:44,680 --> 00:12:46,090
Trains our model.

136
00:12:46,090 --> 00:12:47,620
This is it.

137
00:12:47,680 --> 00:12:54,820
The fit method supplied with two arguments are training data and our training labels will completely

138
00:12:54,910 --> 00:12:58,510
train our model shift enter.

139
00:12:59,480 --> 00:13:02,210
And now we've got a train model.

140
00:13:02,210 --> 00:13:02,970
So how do we do it.

141
00:13:02,990 --> 00:13:03,690
The question right.

142
00:13:03,710 --> 00:13:10,760
That's the next question and I really like to throw this over to you as a challenge because in the previous

143
00:13:10,760 --> 00:13:13,720
lessons we've talked a lot about metrics.

144
00:13:13,790 --> 00:13:18,080
What I'd like you to do is calculate the following for the test dataset.

145
00:13:18,230 --> 00:13:22,070
So X on a score test and Y underscore test.

146
00:13:22,070 --> 00:13:27,740
Can you work out the number of documents that we classify correctly and the number of documents that

147
00:13:27,740 --> 00:13:30,200
were classified incorrectly.

148
00:13:30,260 --> 00:13:37,560
And finally I'd like you to work out the accuracy of our naive based model on the test dataset.

149
00:13:37,790 --> 00:13:42,710
I'll give you a few seconds to pause the video and give this a go.

150
00:13:42,710 --> 00:13:43,820
I'll see you on the other side.

151
00:13:45,580 --> 00:13:46,080
All right.

152
00:13:46,100 --> 00:13:48,840
So let's tackle one of these at a time.

153
00:13:48,840 --> 00:13:59,310
The number of correct documents we can work out by comparing the one and score test data with what the

154
00:13:59,310 --> 00:14:01,680
classifier predicted right.

155
00:14:01,710 --> 00:14:14,970
So why on a score test W equals classifier don't predict parentheses x on a score test our test data

156
00:14:15,600 --> 00:14:25,710
fed into the prediction method from our classifier and then summed up this will be the number of documents

157
00:14:26,040 --> 00:14:27,670
classified correctly.

158
00:14:28,650 --> 00:14:35,670
So the trick for this challenge was googling for the multi nominal and b documentation on cyclone and

159
00:14:35,690 --> 00:14:36,330
there.

160
00:14:36,390 --> 00:14:43,980
When you scroll down you can see that under the methods there is a predict method and this performs

161
00:14:44,100 --> 00:14:48,060
the classification on an array of test vectors.

162
00:14:48,060 --> 00:14:50,340
And this is exactly what we've done here.

163
00:14:50,370 --> 00:14:57,540
We've used our classifier put a dot after it called the predict method supplied r x underscore test

164
00:14:58,260 --> 00:15:09,930
and this is what we're comparing with our actual values because y underscore test looks like this and

165
00:15:10,240 --> 00:15:18,320
classifier don't predict parentheses x on the score test looks like this.

166
00:15:18,510 --> 00:15:22,750
There are two arrays where we can check with the double equal signs.

167
00:15:22,920 --> 00:15:30,930
If the value matches and then all we need to do is to sum up the number of truths in this comparison

168
00:15:31,200 --> 00:15:38,150
to get the number of documents that we predicted correctly so when is this equal to say we want to print

169
00:15:38,150 --> 00:15:40,790
this out so print.

170
00:15:40,790 --> 00:15:41,990
Let's use an F string.

171
00:15:42,210 --> 00:15:46,980
So print f single quotes curly braces.

172
00:15:47,150 --> 00:15:57,410
No underscore correct documents classified correctly.

173
00:15:57,410 --> 00:16:00,710
And single quotes and parentheses.

174
00:16:00,950 --> 00:16:09,890
So if I execute this cell here and then execute my print statement I will get this value here.

175
00:16:09,920 --> 00:16:20,620
This variable inserted here in my string using these curly braces and the F in front of the courts so

176
00:16:20,620 --> 00:16:28,860
here you can see we've correctly predicted a thousand six hundred and sixty documents now what about

177
00:16:28,920 --> 00:16:38,640
the number of documents incorrectly predicted right in our underscore incorrect shall be equal to Y

178
00:16:38,720 --> 00:16:41,820
and a score test dot signs.

179
00:16:41,820 --> 00:16:50,700
So the number of documents in the test dataset the number of emails in y and a score test well minus

180
00:16:50,970 --> 00:16:53,190
a thousand six hundred and sixty.

181
00:16:53,250 --> 00:16:53,910
Right.

182
00:16:53,940 --> 00:16:55,770
And hour on a score.

183
00:16:55,770 --> 00:16:56,220
Correct

184
00:17:00,100 --> 00:17:11,340
so if we want to print this out using an F string I can see a number of documents incorrectly classified

185
00:17:12,960 --> 00:17:23,010
is curly braces you know on a score incorrect shift enter will show us that seventy nine documents have

186
00:17:23,010 --> 00:17:27,660
been classified incorrectly by our classifier.

187
00:17:27,660 --> 00:17:28,720
Brilliant.

188
00:17:28,740 --> 00:17:30,270
So what does this mean for accuracy

189
00:17:33,000 --> 00:17:37,950
and we've worked down the number of documents classify correctly number of documents classify incorrectly

190
00:17:38,910 --> 00:17:41,670
we can calculate the fraction that were classified incorrectly.

191
00:17:41,670 --> 00:17:42,110
Right so.

192
00:17:42,150 --> 00:17:49,620
Fraction on a school wrong it's equal to an hour underscore incorrect.

193
00:17:49,620 --> 00:17:53,910
Divide it by an hour on a score.

194
00:17:53,910 --> 00:17:55,800
Correct.

195
00:17:55,800 --> 00:18:00,020
Plus an R on a score incorrect.

196
00:18:00,030 --> 00:18:03,210
This is the fraction of documents classified incorrectly.

197
00:18:04,280 --> 00:18:10,590
So if I wanted to print out the accuracy of the model I could write something like print in parentheses

198
00:18:10,740 --> 00:18:24,750
f single quotes the parentheses testing accuracy to be specific on the model is curly braces one minus

199
00:18:25,980 --> 00:18:36,080
fraction on a school wrong and here we see that our model is in fact around ninety five percent accurate.

200
00:18:36,080 --> 00:18:41,480
Now if I wanted to format this as a percentage all I need to do is put my cursor in front of this closing

201
00:18:41,480 --> 00:18:49,460
curly brace put a semicolon there then put a dot then a C two and then a percent sign and this will

202
00:18:49,510 --> 00:18:53,750
format my percentage to two decimal places.

203
00:18:53,750 --> 00:18:54,490
There we go.

204
00:18:54,500 --> 00:18:57,110
So that looks quite pretty right now.

205
00:18:57,140 --> 00:19:01,700
So you were studying this documentation a little more closely than you might have noticed that we didn't

206
00:19:01,700 --> 00:19:07,100
even have to go through all that trouble because there is in fact a school method which will report

207
00:19:07,190 --> 00:19:18,620
our accuracy for us so we could have also done it this way have said classifier dot score parentheses

208
00:19:18,950 --> 00:19:26,480
x on the score test come out y on the score test would have gotten the same result.

209
00:19:27,900 --> 00:19:34,200
Now as a follow up challenge to see how our model is doing we should really look beyond accuracy right.

210
00:19:34,230 --> 00:19:36,570
We talked about this in the previous lessons.

211
00:19:36,570 --> 00:19:45,330
So what I'd like you to do is work out the recall and precision and if one school for our classifier.

212
00:19:45,330 --> 00:19:51,180
Once again I encourage you to Google for the site could learn documentation on this topic to work this

213
00:19:51,180 --> 00:19:51,970
out.

214
00:19:52,080 --> 00:19:54,330
I'll give you a few seconds to pause the video.

215
00:19:54,380 --> 00:19:55,950
Ethan I'll see you on the other side

216
00:20:00,240 --> 00:20:06,330
so if I go ahead and Google so I could learn recall precision and I scroll down to the very first result

217
00:20:07,020 --> 00:20:12,060
then what I see is that there's a brief description of what precision and recall is.

218
00:20:12,890 --> 00:20:17,510
But scrolling further down I can see that this example seems to be talking a little bit more about the

219
00:20:17,510 --> 00:20:19,250
precision recall curve.

220
00:20:19,850 --> 00:20:22,370
So I'm not after something this fancy.

221
00:20:22,370 --> 00:20:25,620
What I'm actually after is just a simple metric right.

222
00:20:25,640 --> 00:20:29,990
The recall score the precision score and the F one school.

223
00:20:30,410 --> 00:20:36,600
And these live in Ashkelon dot metrics dot and then the name of the school.

224
00:20:36,620 --> 00:20:42,890
So let's take a recall for example here's the detailed description and how to use it.

225
00:20:43,280 --> 00:20:52,250
And here's a very quick example from SCA learned metrics import recall score I'll copy that line and

226
00:20:52,700 --> 00:21:00,670
here's how I can use it so coming back to our notebook and scrolling to the very top.

227
00:21:00,670 --> 00:21:09,650
I want to import the recall score but also I want to import the precision school while I'm at it and

228
00:21:09,650 --> 00:21:19,360
I'm also going to import the F1 score so three import statements from ASCII learned metrics.

229
00:21:19,610 --> 00:21:26,690
Import recall score from as killer not metrics import precision score and from ASCII learned art metrics

230
00:21:27,080 --> 00:21:34,500
import F1 score and we don't actually have to copy paste all of these three metrics live under Eskil

231
00:21:34,580 --> 00:21:38,910
dot metrics meaning we can put a comma here.

232
00:21:39,110 --> 00:21:40,210
Right.

233
00:21:40,280 --> 00:21:41,960
Precision school.

234
00:21:41,990 --> 00:21:43,220
Put another comma.

235
00:21:43,220 --> 00:21:46,350
And write F one school.

236
00:21:46,380 --> 00:21:49,960
So now we've got one line from ASCII lined up metrics.

237
00:21:50,000 --> 00:21:56,410
Import recall underscore score precision score and F1 on the score score.

238
00:21:56,680 --> 00:22:03,580
Let me hit shift enter on this scroll back down and now it's time to work it out.

239
00:22:03,650 --> 00:22:10,890
Calculated our recall score just needs two inputs right.

240
00:22:10,910 --> 00:22:15,400
It needs the correct labels so why underscore test.

241
00:22:15,710 --> 00:22:17,360
And it needs our predictions.

242
00:22:17,360 --> 00:22:23,360
And as we've seen before we can get our predictions using our classifier and using the predict method

243
00:22:24,260 --> 00:22:35,380
and supplying our test data X underscore test and what we see is that our recall is around 86 percent

244
00:22:37,740 --> 00:22:39,690
our precision underscore score.

245
00:22:39,910 --> 00:22:42,820
We can get in a very very similar way right.

246
00:22:42,820 --> 00:22:47,200
Why on a score test come up classifier

247
00:22:51,380 --> 00:23:02,750
don't predict parentheses excellence or test and precision is at around ninety nine percent very high.

248
00:23:04,470 --> 00:23:07,500
And finally our F school.

249
00:23:07,500 --> 00:23:08,580
Same thing.

250
00:23:08,580 --> 00:23:10,670
I can't even copy the line above.

251
00:23:10,710 --> 00:23:12,150
Change the name of the method.

252
00:23:12,240 --> 00:23:16,050
Right F on the score score and work it out.

253
00:23:16,050 --> 00:23:16,500
That's right.

254
00:23:16,500 --> 00:23:23,640
92 percent so these are our metrics and they're looking looking really strong right.

255
00:23:23,640 --> 00:23:25,620
Looking very very strong.

256
00:23:26,100 --> 00:23:31,680
Once we've got our data and we split it up into our testing and training dataset training our model

257
00:23:31,950 --> 00:23:39,730
making predictions and working out our metrics is actually super super super quick so the last thing

258
00:23:39,730 --> 00:23:45,220
I'm gonna show you in this lesson is that now that we've trained classifier we can actually do some

259
00:23:45,220 --> 00:23:52,870
pretty cool stuff with it like evaluate some sentences or some emails that we're going to write off

260
00:23:52,870 --> 00:23:59,410
the flight just like that we're going to try our own example sentences since we've trained our classifier

261
00:23:59,440 --> 00:24:00,200
already.

262
00:24:00,340 --> 00:24:07,310
We can add some sentences or send emails to a list and then check how spammy they really are.

263
00:24:07,960 --> 00:24:09,320
Let me show you what I mean.

264
00:24:09,460 --> 00:24:16,540
So I had a few cells here and I'll just call this list example and it's going to contain a couple of

265
00:24:16,540 --> 00:24:17,860
strings right.

266
00:24:17,890 --> 00:24:22,680
So the classic one is get via Agora for free now.

267
00:24:23,550 --> 00:24:23,970
Right.

268
00:24:25,330 --> 00:24:28,990
But we can also try to need a mortgage

269
00:24:32,140 --> 00:24:46,240
replied to arrange a call with its vessel list and get a quote Has up pretty spammy rate for the next

270
00:24:46,240 --> 00:24:46,750
one.

271
00:24:46,840 --> 00:24:49,800
Maybe let's try something that isn't very spammy right.

272
00:24:49,810 --> 00:25:00,540
Maybe something like uh could you please help me with the project for tomorrow.

273
00:25:00,550 --> 00:25:04,540
Try that one then maybe um no.

274
00:25:05,860 --> 00:25:07,560
Hello Jonathan.

275
00:25:08,840 --> 00:25:14,010
I watched a game of golf tomorrow.

276
00:25:14,490 --> 00:25:17,600
I imagine this is how the monopoly man talks to his friends.

277
00:25:18,140 --> 00:25:19,310
And for the last one.

278
00:25:20,000 --> 00:25:21,600
Mm hmm.

279
00:25:21,670 --> 00:25:28,510
When I go to Wikipedia and just search for a favorite Austrian pastime namely a ski jumping and I'm

280
00:25:28,500 --> 00:25:32,810
gonna grab the first couple of sentences here.

281
00:25:32,860 --> 00:25:40,660
Copy them come back in here and in single quotes paste them all in and then I'm going to have to hunt

282
00:25:40,660 --> 00:25:47,710
around for an apostrophe because you can tell from the syntax highlighting that this rogue apostrophe

283
00:25:47,710 --> 00:25:55,330
here needs escaping meaning it should be treated as a string so I can do that with a backslash.

284
00:25:55,330 --> 00:25:55,930
There we go.

285
00:25:56,620 --> 00:26:03,490
So now my string ends him and I've got my list of example emails for example sentences that we can try

286
00:26:03,490 --> 00:26:04,540
to make a prediction on.

287
00:26:04,630 --> 00:26:05,730
Using our classifier.

288
00:26:06,730 --> 00:26:07,630
So how do we do this.

289
00:26:07,960 --> 00:26:10,900
Well first up we need our vector writer.

290
00:26:10,890 --> 00:26:12,500
Write the vector riser.

291
00:26:12,510 --> 00:26:15,250
It's going to process this new piece of data.

292
00:26:15,250 --> 00:26:17,310
Write this list of sentences.

293
00:26:17,800 --> 00:26:22,660
So I want to use vector riser dot transform.

294
00:26:23,020 --> 00:26:28,740
That's the method to process these e-mails and I'll feed an example.

295
00:26:28,750 --> 00:26:34,560
So this is the code that will process this list and get it ready for our classifier.

296
00:26:34,570 --> 00:26:39,050
I'll tell you what it's gonna spit out a document term matrix right.

297
00:26:39,100 --> 00:26:47,500
So I can maybe store that under DLC on a score term on a score matrix set that equal to the output from

298
00:26:47,500 --> 00:26:50,000
the vector riser right.

299
00:26:50,260 --> 00:26:51,580
The next line of code.

300
00:26:51,610 --> 00:26:54,160
I'm gonna take my classifier.

301
00:26:54,530 --> 00:27:04,270
Use the predict method and feed in you guessed it the doc on the score term on a score matrix and let's

302
00:27:04,270 --> 00:27:05,930
see what we get.

303
00:27:06,790 --> 00:27:13,360
The very first sentence was very very spammy right and our classifier actually predicts this sentence

304
00:27:13,660 --> 00:27:16,540
to be from a spam email.

305
00:27:16,540 --> 00:27:17,860
Same with the second one.

306
00:27:17,920 --> 00:27:24,100
And that's because the word mortgage and quote probably tipped it off.

307
00:27:24,100 --> 00:27:31,840
But the third fourth and fifth entries here are not classified as spam so so far so good.

308
00:27:31,960 --> 00:27:36,640
I think at this point you can probably try a couple of your own sentences and see how the classifier

309
00:27:36,640 --> 00:27:37,990
behaves.

310
00:27:37,990 --> 00:27:44,000
In any case I hope this lesson was useful and that kind of rounded off our naive bayes module.

311
00:27:44,020 --> 00:27:49,660
I really wanted to show you how you might build a naive based classifier and train it with the power

312
00:27:49,660 --> 00:27:50,980
of these libraries.

313
00:27:51,070 --> 00:27:52,930
In this case S.K. learn.

314
00:27:53,020 --> 00:27:56,310
Now of course there are pros and cons to using libraries.

315
00:27:56,350 --> 00:27:58,230
You can't just apply them blindly.

316
00:27:58,240 --> 00:28:03,430
You have to understand how they work because there's so much going on under the hood and this is why

317
00:28:03,430 --> 00:28:08,950
we spent a lot of the time in the previous lessons covering many of the mechanics and actually built

318
00:28:09,010 --> 00:28:12,870
this naive base classifier from the ground up that way.

319
00:28:12,910 --> 00:28:18,870
These last couple of lines of code don't come across like forbidden magic or something so where does

320
00:28:18,870 --> 00:28:20,170
this leave us.

321
00:28:20,190 --> 00:28:26,760
Well more and more companies are giving job applicants these case studies to solve as part of their

322
00:28:26,760 --> 00:28:27,600
job interviews.

323
00:28:28,200 --> 00:28:33,660
If you're in the job market these days very often you'll be tasked with some sort of data science or

324
00:28:33,660 --> 00:28:37,240
machine learning assignment as part of the interview process.

325
00:28:37,620 --> 00:28:42,720
And my recommendation is that if you have a working on a case study like this or an assignment make

326
00:28:42,720 --> 00:28:48,120
sure that you can demonstrate to your interviewers that you're not just a copy paste coder that you're

327
00:28:48,120 --> 00:28:52,710
not just plugging libraries together but that you truly understand what's going on.

328
00:28:53,490 --> 00:28:58,470
And this will be an important aspect of both the work that you're submitting to the company as well

329
00:28:58,530 --> 00:29:03,650
as what you want to show your interviewer when they bring you in to talk about things.

330
00:29:03,730 --> 00:29:04,840
So what's coming up next.

331
00:29:05,650 --> 00:29:12,460
Well the coming modules are gonna be really exciting because in the upcoming modules we're gonna be

332
00:29:12,460 --> 00:29:15,600
taking this whole classification game up a notch.

333
00:29:15,710 --> 00:29:19,480
We're no longer going to classify things just into two categories.

334
00:29:19,570 --> 00:29:25,360
We're going to classify amongst many different categories and to do that we're going to take the opportunity

335
00:29:25,360 --> 00:29:31,870
to talk about another incredibly powerful tool namely a neural network.

336
00:29:31,990 --> 00:29:38,010
Neural networks are super exciting really looking forward to seeing you on the next lessons.

337
00:29:38,200 --> 00:29:40,950
And if you get a chance go watch some ski jumping.

338
00:29:40,950 --> 00:29:41,520
It's pretty cool.