1
00:00:00,270 --> 00:00:00,770
All right.

2
00:00:00,780 --> 00:00:03,540
Welcome to a new module.

3
00:00:03,540 --> 00:00:07,170
Over the course of this module we'll be training our base classifier.

4
00:00:07,830 --> 00:00:15,450
But the first step is setting up our notebook and loading our data in your projects folder.

5
00:00:15,450 --> 00:00:26,700
Click on new Python 3 notebook and let's give this a notebook a name let's call it 0 7 bayes classifier

6
00:00:27,090 --> 00:00:32,880
hyphen training to set up this notebook.

7
00:00:32,880 --> 00:00:36,240
Let's add a few markdown cells.

8
00:00:36,410 --> 00:00:48,280
The first one will be notebook imports and the second one will read constants under the notebook imports.

9
00:00:48,440 --> 00:01:00,310
Let's import pandas as PD and let's import num pi as NDP we always need these two guys under the constants.

10
00:01:00,350 --> 00:01:07,610
You can probably copy over some of them from the previous notebook so in particular I'm thinking of

11
00:01:07,700 --> 00:01:12,200
our two file paths and our vocabulary size.

12
00:01:12,290 --> 00:01:23,770
So when a copy this over and pasted in here and also add our vocab underscore size which was two thousand

13
00:01:24,020 --> 00:01:25,980
and five hundred.

14
00:01:26,000 --> 00:01:35,270
There you go the very first thing we're gonna do is load our features from the two text files into a

15
00:01:35,270 --> 00:01:37,540
num pi array.

16
00:01:37,580 --> 00:01:49,400
I'm going to add a markdown cell here to commemorate this read and load features from t x t files into

17
00:01:49,400 --> 00:01:53,180
num pi array.

18
00:01:53,180 --> 00:02:00,120
The way we're gonna do this is we're gonna use the load P T function from num Pi.

19
00:02:00,170 --> 00:02:08,510
Our training data comes as a sparse matrix so I'll store this in a variable called sparse on a school

20
00:02:08,690 --> 00:02:21,700
train on a squat data and that's equal to NDP dot load t x t parentheses training data file comma delimiter

21
00:02:23,340 --> 00:02:32,010
single quotes and a space comma data type or D type equals.

22
00:02:32,020 --> 00:02:40,770
And so what I'm doing here is full I'm giving everybody a chance who hasn't done the data pre processing

23
00:02:41,040 --> 00:02:43,670
to come in at the training part of the project.

24
00:02:44,550 --> 00:02:52,440
And I'm doing this by calling a function from num pi called load text and I'm supplying three arguments

25
00:02:52,440 --> 00:02:53,680
to this function.

26
00:02:53,760 --> 00:02:58,380
The first one is my relative path to my data file.

27
00:02:58,380 --> 00:03:05,380
We created this data file in the previous module but you can also download it separately the second

28
00:03:05,380 --> 00:03:08,760
thing is I'm adding a delimiter.

29
00:03:09,030 --> 00:03:15,670
Now if you haven't heard this word before a delimiter is a character that's used to specify the boundary

30
00:03:15,670 --> 00:03:20,150
between independent regions of plain text or data.

31
00:03:20,350 --> 00:03:30,160
We are using a single whitespace as our delimiter and you can see this if you open your text file in

32
00:03:30,160 --> 00:03:31,420
a text editor.

33
00:03:31,420 --> 00:03:38,200
You can see that the single whitespace It's what's used to separate the different values.

34
00:03:38,200 --> 00:03:40,550
The first value here is our document.

35
00:03:40,660 --> 00:03:47,680
The second value here is our word I.D. the third value here is the category or label.

36
00:03:47,680 --> 00:03:56,740
So 1 for spam 2 for non spam and the third value here is the occurrence the number of times a particular

37
00:03:56,740 --> 00:04:06,640
word with say I.D. 0 occurs in the email with document I.D. 0 the whitespace separates these different

38
00:04:06,640 --> 00:04:08,210
kinds of data.

39
00:04:08,290 --> 00:04:14,620
Now we can open our word by I.D. dot CSP file to give you an example of a different kind of delimiter

40
00:04:15,390 --> 00:04:18,580
CSP of course stands for camera separated values.

41
00:04:18,580 --> 00:04:26,600
So there you can see that the delimiter is well surprise surprise a comma in this file.

42
00:04:26,610 --> 00:04:32,670
We have the word Eddy as the very first value in every single line and we have the string.

43
00:04:32,670 --> 00:04:36,500
What the site stands for has the second kind of value.

44
00:04:36,630 --> 00:04:41,370
If you were to open this kind of file and Microsoft Excel or some other spreadsheet program you'd see

45
00:04:41,640 --> 00:04:46,350
that all of these values get put into separate columns.

46
00:04:46,350 --> 00:04:48,300
So that's how it delimiter works.

47
00:04:48,300 --> 00:04:53,040
Lastly I supplied some information about the kind of data that we're importing him.

48
00:04:53,310 --> 00:04:55,290
We're exclusively dealing with whole numbers.

49
00:04:55,290 --> 00:04:55,700
Right.

50
00:04:55,710 --> 00:05:00,420
Our document adds our whole numbers are words our whole numbers our categories our whole numbers and

51
00:05:00,420 --> 00:05:04,190
the number of times a particular would cause is also a whole number.

52
00:05:04,200 --> 00:05:08,150
So I've set the data type as an integer.

53
00:05:08,250 --> 00:05:10,390
Let me shift enter on the cell.

54
00:05:10,560 --> 00:05:19,290
You'll see that it runs a little while and let's import our test data as well so sparse on the squat

55
00:05:19,290 --> 00:05:28,230
test on the score data is equal to N p dot load t t and then it'll be test data file which is our relative

56
00:05:28,230 --> 00:05:37,620
path to our test hyphen data that t XY and you also have the same delimiter single space and it also

57
00:05:37,620 --> 00:05:40,610
has the same data type an integer.

58
00:05:40,740 --> 00:05:44,040
So now we have to num pi arrays.

59
00:05:44,130 --> 00:05:49,350
Let's see if we loaded them successfully by we're just looking at the first five and the last five rows.

60
00:05:50,370 --> 00:05:59,370
So I'll pick spots train data and have maybe a square bracket call on a five to look at the first five

61
00:05:59,370 --> 00:06:08,040
rows and to just verify that the last five rows are OK I can use the square bracket notation but this

62
00:06:08,040 --> 00:06:12,460
time I'll put minus five and then a colon at the end.

63
00:06:12,690 --> 00:06:14,040
There we go.

64
00:06:14,040 --> 00:06:20,190
And if you do this what you should see is that there's a match between the values in the text file and

65
00:06:20,190 --> 00:06:24,490
the values that are shown here as an output and Jupiter notebook.

66
00:06:24,490 --> 00:06:30,610
Now both of these arrays are actually quite large and give you an idea of the kind of data that we're

67
00:06:30,610 --> 00:06:32,370
working with him.

68
00:06:32,440 --> 00:06:39,940
Let's print out the number of rows in both the training and the test file and let's print out the number

69
00:06:39,940 --> 00:06:46,540
of e-mails that we're working with in our training file and our testing files managed to cram this kind

70
00:06:46,540 --> 00:06:49,480
of data a couple of times before with the data frames.

71
00:06:49,480 --> 00:06:55,090
But now that we've got a num pi array here let me show you the methods that you'd use to accomplish

72
00:06:55,090 --> 00:06:56,510
the same thing.

73
00:06:56,590 --> 00:07:05,020
So I'll drop everything in a print statement print parentheses single quotes number of rows in training

74
00:07:05,020 --> 00:07:14,350
file karma and LP are sparse train data don't shape square brackets.

75
00:07:14,350 --> 00:07:15,940
Zero.

76
00:07:15,940 --> 00:07:18,970
Copy this line pasted below.

77
00:07:18,970 --> 00:07:22,290
And then I'll just swap training for test

78
00:07:25,760 --> 00:07:27,030
and hit shift enter.

79
00:07:27,140 --> 00:07:34,430
There you can see that we've got quite a few entries in both of these files our training file has about

80
00:07:34,520 --> 00:07:40,880
two hundred sixty five thousand rows and our testing file has one hundred and ten thousand rows.

81
00:07:41,120 --> 00:07:50,240
The number of unique emails though will be determined by the number of unique document ideas that are

82
00:07:50,240 --> 00:08:01,670
contained in these matrices so I can print this out with print single quotes number of emails in training

83
00:08:01,670 --> 00:08:07,880
file come after it and then I'll use none PI's unique method.

84
00:08:07,970 --> 00:08:15,720
So NDP don't unique parentheses spots.

85
00:08:16,770 --> 00:08:24,150
Train data and then I'll put a square brackets afterwards and maybe select everything right in the first

86
00:08:24,600 --> 00:08:25,980
column.

87
00:08:25,980 --> 00:08:34,220
That way I'm only selecting the document ideas semicolon comma zero.

88
00:08:34,290 --> 00:08:41,160
This is the notation for selecting all the rows in the first column and then because this will actually

89
00:08:41,160 --> 00:08:47,920
return another array I have to change on an attribute here namely the size attribute.

90
00:08:47,940 --> 00:08:50,040
Let's see if this works.

91
00:08:50,040 --> 00:08:51,180
There we go.

92
00:08:51,180 --> 00:08:56,970
The number of emails in the training file is four thousand and fourteen.

93
00:08:57,030 --> 00:09:05,520
If we check out the number of unique emails in the testing file then we just swap this over for a test

94
00:09:06,120 --> 00:09:12,060
and shift enter and we see that we're working with a thousand seven hundred and twenty three different

95
00:09:12,090 --> 00:09:14,600
emails in the testing file.

96
00:09:14,730 --> 00:09:15,870
So there we go.

97
00:09:15,870 --> 00:09:20,230
I think that completes our setup and importing the data.

98
00:09:20,250 --> 00:09:25,090
We've also talked about the format that our data is in.

99
00:09:25,110 --> 00:09:27,270
We've got four columns.

100
00:09:27,270 --> 00:09:32,430
The first one is a document I.D. which identifies a particular email.

101
00:09:32,520 --> 00:09:38,450
The second one is a word I.D. which identifies a particular token or word.

102
00:09:38,880 --> 00:09:42,300
The third one here is our label or category.

103
00:09:43,050 --> 00:09:47,100
So zero for non spam and one for spam.

104
00:09:47,100 --> 00:09:55,870
And the fourth one here is the number of times this particular word occurs in the email in the next

105
00:09:55,870 --> 00:10:01,640
lessons we'll be transforming this name pie array into a pan this data frame.

106
00:10:01,960 --> 00:10:09,910
And we're also going to be we structuring it so that it's no longer a sparse matrix but instead it's

107
00:10:09,910 --> 00:10:11,500
a full matrix.

108
00:10:11,500 --> 00:10:18,580
If you remember the full matrix included the zero values this past matrix did not looking forward to

109
00:10:18,580 --> 00:10:19,690
seeing you in the next lessons.