1 00:00:00,270 --> 00:00:00,770 All right. 2 00:00:00,780 --> 00:00:03,540 Welcome to a new module. 3 00:00:03,540 --> 00:00:07,170 Over the course of this module we'll be training our base classifier. 4 00:00:07,830 --> 00:00:15,450 But the first step is setting up our notebook and loading our data in your projects folder. 5 00:00:15,450 --> 00:00:26,700 Click on new Python 3 notebook and let's give this a notebook a name let's call it 0 7 bayes classifier 6 00:00:27,090 --> 00:00:32,880 hyphen training to set up this notebook. 7 00:00:32,880 --> 00:00:36,240 Let's add a few markdown cells. 8 00:00:36,410 --> 00:00:48,280 The first one will be notebook imports and the second one will read constants under the notebook imports. 9 00:00:48,440 --> 00:01:00,310 Let's import pandas as PD and let's import num pi as NDP we always need these two guys under the constants. 10 00:01:00,350 --> 00:01:07,610 You can probably copy over some of them from the previous notebook so in particular I'm thinking of 11 00:01:07,700 --> 00:01:12,200 our two file paths and our vocabulary size. 12 00:01:12,290 --> 00:01:23,770 So when a copy this over and pasted in here and also add our vocab underscore size which was two thousand 13 00:01:24,020 --> 00:01:25,980 and five hundred. 14 00:01:26,000 --> 00:01:35,270 There you go the very first thing we're gonna do is load our features from the two text files into a 15 00:01:35,270 --> 00:01:37,540 num pi array. 16 00:01:37,580 --> 00:01:49,400 I'm going to add a markdown cell here to commemorate this read and load features from t x t files into 17 00:01:49,400 --> 00:01:53,180 num pi array. 18 00:01:53,180 --> 00:02:00,120 The way we're gonna do this is we're gonna use the load P T function from num Pi. 19 00:02:00,170 --> 00:02:08,510 Our training data comes as a sparse matrix so I'll store this in a variable called sparse on a school 20 00:02:08,690 --> 00:02:21,700 train on a squat data and that's equal to NDP dot load t x t parentheses training data file comma delimiter 21 00:02:23,340 --> 00:02:32,010 single quotes and a space comma data type or D type equals. 22 00:02:32,020 --> 00:02:40,770 And so what I'm doing here is full I'm giving everybody a chance who hasn't done the data pre processing 23 00:02:41,040 --> 00:02:43,670 to come in at the training part of the project. 24 00:02:44,550 --> 00:02:52,440 And I'm doing this by calling a function from num pi called load text and I'm supplying three arguments 25 00:02:52,440 --> 00:02:53,680 to this function. 26 00:02:53,760 --> 00:02:58,380 The first one is my relative path to my data file. 27 00:02:58,380 --> 00:03:05,380 We created this data file in the previous module but you can also download it separately the second 28 00:03:05,380 --> 00:03:08,760 thing is I'm adding a delimiter. 29 00:03:09,030 --> 00:03:15,670 Now if you haven't heard this word before a delimiter is a character that's used to specify the boundary 30 00:03:15,670 --> 00:03:20,150 between independent regions of plain text or data. 31 00:03:20,350 --> 00:03:30,160 We are using a single whitespace as our delimiter and you can see this if you open your text file in 32 00:03:30,160 --> 00:03:31,420 a text editor. 33 00:03:31,420 --> 00:03:38,200 You can see that the single whitespace It's what's used to separate the different values. 34 00:03:38,200 --> 00:03:40,550 The first value here is our document. 35 00:03:40,660 --> 00:03:47,680 The second value here is our word I.D. the third value here is the category or label. 36 00:03:47,680 --> 00:03:56,740 So 1 for spam 2 for non spam and the third value here is the occurrence the number of times a particular 37 00:03:56,740 --> 00:04:06,640 word with say I.D. 0 occurs in the email with document I.D. 0 the whitespace separates these different 38 00:04:06,640 --> 00:04:08,210 kinds of data. 39 00:04:08,290 --> 00:04:14,620 Now we can open our word by I.D. dot CSP file to give you an example of a different kind of delimiter 40 00:04:15,390 --> 00:04:18,580 CSP of course stands for camera separated values. 41 00:04:18,580 --> 00:04:26,600 So there you can see that the delimiter is well surprise surprise a comma in this file. 42 00:04:26,610 --> 00:04:32,670 We have the word Eddy as the very first value in every single line and we have the string. 43 00:04:32,670 --> 00:04:36,500 What the site stands for has the second kind of value. 44 00:04:36,630 --> 00:04:41,370 If you were to open this kind of file and Microsoft Excel or some other spreadsheet program you'd see 45 00:04:41,640 --> 00:04:46,350 that all of these values get put into separate columns. 46 00:04:46,350 --> 00:04:48,300 So that's how it delimiter works. 47 00:04:48,300 --> 00:04:53,040 Lastly I supplied some information about the kind of data that we're importing him. 48 00:04:53,310 --> 00:04:55,290 We're exclusively dealing with whole numbers. 49 00:04:55,290 --> 00:04:55,700 Right. 50 00:04:55,710 --> 00:05:00,420 Our document adds our whole numbers are words our whole numbers our categories our whole numbers and 51 00:05:00,420 --> 00:05:04,190 the number of times a particular would cause is also a whole number. 52 00:05:04,200 --> 00:05:08,150 So I've set the data type as an integer. 53 00:05:08,250 --> 00:05:10,390 Let me shift enter on the cell. 54 00:05:10,560 --> 00:05:19,290 You'll see that it runs a little while and let's import our test data as well so sparse on the squat 55 00:05:19,290 --> 00:05:28,230 test on the score data is equal to N p dot load t t and then it'll be test data file which is our relative 56 00:05:28,230 --> 00:05:37,620 path to our test hyphen data that t XY and you also have the same delimiter single space and it also 57 00:05:37,620 --> 00:05:40,610 has the same data type an integer. 58 00:05:40,740 --> 00:05:44,040 So now we have to num pi arrays. 59 00:05:44,130 --> 00:05:49,350 Let's see if we loaded them successfully by we're just looking at the first five and the last five rows. 60 00:05:50,370 --> 00:05:59,370 So I'll pick spots train data and have maybe a square bracket call on a five to look at the first five 61 00:05:59,370 --> 00:06:08,040 rows and to just verify that the last five rows are OK I can use the square bracket notation but this 62 00:06:08,040 --> 00:06:12,460 time I'll put minus five and then a colon at the end. 63 00:06:12,690 --> 00:06:14,040 There we go. 64 00:06:14,040 --> 00:06:20,190 And if you do this what you should see is that there's a match between the values in the text file and 65 00:06:20,190 --> 00:06:24,490 the values that are shown here as an output and Jupiter notebook. 66 00:06:24,490 --> 00:06:30,610 Now both of these arrays are actually quite large and give you an idea of the kind of data that we're 67 00:06:30,610 --> 00:06:32,370 working with him. 68 00:06:32,440 --> 00:06:39,940 Let's print out the number of rows in both the training and the test file and let's print out the number 69 00:06:39,940 --> 00:06:46,540 of e-mails that we're working with in our training file and our testing files managed to cram this kind 70 00:06:46,540 --> 00:06:49,480 of data a couple of times before with the data frames. 71 00:06:49,480 --> 00:06:55,090 But now that we've got a num pi array here let me show you the methods that you'd use to accomplish 72 00:06:55,090 --> 00:06:56,510 the same thing. 73 00:06:56,590 --> 00:07:05,020 So I'll drop everything in a print statement print parentheses single quotes number of rows in training 74 00:07:05,020 --> 00:07:14,350 file karma and LP are sparse train data don't shape square brackets. 75 00:07:14,350 --> 00:07:15,940 Zero. 76 00:07:15,940 --> 00:07:18,970 Copy this line pasted below. 77 00:07:18,970 --> 00:07:22,290 And then I'll just swap training for test 78 00:07:25,760 --> 00:07:27,030 and hit shift enter. 79 00:07:27,140 --> 00:07:34,430 There you can see that we've got quite a few entries in both of these files our training file has about 80 00:07:34,520 --> 00:07:40,880 two hundred sixty five thousand rows and our testing file has one hundred and ten thousand rows. 81 00:07:41,120 --> 00:07:50,240 The number of unique emails though will be determined by the number of unique document ideas that are 82 00:07:50,240 --> 00:08:01,670 contained in these matrices so I can print this out with print single quotes number of emails in training 83 00:08:01,670 --> 00:08:07,880 file come after it and then I'll use none PI's unique method. 84 00:08:07,970 --> 00:08:15,720 So NDP don't unique parentheses spots. 85 00:08:16,770 --> 00:08:24,150 Train data and then I'll put a square brackets afterwards and maybe select everything right in the first 86 00:08:24,600 --> 00:08:25,980 column. 87 00:08:25,980 --> 00:08:34,220 That way I'm only selecting the document ideas semicolon comma zero. 88 00:08:34,290 --> 00:08:41,160 This is the notation for selecting all the rows in the first column and then because this will actually 89 00:08:41,160 --> 00:08:47,920 return another array I have to change on an attribute here namely the size attribute. 90 00:08:47,940 --> 00:08:50,040 Let's see if this works. 91 00:08:50,040 --> 00:08:51,180 There we go. 92 00:08:51,180 --> 00:08:56,970 The number of emails in the training file is four thousand and fourteen. 93 00:08:57,030 --> 00:09:05,520 If we check out the number of unique emails in the testing file then we just swap this over for a test 94 00:09:06,120 --> 00:09:12,060 and shift enter and we see that we're working with a thousand seven hundred and twenty three different 95 00:09:12,090 --> 00:09:14,600 emails in the testing file. 96 00:09:14,730 --> 00:09:15,870 So there we go. 97 00:09:15,870 --> 00:09:20,230 I think that completes our setup and importing the data. 98 00:09:20,250 --> 00:09:25,090 We've also talked about the format that our data is in. 99 00:09:25,110 --> 00:09:27,270 We've got four columns. 100 00:09:27,270 --> 00:09:32,430 The first one is a document I.D. which identifies a particular email. 101 00:09:32,520 --> 00:09:38,450 The second one is a word I.D. which identifies a particular token or word. 102 00:09:38,880 --> 00:09:42,300 The third one here is our label or category. 103 00:09:43,050 --> 00:09:47,100 So zero for non spam and one for spam. 104 00:09:47,100 --> 00:09:55,870 And the fourth one here is the number of times this particular word occurs in the email in the next 105 00:09:55,870 --> 00:10:01,640 lessons we'll be transforming this name pie array into a pan this data frame. 106 00:10:01,960 --> 00:10:09,910 And we're also going to be we structuring it so that it's no longer a sparse matrix but instead it's 107 00:10:09,910 --> 00:10:11,500 a full matrix. 108 00:10:11,500 --> 00:10:18,580 If you remember the full matrix included the zero values this past matrix did not looking forward to 109 00:10:18,580 --> 00:10:19,690 seeing you in the next lessons.