0 1 00:00:00,270 --> 00:00:06,180 Throughout the next lessons I want to do the final steps of the data preparation for our Bayes' classifier 1 2 00:00:06,630 --> 00:00:13,830 and also show you a common format for data sets that you'll often encounter out there in the wild. 2 3 00:00:14,040 --> 00:00:19,530 Over the next lessons we'll be writing some Python code to create our feature vectors, but not only that, 3 4 00:00:19,980 --> 00:00:23,850 we'll create our features as a sparse matrix. 4 5 00:00:23,850 --> 00:00:25,420 What do I mean by that? 5 6 00:00:25,440 --> 00:00:28,140 Well, let's take it one step ahead of time. 6 7 00:00:28,320 --> 00:00:31,050 Let's consider a full matrix first. 7 8 00:00:31,080 --> 00:00:38,110 In that case, we have our document IDs in one column and then we'll have our words in another column. 8 9 00:00:39,020 --> 00:00:44,490 Now of course these will be the word IDs, but for illustration, I've written an actual word here on this 9 10 00:00:44,490 --> 00:00:45,830 slide. 10 11 00:00:45,830 --> 00:00:52,490 Then, in that third column, we'll have our Label. Our Label will be equal to 1 if the e-mail is spam 11 12 00:00:52,940 --> 00:00:54,680 and it will be equal to 0 12 13 00:00:54,710 --> 00:00:56,740 if our email is non-spam. 13 14 00:00:56,960 --> 00:01:04,280 So in this case, email number 5795 is a non-spam email and the 14 15 00:01:04,280 --> 00:01:05,840 label is equal to 0. 15 16 00:01:06,110 --> 00:01:13,190 In that last column, where it says Occurrence, we will capture how often the word in the Word column 16 17 00:01:13,400 --> 00:01:15,040 appears in the email. 17 18 00:01:15,080 --> 00:01:20,480 So if the word "free" appears 3 times, then occurrence will be equal to 3. 18 19 00:01:20,510 --> 00:01:22,010 Makes sense, right? 19 20 00:01:22,010 --> 00:01:29,330 The reason that this kind of table is called a full matrix is because for each email, there's an entry 20 21 00:01:29,450 --> 00:01:30,890 for each word, 21 22 00:01:30,890 --> 00:01:38,390 even if that word doesn't occur in the email. The full matrix has an entry for each word in the vocabulary 22 23 00:01:38,780 --> 00:01:41,360 for each and every single email. 23 24 00:01:41,430 --> 00:01:45,110 Now we set our vocabulary size at 2500, 24 25 00:01:45,140 --> 00:01:52,430 remember? Therefore for each document ID, there will be 2500 entries, many of 25 26 00:01:52,430 --> 00:01:54,450 which will be 0. 26 27 00:01:54,530 --> 00:01:59,700 And this is where the sparse matrix comes in. With a sparse matrix, 27 28 00:01:59,760 --> 00:02:03,320 we will remove the entries which are 0. 28 29 00:02:03,790 --> 00:02:12,250 And that means we actually only include the rows which have a word that occurs in the email. 29 30 00:02:12,270 --> 00:02:18,750 In this example, the word "mortgage" and the word "pay" does not appear in email number 30 31 00:02:18,750 --> 00:02:20,150 5795, 31 32 00:02:20,190 --> 00:02:28,300 therefore those rows will not be present in the sparse matrix. So you can see how the sparse matrix is 32 33 00:02:28,310 --> 00:02:32,480 just a compressed version of the full matrix with fewer rows. 33 34 00:02:32,480 --> 00:02:36,390 Now let's head over to Jupyter notebook and write some Python code. 34 35 00:02:36,620 --> 00:02:41,410 First, let me add a markdown cell to commemorate what we're gonna do. 35 36 00:02:41,570 --> 00:02:52,040 We're going to generate features and a sparse matrix and the first part of that is going to be creating 36 37 00:02:52,160 --> 00:02:59,240 a dataframe with one word per column. 37 38 00:02:59,520 --> 00:03:02,340 Now I say creating a data frame with one word per column, 38 39 00:03:02,540 --> 00:03:08,900 but what are we creating the dataframe from? Where are we at in terms of our data. 39 40 00:03:08,900 --> 00:03:16,980 We're going to be working with our stemmed nested list. Our stemmed nested list looks like this, 40 41 00:03:17,060 --> 00:03:21,230 so we've got a list of words for each document, for each email. 41 42 00:03:21,770 --> 00:03:29,310 This thing about this is that our stemmed nested list is a series, right. 42 43 00:03:29,720 --> 00:03:34,790 It's a pandas series that holds on to individual lists. 43 44 00:03:34,790 --> 00:03:41,360 So if I want to access the email at position 2 then I would see that even though our stemmed nested 44 45 00:03:41,360 --> 00:03:47,240 list as a whole is a pandas series, it contains individual lists. 45 46 00:03:47,450 --> 00:03:53,270 So it's actually a series of lists, each entry is a list. That makes it a little bit of an unwieldy data 46 47 00:03:53,270 --> 00:03:56,420 structure, but it's one we're gonna work with, 47 48 00:03:56,420 --> 00:04:04,130 and what we're gonna do is we're going to convert it from a series containing lists to one of a list 48 49 00:04:04,370 --> 00:04:05,840 containing lists. 49 50 00:04:05,840 --> 00:04:11,330 And the reason for doing that is that there exists a very, very handy method to convert list of lists 50 51 00:04:11,450 --> 00:04:13,090 into a dataframe. 51 52 00:04:13,160 --> 00:04:20,810 So let me quickly copy and paste this cell and show you the method we're gonna use to convert our series 52 53 00:04:20,810 --> 00:04:21,850 to a list. 53 54 00:04:22,130 --> 00:04:29,090 And it's simply called "to_list". Our pandas series has a method called "to_list" which converts the whole 54 55 00:04:29,090 --> 00:04:31,460 thing to a Python list. 55 56 00:04:31,490 --> 00:04:32,450 Fair enough. 56 57 00:04:32,630 --> 00:04:37,640 The way this looks is what you might expect, namely two square brackets, 57 58 00:04:37,670 --> 00:04:41,520 and then in each pair of square brackets you have an individual email, 58 59 00:04:41,520 --> 00:04:42,830 so the first e-mail ends here, 59 60 00:04:42,830 --> 00:04:44,800 the second one starts here. 60 61 00:04:44,840 --> 00:04:45,370 It's... 61 62 00:04:45,850 --> 00:04:51,650 It's kind of messy actually, but we don't have to worry about this so much, because we're going to create 62 63 00:04:51,740 --> 00:04:56,570 a pandas dataframe using this code here. 63 64 00:04:56,720 --> 00:05:04,560 And the way we're gonna do that is with "pd.DataFrame" and instead of the parentheses and enclosing 64 65 00:05:04,560 --> 00:05:13,890 it like so, what we're gonna do instead is we're gonna put a dot and call a method called "from_records", 65 66 00:05:14,130 --> 00:05:21,330 "from_records" that is. And this is where we're going to feed in our "stemmed_nested_ 66 67 00:05:21,350 --> 00:05:30,320 list.to_list()". Now before I hit Shift+Enter on this, let's store this whole thing, this whole 67 68 00:05:30,320 --> 00:05:39,470 dataframe, in a variable called "word_columns_df", "df" for dataframe and we'll 68 69 00:05:39,470 --> 00:05:47,730 set that equal to "pd.DataFrame.from_records" and then below, we're going to print out the head 69 70 00:05:47,880 --> 00:05:54,130 of this dataframe, so "word_columns_df.head()". 70 71 00:05:54,210 --> 00:06:00,750 Now let's take a look at the first five rows, and what we see here is as the index, we've got our document 71 72 00:06:00,780 --> 00:06:08,880 IDs, we'll add a label for this shortly, and then we have each word split up as an individual data point 72 73 00:06:09,330 --> 00:06:16,890 in a column and it looks like we have a total of 7661 columns. 73 74 00:06:17,660 --> 00:06:21,940 The overall shape of our dataframe looks like this. 74 75 00:06:22,280 --> 00:06:28,970 We've got 5796 rows and 7661 75 76 00:06:29,540 --> 00:06:30,800 columns. 76 77 00:06:30,830 --> 00:06:32,230 Now here's a question to you. 77 78 00:06:32,480 --> 00:06:40,940 Why does the dataframe have this shape? Well, 5796 is the total 78 79 00:06:40,940 --> 00:06:48,440 number of emails that we have and 7661 is the number of words, stemmed 79 80 00:06:48,440 --> 00:06:56,210 words that is, in the longest email. And we know this to be true because we've worked this out in a previous 80 81 00:06:56,480 --> 00:07:09,990 exercise. Now it's time to take the next step and that is splitting the data into a training and testing 81 82 00:07:10,170 --> 00:07:11,340 dataset. 82 83 00:07:11,790 --> 00:07:14,540 It's time to shuffle and split the data, 83 84 00:07:14,580 --> 00:07:18,450 now that we've got our "word_columns_df". 84 85 00:07:18,860 --> 00:07:20,530 Now, we've actually done this before. 85 86 00:07:20,550 --> 00:07:24,310 So I want to throw this over to you as a challenge. 86 87 00:07:24,390 --> 00:07:31,470 Can you split the data into a training and testing dataset using scikit-learn? As you're doing this 87 88 00:07:31,950 --> 00:07:35,820 set the test size at 30%. 88 89 00:07:35,820 --> 00:07:44,190 That means that the training data should include around 4057 emails and also as you're shuffling 89 90 00:07:44,430 --> 00:07:47,300 set the seed value to 42. 90 91 00:07:47,400 --> 00:07:53,460 And as you pause this video and try to solve this challenge, have a think about what the target values 91 92 00:07:53,670 --> 00:07:55,170 should be as you're doing this. 92 93 00:07:58,490 --> 00:07:58,890 all right. 93 94 00:07:58,890 --> 00:08:00,470 So here's the solution. 94 95 00:08:00,660 --> 00:08:05,830 We're gonna be using scikit learn's "train_test_split" function to accomplish this. 95 96 00:08:05,910 --> 00:08:15,200 So we have to import this whole thing into our notebook, so we'll say "from sklearn.model_ 96 97 00:08:15,200 --> 00:08:16,390 selection"; 97 98 00:08:16,410 --> 00:08:18,540 this is where the whole thing lives; 98 99 00:08:18,720 --> 00:08:28,180 "import train_test_split". Hitting Tab on your keyboard after typing a few of the letters should help you 99 100 00:08:28,270 --> 00:08:32,570 avoid any typos on this relatively long import statement. 100 101 00:08:32,620 --> 00:08:40,160 Now hit Shift+Enter on this cell and let's continue where we left off at the bottom of the notebook. The 101 102 00:08:40,160 --> 00:08:41,550 "train_test_split" 102 103 00:08:41,570 --> 00:08:47,800 function will give us four outputs. So let's store them in four separate variables. 103 104 00:08:47,990 --> 00:08:55,620 The first one I'll call "X_train", the next one I'll call "X_test", then lowercase 104 105 00:08:55,620 --> 00:08:58,020 "y_train", 105 106 00:08:58,050 --> 00:08:58,720 comma 106 107 00:08:58,930 --> 00:09:01,250 "y_test". 107 108 00:09:01,250 --> 00:09:10,720 And that's gonna be equal to "train_test_split(word_columns_df)". 108 109 00:09:10,760 --> 00:09:12,830 This is going to be our first argument. 109 110 00:09:13,280 --> 00:09:19,160 Then we have to supply our y-values. The dataframe that we just created after all was just the features, 110 111 00:09:19,240 --> 00:09:25,610 right, the different words. The y-values that we're trying to predict are actually our categories and 111 112 00:09:25,610 --> 00:09:32,800 I'm going to grab those from our "data" dataframe. It has a column called CATEGORY and this is what I'll 112 113 00:09:32,950 --> 00:09:34,530 supply here. 113 114 00:09:35,020 --> 00:09:46,720 Next I'll set the test size to 0.3, so 30%, and then I'll set my "random_ 114 115 00:09:46,960 --> 00:09:51,140 state" to 42. 115 116 00:09:51,200 --> 00:09:54,500 This is where I'm specifying a seed value. 116 117 00:09:54,500 --> 00:10:01,460 And if you and I specify the same seed value, we'll get the same results, we'll get exactly the same shuffle. 117 118 00:10:01,460 --> 00:10:08,770 So let me run this and then we'll look at our analytics. First, let me print out the number of training 118 119 00:10:08,770 --> 00:10:17,690 samples and I'll print out "X_train.shape[ 119 120 00:10:18,000 --> 00:10:18,730 0]". 120 121 00:10:18,760 --> 00:10:25,090 These are the number of training samples and we've got 30% of the total which is 121 122 00:10:25,150 --> 00:10:28,340 4057. Next, 122 123 00:10:28,930 --> 00:10:36,480 let's just verify the fraction of the training set. So the fraction of the training set 123 124 00:10:36,680 --> 00:10:44,690 is gonna be this number, 4057 divided by the total number of entries, say in 124 125 00:10:44,690 --> 00:10:46,840 our features dataframe. 125 126 00:10:46,880 --> 00:10:59,370 so this was "word_columns_df.shape[0]" and that's 70% or very close 126 127 00:10:59,370 --> 00:11:07,440 to that, meaning the test set is going to be one minus this number, which is 30%, which we've specified 127 128 00:11:07,620 --> 00:11:08,850 here. 128 129 00:11:08,850 --> 00:11:11,290 Now let's take a look at what we've actually got. 129 130 00:11:11,370 --> 00:11:16,200 So I'll take "X_train.head()". 130 131 00:11:16,200 --> 00:11:22,330 These are the first five rows of our shuffled "word_columns_df". 131 132 00:11:22,400 --> 00:11:26,230 Now I said that these numbers here would refer to the index. 132 133 00:11:26,240 --> 00:11:26,470 Right. 133 134 00:11:26,480 --> 00:11:30,690 Our document IDs, and we can add this label very, very easily. 134 135 00:11:30,950 --> 00:11:40,610 So we'll say "X_train.index.name = 'DOC_ 135 136 00:11:40,610 --> 00:11:44,290 ID'", Shift+Enter, 136 137 00:11:44,610 --> 00:11:51,210 we'll have our index name show up in the output right here. But say we wanted to add this index name 137 138 00:11:51,720 --> 00:11:55,830 to both the training dataset and the testing dataset. 138 139 00:11:55,860 --> 00:11:57,860 So "X_test" as well. 139 140 00:11:58,200 --> 00:12:03,900 We can actually do that in the very same line of code by inserting another equal sign before we assign 140 141 00:12:03,900 --> 00:12:14,880 this string value here and write "X_test.index.name". In this case we're setting 141 142 00:12:14,880 --> 00:12:22,950 both of these indices equal to "DOC_ID". Let me hit Shift+Enter and show you that document 142 143 00:12:22,980 --> 00:12:32,310 IDs actually match up after shuffling. Let's pull up "y_train.head()" and there we see 143 144 00:12:32,310 --> 00:12:38,610 the first five rows of our target values. You can see here that the document IDs match up with what 144 145 00:12:38,610 --> 00:12:44,450 we see in the training dataset. Now of course, that should be true regardless of whether this says 145 146 00:12:44,450 --> 00:12:51,720 X_train or X_test, the order of the features and the target values will be the same, 146 147 00:12:52,770 --> 00:12:54,450 but proof is in the pudding, right? 147 148 00:12:54,540 --> 00:13:03,930 "X_test" looks like this and "y_test" looks like this. In the next lesson 148 149 00:13:04,140 --> 00:13:11,550 we're going to create a sparse matrix from our training dataset and we're gonna do that by transforming 149 150 00:13:11,550 --> 00:13:14,840 the values in our dataframe. I'll see you there.