0 1 00:00:00,330 --> 00:00:00,830 All right. 1 2 00:00:00,840 --> 00:00:08,370 So in this video I want to show you how to drop certain undesirable rows from a dataframe 2 3 00:00:08,370 --> 00:00:16,620 and I also want to show you how we can change up the index on our dataframe to add document IDs 3 4 00:00:16,770 --> 00:00:18,210 to track these emails, 4 5 00:00:18,300 --> 00:00:25,020 so numbering these emails sequentially rather than with these unintelligible file names that we've got 5 6 00:00:25,020 --> 00:00:26,880 by default. 6 7 00:00:27,000 --> 00:00:35,880 So our first goal is dropping these three rows here with the index by the name of "commands" - "cmds" and 7 8 00:00:35,940 --> 00:00:41,490 dropping the index with the name ".DS_Store". 8 9 00:00:41,580 --> 00:00:52,500 So I'll add a little markdown cell here and that's going to read "Remove System File Entries from Dataframe" 9 10 00:00:53,820 --> 00:00:58,230 and the way we're gonna do this is using the dataframe's drop function. 10 11 00:00:58,260 --> 00:01:02,720 So this is a method that takes a couple of arguments, 11 12 00:01:02,730 --> 00:01:03,470 right. 12 13 00:01:03,480 --> 00:01:06,580 It needs to know which file names to drop. 13 14 00:01:06,630 --> 00:01:08,880 So I'm going to supply that has a list, 14 15 00:01:08,880 --> 00:01:09,870 square brackets, 15 16 00:01:10,050 --> 00:01:19,700 single quotes and then "'cmds', '.DS_Store'". 16 17 00:01:19,740 --> 00:01:25,260 Now remember, no typos here and you'll succeed. 17 18 00:01:25,260 --> 00:01:34,050 So as it is, this will drop the three rows with this index and it will drop the one row with this index. 18 19 00:01:34,200 --> 00:01:42,780 What we could do is we could overwrite our data dataframe with this modified version here, so we can 19 20 00:01:42,780 --> 00:01:49,350 drop the rows and we can overwrite the data that's stored in our dataframe. 20 21 00:01:49,500 --> 00:01:57,300 But the alternative of doing it this way is supplying another argument to this method and that's called 21 22 00:01:57,600 --> 00:02:03,150 "inplace", 'inplace" is set to False by default. 22 23 00:02:03,300 --> 00:02:10,740 And if we set it to True then we don't have to do this, we just can write the method like so and it will 23 24 00:02:10,800 --> 00:02:13,410 update our dataframe. 24 25 00:02:13,410 --> 00:02:20,610 Now before I hit Shift+Enter on this cell, let me copy this line of code here and paste it here, just to 25 26 00:02:20,610 --> 00:02:25,230 make sure that this row does indeed disappear. 26 27 00:02:25,230 --> 00:02:26,730 Let's take a look. 27 28 00:02:26,760 --> 00:02:27,320 All right. 28 29 00:02:27,360 --> 00:02:33,390 So the entire thing shifted up so we're not seeing the same emails right here. 29 30 00:02:33,450 --> 00:02:36,340 We're having a different number of rows. 30 31 00:02:36,540 --> 00:02:37,590 So it seems to have worked. 31 32 00:02:38,250 --> 00:02:42,490 Let's take a look at what the shape is of our dataframe now, 32 33 00:02:42,660 --> 00:02:48,960 "data.shape" gives us 5796, 33 34 00:02:48,960 --> 00:02:55,290 so 4 entries have been dropped and this is how we've done it. 34 35 00:02:55,350 --> 00:02:56,390 Brilliant. 35 36 00:02:56,400 --> 00:02:58,910 Now let's replace these index names. 36 37 00:02:58,950 --> 00:03:02,460 Let's change these index names to something else. 37 38 00:03:02,460 --> 00:03:03,900 Maybe just some numbers right. 38 39 00:03:03,900 --> 00:03:10,860 So let's just number our rows from 1 to 5796. 39 40 00:03:10,860 --> 00:03:22,820 I'll quickly add a markdown cell and then I;ll say "Add Document IDs to Track Emails in Dataset". 40 41 00:03:22,880 --> 00:03:25,980 We're going to be doing some manipulation of these emails, 41 42 00:03:26,120 --> 00:03:33,980 so it's going to be quite nice to be able to have a specific ID associated with each specific email 42 43 00:03:34,190 --> 00:03:37,120 so we can pull it up and refer to it later on. 43 44 00:03:38,060 --> 00:03:47,630 Let's generate our document IDs first, so I'll create a variable called "document_ids" and 44 45 00:03:47,870 --> 00:03:56,330 this will be equal to the values, say 0 to 5796. 45 46 00:03:56,330 --> 00:04:04,190 So we can use the in-built range function from Python starting from zero and going through the length 46 47 00:04:04,730 --> 00:04:06,630 of our dataframe, 47 48 00:04:06,650 --> 00:04:13,970 so "data.index" will give us the length of our dataframe. 48 49 00:04:13,970 --> 00:04:21,980 Let's take a look at what this looks like, "document_ids" is now a range from zero to 49 50 00:04:21,980 --> 00:04:23,720 5796. 50 51 00:04:23,720 --> 00:04:31,010 We're not printing out the individual numbers here, but that actually doesn't change how we can use this 51 52 00:04:31,340 --> 00:04:32,050 object. 52 53 00:04:32,240 --> 00:04:43,690 So we can create a new column say, "data['DOC_ID']", so doc ID 53 54 00:04:44,280 --> 00:04:54,370 and set that equal to our document IDs and if we take a look at what this actually looks like, then 54 55 00:04:55,150 --> 00:05:02,290 we would get something like so, we'd still have our file names as the index, but now we have a column 55 56 00:05:02,710 --> 00:05:09,150 with all the document IDs from zero to 5795. 56 57 00:05:09,160 --> 00:05:09,510 Right. 57 58 00:05:09,550 --> 00:05:10,210 Ninety five, 58 59 00:05:10,210 --> 00:05:11,130 why? 59 60 00:05:11,140 --> 00:05:16,780 Because it's this length minus one, right? 60 61 00:05:16,890 --> 00:05:20,010 There's 5796 entries, 61 62 00:05:20,160 --> 00:05:27,060 but since we start counting from zero, the last entry is 5795. 62 63 00:05:27,060 --> 00:05:28,340 All right. 63 64 00:05:28,460 --> 00:05:30,740 So this is what our new column looks like. 64 65 00:05:30,740 --> 00:05:32,240 Fair enough. 65 66 00:05:32,240 --> 00:05:36,210 Now let's shift all these file names into another column. 66 67 00:05:36,350 --> 00:05:46,800 I'll say "data['FILE_NAME']", all in caps, is equal to "data. 67 68 00:05:46,920 --> 00:05:51,270 index", the index being these filenames right here. 68 69 00:05:51,300 --> 00:05:59,830 This will create a new column with all these file names. So if I say "data.head()" now what we see is 69 70 00:05:59,890 --> 00:06:06,600 we've got our index, we've got our category here, we've got the message column, we've got the document 70 71 00:06:06,690 --> 00:06:13,810 ID column and we've got the filename column which at the moment is the same as our index. 71 72 00:06:13,860 --> 00:06:21,620 However, what I'm going to do now is I'm going to set my index to be equal to my document IDs and the 72 73 00:06:21,620 --> 00:06:29,440 way we can do this is simply by saying "data = data.set_index()", 73 74 00:06:29,480 --> 00:06:37,880 so this is a method on our dataframe and we simply specify 'DOC_ID' in single quotes. 74 75 00:06:38,030 --> 00:06:46,070 If I hit Shift+Enter now this will update, but similar to our drop method which had this inplace parameter 75 76 00:06:46,070 --> 00:06:47,410 here that we can set to 76 77 00:06:47,420 --> 00:06:55,220 True, we can do the very, very same thing with set_index; Shift+Tab on my keyboard brings up the quick 77 78 00:06:55,220 --> 00:07:02,360 documentation and I can see here that we can change this here to True as well and then get rid of this 78 79 00:07:02,360 --> 00:07:05,860 bit of code and write a comma here 79 80 00:07:06,110 --> 00:07:09,470 and then "inplace = True". 80 81 00:07:09,650 --> 00:07:14,450 This will accomplish the exact same thing. Here's what it looks like. 81 82 00:07:16,810 --> 00:07:19,810 Now we've got our document IDs as our index. 82 83 00:07:19,810 --> 00:07:27,190 We've got our category, 1 for spam, 0 for non-spam; our email bodies in the message column and our file 83 84 00:07:27,190 --> 00:07:27,850 names 84 85 00:07:27,970 --> 00:07:33,800 we've preserved as a separate column in our dataframe. Fantastic. 85 86 00:07:33,980 --> 00:07:38,820 Now let's quickly check what the end of our dataframe looks like, 86 87 00:07:38,820 --> 00:07:45,620 "data.tail()" we can see the last five rows. The last row has the document ID 87 88 00:07:45,620 --> 00:07:47,120 5795 88 89 00:07:47,130 --> 00:07:49,470 that holds a message body starting with the words 89 90 00:07:49,490 --> 00:07:51,410 "If you run". All right, 90 91 00:07:51,440 --> 00:07:53,500 so we've done a lot of data cleaning now. 91 92 00:07:53,570 --> 00:07:59,480 We've extracted our relevant data from the raw text files, namely the email bodies. 92 93 00:07:59,480 --> 00:08:01,940 We've converted them into a dataframe. 93 94 00:08:01,940 --> 00:08:07,030 We've checked for empty emails and we've checked for null or missing values as well. 94 95 00:08:07,400 --> 00:08:13,400 And then we've dropped all the rows that didn't contain an email body from our pandas dataframe.