0
1
00:00:00,330 --> 00:00:00,830
All right.
1

2
00:00:00,840 --> 00:00:08,370
So in this video I want to show you how to drop certain undesirable rows from a dataframe
2

3
00:00:08,370 --> 00:00:16,620
and I also want to show you how we can change up the index on our dataframe to add document IDs
3

4
00:00:16,770 --> 00:00:18,210
to track these emails,
4

5
00:00:18,300 --> 00:00:25,020
so numbering these emails sequentially rather than with these unintelligible file names that we've got
5

6
00:00:25,020 --> 00:00:26,880
by default.
6

7
00:00:27,000 --> 00:00:35,880
So our first goal is dropping these three rows here with the index by the name of "commands" - "cmds" and
7

8
00:00:35,940 --> 00:00:41,490
dropping the index with the name ".DS_Store".
8

9
00:00:41,580 --> 00:00:52,500
So I'll add a little markdown cell here and that's going to read "Remove System File Entries from Dataframe"
9

10
00:00:53,820 --> 00:00:58,230
and the way we're gonna do this is using the dataframe's drop function.
10

11
00:00:58,260 --> 00:01:02,720
So this is a method that takes a couple of arguments,
11

12
00:01:02,730 --> 00:01:03,470
right.
12

13
00:01:03,480 --> 00:01:06,580
It needs to know which file names to drop.
13

14
00:01:06,630 --> 00:01:08,880
So I'm going to supply that has a list,
14

15
00:01:08,880 --> 00:01:09,870
square brackets,
15

16
00:01:10,050 --> 00:01:19,700
single quotes and then "'cmds', '.DS_Store'".
16

17
00:01:19,740 --> 00:01:25,260
Now remember, no typos here and you'll succeed.
17

18
00:01:25,260 --> 00:01:34,050
So as it is, this will drop the three rows with this index and it will drop the one row with this index.
18

19
00:01:34,200 --> 00:01:42,780
What we could do is we could overwrite our data dataframe with this modified version here, so we can
19

20
00:01:42,780 --> 00:01:49,350
drop the rows and we can overwrite the data that's stored in our dataframe.
20

21
00:01:49,500 --> 00:01:57,300
But the alternative of doing it this way is supplying another argument to this method and that's called
21

22
00:01:57,600 --> 00:02:03,150
"inplace", 'inplace" is set to False by default.
22

23
00:02:03,300 --> 00:02:10,740
And if we set it to True then we don't have to do this, we just can write the method like so and it will
23

24
00:02:10,800 --> 00:02:13,410
update our dataframe.
24

25
00:02:13,410 --> 00:02:20,610
Now before I hit Shift+Enter on this cell, let me copy this line of code here and paste it here, just to
25

26
00:02:20,610 --> 00:02:25,230
make sure that this row does indeed disappear.
26

27
00:02:25,230 --> 00:02:26,730
Let's take a look.
27

28
00:02:26,760 --> 00:02:27,320
All right.
28

29
00:02:27,360 --> 00:02:33,390
So the entire thing shifted up so we're not seeing the same emails right here.
29

30
00:02:33,450 --> 00:02:36,340
We're having a different number of rows.
30

31
00:02:36,540 --> 00:02:37,590
So it seems to have worked.
31

32
00:02:38,250 --> 00:02:42,490
Let's take a look at what the shape is of our dataframe now,
32

33
00:02:42,660 --> 00:02:48,960
"data.shape" gives us 5796,
33

34
00:02:48,960 --> 00:02:55,290
so 4 entries have been dropped and this is how we've done it.
34

35
00:02:55,350 --> 00:02:56,390
Brilliant.
35

36
00:02:56,400 --> 00:02:58,910
Now let's replace these index names.
36

37
00:02:58,950 --> 00:03:02,460
Let's change these index names to something else.
37

38
00:03:02,460 --> 00:03:03,900
Maybe just some numbers right.
38

39
00:03:03,900 --> 00:03:10,860
So let's just number our rows from 1 to 5796.
39

40
00:03:10,860 --> 00:03:22,820
I'll quickly add a markdown cell and then I;ll say "Add Document IDs to Track Emails in Dataset".
40

41
00:03:22,880 --> 00:03:25,980
We're going to be doing some manipulation of these emails,
41

42
00:03:26,120 --> 00:03:33,980
so it's going to be quite nice to be able to have a specific ID associated with each specific email
42

43
00:03:34,190 --> 00:03:37,120
so we can pull it up and refer to it later on.
43

44
00:03:38,060 --> 00:03:47,630
Let's generate our document IDs first, so I'll create a variable called "document_ids" and
44

45
00:03:47,870 --> 00:03:56,330
this will be equal to the values, say 0 to 5796.
45

46
00:03:56,330 --> 00:04:04,190
So we can use the in-built range function from Python starting from zero and going through the length
46

47
00:04:04,730 --> 00:04:06,630
of our dataframe,
47

48
00:04:06,650 --> 00:04:13,970
so "data.index" will give us the length of our dataframe.
48

49
00:04:13,970 --> 00:04:21,980
Let's take a look at what this looks like, "document_ids" is now a range from zero to 
49

50
00:04:21,980 --> 00:04:23,720
5796.
50

51
00:04:23,720 --> 00:04:31,010
We're not printing out the individual numbers here, but that actually doesn't change how we can use this
51

52
00:04:31,340 --> 00:04:32,050
object.
52

53
00:04:32,240 --> 00:04:43,690
So we can create a new column say, "data['DOC_ID']", so doc ID
53

54
00:04:44,280 --> 00:04:54,370
and set that equal to our document IDs and if we take a look at what this actually looks like, then
54

55
00:04:55,150 --> 00:05:02,290
we would get something like so, we'd still have our file names as the index, but now we have a column
55

56
00:05:02,710 --> 00:05:09,150
with all the document IDs from zero to 5795.
56

57
00:05:09,160 --> 00:05:09,510
Right.
57

58
00:05:09,550 --> 00:05:10,210
Ninety five,
58

59
00:05:10,210 --> 00:05:11,130
why?
59

60
00:05:11,140 --> 00:05:16,780
Because it's this length minus one, right?
60

61
00:05:16,890 --> 00:05:20,010
There's 5796 entries,
61

62
00:05:20,160 --> 00:05:27,060
but since we start counting from zero, the last entry is 5795.
62

63
00:05:27,060 --> 00:05:28,340
All right.
63

64
00:05:28,460 --> 00:05:30,740
So this is what our new column looks like.
64

65
00:05:30,740 --> 00:05:32,240
Fair enough.
65

66
00:05:32,240 --> 00:05:36,210
Now let's shift all these file names into another column.
66

67
00:05:36,350 --> 00:05:46,800
I'll say "data['FILE_NAME']", all in caps, is equal to "data.
67

68
00:05:46,920 --> 00:05:51,270
index", the index being these filenames right here.
68

69
00:05:51,300 --> 00:05:59,830
This will create a new column with all these file names. So if I say "data.head()" now what we see is
69

70
00:05:59,890 --> 00:06:06,600
we've got our index, we've got our category here, we've got the message column, we've got the document
70

71
00:06:06,690 --> 00:06:13,810
ID column and we've got the filename column which at the moment is the same as our index.
71

72
00:06:13,860 --> 00:06:21,620
However, what I'm going to do now is I'm going to set my index to be equal to my document IDs and the
72

73
00:06:21,620 --> 00:06:29,440
way we can do this is simply by saying "data = data.set_index()",
73

74
00:06:29,480 --> 00:06:37,880
so this is a method on our dataframe and we simply specify 'DOC_ID' in single quotes.
74

75
00:06:38,030 --> 00:06:46,070
If I hit Shift+Enter now this will update, but similar to our drop method which had this inplace parameter
75

76
00:06:46,070 --> 00:06:47,410
here that we can set to
76

77
00:06:47,420 --> 00:06:55,220
True, we can do the very, very same thing with set_index; Shift+Tab on my keyboard brings up the quick
77

78
00:06:55,220 --> 00:07:02,360
documentation and I can see here that we can change this here to True as well and then get rid of this
78

79
00:07:02,360 --> 00:07:05,860
bit of code and write a comma here
79

80
00:07:06,110 --> 00:07:09,470
and then "inplace  = True".
80

81
00:07:09,650 --> 00:07:14,450
This will accomplish the exact same thing. Here's what it looks like.
81

82
00:07:16,810 --> 00:07:19,810
Now we've got our document IDs as our index.
82

83
00:07:19,810 --> 00:07:27,190
We've got our category, 1 for spam, 0 for non-spam; our email bodies in the message column and our file
83

84
00:07:27,190 --> 00:07:27,850
names
84

85
00:07:27,970 --> 00:07:33,800
we've preserved as a separate column in our dataframe. Fantastic.
85

86
00:07:33,980 --> 00:07:38,820
Now let's quickly check what the end of our dataframe looks like,
86

87
00:07:38,820 --> 00:07:45,620
"data.tail()" we can see the last five rows. The last row has the document ID 
87

88
00:07:45,620 --> 00:07:47,120
5795
88

89
00:07:47,130 --> 00:07:49,470
that holds a message body starting with the words
89

90
00:07:49,490 --> 00:07:51,410
"If you run". All right,
90

91
00:07:51,440 --> 00:07:53,500
so we've done a lot of data cleaning now.
91

92
00:07:53,570 --> 00:07:59,480
We've extracted our relevant data from the raw text files, namely the email bodies.
92

93
00:07:59,480 --> 00:08:01,940
We've converted them into a dataframe.
93

94
00:08:01,940 --> 00:08:07,030
We've checked for empty emails and we've checked for null or missing values as well.
94

95
00:08:07,400 --> 00:08:13,400
And then we've dropped all the rows that didn't contain an email body from our pandas dataframe.