0
1
00:00:00,780 --> 00:00:07,860
Now it's time to apply the cleaning and tokenization to all the 5800 messages.
1

2
00:00:07,860 --> 00:00:15,540
But before we jump right into that, we have to review and practice a few key skills and that involves
2

3
00:00:15,960 --> 00:00:19,260
slicing dataframes and creating subsets,
3

4
00:00:19,410 --> 00:00:26,310
but not only that. We have to find a way to call our clean messages function on all the e-mail bodies
4

5
00:00:26,490 --> 00:00:27,850
stored in our dataframe.
5

6
00:00:29,050 --> 00:00:37,260
I'm going to change this cell to markdown and quickly commemorate what we're doing - we're going to "Apply Cleaning
6

7
00:00:37,440 --> 00:00:49,260
and Tokenisation to all messages", but before we dive into cleaning 5800 email bodies, it'll be handy
7

8
00:00:49,260 --> 00:00:53,570
to review these techniques for slicing pandas dataframes and series.
8

9
00:00:53,580 --> 00:00:57,540
So I'm going to add a little subheading here that reads
9

10
00:00:59,810 --> 00:01:07,160
"Slicing Dataframes and Series & Creating Subsets".
10

11
00:01:08,260 --> 00:01:08,680
Okay.
11

12
00:01:08,690 --> 00:01:17,390
So recently we've used the "at" attribute to select a particular cell or a scalar from our dataframe.
12

13
00:01:17,390 --> 00:01:23,880
This was a very efficient way of accessing a particular entry in the dataframe and the way this "at"
13

14
00:01:23,960 --> 00:01:31,870
attribute worked was that we had to specify both an index name and a column name.
14

15
00:01:31,880 --> 00:01:38,390
Now I realize the column name here is a string and the index name here is a number, but that's because we've
15

16
00:01:38,390 --> 00:01:42,610
given our index a numerical value here.
16

17
00:01:42,710 --> 00:01:50,840
The important thing to remember is that the "at" attribute works off names. If you'd like to work off a
17

18
00:01:50,840 --> 00:01:56,810
position, say at position 1, position 2, position 3, then there is an alternative attribute that
18

19
00:01:56,900 --> 00:01:58,330
you can use instead.
19

20
00:01:58,370 --> 00:02:00,160
Let me show you what I mean.
20

21
00:02:00,440 --> 00:02:11,790
If we say "data.iat[2,1]", we get the entry at position number
21

22
00:02:11,790 --> 00:02:16,320
2 and column number 1. Column number 1
22

23
00:02:16,320 --> 00:02:22,640
in this case is our "Messages" column. Column number 0 was our "Category" column,
23

24
00:02:22,860 --> 00:02:26,930
Column number 2 is our "Filename" column.
24

25
00:02:27,240 --> 00:02:30,840
So with "at" you're working off a name to get a particular entry
25

26
00:02:30,990 --> 00:02:35,500
and with "iat" you're working off a location or a position.
26

27
00:02:35,670 --> 00:02:42,030
This is very, very useful for selecting a single message or a single entry in the dataframe, especially
27

28
00:02:42,030 --> 00:02:48,030
if you're working in a loop or your code is iterating over a certain part of your dataframe with a
28

29
00:02:48,030 --> 00:02:49,440
numerical value,
29

30
00:02:49,470 --> 00:02:56,130
this "iat" attribute will make your life a lot easier than the "at" attribute which works off the names.
30

31
00:02:56,130 --> 00:03:02,640
Now what if you wanted to select a subset of the dataframe instead?
31

32
00:03:02,910 --> 00:03:06,210
What if you wanted to select more than one value, more than one row
32

33
00:03:06,210 --> 00:03:14,580
for example? In this case, you can use the "iloc", iloc attribute,
33

34
00:03:14,660 --> 00:03:24,380
so this is "data.iloc[0:2]".
34

35
00:03:24,680 --> 00:03:31,490
This code here will select the first two entries in our dataframe, the first two rows if you will. If
35

36
00:03:31,490 --> 00:03:33,980
I hit Shift+Enter we can see what they are.
36

37
00:03:34,460 --> 00:03:41,870
If we wanted to select the first five rows, then we would simply substitute the five for a two and here's
37

38
00:03:41,870 --> 00:03:43,160
the challenge.
38

39
00:03:43,160 --> 00:03:49,070
What would you put in between the square brackets if you wanted to select the entries with "DOC_ID"
39

40
00:03:49,490 --> 00:03:53,470
5, 6, 7, 8, 9 and 10?
40

41
00:03:53,600 --> 00:03:54,770
What would you supply here?
41

42
00:03:57,690 --> 00:04:06,610
In that case, you would go for "5:11" and that's because the first entry is at index 0.
42

43
00:04:06,720 --> 00:04:09,270
Now the nice thing about this "iloc" attribute is
43

44
00:04:09,360 --> 00:04:13,650
it works both on dataframes as well as series.
44

45
00:04:13,650 --> 00:04:23,410
So if I had only a single column selected, say "data.MESSAGE", I could still use this attribute "iloc
45

46
00:04:23,440 --> 00:04:31,140
[0:3]" and select only the first three entries in this series.
46

47
00:04:31,210 --> 00:04:35,770
This is very, very handy for creating or selecting a subset.
47

48
00:04:35,950 --> 00:04:40,750
So now that we've covered this let's move on to the next question.
48

49
00:04:40,870 --> 00:04:47,170
Say we want to try out our "clean_message" function on the first three emails out of the
49

50
00:04:47,320 --> 00:04:48,840
5800 odd emails.
50

51
00:04:48,880 --> 00:04:49,660
How would we do it?
51

52
00:04:51,330 --> 00:04:54,370
Well step 1 is selecting these email bodies of course
52

53
00:04:54,510 --> 00:04:56,840
and we've done that right here.
53

54
00:04:56,850 --> 00:04:59,680
Step two is using the pandas
54

55
00:04:59,790 --> 00:05:01,560
"apply" function.
55

56
00:05:01,600 --> 00:05:02,970
Now I absolutely love this function.
56

57
00:05:02,970 --> 00:05:12,290
This is this genius. Say we store these three entries in a series call it "first_emails", set that equal to
57

58
00:05:12,320 --> 00:05:13,390
"data.MESSAGE.
58

59
00:05:13,400 --> 00:05:18,930
iloc[0:3]" and then we can take this series,
59

60
00:05:21,620 --> 00:05:32,100
put a dot after it, use "apply()" and then supply the name of our function which was "clean_
60

61
00:05:32,100 --> 00:05:40,090
message". Note there is no parentheses after "clean_message". We're supplying just
61

62
00:05:40,090 --> 00:05:48,860
the name of our function to the "apply" method which we're calling on our series object here.
62

63
00:05:49,880 --> 00:05:53,050
Let's see what happens. What we get back
63

64
00:05:53,060 --> 00:05:59,820
is kind of a list of lists. We've got all our words in the list split out for our first e-mail.
64

65
00:05:59,840 --> 00:06:01,610
Same thing with our second e-mail.
65

66
00:06:01,890 --> 00:06:06,220
And again we've cut all the words split out for our third e-mail as well.
66

67
00:06:07,510 --> 00:06:13,090
If you're curious what kind of type of object we're dealing with here, you can take a look by just wrapping
67

68
00:06:13,090 --> 00:06:18,970
this in the type function and you see that you still are working with a series.
68

69
00:06:19,330 --> 00:06:20,140
So that's good.
69

70
00:06:20,140 --> 00:06:22,890
That's really, really interesting.
70

71
00:06:22,890 --> 00:06:24,950
Now I'm gonna give this thing a name.
71

72
00:06:24,970 --> 00:06:33,850
I'm going to call it "nested_list" and I'm going to set it equal to the output from this "apply"
72

73
00:06:33,850 --> 00:06:35,290
function.
73

74
00:06:35,510 --> 00:06:42,110
I'm calling it a nested list because effectively I've almost got like a list of lists, right. Each individual
74

75
00:06:42,110 --> 00:06:52,330
entry in my pandas series here is a list, but here's a question: What if I wanted just one list of words? What
75

76
00:06:52,330 --> 00:06:56,270
if I didn't want a list of lists? Let me hit Shift+Enter on this.
76

77
00:06:59,770 --> 00:07:07,260
I'm going to create an empty list here and then just write a for loop, right. A nested one probably. I have to go over
77

78
00:07:07,260 --> 00:07:15,760
all the items in the list first. So I'll say "sublist in nested_list", that's my first part of my loop
78

79
00:07:16,740 --> 00:07:24,790
and then my second part of my loop, the inner loop, I'll say "for item in sublist:". The outer loop goes
79

80
00:07:24,790 --> 00:07:32,440
over all the lists and the inner loop goes over all the individual words in each sublist. Put a colon there
80

81
00:07:32,910 --> 00:07:45,530
and then say "flat_list.append(item)". That will pretty much do the job. So if I wanted to
81

82
00:07:45,530 --> 00:07:52,670
check what the length of this flat_list is, then I can use "len(flat_list)". Let's see what
82

83
00:07:52,670 --> 00:07:56,120
we get - 390,
83

84
00:07:56,130 --> 00:08:02,570
right. So that's 390 words now in a single list. If we wanted to take a peek what it
84

85
00:08:02,570 --> 00:08:10,310
looks like and it looks like this, it's just all the email words in one big list. And you know this is
85

86
00:08:10,310 --> 00:08:14,280
certainly one way of doing it, but let me show you an alternative.
86

87
00:08:14,390 --> 00:08:20,820
Let me show you how this would look like with something called Python list comprehension.
87

88
00:08:21,050 --> 00:08:29,000
So I'm going to comment all of this out. If you select the text like me and then press Control + /,
88

89
00:08:29,780 --> 00:08:35,330
this is the shortcut for commenting out an entire block of code. If you're on a Mac of course, it'll
89

90
00:08:35,330 --> 00:08:42,940
be Command + /. But for everybody else Control + /. The Python list comprehension
90

91
00:08:42,940 --> 00:08:50,260
syntax for accomplishing exactly the same thing as with these two nested loops looks like this.
91

92
00:08:50,260 --> 00:08:58,610
So I'll still use flat_list as my variable and I'll have square brackets and then I'm basically going to
92

93
00:08:58,610 --> 00:09:03,490
move my loop inside these two square brackets. I'll say "item for
93

94
00:09:07,100 --> 00:09:16,050
sublist in nested_list for item in sublist".
94

95
00:09:17,020 --> 00:09:22,390
This here you'll recognize as the outer loop "for sublist in nested_list"
95

96
00:09:22,510 --> 00:09:29,470
and this here you'll recognize as the inner loop "for item in sublist".
96

97
00:09:29,470 --> 00:09:35,620
This first bit of code is what appends the item to the list.
97

98
00:09:35,680 --> 00:09:38,040
Let's see what happens when I've run this again.
98

99
00:09:38,050 --> 00:09:42,550
I in fact get the exact same result.
99

100
00:09:42,580 --> 00:09:48,330
So now you've got another application of this Python list comprehension syntax.
100

101
00:09:48,550 --> 00:09:54,350
You can often use it instead of nesting these for loops. Now you've probably put two and two together
101

102
00:09:54,800 --> 00:10:03,120
on how to call our "clean_messages" function on all the 5800 odd items. We're gonna
102

103
00:10:03,130 --> 00:10:06,430
do something very, very similar to what we did a minute ago.
103

104
00:10:06,610 --> 00:10:23,020
We'll create a nested list and we'll set it equal to "data.MESSAGE.apply(clean_message
104

105
00:10:23,320 --> 00:10:37,310
no_html)". This here will use "apply" on all the messages in the dataframe. Now before I hit Shift+Enter and
105

106
00:10:37,310 --> 00:10:43,730
execute this code, I'll use a little bit of Jupyter magic. I'll put 2 percent signs here and then use
106

107
00:10:43,730 --> 00:10:52,280
that keyword "time" and what this will do is it'll spit out how long this computation will take.
107

108
00:10:52,310 --> 00:10:58,280
So this is kind of like a little bit of a benchmarking tool for looking at the performance of your Python
108

109
00:10:58,280 --> 00:10:59,760
code.
109

110
00:10:59,810 --> 00:11:06,890
It'll also help you see just how long it takes for my code to execute on my computer and you can compare
110

111
00:11:06,890 --> 00:11:08,560
this to your own.
111

112
00:11:08,600 --> 00:11:15,290
The first thing I'm seeing here is that our Beautiful Soup package seems to think that we're trying
112

113
00:11:15,290 --> 00:11:18,570
to open a Web site.
113

114
00:11:18,600 --> 00:11:20,190
Now this is definitely not the case.
114

115
00:11:20,300 --> 00:11:24,320
So this user warning is a little bit unnecessary.
115

116
00:11:24,320 --> 00:11:31,220
Our code still working, nothing's breaking, but this is something that Beautiful Soup has included in
116

117
00:11:31,220 --> 00:11:36,710
their code as a suggestion for people who are looking to open Web sites.
117

118
00:11:38,070 --> 00:11:44,280
The other thing that you can see is that in order to apply this function to a all my 5800
118

119
00:11:44,280 --> 00:11:49,400
odd messages is that my computer ran for like a good minute.
119

120
00:11:49,560 --> 00:11:52,430
It got there in the end, but it definitely did some work.
120

121
00:11:54,240 --> 00:11:58,080
So let's check out the head of our nested_list.
121

122
00:11:58,080 --> 00:12:02,310
Let's look at the first few lines. Looks good.
122

123
00:12:03,610 --> 00:12:05,180
What about the tail?
123

124
00:12:05,290 --> 00:12:06,740
The last few lines.
124

125
00:12:06,970 --> 00:12:08,100
Brilliant.
125

126
00:12:08,170 --> 00:12:09,010
That also looks good.
126

127
00:12:10,290 --> 00:12:11,490
Nothing unexpected.