0
1
00:00:00,180 --> 00:00:06,850
In this lesson we're going to discuss how to remove the HTML tags from our emails.
1

2
00:00:06,930 --> 00:00:14,380
What I mean by HTML tags? Well HTML is what makes an e-mail look pretty.
2

3
00:00:14,460 --> 00:00:18,900
If I look at the newsletter from ProductHunt here, I've got wonderful formatting,
3

4
00:00:18,900 --> 00:00:26,890
I've got images embedded, I've got animations embedded and it just looks incredibly well put together.
4

5
00:00:27,390 --> 00:00:33,420
The same email message in plain text will look like what I've got here on the left. On the right,
5

6
00:00:33,420 --> 00:00:35,400
I have the rich formatting. On the left,
6

7
00:00:35,430 --> 00:00:37,680
I have the plain text.
7

8
00:00:37,680 --> 00:00:46,910
So the purpose of HTML in emails and in Web sites is to add both structure and some formatting.
8

9
00:00:46,950 --> 00:00:52,830
Let me quickly show you how to add some basic formatting using HTML and this will also show us how HTML
9

10
00:00:52,830 --> 00:00:55,330
tags basically work.
10

11
00:00:55,350 --> 00:01:03,030
Say we have a regular plain text that reads "Do not reply", then this text will be displayed in the email
11

12
00:01:03,090 --> 00:01:10,210
in a standard default way. However, if we wanted to make this text stand out, you might consider using
12

13
00:01:10,250 --> 00:01:18,100
HTML to make this text bold and you can do this by surrounding this piece of text with a tag, namely
13

14
00:01:18,280 --> 00:01:27,370
the "b" tag, "b" for bold. The way that HTML works is that there is a beginning and an ending tag and that way
14

15
00:01:27,640 --> 00:01:33,680
we can mark where the bold text begins and where the bold text should end.
15

16
00:01:33,760 --> 00:01:39,260
One of the best places to see HTML in action of course is on the web.
16

17
00:01:39,370 --> 00:01:46,330
If you go to a website called "example.com" and then right-click on the page and go to "View Source",
17

18
00:01:47,940 --> 00:01:50,080
you will get something like this.
18

19
00:01:50,130 --> 00:01:55,920
This is what the HTML code that's behind the Web site actually looks like. This is what the developer
19

20
00:01:55,920 --> 00:01:58,460
for the website will have actually written.
20

21
00:01:58,770 --> 00:02:07,290
And thanks to our browser, this code here is rendered like so. Looking at the HTML behind the Web
21

22
00:02:07,290 --> 00:02:09,970
site allows us to see a couple of things.
22

23
00:02:09,990 --> 00:02:18,750
For example, we can see here that there's a title called "Example Domain". "Example Domain" sits in between
23

24
00:02:18,750 --> 00:02:22,860
two tags, an opening tag and a closing tag.
24

25
00:02:22,860 --> 00:02:31,560
And what this little bit of code does is it makes example domain show up here on our tab bar for example.
25

26
00:02:31,570 --> 00:02:38,640
Another neat little trick that you can try out with this particular Web site or any Web site is inspecting
26

27
00:02:38,670 --> 00:02:47,690
particular elements. If I right-click and then select "Inspect Element", I will see something like this.
27

28
00:02:48,560 --> 00:02:52,470
I'm going to move this over slightly, move this over slightly.
28

29
00:02:53,700 --> 00:03:01,000
And now what I can do is I can hover over a particular element from the HTML code and my browser will
29

30
00:03:01,000 --> 00:03:05,220
highlight which bits of the website the piece of code is referring to.
30

31
00:03:05,230 --> 00:03:12,990
So for example, right now I'm hovering over a heading which is marked with the "h1" tag, "h" for heading.
31

32
00:03:13,360 --> 00:03:21,130
If I move my mouse down a little bit, then I get to a paragraph, I can explode this here and select the
32

33
00:03:21,130 --> 00:03:22,240
text inside,
33

34
00:03:22,510 --> 00:03:26,110
but the paragraph as a whole is this bit here.
34

35
00:03:26,380 --> 00:03:33,430
And again, the HTML tag for this bit here has a "p" for paragraph at the beginning and at the end.
35

36
00:03:33,580 --> 00:03:40,270
So this is our heading, this is our paragraph and both the heading and the paragraph are contained inside
36

37
00:03:40,570 --> 00:03:45,120
the body and the body refers to the whole thing here.
37

38
00:03:45,220 --> 00:03:49,420
So that's a very short introduction to HTML tags.
38

39
00:03:49,420 --> 00:03:53,870
Let's see how HTML is used inside our corpus of emails.
39

40
00:03:54,070 --> 00:03:59,380
One of the emails that I've looked at in a bit more detail is the one document ID number 2.
40

41
00:03:59,680 --> 00:04:04,440
This is the email with the file name beginning "00214".
41

42
00:04:04,610 --> 00:04:09,650
Let me show you what the body of this email looks like in Jupyter notebook.
42

43
00:04:09,650 --> 00:04:16,940
This email is at position number 2 in our message column, so we can access this email with "data.at
43

44
00:04:17,060 --> 00:04:25,610
[2]" and then the string "MESSAGE" in all caps.
44

45
00:04:25,610 --> 00:04:28,650
And what we get is something like this.
45

46
00:04:28,850 --> 00:04:33,270
Now personally, this isn't the most helpful formatting here that we're getting.
46

47
00:04:33,290 --> 00:04:39,230
So let me show you how this email would look like in my Atom text editor where the formatting is a bit
47

48
00:04:39,230 --> 00:04:42,770
more user friendly. At the top
48

49
00:04:42,770 --> 00:04:45,370
we've got the email header. Scrolling down,
49

50
00:04:45,440 --> 00:04:50,560
we get to the email body and at the bottom here with the syntax highlighting,
50

51
00:04:50,750 --> 00:04:57,970
you can easily spot the HTML tags. If I scroll down a bit further I can show you
51

52
00:04:57,990 --> 00:05:06,730
there is a paragraph in this email and you've got the "Do not reply" part that's going to show up in bold
52

53
00:05:07,270 --> 00:05:12,370
due to the fact that it is surrounded by these bolding HTML tags.
53

54
00:05:12,490 --> 00:05:19,660
Now these are the HTML tags that we're going to remove from our message bodies.
54

55
00:05:19,660 --> 00:05:24,520
We're going to clean our data in such a way that these HTML tags are no longer present.
55

56
00:05:24,700 --> 00:05:30,430
And the reason we're doing this is because we're interested in only the actual words for now, as well
56

57
00:05:30,430 --> 00:05:36,430
as the text with our bag of words approach in our naive Bayes' classifier.
57

58
00:05:36,520 --> 00:05:42,040
So in a way we're gonna be treating these HTML tags just as we were treating punctuation, namely we're
58

59
00:05:42,040 --> 00:05:45,740
going to remove them. Back in Jupyter
59

60
00:05:45,760 --> 00:05:55,390
let me add a markdown cell here and that markdown cell is going to read "Removing HTML tags from
60

61
00:05:55,630 --> 00:06:00,520
Emails". Using this "at" property of the dataframe,
61

62
00:06:00,520 --> 00:06:09,130
we've had a very efficient way of looking up and accessing a single value in the pandas dataframe.
62

63
00:06:09,130 --> 00:06:17,020
This is where we specified the index name and the column name. When it comes to stripping out the HDML
63

64
00:06:17,100 --> 00:06:19,370
tags from this particular email,
64

65
00:06:19,590 --> 00:06:28,170
all the heavy lifting will be done for us by a Python module called Beautiful Soup. At the top with our
65

66
00:06:28,170 --> 00:06:39,100
notebook imports, we're going to write "from bs4 import BeautifulSoup". Then we'll hit Shift+Enter
66

67
00:06:39,100 --> 00:06:47,350
here, scroll back down and just inside this cell here where we've accessed a cell in our dataframe, we'll
67

68
00:06:47,350 --> 00:06:56,020
create a variable called "soup" instead of equal to "BeautifulSoup()", then we'll take the code
68

69
00:06:56,020 --> 00:07:06,900
that we've just written and just cut it, paste it inside here, put a comma after it and then supply a string
69

70
00:07:06,990 --> 00:07:16,860
called "html.parser". What I've just done is supplied to arguments, the first one is the text that
70

71
00:07:16,860 --> 00:07:24,570
I would like to parse and the second one is the parser that I would like to use. Now, the beautiful thing
71

72
00:07:24,570 --> 00:07:30,090
about Python is that it comes with an HTML parser that is ready to go.
72

73
00:07:30,090 --> 00:07:37,470
That is why we can just tell Beautiful Soup to use the built-in parser in Python with this string here.
73

74
00:07:38,170 --> 00:07:44,790
Now what we can do is print out the formatted version of this email, so let's write a print statement
74

75
00:07:45,930 --> 00:07:48,150
and then supply the following argument.
75

76
00:07:48,300 --> 00:07:51,090
We're gonna say "soup.prettify
76

77
00:07:54,020 --> 00:08:03,500
()" and hit Shift+Enter. What we see now is a prettified version of the original text.
77

78
00:08:03,500 --> 00:08:10,050
So this is closer to what we saw in the Atom text editor that I used earlier. If make it a bit larger
78

79
00:08:10,060 --> 00:08:12,490
like so, you can see it a bit better.
79

80
00:08:12,490 --> 00:08:13,300
There we go.
80

81
00:08:14,540 --> 00:08:20,390
The only thing Jupyter doesn't do here is the syntax highlighting on our tags, but other than that
81

82
00:08:20,870 --> 00:08:27,490
with the indentation you can actually tell which parts are HTML and it's a lot more readable.
82

83
00:08:27,670 --> 00:08:28,060
All right,
83

84
00:08:28,090 --> 00:08:36,890
so we've seen how Beautiful Soup can prettify a piece of text that contains HTML but it can also remove all
84

85
00:08:36,890 --> 00:08:37,910
our tags.
85

86
00:08:38,130 --> 00:08:45,380
And this is actually the primary purpose of what we're gonna be using it for. So just below the cell,
86

87
00:08:45,930 --> 00:08:56,060
let's remove all the HTML and we can do that by calling the "get_text" method on our soup object,
87

88
00:08:56,180 --> 00:09:06,710
so "soup.get_gext()" will remove our HTML. Now our output will no longer have any
88

89
00:09:06,710 --> 00:09:11,530
HTML in it and you can verify this by looking at the output here.
89

90
00:09:11,630 --> 00:09:17,750
So for example the bolding tags that used to surround "Do not reply" have disappeared.
90

91
00:09:17,860 --> 00:09:19,510
Alright so that's pretty much it.
91

92
00:09:19,610 --> 00:09:24,620
All the heavy lifting has been done for us by Python module. In
92

93
00:09:24,650 --> 00:09:30,560
the next lesson we're finally gonna start tackling more of these emails and what we're gonna do is we're
93

94
00:09:30,560 --> 00:09:38,910
gonna put all of our work, all of the cleaning, all the pre-processing into some Python functions.
94

95
00:09:38,980 --> 00:09:39,670
I'll see you there.