0 1 00:00:00,180 --> 00:00:06,850 In this lesson we're going to discuss how to remove the HTML tags from our emails. 1 2 00:00:06,930 --> 00:00:14,380 What I mean by HTML tags? Well HTML is what makes an e-mail look pretty. 2 3 00:00:14,460 --> 00:00:18,900 If I look at the newsletter from ProductHunt here, I've got wonderful formatting, 3 4 00:00:18,900 --> 00:00:26,890 I've got images embedded, I've got animations embedded and it just looks incredibly well put together. 4 5 00:00:27,390 --> 00:00:33,420 The same email message in plain text will look like what I've got here on the left. On the right, 5 6 00:00:33,420 --> 00:00:35,400 I have the rich formatting. On the left, 6 7 00:00:35,430 --> 00:00:37,680 I have the plain text. 7 8 00:00:37,680 --> 00:00:46,910 So the purpose of HTML in emails and in Web sites is to add both structure and some formatting. 8 9 00:00:46,950 --> 00:00:52,830 Let me quickly show you how to add some basic formatting using HTML and this will also show us how HTML 9 10 00:00:52,830 --> 00:00:55,330 tags basically work. 10 11 00:00:55,350 --> 00:01:03,030 Say we have a regular plain text that reads "Do not reply", then this text will be displayed in the email 11 12 00:01:03,090 --> 00:01:10,210 in a standard default way. However, if we wanted to make this text stand out, you might consider using 12 13 00:01:10,250 --> 00:01:18,100 HTML to make this text bold and you can do this by surrounding this piece of text with a tag, namely 13 14 00:01:18,280 --> 00:01:27,370 the "b" tag, "b" for bold. The way that HTML works is that there is a beginning and an ending tag and that way 14 15 00:01:27,640 --> 00:01:33,680 we can mark where the bold text begins and where the bold text should end. 15 16 00:01:33,760 --> 00:01:39,260 One of the best places to see HTML in action of course is on the web. 16 17 00:01:39,370 --> 00:01:46,330 If you go to a website called "example.com" and then right-click on the page and go to "View Source", 17 18 00:01:47,940 --> 00:01:50,080 you will get something like this. 18 19 00:01:50,130 --> 00:01:55,920 This is what the HTML code that's behind the Web site actually looks like. This is what the developer 19 20 00:01:55,920 --> 00:01:58,460 for the website will have actually written. 20 21 00:01:58,770 --> 00:02:07,290 And thanks to our browser, this code here is rendered like so. Looking at the HTML behind the Web 21 22 00:02:07,290 --> 00:02:09,970 site allows us to see a couple of things. 22 23 00:02:09,990 --> 00:02:18,750 For example, we can see here that there's a title called "Example Domain". "Example Domain" sits in between 23 24 00:02:18,750 --> 00:02:22,860 two tags, an opening tag and a closing tag. 24 25 00:02:22,860 --> 00:02:31,560 And what this little bit of code does is it makes example domain show up here on our tab bar for example. 25 26 00:02:31,570 --> 00:02:38,640 Another neat little trick that you can try out with this particular Web site or any Web site is inspecting 26 27 00:02:38,670 --> 00:02:47,690 particular elements. If I right-click and then select "Inspect Element", I will see something like this. 27 28 00:02:48,560 --> 00:02:52,470 I'm going to move this over slightly, move this over slightly. 28 29 00:02:53,700 --> 00:03:01,000 And now what I can do is I can hover over a particular element from the HTML code and my browser will 29 30 00:03:01,000 --> 00:03:05,220 highlight which bits of the website the piece of code is referring to. 30 31 00:03:05,230 --> 00:03:12,990 So for example, right now I'm hovering over a heading which is marked with the "h1" tag, "h" for heading. 31 32 00:03:13,360 --> 00:03:21,130 If I move my mouse down a little bit, then I get to a paragraph, I can explode this here and select the 32 33 00:03:21,130 --> 00:03:22,240 text inside, 33 34 00:03:22,510 --> 00:03:26,110 but the paragraph as a whole is this bit here. 34 35 00:03:26,380 --> 00:03:33,430 And again, the HTML tag for this bit here has a "p" for paragraph at the beginning and at the end. 35 36 00:03:33,580 --> 00:03:40,270 So this is our heading, this is our paragraph and both the heading and the paragraph are contained inside 36 37 00:03:40,570 --> 00:03:45,120 the body and the body refers to the whole thing here. 37 38 00:03:45,220 --> 00:03:49,420 So that's a very short introduction to HTML tags. 38 39 00:03:49,420 --> 00:03:53,870 Let's see how HTML is used inside our corpus of emails. 39 40 00:03:54,070 --> 00:03:59,380 One of the emails that I've looked at in a bit more detail is the one document ID number 2. 40 41 00:03:59,680 --> 00:04:04,440 This is the email with the file name beginning "00214". 41 42 00:04:04,610 --> 00:04:09,650 Let me show you what the body of this email looks like in Jupyter notebook. 42 43 00:04:09,650 --> 00:04:16,940 This email is at position number 2 in our message column, so we can access this email with "data.at 43 44 00:04:17,060 --> 00:04:25,610 [2]" and then the string "MESSAGE" in all caps. 44 45 00:04:25,610 --> 00:04:28,650 And what we get is something like this. 45 46 00:04:28,850 --> 00:04:33,270 Now personally, this isn't the most helpful formatting here that we're getting. 46 47 00:04:33,290 --> 00:04:39,230 So let me show you how this email would look like in my Atom text editor where the formatting is a bit 47 48 00:04:39,230 --> 00:04:42,770 more user friendly. At the top 48 49 00:04:42,770 --> 00:04:45,370 we've got the email header. Scrolling down, 49 50 00:04:45,440 --> 00:04:50,560 we get to the email body and at the bottom here with the syntax highlighting, 50 51 00:04:50,750 --> 00:04:57,970 you can easily spot the HTML tags. If I scroll down a bit further I can show you 51 52 00:04:57,990 --> 00:05:06,730 there is a paragraph in this email and you've got the "Do not reply" part that's going to show up in bold 52 53 00:05:07,270 --> 00:05:12,370 due to the fact that it is surrounded by these bolding HTML tags. 53 54 00:05:12,490 --> 00:05:19,660 Now these are the HTML tags that we're going to remove from our message bodies. 54 55 00:05:19,660 --> 00:05:24,520 We're going to clean our data in such a way that these HTML tags are no longer present. 55 56 00:05:24,700 --> 00:05:30,430 And the reason we're doing this is because we're interested in only the actual words for now, as well 56 57 00:05:30,430 --> 00:05:36,430 as the text with our bag of words approach in our naive Bayes' classifier. 57 58 00:05:36,520 --> 00:05:42,040 So in a way we're gonna be treating these HTML tags just as we were treating punctuation, namely we're 58 59 00:05:42,040 --> 00:05:45,740 going to remove them. Back in Jupyter 59 60 00:05:45,760 --> 00:05:55,390 let me add a markdown cell here and that markdown cell is going to read "Removing HTML tags from 60 61 00:05:55,630 --> 00:06:00,520 Emails". Using this "at" property of the dataframe, 61 62 00:06:00,520 --> 00:06:09,130 we've had a very efficient way of looking up and accessing a single value in the pandas dataframe. 62 63 00:06:09,130 --> 00:06:17,020 This is where we specified the index name and the column name. When it comes to stripping out the HDML 63 64 00:06:17,100 --> 00:06:19,370 tags from this particular email, 64 65 00:06:19,590 --> 00:06:28,170 all the heavy lifting will be done for us by a Python module called Beautiful Soup. At the top with our 65 66 00:06:28,170 --> 00:06:39,100 notebook imports, we're going to write "from bs4 import BeautifulSoup". Then we'll hit Shift+Enter 66 67 00:06:39,100 --> 00:06:47,350 here, scroll back down and just inside this cell here where we've accessed a cell in our dataframe, we'll 67 68 00:06:47,350 --> 00:06:56,020 create a variable called "soup" instead of equal to "BeautifulSoup()", then we'll take the code 68 69 00:06:56,020 --> 00:07:06,900 that we've just written and just cut it, paste it inside here, put a comma after it and then supply a string 69 70 00:07:06,990 --> 00:07:16,860 called "html.parser". What I've just done is supplied to arguments, the first one is the text that 70 71 00:07:16,860 --> 00:07:24,570 I would like to parse and the second one is the parser that I would like to use. Now, the beautiful thing 71 72 00:07:24,570 --> 00:07:30,090 about Python is that it comes with an HTML parser that is ready to go. 72 73 00:07:30,090 --> 00:07:37,470 That is why we can just tell Beautiful Soup to use the built-in parser in Python with this string here. 73 74 00:07:38,170 --> 00:07:44,790 Now what we can do is print out the formatted version of this email, so let's write a print statement 74 75 00:07:45,930 --> 00:07:48,150 and then supply the following argument. 75 76 00:07:48,300 --> 00:07:51,090 We're gonna say "soup.prettify 76 77 00:07:54,020 --> 00:08:03,500 ()" and hit Shift+Enter. What we see now is a prettified version of the original text. 77 78 00:08:03,500 --> 00:08:10,050 So this is closer to what we saw in the Atom text editor that I used earlier. If make it a bit larger 78 79 00:08:10,060 --> 00:08:12,490 like so, you can see it a bit better. 79 80 00:08:12,490 --> 00:08:13,300 There we go. 80 81 00:08:14,540 --> 00:08:20,390 The only thing Jupyter doesn't do here is the syntax highlighting on our tags, but other than that 81 82 00:08:20,870 --> 00:08:27,490 with the indentation you can actually tell which parts are HTML and it's a lot more readable. 82 83 00:08:27,670 --> 00:08:28,060 All right, 83 84 00:08:28,090 --> 00:08:36,890 so we've seen how Beautiful Soup can prettify a piece of text that contains HTML but it can also remove all 84 85 00:08:36,890 --> 00:08:37,910 our tags. 85 86 00:08:38,130 --> 00:08:45,380 And this is actually the primary purpose of what we're gonna be using it for. So just below the cell, 86 87 00:08:45,930 --> 00:08:56,060 let's remove all the HTML and we can do that by calling the "get_text" method on our soup object, 87 88 00:08:56,180 --> 00:09:06,710 so "soup.get_gext()" will remove our HTML. Now our output will no longer have any 88 89 00:09:06,710 --> 00:09:11,530 HTML in it and you can verify this by looking at the output here. 89 90 00:09:11,630 --> 00:09:17,750 So for example the bolding tags that used to surround "Do not reply" have disappeared. 90 91 00:09:17,860 --> 00:09:19,510 Alright so that's pretty much it. 91 92 00:09:19,610 --> 00:09:24,620 All the heavy lifting has been done for us by Python module. In 92 93 00:09:24,650 --> 00:09:30,560 the next lesson we're finally gonna start tackling more of these emails and what we're gonna do is we're 93 94 00:09:30,560 --> 00:09:38,910 gonna put all of our work, all of the cleaning, all the pre-processing into some Python functions. 94 95 00:09:38,980 --> 00:09:39,670 I'll see you there.