In this lesson we're going to discuss how to remove the HTML tags from our emails. What I mean by HTML tags? Well HTML is what makes an e-mail look pretty. If I look at the newsletter from ProductHunt here, I've got wonderful formatting, I've got images embedded, I've got animations embedded and it just looks incredibly well put together. The same email message in plain text will look like what I've got here on the left. On the right, I have the rich formatting. On the left, I have the plain text. So the purpose of HTML in emails and in Web sites is to add both structure and some formatting. Let me quickly show you how to add some basic formatting using HTML and this will also show us how HTML tags basically work. Say we have a regular plain text that reads "Do not reply", then this text will be displayed in the email in a standard default way. However, if we wanted to make this text stand out, you might consider using HTML to make this text bold and you can do this by surrounding this piece of text with a tag, namely the "b" tag, "b" for bold. The way that HTML works is that there is a beginning and an ending tag and that way we can mark where the bold text begins and where the bold text should end. One of the best places to see HTML in action of course is on the web. If you go to a website called "example.com" and then right-click on the page and go to "View Source", you will get something like this. This is what the HTML code that's behind the Web site actually looks like. This is what the developer for the website will have actually written. And thanks to our browser, this code here is rendered like so. Looking at the HTML behind the Web site allows us to see a couple of things. For example, we can see here that there's a title called "Example Domain". "Example Domain" sits in between two tags, an opening tag and a closing tag. And what this little bit of code does is it makes example domain show up here on our tab bar for example. Another neat little trick that you can try out with this particular Web site or any Web site is inspecting particular elements. If I right-click and then select "Inspect Element", I will see something like this. I'm going to move this over slightly, move this over slightly. And now what I can do is I can hover over a particular element from the HTML code and my browser will highlight which bits of the website the piece of code is referring to. So for example, right now I'm hovering over a heading which is marked with the "h1" tag, "h" for heading. If I move my mouse down a little bit, then I get to a paragraph, I can explode this here and select the text inside, but the paragraph as a whole is this bit here. And again, the HTML tag for this bit here has a "p" for paragraph at the beginning and at the end. So this is our heading, this is our paragraph and both the heading and the paragraph are contained inside the body and the body refers to the whole thing here. So that's a very short introduction to HTML tags. Let's see how HTML is used inside our corpus of emails. One of the emails that I've looked at in a bit more detail is the one document ID number 2. This is the email with the file name beginning "00214". Let me show you what the body of this email looks like in Jupyter notebook. This email is at position number 2 in our message column, so we can access this email with "data.at [2]" and then the string "MESSAGE" in all caps. And what we get is something like this. Now personally, this isn't the most helpful formatting here that we're getting. So let me show you how this email would look like in my Atom text editor where the formatting is a bit more user friendly. At the top we've got the email header. Scrolling down, we get to the email body and at the bottom here with the syntax highlighting, you can easily spot the HTML tags. If I scroll down a bit further I can show you there is a paragraph in this email and you've got the "Do not reply" part that's going to show up in bold due to the fact that it is surrounded by these bolding HTML tags. Now these are the HTML tags that we're going to remove from our message bodies. We're going to clean our data in such a way that these HTML tags are no longer present. And the reason we're doing this is because we're interested in only the actual words for now, as well as the text with our bag of words approach in our naive Bayes' classifier. So in a way we're gonna be treating these HTML tags just as we were treating punctuation, namely we're going to remove them. Back in Jupyter let me add a markdown cell here and that markdown cell is going to read "Removing HTML tags from Emails". Using this "at" property of the dataframe, we've had a very efficient way of looking up and accessing a single value in the pandas dataframe. This is where we specified the index name and the column name. When it comes to stripping out the HDML tags from this particular email, all the heavy lifting will be done for us by a Python module called Beautiful Soup. At the top with our notebook imports, we're going to write "from bs4 import BeautifulSoup". Then we'll hit Shift+Enter here, scroll back down and just inside this cell here where we've accessed a cell in our dataframe, we'll create a variable called "soup" instead of equal to "BeautifulSoup()", then we'll take the code that we've just written and just cut it, paste it inside here, put a comma after it and then supply a string called "html.parser". What I've just done is supplied to arguments, the first one is the text that I would like to parse and the second one is the parser that I would like to use. Now, the beautiful thing about Python is that it comes with an HTML parser that is ready to go. That is why we can just tell Beautiful Soup to use the built-in parser in Python with this string here. Now what we can do is print out the formatted version of this email, so let's write a print statement and then supply the following argument. We're gonna say "soup.prettify ()" and hit Shift+Enter. What we see now is a prettified version of the original text. So this is closer to what we saw in the Atom text editor that I used earlier. If make it a bit larger like so, you can see it a bit better. There we go. The only thing Jupyter doesn't do here is the syntax highlighting on our tags, but other than that with the indentation you can actually tell which parts are HTML and it's a lot more readable. All right, so we've seen how Beautiful Soup can prettify a piece of text that contains HTML but it can also remove all our tags. And this is actually the primary purpose of what we're gonna be using it for. So just below the cell, let's remove all the HTML and we can do that by calling the "get_text" method on our soup object, so "soup.get_gext()" will remove our HTML. Now our output will no longer have any HTML in it and you can verify this by looking at the output here. So for example the bolding tags that used to surround "Do not reply" have disappeared. Alright so that's pretty much it. All the heavy lifting has been done for us by Python module. In the next lesson we're finally gonna start tackling more of these emails and what we're gonna do is we're gonna put all of our work, all of the cleaning, all the pre-processing into some Python functions. I'll see you there.