0 1 00:00:00,630 --> 00:00:00,950 All right. 1 2 00:00:00,990 --> 00:00:07,310 So in this lesson we're going to practice getting an email body from a single text file with a known 2 3 00:00:07,310 --> 00:00:08,520 structure. 3 4 00:00:08,520 --> 00:00:17,130 I'm going to go ahead and copy this cell right here and then I'm going to paste it down below. 4 5 00:00:17,130 --> 00:00:20,280 Our goal is modifying the code in the cell, 5 6 00:00:20,280 --> 00:00:26,230 so we just print out the message body instead of the header and the message body. 6 7 00:00:26,340 --> 00:00:27,570 Here's what I'm going to do. 7 8 00:00:27,570 --> 00:00:34,870 I'm going to delete this line, as well as this line, as well as this line. 8 9 00:00:35,610 --> 00:00:41,900 And I'm going to add some code between opening the file and closing the file. 9 10 00:00:42,120 --> 00:00:49,230 The first thing I'm going to do is I'm going to create a boolean variable and I'm a call that "is_ 10 11 00:00:49,230 --> 00:00:51,180 body". 11 12 00:00:51,180 --> 00:01:00,780 And then I'm going to set that to False to begin with. Then I'll also create another variable - a list called 12 13 00:01:01,050 --> 00:01:01,950 "lines". 13 14 00:01:01,950 --> 00:01:05,900 So lines is equal to an empty list. 14 15 00:01:05,970 --> 00:01:09,390 So that has the square bracket notation. 15 16 00:01:09,390 --> 00:01:17,310 Now what I'm gonna do is I'm going to read all the lines in our file so I'm going to use a loop and I'm going 16 17 00:01:17,310 --> 00:01:27,570 to loop through our stream, so I'll say "for line in stream:" and then inside the loop I'm 17 18 00:01:27,570 --> 00:01:32,170 going to check if I'm inside the email body. 18 19 00:01:32,310 --> 00:01:46,020 So "if is_body:", "lines.append(line)". What this line of code does is it will 19 20 00:01:46,020 --> 00:01:57,660 take our list - lines; and append a single line from our text file to the list if we are inside the email 20 21 00:01:57,840 --> 00:01:59,310 body. 21 22 00:01:59,450 --> 00:02:05,570 Now the natural question is: How would you know that you're inside the email body? 22 23 00:02:05,580 --> 00:02:15,300 Well, if we look at the structure of this email, we know that there is a space between the header and 23 24 00:02:15,300 --> 00:02:16,040 the body. 24 25 00:02:16,050 --> 00:02:21,790 There is a new line right after the date information in the e-mail header. 25 26 00:02:22,050 --> 00:02:29,580 So what we're gonna do is we're going to check for this blank right here. If we hit this blank line, 26 27 00:02:29,580 --> 00:02:34,390 this new line, we're going to change our boolean variable to True. 27 28 00:02:34,500 --> 00:02:47,610 So "elif line == '\n'" - "\n" is our 28 29 00:02:47,610 --> 00:02:49,080 new line character. 29 30 00:02:49,140 --> 00:02:54,330 So this checks for somebody having hit Enter on their keyboard essentially. 30 31 00:02:54,420 --> 00:03:01,950 Then I'll add a semicolon at the end and then inside this else if statement I'm going to set is_body 31 32 00:03:02,730 --> 00:03:05,180 equal to True. 32 33 00:03:05,610 --> 00:03:13,310 And that's pretty much our loop. I can close my file after we've read all the individual lines in our 33 34 00:03:13,310 --> 00:03:23,090 stream and then all I have to do is maybe print out my email body. So my "email_body" is 34 35 00:03:23,090 --> 00:03:28,880 going to be equal to all of my list items maybe joined by a new line. 35 36 00:03:28,940 --> 00:03:38,490 So I'll take "'\n'.join( 36 37 00:03:38,600 --> 00:03:49,390 lines)". And now let's print out our email_body and hit Shift+Enter and what we see is that our 37 38 00:03:49,390 --> 00:03:50,750 loop is working. 38 39 00:03:50,770 --> 00:03:59,050 We've cut off our hair and we're left with just the email body. If you found this line of code confusing, 39 40 00:03:59,500 --> 00:04:06,310 namely joining all the individual entries in the list with a new line character, 40 41 00:04:06,310 --> 00:04:09,560 let's take it out and see how this formatting changes 41 42 00:04:09,700 --> 00:04:11,070 if we don't have it. 42 43 00:04:11,110 --> 00:04:17,710 So what I'll do is I'll put a hash tag there, comment this out and what I'll also do is comment out this 43 44 00:04:17,710 --> 00:04:18,530 line right here 44 45 00:04:19,720 --> 00:04:25,460 and instead I'm going to print out my list, which was lines. 45 46 00:04:25,460 --> 00:04:28,400 Let's see what this looks like. Here's what we have. 46 47 00:04:29,800 --> 00:04:31,530 We've got a new line character, 47 48 00:04:31,660 --> 00:04:33,500 then we have the first line in our email - 48 49 00:04:33,520 --> 00:04:34,750 "Dear Mr. Still", 49 50 00:04:34,750 --> 00:04:40,720 then another new line character and everything is separated by a comma in this list as the loop 50 51 00:04:40,720 --> 00:04:45,120 read it line by line and appended it to our list. 51 52 00:04:45,130 --> 00:04:51,520 This is the formatting we would get straight out of our loop, and because this is not very readable, what 52 53 00:04:51,520 --> 00:05:00,460 we can do instead is join all the individual entries in the list at the new Line character, and then 53 54 00:05:00,460 --> 00:05:03,550 we get something much, much, much more readable. 54 55 00:05:03,730 --> 00:05:05,260 Fantastic! 55 56 00:05:05,260 --> 00:05:13,240 What we've just done is written the Python code to extract an email body from our raw data. With this 56 57 00:05:13,240 --> 00:05:13,960 in hand, 57 58 00:05:13,960 --> 00:05:21,680 we can apply this to all the messages in our dataset because they all follow the same pattern. 58 59 00:05:21,680 --> 00:05:25,240 And I'm going to show you how to do just that in the next lesson. 59 60 00:05:25,240 --> 00:05:26,350 Stay tuned.