1 00:00:00,920 --> 00:00:01,200 All right. 2 00:00:01,250 --> 00:00:04,960 So in this lesson, we're going to write the pipeline functions for e-mail processing. 3 00:00:05,870 --> 00:00:07,310 Let me add a markdown cell. 4 00:00:07,520 --> 00:00:14,580 Just add a quick section heading here and save functions for e-mail processing. 5 00:00:16,540 --> 00:00:21,320 I'm going to create a function called clean up a school message. 6 00:00:22,500 --> 00:00:25,460 And I want this function to eventually return to me. 7 00:00:26,090 --> 00:00:28,610 A list of filtered words. 8 00:00:29,450 --> 00:00:35,720 So I'll create an empty list filtered on the school words. 9 00:00:37,530 --> 00:00:39,870 And then we're gonna write some code. 10 00:00:40,950 --> 00:00:43,950 And then at the end, we want this function to return. 11 00:00:44,310 --> 00:00:46,710 This list of filtered words for us. 12 00:00:47,840 --> 00:00:50,060 Now, what sort of inputs should this function have? 13 00:00:51,140 --> 00:00:53,990 Well, the first one should be a. 14 00:00:54,230 --> 00:00:54,980 E-mail message. 15 00:00:55,100 --> 00:00:55,790 E-mail body. 16 00:00:56,630 --> 00:00:58,550 We can then use this message. 17 00:00:59,660 --> 00:01:02,450 And first of all, converted to lower case. 18 00:01:02,780 --> 00:01:05,150 So I can say message dot lower. 19 00:01:05,780 --> 00:01:11,060 This will convert all the contents in the message that are being passed to this function to lower case. 20 00:01:11,690 --> 00:01:14,600 And we can also tokenized the message. 21 00:01:15,110 --> 00:01:22,730 So word on the school tokenized parentheses message dot lower. 22 00:01:23,630 --> 00:01:26,240 Well, tokenized all the words in our message. 23 00:01:26,630 --> 00:01:30,740 So this is kind of a review of a lot of the previous steps that we've taken. 24 00:01:31,850 --> 00:01:40,700 So let me create a variable called words and set that equal to the result of all this work that's taking 25 00:01:40,700 --> 00:01:41,150 place. 26 00:01:41,240 --> 00:01:47,570 And this line of code words will, in fact, be a list of all the individual words in the email body. 27 00:01:48,520 --> 00:01:57,610 So this means that we can iterate over this list of words, right, so we can see for word in words, 28 00:01:58,120 --> 00:01:58,930 semicolon. 29 00:01:59,560 --> 00:02:04,390 And then inside the loop, we can do some similar kind of work that we did before. 30 00:02:05,590 --> 00:02:09,340 We can remove the stop words and we can remove the punctuation. 31 00:02:10,210 --> 00:02:17,680 If the word is not in stop words and the word. 32 00:02:19,950 --> 00:02:21,180 It's not punctuation. 33 00:02:21,690 --> 00:02:28,470 In other words, word dot is alpha parentheses, then stem the word. 34 00:02:29,130 --> 00:02:34,860 So we'll use our Stemmer and stem the word. 35 00:02:35,850 --> 00:02:41,220 Once we're happy with that, we can take our our list of filtered words. 36 00:02:42,650 --> 00:02:47,590 And upend the word from the stemmer. 37 00:02:48,860 --> 00:02:53,360 Now we'll actually want to make this function a little bit more independent from the previous cells 38 00:02:53,720 --> 00:03:00,020 so we can say the stemmer is going to be equal to the Porta Stemmer. 39 00:03:01,310 --> 00:03:10,160 What we're doing here is making Stemmer an optional argument and then our stop words are going to be 40 00:03:10,190 --> 00:03:10,970 equal to. 41 00:03:12,070 --> 00:03:16,660 The set of stop words, words. 42 00:03:17,690 --> 00:03:18,410 From Inglish. 43 00:03:20,390 --> 00:03:26,420 Now, we've created a function where we can swap out the Stemmer and swap out the list of stop words 44 00:03:26,930 --> 00:03:27,920 if we wanted to. 45 00:03:29,150 --> 00:03:37,670 This bit of code here converts to lowercase and splits up the individual words. 46 00:03:39,610 --> 00:03:46,390 This bit here removes the stop words and punctuation. 47 00:03:47,900 --> 00:03:54,860 Now, let me hit shift, enter and try out this function a little bit earlier on in the project when 48 00:03:54,860 --> 00:03:56,930 we were learning about reading files. 49 00:03:57,320 --> 00:04:03,740 We had this example e-mail and we saved this e-mail text in a variable called e-mail body. 50 00:04:04,790 --> 00:04:05,390 Let's try it. 51 00:04:05,450 --> 00:04:09,010 Our clean message function on this e-mail body right here. 52 00:04:10,260 --> 00:04:19,750 So coming down, I've got a clean message, parentheses and then email underscore Bonnie. 53 00:04:21,450 --> 00:04:23,350 Plum hit shift, enter and see what we get. 54 00:04:25,310 --> 00:04:26,110 We get an error. 55 00:04:26,840 --> 00:04:30,050 And that's because I've got a typo right here. 56 00:04:31,040 --> 00:04:33,170 Here we're dealing with the filtered words variable. 57 00:04:34,100 --> 00:04:36,140 Here we're dealing with the filtered words variable. 58 00:04:36,500 --> 00:04:38,660 But here I've left out the S. 59 00:04:40,410 --> 00:04:41,040 Let's try again. 60 00:04:41,430 --> 00:04:46,560 So I'm going to shift enter on this hand, hit shift and her on this. 61 00:04:47,710 --> 00:04:48,670 And here's our output. 62 00:04:49,330 --> 00:04:54,250 The entire contents of the e-mail are tokenized and also stemmed. 63 00:04:55,090 --> 00:04:56,700 So let me quickly copy this cell. 64 00:04:57,740 --> 00:05:02,480 Come down here, pasted in and modify the name of our function. 65 00:05:02,810 --> 00:05:08,130 I'm going to call it clean message underscore no, underscore h t Amelle. 66 00:05:09,280 --> 00:05:10,760 I want to pose a challenge to you. 67 00:05:11,630 --> 00:05:17,380 I'd like you to modify the function that we've just written to also remove the H. 68 00:05:17,410 --> 00:05:18,230 Timal tags. 69 00:05:18,800 --> 00:05:25,520 And then I'd like you to test this function on the email with document I.D. number two. 70 00:05:26,560 --> 00:05:28,600 Namely, this e-mail right here. 71 00:05:29,200 --> 00:05:31,830 So pause the video and give this a go. 72 00:05:34,160 --> 00:05:34,940 Did you figure it out? 73 00:05:36,780 --> 00:05:45,840 Here's the solution, quickly add a comment, remove HD him all tags, and then I'm going to use beautiful 74 00:05:45,840 --> 00:05:46,200 soup. 75 00:05:46,530 --> 00:05:48,390 So I'll say soup is equal to. 76 00:05:50,400 --> 00:05:52,320 Beautiful soup parentheses. 77 00:05:53,440 --> 00:05:53,890 And then what? 78 00:05:53,950 --> 00:05:55,780 I have to provide two arguments, right? 79 00:05:56,620 --> 00:06:00,190 The first one will have to be the e-mail body, which will be our message. 80 00:06:01,650 --> 00:06:07,980 And then I'm going to select the team, El Paso, as my default Partha. 81 00:06:09,390 --> 00:06:11,860 To remove all the tags, I'll say soup. 82 00:06:12,030 --> 00:06:13,840 Don't get an escort. 83 00:06:13,890 --> 00:06:20,610 Text parentheses, but I'm going to store the output in a variable as well. 84 00:06:20,670 --> 00:06:21,360 I'll say maybe. 85 00:06:22,610 --> 00:06:28,040 Cleaned and a school text is equal to soup, don't get taxed. 86 00:06:29,190 --> 00:06:37,400 And then what I'll do is instead of saying message to Laura case, I'll see cleaned text to lower case. 87 00:06:38,070 --> 00:06:39,810 And if I hit shift, enter now. 88 00:06:41,750 --> 00:06:50,510 I can come down here quick in this cell and hit tab on my keyboard, bring up clean message. 89 00:06:50,690 --> 00:07:03,380 No, in Tamil and then in the parentheses provide data, dots and square brackets to karma and single 90 00:07:03,380 --> 00:07:05,480 quotes message. 91 00:07:06,810 --> 00:07:15,120 And if I hit shift enter now, we should see just a list of stemmed words and no H.M.S. tags. 92 00:07:16,260 --> 00:07:16,830 Wonderful. 93 00:07:18,390 --> 00:07:24,630 Now, we've successfully cleaned and tokenized a single email from our dataset. 94 00:07:25,860 --> 00:07:31,810 Now it's time to apply the cleaning and tokenization to all the 5800 messages. 95 00:07:33,150 --> 00:07:36,690 And that's what we're going to be working up to in the next lesson. 96 00:07:37,170 --> 00:07:38,070 I'll see you there.