1
00:00:00,920 --> 00:00:01,200
All right.

2
00:00:01,250 --> 00:00:04,960
So in this lesson, we're going to write the pipeline functions for e-mail processing.

3
00:00:05,870 --> 00:00:07,310
Let me add a markdown cell.

4
00:00:07,520 --> 00:00:14,580
Just add a quick section heading here and save functions for e-mail processing.

5
00:00:16,540 --> 00:00:21,320
I'm going to create a function called clean up a school message.

6
00:00:22,500 --> 00:00:25,460
And I want this function to eventually return to me.

7
00:00:26,090 --> 00:00:28,610
A list of filtered words.

8
00:00:29,450 --> 00:00:35,720
So I'll create an empty list filtered on the school words.

9
00:00:37,530 --> 00:00:39,870
And then we're gonna write some code.

10
00:00:40,950 --> 00:00:43,950
And then at the end, we want this function to return.

11
00:00:44,310 --> 00:00:46,710
This list of filtered words for us.

12
00:00:47,840 --> 00:00:50,060
Now, what sort of inputs should this function have?

13
00:00:51,140 --> 00:00:53,990
Well, the first one should be a.

14
00:00:54,230 --> 00:00:54,980
E-mail message.

15
00:00:55,100 --> 00:00:55,790
E-mail body.

16
00:00:56,630 --> 00:00:58,550
We can then use this message.

17
00:00:59,660 --> 00:01:02,450
And first of all, converted to lower case.

18
00:01:02,780 --> 00:01:05,150
So I can say message dot lower.

19
00:01:05,780 --> 00:01:11,060
This will convert all the contents in the message that are being passed to this function to lower case.

20
00:01:11,690 --> 00:01:14,600
And we can also tokenized the message.

21
00:01:15,110 --> 00:01:22,730
So word on the school tokenized parentheses message dot lower.

22
00:01:23,630 --> 00:01:26,240
Well, tokenized all the words in our message.

23
00:01:26,630 --> 00:01:30,740
So this is kind of a review of a lot of the previous steps that we've taken.

24
00:01:31,850 --> 00:01:40,700
So let me create a variable called words and set that equal to the result of all this work that's taking

25
00:01:40,700 --> 00:01:41,150
place.

26
00:01:41,240 --> 00:01:47,570
And this line of code words will, in fact, be a list of all the individual words in the email body.

27
00:01:48,520 --> 00:01:57,610
So this means that we can iterate over this list of words, right, so we can see for word in words,

28
00:01:58,120 --> 00:01:58,930
semicolon.

29
00:01:59,560 --> 00:02:04,390
And then inside the loop, we can do some similar kind of work that we did before.

30
00:02:05,590 --> 00:02:09,340
We can remove the stop words and we can remove the punctuation.

31
00:02:10,210 --> 00:02:17,680
If the word is not in stop words and the word.

32
00:02:19,950 --> 00:02:21,180
It's not punctuation.

33
00:02:21,690 --> 00:02:28,470
In other words, word dot is alpha parentheses, then stem the word.

34
00:02:29,130 --> 00:02:34,860
So we'll use our Stemmer and stem the word.

35
00:02:35,850 --> 00:02:41,220
Once we're happy with that, we can take our our list of filtered words.

36
00:02:42,650 --> 00:02:47,590
And upend the word from the stemmer.

37
00:02:48,860 --> 00:02:53,360
Now we'll actually want to make this function a little bit more independent from the previous cells

38
00:02:53,720 --> 00:03:00,020
so we can say the stemmer is going to be equal to the Porta Stemmer.

39
00:03:01,310 --> 00:03:10,160
What we're doing here is making Stemmer an optional argument and then our stop words are going to be

40
00:03:10,190 --> 00:03:10,970
equal to.

41
00:03:12,070 --> 00:03:16,660
The set of stop words, words.

42
00:03:17,690 --> 00:03:18,410
From Inglish.

43
00:03:20,390 --> 00:03:26,420
Now, we've created a function where we can swap out the Stemmer and swap out the list of stop words

44
00:03:26,930 --> 00:03:27,920
if we wanted to.

45
00:03:29,150 --> 00:03:37,670
This bit of code here converts to lowercase and splits up the individual words.

46
00:03:39,610 --> 00:03:46,390
This bit here removes the stop words and punctuation.

47
00:03:47,900 --> 00:03:54,860
Now, let me hit shift, enter and try out this function a little bit earlier on in the project when

48
00:03:54,860 --> 00:03:56,930
we were learning about reading files.

49
00:03:57,320 --> 00:04:03,740
We had this example e-mail and we saved this e-mail text in a variable called e-mail body.

50
00:04:04,790 --> 00:04:05,390
Let's try it.

51
00:04:05,450 --> 00:04:09,010
Our clean message function on this e-mail body right here.

52
00:04:10,260 --> 00:04:19,750
So coming down, I've got a clean message, parentheses and then email underscore Bonnie.

53
00:04:21,450 --> 00:04:23,350
Plum hit shift, enter and see what we get.

54
00:04:25,310 --> 00:04:26,110
We get an error.

55
00:04:26,840 --> 00:04:30,050
And that's because I've got a typo right here.

56
00:04:31,040 --> 00:04:33,170
Here we're dealing with the filtered words variable.

57
00:04:34,100 --> 00:04:36,140
Here we're dealing with the filtered words variable.

58
00:04:36,500 --> 00:04:38,660
But here I've left out the S.

59
00:04:40,410 --> 00:04:41,040
Let's try again.

60
00:04:41,430 --> 00:04:46,560
So I'm going to shift enter on this hand, hit shift and her on this.

61
00:04:47,710 --> 00:04:48,670
And here's our output.

62
00:04:49,330 --> 00:04:54,250
The entire contents of the e-mail are tokenized and also stemmed.

63
00:04:55,090 --> 00:04:56,700
So let me quickly copy this cell.

64
00:04:57,740 --> 00:05:02,480
Come down here, pasted in and modify the name of our function.

65
00:05:02,810 --> 00:05:08,130
I'm going to call it clean message underscore no, underscore h t Amelle.

66
00:05:09,280 --> 00:05:10,760
I want to pose a challenge to you.

67
00:05:11,630 --> 00:05:17,380
I'd like you to modify the function that we've just written to also remove the H.

68
00:05:17,410 --> 00:05:18,230
Timal tags.

69
00:05:18,800 --> 00:05:25,520
And then I'd like you to test this function on the email with document I.D. number two.

70
00:05:26,560 --> 00:05:28,600
Namely, this e-mail right here.

71
00:05:29,200 --> 00:05:31,830
So pause the video and give this a go.

72
00:05:34,160 --> 00:05:34,940
Did you figure it out?

73
00:05:36,780 --> 00:05:45,840
Here's the solution, quickly add a comment, remove HD him all tags, and then I'm going to use beautiful

74
00:05:45,840 --> 00:05:46,200
soup.

75
00:05:46,530 --> 00:05:48,390
So I'll say soup is equal to.

76
00:05:50,400 --> 00:05:52,320
Beautiful soup parentheses.

77
00:05:53,440 --> 00:05:53,890
And then what?

78
00:05:53,950 --> 00:05:55,780
I have to provide two arguments, right?

79
00:05:56,620 --> 00:06:00,190
The first one will have to be the e-mail body, which will be our message.

80
00:06:01,650 --> 00:06:07,980
And then I'm going to select the team, El Paso, as my default Partha.

81
00:06:09,390 --> 00:06:11,860
To remove all the tags, I'll say soup.

82
00:06:12,030 --> 00:06:13,840
Don't get an escort.

83
00:06:13,890 --> 00:06:20,610
Text parentheses, but I'm going to store the output in a variable as well.

84
00:06:20,670 --> 00:06:21,360
I'll say maybe.

85
00:06:22,610 --> 00:06:28,040
Cleaned and a school text is equal to soup, don't get taxed.

86
00:06:29,190 --> 00:06:37,400
And then what I'll do is instead of saying message to Laura case, I'll see cleaned text to lower case.

87
00:06:38,070 --> 00:06:39,810
And if I hit shift, enter now.

88
00:06:41,750 --> 00:06:50,510
I can come down here quick in this cell and hit tab on my keyboard, bring up clean message.

89
00:06:50,690 --> 00:07:03,380
No, in Tamil and then in the parentheses provide data, dots and square brackets to karma and single

90
00:07:03,380 --> 00:07:05,480
quotes message.

91
00:07:06,810 --> 00:07:15,120
And if I hit shift enter now, we should see just a list of stemmed words and no H.M.S. tags.

92
00:07:16,260 --> 00:07:16,830
Wonderful.

93
00:07:18,390 --> 00:07:24,630
Now, we've successfully cleaned and tokenized a single email from our dataset.

94
00:07:25,860 --> 00:07:31,810
Now it's time to apply the cleaning and tokenization to all the 5800 messages.

95
00:07:33,150 --> 00:07:36,690
And that's what we're going to be working up to in the next lesson.

96
00:07:37,170 --> 00:07:38,070
I'll see you there.