0
1
00:00:00,890 --> 00:00:11,510
For our second exercise what I'd like you to do is find the email with the most number of words.
1

2
00:00:11,510 --> 00:00:13,240
So this is after cleaning,
2

3
00:00:13,280 --> 00:00:15,450
right.
3

4
00:00:16,430 --> 00:00:24,020
In this challenge, I'd like you to print out the number of words in the longest email - that is after cleaning
4

5
00:00:24,020 --> 00:00:31,070
and stemming and I'd like you to note the longest email's position in the list of cleaned emails.
5

6
00:00:32,580 --> 00:00:39,510
Also print out the stemmed list of words in the longest e-mail and print out the longest email from
6

7
00:00:39,510 --> 00:00:47,230
the data dataframe. I'll give you a few seconds to pause the video and give this a go.
7

8
00:00:51,370 --> 00:00:58,120
If you want a hint, use the length function, "len" and practice the Python list comprehension.
8

9
00:01:01,440 --> 00:01:01,980
All right.
9

10
00:01:02,010 --> 00:01:03,850
So here's the solution.
10

11
00:01:04,290 --> 00:01:07,340
One way to do this is to use a for loop.
11

12
00:01:07,340 --> 00:01:07,680
Yeah.
12

13
00:01:07,680 --> 00:01:09,360
Classic.
13

14
00:01:09,360 --> 00:01:17,940
In this case, we would create an empty list say "clean_email_lengths" which is going to hold
14

15
00:01:17,940 --> 00:01:21,510
on to the number of characters in each email.
15

16
00:01:21,510 --> 00:01:25,370
So an empty list is created with an empty pair of square brackets.
16

17
00:01:25,650 --> 00:01:33,040
And then we write our for loop, so "for sublist in stemmed_nested_list",
17

18
00:01:33,060 --> 00:01:40,250
this is where we've stored all our email bodies, right.
18

19
00:01:40,330 --> 00:01:48,100
This is what we can iterate over and check the number of characters and we can check the number of characters
19

20
00:01:48,190 --> 00:01:55,480
with the "len" function so "len(sublist)" will check for the number of characters.
20

21
00:01:55,870 --> 00:02:02,200
But what we actually want to do with these number of characters is we want to append them to our empty
21

22
00:02:02,200 --> 00:02:04,810
list up here as the loop runs.
22

23
00:02:04,810 --> 00:02:15,220
So it'd be "clean_email_lengths.append(len(sublist))" and then
23

24
00:02:15,220 --> 00:02:20,290
two closing parentheses. Let's take a look at what this looks like.
24

25
00:02:20,370 --> 00:02:32,160
So I'm going to hit Shift+Enter and then maybe print my clean_email_lengths. What I can see here is that the first
25

26
00:02:32,490 --> 00:02:39,120
email has 50 characters after stemming and after removing stop once, that is. The next one has 80 characters,
26

27
00:02:39,120 --> 00:02:41,140
the next one has 92.
27

28
00:02:41,220 --> 00:02:42,900
So this seems to work.
28

29
00:02:42,900 --> 00:02:49,400
But one thing we can do is, instead of using this for loop, we can also do it a very Python way,
29

30
00:02:49,410 --> 00:02:51,420
we can use Python
30

31
00:02:51,420 --> 00:02:52,360
list comprehension,
31

32
00:02:52,380 --> 00:03:00,700
right? If we want to do it this way we can take our clean_email_lengths variable and simply set that
32

33
00:03:00,700 --> 00:03:04,660
equal to the result of the Python list comprehension.
33

34
00:03:04,930 --> 00:03:13,790
The bit that we want to append of course is "len(sublist)" and the for loop would go inside
34

35
00:03:13,790 --> 00:03:14,690
these parentheses.
35

36
00:03:14,690 --> 00:03:24,940
So "for sublist in stemmed_nested_list". This is how we can do it in the Python list comprehension way.
36

37
00:03:26,290 --> 00:03:33,040
To print out the number of words in the longest email,
37

38
00:03:33,130 --> 00:03:35,610
how would we do it?
38

39
00:03:35,680 --> 00:03:40,780
Well, there's a Python function for finding the largest value in a list and that's the "max"
39

40
00:03:40,780 --> 00:03:51,050
function - "max(clean_email_lenghts)" will give us the largest value in this list.
40

41
00:03:51,590 --> 00:03:58,340
In this case, the largest value is 7661.
41

42
00:03:58,490 --> 00:04:07,940
This is, these are the number of characters in the longest email. In terms of where this email is, in terms
42

43
00:04:07,940 --> 00:04:12,530
of its position, you would have had to do a little bit of googling right.
43

44
00:04:12,530 --> 00:04:17,090
You would have had to find the position of this value,
44

45
00:04:17,090 --> 00:04:22,430
7661 in the clean email lengths list.
45

46
00:04:24,120 --> 00:04:36,720
So the email position in the list and also the data dataframe, because they match, right, is going to
46

47
00:04:36,720 --> 00:04:41,790
be found at "np.argmax(
47

48
00:04:41,820 --> 00:04:53,580
clean_email_lenghts)", so numpy has a handy, handy function called "argmax"
48

49
00:04:53,620 --> 00:05:01,010
which will give us the location of the largest value in this list.
49

50
00:05:01,010 --> 00:05:06,190
So figuring this out was the second part of the challenge if you will. Now,
50

51
00:05:06,250 --> 00:05:07,720
now this isn't the only way.
51

52
00:05:07,760 --> 00:05:13,760
And if you have another favorite way that you solve this problem please share it in the comments below
52

53
00:05:13,760 --> 00:05:22,850
this lesson and I'd be curious to have read and find out how you solved this problem. So let me hit Shift+
53

54
00:05:22,850 --> 00:05:26,240
Enter to find out where this email is.
54

55
00:05:26,310 --> 00:05:35,950
It's at position 5401. Bringing up the list of words in this email should
55

56
00:05:35,950 --> 00:05:37,080
be fairly simple,
56

57
00:05:37,240 --> 00:05:46,360
because all I have to do is feed this value into the square brackets for my stemmed_nested_list, so "np.
57

58
00:05:46,830 --> 00:05:47,130
argmax(
58

59
00:05:47,140 --> 00:05:55,570
clean_email_lengths)" will show me what the words are that are in this list, so
59

60
00:05:55,570 --> 00:06:00,850
this is 5600 odd words long. Now,
60

61
00:06:00,870 --> 00:06:03,900
what about pulling out the original email from the dataframe?
61

62
00:06:04,200 --> 00:06:11,700
In this case, I would use "data.at" because I know exactly which document ID we're going to supply,
62

63
00:06:12,660 --> 00:06:20,190
namely "np.argmax(clean_email_lengths)".
63

64
00:06:20,190 --> 00:06:25,450
So this is for the row name, right. Row location would be "iat",
64

65
00:06:25,620 --> 00:06:33,920
but row name would be "at", but we've handily named our rows after integers so we can do it this way,
65

66
00:06:35,460 --> 00:06:39,830
but for the column, because we don't want the entire row,
66

67
00:06:39,990 --> 00:06:46,830
we'll just supply the name of the column after the comma, so it'll be a 'MESSAGE' and this is it.
67

68
00:06:46,900 --> 00:06:48,350
This is the original email
68

69
00:06:49,060 --> 00:06:55,400
after removing the header. So you can see it's quite long.
69

70
00:06:57,500 --> 00:06:58,350
Brilliant.
70

71
00:06:58,460 --> 00:07:01,730
So I hope you solved this challenge on your own
71

72
00:07:01,940 --> 00:07:09,730
and the solution was helpful in comparing your code with with mine. In the next lessons,
72

73
00:07:09,790 --> 00:07:13,880
we're going to go back to working a bit more with our dataframes.
73

74
00:07:13,970 --> 00:07:14,980
I'll see you there.