0 1 00:00:00,890 --> 00:00:11,510 For our second exercise what I'd like you to do is find the email with the most number of words. 1 2 00:00:11,510 --> 00:00:13,240 So this is after cleaning, 2 3 00:00:13,280 --> 00:00:15,450 right. 3 4 00:00:16,430 --> 00:00:24,020 In this challenge, I'd like you to print out the number of words in the longest email - that is after cleaning 4 5 00:00:24,020 --> 00:00:31,070 and stemming and I'd like you to note the longest email's position in the list of cleaned emails. 5 6 00:00:32,580 --> 00:00:39,510 Also print out the stemmed list of words in the longest e-mail and print out the longest email from 6 7 00:00:39,510 --> 00:00:47,230 the data dataframe. I'll give you a few seconds to pause the video and give this a go. 7 8 00:00:51,370 --> 00:00:58,120 If you want a hint, use the length function, "len" and practice the Python list comprehension. 8 9 00:01:01,440 --> 00:01:01,980 All right. 9 10 00:01:02,010 --> 00:01:03,850 So here's the solution. 10 11 00:01:04,290 --> 00:01:07,340 One way to do this is to use a for loop. 11 12 00:01:07,340 --> 00:01:07,680 Yeah. 12 13 00:01:07,680 --> 00:01:09,360 Classic. 13 14 00:01:09,360 --> 00:01:17,940 In this case, we would create an empty list say "clean_email_lengths" which is going to hold 14 15 00:01:17,940 --> 00:01:21,510 on to the number of characters in each email. 15 16 00:01:21,510 --> 00:01:25,370 So an empty list is created with an empty pair of square brackets. 16 17 00:01:25,650 --> 00:01:33,040 And then we write our for loop, so "for sublist in stemmed_nested_list", 17 18 00:01:33,060 --> 00:01:40,250 this is where we've stored all our email bodies, right. 18 19 00:01:40,330 --> 00:01:48,100 This is what we can iterate over and check the number of characters and we can check the number of characters 19 20 00:01:48,190 --> 00:01:55,480 with the "len" function so "len(sublist)" will check for the number of characters. 20 21 00:01:55,870 --> 00:02:02,200 But what we actually want to do with these number of characters is we want to append them to our empty 21 22 00:02:02,200 --> 00:02:04,810 list up here as the loop runs. 22 23 00:02:04,810 --> 00:02:15,220 So it'd be "clean_email_lengths.append(len(sublist))" and then 23 24 00:02:15,220 --> 00:02:20,290 two closing parentheses. Let's take a look at what this looks like. 24 25 00:02:20,370 --> 00:02:32,160 So I'm going to hit Shift+Enter and then maybe print my clean_email_lengths. What I can see here is that the first 25 26 00:02:32,490 --> 00:02:39,120 email has 50 characters after stemming and after removing stop once, that is. The next one has 80 characters, 26 27 00:02:39,120 --> 00:02:41,140 the next one has 92. 27 28 00:02:41,220 --> 00:02:42,900 So this seems to work. 28 29 00:02:42,900 --> 00:02:49,400 But one thing we can do is, instead of using this for loop, we can also do it a very Python way, 29 30 00:02:49,410 --> 00:02:51,420 we can use Python 30 31 00:02:51,420 --> 00:02:52,360 list comprehension, 31 32 00:02:52,380 --> 00:03:00,700 right? If we want to do it this way we can take our clean_email_lengths variable and simply set that 32 33 00:03:00,700 --> 00:03:04,660 equal to the result of the Python list comprehension. 33 34 00:03:04,930 --> 00:03:13,790 The bit that we want to append of course is "len(sublist)" and the for loop would go inside 34 35 00:03:13,790 --> 00:03:14,690 these parentheses. 35 36 00:03:14,690 --> 00:03:24,940 So "for sublist in stemmed_nested_list". This is how we can do it in the Python list comprehension way. 36 37 00:03:26,290 --> 00:03:33,040 To print out the number of words in the longest email, 37 38 00:03:33,130 --> 00:03:35,610 how would we do it? 38 39 00:03:35,680 --> 00:03:40,780 Well, there's a Python function for finding the largest value in a list and that's the "max" 39 40 00:03:40,780 --> 00:03:51,050 function - "max(clean_email_lenghts)" will give us the largest value in this list. 40 41 00:03:51,590 --> 00:03:58,340 In this case, the largest value is 7661. 41 42 00:03:58,490 --> 00:04:07,940 This is, these are the number of characters in the longest email. In terms of where this email is, in terms 42 43 00:04:07,940 --> 00:04:12,530 of its position, you would have had to do a little bit of googling right. 43 44 00:04:12,530 --> 00:04:17,090 You would have had to find the position of this value, 44 45 00:04:17,090 --> 00:04:22,430 7661 in the clean email lengths list. 45 46 00:04:24,120 --> 00:04:36,720 So the email position in the list and also the data dataframe, because they match, right, is going to 46 47 00:04:36,720 --> 00:04:41,790 be found at "np.argmax( 47 48 00:04:41,820 --> 00:04:53,580 clean_email_lenghts)", so numpy has a handy, handy function called "argmax" 48 49 00:04:53,620 --> 00:05:01,010 which will give us the location of the largest value in this list. 49 50 00:05:01,010 --> 00:05:06,190 So figuring this out was the second part of the challenge if you will. Now, 50 51 00:05:06,250 --> 00:05:07,720 now this isn't the only way. 51 52 00:05:07,760 --> 00:05:13,760 And if you have another favorite way that you solve this problem please share it in the comments below 52 53 00:05:13,760 --> 00:05:22,850 this lesson and I'd be curious to have read and find out how you solved this problem. So let me hit Shift+ 53 54 00:05:22,850 --> 00:05:26,240 Enter to find out where this email is. 54 55 00:05:26,310 --> 00:05:35,950 It's at position 5401. Bringing up the list of words in this email should 55 56 00:05:35,950 --> 00:05:37,080 be fairly simple, 56 57 00:05:37,240 --> 00:05:46,360 because all I have to do is feed this value into the square brackets for my stemmed_nested_list, so "np. 57 58 00:05:46,830 --> 00:05:47,130 argmax( 58 59 00:05:47,140 --> 00:05:55,570 clean_email_lengths)" will show me what the words are that are in this list, so 59 60 00:05:55,570 --> 00:06:00,850 this is 5600 odd words long. Now, 60 61 00:06:00,870 --> 00:06:03,900 what about pulling out the original email from the dataframe? 61 62 00:06:04,200 --> 00:06:11,700 In this case, I would use "data.at" because I know exactly which document ID we're going to supply, 62 63 00:06:12,660 --> 00:06:20,190 namely "np.argmax(clean_email_lengths)". 63 64 00:06:20,190 --> 00:06:25,450 So this is for the row name, right. Row location would be "iat", 64 65 00:06:25,620 --> 00:06:33,920 but row name would be "at", but we've handily named our rows after integers so we can do it this way, 65 66 00:06:35,460 --> 00:06:39,830 but for the column, because we don't want the entire row, 66 67 00:06:39,990 --> 00:06:46,830 we'll just supply the name of the column after the comma, so it'll be a 'MESSAGE' and this is it. 67 68 00:06:46,900 --> 00:06:48,350 This is the original email 68 69 00:06:49,060 --> 00:06:55,400 after removing the header. So you can see it's quite long. 69 70 00:06:57,500 --> 00:06:58,350 Brilliant. 70 71 00:06:58,460 --> 00:07:01,730 So I hope you solved this challenge on your own 71 72 00:07:01,940 --> 00:07:09,730 and the solution was helpful in comparing your code with with mine. In the next lessons, 72 73 00:07:09,790 --> 00:07:13,880 we're going to go back to working a bit more with our dataframes. 73 74 00:07:13,970 --> 00:07:14,980 I'll see you there.