0 1 00:00:00,780 --> 00:00:07,860 Now it's time to apply the cleaning and tokenization to all the 5800 messages. 1 2 00:00:07,860 --> 00:00:15,540 But before we jump right into that, we have to review and practice a few key skills and that involves 2 3 00:00:15,960 --> 00:00:19,260 slicing dataframes and creating subsets, 3 4 00:00:19,410 --> 00:00:26,310 but not only that. We have to find a way to call our clean messages function on all the e-mail bodies 4 5 00:00:26,490 --> 00:00:27,850 stored in our dataframe. 5 6 00:00:29,050 --> 00:00:37,260 I'm going to change this cell to markdown and quickly commemorate what we're doing - we're going to "Apply Cleaning 6 7 00:00:37,440 --> 00:00:49,260 and Tokenisation to all messages", but before we dive into cleaning 5800 email bodies, it'll be handy 7 8 00:00:49,260 --> 00:00:53,570 to review these techniques for slicing pandas dataframes and series. 8 9 00:00:53,580 --> 00:00:57,540 So I'm going to add a little subheading here that reads 9 10 00:00:59,810 --> 00:01:07,160 "Slicing Dataframes and Series & Creating Subsets". 10 11 00:01:08,260 --> 00:01:08,680 Okay. 11 12 00:01:08,690 --> 00:01:17,390 So recently we've used the "at" attribute to select a particular cell or a scalar from our dataframe. 12 13 00:01:17,390 --> 00:01:23,880 This was a very efficient way of accessing a particular entry in the dataframe and the way this "at" 13 14 00:01:23,960 --> 00:01:31,870 attribute worked was that we had to specify both an index name and a column name. 14 15 00:01:31,880 --> 00:01:38,390 Now I realize the column name here is a string and the index name here is a number, but that's because we've 15 16 00:01:38,390 --> 00:01:42,610 given our index a numerical value here. 16 17 00:01:42,710 --> 00:01:50,840 The important thing to remember is that the "at" attribute works off names. If you'd like to work off a 17 18 00:01:50,840 --> 00:01:56,810 position, say at position 1, position 2, position 3, then there is an alternative attribute that 18 19 00:01:56,900 --> 00:01:58,330 you can use instead. 19 20 00:01:58,370 --> 00:02:00,160 Let me show you what I mean. 20 21 00:02:00,440 --> 00:02:11,790 If we say "data.iat[2,1]", we get the entry at position number 21 22 00:02:11,790 --> 00:02:16,320 2 and column number 1. Column number 1 22 23 00:02:16,320 --> 00:02:22,640 in this case is our "Messages" column. Column number 0 was our "Category" column, 23 24 00:02:22,860 --> 00:02:26,930 Column number 2 is our "Filename" column. 24 25 00:02:27,240 --> 00:02:30,840 So with "at" you're working off a name to get a particular entry 25 26 00:02:30,990 --> 00:02:35,500 and with "iat" you're working off a location or a position. 26 27 00:02:35,670 --> 00:02:42,030 This is very, very useful for selecting a single message or a single entry in the dataframe, especially 27 28 00:02:42,030 --> 00:02:48,030 if you're working in a loop or your code is iterating over a certain part of your dataframe with a 28 29 00:02:48,030 --> 00:02:49,440 numerical value, 29 30 00:02:49,470 --> 00:02:56,130 this "iat" attribute will make your life a lot easier than the "at" attribute which works off the names. 30 31 00:02:56,130 --> 00:03:02,640 Now what if you wanted to select a subset of the dataframe instead? 31 32 00:03:02,910 --> 00:03:06,210 What if you wanted to select more than one value, more than one row 32 33 00:03:06,210 --> 00:03:14,580 for example? In this case, you can use the "iloc", iloc attribute, 33 34 00:03:14,660 --> 00:03:24,380 so this is "data.iloc[0:2]". 34 35 00:03:24,680 --> 00:03:31,490 This code here will select the first two entries in our dataframe, the first two rows if you will. If 35 36 00:03:31,490 --> 00:03:33,980 I hit Shift+Enter we can see what they are. 36 37 00:03:34,460 --> 00:03:41,870 If we wanted to select the first five rows, then we would simply substitute the five for a two and here's 37 38 00:03:41,870 --> 00:03:43,160 the challenge. 38 39 00:03:43,160 --> 00:03:49,070 What would you put in between the square brackets if you wanted to select the entries with "DOC_ID" 39 40 00:03:49,490 --> 00:03:53,470 5, 6, 7, 8, 9 and 10? 40 41 00:03:53,600 --> 00:03:54,770 What would you supply here? 41 42 00:03:57,690 --> 00:04:06,610 In that case, you would go for "5:11" and that's because the first entry is at index 0. 42 43 00:04:06,720 --> 00:04:09,270 Now the nice thing about this "iloc" attribute is 43 44 00:04:09,360 --> 00:04:13,650 it works both on dataframes as well as series. 44 45 00:04:13,650 --> 00:04:23,410 So if I had only a single column selected, say "data.MESSAGE", I could still use this attribute "iloc 45 46 00:04:23,440 --> 00:04:31,140 [0:3]" and select only the first three entries in this series. 46 47 00:04:31,210 --> 00:04:35,770 This is very, very handy for creating or selecting a subset. 47 48 00:04:35,950 --> 00:04:40,750 So now that we've covered this let's move on to the next question. 48 49 00:04:40,870 --> 00:04:47,170 Say we want to try out our "clean_message" function on the first three emails out of the 49 50 00:04:47,320 --> 00:04:48,840 5800 odd emails. 50 51 00:04:48,880 --> 00:04:49,660 How would we do it? 51 52 00:04:51,330 --> 00:04:54,370 Well step 1 is selecting these email bodies of course 52 53 00:04:54,510 --> 00:04:56,840 and we've done that right here. 53 54 00:04:56,850 --> 00:04:59,680 Step two is using the pandas 54 55 00:04:59,790 --> 00:05:01,560 "apply" function. 55 56 00:05:01,600 --> 00:05:02,970 Now I absolutely love this function. 56 57 00:05:02,970 --> 00:05:12,290 This is this genius. Say we store these three entries in a series call it "first_emails", set that equal to 57 58 00:05:12,320 --> 00:05:13,390 "data.MESSAGE. 58 59 00:05:13,400 --> 00:05:18,930 iloc[0:3]" and then we can take this series, 59 60 00:05:21,620 --> 00:05:32,100 put a dot after it, use "apply()" and then supply the name of our function which was "clean_ 60 61 00:05:32,100 --> 00:05:40,090 message". Note there is no parentheses after "clean_message". We're supplying just 61 62 00:05:40,090 --> 00:05:48,860 the name of our function to the "apply" method which we're calling on our series object here. 62 63 00:05:49,880 --> 00:05:53,050 Let's see what happens. What we get back 63 64 00:05:53,060 --> 00:05:59,820 is kind of a list of lists. We've got all our words in the list split out for our first e-mail. 64 65 00:05:59,840 --> 00:06:01,610 Same thing with our second e-mail. 65 66 00:06:01,890 --> 00:06:06,220 And again we've cut all the words split out for our third e-mail as well. 66 67 00:06:07,510 --> 00:06:13,090 If you're curious what kind of type of object we're dealing with here, you can take a look by just wrapping 67 68 00:06:13,090 --> 00:06:18,970 this in the type function and you see that you still are working with a series. 68 69 00:06:19,330 --> 00:06:20,140 So that's good. 69 70 00:06:20,140 --> 00:06:22,890 That's really, really interesting. 70 71 00:06:22,890 --> 00:06:24,950 Now I'm gonna give this thing a name. 71 72 00:06:24,970 --> 00:06:33,850 I'm going to call it "nested_list" and I'm going to set it equal to the output from this "apply" 72 73 00:06:33,850 --> 00:06:35,290 function. 73 74 00:06:35,510 --> 00:06:42,110 I'm calling it a nested list because effectively I've almost got like a list of lists, right. Each individual 74 75 00:06:42,110 --> 00:06:52,330 entry in my pandas series here is a list, but here's a question: What if I wanted just one list of words? What 75 76 00:06:52,330 --> 00:06:56,270 if I didn't want a list of lists? Let me hit Shift+Enter on this. 76 77 00:06:59,770 --> 00:07:07,260 I'm going to create an empty list here and then just write a for loop, right. A nested one probably. I have to go over 77 78 00:07:07,260 --> 00:07:15,760 all the items in the list first. So I'll say "sublist in nested_list", that's my first part of my loop 78 79 00:07:16,740 --> 00:07:24,790 and then my second part of my loop, the inner loop, I'll say "for item in sublist:". The outer loop goes 79 80 00:07:24,790 --> 00:07:32,440 over all the lists and the inner loop goes over all the individual words in each sublist. Put a colon there 80 81 00:07:32,910 --> 00:07:45,530 and then say "flat_list.append(item)". That will pretty much do the job. So if I wanted to 81 82 00:07:45,530 --> 00:07:52,670 check what the length of this flat_list is, then I can use "len(flat_list)". Let's see what 82 83 00:07:52,670 --> 00:07:56,120 we get - 390, 83 84 00:07:56,130 --> 00:08:02,570 right. So that's 390 words now in a single list. If we wanted to take a peek what it 84 85 00:08:02,570 --> 00:08:10,310 looks like and it looks like this, it's just all the email words in one big list. And you know this is 85 86 00:08:10,310 --> 00:08:14,280 certainly one way of doing it, but let me show you an alternative. 86 87 00:08:14,390 --> 00:08:20,820 Let me show you how this would look like with something called Python list comprehension. 87 88 00:08:21,050 --> 00:08:29,000 So I'm going to comment all of this out. If you select the text like me and then press Control + /, 88 89 00:08:29,780 --> 00:08:35,330 this is the shortcut for commenting out an entire block of code. If you're on a Mac of course, it'll 89 90 00:08:35,330 --> 00:08:42,940 be Command + /. But for everybody else Control + /. The Python list comprehension 90 91 00:08:42,940 --> 00:08:50,260 syntax for accomplishing exactly the same thing as with these two nested loops looks like this. 91 92 00:08:50,260 --> 00:08:58,610 So I'll still use flat_list as my variable and I'll have square brackets and then I'm basically going to 92 93 00:08:58,610 --> 00:09:03,490 move my loop inside these two square brackets. I'll say "item for 93 94 00:09:07,100 --> 00:09:16,050 sublist in nested_list for item in sublist". 94 95 00:09:17,020 --> 00:09:22,390 This here you'll recognize as the outer loop "for sublist in nested_list" 95 96 00:09:22,510 --> 00:09:29,470 and this here you'll recognize as the inner loop "for item in sublist". 96 97 00:09:29,470 --> 00:09:35,620 This first bit of code is what appends the item to the list. 97 98 00:09:35,680 --> 00:09:38,040 Let's see what happens when I've run this again. 98 99 00:09:38,050 --> 00:09:42,550 I in fact get the exact same result. 99 100 00:09:42,580 --> 00:09:48,330 So now you've got another application of this Python list comprehension syntax. 100 101 00:09:48,550 --> 00:09:54,350 You can often use it instead of nesting these for loops. Now you've probably put two and two together 101 102 00:09:54,800 --> 00:10:03,120 on how to call our "clean_messages" function on all the 5800 odd items. We're gonna 102 103 00:10:03,130 --> 00:10:06,430 do something very, very similar to what we did a minute ago. 103 104 00:10:06,610 --> 00:10:23,020 We'll create a nested list and we'll set it equal to "data.MESSAGE.apply(clean_message 104 105 00:10:23,320 --> 00:10:37,310 no_html)". This here will use "apply" on all the messages in the dataframe. Now before I hit Shift+Enter and 105 106 00:10:37,310 --> 00:10:43,730 execute this code, I'll use a little bit of Jupyter magic. I'll put 2 percent signs here and then use 106 107 00:10:43,730 --> 00:10:52,280 that keyword "time" and what this will do is it'll spit out how long this computation will take. 107 108 00:10:52,310 --> 00:10:58,280 So this is kind of like a little bit of a benchmarking tool for looking at the performance of your Python 108 109 00:10:58,280 --> 00:10:59,760 code. 109 110 00:10:59,810 --> 00:11:06,890 It'll also help you see just how long it takes for my code to execute on my computer and you can compare 110 111 00:11:06,890 --> 00:11:08,560 this to your own. 111 112 00:11:08,600 --> 00:11:15,290 The first thing I'm seeing here is that our Beautiful Soup package seems to think that we're trying 112 113 00:11:15,290 --> 00:11:18,570 to open a Web site. 113 114 00:11:18,600 --> 00:11:20,190 Now this is definitely not the case. 114 115 00:11:20,300 --> 00:11:24,320 So this user warning is a little bit unnecessary. 115 116 00:11:24,320 --> 00:11:31,220 Our code still working, nothing's breaking, but this is something that Beautiful Soup has included in 116 117 00:11:31,220 --> 00:11:36,710 their code as a suggestion for people who are looking to open Web sites. 117 118 00:11:38,070 --> 00:11:44,280 The other thing that you can see is that in order to apply this function to a all my 5800 118 119 00:11:44,280 --> 00:11:49,400 odd messages is that my computer ran for like a good minute. 119 120 00:11:49,560 --> 00:11:52,430 It got there in the end, but it definitely did some work. 120 121 00:11:54,240 --> 00:11:58,080 So let's check out the head of our nested_list. 121 122 00:11:58,080 --> 00:12:02,310 Let's look at the first few lines. Looks good. 122 123 00:12:03,610 --> 00:12:05,180 What about the tail? 123 124 00:12:05,290 --> 00:12:06,740 The last few lines. 124 125 00:12:06,970 --> 00:12:08,100 Brilliant. 125 126 00:12:08,170 --> 00:12:09,010 That also looks good. 126 127 00:12:10,290 --> 00:12:11,490 Nothing unexpected.