1 00:00:00,540 --> 00:00:07,530 All right, so now that we have are nested list of all the words and all the e-mails stemmed and cleaned. 2 00:00:08,100 --> 00:00:14,070 I want to return to one of our previous topics, namely slicing data frames and creating subsets. 3 00:00:15,230 --> 00:00:18,380 Previously, we looked at the at attribute. 4 00:00:19,160 --> 00:00:23,840 We looked at the Iot attributes and we looked at the eye log attribute. 5 00:00:24,980 --> 00:00:30,630 In this lesson, I want to show you how to use logic to create a subset of the woods. 6 00:00:30,860 --> 00:00:34,400 I want to create a subset from a data frame based on a condition. 7 00:00:34,820 --> 00:00:40,510 Consider this part two of slicing data frames allowed a quick markdown sell him. 8 00:00:41,300 --> 00:00:48,290 That's going to read using logic to slice data frames. 9 00:00:49,860 --> 00:00:56,100 Now, we've already seen how we can use the square bracket notation with a data frame to select a particular 10 00:00:56,100 --> 00:00:58,140 column, say the message column. 11 00:00:59,010 --> 00:01:06,540 But another thing that we can do in these square brackets is to place a condition similar to what you 12 00:01:06,540 --> 00:01:08,190 would find after an if statement. 13 00:01:08,190 --> 00:01:14,820 Say, suppose I want to access a subset in our data frame, say all the spam messages. 14 00:01:15,570 --> 00:01:19,790 Then what I could do in here is to have data dot category. 15 00:01:20,640 --> 00:01:28,260 So selecting a particular column and then saying, well, only give me the rows where the category is 16 00:01:28,290 --> 00:01:29,310 equal to one. 17 00:01:30,390 --> 00:01:35,580 This here will actually create a subset based on this condition being true. 18 00:01:36,930 --> 00:01:43,830 If I change on the shape attribute here, then we can see that the size of this subset is a thousand 19 00:01:43,830 --> 00:01:47,400 eight hundred and ninety six rows and three columns. 20 00:01:48,090 --> 00:01:50,790 This is exactly the number of spam messages that we had. 21 00:01:51,900 --> 00:01:58,860 If I copy paste this code and change that shape at the end to head, we can look at the first few rows 22 00:01:58,950 --> 00:01:59,790 in this subset. 23 00:02:00,450 --> 00:02:02,700 Now, these look very similar to what we've seen before. 24 00:02:03,210 --> 00:02:09,840 But if I change this head here tail, we can see that the last spam message is with document eighty 25 00:02:09,870 --> 00:02:11,760 one thousand eight hundred and ninety five. 26 00:02:12,930 --> 00:02:18,060 Now, I want to pose a challenge to you to put the subsetting into a bit more practice. 27 00:02:18,600 --> 00:02:21,510 I'd like you to create two variables, doc. 28 00:02:21,600 --> 00:02:27,720 Underscore I.D., underscore spam and doc underscore ideas, underscore ham. 29 00:02:28,470 --> 00:02:36,600 And I want you to store the indices for the spam and the non spam emails in these variables. 30 00:02:37,500 --> 00:02:40,310 So pause the video and try and figure this out. 31 00:02:42,750 --> 00:02:43,450 Did you have a go? 32 00:02:44,840 --> 00:02:45,710 Here's the solution. 33 00:02:47,180 --> 00:02:49,450 So I'll say donc I.D. on the school. 34 00:02:49,580 --> 00:02:58,490 Spam is equal to and then it's going to be equal to a subset from my data data frame. 35 00:02:58,640 --> 00:03:01,490 So it's a data square brackets, data. 36 00:03:01,530 --> 00:03:06,500 That category is equal to double equals one. 37 00:03:07,400 --> 00:03:15,290 And then I'll use the index attribute to store the index of the subset inside this variable. 38 00:03:16,670 --> 00:03:21,230 And I'll do something very, very similar for the hand messages, for the non spam messages, that is. 39 00:03:21,890 --> 00:03:27,680 I'll just change the name of the variable to him and then change the category to zero. 40 00:03:28,700 --> 00:03:29,240 And that's it. 41 00:03:29,810 --> 00:03:33,650 Here's what these X values would look like for our non spam messages. 42 00:03:34,730 --> 00:03:40,380 They start at a thousand eight hundred and ninety six and go up all the way to five thousand seven hundred 43 00:03:40,380 --> 00:03:41,420 and ninety five. 44 00:03:42,530 --> 00:03:45,300 And there's a total of three thousand nine hundred of them. 45 00:03:46,390 --> 00:03:52,840 But the thing is, now that we know exactly which indices in the data frame correspond to our spam messages 46 00:03:53,140 --> 00:03:56,020 and which indices correspond to our non spam messages. 47 00:03:56,470 --> 00:03:57,700 I want to show you another trick. 48 00:03:58,300 --> 00:04:03,010 I want to show you how you can create a subset using these indices directly. 49 00:04:03,850 --> 00:04:08,320 So let me quickly add a mark down cell so we can find this again easily. 50 00:04:08,990 --> 00:04:13,720 I'll say subsetting a series with an index. 51 00:04:15,160 --> 00:04:22,600 Now, if you recall, the type of our dog I.D. on underscore ham is in fact, an index. 52 00:04:23,320 --> 00:04:29,200 And the type of our ness, that list is a Pendas series. 53 00:04:29,860 --> 00:04:35,950 When you're working the series or a data frame, you can use the Yellow Sea attribute, right. 54 00:04:36,120 --> 00:04:38,800 Alosi with the square brackets. 55 00:04:39,310 --> 00:04:43,690 And then here you can feed in an index directly. 56 00:04:44,080 --> 00:04:52,750 So if, say, we wanted to access a subset of the nested list that just contained the non spam messages 57 00:04:53,380 --> 00:05:01,780 we could feed in the non spam indices between these square brackets of the Yellow Sea attribute to location 58 00:05:01,960 --> 00:05:02,530 attribute. 59 00:05:03,340 --> 00:05:05,890 I'm actually going to store all of this in a variable. 60 00:05:06,250 --> 00:05:10,090 I'm going to call this variable nest that list ham. 61 00:05:11,340 --> 00:05:13,110 This nested list has. 62 00:05:13,940 --> 00:05:22,320 The following shape, it's got three thousand nine hundred entries and the last few entries in this 63 00:05:22,320 --> 00:05:24,390 series look like this. 64 00:05:25,980 --> 00:05:31,530 Let's do the same thing for our non spam messages some quickly just going to paste this in here. 65 00:05:31,950 --> 00:05:35,470 Change this to spam and change this to spam here. 66 00:05:35,910 --> 00:05:42,110 So I've got a nested list spam variable and I've got my indices for my spam emails. 67 00:05:42,540 --> 00:05:47,160 Now, inside the square brackets of the location attribute for the nested list. 68 00:05:47,970 --> 00:05:52,080 And this will give me a nested list of all the spam messages. 69 00:05:52,890 --> 00:05:53,700 Fantastic. 70 00:05:54,240 --> 00:06:00,120 Now we have two separate lists of all the hash messages and all the spam messages all broken up into 71 00:06:00,120 --> 00:06:04,020 the tokenized stemmed and lowercase words. 72 00:06:05,010 --> 00:06:08,310 At this stage, I want to propose another challenge to you. 73 00:06:08,490 --> 00:06:10,110 This is gonna be a bit of a bigger challenge. 74 00:06:10,980 --> 00:06:13,170 I'd like you to practice some list comprehension. 75 00:06:13,770 --> 00:06:21,120 And then I'd like you to find the total of number of words in our clean data set of spam email bodies. 76 00:06:22,050 --> 00:06:26,090 And I'd also like you to find the total number of words used for the normal e-mails. 77 00:06:27,330 --> 00:06:35,670 And then I'd also like you to tell me what the 10 most common words are in all the spam messages, as 78 00:06:35,670 --> 00:06:40,050 well as the ten most common words in all the normal e-mail messages. 79 00:06:41,040 --> 00:06:42,590 I hope you're up for the challenge. 80 00:06:43,130 --> 00:06:47,060 I'll give you a few seconds to pause the video and figure this out. 81 00:06:48,530 --> 00:06:49,100 Are you ready? 82 00:06:50,250 --> 00:06:51,240 Here's the solution. 83 00:06:51,780 --> 00:06:55,020 Let's tackle the non spam emails first. 84 00:06:55,980 --> 00:07:02,130 So I'll create a flat list from our nested list as follows. 85 00:07:02,580 --> 00:07:03,600 So it's a flat list. 86 00:07:03,870 --> 00:07:07,440 Ham is equal to scrub brackets. 87 00:07:08,280 --> 00:07:10,580 And this is where I'm going to use list comprehension again. 88 00:07:10,890 --> 00:07:13,770 This is a little bit of review from what we've done before. 89 00:07:14,190 --> 00:07:20,940 So I'll say item four, sublist in nested list ham. 90 00:07:22,510 --> 00:07:25,870 For item in sublist. 91 00:07:27,010 --> 00:07:28,390 This is the list comprehension. 92 00:07:28,420 --> 00:07:28,750 Done. 93 00:07:29,440 --> 00:07:37,060 Here we are flattening our nested list of all the words in the non spam messages and putting them into 94 00:07:37,060 --> 00:07:38,830 a single place. 95 00:07:39,730 --> 00:07:43,690 Now, what I'll do is I'll find out the total number of words. 96 00:07:44,110 --> 00:07:48,190 So I'll say maybe normal words is equal to. 97 00:07:49,370 --> 00:07:53,130 PD don't series parentheses. 98 00:07:53,810 --> 00:08:03,710 So create a panda series from our flattened list of ham e-mails and I can then easily find the total 99 00:08:03,710 --> 00:08:04,970 number of words used here. 100 00:08:05,360 --> 00:08:07,070 With the shape attribute. 101 00:08:07,610 --> 00:08:09,750 And it's just at index zero. 102 00:08:09,980 --> 00:08:11,810 So this is the total number of words. 103 00:08:12,500 --> 00:08:20,120 We've got approximately four hundred and forty one thousand words inside our bag of words for our non 104 00:08:20,120 --> 00:08:21,200 spam messages. 105 00:08:22,160 --> 00:08:23,680 But there's one problem here, right? 106 00:08:24,650 --> 00:08:30,000 We've got four hundred and forty one thousand words in our bag of words of normal emails. 107 00:08:30,710 --> 00:08:32,600 But there's gonna be some repetition here. 108 00:08:33,260 --> 00:08:36,830 What if we want to find the unique number of words? 109 00:08:37,880 --> 00:08:44,450 If you wanted to find the number of unique words, then we'd have to use the value counts method. 110 00:08:45,400 --> 00:08:46,150 So check this out. 111 00:08:46,990 --> 00:08:51,520 If I add value on a school counts. 112 00:08:53,940 --> 00:09:01,020 To my code here, then I'll be creating a series using the flattened list of how messages. 113 00:09:02,020 --> 00:09:07,450 But instead of storing this in my variable directly, I want to use the value underscore counts method. 114 00:09:08,440 --> 00:09:15,190 This will then tell me the total number of unique words in the nonce found messages. 115 00:09:16,790 --> 00:09:20,620 In this case, we have approximately 21000 words. 116 00:09:21,500 --> 00:09:27,740 So while we've got a total of approximately four hundred and forty one thousand words, we've only got 117 00:09:27,740 --> 00:09:34,040 about 21000 unique words in our dataset of non spam messages. 118 00:09:35,250 --> 00:09:42,510 To find the most common woods hole we have to do now is take our normal woods and look at the first 119 00:09:42,690 --> 00:09:43,620 10 values. 120 00:09:43,830 --> 00:09:47,500 So square brackets, semicolon 10. 121 00:09:48,300 --> 00:09:48,810 Here we go. 122 00:09:49,470 --> 00:09:53,130 These are the ten most common words in the non spam messages. 123 00:09:53,520 --> 00:09:58,260 And this is the number of times that the word HTP occurs. 124 00:09:58,800 --> 00:10:03,900 This is the number of times the word use occurred in our non spam messages. 125 00:10:04,940 --> 00:10:08,690 So I hope you saw how we use a couple of the techniques that we've covered already. 126 00:10:09,080 --> 00:10:16,130 To answer this question of what the top 10 most common words were when you were able to do this with 127 00:10:16,130 --> 00:10:17,810 very, very few lines of code. 128 00:10:18,290 --> 00:10:24,500 Thanks to the power of pandas, oftentimes of these things, it's just a matter of remembering what 129 00:10:24,500 --> 00:10:27,000 the right method is or Googling for it. 130 00:10:28,040 --> 00:10:31,160 But this is just the hammer, such as this is just a non spam messages. 131 00:10:31,340 --> 00:10:32,600 What about the spam messages? 132 00:10:32,900 --> 00:10:34,580 Let me copy the code here. 133 00:10:35,090 --> 00:10:39,290 Scroll down and then we'll do the spam messages all the same. 134 00:10:39,890 --> 00:10:43,100 So we'll change our variable name here. 135 00:10:43,500 --> 00:10:46,010 We'll change the name of the nested list we're using. 136 00:10:46,760 --> 00:10:51,650 And we'll create these series from the flight list of spam messages. 137 00:10:52,700 --> 00:10:56,870 We'll call this spammy was instead of normal. 138 00:10:56,900 --> 00:10:57,680 Underscore words. 139 00:10:58,740 --> 00:11:00,900 And then we'll print out the shape. 140 00:11:00,930 --> 00:11:04,500 The total number of unique spammy words as well. 141 00:11:05,220 --> 00:11:10,350 So this is the total number of unique words in the spam messages. 142 00:11:10,890 --> 00:11:11,700 Let's see what we get. 143 00:11:13,170 --> 00:11:20,520 In the spam messages, we have a total of thirteen thousand unique words, and the top most common ones 144 00:11:21,090 --> 00:11:29,550 are spammy words, square brackets, semicolon 10, says the top 10 spammy words. 145 00:11:30,920 --> 00:11:35,000 At number one, we've got HTP at number two. 146 00:11:35,150 --> 00:11:37,400 We've got the what e-mail, then the word free. 147 00:11:37,730 --> 00:11:38,590 Then the word click. 148 00:11:39,530 --> 00:11:42,230 Now, one thing that you might be wondering about is, of course. 149 00:11:42,680 --> 00:11:43,250 Well, wait a minute. 150 00:11:43,310 --> 00:11:46,330 So these words, these were all the standard words, right? 151 00:11:46,520 --> 00:11:48,440 These things here aren't real words. 152 00:11:49,490 --> 00:11:58,250 So we can entirely be sure if this word here is distend would form for please or pleasure just by looking 153 00:11:58,250 --> 00:11:58,490 at it. 154 00:11:58,670 --> 00:12:02,380 We might not know what our Porter Stemmer has done behind the scenes. 155 00:12:03,430 --> 00:12:08,140 So that's something to think about for the future lessons in the next lesson. 156 00:12:08,350 --> 00:12:14,770 We're going to be doing some more data exploration in the form of some advanced visualization techniques. 157 00:12:15,910 --> 00:12:21,850 I want to show you how to create something called a word cloud, because it's all very nice and good 158 00:12:22,090 --> 00:12:29,830 printing out the spammy words or the normal words as this list where you see the frequency and the word. 159 00:12:30,400 --> 00:12:37,000 What's a lot nicer usually is to present this graphically, to present this visually, because oftentimes 160 00:12:37,240 --> 00:12:41,530 visuals are just so much more convincing in any meeting or any presentation. 161 00:12:43,040 --> 00:12:44,930 I can't wait to see you again in the next lesson. 162 00:12:45,440 --> 00:12:45,890 Take care.