1
00:00:00,540 --> 00:00:07,530
All right, so now that we have are nested list of all the words and all the e-mails stemmed and cleaned.

2
00:00:08,100 --> 00:00:14,070
I want to return to one of our previous topics, namely slicing data frames and creating subsets.

3
00:00:15,230 --> 00:00:18,380
Previously, we looked at the at attribute.

4
00:00:19,160 --> 00:00:23,840
We looked at the Iot attributes and we looked at the eye log attribute.

5
00:00:24,980 --> 00:00:30,630
In this lesson, I want to show you how to use logic to create a subset of the woods.

6
00:00:30,860 --> 00:00:34,400
I want to create a subset from a data frame based on a condition.

7
00:00:34,820 --> 00:00:40,510
Consider this part two of slicing data frames allowed a quick markdown sell him.

8
00:00:41,300 --> 00:00:48,290
That's going to read using logic to slice data frames.

9
00:00:49,860 --> 00:00:56,100
Now, we've already seen how we can use the square bracket notation with a data frame to select a particular

10
00:00:56,100 --> 00:00:58,140
column, say the message column.

11
00:00:59,010 --> 00:01:06,540
But another thing that we can do in these square brackets is to place a condition similar to what you

12
00:01:06,540 --> 00:01:08,190
would find after an if statement.

13
00:01:08,190 --> 00:01:14,820
Say, suppose I want to access a subset in our data frame, say all the spam messages.

14
00:01:15,570 --> 00:01:19,790
Then what I could do in here is to have data dot category.

15
00:01:20,640 --> 00:01:28,260
So selecting a particular column and then saying, well, only give me the rows where the category is

16
00:01:28,290 --> 00:01:29,310
equal to one.

17
00:01:30,390 --> 00:01:35,580
This here will actually create a subset based on this condition being true.

18
00:01:36,930 --> 00:01:43,830
If I change on the shape attribute here, then we can see that the size of this subset is a thousand

19
00:01:43,830 --> 00:01:47,400
eight hundred and ninety six rows and three columns.

20
00:01:48,090 --> 00:01:50,790
This is exactly the number of spam messages that we had.

21
00:01:51,900 --> 00:01:58,860
If I copy paste this code and change that shape at the end to head, we can look at the first few rows

22
00:01:58,950 --> 00:01:59,790
in this subset.

23
00:02:00,450 --> 00:02:02,700
Now, these look very similar to what we've seen before.

24
00:02:03,210 --> 00:02:09,840
But if I change this head here tail, we can see that the last spam message is with document eighty

25
00:02:09,870 --> 00:02:11,760
one thousand eight hundred and ninety five.

26
00:02:12,930 --> 00:02:18,060
Now, I want to pose a challenge to you to put the subsetting into a bit more practice.

27
00:02:18,600 --> 00:02:21,510
I'd like you to create two variables, doc.

28
00:02:21,600 --> 00:02:27,720
Underscore I.D., underscore spam and doc underscore ideas, underscore ham.

29
00:02:28,470 --> 00:02:36,600
And I want you to store the indices for the spam and the non spam emails in these variables.

30
00:02:37,500 --> 00:02:40,310
So pause the video and try and figure this out.

31
00:02:42,750 --> 00:02:43,450
Did you have a go?

32
00:02:44,840 --> 00:02:45,710
Here's the solution.

33
00:02:47,180 --> 00:02:49,450
So I'll say donc I.D. on the school.

34
00:02:49,580 --> 00:02:58,490
Spam is equal to and then it's going to be equal to a subset from my data data frame.

35
00:02:58,640 --> 00:03:01,490
So it's a data square brackets, data.

36
00:03:01,530 --> 00:03:06,500
That category is equal to double equals one.

37
00:03:07,400 --> 00:03:15,290
And then I'll use the index attribute to store the index of the subset inside this variable.

38
00:03:16,670 --> 00:03:21,230
And I'll do something very, very similar for the hand messages, for the non spam messages, that is.

39
00:03:21,890 --> 00:03:27,680
I'll just change the name of the variable to him and then change the category to zero.

40
00:03:28,700 --> 00:03:29,240
And that's it.

41
00:03:29,810 --> 00:03:33,650
Here's what these X values would look like for our non spam messages.

42
00:03:34,730 --> 00:03:40,380
They start at a thousand eight hundred and ninety six and go up all the way to five thousand seven hundred

43
00:03:40,380 --> 00:03:41,420
and ninety five.

44
00:03:42,530 --> 00:03:45,300
And there's a total of three thousand nine hundred of them.

45
00:03:46,390 --> 00:03:52,840
But the thing is, now that we know exactly which indices in the data frame correspond to our spam messages

46
00:03:53,140 --> 00:03:56,020
and which indices correspond to our non spam messages.

47
00:03:56,470 --> 00:03:57,700
I want to show you another trick.

48
00:03:58,300 --> 00:04:03,010
I want to show you how you can create a subset using these indices directly.

49
00:04:03,850 --> 00:04:08,320
So let me quickly add a mark down cell so we can find this again easily.

50
00:04:08,990 --> 00:04:13,720
I'll say subsetting a series with an index.

51
00:04:15,160 --> 00:04:22,600
Now, if you recall, the type of our dog I.D. on underscore ham is in fact, an index.

52
00:04:23,320 --> 00:04:29,200
And the type of our ness, that list is a Pendas series.

53
00:04:29,860 --> 00:04:35,950
When you're working the series or a data frame, you can use the Yellow Sea attribute, right.

54
00:04:36,120 --> 00:04:38,800
Alosi with the square brackets.

55
00:04:39,310 --> 00:04:43,690
And then here you can feed in an index directly.

56
00:04:44,080 --> 00:04:52,750
So if, say, we wanted to access a subset of the nested list that just contained the non spam messages

57
00:04:53,380 --> 00:05:01,780
we could feed in the non spam indices between these square brackets of the Yellow Sea attribute to location

58
00:05:01,960 --> 00:05:02,530
attribute.

59
00:05:03,340 --> 00:05:05,890
I'm actually going to store all of this in a variable.

60
00:05:06,250 --> 00:05:10,090
I'm going to call this variable nest that list ham.

61
00:05:11,340 --> 00:05:13,110
This nested list has.

62
00:05:13,940 --> 00:05:22,320
The following shape, it's got three thousand nine hundred entries and the last few entries in this

63
00:05:22,320 --> 00:05:24,390
series look like this.

64
00:05:25,980 --> 00:05:31,530
Let's do the same thing for our non spam messages some quickly just going to paste this in here.

65
00:05:31,950 --> 00:05:35,470
Change this to spam and change this to spam here.

66
00:05:35,910 --> 00:05:42,110
So I've got a nested list spam variable and I've got my indices for my spam emails.

67
00:05:42,540 --> 00:05:47,160
Now, inside the square brackets of the location attribute for the nested list.

68
00:05:47,970 --> 00:05:52,080
And this will give me a nested list of all the spam messages.

69
00:05:52,890 --> 00:05:53,700
Fantastic.

70
00:05:54,240 --> 00:06:00,120
Now we have two separate lists of all the hash messages and all the spam messages all broken up into

71
00:06:00,120 --> 00:06:04,020
the tokenized stemmed and lowercase words.

72
00:06:05,010 --> 00:06:08,310
At this stage, I want to propose another challenge to you.

73
00:06:08,490 --> 00:06:10,110
This is gonna be a bit of a bigger challenge.

74
00:06:10,980 --> 00:06:13,170
I'd like you to practice some list comprehension.

75
00:06:13,770 --> 00:06:21,120
And then I'd like you to find the total of number of words in our clean data set of spam email bodies.

76
00:06:22,050 --> 00:06:26,090
And I'd also like you to find the total number of words used for the normal e-mails.

77
00:06:27,330 --> 00:06:35,670
And then I'd also like you to tell me what the 10 most common words are in all the spam messages, as

78
00:06:35,670 --> 00:06:40,050
well as the ten most common words in all the normal e-mail messages.

79
00:06:41,040 --> 00:06:42,590
I hope you're up for the challenge.

80
00:06:43,130 --> 00:06:47,060
I'll give you a few seconds to pause the video and figure this out.

81
00:06:48,530 --> 00:06:49,100
Are you ready?

82
00:06:50,250 --> 00:06:51,240
Here's the solution.

83
00:06:51,780 --> 00:06:55,020
Let's tackle the non spam emails first.

84
00:06:55,980 --> 00:07:02,130
So I'll create a flat list from our nested list as follows.

85
00:07:02,580 --> 00:07:03,600
So it's a flat list.

86
00:07:03,870 --> 00:07:07,440
Ham is equal to scrub brackets.

87
00:07:08,280 --> 00:07:10,580
And this is where I'm going to use list comprehension again.

88
00:07:10,890 --> 00:07:13,770
This is a little bit of review from what we've done before.

89
00:07:14,190 --> 00:07:20,940
So I'll say item four, sublist in nested list ham.

90
00:07:22,510 --> 00:07:25,870
For item in sublist.

91
00:07:27,010 --> 00:07:28,390
This is the list comprehension.

92
00:07:28,420 --> 00:07:28,750
Done.

93
00:07:29,440 --> 00:07:37,060
Here we are flattening our nested list of all the words in the non spam messages and putting them into

94
00:07:37,060 --> 00:07:38,830
a single place.

95
00:07:39,730 --> 00:07:43,690
Now, what I'll do is I'll find out the total number of words.

96
00:07:44,110 --> 00:07:48,190
So I'll say maybe normal words is equal to.

97
00:07:49,370 --> 00:07:53,130
PD don't series parentheses.

98
00:07:53,810 --> 00:08:03,710
So create a panda series from our flattened list of ham e-mails and I can then easily find the total

99
00:08:03,710 --> 00:08:04,970
number of words used here.

100
00:08:05,360 --> 00:08:07,070
With the shape attribute.

101
00:08:07,610 --> 00:08:09,750
And it's just at index zero.

102
00:08:09,980 --> 00:08:11,810
So this is the total number of words.

103
00:08:12,500 --> 00:08:20,120
We've got approximately four hundred and forty one thousand words inside our bag of words for our non

104
00:08:20,120 --> 00:08:21,200
spam messages.

105
00:08:22,160 --> 00:08:23,680
But there's one problem here, right?

106
00:08:24,650 --> 00:08:30,000
We've got four hundred and forty one thousand words in our bag of words of normal emails.

107
00:08:30,710 --> 00:08:32,600
But there's gonna be some repetition here.

108
00:08:33,260 --> 00:08:36,830
What if we want to find the unique number of words?

109
00:08:37,880 --> 00:08:44,450
If you wanted to find the number of unique words, then we'd have to use the value counts method.

110
00:08:45,400 --> 00:08:46,150
So check this out.

111
00:08:46,990 --> 00:08:51,520
If I add value on a school counts.

112
00:08:53,940 --> 00:09:01,020
To my code here, then I'll be creating a series using the flattened list of how messages.

113
00:09:02,020 --> 00:09:07,450
But instead of storing this in my variable directly, I want to use the value underscore counts method.

114
00:09:08,440 --> 00:09:15,190
This will then tell me the total number of unique words in the nonce found messages.

115
00:09:16,790 --> 00:09:20,620
In this case, we have approximately 21000 words.

116
00:09:21,500 --> 00:09:27,740
So while we've got a total of approximately four hundred and forty one thousand words, we've only got

117
00:09:27,740 --> 00:09:34,040
about 21000 unique words in our dataset of non spam messages.

118
00:09:35,250 --> 00:09:42,510
To find the most common woods hole we have to do now is take our normal woods and look at the first

119
00:09:42,690 --> 00:09:43,620
10 values.

120
00:09:43,830 --> 00:09:47,500
So square brackets, semicolon 10.

121
00:09:48,300 --> 00:09:48,810
Here we go.

122
00:09:49,470 --> 00:09:53,130
These are the ten most common words in the non spam messages.

123
00:09:53,520 --> 00:09:58,260
And this is the number of times that the word HTP occurs.

124
00:09:58,800 --> 00:10:03,900
This is the number of times the word use occurred in our non spam messages.

125
00:10:04,940 --> 00:10:08,690
So I hope you saw how we use a couple of the techniques that we've covered already.

126
00:10:09,080 --> 00:10:16,130
To answer this question of what the top 10 most common words were when you were able to do this with

127
00:10:16,130 --> 00:10:17,810
very, very few lines of code.

128
00:10:18,290 --> 00:10:24,500
Thanks to the power of pandas, oftentimes of these things, it's just a matter of remembering what

129
00:10:24,500 --> 00:10:27,000
the right method is or Googling for it.

130
00:10:28,040 --> 00:10:31,160
But this is just the hammer, such as this is just a non spam messages.

131
00:10:31,340 --> 00:10:32,600
What about the spam messages?

132
00:10:32,900 --> 00:10:34,580
Let me copy the code here.

133
00:10:35,090 --> 00:10:39,290
Scroll down and then we'll do the spam messages all the same.

134
00:10:39,890 --> 00:10:43,100
So we'll change our variable name here.

135
00:10:43,500 --> 00:10:46,010
We'll change the name of the nested list we're using.

136
00:10:46,760 --> 00:10:51,650
And we'll create these series from the flight list of spam messages.

137
00:10:52,700 --> 00:10:56,870
We'll call this spammy was instead of normal.

138
00:10:56,900 --> 00:10:57,680
Underscore words.

139
00:10:58,740 --> 00:11:00,900
And then we'll print out the shape.

140
00:11:00,930 --> 00:11:04,500
The total number of unique spammy words as well.

141
00:11:05,220 --> 00:11:10,350
So this is the total number of unique words in the spam messages.

142
00:11:10,890 --> 00:11:11,700
Let's see what we get.

143
00:11:13,170 --> 00:11:20,520
In the spam messages, we have a total of thirteen thousand unique words, and the top most common ones

144
00:11:21,090 --> 00:11:29,550
are spammy words, square brackets, semicolon 10, says the top 10 spammy words.

145
00:11:30,920 --> 00:11:35,000
At number one, we've got HTP at number two.

146
00:11:35,150 --> 00:11:37,400
We've got the what e-mail, then the word free.

147
00:11:37,730 --> 00:11:38,590
Then the word click.

148
00:11:39,530 --> 00:11:42,230
Now, one thing that you might be wondering about is, of course.

149
00:11:42,680 --> 00:11:43,250
Well, wait a minute.

150
00:11:43,310 --> 00:11:46,330
So these words, these were all the standard words, right?

151
00:11:46,520 --> 00:11:48,440
These things here aren't real words.

152
00:11:49,490 --> 00:11:58,250
So we can entirely be sure if this word here is distend would form for please or pleasure just by looking

153
00:11:58,250 --> 00:11:58,490
at it.

154
00:11:58,670 --> 00:12:02,380
We might not know what our Porter Stemmer has done behind the scenes.

155
00:12:03,430 --> 00:12:08,140
So that's something to think about for the future lessons in the next lesson.

156
00:12:08,350 --> 00:12:14,770
We're going to be doing some more data exploration in the form of some advanced visualization techniques.

157
00:12:15,910 --> 00:12:21,850
I want to show you how to create something called a word cloud, because it's all very nice and good

158
00:12:22,090 --> 00:12:29,830
printing out the spammy words or the normal words as this list where you see the frequency and the word.

159
00:12:30,400 --> 00:12:37,000
What's a lot nicer usually is to present this graphically, to present this visually, because oftentimes

160
00:12:37,240 --> 00:12:41,530
visuals are just so much more convincing in any meeting or any presentation.

161
00:12:43,040 --> 00:12:44,930
I can't wait to see you again in the next lesson.

162
00:12:45,440 --> 00:12:45,890
Take care.