0 1 00:00:00,210 --> 00:00:01,460 Welcome back. 1 2 00:00:01,470 --> 00:00:09,330 Having just completed some very beautiful visualizations let's get back to pre-processing our data for 2 3 00:00:09,330 --> 00:00:11,640 our Bayes' classifier. 3 4 00:00:11,720 --> 00:00:18,620 Now there's lots of individual words among the 5800 odd emails that constitute 4 5 00:00:18,770 --> 00:00:20,300 our dataset. 5 6 00:00:20,300 --> 00:00:26,360 We won't actually use every single word that came up in these email bodies. 6 7 00:00:26,390 --> 00:00:32,720 We're just going to use the 2500 most frequent words. I'm going to add a markdown cell 7 8 00:00:32,720 --> 00:00:43,990 here to commemorate this and it's going to read "Generate Vocabulary & Dictionary" The 8 9 00:00:43,990 --> 00:00:51,110 2500 most frequent words in our dataset are going to form our vocabulary and we will generate 9 10 00:00:51,470 --> 00:00:59,000 this vocabulary from our stemmed list of words. To get our stemmed list of words we're once again going 10 11 00:00:59,000 --> 00:01:05,010 to call our "clean_msg_no_html" function that we've created earlier. 11 12 00:01:05,090 --> 00:01:08,570 Now I know that for the word cloud we commented out this line. 12 13 00:01:08,570 --> 00:01:14,960 So if you have this commented out, comment it back in. because we actually do understand the words this 13 14 00:01:14,960 --> 00:01:20,820 time round and if you made this change make sure that you do two things. 14 15 00:01:20,900 --> 00:01:27,680 If you comment this line back in, make sure you comment this one back out and also press Shift+Enter 15 16 00:01:27,830 --> 00:01:29,090 on this cell. 16 17 00:01:29,090 --> 00:01:33,470 Otherwise you're going to get some very unexpected results later on. 17 18 00:01:33,530 --> 00:01:33,830 All right. 18 19 00:01:33,860 --> 00:01:39,200 So let's use our apply method and call this function right here. 19 20 00:01:39,200 --> 00:01:47,870 I'll create a variable called "stemmed_nested_list", set that equal to "data.MESSAGE. 20 21 00:01:47,930 --> 00:01:56,460 apply()" and then I'll feed in the name of our function "clean_msg_no_ 21 22 00:01:56,570 --> 00:01:58,310 html". 22 23 00:01:58,640 --> 00:02:08,000 And because this is a nested list I'm going to flatten this list and store it under "flat_stemmed_list" and 23 24 00:02:08,000 --> 00:02:12,760 set that equal to the result of some Python list comprehension, 24 25 00:02:13,190 --> 00:02:23,750 "[item for sublist in stemmed_nested_list for item in sublist]". 25 26 00:02:24,620 --> 00:02:27,420 So far nothing new. 26 27 00:02:27,440 --> 00:02:31,040 Let's run this cell and move on. The next step 27 28 00:02:31,280 --> 00:02:34,500 will be getting a unique set of words. 28 29 00:02:34,550 --> 00:02:37,250 This is gonna make up our vocabulary. 29 30 00:02:37,250 --> 00:02:42,980 The easiest way to do this I think is to generate a pandas series and then use the "value_counts" method. 30 31 00:02:42,980 --> 00:02:45,590 Once again you can ignore this warning here. 31 32 00:02:45,590 --> 00:02:51,140 This is aimed at people who are trying to use Beautiful Soup to open a URL. Now to create that series 32 33 00:02:51,230 --> 00:02:52,700 of unique words, 33 34 00:02:52,850 --> 00:03:02,910 I'll create a variable really quickly that's called "unique_words" and I'll set that equal to "pd.Series()", 34 35 00:03:03,480 --> 00:03:11,970 I'll provide the "flat_stemmed_list" that we created a minute ago and then I'm going to call a 35 36 00:03:11,970 --> 00:03:16,230 method by the name of "value_counts". 36 37 00:03:16,320 --> 00:03:18,130 Let me show you what we've just done here. 37 38 00:03:18,210 --> 00:03:27,240 The number of unique words I can print out using the shape of my variable, so I'll say "Nr of unique 38 39 00:03:27,240 --> 00:03:37,620 words" is going to be in a print statement, then a comma, and I'll say "unique_words.shape[ 39 40 00:03:37,870 --> 00:03:38,710 0]" 40 41 00:03:38,820 --> 00:03:45,480 This will print out the number of unique words in this series. And to look at the first five rows, the 41 42 00:03:45,480 --> 00:03:47,140 first five entries in this series, 42 43 00:03:47,310 --> 00:03:50,770 I'll say "unique_words.head()". 43 44 00:03:51,180 --> 00:03:59,760 See what we get. And what we see here is that after cleaning and stemming, we are left with 44 45 00:03:59,760 --> 00:04:07,920 27320 words in our dataset, 27000 unique words across all 45 46 00:04:07,920 --> 00:04:09,590 our email bodies. 46 47 00:04:09,630 --> 00:04:17,430 Now this is an absolutely huge number and we're actually only going to train our classifier with a subset 47 48 00:04:17,880 --> 00:04:25,490 of this number, namely to 2500 most frequent words. Now you might be wondering why "http" 48 49 00:04:25,530 --> 00:04:26,540 is up here. 49 50 00:04:26,580 --> 00:04:30,350 Well, "http" pretty much precedes every single URL, 50 51 00:04:30,410 --> 00:04:37,910 so this goes to show how many hyperlinks people have included in their emails. Now to get the 51 52 00:04:37,910 --> 00:04:42,640 2500 most frequent words I wanna throw this over to you. As a challenge, 52 53 00:04:42,910 --> 00:04:49,860 and the reason is this is another good opportunity to practice subsetting and working with these series. 53 54 00:04:49,860 --> 00:04:58,610 Can you create a subset of this unique words series and store it in a variable called "frequent_ 54 55 00:04:58,680 --> 00:05:05,520 words which will only contain the most frequent 2500 words out of the total? 55 56 00:05:05,730 --> 00:05:09,870 And then afterwards print out the top 10 words. 56 57 00:05:09,990 --> 00:05:14,700 These, of course, are going to overlap with the top 5 words that you see above. 57 58 00:05:14,700 --> 00:05:17,490 I'll give you a few seconds to pause the video and give this a go. 58 59 00:05:20,430 --> 00:05:22,590 Here's the solution. 59 60 00:05:22,740 --> 00:05:31,350 "frequent_words = unique_words" and then we're going to use that square bracket notation, [0: 60 61 00:05:31,380 --> 00:05:35,500 2500]". 61 62 00:05:35,580 --> 00:05:40,040 That's how we create a subset. And to print out the top 10, 62 63 00:05:40,140 --> 00:05:50,620 we'll say "print('Most common words') and then I'm even going to use a escape character and a new line, comma 63 64 00:05:51,760 --> 00:05:52,990 "frequent_words" 64 65 00:05:53,320 --> 00:06:00,250 and once again I'm going to create a subset, this time from the beginning, so I'm even gonna leave out the zero 65 66 00:06:00,730 --> 00:06:01,900 to 10. 66 67 00:06:01,900 --> 00:06:05,620 These are the top 10 words and here they are. 67 68 00:06:06,070 --> 00:06:12,670 With this notation, when you're creating a subset you're setting a starting point and an ending point. 68 69 00:06:12,670 --> 00:06:18,700 And with this notation of creating a subset you're going from the beginning to an end point. 69 70 00:06:18,700 --> 00:06:21,400 So I hope this was a useful review. 70 71 00:06:21,430 --> 00:06:26,800 One thing that we can do to improve our code though is removing some of these magic numbers that we 71 72 00:06:26,800 --> 00:06:27,830 see here. 72 73 00:06:27,850 --> 00:06:35,230 So what I'd like is instead of having 2500 float around in my code here, I'd like 73 74 00:06:35,230 --> 00:06:36,820 to define a constant 74 75 00:06:36,820 --> 00:06:47,440 at the very top called "VOCAB_SIZE" and set that equal to the size of the vocabulary that 75 76 00:06:47,440 --> 00:06:53,290 I'm gonna use in my code later on. That way if I ever want to make a change all I have to do is change 76 77 00:06:53,470 --> 00:07:02,760 this number here and it will filter through as long as I replace this number here with my constant "VOCAB_ 77 78 00:07:02,840 --> 00:07:05,100 SIZE". 78 79 00:07:05,150 --> 00:07:09,430 Now you've got to remember, if you've changed a cell up top, you've gotta press Shift+Enter. 79 80 00:07:09,580 --> 00:07:12,580 Otherwise you're gonna get an error. 80 81 00:07:13,120 --> 00:07:17,040 Now with frequent words, we're working with a series, right. 81 82 00:07:17,350 --> 00:07:27,790 so "type(frequent_words)", we can see that we have a pandas series, and not only that; if we 82 83 00:07:27,790 --> 00:07:36,940 look at frequent words, we see that this bit here, the actual words form our index and the number of occurrences 83 84 00:07:37,210 --> 00:07:40,610 are actually the values in this series. 84 85 00:07:40,630 --> 00:07:47,320 Let's practice how we would go between a series and a dataframe and how to work with these indices. 85 86 00:07:47,380 --> 00:07:54,040 We're also going to take this opportunity to assign a word ID to each word, similar to how we assigned 86 87 00:07:54,100 --> 00:08:04,490 a doc ID in an earlier lesson. I'll add a markdown cell here real quick that's gonna read "Create Vocabulary 87 88 00:08:05,780 --> 00:08:15,680 DataFrame with a WORD_ID". Now our word IDs are just going to be integers, they're going to be ranging 88 89 00:08:15,680 --> 00:08:21,980 from zero to 2499, meaning we're gonna work again with this range 89 90 00:08:21,980 --> 00:08:31,190 object and we can create a range very easily with "range(0,)" and then going up to 90 91 00:08:31,790 --> 00:08:33,720 our "VOCAB_SIZE", right. 91 92 00:08:33,740 --> 00:08:39,630 This is how we create our range. Now what we can do is store all these numbers in a list. 92 93 00:08:39,710 --> 00:08:47,090 So I'll wrap this call to the range into a list and I'm actually gonna store this in a variable called 93 94 00:08:47,150 --> 00:08:48,470 "word_ids". 94 95 00:08:48,470 --> 00:08:56,210 So "word_ids = list(range(0, VOCAB_SIZE))" and then closing 95 96 00:08:56,210 --> 00:08:57,740 the parentheses. 96 97 00:08:57,740 --> 00:09:06,410 Now let's create our dataframe with "pd.DataFrame()" and then what I'm going to do is 97 98 00:09:06,480 --> 00:09:14,130 I'm going to provide a dictionary, so I'll have those two curly braces, our dictionary is gonna consist of 98 99 00:09:14,130 --> 00:09:15,580 a key and a value. 99 100 00:09:15,780 --> 00:09:22,810 So the key will be whatever I want that column heading to read, "VOCAB_WORD" 100 101 00:09:22,860 --> 00:09:24,830 sounds good to me. 101 102 00:09:24,960 --> 00:09:31,530 Now the values, I want those to be the actual words. Scrolling up a little bit, 102 103 00:09:31,830 --> 00:09:36,240 the words are here in our series. 103 104 00:09:36,240 --> 00:09:45,630 So this means that if I use "frequent_words" like so, I'm actually accessing the frequencies, the numbers. 104 105 00:09:46,400 --> 00:09:51,190 What I need to do to get the words is work with our index, right. 105 106 00:09:51,240 --> 00:10:00,810 So "index.values" will be the way I can store all these different strings in a column for our 106 107 00:10:00,810 --> 00:10:02,010 dataframe. 107 108 00:10:02,010 --> 00:10:03,270 So far so good. 108 109 00:10:03,390 --> 00:10:04,200 Let's see what we've got. 109 110 00:10:04,740 --> 00:10:13,750 Let me hit Shift+Enter on this cell. At the moment, our dataframe looks like this. Fair enough. 110 111 00:10:13,750 --> 00:10:21,520 Let's add our word IDs explicitly to this dataframe. So we can do that with setting the dataframe's 111 112 00:10:21,610 --> 00:10:22,270 index, 112 113 00:10:22,300 --> 00:10:22,710 right. 113 114 00:10:22,810 --> 00:10:30,720 So "index = word_ids" and then we can also give that index a name. 114 115 00:10:30,860 --> 00:10:39,990 But first let me give our dataframe a name as well, so I'll say "vocab = pd.DataFrame" and on 115 116 00:10:39,990 --> 00:10:40,700 the line below, 116 117 00:10:40,710 --> 00:10:51,890 I'll say "vocab.index.name = 'WORD_ID'". Let's look at the first 117 118 00:10:52,010 --> 00:10:58,500 five rows in our dataframe with "vocab.head()" and then Shift+Enter. 118 119 00:10:58,610 --> 00:10:59,510 There we go. 119 120 00:10:59,510 --> 00:11:00,670 Fantastic. 120 121 00:11:00,740 --> 00:11:07,040 We've generated our vocabulary that we're going to train our classifier with and previously we've had 121 122 00:11:07,040 --> 00:11:15,080 a pandas dataframe and we've used this "to_json" functionality to save a file in the JSON format 122 123 00:11:15,260 --> 00:11:16,410 to our disk. 123 124 00:11:16,430 --> 00:11:22,640 Now "to_json" is very well and good but of course pandas can save many different file types. 124 125 00:11:22,730 --> 00:11:29,750 A common one that you're gonna be working with a lot is a CSV file, a comma separated value. 125 126 00:11:29,750 --> 00:11:35,060 This is a file format that can easily be opened and nicely formatted with Microsoft Excel or Google 126 127 00:11:35,060 --> 00:11:36,290 Sheets. 127 128 00:11:36,290 --> 00:11:43,340 Now you probably already surmised that you're going to need a file path for this "to_csv" function that 128 129 00:11:43,340 --> 00:11:43,970 we're gonna call. 129 130 00:11:44,780 --> 00:11:52,610 So let's go back up to our constants and create a constant that will hold on to our file path and our file 130 131 00:11:52,610 --> 00:11:56,840 name for this CSV file that we're gonna create. I'm going to copy 131 132 00:11:56,870 --> 00:11:59,900 this constant right here, paste it below 132 133 00:11:59,900 --> 00:12:02,270 and then make a few changes to it of course. 133 134 00:12:02,300 --> 00:12:09,710 So I'll call this one, in all caps, "WORD_ID_FILE" and we're still gonna save 134 135 00:12:09,710 --> 00:12:19,400 it in our Processing folder but we're going to call it "word-by-id" and then add the 135 136 00:12:19,610 --> 00:12:23,210 ".csv" extension to it. 136 137 00:12:23,210 --> 00:12:32,420 So that's our file path and I'll hit Shift+Enter to make sure this is saved and then down here gonna add a 137 138 00:12:32,660 --> 00:12:45,470 quick section heading. It's gonna read "Save the Vocabulary as a CSV File". As you can guess we'll access 138 139 00:12:45,650 --> 00:12:57,970 our "to_csv" method from our vocab object, so "vocab.to_csv()" and then we're gonna 139 140 00:12:57,980 --> 00:13:02,920 pass in our "WORD_ID_FILE" path and name. 140 141 00:13:02,960 --> 00:13:10,700 Now if I hit Shift+Tab on this, I can see some of these other parameters that I can specify and that 141 142 00:13:10,700 --> 00:13:16,190 includes a header and an index. 142 143 00:13:16,190 --> 00:13:19,300 Now what does this mean? Coming down here, 143 144 00:13:19,400 --> 00:13:29,020 I can see that our header can be a list of strings and our index label can also be a string. By default, 144 145 00:13:29,120 --> 00:13:31,520 no index label is provided. 145 146 00:13:31,850 --> 00:13:40,430 So let's provide these two additional inputs to our "to_csv" method call. So I'll add a comma here and the 146 147 00:13:40,430 --> 00:13:43,440 first thing I'll do is provide the index label. 147 148 00:13:43,430 --> 00:13:50,390 Now I can provide it as a string with single quotes and say 'WORD_ID', but what I could also 148 149 00:13:50,390 --> 00:13:54,580 do is instead of typing this out and risk making a typo. 149 150 00:13:55,160 --> 00:14:02,690 If I waned to make sure it matches what I've got in my dataframe I can access this directly with "vocab. 150 151 00:14:03,080 --> 00:14:10,010 index.name' and for our header it's actually the very same thing. 151 152 00:14:10,550 --> 00:14:20,350 I could either write 'VOCAB_WORD' which is our header right here, but alternatively 152 153 00:14:20,620 --> 00:14:23,520 I can also grab our column, 153 154 00:14:23,590 --> 00:14:28,110 so with "vocab.VOCAB_WORD. 154 155 00:14:28,120 --> 00:14:34,010 name". This will accomplish the very, very same thing. 155 156 00:14:34,050 --> 00:14:36,660 Now I'm not gonna hit Shift+Enter on this right now. 156 157 00:14:36,660 --> 00:14:43,330 What I want to do instead is bring up my folder here on the right hand side and then hit Shift+Enter, 157 158 00:14:43,560 --> 00:14:46,550 so you can see the file appearing. 158 159 00:14:46,620 --> 00:14:47,670 Here you go. 159 160 00:14:47,670 --> 00:14:49,110 There it is. 160 161 00:14:49,320 --> 00:14:57,330 Now you can open this in a text editor, say like Atom, and the CSV file will be formatted something like 161 162 00:14:57,330 --> 00:14:58,140 this. 162 163 00:14:58,140 --> 00:15:01,350 It's not particularly impressive. 163 164 00:15:01,470 --> 00:15:08,880 But if you have a spreadsheet program like Microsoft Excel or Google Sheets or in my case, this Numbers 164 165 00:15:08,880 --> 00:15:15,810 program that comes with Mac then you'll see the values formatted nicely in these columns. 165 166 00:15:15,810 --> 00:15:20,640 So as you can see the CSV format is very, very handy. 166 167 00:15:20,640 --> 00:15:25,090 Now we're covering quite a lot of stuff in these lessons and with programming 167 168 00:15:25,170 --> 00:15:27,710 the best way of learning it is by doing it. 168 169 00:15:28,020 --> 00:15:33,930 So the next two lessons will consist of some very quick exercises to review some of the concepts that 169 170 00:15:34,110 --> 00:15:36,040 we've talked about. 170 171 00:15:36,090 --> 00:15:39,420 I'm off to grab some more coffee and then I'll see you there.