0 1 00:00:00,300 --> 00:00:05,040 To create our word cloud for our ham messages and our spam messages, 1 2 00:00:05,250 --> 00:00:07,650 we're gonna use two different masks. 2 3 00:00:07,680 --> 00:00:15,980 I've got thumbs up for the non-spam messages and I've got thumbs down for the spam messages. 3 4 00:00:16,080 --> 00:00:20,730 These are gonna be our masks for our respective word clouds. 4 5 00:00:20,730 --> 00:00:25,770 The dataset backing our word clouds will come from some of the previous work we've done. So we've got 5 6 00:00:25,800 --> 00:00:34,250 a flattened list of non-spam words and we've got a flattened list of spam words. 6 7 00:00:34,280 --> 00:00:37,970 This is the data that we're going to feed into our word clouds. 7 8 00:00:37,970 --> 00:00:40,910 So let's tackle our ham messages first. 8 9 00:00:40,910 --> 00:00:52,070 I'm going to add a markdown cell here and write "Word Cloud of Ham and Spam Messages". 9 10 00:00:52,070 --> 00:00:54,610 Next I'm going to add some constants. 10 11 00:00:54,710 --> 00:01:01,790 I'm going to add the thumbs up and thumbs down files, so I'll paste the relative path for the skull file 11 12 00:01:02,540 --> 00:01:12,650 and just rename these to "THUMBS_UP_FILE" and "THUMBS_DOWN_FILE". 12 13 00:01:12,650 --> 00:01:15,170 Of course the icon names have to change as well. 13 14 00:01:15,410 --> 00:01:28,240 So this one well call "thumbs-up.png", and this one we'll call "thumbs-down. 14 15 00:01:28,240 --> 00:01:31,700 png". Shift+Enter, 15 16 00:01:31,990 --> 00:01:34,190 scroll down and 16 17 00:01:34,360 --> 00:01:43,540 now what I'm gonna do is I'm going to copy this code here for our whale, come down here and paste it in. 17 18 00:01:44,920 --> 00:01:49,380 There's a few modifications that we can make here that will save us a bit of time. 18 19 00:01:49,390 --> 00:01:56,250 So first off we're gonna change the image to say thumbs up, 19 20 00:01:56,530 --> 00:02:03,810 then what we'll have to do is we'll have to generate the text as a string for the word cloud, 20 21 00:02:03,820 --> 00:02:04,110 right. 21 22 00:02:06,050 --> 00:02:13,700 In this case, I'm going to store all of this text in a variable called "ham_str" and that will 22 23 00:02:13,700 --> 00:02:24,030 be equal to "' '.join(flat_list_ham)". 23 24 00:02:24,770 --> 00:02:32,930 So we've got a list of non-spam words and we're going to join them all together into a single string. Before 24 25 00:02:32,930 --> 00:02:33,990 I run this, 25 26 00:02:34,040 --> 00:02:43,910 I have to do one last thing. I have to change this "novel_as_string" to our "ham_str" variable. 26 27 00:02:43,910 --> 00:02:47,120 That way I'm not generating another word cloud for Moby Dick, 27 28 00:02:47,160 --> 00:02:50,570 instead I'm generating it for non-spam emails. 28 29 00:02:52,810 --> 00:02:54,560 And here's our output. 29 30 00:02:54,670 --> 00:03:01,030 We still have the Moby Dick styling, but we can change that very easily with a different colour map and 30 31 00:03:01,030 --> 00:03:05,560 I think otherwise it looks pretty good. In terms of colour map, 31 32 00:03:05,590 --> 00:03:11,830 let's go for 'winter' and, I don't know, maybe a bit more granularity on the words. 32 33 00:03:14,350 --> 00:03:15,100 Now, 33 34 00:03:15,160 --> 00:03:22,660 I think this looks pretty good actually, but there's probably one thing that you'll notice that might 34 35 00:03:22,840 --> 00:03:24,520 bother you a little bit. 35 36 00:03:24,580 --> 00:03:28,330 "People", for example is spelled without the "e" at the end. 36 37 00:03:28,360 --> 00:03:34,830 "Change" is spelled without the "e" at the end. "Provide is spelled without the "e". 37 38 00:03:35,170 --> 00:03:37,700 We can only guess what this word "tri" 38 39 00:03:37,720 --> 00:03:43,990 is supposed to be. The reason for this is is that the data that we fed into our word cloud are the 39 40 00:03:43,990 --> 00:03:46,210 stemmed words. 40 41 00:03:46,210 --> 00:03:50,230 That's why we're seeing our word cloud presented like this. 41 42 00:03:50,710 --> 00:03:58,690 Our flattened list of non-spam keywords is coming from our nested list of non-spam keywords, which is 42 43 00:03:58,690 --> 00:04:06,760 coming from our nested list, which we've created up here when we applied our "clean_msg_ 43 44 00:04:06,910 --> 00:04:09,400 no_html" method. 44 45 00:04:09,400 --> 00:04:12,610 This is where we did this stemming if you recall. 45 46 00:04:12,610 --> 00:04:17,920 If you wanted to see what this word cloud would look like without the stemmed words, then you'd have 46 47 00:04:17,920 --> 00:04:27,580 to copy this code, paste it in, use the original words and comment out 47 48 00:04:27,690 --> 00:04:38,250 this line right here. If you refresh this cell and then go to "Kernel" > "Restart & Run All", then you'll 48 49 00:04:38,250 --> 00:04:46,110 run all the cells in the entire notebook and after waiting a little while, you can scroll down and look 49 50 00:04:46,260 --> 00:04:53,210 at the output for the word cloud without the stemmed words. 50 51 00:04:53,220 --> 00:04:54,300 There we go. 51 52 00:04:54,300 --> 00:04:56,680 This is starting to look a lot better. 52 53 00:04:56,700 --> 00:05:03,150 The stemmed words are going to be verym very useful for our Bayes' classifier but they're actually not so 53 54 00:05:03,150 --> 00:05:05,480 useful for our word cloud. 54 55 00:05:05,580 --> 00:05:11,040 So if you've changed it like I did then it's important to make a mental note to change this back and 55 56 00:05:11,040 --> 00:05:17,680 use the stemmed words before we train our model. Now I've got one final challenge for you on the word 56 57 00:05:17,680 --> 00:05:19,510 cloud front. 57 58 00:05:19,510 --> 00:05:26,500 I'd like you to pull up the word cloud documentation and figure out how to use the custom font that 58 59 00:05:26,530 --> 00:05:33,910 I've included in the lesson resources instead of this default font here and then create a word cloud 59 60 00:05:34,330 --> 00:05:38,430 of all the words in the spam messages. 60 61 00:05:38,480 --> 00:05:40,440 I've included two font files. 61 62 00:05:40,630 --> 00:05:49,520 The "OpenSansCondensed-Bold" font file and I've included the "OpenSansCondensed-Light" font file. 62 63 00:05:49,840 --> 00:05:54,820 If you look at this word cloud here, what kind of font do you think would work best with it? 63 64 00:05:55,570 --> 00:06:02,650 What characteristics should a font have in order to make a word cloud look a lot more attractive? Because 64 65 00:06:02,800 --> 00:06:09,880 in my humble opinion this default font here doesn't work as well as it should. 65 66 00:06:09,880 --> 00:06:15,970 What kind of improvement on the font front would you make to this visualization? 66 67 00:06:15,970 --> 00:06:18,870 I can think of two improvements right off the bat. 67 68 00:06:19,030 --> 00:06:24,780 First off, the words should probably be written in all caps rather than lowercase. 68 69 00:06:25,450 --> 00:06:31,420 And second, the letters of the font should probably be closer together and bold. 69 70 00:06:31,540 --> 00:06:36,740 These are the first two things that I would try to make this look a lot more convincing. 70 71 00:06:37,210 --> 00:06:41,180 So I hope you will give this a shot when solving the challenge. 71 72 00:06:41,230 --> 00:06:46,930 Pause the video and have a go. All good? 72 73 00:06:46,930 --> 00:06:49,540 Here's the solution. 73 74 00:06:49,740 --> 00:06:57,220 We've already added the relative path to the thumbs down file so we can already use it. 74 75 00:06:57,460 --> 00:07:00,610 I'm going to copy this code right here 75 76 00:07:01,830 --> 00:07:05,070 and paste it in below. 76 77 00:07:05,070 --> 00:07:13,730 I'll swap up thumbs up for thumbs down, this line stays the same, as does this one. Our string of course 77 78 00:07:14,090 --> 00:07:22,280 will need to change, so this will be the spam string and this will come from the flattened spam list. 78 79 00:07:23,170 --> 00:07:29,660 And when it comes to generating the cloud we will use the spam string instead of the ham string. For a 79 80 00:07:29,660 --> 00:07:31,760 bit of visual differentiation, 80 81 00:07:31,760 --> 00:07:35,780 I'm going to go with something reddish from the color map. 81 82 00:07:35,780 --> 00:07:41,220 In fact, I'm going to try this "gist_heat" color map that we've got here. 82 83 00:07:41,330 --> 00:07:48,020 So instead of "winter" I'll go for "gist_heat" and I'll keep some of the other things the 83 84 00:07:48,020 --> 00:07:49,280 same. 84 85 00:07:49,370 --> 00:07:55,220 Now when I said check the word cloud documentation, what I wanted you to pick up on was the fact that 85 86 00:07:55,220 --> 00:08:06,180 you could supply a font here as an argument and you can do so by giving the word cloud a font path. The 86 87 00:08:06,180 --> 00:08:10,680 font path is the font that will be used in the word cloud. 87 88 00:08:10,860 --> 00:08:20,850 And this has to be a OTF or a TFF file and a TFF file is exactly what we've got. Going back up, 88 89 00:08:20,850 --> 00:08:29,730 I'll add my relative path here for my font file. So I'll call this one "CUSTOM_FONT_ 89 90 00:08:29,730 --> 00:08:43,610 FILE" and this will be my "OpenSansCondensed-Bold.ttf" file. 90 91 00:08:43,920 --> 00:08:49,900 You can try both of these and see which one looks better, but I think I prefer the bold one. Just gotta 91 92 00:08:49,900 --> 00:08:57,540 make sure there's no typos here in this file name. Let me hit Shift+Enter on this, scroll back down, 92 93 00:08:58,720 --> 00:09:09,730 and now I can supply a "font_path" and set that equal to the custom font file. 93 94 00:09:09,800 --> 00:09:17,660 The other thing that you might have had to have a play with is the maximum font size. So in getting it to 94 95 00:09:17,660 --> 00:09:19,670 look the way you want it to, 95 96 00:09:19,670 --> 00:09:26,480 there is a mention of a font size parameter in the quick documentation as well. Despite the rather strange 96 97 00:09:26,480 --> 00:09:28,190 word wrapping at the moment, 97 98 00:09:28,190 --> 00:09:35,840 you can see it right here, "max_font_size" by default is equal to None, but we can give it another value 98 99 00:09:35,840 --> 00:09:46,160 as well, so "max_font_size" equals, say I don't know, I think 300 worked well last 99 100 00:09:46,160 --> 00:09:51,130 time I looked at this. Now let me hit Shift+Enter and see what this looks like. 100 101 00:09:52,460 --> 00:09:54,230 My computer is really struggling. 101 102 00:09:54,230 --> 00:09:56,800 I think I've got too many things open. 102 103 00:09:57,010 --> 00:09:57,380 Okay. 103 104 00:09:57,410 --> 00:09:58,610 So here we go. 104 105 00:09:58,640 --> 00:10:00,080 This is the result. 105 106 00:10:00,260 --> 00:10:07,520 I think the font looks pretty good, but I think we need to up the number of words being displayed in 106 107 00:10:07,520 --> 00:10:09,020 this image. 107 108 00:10:09,080 --> 00:10:14,110 Also we're still in lower case, so let's change these two things. 108 109 00:10:14,270 --> 00:10:21,350 I'm going to be pretty radical and change the maximum number of words to 2000 from 500 and when it 109 110 00:10:21,350 --> 00:10:29,560 comes to generating my word cloud I'm going to convert my string to uppercase letters. 110 111 00:10:29,810 --> 00:10:40,350 So "spam_str.upper()" will convert all of my letters to uppercase. Now I can refresh and 111 112 00:10:40,710 --> 00:10:43,410 see if we're happy with the result. 112 113 00:10:43,470 --> 00:10:49,240 I think the red color definitely worked really well but with these last two changes I think the design 113 114 00:10:49,240 --> 00:10:53,400 should really come together. 114 115 00:10:54,040 --> 00:10:55,960 Okay so that's interesting. 115 116 00:10:56,190 --> 00:11:00,480 The largest word has randomly been given a white color. 116 117 00:11:00,540 --> 00:11:06,350 This is probably not what I want, so I'm going to refresh this and see if I'm more lucky 117 118 00:11:06,390 --> 00:11:11,080 the second time round. Much better. 118 119 00:11:11,110 --> 00:11:11,740 Brilliant. 119 120 00:11:12,030 --> 00:11:18,800 So I quite like the look of this but I'd be quite curious what sort of designs you've come up with. 120 121 00:11:18,900 --> 00:11:25,920 And speaking of your own designs, a good question at this point is: Where can you find both fonts and 121 122 00:11:25,920 --> 00:11:31,770 masks for your own projects, for your own word clouds? When it comes to fonts, 122 123 00:11:31,780 --> 00:11:39,460 one of my favorite places to go to is Google Fonts, so Google Fonts are free and open source and you 123 124 00:11:39,460 --> 00:11:44,680 can just browse and download the font files that you like from here. 124 125 00:11:44,680 --> 00:11:49,670 This is a brilliant resource. When it comes to masks and icons, 125 126 00:11:49,780 --> 00:11:51,890 you've got a lot of choice as well. 126 127 00:11:51,910 --> 00:11:57,550 One place that you could check out is the icon set on fontawesome.com. 127 128 00:11:57,580 --> 00:12:03,910 Here you can download various icons that you can then use as masks for your PNG files. 128 129 00:12:03,940 --> 00:12:10,050 Just remember to double check the licensing on all of these image files that you're interested in using. 129 130 00:12:10,570 --> 00:12:15,430 In any case font awesome allows you to filter on free icons which is quite handy. 130 131 00:12:15,430 --> 00:12:22,660 So if we check out this accessible icon here, then you can download this icon as an SVG file, a vector 131 132 00:12:22,990 --> 00:12:28,510 which you can then scale as large as you like with some image processing software. 132 133 00:12:28,510 --> 00:12:32,800 Now with those tools in hand I think you're all set to create something fantastic. 133 134 00:12:32,830 --> 00:12:36,520 If you do please post it in the comments section below this video, 134 135 00:12:36,520 --> 00:12:40,420 I'd love to see what you've made. In the next lessons, 135 136 00:12:40,500 --> 00:12:47,800 we're going to generate the vocabulary and the dictionary for our Bayes' classifier. It's about time I get 136 137 00:12:47,800 --> 00:12:50,450 some food, so I'll see in the next lesson. 137 138 00:12:50,470 --> 00:12:51,510 Take care.