0 1 00:00:00,300 --> 00:00:04,070 In this lesson we're going to create our own word cloud. 1 2 00:00:04,200 --> 00:00:11,430 Also I'll show you how to get hold of entire novels through the NLTK resources. 2 3 00:00:11,440 --> 00:00:19,780 Let's add a quick little section heading here with a markdown cell that reads "Creating a Word Cloud". 3 4 00:00:19,830 --> 00:00:26,250 Now, whenever you're using an unknown package or a new package that you haven't seen before, it's always 4 5 00:00:26,250 --> 00:00:30,930 a good idea to pull up the documentation of this package. 5 6 00:00:30,930 --> 00:00:37,650 In our case, the package that we're going to be using is called wordcloud and it was originally made 6 7 00:00:37,650 --> 00:00:39,550 by a chap called Andreas Mueller. 7 8 00:00:39,690 --> 00:00:47,520 But since then he's had approximately 33 people contribute and improve and add their work to make this 8 9 00:00:47,520 --> 00:00:49,110 package better. 9 10 00:00:49,110 --> 00:00:54,460 As you can see the entire code base is up on GitHub for anybody to look at 10 11 00:00:54,660 --> 00:00:58,000 and there's some basic instructions here as well, 11 12 00:00:58,170 --> 00:01:05,850 but the more detailed API documentation can be found here. On this Web site you can see a bit more information 12 13 00:01:05,850 --> 00:01:09,880 on how to use the wordcloud as well as some more examples. 13 14 00:01:09,960 --> 00:01:13,650 I've added the links to both these sites into the lesson resources. 14 15 00:01:13,740 --> 00:01:18,950 So without further ado let's get started with a basic example. 15 16 00:01:19,020 --> 00:01:24,300 Now having downloaded and installed the package in the previous lesson it's time to add it to our notebook 16 17 00:01:24,390 --> 00:01:25,350 imports. 17 18 00:01:25,350 --> 00:01:38,080 So here we can say "from wordcloud import WordCloud". Lets hit Shift+Enter and scroll all the way down. 18 19 00:01:38,090 --> 00:01:42,600 Now we're gonna be using that example email again to generate our first word cloud. 19 20 00:01:42,680 --> 00:01:52,510 The way to do it is to use "WordCloud().generate()" and then 20 21 00:01:52,510 --> 00:01:57,800 we just have to supply the piece of text that we'd like to use to generate the word cloud from. 21 22 00:01:57,880 --> 00:02:05,950 So in our case it was the email body. I'm going to actually save this in a variable called "word_cloud", 22 23 00:02:05,950 --> 00:02:07,000 so I'll say "word_ 23 24 00:02:07,000 --> 00:02:13,110 cloud = WordCloud().generete(email_body)". 24 25 00:02:13,360 --> 00:02:14,790 This is step one. 25 26 00:02:14,920 --> 00:02:23,070 Now matplotlib is going to come into play. So matplotlib was stored inside a variable called "plt" 26 27 00:02:23,470 --> 00:02:31,990 and here we're going to use the "imshow" method, "imshow()" and we can supply, 27 28 00:02:32,040 --> 00:02:39,600 well you guessed it, we're going to supply our "word_cloud" to "imshow" and now let's use "plt.show()" and 28 29 00:02:39,600 --> 00:02:45,990 see what. We get we get something like this. Now I don't know about you but I can see we can make some 29 30 00:02:45,990 --> 00:02:52,340 improvements on this right away. The quality isn't that great, the letters look a little bit jagged and 30 31 00:02:52,350 --> 00:02:59,580 we've got these axes on both sides, on the x and on the y. These axes are really easy to remove actually. 31 32 00:03:00,260 --> 00:03:09,450 If we use "plt.axis()" and then single quotes with the string 'off' and hit Shift+Enter, 32 33 00:03:09,840 --> 00:03:16,690 the axes disappear, but we still kind of have these jagged edges on some of these letters. It doesn't 33 34 00:03:16,690 --> 00:03:18,450 look that clean. 34 35 00:03:18,610 --> 00:03:25,540 So what we can do is come up to the "imshow" method, put in a comma after "word_cloud" and add an additional 35 36 00:03:25,930 --> 00:03:27,380 argument here. 36 37 00:03:27,520 --> 00:03:33,620 This one's going to be called "interpolation" and we can set that equal to some sort of value, 37 38 00:03:33,640 --> 00:03:37,630 right. At first we had a default value for interpolation which was "none", 38 39 00:03:37,690 --> 00:03:43,590 so there was no interpolation going on, but we can do some interpolation to smooth out these edges. 39 40 00:03:43,750 --> 00:03:51,610 I'm going to go with "bilinear" and now we have a much cleaner picture. Of course, the only reason I know 40 41 00:03:51,640 --> 00:03:59,110 to set "interpolation" to "bilinear" is because I've had a look at the quick documentation here and I 41 42 00:03:59,110 --> 00:04:05,190 scrolled down to the interpolation parameter and I've tried out a couple of these. 42 43 00:04:05,260 --> 00:04:08,770 So there is quite a big number to choose from. 43 44 00:04:08,890 --> 00:04:16,120 The default is "none", but I did find that I could get some improvement in the look of my word cloud by 44 45 00:04:16,120 --> 00:04:22,870 playing around with this and I found that "bilinear" works well for me. Now I think as a minimalist basic 45 46 00:04:22,870 --> 00:04:25,730 example, this works really, really well 46 47 00:04:26,050 --> 00:04:30,980 and you'll find something very, very similar on the documentation page actually. 47 48 00:04:31,000 --> 00:04:32,770 So let's take this to the next level. 48 49 00:04:33,130 --> 00:04:40,940 Let's make this a little bit more interesting. To make our next word clouds a little bit more interesting, 49 50 00:04:40,980 --> 00:04:49,210 I want to show you how to download entire novel from the natural language tool kit. NLTK actually 50 51 00:04:49,210 --> 00:04:55,680 has a whole bunch of resources that we can use if we go back up to where we were downloading our tokenizer 51 52 00:04:55,680 --> 00:05:01,750 and our stop words, you'll recall both of these were downloaded to the "nltk_data" 52 53 00:05:01,750 --> 00:05:09,310 folder on our hard drive. All we need to do is add two additional lines of code here, namely "nltk. 53 54 00:05:09,310 --> 00:05:23,100 download('gutenberg')" and "nltk.download(' 54 55 00:05:24,360 --> 00:05:32,060 shakespeare')". If I hit Shift+Enter on this cell it will download some additional corpora for us that 55 56 00:05:32,060 --> 00:05:40,980 we can make use of in our word cloud. Check it out. So it got the zip files and it unzipped them 56 57 00:05:41,280 --> 00:05:44,150 and here they are in my "corpora" folder. 57 58 00:05:44,310 --> 00:05:52,650 I've got "shakespeare" right here with a number of his plays including Hamlet, Julius Caesar, Macbeth and 58 59 00:05:52,650 --> 00:05:57,530 so on. I think "r_and_j" stands for Romeo and Juliet, 59 60 00:05:57,660 --> 00:06:05,360 another classic. And Gutenberg has also a whole bunch of books, so you can read these for free if you 60 61 00:06:05,360 --> 00:06:06,290 like. 61 62 00:06:06,290 --> 00:06:11,050 This one here looks like Alice in Wonderland by Lewis Carroll. 62 63 00:06:11,060 --> 00:06:11,620 There we go. 63 64 00:06:11,630 --> 00:06:19,810 Opening them up in my Atom text editor I can see the whole text of the entire novel nicely formatted. The 64 65 00:06:19,810 --> 00:06:25,630 one we're gonna use is Moby Dick by Herman Melville. 65 66 00:06:25,630 --> 00:06:30,820 If you actually wanted to read this book, you have to scroll down quite a bit past all the acknowledgements 66 67 00:06:30,970 --> 00:06:38,520 and praises and so on to, uh, eventually Chapter 1 which is on line 500 with the famous opening sentences. 67 68 00:06:39,430 --> 00:06:41,400 But okay, so far so good. 68 69 00:06:41,410 --> 00:06:45,600 Now we've got access to the text of a whole bunch of novels, 69 70 00:06:45,670 --> 00:06:51,800 now it's up to us to put these into a word cloud. So I'm going to insert a few more cells at the bottom of 70 71 00:06:51,800 --> 00:06:58,730 our notebook and I'm going to get hold of one of these works. The way I'm going to get hold of one 71 72 00:06:58,730 --> 00:07:08,530 of these is like this, I'm going to have to go "nltk.corpus.gutenberg. 72 73 00:07:11,150 --> 00:07:13,390 .words". 73 74 00:07:13,460 --> 00:07:21,590 So in this case I'm getting a hold of one of the text files in the Gutenberg folder. The one I'm interested 74 75 00:07:21,590 --> 00:07:29,240 in is going to be Moby Dick, so I'll have single quotes in the parentheses and I'll type out the exact 75 76 00:07:29,240 --> 00:07:38,210 spelling of this text file, so it's 'melville-moby_dick.txt'. 76 77 00:07:38,450 --> 00:07:45,200 I'm including the file extension. The string that I'm passing into my "words" method here has to match 77 78 00:07:45,350 --> 00:07:54,230 this file name exactly. Now I'm going to create a variable called "example_corpus" and that's 78 79 00:07:54,230 --> 00:07:56,790 where I'm going. to store my Moby Dick novel. 79 80 00:07:57,020 --> 00:08:03,020 If you're wondering how many words are in this entire novel by the way you can pull it up with the built 80 81 00:08:03,020 --> 00:08:06,080 in length function, "len( 81 82 00:08:06,080 --> 00:08:08,540 example_corpus)". 82 83 00:08:08,540 --> 00:08:11,190 Let's see how much work was put into Moby Dick. 83 84 00:08:11,210 --> 00:08:17,020 It's not War and Peace but 260000 words is still no slouch. 84 85 00:08:17,130 --> 00:08:23,300 Now another thing that might be quite interesting actually is to see what type of variable or type of 85 86 00:08:23,300 --> 00:08:25,320 object we're dealing with. 86 87 00:08:25,340 --> 00:08:32,760 So in this case our "example_corpus" is something called a Stream Backed Corpus View. 87 88 00:08:32,780 --> 00:08:37,610 The reason I'm drawing attention to this is because you might be forgiven for thinking that what you're 88 89 00:08:37,610 --> 00:08:44,000 getting back here is just a big long string, but you're actually dealing with a different kind of object 89 90 00:08:44,000 --> 00:08:44,990 here. 90 91 00:08:45,050 --> 00:08:49,190 This thing that we're working with actually has a whole bunch of tokens. 91 92 00:08:49,400 --> 00:08:55,160 If you wanted to get the list of words, you have to kind of join them together, 92 93 00:08:55,160 --> 00:08:57,330 you have to join all the tokens together. 93 94 00:08:57,350 --> 00:08:59,360 Let me show you what I mean. 94 95 00:08:59,660 --> 00:09:06,170 The first thing I'll do is I'm going to create a list, I'm going to call it "word_list" and set that equal to 95 96 00:09:07,100 --> 00:09:08,750 square brackets, 96 97 00:09:08,750 --> 00:09:17,360 and then inside these square brackets I'm going to use Python's list comprehension to join all the tokens 97 98 00:09:17,360 --> 00:09:18,150 together. 98 99 00:09:18,290 --> 00:09:27,470 So it'll be two single quotes, then dot, "join(word)", 99 100 00:09:27,530 --> 00:09:38,130 and yeah you guessed that there's a loop coming "for word in example_corpus". This here will join all of 100 101 00:09:38,130 --> 00:09:40,690 our words together. 101 102 00:09:40,690 --> 00:09:49,370 So if you're wondering what this looked like before, "example_corpus", Shift+Enter looks like so and the 102 103 00:09:49,370 --> 00:09:55,550 "word_list" that I'm getting after running my list comprehension and joining all the words together will 103 104 00:09:55,550 --> 00:10:06,480 look like so. Very similar looking I know, but remember we're going from tokens to a list of words which 104 105 00:10:06,480 --> 00:10:10,530 we can then use to join together into a single string. 105 106 00:10:10,570 --> 00:10:19,000 So if I say "novel_as_string = ' '. 106 107 00:10:20,100 --> 00:10:30,120 join(word_list)". This here will actually take all the words out of the list and put 107 108 00:10:30,120 --> 00:10:34,670 them into a single string, so "novel_as_string" will look like 108 109 00:10:34,670 --> 00:10:40,620 so. Now if you're wondering why we're going through this trouble, it's because we have to do some pre- 109 110 00:10:40,620 --> 00:10:44,160 processing to feed our text into our word cloud. 110 111 00:10:44,220 --> 00:10:46,400 Remember, this "generate" function here, 111 112 00:10:46,530 --> 00:10:53,250 it's kind of expecting a very simple string to build the word cloud from, we can't give it a corpus 112 113 00:10:53,250 --> 00:10:56,490 straight out of the NLTK toolkit. 113 114 00:10:56,910 --> 00:11:05,160 And I also don't want to give it a list of tokens or a list of individual words. I'm planning to give 114 115 00:11:05,160 --> 00:11:10,060 my word cloud a simple string. So why don't we try it out for now? 115 116 00:11:10,210 --> 00:11:19,660 I'm going to copy this, come down here, paste it in, instead of having email body inside my "generate" method, 116 117 00:11:20,140 --> 00:11:29,550 I'm going to have "novel_as_string" and see what we get. Fantastic! So that's working. We've successfully extracted 117 118 00:11:29,940 --> 00:11:38,190 all the words from a corpus in the NLTK resources and fed them into our word cloud package. 118 119 00:11:38,190 --> 00:11:42,560 Now all we have to do is make this thing look a bit better and style it. 119 120 00:11:42,810 --> 00:11:47,280 I'm planning to arrange all these words into the shape of a whale. 120 121 00:11:47,310 --> 00:11:48,450 Let's see if we can do this.