0
1
00:00:00,300 --> 00:00:04,070
In this lesson we're going to create our own word cloud.
1

2
00:00:04,200 --> 00:00:11,430
Also I'll show you how to get hold of entire novels through the NLTK resources.
2

3
00:00:11,440 --> 00:00:19,780
Let's add a quick little section heading here with a markdown cell that reads "Creating a Word Cloud".
3

4
00:00:19,830 --> 00:00:26,250
Now, whenever you're using an unknown package or a new package that you haven't seen before, it's always
4

5
00:00:26,250 --> 00:00:30,930
a good idea to pull up the documentation of this package.
5

6
00:00:30,930 --> 00:00:37,650
In our case, the package that we're going to be using is called wordcloud and it was originally made
6

7
00:00:37,650 --> 00:00:39,550
by a chap called Andreas Mueller.
7

8
00:00:39,690 --> 00:00:47,520
But since then he's had approximately 33 people contribute and improve and add their work to make this
8

9
00:00:47,520 --> 00:00:49,110
package better.
9

10
00:00:49,110 --> 00:00:54,460
As you can see the entire code base is up on GitHub for anybody to look at
10

11
00:00:54,660 --> 00:00:58,000
and there's some basic instructions here as well,
11

12
00:00:58,170 --> 00:01:05,850
but the more detailed API documentation can be found here. On this Web site you can see a bit more information
12

13
00:01:05,850 --> 00:01:09,880
on how to use the wordcloud as well as some more examples.
13

14
00:01:09,960 --> 00:01:13,650
I've added the links to both these sites into the lesson resources.
14

15
00:01:13,740 --> 00:01:18,950
So without further ado let's get started with a basic example.
15

16
00:01:19,020 --> 00:01:24,300
Now having downloaded and installed the package in the previous lesson it's time to add it to our notebook
16

17
00:01:24,390 --> 00:01:25,350
imports.
17

18
00:01:25,350 --> 00:01:38,080
So here we can say "from wordcloud import WordCloud". Lets hit Shift+Enter and scroll all the way down.
18

19
00:01:38,090 --> 00:01:42,600
Now we're gonna be using that example email again to generate our first word cloud.
19

20
00:01:42,680 --> 00:01:52,510
The way to do it is to use "WordCloud().generate()" and then
20

21
00:01:52,510 --> 00:01:57,800
we just have to supply the piece of text that we'd like to use to generate the word cloud from.
21

22
00:01:57,880 --> 00:02:05,950
So in our case it was the email body. I'm going to actually save this in a variable called "word_cloud",
22

23
00:02:05,950 --> 00:02:07,000
so I'll say "word_
23

24
00:02:07,000 --> 00:02:13,110
cloud = WordCloud().generete(email_body)".
24

25
00:02:13,360 --> 00:02:14,790
This is step one.
25

26
00:02:14,920 --> 00:02:23,070
Now matplotlib is going to come into play. So matplotlib was stored inside a variable called "plt"
26

27
00:02:23,470 --> 00:02:31,990
and here we're going to use the "imshow" method, "imshow()" and we can supply,
27

28
00:02:32,040 --> 00:02:39,600
well you guessed it, we're going to supply our "word_cloud" to "imshow" and now let's use "plt.show()" and
28

29
00:02:39,600 --> 00:02:45,990
see what. We get we get something like this. Now I don't know about you but I can see we can make some
29

30
00:02:45,990 --> 00:02:52,340
improvements on this right away. The quality isn't that great, the letters look a little bit jagged and
30

31
00:02:52,350 --> 00:02:59,580
we've got these axes on both sides, on the x and on the y. These axes are really easy to remove actually.
31

32
00:03:00,260 --> 00:03:09,450
If we use "plt.axis()" and then single quotes with the string 'off' and hit Shift+Enter,
32

33
00:03:09,840 --> 00:03:16,690
the axes disappear, but we still kind of have these jagged edges on some of these letters. It doesn't
33

34
00:03:16,690 --> 00:03:18,450
look that clean.
34

35
00:03:18,610 --> 00:03:25,540
So what we can do is come up to the "imshow" method, put in a comma after "word_cloud" and add an additional
35

36
00:03:25,930 --> 00:03:27,380
argument here.
36

37
00:03:27,520 --> 00:03:33,620
This one's going to be called "interpolation" and we can set that equal to some sort of value,
37

38
00:03:33,640 --> 00:03:37,630
right. At first we had a default value for interpolation which was "none",
38

39
00:03:37,690 --> 00:03:43,590
so there was no interpolation going on, but we can do some interpolation to smooth out these edges.
39

40
00:03:43,750 --> 00:03:51,610
I'm going to go with "bilinear" and now we have a much cleaner picture. Of course, the only reason I know
40

41
00:03:51,640 --> 00:03:59,110
to set "interpolation" to "bilinear" is because I've had a look at the quick documentation here and I
41

42
00:03:59,110 --> 00:04:05,190
scrolled down to the interpolation parameter and I've tried out a couple of these.
42

43
00:04:05,260 --> 00:04:08,770
So there is quite a big number to choose from.
43

44
00:04:08,890 --> 00:04:16,120
The default is "none", but I did find that I could get some improvement in the look of my word cloud by
44

45
00:04:16,120 --> 00:04:22,870
playing around with this and I found that "bilinear" works well for me. Now I think as a minimalist basic
45

46
00:04:22,870 --> 00:04:25,730
example, this works really, really well
46

47
00:04:26,050 --> 00:04:30,980
and you'll find something very, very similar on the documentation page actually.
47

48
00:04:31,000 --> 00:04:32,770
So let's take this to the next level.
48

49
00:04:33,130 --> 00:04:40,940
Let's make this a little bit more interesting. To make our next word clouds a little bit more interesting,
49

50
00:04:40,980 --> 00:04:49,210
I want to show you how to download entire novel from the natural language tool kit. NLTK actually
50

51
00:04:49,210 --> 00:04:55,680
has a whole bunch of resources that we can use if we go back up to where we were downloading our tokenizer
51

52
00:04:55,680 --> 00:05:01,750
and our stop words, you'll recall both of these were downloaded to the "nltk_data"
52

53
00:05:01,750 --> 00:05:09,310
folder on our hard drive. All we need to do is add two additional lines of code here, namely "nltk.
53

54
00:05:09,310 --> 00:05:23,100
download('gutenberg')" and "nltk.download('
54

55
00:05:24,360 --> 00:05:32,060
shakespeare')". If I hit Shift+Enter on this cell it will download some additional corpora for us that
55

56
00:05:32,060 --> 00:05:40,980
we can make use of in our word cloud. Check it out. So it got the zip files and it unzipped them
56

57
00:05:41,280 --> 00:05:44,150
and here they are in my "corpora" folder.
57

58
00:05:44,310 --> 00:05:52,650
I've got "shakespeare" right here with a number of his plays including Hamlet, Julius Caesar, Macbeth and
58

59
00:05:52,650 --> 00:05:57,530
so on. I think "r_and_j" stands for Romeo and Juliet,
59

60
00:05:57,660 --> 00:06:05,360
another classic. And Gutenberg has also a whole bunch of books, so you can read these for free if you
60

61
00:06:05,360 --> 00:06:06,290
like.
61

62
00:06:06,290 --> 00:06:11,050
This one here looks like Alice in Wonderland by Lewis Carroll.
62

63
00:06:11,060 --> 00:06:11,620
There we go.
63

64
00:06:11,630 --> 00:06:19,810
Opening them up in my Atom text editor I can see the whole text of the entire novel nicely formatted. The
64

65
00:06:19,810 --> 00:06:25,630
one we're gonna use is Moby Dick by Herman Melville.
65

66
00:06:25,630 --> 00:06:30,820
If you actually wanted to read this book, you have to scroll down quite a bit past all the acknowledgements
66

67
00:06:30,970 --> 00:06:38,520
and praises and so on to, uh, eventually Chapter 1 which is on line 500 with the famous opening sentences.
67

68
00:06:39,430 --> 00:06:41,400
But okay, so far so good.
68

69
00:06:41,410 --> 00:06:45,600
Now we've got access to the text of a whole bunch of novels,
69

70
00:06:45,670 --> 00:06:51,800
now it's up to us to put these into a word cloud. So I'm going to insert a few more cells at the bottom of
70

71
00:06:51,800 --> 00:06:58,730
our notebook and I'm going to get hold of one of these works. The way I'm going to get hold of one
71

72
00:06:58,730 --> 00:07:08,530
of these is like this, I'm going to have to go "nltk.corpus.gutenberg.
72

73
00:07:11,150 --> 00:07:13,390
.words".
73

74
00:07:13,460 --> 00:07:21,590
So in this case I'm getting a hold of one of the text files in the Gutenberg folder. The one I'm interested
74

75
00:07:21,590 --> 00:07:29,240
in is going to be Moby Dick, so I'll have single quotes in the parentheses and I'll type out the exact
75

76
00:07:29,240 --> 00:07:38,210
spelling of this text file, so it's 'melville-moby_dick.txt'.
76

77
00:07:38,450 --> 00:07:45,200
I'm including the file extension. The string that I'm passing into my "words" method here has to match
77

78
00:07:45,350 --> 00:07:54,230
this file name exactly. Now I'm going to create a variable called "example_corpus" and that's
78

79
00:07:54,230 --> 00:07:56,790
where I'm going. to store my Moby Dick novel.
79

80
00:07:57,020 --> 00:08:03,020
If you're wondering how many words are in this entire novel by the way you can pull it up with the built
80

81
00:08:03,020 --> 00:08:06,080
in length function, "len(
81

82
00:08:06,080 --> 00:08:08,540
example_corpus)".
82

83
00:08:08,540 --> 00:08:11,190
Let's see how much work was put into Moby Dick.
83

84
00:08:11,210 --> 00:08:17,020
It's not War and Peace but 260000 words is still no slouch.
84

85
00:08:17,130 --> 00:08:23,300
Now another thing that might be quite interesting actually is to see what type of variable or type of
85

86
00:08:23,300 --> 00:08:25,320
object we're dealing with.
86

87
00:08:25,340 --> 00:08:32,760
So in this case our "example_corpus" is something called a Stream Backed Corpus View.
87

88
00:08:32,780 --> 00:08:37,610
The reason I'm drawing attention to this is because you might be forgiven for thinking that what you're
88

89
00:08:37,610 --> 00:08:44,000
getting back here is just a big long string, but you're actually dealing with a different kind of object
89

90
00:08:44,000 --> 00:08:44,990
here.
90

91
00:08:45,050 --> 00:08:49,190
This thing that we're working with actually has a whole bunch of tokens.
91

92
00:08:49,400 --> 00:08:55,160
If you wanted to get the list of words, you have to kind of join them together,
92

93
00:08:55,160 --> 00:08:57,330
you have to join all the tokens together.
93

94
00:08:57,350 --> 00:08:59,360
Let me show you what I mean.
94

95
00:08:59,660 --> 00:09:06,170
The first thing I'll do is I'm going to create a list, I'm going to call it "word_list" and set that equal to
95

96
00:09:07,100 --> 00:09:08,750
square brackets,
96

97
00:09:08,750 --> 00:09:17,360
and then inside these square brackets I'm going to use Python's list comprehension to join all the tokens
97

98
00:09:17,360 --> 00:09:18,150
together.
98

99
00:09:18,290 --> 00:09:27,470
So it'll be two single quotes, then dot, "join(word)",
99

100
00:09:27,530 --> 00:09:38,130
and yeah you guessed that there's a loop coming "for word in example_corpus". This here will join all of
100

101
00:09:38,130 --> 00:09:40,690
our words together.
101

102
00:09:40,690 --> 00:09:49,370
So if you're wondering what this looked like before, "example_corpus", Shift+Enter looks like so and the
102

103
00:09:49,370 --> 00:09:55,550
"word_list" that I'm getting after running my list comprehension and joining all the words together will
103

104
00:09:55,550 --> 00:10:06,480
look like so. Very similar looking I know, but remember we're going from tokens to a list of words which
104

105
00:10:06,480 --> 00:10:10,530
we can then use to join together into a single string.
105

106
00:10:10,570 --> 00:10:19,000
So if I say "novel_as_string = ' '.
106

107
00:10:20,100 --> 00:10:30,120
join(word_list)". This here will actually take all the words out of the list and put
107

108
00:10:30,120 --> 00:10:34,670
them into a single string, so "novel_as_string" will look like
108

109
00:10:34,670 --> 00:10:40,620
so. Now if you're wondering why we're going through this trouble, it's because we have to do some pre-
109

110
00:10:40,620 --> 00:10:44,160
processing to feed our text into our word cloud.
110

111
00:10:44,220 --> 00:10:46,400
Remember, this "generate" function here,
111

112
00:10:46,530 --> 00:10:53,250
it's kind of expecting a very simple string to build the word cloud from, we can't give it a corpus
112

113
00:10:53,250 --> 00:10:56,490
straight out of the NLTK toolkit.
113

114
00:10:56,910 --> 00:11:05,160
And I also don't want to give it a list of tokens or a list of individual words. I'm planning to give
114

115
00:11:05,160 --> 00:11:10,060
my word cloud a simple string. So why don't we try it out for now?
115

116
00:11:10,210 --> 00:11:19,660
I'm going to copy this, come down here, paste it in, instead of having email body inside my "generate" method,
116

117
00:11:20,140 --> 00:11:29,550
I'm going to have "novel_as_string" and see what we get. Fantastic! So that's working. We've successfully extracted
117

118
00:11:29,940 --> 00:11:38,190
all the words from a corpus in the NLTK resources and fed them into our word cloud package.
118

119
00:11:38,190 --> 00:11:42,560
Now all we have to do is make this thing look a bit better and style it.
119

120
00:11:42,810 --> 00:11:47,280
I'm planning to arrange all these words into the shape of a whale.
120

121
00:11:47,310 --> 00:11:48,450
Let's see if we can do this.