0
1
00:00:00,390 --> 00:00:07,650
As part of the lesson resources that you've downloaded, under "SpamData" > "01_Processing", there's a folder
1

2
00:00:07,770 --> 00:00:18,280
called "wordcloud_resources" and there you should see a "whale-icon.png" file and this is gonna
2

3
00:00:18,300 --> 00:00:27,450
be our mask that we will project the words from the word cloud onto. If you scroll back up in the notebook
3

4
00:00:27,870 --> 00:00:29,990
to where we've put all our constants,
4

5
00:00:30,180 --> 00:00:40,590
we can add another one here. Let's add the relative path to this PNG file. So I'm going to add a new line and
5

6
00:00:40,590 --> 00:00:49,380
here we'll add a constant called "WHALE_FILE" and set that equal to to what it says in the
6

7
00:00:49,380 --> 00:00:50,580
previous string
7

8
00:00:50,910 --> 00:00:58,840
at first, save us some time typing. The first two bits will stay the same, it will still be "SpamData/01_
8

9
00:00:58,840 --> 00:01:09,020
Processing", but after that forward slash, we have to write 
"wordcloud_resources/
9

10
00:01:09,030 --> 00:01:21,040
whale-icon.png". This bit here needs to be the relative path that points to our "whale
10

11
00:01:21,050 --> 00:01:25,670
-icon.png" file. As always with these things,
11

12
00:01:25,670 --> 00:01:30,320
everything's case sensitive. Now before we scroll back down,
12

13
00:01:30,380 --> 00:01:36,660
let's go up to our notebook imports and import another tool.
13

14
00:01:36,740 --> 00:01:40,880
This will be some functionality from the PIL package.
14

15
00:01:40,880 --> 00:01:54,230
So "from PIL", all caps, "import image". PIL is short for "Pillow". Pillow is the name of a package that will help
15

16
00:01:54,230 --> 00:01:57,870
us with image manipulation and processing.
16

17
00:01:58,490 --> 00:02:04,730
Pillow can actually do quite a lot of things like pixel manipulation, image blurring or smoothing, or
17

18
00:02:04,730 --> 00:02:13,160
adding text. We're going to be using it to convert our starting image of the whale to grayscale and then to an
18

19
00:02:13,160 --> 00:02:15,110
array of RGB values.
19

20
00:02:15,470 --> 00:02:22,610
And the reason for that is that wordcloud has some very specific requirements as to what kinds of images
20

21
00:02:22,880 --> 00:02:27,720
can and cannot be used as a mask. And also it's quite particular.
21

22
00:02:27,830 --> 00:02:32,920
what kind of format the mask is in and that's why we need to do a little prep.
22

23
00:02:33,070 --> 00:02:33,890
Also,
23

24
00:02:33,890 --> 00:02:38,860
I suspect that at one point you want to use your own custom masks with wordcloud,
24

25
00:02:38,870 --> 00:02:40,410
at some point in the future.
25

26
00:02:40,640 --> 00:02:47,420
So it's a good idea to have seen this workflow of the image preparation once before. In fact, let's take
26

27
00:02:47,420 --> 00:02:50,900
a look at wordcloud's requirements.
27

28
00:02:50,900 --> 00:02:57,800
If I scroll down here and hit Shift+Tab on my keyboard to bring up the quick documentation, I can see
28

29
00:02:57,800 --> 00:03:00,910
that there is a parameter called "mask". By default
29

30
00:03:01,010 --> 00:03:04,270
it has the value None and scrolling down.
30

31
00:03:04,400 --> 00:03:10,470
we can take a quick look at a more detailed description and this reads:
31

32
00:03:10,750 --> 00:03:16,600
"If not None, gives a binary mask on where to draw words."
32

33
00:03:16,960 --> 00:03:23,320
So the mask will specify the location of where the words are drawn on the canvas.
33

34
00:03:24,560 --> 00:03:31,100
Further down it says "All white entries will be considered 'masked out' and all the other entries will
34

35
00:03:31,100 --> 00:03:40,260
be free to draw on.". And the format that we need to supply this mask in is an nd-array, an n-dimensional
35

36
00:03:40,380 --> 00:03:43,290
array like one from numpy.
36

37
00:03:43,290 --> 00:03:45,300
Now that sounds a little strange, right.
37

38
00:03:45,300 --> 00:03:49,380
Why would we provide an array?
38

39
00:03:49,460 --> 00:03:57,110
And the reason is is that wordcloud actually expects pixel by pixel information on the image in this
39

40
00:03:57,200 --> 00:03:59,020
array format.
40

41
00:03:59,150 --> 00:04:06,430
And when I say pixel by pixel information, what I really mean is pixel color because what the word cloud
41

42
00:04:06,430 --> 00:04:13,810
code cares about is if it's dealing with a white pixel or a pixel in another color. The way that we will
42

43
00:04:13,810 --> 00:04:19,430
be providing this pixel color information is in RGB format.
43

44
00:04:19,630 --> 00:04:23,720
RGB stands for Red Green Blue.
44

45
00:04:23,800 --> 00:04:28,180
We're already acquainted with color hex codes from previous lessons.
45

46
00:04:28,180 --> 00:04:31,790
So this time we'll use our RGB to mix things up a little bit.
46

47
00:04:32,470 --> 00:04:39,460
The RGB format is particularly relevant because it relates very closely to how colors are shown on your
47

48
00:04:39,460 --> 00:04:42,050
TV or your phone screens.
48

49
00:04:42,130 --> 00:04:49,270
If you were to take a magnifying glass and go up really close to your phone, you will actually see tiny
49

50
00:04:49,270 --> 00:04:53,500
little red green and blue LED. With RGB
50

51
00:04:53,600 --> 00:04:59,760
you essentially supply three values, namely how bright each of these little LEDs should shine.
51

52
00:05:00,050 --> 00:05:05,890
The RGB values themselves have a scale between zero and 255.
52

53
00:05:07,130 --> 00:05:13,500
If all the numbers are on 255 then we get pure white.
53

54
00:05:13,610 --> 00:05:22,980
If all the numbers are on 0 then we get pure black. And with any combination between we get all the
54

55
00:05:22,980 --> 00:05:27,540
other colors. Back in our Python code,
55

56
00:05:27,540 --> 00:05:33,610
we now know that our goal is to get an array of RGB values to our word cloud.
56

57
00:05:34,600 --> 00:05:38,440
The first step will be opening the image file.
57

58
00:05:38,440 --> 00:05:45,520
You'll recall that we've added the constant up top that points to the "whale-icon.png" file,
58

59
00:05:46,300 --> 00:05:54,510
so we can use pillows "Image.open()" and then supply the relative file path,
59

60
00:05:54,510 --> 00:06:06,090
our "WHALE_FILE" to the open method. In fact, let's say "icon = Image.open(WHALE_FILE)".
60

61
00:06:06,310 --> 00:06:11,680
Step two will be creating a new blank image object from Pillow.
61

62
00:06:11,770 --> 00:06:19,180
So again we'll use "Image.new()" and then we'll supply three arguments,
62

63
00:06:19,180 --> 00:06:23,810
the first of which is the mode and this will be RGB;
63

64
00:06:24,040 --> 00:06:26,530
the second of which will be the size,
64

65
00:06:26,530 --> 00:06:33,160
so we're kind of setting the size of our canvas here and I'm gonna set that equal to "icon.size",
65

66
00:06:33,190 --> 00:06:42,190
so the same size as the whale image and the third thing is I'll set a base colour and that will be
66

67
00:06:42,190 --> 00:06:50,670
equal to "(255, 255, 255)".
67

68
00:06:50,800 --> 00:06:59,590
So this is a tuple of three integers and they correspond to the red green and blue values. With all of
68

69
00:06:59,590 --> 00:07:01,170
them set to 255,
69

70
00:07:01,270 --> 00:07:02,750
we get white.
70

71
00:07:03,190 --> 00:07:10,700
I'll store our blank canvas in a variable called "image_mask".
71

72
00:07:10,880 --> 00:07:22,070
Now what I'll do, I'll take our image_mask and put a dot after it and say "paste(icon,
72

73
00:07:22,880 --> 00:07:26,570
box = icon)".
73

74
00:07:26,570 --> 00:07:33,410
What this will do is it will paste the picture of our whale onto our blank canvas.
74

75
00:07:33,410 --> 00:07:42,140
And the reason we're doing this is because now we can easily convert our image mask to an array of
75

76
00:07:42,140 --> 00:07:43,850
RGB values.
76

77
00:07:43,850 --> 00:07:51,860
This is why we went through all this trouble, so I'll say "rgb_array = np.
77

78
00:07:51,860 --> 00:08:00,810
array(image_mask)".
78

79
00:08:00,950 --> 00:08:02,440
This is the key.
79

80
00:08:02,570 --> 00:08:14,250
This converts the image object to an array. And with that in hand we can supply the mask argument to
80

81
00:08:14,250 --> 00:08:15,060
our word cloud.
81

82
00:08:15,120 --> 00:08:26,640
So we'll say "mask = rgb_array". Now let's hit Shift+Enter and see what we get - "name 'np'
82

83
00:08:26,670 --> 00:08:28,860
is not defined".
83

84
00:08:28,860 --> 00:08:33,580
Looks like we forgot to import numpy as np.
84

85
00:08:33,660 --> 00:08:39,420
Let's go to the very top on our notebook imports and indeed we're missing it.
85

86
00:08:39,420 --> 00:08:45,310
So this shall not do. Any self respecting Python notebook needs to import numpy, right?
86

87
00:08:45,330 --> 00:08:54,520
So let's add "import numpy as np" and hit Shift+Enter.
87

88
00:08:54,570 --> 00:09:01,450
Now we can refresh this cell and see if it runs. Giving it a little bit of time,
88

89
00:09:01,470 --> 00:09:06,880
we now see our words superimposed on the whale.
89

90
00:09:06,930 --> 00:09:07,990
Brilliant.
90

91
00:09:08,100 --> 00:09:15,900
So this is really cool, but I think we can style this a little better and the way we're gonna do that
91

92
00:09:16,230 --> 00:09:19,910
is we're going to supply some more arguments to our word cloud here.
92

93
00:09:22,190 --> 00:09:29,060
In particular, I'm interested in changing the background colour here and I'm also interested in changing
93

94
00:09:29,060 --> 00:09:36,440
the colour of these words and we can do that by supplying something called a "colormap".
94

95
00:09:36,450 --> 00:09:42,950
Also I want to play with this "max_words" parameter here which is currently set to 200.
95

96
00:09:43,140 --> 00:09:50,680
Let's see what happens when we change this to say 50 or 400, but the very first thing I'm gonna do is change
96

97
00:09:50,680 --> 00:09:52,440
the size of this image.
97

98
00:09:53,050 --> 00:10:00,070
And this is good old's matplotlib's "plt.figure" method that I'm going to call here.
98

99
00:10:00,550 --> 00:10:11,620
So "figure(figsize = [16, 8])". I think that will do nicely and make it a little
99

100
00:10:11,620 --> 00:10:16,020
larger and easier to see what the styling changes are going to look like.
100

101
00:10:17,510 --> 00:10:22,310
So here's our whale now. Let's change the background color to white.
101

102
00:10:22,480 --> 00:10:24,190
I'll add a comma here,
102

103
00:10:24,280 --> 00:10:31,200
change the background color to 'white' and hit Shift+Enter.
103

104
00:10:31,510 --> 00:10:35,110
Now this is starting to look a lot better in my opinion.
104

105
00:10:35,110 --> 00:10:38,460
But let's play with some of the other parameters as well.
105

106
00:10:38,740 --> 00:10:43,880
Before we add those though, I'm noticing that this line is getting a little long.
106

107
00:10:43,900 --> 00:10:50,950
So instead of chaining this method call here I'm going to come down here, hit Enter and say "word_cloud.
107

108
00:10:51,940 --> 00:10:56,250
generate" on a separate line of code. Here
108

109
00:10:56,260 --> 00:10:57,370
after background color,
109

110
00:10:57,500 --> 00:10:58,840
I'll insert a comma,
110

111
00:10:59,110 --> 00:11:09,380
hit Enter and then change the "max_words" to say 50 and update our graphic.
111

112
00:11:10,660 --> 00:11:12,020
So that's a very different look.
112

113
00:11:12,050 --> 00:11:13,490
Right.
113

114
00:11:13,580 --> 00:11:18,400
On the upside this image was generated a lot faster than the other ones.
114

115
00:11:18,590 --> 00:11:24,530
But on the downside I can't really tell that this is a whale anymore because a lot of the edges and
115

116
00:11:24,530 --> 00:11:30,140
so on depend on smaller words being present to flesh out this image.
116

117
00:11:30,170 --> 00:11:36,530
So what I'll do instead is change this "max_words" argument to 400 and refresh this cell.
117

118
00:11:42,040 --> 00:11:42,600
At this point,
118

119
00:11:42,640 --> 00:11:49,870
our code will have taken a little longer but we get a much, much more beautiful word cloud with all the
119

120
00:11:49,870 --> 00:11:53,830
small words on the perimeter of this mask.
120

121
00:11:53,830 --> 00:12:00,490
But if you recall what's actually determining the location of where these words are drawn on the
121

122
00:12:00,490 --> 00:12:03,640
canvas it's this "rgb_array".
122

123
00:12:03,940 --> 00:12:10,570
If we take a closer look at this "rgb_array", and hit Shift+Enter on it,
123

124
00:12:10,570 --> 00:12:14,290
we see that it looks something like this.
124

125
00:12:14,410 --> 00:12:16,590
Now that doesn't tell us a whole lot,
125

126
00:12:16,600 --> 00:12:22,080
maybe. A better way to look at it might be "rgb_array.shape".
126

127
00:12:22,330 --> 00:12:30,370
And there we see it's 1024 by 2048 by 3.
127

128
00:12:30,420 --> 00:12:39,660
The reason we see those numbers here is because the whale icon itself has a width of 2048 pixels by
128

129
00:12:39,780 --> 00:12:42,510
1024 pixels
129

130
00:12:42,510 --> 00:12:50,880
and the 3 is for the red green and blue RGB values. In other words this array has pixel by pixel
130

131
00:12:50,880 --> 00:12:53,590
information on the color.
131

132
00:12:53,700 --> 00:12:59,700
If we take our rgb_array, we can actually pull up a particular pixel.
132

133
00:12:59,700 --> 00:13:08,910
So if I say for example "[1023, 2047]" then I can see what the RGB values
133

134
00:13:08,970 --> 00:13:14,290
are for this particular pixel. Given that this pixel is white,
134

135
00:13:14,360 --> 00:13:25,290
the word cloud will not draw in this area, but let's take the pixel at say 500 and 1000.
135

136
00:13:25,380 --> 00:13:34,570
Here we see the values 0, 0, 0, meaning that at this particular location in the image the pixel is black.
136

137
00:13:34,620 --> 00:13:42,610
So word cloud is free to use this particular pixel to draw on. Now one of the nice things about matplotlib
137

138
00:13:42,610 --> 00:13:49,540
is that we have access to these wonderful color maps and we can play with the styling of our
138

139
00:13:49,540 --> 00:13:54,910
words by say providing one of these color maps as an argument.
139

140
00:13:55,010 --> 00:13:56,310
There's quite a few to choose from.
140

141
00:13:56,380 --> 00:14:01,770
We can go with "plasma", we could go with "blues" or "greens" or "oranges".
141

142
00:14:02,180 --> 00:14:09,080
But one that I found that works particularly well I think for the whale image is a color map called
142

143
00:14:09,860 --> 00:14:13,250
"ocean". Refreshing our cell,
143

144
00:14:13,620 --> 00:14:21,310
our whale will now start looking something like this and I think thematically this color map seems to
144

145
00:14:21,310 --> 00:14:22,700
work rather well.
145

146
00:14:22,870 --> 00:14:28,900
Having worked a little bit now with the NLTK resources and our word cloud code I want to throw it
146

147
00:14:28,900 --> 00:14:30,060
over to you.
147

148
00:14:30,100 --> 00:14:31,970
I want to propose a challenge.
148

149
00:14:32,170 --> 00:14:41,230
I'd like you to use the skull image in the lesson resources as a mask to create a word cloud for Shakespeare's
149

150
00:14:41,230 --> 00:14:43,030
play Hamlet.
150

151
00:14:43,900 --> 00:14:47,320
The skull image that I've included looks like this.
151

152
00:14:47,320 --> 00:14:53,770
So have a go at this and have a play with different color maps and different number of maximum words
152

153
00:14:53,950 --> 00:14:56,180
to make this look good.
153

154
00:14:56,380 --> 00:15:04,750
And you can find Shakespeare's entire play under "gutenberg" and then "shakespeare-hamlet.
154

155
00:15:04,750 --> 00:15:06,200
txt".
155

156
00:15:06,250 --> 00:15:10,780
This is something that we've downloaded as part of the NLTK resources.
156

157
00:15:10,780 --> 00:15:14,980
The file is actually in the same folder as Melville's Moby Dick.
157

158
00:15:14,980 --> 00:15:21,400
If you're particularly brave then you can try and tackle the Hamlet play in the form of XML as opposed
158

159
00:15:21,400 --> 00:15:22,840
to a TXT file.