0 1 00:00:00,390 --> 00:00:07,650 As part of the lesson resources that you've downloaded, under "SpamData" > "01_Processing", there's a folder 1 2 00:00:07,770 --> 00:00:18,280 called "wordcloud_resources" and there you should see a "whale-icon.png" file and this is gonna 2 3 00:00:18,300 --> 00:00:27,450 be our mask that we will project the words from the word cloud onto. If you scroll back up in the notebook 3 4 00:00:27,870 --> 00:00:29,990 to where we've put all our constants, 4 5 00:00:30,180 --> 00:00:40,590 we can add another one here. Let's add the relative path to this PNG file. So I'm going to add a new line and 5 6 00:00:40,590 --> 00:00:49,380 here we'll add a constant called "WHALE_FILE" and set that equal to to what it says in the 6 7 00:00:49,380 --> 00:00:50,580 previous string 7 8 00:00:50,910 --> 00:00:58,840 at first, save us some time typing. The first two bits will stay the same, it will still be "SpamData/01_ 8 9 00:00:58,840 --> 00:01:09,020 Processing", but after that forward slash, we have to write "wordcloud_resources/ 9 10 00:01:09,030 --> 00:01:21,040 whale-icon.png". This bit here needs to be the relative path that points to our "whale 10 11 00:01:21,050 --> 00:01:25,670 -icon.png" file. As always with these things, 11 12 00:01:25,670 --> 00:01:30,320 everything's case sensitive. Now before we scroll back down, 12 13 00:01:30,380 --> 00:01:36,660 let's go up to our notebook imports and import another tool. 13 14 00:01:36,740 --> 00:01:40,880 This will be some functionality from the PIL package. 14 15 00:01:40,880 --> 00:01:54,230 So "from PIL", all caps, "import image". PIL is short for "Pillow". Pillow is the name of a package that will help 15 16 00:01:54,230 --> 00:01:57,870 us with image manipulation and processing. 16 17 00:01:58,490 --> 00:02:04,730 Pillow can actually do quite a lot of things like pixel manipulation, image blurring or smoothing, or 17 18 00:02:04,730 --> 00:02:13,160 adding text. We're going to be using it to convert our starting image of the whale to grayscale and then to an 18 19 00:02:13,160 --> 00:02:15,110 array of RGB values. 19 20 00:02:15,470 --> 00:02:22,610 And the reason for that is that wordcloud has some very specific requirements as to what kinds of images 20 21 00:02:22,880 --> 00:02:27,720 can and cannot be used as a mask. And also it's quite particular. 21 22 00:02:27,830 --> 00:02:32,920 what kind of format the mask is in and that's why we need to do a little prep. 22 23 00:02:33,070 --> 00:02:33,890 Also, 23 24 00:02:33,890 --> 00:02:38,860 I suspect that at one point you want to use your own custom masks with wordcloud, 24 25 00:02:38,870 --> 00:02:40,410 at some point in the future. 25 26 00:02:40,640 --> 00:02:47,420 So it's a good idea to have seen this workflow of the image preparation once before. In fact, let's take 26 27 00:02:47,420 --> 00:02:50,900 a look at wordcloud's requirements. 27 28 00:02:50,900 --> 00:02:57,800 If I scroll down here and hit Shift+Tab on my keyboard to bring up the quick documentation, I can see 28 29 00:02:57,800 --> 00:03:00,910 that there is a parameter called "mask". By default 29 30 00:03:01,010 --> 00:03:04,270 it has the value None and scrolling down. 30 31 00:03:04,400 --> 00:03:10,470 we can take a quick look at a more detailed description and this reads: 31 32 00:03:10,750 --> 00:03:16,600 "If not None, gives a binary mask on where to draw words." 32 33 00:03:16,960 --> 00:03:23,320 So the mask will specify the location of where the words are drawn on the canvas. 33 34 00:03:24,560 --> 00:03:31,100 Further down it says "All white entries will be considered 'masked out' and all the other entries will 34 35 00:03:31,100 --> 00:03:40,260 be free to draw on.". And the format that we need to supply this mask in is an nd-array, an n-dimensional 35 36 00:03:40,380 --> 00:03:43,290 array like one from numpy. 36 37 00:03:43,290 --> 00:03:45,300 Now that sounds a little strange, right. 37 38 00:03:45,300 --> 00:03:49,380 Why would we provide an array? 38 39 00:03:49,460 --> 00:03:57,110 And the reason is is that wordcloud actually expects pixel by pixel information on the image in this 39 40 00:03:57,200 --> 00:03:59,020 array format. 40 41 00:03:59,150 --> 00:04:06,430 And when I say pixel by pixel information, what I really mean is pixel color because what the word cloud 41 42 00:04:06,430 --> 00:04:13,810 code cares about is if it's dealing with a white pixel or a pixel in another color. The way that we will 42 43 00:04:13,810 --> 00:04:19,430 be providing this pixel color information is in RGB format. 43 44 00:04:19,630 --> 00:04:23,720 RGB stands for Red Green Blue. 44 45 00:04:23,800 --> 00:04:28,180 We're already acquainted with color hex codes from previous lessons. 45 46 00:04:28,180 --> 00:04:31,790 So this time we'll use our RGB to mix things up a little bit. 46 47 00:04:32,470 --> 00:04:39,460 The RGB format is particularly relevant because it relates very closely to how colors are shown on your 47 48 00:04:39,460 --> 00:04:42,050 TV or your phone screens. 48 49 00:04:42,130 --> 00:04:49,270 If you were to take a magnifying glass and go up really close to your phone, you will actually see tiny 49 50 00:04:49,270 --> 00:04:53,500 little red green and blue LED. With RGB 50 51 00:04:53,600 --> 00:04:59,760 you essentially supply three values, namely how bright each of these little LEDs should shine. 51 52 00:05:00,050 --> 00:05:05,890 The RGB values themselves have a scale between zero and 255. 52 53 00:05:07,130 --> 00:05:13,500 If all the numbers are on 255 then we get pure white. 53 54 00:05:13,610 --> 00:05:22,980 If all the numbers are on 0 then we get pure black. And with any combination between we get all the 54 55 00:05:22,980 --> 00:05:27,540 other colors. Back in our Python code, 55 56 00:05:27,540 --> 00:05:33,610 we now know that our goal is to get an array of RGB values to our word cloud. 56 57 00:05:34,600 --> 00:05:38,440 The first step will be opening the image file. 57 58 00:05:38,440 --> 00:05:45,520 You'll recall that we've added the constant up top that points to the "whale-icon.png" file, 58 59 00:05:46,300 --> 00:05:54,510 so we can use pillows "Image.open()" and then supply the relative file path, 59 60 00:05:54,510 --> 00:06:06,090 our "WHALE_FILE" to the open method. In fact, let's say "icon = Image.open(WHALE_FILE)". 60 61 00:06:06,310 --> 00:06:11,680 Step two will be creating a new blank image object from Pillow. 61 62 00:06:11,770 --> 00:06:19,180 So again we'll use "Image.new()" and then we'll supply three arguments, 62 63 00:06:19,180 --> 00:06:23,810 the first of which is the mode and this will be RGB; 63 64 00:06:24,040 --> 00:06:26,530 the second of which will be the size, 64 65 00:06:26,530 --> 00:06:33,160 so we're kind of setting the size of our canvas here and I'm gonna set that equal to "icon.size", 65 66 00:06:33,190 --> 00:06:42,190 so the same size as the whale image and the third thing is I'll set a base colour and that will be 66 67 00:06:42,190 --> 00:06:50,670 equal to "(255, 255, 255)". 67 68 00:06:50,800 --> 00:06:59,590 So this is a tuple of three integers and they correspond to the red green and blue values. With all of 68 69 00:06:59,590 --> 00:07:01,170 them set to 255, 69 70 00:07:01,270 --> 00:07:02,750 we get white. 70 71 00:07:03,190 --> 00:07:10,700 I'll store our blank canvas in a variable called "image_mask". 71 72 00:07:10,880 --> 00:07:22,070 Now what I'll do, I'll take our image_mask and put a dot after it and say "paste(icon, 72 73 00:07:22,880 --> 00:07:26,570 box = icon)". 73 74 00:07:26,570 --> 00:07:33,410 What this will do is it will paste the picture of our whale onto our blank canvas. 74 75 00:07:33,410 --> 00:07:42,140 And the reason we're doing this is because now we can easily convert our image mask to an array of 75 76 00:07:42,140 --> 00:07:43,850 RGB values. 76 77 00:07:43,850 --> 00:07:51,860 This is why we went through all this trouble, so I'll say "rgb_array = np. 77 78 00:07:51,860 --> 00:08:00,810 array(image_mask)". 78 79 00:08:00,950 --> 00:08:02,440 This is the key. 79 80 00:08:02,570 --> 00:08:14,250 This converts the image object to an array. And with that in hand we can supply the mask argument to 80 81 00:08:14,250 --> 00:08:15,060 our word cloud. 81 82 00:08:15,120 --> 00:08:26,640 So we'll say "mask = rgb_array". Now let's hit Shift+Enter and see what we get - "name 'np' 82 83 00:08:26,670 --> 00:08:28,860 is not defined". 83 84 00:08:28,860 --> 00:08:33,580 Looks like we forgot to import numpy as np. 84 85 00:08:33,660 --> 00:08:39,420 Let's go to the very top on our notebook imports and indeed we're missing it. 85 86 00:08:39,420 --> 00:08:45,310 So this shall not do. Any self respecting Python notebook needs to import numpy, right? 86 87 00:08:45,330 --> 00:08:54,520 So let's add "import numpy as np" and hit Shift+Enter. 87 88 00:08:54,570 --> 00:09:01,450 Now we can refresh this cell and see if it runs. Giving it a little bit of time, 88 89 00:09:01,470 --> 00:09:06,880 we now see our words superimposed on the whale. 89 90 00:09:06,930 --> 00:09:07,990 Brilliant. 90 91 00:09:08,100 --> 00:09:15,900 So this is really cool, but I think we can style this a little better and the way we're gonna do that 91 92 00:09:16,230 --> 00:09:19,910 is we're going to supply some more arguments to our word cloud here. 92 93 00:09:22,190 --> 00:09:29,060 In particular, I'm interested in changing the background colour here and I'm also interested in changing 93 94 00:09:29,060 --> 00:09:36,440 the colour of these words and we can do that by supplying something called a "colormap". 94 95 00:09:36,450 --> 00:09:42,950 Also I want to play with this "max_words" parameter here which is currently set to 200. 95 96 00:09:43,140 --> 00:09:50,680 Let's see what happens when we change this to say 50 or 400, but the very first thing I'm gonna do is change 96 97 00:09:50,680 --> 00:09:52,440 the size of this image. 97 98 00:09:53,050 --> 00:10:00,070 And this is good old's matplotlib's "plt.figure" method that I'm going to call here. 98 99 00:10:00,550 --> 00:10:11,620 So "figure(figsize = [16, 8])". I think that will do nicely and make it a little 99 100 00:10:11,620 --> 00:10:16,020 larger and easier to see what the styling changes are going to look like. 100 101 00:10:17,510 --> 00:10:22,310 So here's our whale now. Let's change the background color to white. 101 102 00:10:22,480 --> 00:10:24,190 I'll add a comma here, 102 103 00:10:24,280 --> 00:10:31,200 change the background color to 'white' and hit Shift+Enter. 103 104 00:10:31,510 --> 00:10:35,110 Now this is starting to look a lot better in my opinion. 104 105 00:10:35,110 --> 00:10:38,460 But let's play with some of the other parameters as well. 105 106 00:10:38,740 --> 00:10:43,880 Before we add those though, I'm noticing that this line is getting a little long. 106 107 00:10:43,900 --> 00:10:50,950 So instead of chaining this method call here I'm going to come down here, hit Enter and say "word_cloud. 107 108 00:10:51,940 --> 00:10:56,250 generate" on a separate line of code. Here 108 109 00:10:56,260 --> 00:10:57,370 after background color, 109 110 00:10:57,500 --> 00:10:58,840 I'll insert a comma, 110 111 00:10:59,110 --> 00:11:09,380 hit Enter and then change the "max_words" to say 50 and update our graphic. 111 112 00:11:10,660 --> 00:11:12,020 So that's a very different look. 112 113 00:11:12,050 --> 00:11:13,490 Right. 113 114 00:11:13,580 --> 00:11:18,400 On the upside this image was generated a lot faster than the other ones. 114 115 00:11:18,590 --> 00:11:24,530 But on the downside I can't really tell that this is a whale anymore because a lot of the edges and 115 116 00:11:24,530 --> 00:11:30,140 so on depend on smaller words being present to flesh out this image. 116 117 00:11:30,170 --> 00:11:36,530 So what I'll do instead is change this "max_words" argument to 400 and refresh this cell. 117 118 00:11:42,040 --> 00:11:42,600 At this point, 118 119 00:11:42,640 --> 00:11:49,870 our code will have taken a little longer but we get a much, much more beautiful word cloud with all the 119 120 00:11:49,870 --> 00:11:53,830 small words on the perimeter of this mask. 120 121 00:11:53,830 --> 00:12:00,490 But if you recall what's actually determining the location of where these words are drawn on the 121 122 00:12:00,490 --> 00:12:03,640 canvas it's this "rgb_array". 122 123 00:12:03,940 --> 00:12:10,570 If we take a closer look at this "rgb_array", and hit Shift+Enter on it, 123 124 00:12:10,570 --> 00:12:14,290 we see that it looks something like this. 124 125 00:12:14,410 --> 00:12:16,590 Now that doesn't tell us a whole lot, 125 126 00:12:16,600 --> 00:12:22,080 maybe. A better way to look at it might be "rgb_array.shape". 126 127 00:12:22,330 --> 00:12:30,370 And there we see it's 1024 by 2048 by 3. 127 128 00:12:30,420 --> 00:12:39,660 The reason we see those numbers here is because the whale icon itself has a width of 2048 pixels by 128 129 00:12:39,780 --> 00:12:42,510 1024 pixels 129 130 00:12:42,510 --> 00:12:50,880 and the 3 is for the red green and blue RGB values. In other words this array has pixel by pixel 130 131 00:12:50,880 --> 00:12:53,590 information on the color. 131 132 00:12:53,700 --> 00:12:59,700 If we take our rgb_array, we can actually pull up a particular pixel. 132 133 00:12:59,700 --> 00:13:08,910 So if I say for example "[1023, 2047]" then I can see what the RGB values 133 134 00:13:08,970 --> 00:13:14,290 are for this particular pixel. Given that this pixel is white, 134 135 00:13:14,360 --> 00:13:25,290 the word cloud will not draw in this area, but let's take the pixel at say 500 and 1000. 135 136 00:13:25,380 --> 00:13:34,570 Here we see the values 0, 0, 0, meaning that at this particular location in the image the pixel is black. 136 137 00:13:34,620 --> 00:13:42,610 So word cloud is free to use this particular pixel to draw on. Now one of the nice things about matplotlib 137 138 00:13:42,610 --> 00:13:49,540 is that we have access to these wonderful color maps and we can play with the styling of our 138 139 00:13:49,540 --> 00:13:54,910 words by say providing one of these color maps as an argument. 139 140 00:13:55,010 --> 00:13:56,310 There's quite a few to choose from. 140 141 00:13:56,380 --> 00:14:01,770 We can go with "plasma", we could go with "blues" or "greens" or "oranges". 141 142 00:14:02,180 --> 00:14:09,080 But one that I found that works particularly well I think for the whale image is a color map called 142 143 00:14:09,860 --> 00:14:13,250 "ocean". Refreshing our cell, 143 144 00:14:13,620 --> 00:14:21,310 our whale will now start looking something like this and I think thematically this color map seems to 144 145 00:14:21,310 --> 00:14:22,700 work rather well. 145 146 00:14:22,870 --> 00:14:28,900 Having worked a little bit now with the NLTK resources and our word cloud code I want to throw it 146 147 00:14:28,900 --> 00:14:30,060 over to you. 147 148 00:14:30,100 --> 00:14:31,970 I want to propose a challenge. 148 149 00:14:32,170 --> 00:14:41,230 I'd like you to use the skull image in the lesson resources as a mask to create a word cloud for Shakespeare's 149 150 00:14:41,230 --> 00:14:43,030 play Hamlet. 150 151 00:14:43,900 --> 00:14:47,320 The skull image that I've included looks like this. 151 152 00:14:47,320 --> 00:14:53,770 So have a go at this and have a play with different color maps and different number of maximum words 152 153 00:14:53,950 --> 00:14:56,180 to make this look good. 153 154 00:14:56,380 --> 00:15:04,750 And you can find Shakespeare's entire play under "gutenberg" and then "shakespeare-hamlet. 154 155 00:15:04,750 --> 00:15:06,200 txt". 155 156 00:15:06,250 --> 00:15:10,780 This is something that we've downloaded as part of the NLTK resources. 156 157 00:15:10,780 --> 00:15:14,980 The file is actually in the same folder as Melville's Moby Dick. 157 158 00:15:14,980 --> 00:15:21,400 If you're particularly brave then you can try and tackle the Hamlet play in the form of XML as opposed 158 159 00:15:21,400 --> 00:15:22,840 to a TXT file.