0 1 00:00:00,700 --> 00:00:07,350 All right, so in the upcoming lessons I want to talk a little bit more about data visualization. In particular, 1 2 00:00:07,430 --> 00:00:11,040 I want to to show you guys how to create some beautiful pie charts. 2 3 00:00:11,150 --> 00:00:18,690 And afterwards I wanna show you how to create a variation on the pie chart called the doughnut chart. 3 4 00:00:18,690 --> 00:00:23,350 Now, both of these types of charts are very, very common, especially in business reports, 4 5 00:00:23,520 --> 00:00:28,830 so it's very important that we also get these two charts under our belt. 5 6 00:00:28,850 --> 00:00:37,250 The first thing I'll do of course is add another markdown cell and I'll call this "Number of Spam Messages 6 7 00:00:37,820 --> 00:00:41,340 Visualised ( 7 8 00:00:41,340 --> 00:00:44,640 Pie Charts)". There we go. 8 9 00:00:44,650 --> 00:00:50,350 Now before we dive into the visualizations, we should probably check how many spam and how many non-spam 9 10 00:00:50,350 --> 00:00:53,720 messages there are. For this, our dataframe's 10 11 00:00:53,830 --> 00:00:56,810 value_counts method is going to be very, very handy. 11 12 00:00:57,160 --> 00:01:08,710 So if we take our dataframe and select the CATEGORY column, we can use value_counts and then 12 13 00:01:08,710 --> 00:01:15,140 parentheses to get a breakdown of how many messages of each category we've got. 13 14 00:01:15,370 --> 00:01:21,100 Here we can see that we've got 3900 non-spam or ham messages and we've got 14 15 00:01:21,100 --> 00:01:25,680 1896 spam messages. 15 16 00:01:25,780 --> 00:01:30,550 Now, since we're gonna be using these numbers later on, let's store them in some variables. 16 17 00:01:30,730 --> 00:01:33,580 So I'll say "amount_of_ 17 18 00:01:33,580 --> 00:01:37,310 spam = data. 18 19 00:01:37,420 --> 00:01:39,890 CATEGORY.value_counts()[ 19 20 00:01:40,140 --> 00:01:42,770 1]". 20 21 00:01:43,410 --> 00:01:52,620 And the amount of ham is gonna be equal to the same thing except it's going to be followed by "[0]". 21 22 00:01:52,620 --> 00:02:02,100 Zero because I access the first value in this series and one to access the second value 22 23 00:02:02,400 --> 00:02:04,310 in this series. 23 24 00:02:04,380 --> 00:02:05,970 Now, to do our visualization, 24 25 00:02:05,990 --> 00:02:09,180 we're gonna be using our old friend matplotlib. 25 26 00:02:09,180 --> 00:02:17,250 So let's import matplotlib's pyplot functionality as plt at the top of our notebook. Going to our 26 27 00:02:17,250 --> 00:02:19,770 notebook imports we'll say "import 27 28 00:02:22,590 --> 00:02:30,940 matplotlib.pyplot as plt". 28 29 00:02:30,940 --> 00:02:33,430 Now, as usual, we're also gonna be adding 29 30 00:02:33,560 --> 00:02:38,210 "%matplotlib inline". 30 31 00:02:38,450 --> 00:02:42,010 This line of code here is Jupyter notebook specific 31 32 00:02:42,230 --> 00:02:50,540 and it allows us to export our images for our charts when we export the notebook. 32 33 00:02:50,570 --> 00:02:58,310 So if you ever go to Download, Download notebook as Notebook, then if you have this line of code in here 33 34 00:02:58,640 --> 00:03:00,530 then you get the graphics as well. 34 35 00:03:01,410 --> 00:03:01,890 All right. 35 36 00:03:01,910 --> 00:03:04,730 So how do you create one of these friendly pie charts? 36 37 00:03:04,760 --> 00:03:12,160 The first thing I'll do is I'll create a list of our category names, so category_names 37 38 00:03:12,200 --> 00:03:15,950 is gonna be equal to a list with the square brackets, 38 39 00:03:15,950 --> 00:03:27,340 "['Spam', 'Legit Mail']" and the next thing I'll do is create 39 40 00:03:27,430 --> 00:03:28,480 another list. 40 41 00:03:28,720 --> 00:03:34,240 I'm going to call this one "sizes" for the size of the different pieces of the pie. 41 42 00:03:35,430 --> 00:03:45,180 This is going to hold on to the amount_of_spam and it's going to hold on to the amount_of_ham. 42 43 00:03:45,210 --> 00:03:57,450 Now we can use these two to generate our chart, "plt.pie(sizes, " and then the labels 43 44 00:03:57,930 --> 00:04:08,660 of the chart are gonna be equal to the category_names. With "plt.show()" we can see what our basic, 44 45 00:04:08,810 --> 00:04:12,020 very, very basic pie chart is going to look like. Now, 45 46 00:04:12,050 --> 00:04:17,060 I don't recall if I actually pressed Shift+Enter a minute earlier on my notebook cell, but let me hit 46 47 00:04:17,060 --> 00:04:20,510 Shift+Enter now and we'll quickly find out. 47 48 00:04:21,480 --> 00:04:25,490 Nope, get my "NameError: 'plt' is not defined". 48 49 00:04:25,490 --> 00:04:27,140 This means that we have to go back up, 49 50 00:04:27,790 --> 00:04:32,090 or I do at least, and hit Shift+Enter on my imports. 50 51 00:04:32,090 --> 00:04:33,300 There we go. 51 52 00:04:33,300 --> 00:04:41,940 Now I can come down here and actually generate my pie chart and it looks like this. It's pretty ugly actually. 52 53 00:04:42,470 --> 00:04:47,090 So we're gonna have to customize this and make it look presentable. 53 54 00:04:47,090 --> 00:04:48,530 We're going to dress it up. 54 55 00:04:48,630 --> 00:04:54,810 The first thing I'm going to do is actually make those font sizes on this thing look a bit larger, so I can access 55 56 00:04:54,960 --> 00:05:01,360 the font sizes of my labels here with the property called "textprops"; 56 57 00:05:01,620 --> 00:05:06,100 "textprops = { 57 58 00:05:06,180 --> 00:05:12,540 'fontsize': }", say, I don't know, maybe 6, 58 59 00:05:12,600 --> 00:05:13,340 see what happens. 59 60 00:05:14,400 --> 00:05:16,210 OK, they get even smaller, 60 61 00:05:16,210 --> 00:05:17,210 not what I want. 61 62 00:05:17,740 --> 00:05:20,200 Maybe I'll pick 16. 62 63 00:05:20,200 --> 00:05:21,520 There we go. 63 64 00:05:21,520 --> 00:05:27,160 That's a bit better, but this whole thing is still very, very unconvincing. 64 65 00:05:27,160 --> 00:05:31,360 For starters I can see that it's not very, very sharp on my screen. 65 66 00:05:31,360 --> 00:05:36,110 So what I can do in this case is manipulate the figure itself, so "plt. 66 67 00:05:36,120 --> 00:05:43,760 figure(figsize = )", 67 68 00:05:43,760 --> 00:05:48,220 I don't know, say, "2,2", right. 68 69 00:05:48,410 --> 00:05:50,540 Changing the figure size. 69 70 00:05:50,660 --> 00:05:51,840 See what this looks like. 70 71 00:05:51,950 --> 00:05:57,740 It gets a lot smaller, but this isn't the only thing I can change on the figure size. 71 72 00:05:58,370 --> 00:06:05,270 If I wanted to make this whole thing look tack sharp on my screen, then I can actually set the density 72 73 00:06:05,900 --> 00:06:11,960 of the pixels per inch or DPI to whatever my monitor supports. 73 74 00:06:11,960 --> 00:06:15,250 So I've got a pretty decent monitor that I'm working with here. 74 75 00:06:15,350 --> 00:06:22,480 It has quite a high resolution, so it actually supports 227 pixels per inch. 75 76 00:06:22,700 --> 00:06:29,030 And this will have an interesting effect, so if I press Shift+Enter now, you can see that the whole thing 76 77 00:06:29,630 --> 00:06:35,780 starts looking first of all a lot larger because it's scaled up and second of all the edges start looking 77 78 00:06:35,780 --> 00:06:37,980 a bit more clear. 78 79 00:06:38,010 --> 00:06:43,980 Only thing is my font size is probably a bit too large now, so if I go back to font size say 6, 79 80 00:06:44,120 --> 00:06:48,210 then it starts looking a bit better like this. 80 81 00:06:48,330 --> 00:06:55,920 Now, when you're working on this project, have a play with the different figure sizes, the DPI values and 81 82 00:06:55,920 --> 00:06:58,910 the font size and see how it affects your scaling. 82 83 00:06:58,950 --> 00:07:04,080 You're probably gonna have to use a bit of trial and error to get this thing looking good or looking 83 84 00:07:04,080 --> 00:07:06,640 the way you want it to on your monitor. 84 85 00:07:07,560 --> 00:07:12,540 If you ever want to export this thing and you know save it to, put it into a report or what have you, 85 86 00:07:12,540 --> 00:07:20,070 it's probably better to create a larger version of this figure and then just right click and save the 86 87 00:07:20,070 --> 00:07:26,670 image as a larger image, because then if you ever need to include it into a report or into a Word document 87 88 00:07:26,730 --> 00:07:33,540 or whatever you're using, then you can downscale it and it'll look a lot better than taking a small image and 88 89 00:07:33,540 --> 00:07:35,350 scaling it up. 89 90 00:07:35,350 --> 00:07:39,980 Now let me pull up the quick documentation on the "pie" functionality from matplotlib. 90 91 00:07:40,800 --> 00:07:46,910 We can actually see that there's quite a few different parameters that we can set. 91 92 00:07:46,920 --> 00:07:48,320 Let me show you two of them. 92 93 00:07:48,510 --> 00:07:56,730 I want to show you the start angle which is currently set to none and I want to show you this auto percent 93 94 00:07:56,820 --> 00:08:00,450 parameter that we've got here and how to use it. 94 95 00:08:00,480 --> 00:08:04,500 Let's have a play with the start angle first. 95 96 00:08:04,500 --> 00:08:08,370 So if I set the start angle to zero, let's see what we have. 96 97 00:08:08,730 --> 00:08:16,110 We can see that the chart looks like so, then we can see that this blue portion just rotated by 10 degrees 97 98 00:08:16,350 --> 00:08:18,380 from this position. 98 99 00:08:18,630 --> 00:08:28,380 You can see this rotation actually very clearly if I start going to 20, 30, 40, 50 and so on. The blue piece 99 100 00:08:28,590 --> 00:08:33,390 starts rotating counterclockwise from the x axis, 100 101 00:08:33,390 --> 00:08:39,470 so from a straight line that goes from the center out to here. 101 102 00:08:39,510 --> 00:08:49,380 If I set the start angle to 90, then our chart rotates by exactly one fourth of the way around counterclockwise. 102 103 00:08:49,380 --> 00:08:49,790 All right. 103 104 00:08:49,820 --> 00:08:51,560 So that's the start angle. 104 105 00:08:51,600 --> 00:08:55,840 Now let's have a look at the other property that I wanted to show you - 105 106 00:08:56,090 --> 00:08:58,300 auto percent, 106 107 00:08:58,350 --> 00:09:09,650 and we're gonna set that equal to '%1.1f%%' and 107 108 00:09:09,650 --> 00:09:19,220 what we get now is we get a pie chart where the percent of the chart is displayed on the chart itself. 108 109 00:09:19,220 --> 00:09:24,070 The formatting is to one decimal point, because we've put 1.1 here. 109 110 00:09:24,290 --> 00:09:31,250 If instead I had "1.2f", then it would show me two decimal points. 110 111 00:09:31,250 --> 00:09:36,890 So depending on the amount of precision that you want on your pie chart you can choose the formatting 111 112 00:09:37,040 --> 00:09:40,150 of the percent sign as appropriate. 112 113 00:09:40,450 --> 00:09:40,740 Now, 113 114 00:09:40,740 --> 00:09:46,880 personally, I don't like showing too many digits after the decimal point. In most use cases, 114 115 00:09:46,940 --> 00:09:51,990 you usually want to favor readability over precision when it comes to these visualizations anyhow. 115 116 00:09:53,200 --> 00:09:58,930 If we wanted to round this to the nearest percentage, we can put a zero here and then we'll get 33 and 116 117 00:09:58,930 --> 00:10:00,810 67. 117 118 00:10:00,850 --> 00:10:03,250 Now I find this chart here, 118 119 00:10:03,250 --> 00:10:07,370 size wise and formatting wise, is starting to look pretty decent. 119 120 00:10:07,390 --> 00:10:09,050 The only thing that I don't like, 120 121 00:10:09,050 --> 00:10:13,410 are these really awful colors that it's showing me here by default. 121 122 00:10:13,450 --> 00:10:18,220 So let's, let's change this and try to improve the design a little bit. 122 123 00:10:19,400 --> 00:10:21,760 Just so we have a record of our old chart, 123 124 00:10:21,830 --> 00:10:25,340 I'm going to copy this cell and I'm going to paste it below. 124 125 00:10:25,340 --> 00:10:32,000 Now I've got two different pie charts and I'm going to spice up the design on this one and leave the 125 126 00:10:32,000 --> 00:10:33,580 other one as it is. 126 127 00:10:33,950 --> 00:10:40,330 One of my go to places for picking a nice palette is a website called Flat UI Colors. 127 128 00:10:40,400 --> 00:10:46,170 Let's check out this American Palette here that they've got. The first color that I'm gonna grab is gonna 128 129 00:10:46,190 --> 00:10:57,230 be maybe Pink Glamour, means I can go back here into my Jupyter notebook and save my colors as a list. 129 130 00:10:57,230 --> 00:11:06,890 I'm going to call it a "custom_colors = ['']", paste. This is gonna be my first 130 131 00:11:06,890 --> 00:11:13,880 color, comma, back to the American Palette and I'm going to select the color that goes well with this 131 132 00:11:13,880 --> 00:11:14,970 other category. 132 133 00:11:15,080 --> 00:11:19,320 So I think something that would be contrast-y would be this blue one here. 133 134 00:11:19,510 --> 00:11:22,660 So I'm gonna take this one, go back here, 134 135 00:11:24,430 --> 00:11:32,860 add that as well. Having added my colors' hex codes to a list, I can now feed them in as an argument to 135 136 00:11:32,890 --> 00:11:34,420 my "pie" method. 136 137 00:11:34,960 --> 00:11:47,980 So, "colors =custom_colours" and Shift+Enter will color in my pie chart in a much more beautiful way 137 138 00:11:48,660 --> 00:11:57,400 and I've picked some light colors, so that the dark text, this 33 and 67 percent is still very, very readable. 138 139 00:11:57,400 --> 00:12:03,340 Now one thing we can also do to make the design a little bit snappier is to break out this spam section 139 140 00:12:03,340 --> 00:12:10,990 here from the legitimate e-mail section, so we can actually have a little bit of a gap in between the 140 141 00:12:10,990 --> 00:12:21,240 two sections and we can do this by supplying an argument here called "explode", so "explode = " 141 142 00:12:22,170 --> 00:12:25,150 and then we also supply a list. 142 143 00:12:25,230 --> 00:12:31,800 This list will have two values, say 0 and 0.1. 143 144 00:12:32,130 --> 00:12:36,300 These numerical values here set the size of the gap. 144 145 00:12:36,420 --> 00:12:44,230 So let me press Shift+Enter and show you what this would look like. So passing in an argument of 0 145 146 00:12:44,350 --> 00:12:49,150 and 0.1 will get the chart looking like this. 146 147 00:12:49,150 --> 00:12:55,190 If we had 0.5 instead, then the gap would start looking a lot bigger. 147 148 00:12:55,750 --> 00:13:02,860 So I'm going to set this back to 0.1 and I can also meddle with the first one here, so 0.5 148 149 00:13:02,890 --> 00:13:09,760 instead of 0 and we get again something like this. The offset that we're applying here, 149 150 00:13:10,300 --> 00:13:18,220 if we go to the quick documentation essentially specifies the fraction of the radius with which to offset 150 151 00:13:18,520 --> 00:13:27,380 each wedge. Since we have two wedges, the red one and the blue one, we're supplying two numbers. So in my 151 152 00:13:27,380 --> 00:13:31,250 case, I'm offsetting one of them by 0.1. 152 153 00:13:31,640 --> 00:13:38,560 If I offset both of them by 0.1, then the gap between the two wedges doubles. 153 154 00:13:38,580 --> 00:13:42,840 Now I quite like the look of a small offset, 0.1 will do for me. 154 155 00:13:43,950 --> 00:13:51,280 In the next lesson I'll show you how to take these designs a little further and make a donut chart. I'll 155 156 00:13:51,280 --> 00:13:52,140 see you there.