0 1 00:00:01,050 --> 00:00:07,260 So far we've had a cursory look at our data set and checked for missing data. 1 2 00:00:07,260 --> 00:00:11,930 Now we're going to start looking at our data in a bit more detail. 2 3 00:00:12,240 --> 00:00:17,210 One of the most useful things to do as part of exploring it is to visualize it. 3 4 00:00:17,250 --> 00:00:24,510 It's true they say, you know, a picture is worth a thousand words, and data visualizations don't just come 4 5 00:00:24,510 --> 00:00:31,050 in handy when it comes to say presenting a snazzy final report to your boss, your team or your client. 5 6 00:00:31,650 --> 00:00:32,820 Data visualizations 6 7 00:00:32,820 --> 00:00:38,160 also help us make sense of our data at the exploration stage. 7 8 00:00:38,190 --> 00:00:38,700 How so? 8 9 00:00:40,030 --> 00:00:44,860 Well, there's two things that we want to get a sense for right now. 9 10 00:00:44,860 --> 00:00:53,290 The first is the distribution of the data and the second are outliers in our data. 10 11 00:00:53,290 --> 00:00:59,780 So what kind of visualization could we use at the exploration stage to spot both outliers and see the 11 12 00:00:59,780 --> 00:01:01,750 distribution? 12 13 00:01:01,750 --> 00:01:10,000 Well, enter our friend the humble histogram. The good old histogram is the first data visualization technique 13 14 00:01:10,060 --> 00:01:13,360 that we're going to cover. Histogram are pretty simple, 14 15 00:01:13,390 --> 00:01:20,500 they just show the number of instances in the data that have a certain value. A histogram is just a 15 16 00:01:20,500 --> 00:01:26,650 bar chart that shows us the frequency of a particular value. In this green histogram 16 17 00:01:26,650 --> 00:01:34,150 we've got the values on the x axis and the number of occurrences on the y axis, the taller an individual 17 18 00:01:34,150 --> 00:01:38,280 bar, the more occurrences there are in the dataset. 18 19 00:01:38,470 --> 00:01:44,980 And by plotting all the bars next to each other we get a certain shape. That shape is the distribution 19 20 00:01:45,400 --> 00:01:47,490 of our data. 20 21 00:01:47,510 --> 00:01:54,910 Now, I've created the screen histogram here to show you what a normal distribution would look like. 21 22 00:01:54,940 --> 00:02:02,500 You can always spot a normal distribution by this very reassuring bell curve, meaning a lot of the observations 22 23 00:02:02,590 --> 00:02:06,260 are around the center or the mean of the distribution 23 24 00:02:06,340 --> 00:02:10,640 and very few observations are at the edges. 24 25 00:02:10,870 --> 00:02:16,420 The reason that we care about distributions in the first place is because they tell us a great deal 25 26 00:02:16,420 --> 00:02:21,290 about our data. For this dataset on Boston house prices, 26 27 00:02:21,470 --> 00:02:29,500 that's 13 independent variables including everything from the average number of rooms to zoning restrictions 27 28 00:02:29,710 --> 00:02:33,010 to the pupil-teacher ratio in the schools. 28 29 00:02:33,010 --> 00:02:41,590 And each of these variables is measured differently and a histogram is a good starting point for understanding 29 30 00:02:41,800 --> 00:02:47,790 what these features are, how they're measured and what the data actually looks like. 30 31 00:02:47,830 --> 00:02:55,300 Another reason why I bring up distributions at this stage is that many statistical tests and estimation 31 32 00:02:55,300 --> 00:03:00,900 techniques make certain assumptions about the kind of distribution. 32 33 00:03:00,940 --> 00:03:08,200 Now we're going to revisit this concept at the analysis stage when it comes to our regression residuals 33 34 00:03:08,830 --> 00:03:14,570 and it'll be very interesting to see what kind of distributions those have. 34 35 00:03:14,590 --> 00:03:21,520 So this is something to keep in mind for later, we're gonna see if we get a bell shaped curve like this 35 36 00:03:21,700 --> 00:03:24,420 at the analysis stage or not. 36 37 00:03:24,420 --> 00:03:26,450 Now let me ask you a question. 37 38 00:03:26,650 --> 00:03:34,240 What do you think the distribution of house prices will look like in our dataset? 38 39 00:03:34,270 --> 00:03:42,250 If you had to imagine the distribution of house prices in your head, what would it look like? Spoiler 39 40 00:03:42,250 --> 00:03:42,960 alert, 40 41 00:03:43,120 --> 00:03:48,920 the house prices are going to look nothing like this green histogram right here. 41 42 00:03:49,150 --> 00:03:58,630 In fact, the distribution of prices looks like this. As you can see, it's much more messy, right? 42 43 00:03:58,640 --> 00:04:08,630 Well that's real data for you. And also, our data has outliers - at the right hand edge of this distribution 43 44 00:04:09,600 --> 00:04:17,090 a normal distribution has very, very few observations, but the actual house prices have some pretty high 44 45 00:04:17,090 --> 00:04:18,490 bars right here. 45 46 00:04:18,500 --> 00:04:24,170 Now I'm not sure which houses these are in Boston, but the people living there are pretty well off to 46 47 00:04:24,200 --> 00:04:25,880 say the least. 47 48 00:04:25,880 --> 00:04:26,360 All right. 48 49 00:04:26,360 --> 00:04:34,940 So now it's time to write some Python code and learn how to draw histograms like this because I suspect 49 50 00:04:35,210 --> 00:04:38,960 you're not just going to take my word for it when it comes to these charts. 50 51 00:04:39,110 --> 00:04:42,230 So let's head back to Jupyter notebook. 51 52 00:04:42,230 --> 00:04:52,530 Let's start by inserting a markdown cell and putting the following subheading here "Visualizing Data - 52 53 00:04:54,490 --> 00:05:02,760 Histograms, Distributions and Bar Charts". 53 54 00:05:02,760 --> 00:05:11,010 Now to draw a histogram in our notebook we're gonna make use of the matplotlib module, so we're gonna 54 55 00:05:11,010 --> 00:05:15,140 have to add some import statements at the very top. 55 56 00:05:15,240 --> 00:05:23,730 So I'm going to scroll back up and I'm going to add the following import statement "import 56 57 00:05:23,910 --> 00:05:33,120 matplotlib.pyplot as plt" and at the end I'm also going to add the "% 57 58 00:05:33,440 --> 00:05:33,900 matplotlib 58 59 00:05:33,900 --> 00:05:39,340 inline". 59 60 00:05:39,340 --> 00:05:41,740 And let me hit Shift+Enter. 60 61 00:05:42,220 --> 00:05:43,630 Now this last line, 61 62 00:05:43,630 --> 00:05:50,050 if you recall, was so that our charts would show up when we export our notebooks. 62 63 00:05:50,050 --> 00:05:54,420 So this line of code is really Jupyter notebook specific. 63 64 00:05:54,760 --> 00:06:02,920 Now to draw our histogram we're going to use our matplotlib module and call the hist function. 64 65 00:06:02,920 --> 00:06:08,920 So we're gonna write "plt.hist()" 65 66 00:06:08,920 --> 00:06:12,910 and now we need to supply some arguments. 66 67 00:06:12,910 --> 00:06:20,610 The first input to this function is what should be plotted on the histogram. In our case, 67 68 00:06:20,610 --> 00:06:28,530 we're gonna start plotting the values from our target, namely the house prices that are given in thousands. 68 69 00:06:28,530 --> 00:06:31,800 This was inside our dataframe's price column, 69 70 00:06:31,830 --> 00:06:34,470 if you remember. So I'm going to write "data[]" 70 71 00:06:37,350 --> 00:06:49,260 and then the string "PRICE" in all caps. On the next line I'm going to put "plt.show()" and 71 72 00:06:49,280 --> 00:06:56,020 hit Shift+Enter and this is the output that we'll get. Now, 72 73 00:06:56,090 --> 00:07:02,860 the thing about this histogram function is that we can supply more arguments to customize the look and 73 74 00:07:02,860 --> 00:07:05,750 feel of our histogram. 74 75 00:07:05,980 --> 00:07:16,690 For example, we can supply an argument called "bins" and bins is going to determine how our prices are 75 76 00:07:16,690 --> 00:07:20,460 grouped together to form the individual bars. 76 77 00:07:20,470 --> 00:07:27,600 So I'm going to put "bins = 3 " and hit Shift+Enter. 77 78 00:07:28,210 --> 00:07:32,840 If we put "bins = 3" then we only get three bars. 78 79 00:07:33,100 --> 00:07:37,640 All our house prices are grouped into one of these three bars. 79 80 00:07:37,660 --> 00:07:40,350 Now this might be a little difficult for you to see. 80 81 00:07:40,360 --> 00:07:44,250 So what I'm going to do is I'm going to make my chart larger 81 82 00:07:44,260 --> 00:07:56,080 first of all, so we can do this with "plt.figure(figsize = )" and then supply a tuple, 82 83 00:07:56,210 --> 00:07:58,780 I'm going to say 10 and 6. 83 84 00:07:59,800 --> 00:08:02,940 So this is gonna make my chart a lot larger. 84 85 00:08:04,340 --> 00:08:06,590 But I want to make this more explicit still. 85 86 00:08:06,620 --> 00:08:12,320 I'm going to show a black outline of the actual bins in this chart. 86 87 00:08:12,590 --> 00:08:22,290 So there's something called edge color, ec, that I can supply as an argument to this histogram function. 87 88 00:08:22,340 --> 00:08:31,050 So I'm going to say "ec = 'black'" and now my histogram will look like this. 88 89 00:08:31,050 --> 00:08:38,910 So now we can really see that I've only got three bins - all our house prices either grouped in to this 89 90 00:08:38,910 --> 00:08:47,510 first group here, up to 20000, or the second bar here between 20 and say 35000, or this third 90 91 00:08:47,540 --> 00:08:53,250 bar here, between 35000 and 50000 dollars. 91 92 00:08:53,310 --> 00:08:56,590 Now of course you can play around with this input here. 92 93 00:08:56,730 --> 00:08:59,540 So we could also go the other extreme, right? 93 94 00:08:59,550 --> 00:09:08,490 We could have, I don't know, 100 different bins. Sitting Shift+Enter on this will make our chart look like 94 95 00:09:08,490 --> 00:09:18,070 so. In other words by setting the number of bins we can set how granular we want our histogram to look. 95 96 00:09:18,370 --> 00:09:20,830 I tell you what, I'm going to go with 50. 96 97 00:09:24,910 --> 00:09:32,140 I think 50 is a good compromise between 3 and 300 and this conveys the information in 97 98 00:09:32,140 --> 00:09:34,920 the price column quite nicely. 98 99 00:09:36,250 --> 00:09:41,920 Now if you come back to this chart in three months time you're probably not going to know what it's 99 100 00:09:41,920 --> 00:09:42,430 showing. 100 101 00:09:42,490 --> 00:09:46,500 So let's add some labels on the axes. 101 102 00:09:46,500 --> 00:10:02,250 So I'm going to write "plt.xlabel('Price in 000s')" and "plt.ylabel(' 102 103 00:10:03,510 --> 00:10:11,420 Nr. of houses')" and I'm going to have to format this as a string, I can't put it in there like so. 103 104 00:10:11,420 --> 00:10:17,030 So I need single quotes at the beginning and a single quote at the end. 104 105 00:10:17,040 --> 00:10:21,380 Now let me hit Shift+Enter and voila! 105 106 00:10:21,390 --> 00:10:22,570 It's a lot more clear. 106 107 00:10:22,680 --> 00:10:30,900 We've got the frequency on the left and the dollar price in thousands on the x axis, so far so good 107 108 00:10:30,900 --> 00:10:31,230 right? 108 109 00:10:32,160 --> 00:10:33,990 But wait what was that? 109 110 00:10:34,140 --> 00:10:38,280 Did I just hear you say you want to style this histogram to make it look prettier? 110 111 00:10:38,340 --> 00:10:45,750 Okay, okay, let me try and channel my inner designer and choose a different color. To do that I'm going 111 112 00:10:45,750 --> 00:10:52,290 to add another argument to my function call here and that argument is gonna be called, surprise, surprise, 112 113 00:10:52,560 --> 00:10:57,980 "color" and it's also gonna be set equal to a string. 113 114 00:10:57,990 --> 00:11:05,310 Now I can add a hex code here as an input and now it turns out that my inner designer is actually super 114 115 00:11:05,310 --> 00:11:12,360 lazy and recommended that I use a Web site like Material Palette to pick a color instead. 115 116 00:11:12,360 --> 00:11:18,710 So you can see that this is just one of the many Web sites that curates a color palette for you. 116 117 00:11:18,930 --> 00:11:21,480 And I'm going to pick this blue one here, 117 118 00:11:21,480 --> 00:11:30,220 these two blue colors, and then I'm just going to copy this hex code right here and this is the hex code 118 119 00:11:30,430 --> 00:11:37,570 that I'm then going to paste in has an argument for the color right here. 119 120 00:11:37,590 --> 00:11:38,730 Let's see what this looks like. 120 121 00:11:40,130 --> 00:11:41,270 Voila! 121 122 00:11:41,490 --> 00:11:42,030 Cool. 122 123 00:11:42,030 --> 00:11:46,290 So I think we've got a really nice histogram here from matplotlib. 123 124 00:11:46,410 --> 00:11:53,100 And now you can also see for yourself that the histogram here ties out with what I've showed you earlier 124 125 00:11:53,220 --> 00:11:54,640 on the slide. 125 126 00:11:54,780 --> 00:11:56,880 Trust but verify. 126 127 00:11:56,940 --> 00:11:57,260 All right. 127 128 00:11:59,130 --> 00:12:04,060 You know apparently the saying was Ronald Reagan's favorite Russian proverb - 128 129 00:12:04,150 --> 00:12:09,700 Trust but verify. If you happen to know any other funky Russian proverbs, 129 130 00:12:09,850 --> 00:12:14,920 please do let me know in the Q&A section. But moving on, 130 131 00:12:14,920 --> 00:12:20,960 the thing is - we're actually not stuck with just using matplotlib for data viz. 131 132 00:12:21,250 --> 00:12:29,050 There are quite a few other lovely Python modules out there that do a fantastic job at data visualization 132 133 00:12:29,470 --> 00:12:33,940 and I can't wait to introduce you to an alternative to matplotlib. 133 134 00:12:34,600 --> 00:12:36,790 And that's a module called seaborn.