0 1 00:00:00,300 --> 00:00:07,350 A key part of data exploration that you want to do in conjunction with data visualization is looking 1 2 00:00:07,350 --> 00:00:13,010 at some descriptive statistics of the data that you're working with. 2 3 00:00:13,020 --> 00:00:18,660 In the previous lessons we saw that our data set contained everything from price data to index data 3 4 00:00:18,990 --> 00:00:20,670 to dummy variables, 4 5 00:00:20,670 --> 00:00:24,210 and these were all measured in different ways. 5 6 00:00:24,210 --> 00:00:30,810 So in this lesson I'm going to show you how you can pull up various different statistics on a dataframe 6 7 00:00:31,080 --> 00:00:38,100 which you could have a look at in conjunction with your data visualizations. Now, I personally think 7 8 00:00:38,100 --> 00:00:45,360 this topic of descriptive statistics can be so utterly dull that I want to introduce it to you with 8 9 00:00:45,420 --> 00:00:53,280 a short story. Imagine that it's an election year and the two leading political candidates are having 9 10 00:00:53,280 --> 00:01:00,540 their big debate on television. The very fictional conservative candidate by the name of Ronald Dump 10 11 00:01:00,780 --> 00:01:09,120 starts off the debate and says "Friends, Romans, countrymen lend me your ears. Under my leadership the economy 11 12 00:01:09,120 --> 00:01:16,660 has been doing splendidly and the average family is reaping the benefits. Over the past four years, 12 13 00:01:16,710 --> 00:01:21,090 average income has increased by over 30000 dollars. 13 14 00:01:21,150 --> 00:01:28,480 Vote for me.". And then it's the opposition candidate's turn. Candidate Artillery Hinton takes the floor 14 15 00:01:28,510 --> 00:01:31,050 and says "Don't listen to Ronald, 15 16 00:01:31,360 --> 00:01:37,740 today middle income families are earning 30000 dollars less than when Ronald took office. 16 17 00:01:37,750 --> 00:01:45,790 My policies will help the typical family. Vote for me.". So hearing these two statements you might wonder, 17 18 00:01:46,450 --> 00:01:53,100 is one of the politicians lying? Or can both of these statements be true at the same time? 18 19 00:01:53,140 --> 00:01:57,400 How can we reconcile these two seemingly contradictory claims? 19 20 00:01:57,510 --> 00:02:04,080 Now it turns out that even though the two statements sound very similar, these two politicians are not 20 21 00:02:04,080 --> 00:02:05,870 talking about the same thing. 21 22 00:02:06,520 --> 00:02:15,630 The Ronald is talking about the mean, while Artillery is talking about the median. The mean is another 22 23 00:02:15,630 --> 00:02:18,960 word for average and to calculate the mean income, 23 24 00:02:18,960 --> 00:02:26,840 you simply add up all the families incomes and you divide them by the number of families. The median 24 25 00:02:26,840 --> 00:02:33,650 income on the other hand is calculated by arranging all the family incomes from lowest to highest and 25 26 00:02:33,650 --> 00:02:35,810 then picking the one in the middle. 26 27 00:02:35,810 --> 00:02:43,910 So in contrast to the mean the median is not affected so much by big outliers. This whole discussion 27 28 00:02:43,910 --> 00:02:52,630 in fact goes back to this idea of a distribution. The shape of a distribution determines statistical 28 29 00:02:52,630 --> 00:02:55,700 measures like the mean or the median. 29 30 00:02:55,870 --> 00:02:59,970 Remember this green histogram that I created with imaginary house price data? 30 31 00:03:00,010 --> 00:03:04,800 This is in the shape of our old friend the normal distribution. 31 32 00:03:05,080 --> 00:03:13,790 In this case both the median and the mean would be the same. However, 32 33 00:03:13,850 --> 00:03:17,300 what if this distribution was not normal? 33 34 00:03:17,300 --> 00:03:24,460 What if we didn't have this pretty and imaginary bell shaped curve for family incomes? 34 35 00:03:24,680 --> 00:03:28,880 In that case the mean and the median won't be the same. 35 36 00:03:28,910 --> 00:03:37,090 And this is a story of the politicians. So the distribution is the second part of our answer. 36 37 00:03:37,130 --> 00:03:44,270 The thing that happened to reconcile the two politicians statements is that the income distribution 37 38 00:03:44,420 --> 00:03:46,190 has changed. 38 39 00:03:46,190 --> 00:03:53,660 This is how it is possible for the average and the mean to move in separate directions. 39 40 00:03:53,830 --> 00:04:02,590 You see if most people got slightly poorer but then very very few people become enormously wealthy going 40 41 00:04:02,680 --> 00:04:09,370 all the way out to the right of this distribution into the tail then the mean and the median could be 41 42 00:04:09,370 --> 00:04:12,760 trading places like in this slide. 42 43 00:04:12,760 --> 00:04:18,940 So I hope this little story got you a little bit more interested in this topic of descriptive statistics. 43 44 00:04:18,940 --> 00:04:24,640 So at this stage you might be asking: well then, what are a couple of good things to look at to better 44 45 00:04:24,640 --> 00:04:28,960 understand the data? We're gonna be looking at 4 things for now. 45 46 00:04:28,960 --> 00:04:36,490 We're gonna be looking at the smallest value, the largest value, the mean value and the median value in 46 47 00:04:36,580 --> 00:04:38,340 our dataset. 47 48 00:04:38,410 --> 00:04:45,160 Lucky for us, the python Panda's module makes all of the super easy and the pandas dataframe already 48 49 00:04:45,160 --> 00:04:50,170 has a number of handy methods which we can use to instantly pull up this kind of information in our 49 50 00:04:50,170 --> 00:04:51,580 notebook. 50 51 00:04:51,580 --> 00:04:57,600 Let me show you how. The first thing I'm goint to do is going to add a little section heading here that 51 52 00:04:57,600 --> 00:05:00,330 reads "Descriptive Statistics". 52 53 00:05:05,280 --> 00:05:11,880 And now let me show you how we can pull up the smallest value in a particular column of our data 53 54 00:05:11,880 --> 00:05:15,990 frame. Say we want to know the smallest house price. 54 55 00:05:15,990 --> 00:05:22,500 We can select a particular column or a series object with the square bracket notation. 55 56 00:05:22,620 --> 00:05:30,570 If I type "data['price']", surrounded by single quotes and then put a dot after 56 57 00:05:30,570 --> 00:05:39,930 it and call the min method, "min()" and hitting Shift+Enter, we can see that the smallest house 57 58 00:05:39,930 --> 00:05:44,860 price is 5000 U.S. dollars. 58 59 00:05:44,870 --> 00:05:50,830 Now I don't know about you, but I'd really like to see this house. 59 60 00:05:50,910 --> 00:05:54,060 I mean for 5000 in Boston 60 61 00:05:54,060 --> 00:05:59,920 I'm imagining some sort of rusty trailer on the outskirts of the city without running water and electricity. 61 62 00:06:01,330 --> 00:06:05,610 Maybe with one of the radial highways on a bridge overhead. 62 63 00:06:05,880 --> 00:06:11,800 But anyhow, let's see what the largest value is using the sister method max(), 63 64 00:06:11,860 --> 00:06:23,910 so "data['PRICE'].max()" will bring up 50. And since this is in thousands, 64 65 00:06:23,910 --> 00:06:26,580 This is fifty thousand dollars. 65 66 00:06:26,580 --> 00:06:30,790 Now I know this doesn't sound like a lot but this is in the 1970s. 66 67 00:06:30,810 --> 00:06:36,860 So things were a bit cheaper back then. Now the cool thing about pandas is that you don't have to do 67 68 00:06:36,860 --> 00:06:40,380 this for every single column in the data frame. 68 69 00:06:40,460 --> 00:06:46,190 You can actually pull up the minimum and maximum values on the dataframe object itself. 69 70 00:06:46,190 --> 00:06:52,950 You can pull it up on the dataframe as a whole. So if I write "data.min()" 70 71 00:06:53,060 --> 00:06:58,400 I can see the minimum value in every single column at the same time. 71 72 00:06:58,760 --> 00:07:02,210 Of course the same thing goes with "data.max()" 72 73 00:07:02,210 --> 00:07:07,070 which brings up the largest value in every single column. 73 74 00:07:07,280 --> 00:07:12,020 So that's the largest and smallest values covered. 74 75 00:07:12,140 --> 00:07:16,640 The other descriptive statistics that we've talked about that can be pulled up really easily were the 75 76 00:07:16,640 --> 00:07:17,840 mean and the median. 76 77 00:07:18,200 --> 00:07:20,770 So "data.mean()' 77 78 00:07:20,780 --> 00:07:29,850 will bring up the average value of every single feature and "data.median()" will bring up the typical 78 79 00:07:29,850 --> 00:07:36,490 value or the middle value of every single feature in the data frame. 79 80 00:07:36,510 --> 00:07:46,020 Now that's all very well and good, but the thing is, what if you're like me? What if you're lazy? Typing 80 81 00:07:46,020 --> 00:07:52,230 all this stuff in and getting out the above output is not very satisfactory. 81 82 00:07:52,230 --> 00:07:59,190 What I want is I want all my stats at the same time and I want it to be formatted in a way that I can 82 83 00:07:59,400 --> 00:08:01,300 easily read. 83 84 00:08:01,320 --> 00:08:05,630 This is where the describe method comes to the rescue - "data. 84 85 00:08:05,670 --> 00:08:13,620 describe()" will bring up a whole bunch of summary statistics from the data frame all 85 86 00:08:13,620 --> 00:08:15,280 at the same time. 86 87 00:08:15,360 --> 00:08:16,410 I love this method. 87 88 00:08:16,440 --> 00:08:19,680 This is super, super useful. 88 89 00:08:19,690 --> 00:08:25,230 Now you may be looking at this and thinking: Hey, wait a minute, where is the median? 89 90 00:08:25,230 --> 00:08:27,180 Don't cheat me out of the median. 90 91 00:08:27,320 --> 00:08:29,040 Well, not to worry. 91 92 00:08:29,040 --> 00:08:32,790 It's right here in this 50% row. 92 93 00:08:32,790 --> 00:08:36,020 This is where the median values are hiding. 93 94 00:08:36,120 --> 00:08:36,990 Cool. 94 95 00:08:36,990 --> 00:08:42,390 So this table is something very handy to pull up when you're working with a new data frame that you 95 96 00:08:42,390 --> 00:08:43,790 haven't seen before. 96 97 00:08:43,860 --> 00:08:51,360 You take the data frame and simply call the describe method and this will generate the descriptive statistics 97 98 00:08:51,480 --> 00:08:58,860 that summarize the central tendency dispersion and the shape of the dataset's distribution. 98 99 00:08:58,860 --> 00:09:05,550 Just note this excludes not a number or nan values if there are any in your data frame. 99 100 00:09:06,210 --> 00:09:07,310 So it's quite clever. 100 101 00:09:07,320 --> 00:09:08,600 Good stuff. 101 102 00:09:08,610 --> 00:09:14,250 Now looking at this, one of the things that I found quite interesting and that I'm noting down for later 102 103 00:09:14,700 --> 00:09:21,240 is that there is an outlier in the number of rooms category that might be worth investigating. 103 104 00:09:21,240 --> 00:09:29,420 We can see this in the summary statistics right here. The reason I say it's an outlier is because the 104 105 00:09:29,510 --> 00:09:34,600 average number of rooms and the median number of rooms is around 6. 105 106 00:09:34,650 --> 00:09:42,720 We can also see that most of the properties have between 5.9 and 6.6 rooms. 106 107 00:09:42,770 --> 00:09:51,230 So this property here with almost 9 rooms is gigantic and also quite far from the norm. 107 108 00:09:51,290 --> 00:09:56,900 So yeah I'm going to make a mental note of this for the analysis stage. In the next lessons we're gonna 108 109 00:09:56,900 --> 00:10:07,190 be looking at if and how our explanatory variables, our 13 features move together. We're gonna be looking 109 110 00:10:07,190 --> 00:10:10,840 at their correlation. I'll see you there.