0
1
00:00:00,300 --> 00:00:07,350
A key part of data exploration that you want to do in conjunction with data visualization is looking
1

2
00:00:07,350 --> 00:00:13,010
at some descriptive statistics of the data that you're working with.
2

3
00:00:13,020 --> 00:00:18,660
In the previous lessons we saw that our data set contained everything from price data to index data
3

4
00:00:18,990 --> 00:00:20,670
to dummy variables,
4

5
00:00:20,670 --> 00:00:24,210
and these were all measured in different ways.
5

6
00:00:24,210 --> 00:00:30,810
So in this lesson I'm going to show you how you can pull up various different statistics on a dataframe
6

7
00:00:31,080 --> 00:00:38,100
which you could have a look at in conjunction with your data visualizations. Now, I personally think
7

8
00:00:38,100 --> 00:00:45,360
this topic of descriptive statistics can be so utterly dull that I want to introduce it to you with
8

9
00:00:45,420 --> 00:00:53,280
a short story. Imagine that it's an election year and the two leading political candidates are having
9

10
00:00:53,280 --> 00:01:00,540
their big debate on television. The very fictional conservative candidate by the name of Ronald Dump
10

11
00:01:00,780 --> 00:01:09,120
starts off the debate and says "Friends, Romans, countrymen lend me your ears. Under my leadership the economy
11

12
00:01:09,120 --> 00:01:16,660
has been doing splendidly and the average family is reaping the benefits. Over the past four years,
12

13
00:01:16,710 --> 00:01:21,090
average income has increased by over 30000 dollars.
13

14
00:01:21,150 --> 00:01:28,480
Vote for me.". And then it's the opposition candidate's turn. Candidate Artillery Hinton takes the floor
14

15
00:01:28,510 --> 00:01:31,050
and says "Don't listen to Ronald,
15

16
00:01:31,360 --> 00:01:37,740
today middle income families are earning 30000 dollars less than when Ronald took office.
16

17
00:01:37,750 --> 00:01:45,790
My policies will help the typical family. Vote for me.". So hearing these two statements you might wonder,
17

18
00:01:46,450 --> 00:01:53,100
is one of the politicians lying? Or can both of these statements be true at the same time?
18

19
00:01:53,140 --> 00:01:57,400
How can we reconcile these two seemingly contradictory claims?
19

20
00:01:57,510 --> 00:02:04,080
Now it turns out that even though the two statements sound very similar, these two politicians are not
20

21
00:02:04,080 --> 00:02:05,870
talking about the same thing.
21

22
00:02:06,520 --> 00:02:15,630
The Ronald is talking about the mean, while Artillery is talking about the median. The mean is another
22

23
00:02:15,630 --> 00:02:18,960
word for average and to calculate the mean income,
23

24
00:02:18,960 --> 00:02:26,840
you simply add up all the families incomes and you divide them by the number of families. The median
24

25
00:02:26,840 --> 00:02:33,650
income on the other hand is calculated by arranging all the family incomes from lowest to highest and
25

26
00:02:33,650 --> 00:02:35,810
then picking the one in the middle.
26

27
00:02:35,810 --> 00:02:43,910
So in contrast to the mean the median is not affected so much by big outliers. This whole discussion
27

28
00:02:43,910 --> 00:02:52,630
in fact goes back to this idea of a distribution. The shape of a distribution determines statistical
28

29
00:02:52,630 --> 00:02:55,700
measures like the mean or the median.
29

30
00:02:55,870 --> 00:02:59,970
Remember this green histogram that I created with imaginary house price data?
30

31
00:03:00,010 --> 00:03:04,800
This is in the shape of our old friend the normal distribution.
31

32
00:03:05,080 --> 00:03:13,790
In this case both the median and the mean would be the same. However,
32

33
00:03:13,850 --> 00:03:17,300
what if this distribution was not normal?
33

34
00:03:17,300 --> 00:03:24,460
What if we didn't have this pretty and imaginary bell shaped curve for family incomes?
34

35
00:03:24,680 --> 00:03:28,880
In that case the mean and the median won't be the same.
35

36
00:03:28,910 --> 00:03:37,090
And this is a story of the politicians. So the distribution is the second part of our answer.
36

37
00:03:37,130 --> 00:03:44,270
The thing that happened to reconcile the two politicians statements is that the income distribution
37

38
00:03:44,420 --> 00:03:46,190
has changed.
38

39
00:03:46,190 --> 00:03:53,660
This is how it is possible for the average and the mean to move in separate directions.
39

40
00:03:53,830 --> 00:04:02,590
You see if most people got slightly poorer but then very very few people become enormously wealthy going
40

41
00:04:02,680 --> 00:04:09,370
all the way out to the right of this distribution into the tail then the mean and the median could be
41

42
00:04:09,370 --> 00:04:12,760
trading places like in this slide.
42

43
00:04:12,760 --> 00:04:18,940
So I hope this little story got you a little bit more interested in this topic of descriptive statistics.
43

44
00:04:18,940 --> 00:04:24,640
So at this stage you might be asking: well then, what are a couple of good things to look at to better
44

45
00:04:24,640 --> 00:04:28,960
understand the data? We're gonna be looking at 4 things for now.
45

46
00:04:28,960 --> 00:04:36,490
We're gonna be looking at the smallest value, the largest value, the mean value and the median value in
46

47
00:04:36,580 --> 00:04:38,340
our dataset.
47

48
00:04:38,410 --> 00:04:45,160
Lucky for us, the python Panda's module makes all of the super easy and the pandas dataframe already
48

49
00:04:45,160 --> 00:04:50,170
has a number of handy methods which we can use to instantly pull up this kind of information in our
49

50
00:04:50,170 --> 00:04:51,580
notebook.
50

51
00:04:51,580 --> 00:04:57,600
Let me show you how. The first thing I'm goint to do is going to add a little section heading here that
51

52
00:04:57,600 --> 00:05:00,330
reads "Descriptive Statistics".
52

53
00:05:05,280 --> 00:05:11,880
And now let me show you how we can pull up the smallest value in a particular column of our data
53

54
00:05:11,880 --> 00:05:15,990
frame. Say we want to know the smallest house price.
54

55
00:05:15,990 --> 00:05:22,500
We can select a particular column or a series object with the square bracket notation.
55

56
00:05:22,620 --> 00:05:30,570
If I type "data['price']", surrounded by single quotes and then put a dot after
56

57
00:05:30,570 --> 00:05:39,930
it and call the min method, "min()" and hitting Shift+Enter, we can see that the smallest house
57

58
00:05:39,930 --> 00:05:44,860
price is 5000 U.S. dollars.
58

59
00:05:44,870 --> 00:05:50,830
Now I don't know about you, but I'd really like to see this house.
59

60
00:05:50,910 --> 00:05:54,060
I mean for 5000 in Boston
60

61
00:05:54,060 --> 00:05:59,920
I'm imagining some sort of rusty trailer on the outskirts of the city without running water and electricity.
61

62
00:06:01,330 --> 00:06:05,610
Maybe with one of the radial highways on a bridge overhead.
62

63
00:06:05,880 --> 00:06:11,800
But anyhow, let's see what the largest value is using the sister method max(),
63

64
00:06:11,860 --> 00:06:23,910
so "data['PRICE'].max()" will bring up 50. And since this is in thousands,
64

65
00:06:23,910 --> 00:06:26,580
This is fifty thousand dollars.
65

66
00:06:26,580 --> 00:06:30,790
Now I know this doesn't sound like a lot but this is in the 1970s.
66

67
00:06:30,810 --> 00:06:36,860
So things were a bit cheaper back then. Now the cool thing about pandas is that you don't have to do
67

68
00:06:36,860 --> 00:06:40,380
this for every single column in the data frame.
68

69
00:06:40,460 --> 00:06:46,190
You can actually pull up the minimum and maximum values on the dataframe object itself.
69

70
00:06:46,190 --> 00:06:52,950
You can pull it up on the dataframe as a whole. So if I write  "data.min()"
70

71
00:06:53,060 --> 00:06:58,400
I can see the minimum value in every single column at the same time.
71

72
00:06:58,760 --> 00:07:02,210
Of course the same thing goes with "data.max()"
72

73
00:07:02,210 --> 00:07:07,070
which brings up the largest value in every single column.
73

74
00:07:07,280 --> 00:07:12,020
So that's the largest and smallest values covered.
74

75
00:07:12,140 --> 00:07:16,640
The other descriptive statistics that we've talked about that can be pulled up really easily were the
75

76
00:07:16,640 --> 00:07:17,840
mean and the median.
76

77
00:07:18,200 --> 00:07:20,770
So "data.mean()'
77

78
00:07:20,780 --> 00:07:29,850
will bring up the average value of every single feature and 
"data.median()" will bring up the typical
78

79
00:07:29,850 --> 00:07:36,490
value or the middle value of every single feature in the data frame.
79

80
00:07:36,510 --> 00:07:46,020
Now that's all very well and good, but the thing is, what if you're like me? What if you're lazy? Typing
80

81
00:07:46,020 --> 00:07:52,230
all this stuff in and getting out the above output is not very satisfactory.
81

82
00:07:52,230 --> 00:07:59,190
What I want is I want all my stats at the same time and I want it to be formatted in a way that I can
82

83
00:07:59,400 --> 00:08:01,300
easily read.
83

84
00:08:01,320 --> 00:08:05,630
This is where the describe method comes to the rescue - "data.
84

85
00:08:05,670 --> 00:08:13,620
describe()" will bring up a whole bunch of summary statistics from the data frame all
85

86
00:08:13,620 --> 00:08:15,280
at the same time.
86

87
00:08:15,360 --> 00:08:16,410
I love this method.
87

88
00:08:16,440 --> 00:08:19,680
This is super, super useful.
88

89
00:08:19,690 --> 00:08:25,230
Now you may be looking at this and thinking: Hey, wait a minute, where is the median?
89

90
00:08:25,230 --> 00:08:27,180
Don't cheat me out of the median.
90

91
00:08:27,320 --> 00:08:29,040
Well, not to worry.
91

92
00:08:29,040 --> 00:08:32,790
It's right here in this 50% row.
92

93
00:08:32,790 --> 00:08:36,020
This is where the median values are hiding.
93

94
00:08:36,120 --> 00:08:36,990
Cool.
94

95
00:08:36,990 --> 00:08:42,390
So this table is something very handy to pull up when you're working with a new data frame that you
95

96
00:08:42,390 --> 00:08:43,790
haven't seen before.
96

97
00:08:43,860 --> 00:08:51,360
You take the data frame and simply call the describe method and this will generate the descriptive statistics
97

98
00:08:51,480 --> 00:08:58,860
that summarize the central tendency dispersion and the shape of the dataset's distribution.
98

99
00:08:58,860 --> 00:09:05,550
Just note this excludes not a number or nan values if there are any in your data frame.
99

100
00:09:06,210 --> 00:09:07,310
So it's quite clever.
100

101
00:09:07,320 --> 00:09:08,600
Good stuff.
101

102
00:09:08,610 --> 00:09:14,250
Now looking at this, one of the things that I found quite interesting and that I'm noting down for later
102

103
00:09:14,700 --> 00:09:21,240
is that there is an outlier in the number of rooms category that might be worth investigating.
103

104
00:09:21,240 --> 00:09:29,420
We can see this in the summary statistics right here. The reason I say it's an outlier is because the
104

105
00:09:29,510 --> 00:09:34,600
average number of rooms and the median number of rooms is around 6.
105

106
00:09:34,650 --> 00:09:42,720
We can also see that most of the properties have between 5.9 and 6.6 rooms.
106

107
00:09:42,770 --> 00:09:51,230
So this property here with almost 9 rooms is gigantic and also quite far from the norm.
107

108
00:09:51,290 --> 00:09:56,900
So yeah I'm going to make a mental note of this for the analysis stage. In the next lessons we're gonna
108

109
00:09:56,900 --> 00:10:07,190
be looking at if and how our explanatory variables, our 13 features move together. We're gonna be looking
109

110
00:10:07,190 --> 00:10:10,840
at their correlation. I'll see you there.