0 1 00:00:00,270 --> 00:00:07,500 In the previous lessons we've imported our Boston house price data set into our Jupyter notebook. 1 2 00:00:07,500 --> 00:00:15,260 Now that we've successfully gathered our data, it is time to take a good hard look at it. 2 3 00:00:15,620 --> 00:00:18,090 And this is really the next step in our workflow. 3 4 00:00:18,090 --> 00:00:20,490 We've formulated our question. 4 5 00:00:20,490 --> 00:00:23,040 We've gathered our data. 5 6 00:00:23,100 --> 00:00:26,750 Now it's time to explore our data set in depth. 6 7 00:00:27,030 --> 00:00:32,940 And oftentimes we're going to be exploring, visualizing and cleaning this data more or less at the same 7 8 00:00:32,940 --> 00:00:41,100 time, simply because the problems that the data has only becomes apparent after you start digging into 8 9 00:00:41,100 --> 00:00:43,480 the data set. Okay, 9 10 00:00:43,500 --> 00:00:50,650 so imagine that you're back at your real estate job and that the office intern has just plonked down 10 11 00:00:50,650 --> 00:00:54,150 a big data set on your desk. 11 12 00:00:54,160 --> 00:00:59,680 What are the first things that you'd want to understand about a fresh dataset? 12 13 00:00:59,770 --> 00:01:05,690 What's a good starting point to get to grips with a data set that you've never seen before? 13 14 00:01:05,710 --> 00:01:10,560 What are the kind of questions that you'd want to ask when you're first starting out with your work? 14 15 00:01:10,570 --> 00:01:12,430 Let me show you my own starting point. 15 16 00:01:12,610 --> 00:01:17,680 Let me show you the first six questions that I ask myself when I first start working with a new dataset 16 17 00:01:17,950 --> 00:01:20,280 that I haven't seen before. 17 18 00:01:20,320 --> 00:01:24,310 The first question I'm going to ask myself is: Where does the data come from? 18 19 00:01:24,310 --> 00:01:26,940 What's the source of the data? 19 20 00:01:27,080 --> 00:01:32,880 The second question is: Can I find some sort of short description of what's in the data set? 20 21 00:01:33,080 --> 00:01:39,680 And this is important for understanding the all important context under which the data was collected 21 22 00:01:40,070 --> 00:01:43,600 and also how the data was collected. 22 23 00:01:43,610 --> 00:01:47,060 Third - how big is the data set actually? 23 24 00:01:47,090 --> 00:01:51,740 How many individual data points are there in the data set? 24 25 00:01:51,740 --> 00:01:56,060 Am I dealing with an enormous data set or a small one? 25 26 00:01:56,150 --> 00:02:01,880 This is important from a practical point because working with a dataset with 10 million data points 26 27 00:02:01,880 --> 00:02:07,910 will require very different techniques than working with a data set of say 10 data points. 27 28 00:02:07,910 --> 00:02:14,210 For starters, my aging laptop will totally struggle to crunch like a huge data set so it's important 28 29 00:02:14,210 --> 00:02:17,360 to figure out what sort of beast you're gonna be dealing with. 29 30 00:02:17,360 --> 00:02:23,810 But this aspect of data set size it's not just important from a practical point of view, but it's also 30 31 00:02:23,810 --> 00:02:30,080 important from a theoretical perspective because many statistical tests that you're going to be using 31 32 00:02:30,350 --> 00:02:39,320 will become a lot more powerful as the sample size increases. The fourth question is: how many features 32 33 00:02:39,470 --> 00:02:41,630 are there in the dataset? 33 34 00:02:41,720 --> 00:02:50,750 Now what I mean by features? Well, for each data point, how many aspects were measured, how many entries 34 35 00:02:50,840 --> 00:02:58,250 are there for each row in the table, how many columns are there - this is what I mean by features. You and 35 36 00:02:58,250 --> 00:03:01,390 I are going to be looking at house prices shortly. 36 37 00:03:01,400 --> 00:03:04,760 And each house is gonna be in a row in this dataset. 37 38 00:03:04,970 --> 00:03:12,200 The number of features will tell us how much information we have about each house. The number of features 38 39 00:03:12,230 --> 00:03:17,660 will help us figure out how many characteristics we're going to be basing our prediction of the house 39 40 00:03:17,660 --> 00:03:24,170 value on. The next two questions we're going to ask ourselves are: "What are the names of the features?" 40 41 00:03:24,530 --> 00:03:28,100 and "What are the descriptions of each feature?". 41 42 00:03:28,100 --> 00:03:35,030 The reason these questions are very crucial is because we need to understand what the dataset is actually 42 43 00:03:35,030 --> 00:03:36,150 measuring. 43 44 00:03:36,320 --> 00:03:42,710 So sometimes you'll get datasets with pretty unintuitive measurements so it's important to kind of dig 44 45 00:03:42,710 --> 00:03:48,380 in and understand what exactly is contained in the data. 45 46 00:03:48,380 --> 00:03:53,750 For starters, you'll probably want to check the units that are being used in each column. 46 47 00:03:53,780 --> 00:04:00,800 So for example, is our price given in dollars or in thousands of dollars? 47 48 00:04:01,200 --> 00:04:05,090 And these are just some of the basics to get right. Okay, 48 49 00:04:05,130 --> 00:04:13,810 so now that we've got our to do list, let's return to the Python code and see if we can answer these 49 50 00:04:13,900 --> 00:04:18,040 initial questions and check them off one by one. 50 51 00:04:18,180 --> 00:04:27,750 There is a handy little Python function called "dir" that we can use to look at a Python objects attributes. 51 52 00:04:27,750 --> 00:04:36,960 Check it out - "dir(boston_dataset)" and Shift+Enter 52 53 00:04:37,350 --> 00:04:39,940 will bring up the following output. 53 54 00:04:40,220 --> 00:04:48,990 They'll show us a list of attributes. What we're looking at here is actually a list of attributes for 54 55 00:04:48,990 --> 00:04:57,160 this Python object. The first attribute here is a shorthand for I'm guessing description. 55 56 00:04:57,250 --> 00:04:59,280 So let's pull this out. 56 57 00:04:59,290 --> 00:05:00,380 Let's print this out. 57 58 00:05:00,580 --> 00:05:12,610 I'm going to say "print(boston_dataset.DESCR)", all caps, and let's take a look at what we 58 59 00:05:12,610 --> 00:05:13,180 see here. 59 60 00:05:14,800 --> 00:05:16,470 So using this Python attribute 60 61 00:05:16,600 --> 00:05:21,450 We do indeed get a description of this dataset. 61 62 00:05:21,460 --> 00:05:28,210 This description was included in the Python object and we were able to access it with this attribute 62 63 00:05:28,450 --> 00:05:31,070 that we discovered through the dir 63 64 00:05:31,090 --> 00:05:35,570 function. Okay, so let's take a look at these notes. 64 65 00:05:35,870 --> 00:05:43,940 We've already seen that there are 506 instances or rows in this dataset and there are 65 66 00:05:43,940 --> 00:05:50,000 13 attributes, 13 categories or columns. 66 67 00:05:50,030 --> 00:05:53,530 Now these 13 columns are as follows; 67 68 00:05:53,810 --> 00:05:55,760 we've got per capita crime, 68 69 00:05:55,940 --> 00:06:02,050 we've got concentration of nitric oxides - so this is a proxy for pollution, 69 70 00:06:02,120 --> 00:06:06,410 we've got the average number of rooms per dwelling, 70 71 00:06:06,410 --> 00:06:10,570 we've got a pupil - teacher ratio by town, 71 72 00:06:10,760 --> 00:06:17,910 so this is a proxy for the quality of the schools and a whole bunch of other attributes. 72 73 00:06:18,410 --> 00:06:24,770 Scrolling down a little more we see that the two researchers that collated this dataset are Harrison 73 74 00:06:24,860 --> 00:06:30,270 and Rubinfeld and that this is based on a research paper. 74 75 00:06:30,710 --> 00:06:35,210 In fact we can actually see the title of the original research paper here 75 76 00:06:35,210 --> 00:06:43,670 "Hedonic prices and the demand for clean air" and this was published in The Journal of Environment Economics 76 77 00:06:43,760 --> 00:06:49,040 and Management Vol. 5 in 1978. 77 78 00:06:49,040 --> 00:06:54,540 So this already answers a lot of the initial questions about our dataset. 78 79 00:06:54,680 --> 00:07:00,500 It's actually fairly interesting that the original purpose of the researchers was to figure out how 79 80 00:07:00,830 --> 00:07:04,080 high the demand was for clean air in Boston. 80 81 00:07:04,130 --> 00:07:06,440 That's what they were trying to accomplish. 81 82 00:07:06,500 --> 00:07:12,890 They were trying to figure out how much more people are willing to pay to be able to breathe clean air 82 83 00:07:12,980 --> 00:07:15,080 in the city. 83 84 00:07:15,080 --> 00:07:22,820 The other thing that's important is that this housing data dates back to 1978 and we're working with 84 85 00:07:23,000 --> 00:07:31,760 506 different entries. So let's check off the questions on our list. 85 86 00:07:31,800 --> 00:07:35,210 We've figured out the source of the data. 86 87 00:07:35,340 --> 00:07:39,430 We've read a brief description of the data set. 87 88 00:07:39,480 --> 00:07:44,240 We've also managed to figure out the number of data points in the dataset which was 506 88 89 00:07:44,240 --> 00:07:49,480 and the number of features which was 13. 89 90 00:07:49,830 --> 00:07:53,770 And lucky for us the descriptions of the features was also given. 90 91 00:07:53,790 --> 00:07:59,400 They call them attributes, if you remember, and the names of these attributes were given in all caps, they 91 92 00:07:59,400 --> 00:08:06,660 were abbreviations. And finally we had a very brief description of all the features as well. 92 93 00:08:07,370 --> 00:08:10,080 So I think for starters this is pretty good going, 93 94 00:08:10,100 --> 00:08:17,570 but let's crack on. Now as an aside if you're curious on how this dataset was originally used you can 94 95 00:08:17,570 --> 00:08:23,240 actually pull up the original research paper that is mentioned in the description. 95 96 00:08:23,240 --> 00:08:30,680 So I just googled it and I got sent to the University of Michigan library Web site and there I was able 96 97 00:08:30,680 --> 00:08:35,170 to pull up the PDF for the original paper for free. 97 98 00:08:35,480 --> 00:08:42,080 And I think that's because Daniel Rubinfeld was actually at the University of Michigan while his co-author 98 99 00:08:42,260 --> 00:08:45,820 David Harrison was at Harvard at the time. 99 100 00:08:45,830 --> 00:08:53,570 Now if you Google this as well, let me show you how you can embed a link in your Jupyter notebook very, 100 101 00:08:53,570 --> 00:08:55,080 very easily. 101 102 00:08:55,080 --> 00:08:57,410 So I can copy the URL here, 102 103 00:08:57,410 --> 00:09:04,480 go back to my Jupyter notebook and then going into one of the markdown cells, say the gathered data cell 103 104 00:09:04,480 --> 00:09:15,070 I've got here, I can use some square brackets and some parentheses to insert my URL. So the URL 104 105 00:09:15,490 --> 00:09:17,980 goes between the two parentheses. 105 106 00:09:17,980 --> 00:09:19,260 So I'm gonna paste it in here. 106 107 00:09:19,270 --> 00:09:25,630 This is the URL that I copied from the other tab in my browser, and then in the square brackets I can 107 108 00:09:25,630 --> 00:09:31,760 include the text that I want to display instead of this long and unwieldy URL. 108 109 00:09:31,880 --> 00:09:41,030 So I'm going to write "Source: Original research paper" and when I hit Shift+Enter it's gonna be displayed 109 110 00:09:41,240 --> 00:09:47,930 like so and this is an active link that we have now in our Jupyter notebook. 110 111 00:09:47,990 --> 00:09:54,440 Now you might not always get a nice description along with your data set like this. 111 112 00:09:54,440 --> 00:10:00,230 So let's have a think about how we might look at the number of data points and the number of features 112 113 00:10:00,530 --> 00:10:07,870 manually in case it wasn't presented to us on a silver plate like this. Going down to the bottom, 113 114 00:10:07,910 --> 00:10:15,140 I'm going to insert another markdown cell and I'm going to add a subheading here and this is going to read 114 115 00:10:16,160 --> 00:10:25,550 "Data points and features". To look at the number of features in this dataset, I'm going to first access 115 116 00:10:25,820 --> 00:10:29,330 the Bunch object's data attribute. 116 117 00:10:29,330 --> 00:10:30,690 I remember seeing this up above 117 118 00:10:30,800 --> 00:10:38,960 when we use that the dir function. So we can write "boston_dataset.data". 118 119 00:10:39,980 --> 00:10:46,610 And this is what this would look like. From the output we can see that it's an array and we could verify 119 120 00:10:46,610 --> 00:10:54,250 this by writing "type" and then surrounding "boston_dataset.data" by parentheses. 120 121 00:10:54,320 --> 00:10:59,510 And here we can see that it is in fact a numpy n-dimensional array. 121 122 00:10:59,510 --> 00:11:06,140 This is the type of object that we would be accessing like so. Now if we wanted to see the number of 122 123 00:11:06,140 --> 00:11:16,160 rows and columns, the easiest way to do this is by writing "boston_dataset.data.shape"; 123 124 00:11:16,940 --> 00:11:21,460 shape is an attribute of a numpy array 124 125 00:11:21,470 --> 00:11:31,310 if you recall. So hitting Shift+Enter, we can see that this array has 506 rows and 13 columns, which is 125 126 00:11:31,310 --> 00:11:36,680 good because it ties out with the documentation that we read earlier. 126 127 00:11:36,920 --> 00:11:43,840 The thing that you'll also notice in this line of Python code is that we are chaining our attributes. 127 128 00:11:43,940 --> 00:11:47,310 I'll just add this as a comment here on the right hand side. 128 129 00:11:47,330 --> 00:11:55,130 Now this is something quite important to understand, because this is a good example that shows how objects 129 130 00:11:55,400 --> 00:12:00,820 and data can be nested inside one other. Scrolling to the very top, 130 131 00:12:00,830 --> 00:12:08,030 we see that when we are working with the boston_dataset, we are in fact working with an object 131 132 00:12:08,480 --> 00:12:18,300 of type Bunch, and this bunch has a number of attributes, including that data attribute. This data attribute 132 133 00:12:18,870 --> 00:12:27,440 is in fact an object of type ndarray, a numpy n-dimensional array. And the n-dimensional array in 133 134 00:12:27,440 --> 00:12:32,950 turn also has attributes, including a shape attribute. 134 135 00:12:33,330 --> 00:12:42,820 And when we call on an ndarray's shape attribute, we get back a tuple. So when you see this dot notation 135 136 00:12:43,060 --> 00:12:48,670 being used to chain things together, you can think of it almost like a Russian matrioshka doll, where 136 137 00:12:48,670 --> 00:12:54,580 each doll contains another object. And if that's not your kind of thing then think of it maybe like the 137 138 00:12:54,580 --> 00:13:01,940 movie Inception where you had dreams within dreams, just with less gunfire. And on that note I'll see you 138 139 00:13:01,940 --> 00:13:02,740 in the next lesson.