0 1 00:00:00,510 --> 00:00:05,190 After having defined the problem and formulated our questions, 1 2 00:00:05,190 --> 00:00:11,800 the second step is gathering the data that will help us find the answers. 2 3 00:00:12,120 --> 00:00:20,100 Now previously I've provided data sets for you and you've imported this data from a CSV into Jupyter 3 4 00:00:20,100 --> 00:00:21,690 notebook. 4 5 00:00:21,690 --> 00:00:23,850 But what if you're on your own? 5 6 00:00:23,880 --> 00:00:27,320 Where would you get your data from? Well, 6 7 00:00:27,330 --> 00:00:33,300 typically you'd Google away and search for one of the many datasets available online. 7 8 00:00:33,300 --> 00:00:40,200 After all, what if you wanted to build this valuation tool using house price data for your own city 8 9 00:00:40,380 --> 00:00:42,570 rather than for Boston? 9 10 00:00:42,570 --> 00:00:47,700 In that case Google is going to be your best friend in every respect. 10 11 00:00:47,700 --> 00:00:53,460 But there's a third alternative for when you're just getting started and you want to practice your data 11 12 00:00:53,460 --> 00:01:00,150 science and machine learning skills and that's to use some of the popular practice data sets that come 12 13 00:01:00,150 --> 00:01:03,720 directly from a Python module. 13 14 00:01:03,750 --> 00:01:11,430 In other words, some Python modules actually come with sample datasets that we can use. 14 15 00:01:11,630 --> 00:01:16,340 One of my favorite Python modules out there is scikit learn. 15 16 00:01:16,610 --> 00:01:23,030 They've got excellent practice data sets and you could see a whole list of scikit learn's toy data sets 16 17 00:01:23,360 --> 00:01:29,870 right on this Web site. Scrolling down we can see that they not only provide the Boston house price data 17 18 00:01:29,870 --> 00:01:36,830 set that we're gonna use in this section, but they've also got the famous iris data set which is used 18 19 00:01:36,830 --> 00:01:39,170 for flowers classification, 19 20 00:01:39,170 --> 00:01:45,910 they've got a dataset on diabetes, on digits, on wine and on breast cancer. 20 21 00:01:46,400 --> 00:01:51,470 So this makes the search for data sets that you can use to practice machine learning and data science 21 22 00:01:51,800 --> 00:01:54,240 much, much easier. 22 23 00:01:54,240 --> 00:02:00,110 And that's because the really, really neat thing about using one of these data sets is that these tend 23 24 00:02:00,110 --> 00:02:07,060 to be much cleaner and user friendly than some random CSV that you download from a Web site. 24 25 00:02:07,100 --> 00:02:11,030 These datasets are meant for testing and for practice. 25 26 00:02:11,030 --> 00:02:18,890 So you'll encounter far fewer missing values, fewer weird data formats, less non relevant information 26 27 00:02:19,310 --> 00:02:21,880 and fewer other problems. 27 28 00:02:21,920 --> 00:02:28,970 In other words, somebody has already done a first pass on the data set so that you can crack on with 28 29 00:02:28,970 --> 00:02:29,770 a head start. 29 30 00:02:31,320 --> 00:02:37,440 Now since we're gonna be examining and predicting house prices in Boston, let's pull up the official 30 31 00:02:37,620 --> 00:02:42,920 documentation for this dataset and have a look at what it has to say. 31 32 00:02:43,110 --> 00:02:47,730 The information provided on the website is fairly basic. 32 33 00:02:47,760 --> 00:02:51,630 We can see the total number of samples - 506. 33 34 00:02:51,810 --> 00:02:55,840 We can see the dimensionality which is 13. 34 35 00:02:55,890 --> 00:03:01,780 You can think of this as the number of columns in the data set. And also, down here, 35 36 00:03:01,920 --> 00:03:09,740 we can see some sample code for how to use this built in data set from scikit learn. 36 37 00:03:09,810 --> 00:03:12,980 We're gonna be making use of all of this information 37 38 00:03:13,140 --> 00:03:21,180 and all of these things in the coming lessons. So let's crack on and get started. 38 39 00:03:21,180 --> 00:03:25,000 The first thing we have to do is create a new notebook. 39 40 00:03:25,020 --> 00:03:27,570 It's gonna be a Python 3 notebook. 40 41 00:03:28,850 --> 00:03:33,110 I'm going to click up here where it says "Untitled" and give this notebook a name, 41 42 00:03:33,110 --> 00:03:45,230 I'm going to call it "04 Multivariate Regression", click "Rename" and then let's add some section headings. For the first 42 43 00:03:45,230 --> 00:03:45,980 section heading, 43 44 00:03:45,980 --> 00:03:56,240 I'm going to change my cell to "Markdown", put a pound symbol there and then write "Notebook Imports". Below 44 45 00:03:56,240 --> 00:04:00,040 this section heading we're gonna have all our import statements. 45 46 00:04:00,110 --> 00:04:05,450 So what's the first import statement that we're gonna put in? The first import statement 46 47 00:04:05,450 --> 00:04:12,920 we're gonna be putting in here is going to be loading our Boston house price data set into our Jupyter 47 48 00:04:12,920 --> 00:04:16,130 notebook. From the scikit learn documentation 48 49 00:04:16,130 --> 00:04:18,860 we know what this import statement should look like. 49 50 00:04:18,860 --> 00:04:21,570 It should read "from sklearn.datasets 50 51 00:04:21,590 --> 00:04:24,230 import load_boston". 51 52 00:04:24,800 --> 00:04:31,260 So let's write that in here. We're going to say "from sklearn.datasets 52 53 00:04:31,280 --> 00:04:34,890 import load_boston". 53 54 00:04:35,080 --> 00:04:41,550 Now I'm going to add another markdown cell to single out the section of the notebook where we're going 54 55 00:04:41,550 --> 00:04:43,040 to gather our data. 55 56 00:04:43,100 --> 00:04:47,690 So I'm gonna put a pound symbol here and then write "Gather Data". 56 57 00:04:47,960 --> 00:04:54,170 We're gonna try out a little bit more of the code that we saw in the documentation to get our data set 57 58 00:04:54,500 --> 00:04:56,690 into our notebook. 58 59 00:04:56,840 --> 00:05:03,590 I'm going to insert some cells as well, so that I'm going to be working more in the middle of the screen. And 59 60 00:05:03,590 --> 00:05:09,920 here we can call the "load_boston" function because this is the function that will actually 60 61 00:05:09,920 --> 00:05:12,180 return our data set. 61 62 00:05:12,470 --> 00:05:19,460 So I'm going to create a variable, call it "boston_dataset" and set the value of this variable 62 63 00:05:19,580 --> 00:05:25,260 equal to the return value from load_boston 63 64 00:05:25,400 --> 00:05:27,960 and then two parentheses at the end. 64 65 00:05:27,980 --> 00:05:34,880 So this function here will return to us our data set and we'll store it in a variable called 65 66 00:05:34,910 --> 00:05:37,270 boston_dataset. 66 67 00:05:37,340 --> 00:05:43,980 Let me hit Shift+Enter and let's check out the type of this variable. So I'm going to write "type( 67 68 00:05:44,030 --> 00:05:46,670 boston_ 68 69 00:05:46,720 --> 00:05:50,630 dataset" and hit Shift+Enter 69 70 00:05:51,020 --> 00:05:55,910 and then we can see what kind of object we're actually dealing with. 70 71 00:05:55,910 --> 00:06:03,640 And as per the documentation we can see that we're working with an object of type "Bunch". 71 72 00:06:03,650 --> 00:06:11,000 Now usually we typically want to work with a data frame so let's make a mental note right now to convert 72 73 00:06:11,090 --> 00:06:14,300 our data to a data frame for later. 73 74 00:06:14,450 --> 00:06:19,740 But now let's take a look at what our data actually looks like in its raw state. 74 75 00:06:19,970 --> 00:06:29,330 And to do that all we have to do is write "boston_dataset" into an empty cell and hit Shift+ 75 76 00:06:29,690 --> 00:06:34,830 Enter. And what we get to see now is a whole bunch of output. 76 77 00:06:34,990 --> 00:06:35,440 Now, 77 78 00:06:35,450 --> 00:06:40,610 this is really, really difficult to read and it's not formatted in a way that we want to sift through 78 79 00:06:40,610 --> 00:06:43,710 it. But the good news is that our data is there. 79 80 00:06:43,730 --> 00:06:50,420 We've successfully imported our data set and we've got it in our Jupyter notebook and this will bring 80 81 00:06:50,420 --> 00:06:57,620 us to our next step - exploring our data set, cleaning our data set and visualizing it. 81 82 00:06:57,750 --> 00:06:59,040 I'll see you in the next lesson.