0
1
00:00:00,510 --> 00:00:05,190
After having defined the problem and formulated our questions,
1

2
00:00:05,190 --> 00:00:11,800
the second step is gathering the data that will help us find the answers.
2

3
00:00:12,120 --> 00:00:20,100
Now previously I've provided data sets for you and you've imported this data from a CSV into Jupyter
3

4
00:00:20,100 --> 00:00:21,690
notebook.
4

5
00:00:21,690 --> 00:00:23,850
But what if you're on your own?
5

6
00:00:23,880 --> 00:00:27,320
Where would you get your data from? Well,
6

7
00:00:27,330 --> 00:00:33,300
typically you'd Google away and search for one of the many datasets available online.
7

8
00:00:33,300 --> 00:00:40,200
After all, what if you wanted to build this valuation tool using house price data for your own city
8

9
00:00:40,380 --> 00:00:42,570
rather than for Boston?
9

10
00:00:42,570 --> 00:00:47,700
In that case Google is going to be your best friend in every respect.
10

11
00:00:47,700 --> 00:00:53,460
But there's a third alternative for when you're just getting started and you want to practice your data
11

12
00:00:53,460 --> 00:01:00,150
science and machine learning skills and that's to use some of the popular practice data sets that come
12

13
00:01:00,150 --> 00:01:03,720
directly from a Python module.
13

14
00:01:03,750 --> 00:01:11,430
In other words, some Python modules actually come with sample datasets that we can use.
14

15
00:01:11,630 --> 00:01:16,340
One of my favorite Python modules out there is scikit learn.
15

16
00:01:16,610 --> 00:01:23,030
They've got excellent practice data sets and you could see a whole list of scikit learn's toy data sets
16

17
00:01:23,360 --> 00:01:29,870
right on this Web site. Scrolling down we can see that they not only provide the Boston house price data
17

18
00:01:29,870 --> 00:01:36,830
set that we're gonna use in this section, but they've also got the famous iris data set which is used
18

19
00:01:36,830 --> 00:01:39,170
for flowers classification,
19

20
00:01:39,170 --> 00:01:45,910
they've got a dataset on diabetes, on digits, on wine and on breast cancer.
20

21
00:01:46,400 --> 00:01:51,470
So this makes the search for data sets that you can use to practice machine learning and data science
21

22
00:01:51,800 --> 00:01:54,240
much, much easier.
22

23
00:01:54,240 --> 00:02:00,110
And that's because the really, really neat thing about using one of these data sets is that these tend
23

24
00:02:00,110 --> 00:02:07,060
to be much cleaner and user friendly than some random CSV that you download from a Web site.
24

25
00:02:07,100 --> 00:02:11,030
These datasets are meant for testing and for practice.
25

26
00:02:11,030 --> 00:02:18,890
So you'll encounter far fewer missing values, fewer weird data formats, less non relevant information
26

27
00:02:19,310 --> 00:02:21,880
and fewer other problems.
27

28
00:02:21,920 --> 00:02:28,970
In other words, somebody has already done a first pass on the data set so that you can crack on with
28

29
00:02:28,970 --> 00:02:29,770
a head start.
29

30
00:02:31,320 --> 00:02:37,440
Now since we're gonna be examining and predicting house prices in Boston, let's pull up the official
30

31
00:02:37,620 --> 00:02:42,920
documentation for this dataset and have a look at what it has to say.
31

32
00:02:43,110 --> 00:02:47,730
The information provided on the website is fairly basic.
32

33
00:02:47,760 --> 00:02:51,630
We can see the total number of samples - 506.
33

34
00:02:51,810 --> 00:02:55,840
We can see the dimensionality which is 13.
34

35
00:02:55,890 --> 00:03:01,780
You can think of this as the number of columns in the data set. And also, down here,
35

36
00:03:01,920 --> 00:03:09,740
we can see some sample code for how to use this built in data set from scikit learn.
36

37
00:03:09,810 --> 00:03:12,980
We're gonna be making use of all of this information
37

38
00:03:13,140 --> 00:03:21,180
and all of these things in the coming lessons. So let's crack on and get started.
38

39
00:03:21,180 --> 00:03:25,000
The first thing we have to do is create a new notebook.
39

40
00:03:25,020 --> 00:03:27,570
It's gonna be a Python 3 notebook.
40

41
00:03:28,850 --> 00:03:33,110
I'm going to click up here where it says "Untitled" and give this notebook a name,
41

42
00:03:33,110 --> 00:03:45,230
I'm going to call it "04 Multivariate Regression", click "Rename" and then let's add some section headings. For the first
42

43
00:03:45,230 --> 00:03:45,980
section heading,
43

44
00:03:45,980 --> 00:03:56,240
I'm going to change my cell to "Markdown", put a pound symbol there and then write "Notebook Imports". Below
44

45
00:03:56,240 --> 00:04:00,040
this section heading we're gonna have all our import statements.
45

46
00:04:00,110 --> 00:04:05,450
So what's the first import statement that we're gonna put in? The first import statement
46

47
00:04:05,450 --> 00:04:12,920
we're gonna be putting in here is going to be loading our Boston house price data set into our Jupyter
47

48
00:04:12,920 --> 00:04:16,130
notebook. From the scikit learn documentation
48

49
00:04:16,130 --> 00:04:18,860
we know what this import statement should look like.
49

50
00:04:18,860 --> 00:04:21,570
It should read "from sklearn.datasets 
50

51
00:04:21,590 --> 00:04:24,230
import load_boston".
51

52
00:04:24,800 --> 00:04:31,260
So let's write that in here. We're going to say "from sklearn.datasets
52

53
00:04:31,280 --> 00:04:34,890
import load_boston".
53

54
00:04:35,080 --> 00:04:41,550
Now I'm going to add another markdown cell to single out the section of the notebook where we're going
54

55
00:04:41,550 --> 00:04:43,040
to gather our data.
55

56
00:04:43,100 --> 00:04:47,690
So I'm gonna put a pound symbol here and then write "Gather Data".
56

57
00:04:47,960 --> 00:04:54,170
We're gonna try out a little bit more of the code that we saw in the documentation to get our data set
57

58
00:04:54,500 --> 00:04:56,690
into our notebook.
58

59
00:04:56,840 --> 00:05:03,590
I'm going to insert some cells as well, so that I'm going to be working more in the middle of the screen. And
59

60
00:05:03,590 --> 00:05:09,920
here we can call the "load_boston" function because this is the function that will actually
60

61
00:05:09,920 --> 00:05:12,180
return our data set.
61

62
00:05:12,470 --> 00:05:19,460
So I'm going to create a variable, call it "boston_dataset" and set the value of this variable
62

63
00:05:19,580 --> 00:05:25,260
equal to the return value from load_boston
63

64
00:05:25,400 --> 00:05:27,960
and then two parentheses at the end.
64

65
00:05:27,980 --> 00:05:34,880
So this function here will return to us our data set and we'll store it in a variable called 
65

66
00:05:34,910 --> 00:05:37,270
boston_dataset.
66

67
00:05:37,340 --> 00:05:43,980
Let me hit Shift+Enter and let's check out the type of this variable. So I'm going to write "type(
67

68
00:05:44,030 --> 00:05:46,670
boston_
68

69
00:05:46,720 --> 00:05:50,630
dataset" and hit Shift+Enter
69

70
00:05:51,020 --> 00:05:55,910
and then we can see what kind of object we're actually dealing with.
70

71
00:05:55,910 --> 00:06:03,640
And as per the documentation we can see that we're working with an object of type "Bunch".
71

72
00:06:03,650 --> 00:06:11,000
Now usually we typically want to work with a data frame so let's make a mental note right now to convert
72

73
00:06:11,090 --> 00:06:14,300
our data to a data frame for later.
73

74
00:06:14,450 --> 00:06:19,740
But now let's take a look at what our data actually looks like in its raw state.
74

75
00:06:19,970 --> 00:06:29,330
And to do that all we have to do is write "boston_dataset" into an empty cell and hit Shift+
75

76
00:06:29,690 --> 00:06:34,830
Enter. And what we get to see now is a whole bunch of output.
76

77
00:06:34,990 --> 00:06:35,440
Now,
77

78
00:06:35,450 --> 00:06:40,610
this is really, really difficult to read and it's not formatted in a way that we want to sift through
78

79
00:06:40,610 --> 00:06:43,710
it. But the good news is that our data is there.
79

80
00:06:43,730 --> 00:06:50,420
We've successfully imported our data set and we've got it in our Jupyter notebook and this will bring
80

81
00:06:50,420 --> 00:06:57,620
us to our next step - exploring our data set, cleaning our data set and visualizing it.
81

82
00:06:57,750 --> 00:06:59,040
I'll see you in the next lesson.