0 1 00:00:00,270 --> 00:00:00,720 All right, 1 2 00:00:00,750 --> 00:00:07,080 so let's take a closer look what we've got inside of our boston_dataset bunch. 2 3 00:00:07,350 --> 00:00:15,720 Before we were talking about the attributes of this object. The thing is, in machine learning this word 3 4 00:00:15,930 --> 00:00:21,750 attribute is gonna be used in a different context. In machine learning the features of a data set are 4 5 00:00:21,750 --> 00:00:30,870 typically represented as the columns in a table and these columns you'll often see referred to as the attributes 5 6 00:00:31,140 --> 00:00:33,390 of a data set. 6 7 00:00:33,390 --> 00:00:39,570 In other words, when using this word attribute in the machine learning context we will refer to a feature 7 8 00:00:39,690 --> 00:00:42,040 or an independent variable. 8 9 00:00:42,180 --> 00:00:46,370 And this is what we're going to be using to predict a house price. 9 10 00:00:46,770 --> 00:00:53,360 So yeah, the word attribute is used both in Python as well as in machine learning, but unfortunately they 10 11 00:00:53,370 --> 00:01:00,990 mean completely different things. So, speaking of features, let's pull up the features of our dataset in 11 12 00:01:00,990 --> 00:01:09,420 the Python notebook and we can do this again using the boston_dataset object which has an 12 13 00:01:09,450 --> 00:01:19,410 attribute called "feature_names". Let's print this out. Here we can see all the feature names 13 14 00:01:19,740 --> 00:01:23,120 of our data set in a nice little array. 14 15 00:01:23,430 --> 00:01:30,990 Again the dir function was really, really handy in ascertaining that our bunch has a attribute called 15 16 00:01:30,990 --> 00:01:32,820 feature_names. 16 17 00:01:32,940 --> 00:01:39,150 So these are the kind of niceties that you're going to get with a toy dataset like this. 17 18 00:01:39,150 --> 00:01:48,960 Now taking a look at these feature names, the question you might ask is: where is the house price? 18 19 00:01:48,960 --> 00:01:51,890 Where is the price of the houses? 19 20 00:01:51,900 --> 00:01:58,290 Because we've got a bunch of abbreviations here and none of them seem to suggest anything about the 20 21 00:01:58,290 --> 00:02:02,920 values of the houses that we're looking to estimate. 21 22 00:02:02,940 --> 00:02:06,440 So this means that this is hidden somewhere else. 22 23 00:02:06,450 --> 00:02:14,010 The key thing what we're trying to predict is found in an attribute called "target". 23 24 00:02:14,040 --> 00:02:26,110 So boston_dataset.target will bring up the actual prices of the houses. 24 25 00:02:26,110 --> 00:02:30,870 This is why we didn't get a separate column for the house prices earlier. 25 26 00:02:30,880 --> 00:02:35,400 The house prices are actually found somewhere else in our bunch 26 27 00:02:35,400 --> 00:02:39,900 object. Now looking at these house prices, 27 28 00:02:40,130 --> 00:02:44,510 you might be wondering, 24 21 34, 28 29 00:02:44,990 --> 00:02:49,230 these look like they're the prices for toy houses or something. 29 30 00:02:49,370 --> 00:02:54,890 They don't look high enough to be the dollar values of actual houses, because no house could possibly 30 31 00:02:54,890 --> 00:02:57,100 cost 24 dollars right, 31 32 00:02:57,680 --> 00:02:59,720 unless of course you buy it of 32 33 00:02:59,720 --> 00:03:07,550 Ali express or something. So, the thing to note is that these units are actually in thousands. 33 34 00:03:07,550 --> 00:03:13,440 So these are the actual prices in thousands. 34 35 00:03:13,530 --> 00:03:15,970 I'm going to add this here as a comment. 35 36 00:03:15,980 --> 00:03:22,080 So if you come back to this note book in three months time this is still going to make sense. 36 37 00:03:22,620 --> 00:03:29,460 Now working with a bunch object in the notebook is all very well and good, but one of the most common 37 38 00:03:29,460 --> 00:03:35,010 types of objects that you're actually going to be encountering in your work as a machine learning expert 38 39 00:03:35,100 --> 00:03:39,030 or as a data scientist is the pandas 39 40 00:03:39,150 --> 00:03:44,800 dataframe. The pandas dataframe is gonna be our main workhorse. 40 41 00:03:45,390 --> 00:03:50,100 So let me add a little section heading here to commemorate this. 41 42 00:03:50,480 --> 00:04:01,570 I'll add a subheading here that reads "Data exploration with Pandas dataframes" and what I'm going to do is I'm 42 43 00:04:01,570 --> 00:04:10,240 going to create a variable here called "data" and I'm going to have this variable hold on to our pandas 43 44 00:04:10,240 --> 00:04:19,040 dataframe object. And the way we're going to create this dataframe is using the pandas module. 44 45 00:04:19,120 --> 00:04:25,310 So before writing any more code here, I'm going to have to import this pandas 45 46 00:04:25,320 --> 00:04:26,070 module, right? 46 47 00:04:26,080 --> 00:04:35,420 I can't just write pd.DataFrame without importing my pandas module first. 47 48 00:04:35,590 --> 00:04:38,200 So I'm going to pause here for a second, 48 49 00:04:38,320 --> 00:04:48,100 go back up to the top where I've got all my import statements and then write "import pandas as pd" and 49 50 00:04:48,250 --> 00:04:58,660 hit Shift+Enter. Now, I can come back down here and actually make use of our module. To construct our data 50 51 00:04:58,660 --> 00:05:02,630 frame from our boston_dataset, 51 52 00:05:02,830 --> 00:05:06,460 we're going supply some arguments between these parentheses. 52 53 00:05:06,550 --> 00:05:16,770 The first argument is called "data" and we're gonna set that equal to boston_dataset.data. 53 54 00:05:16,810 --> 00:05:24,540 So this is gonna be the numpy array contained inside of our boston_dataset bunch. 54 55 00:05:24,640 --> 00:05:32,050 The next argument, columns, is gonna be the argument for the column names and we're gonna set that equal 55 56 00:05:32,050 --> 00:05:39,190 to "boston_dataset.feature_names". 56 57 00:05:39,880 --> 00:05:45,340 And what this code will do is it will create a pandas dataframe. 57 58 00:05:48,890 --> 00:05:54,420 Now, remember how our price, our house prices will not be included in this, 58 59 00:05:54,500 --> 00:05:56,420 so we're going to add those separately. 59 60 00:05:56,420 --> 00:06:01,230 Now I'm going to add a column, with the price, 60 61 00:06:01,280 --> 00:06:01,880 yeah, 61 62 00:06:01,880 --> 00:06:11,170 our target, to the dataframe. The way we want to do this is by using our dataframe variable which is 62 63 00:06:11,170 --> 00:06:15,040 called data, having some square brackets after it, 63 64 00:06:15,040 --> 00:06:18,700 and in those square brackets I want to supply a column name. 64 65 00:06:18,760 --> 00:06:26,680 So I'm going to call this column "PRICE", all caps, and I'm going to set it equal to "boston_dataset 65 66 00:06:28,120 --> 00:06:29,620 .target". 66 67 00:06:30,600 --> 00:06:31,050 Okay, 67 68 00:06:31,060 --> 00:06:35,880 so let's hit Shift+Enter together and see if we get any errors. 68 69 00:06:36,050 --> 00:06:38,320 All good. 69 70 00:06:38,360 --> 00:06:45,080 Let me add a few more cells here and then we can continue to explore our data and I'll show you a couple 70 71 00:06:45,080 --> 00:06:55,350 of tricks using the pandas dataframe. The thing is, oftentimes your data frame will be huge. 71 72 00:06:55,380 --> 00:06:59,730 This one just has 506 rows. 72 73 00:06:59,730 --> 00:07:06,880 But oftentimes you're gonna be working with dataframes with many thousands of rows or tens of thousands. 73 74 00:07:07,040 --> 00:07:14,630 So the question is: how can you get a glimpse of the data inside a huge data frame without printing out 74 75 00:07:14,870 --> 00:07:17,120 all of the values? 75 76 00:07:17,240 --> 00:07:21,620 And for that pandas gives us two dataframe methods; 76 77 00:07:21,620 --> 00:07:23,810 the first one is called "head" 77 78 00:07:23,960 --> 00:07:27,870 and the second one is called "tail". 78 79 00:07:27,980 --> 00:07:36,170 Let me show you how you'd use them. If we were to write "data" and hit Shift+Enter, our notebook would output 79 80 00:07:36,920 --> 00:07:47,310 a whole bunch of rows. But if we wanted to just take a gander at the first couple of rows in the data 80 81 00:07:47,310 --> 00:07:57,410 frame, say rows 0 through 4 for example, we could write "data.head()" and hitting Shift+Enter, 81 82 00:07:57,540 --> 00:08:08,330 what we see instead are rows 0 through 4. This will give us an idea of the kind of values that are contained 82 83 00:08:08,450 --> 00:08:16,720 inside of our rows and our columns without having to look at an enormous amount of data. So let me add 83 84 00:08:16,780 --> 00:08:23,300 a little comment here that says "The top rows look like this". 84 85 00:08:23,410 --> 00:08:32,320 Now it follows that "data.tail()" will show us the rows at the bottom of the data frame, 85 86 00:08:32,350 --> 00:08:40,100 right? "Rows at bottom of data frame look like this". 86 87 00:08:43,560 --> 00:08:52,170 Scrolling down, we can see that rows 501 through 505 have this kind of data inside of them. 87 88 00:08:52,170 --> 00:08:59,190 I personally really like these two methods for looking at the top part of the data and the bottom part 88 89 00:08:59,190 --> 00:09:04,910 of the data just to get an idea of what we're working with. 89 90 00:09:05,320 --> 00:09:12,880 Now, if we wanted to figure out how many rows our data frame has or we just wanted to retrieve the rows 90 91 00:09:13,240 --> 00:09:18,130 in each column, there's a handy little method called "count". 91 92 00:09:18,430 --> 00:09:30,190 So "data.count()" will show us the number of rows and that's for each column. 92 93 00:09:30,190 --> 00:09:30,730 Check it out. 93 94 00:09:33,720 --> 00:09:39,250 In the output below you see the number of entries per column. 94 95 00:09:39,270 --> 00:09:44,810 So each column has 506 entries. 95 96 00:09:44,970 --> 00:09:53,430 Now coming back to this topic of language and lingo and jargon, you'll often hear the number of data 96 97 00:09:53,430 --> 00:09:59,460 points or rows being referred to as the number of instances. 97 98 00:09:59,460 --> 00:10:03,810 So here we've got and 506 instances. 98 99 00:10:04,080 --> 00:10:11,190 This is how this word instance is used in the context of machine learning, and the important thing to 99 100 00:10:11,190 --> 00:10:18,900 note here is that this word instance means something completely different to a programmer; to a Python 100 101 00:10:18,900 --> 00:10:19,770 programmer 101 102 00:10:19,770 --> 00:10:23,100 an instance is an object. 102 103 00:10:23,100 --> 00:10:31,320 In other words, our data object right here is an instance of a data frame. A data frame would be the general 103 104 00:10:31,320 --> 00:10:38,670 category and a particular dataframe, namely our dataframe which we've stored inside our variable here, 104 105 00:10:39,270 --> 00:10:42,380 would then be referred to as an instance. 105 106 00:10:42,390 --> 00:10:49,420 So yeah, the word instance again has a different meaning in machine learning and in programming. 106 107 00:10:49,530 --> 00:10:52,940 Again, this is just something to be aware of. 107 108 00:10:53,010 --> 00:11:04,160 Moving onto our next topic, let's add a few more cells here and make the first one a markdown cell where 108 109 00:11:04,160 --> 00:11:16,310 we're going to add a subheading called "Cleaning data - check for missing values". So when you're doing your 109 110 00:11:16,310 --> 00:11:24,290 data exploration, oftentimes you're going to look for problems in your data set and dealing with missing 110 111 00:11:24,290 --> 00:11:30,950 data is definitely a kind of problem that you have to address, because I guarantee you that your machine 111 112 00:11:30,950 --> 00:11:37,220 learning algorithm is gonna get really confused and give you really terrible answers if you haven't 112 113 00:11:37,370 --> 00:11:43,260 addressed this ahead of time and are feeding clean data to your algorithm. 113 114 00:11:43,460 --> 00:11:50,570 So you might remember how we've addressed the missing values when we were analyzing our movie revenues. 114 115 00:11:50,570 --> 00:11:56,870 The problem that we're confronted with at the moment is: how do we find the missing values and how do 115 116 00:11:56,870 --> 00:11:59,100 we find them quickly? 116 117 00:11:59,120 --> 00:12:08,660 Pandas actually has a function called "isnull", and this function will return to us a table if any of the 117 118 00:12:08,660 --> 00:12:11,090 values were missing. 118 119 00:12:11,090 --> 00:12:15,660 Now let me show you how you use it. Since this function comes from the pandas module, 119 120 00:12:15,660 --> 00:12:23,720 we're gonna have to access it through "pd.isnull()" and as an argument we can have to pass in the 120 121 00:12:24,050 --> 00:12:26,750 data that we want the function to check. 121 122 00:12:26,750 --> 00:12:35,000 So we've stored all of this inside our data dataframe and hitting Shift+Enter on this will now return 122 123 00:12:35,000 --> 00:12:35,780 to us 123 124 00:12:35,990 --> 00:12:46,070 a whole table where each entry is either False meaning no missing values or True which means missing 124 125 00:12:46,070 --> 00:12:47,160 values. 125 126 00:12:47,240 --> 00:12:55,280 You can see this is a huge table, right, Jupyter notebook is not even showing us the entire table here. 126 127 00:12:55,340 --> 00:13:05,930 So the question is: how would we know if there are any missing values in this entire table of 506 entries? 127 128 00:13:07,740 --> 00:13:14,130 And the answer is: we can chain another method call onto this one. 128 129 00:13:14,130 --> 00:13:19,400 So we've caught our table back and it's a table of True and False entries. 129 130 00:13:19,710 --> 00:13:30,150 And if we chain a method called "any()" and hit Shift+Enter then what we get is we got pandas 130 131 00:13:30,240 --> 00:13:37,620 checking all the columns and telling us if there are any missing values in any of the columns. 131 132 00:13:37,680 --> 00:13:44,420 Now I don't know if you've heard the word null before, but null does not mean the value 0. 132 133 00:13:44,430 --> 00:13:51,930 So if a variable is equal to the value null, it contains nothing which is very, very different from the 133 134 00:13:51,930 --> 00:14:00,810 variable having the value 0. And the Internet has decided that the best way to summarize this is as a 134 135 00:14:00,840 --> 00:14:02,760 picture of a problem 135 136 00:14:02,790 --> 00:14:10,290 wouldn't wish upon anyone in a public restroom. On the left, we have the dispenser containing the value 136 137 00:14:10,290 --> 00:14:14,940 0 and on the right we have the dispenser containing the value 137 138 00:14:15,030 --> 00:14:24,550 null. This isnull() method chained with the any method is super handy for figuring out if there are any missing 138 139 00:14:24,550 --> 00:14:31,750 values in your dataset, because if there is a missing value in one of the columns, then instead of this 139 140 00:14:31,750 --> 00:14:38,920 word False being printed here, you will see the word True and then you might have to dig into the data 140 141 00:14:39,280 --> 00:14:41,160 and fix the problem. 141 142 00:14:41,170 --> 00:14:48,370 Now, let me show you an alternative way of doing this check, because this first method here isnull() and 142 143 00:14:48,460 --> 00:14:51,820 any() belong to the pandas module. 143 144 00:14:51,820 --> 00:15:00,910 The next method I'm going to show you, the alternative, belongs to the dataframe instead. Typing "data. 144 145 00:15:01,330 --> 00:15:10,330 info()", so using our data object and then calling the info method on it will show us not only if there 145 146 00:15:10,330 --> 00:15:17,800 are any null values, but it'll also show us a whole bunch of other information, including the number of 146 147 00:15:17,860 --> 00:15:28,140 entries or rows, the number of columns, the names of the columns, if any of the columns have a null value 147 148 00:15:29,010 --> 00:15:34,620 and also the type of object that each column contains. 148 149 00:15:34,620 --> 00:15:42,120 In our case all of the columns contain float64 type objects. 149 150 00:15:42,120 --> 00:15:45,880 Now if you're new to programming this will look super jargony. 150 151 00:15:46,230 --> 00:15:48,540 So let me explain what this means. 151 152 00:15:48,600 --> 00:15:57,240 A float is programming jargon for a floating point number. What's the floating point number? Nothing special, 152 153 00:15:57,240 --> 00:16:03,900 it just refers to a decimal number. So a floating point number has a decimal point, but something like 153 154 00:16:04,050 --> 00:16:06,260 an integer does not. 154 155 00:16:06,330 --> 00:16:11,490 In other words, Python has different categories for different types of numbers. 155 156 00:16:11,700 --> 00:16:19,710 So this number 64 at the end of the word float shows us that the category that we're working with here 156 157 00:16:20,250 --> 00:16:25,870 is a large and precise decimal number. In this case, 157 158 00:16:25,900 --> 00:16:35,290 I've got a 64 bit floating point number that takes up 64 bits of memory and that would be in contrast 158 159 00:16:35,290 --> 00:16:45,130 to less precise numbers like a float32 or a float16. A float64 number contains twice as many 159 160 00:16:45,130 --> 00:16:49,200 digits and takes up twice as much memory as a float32. 160 161 00:16:49,660 --> 00:16:51,670 So that's what that means. 161 162 00:16:51,910 --> 00:16:58,830 In any case, the good news is that we have no missing values, which is great. 162 163 00:16:58,840 --> 00:17:00,550 One less thing to worry about. 163 164 00:17:00,700 --> 00:17:06,530 So now we can start to explore our features that are contained in the dataset. 164 165 00:17:06,900 --> 00:17:15,790 It's time to demystify these mysterious sounding columns like RM, NOX and DIS. Can't wait to see you 165 166 00:17:15,790 --> 00:17:16,720 in the next lesson.