0 1 00:00:00,930 --> 00:00:06,180 In this lesson we're going to talk about two more types of data structures for storing a lot of data 1 2 00:00:06,180 --> 00:00:07,900 at the same time. 2 3 00:00:08,040 --> 00:00:12,060 The first one that we're going to look at is called a data frame. 3 4 00:00:12,480 --> 00:00:15,700 In this lesson's resources you'll find included some fresh data. 4 5 00:00:16,260 --> 00:00:20,220 Let's upload this data to our Jupyter notebook. 5 6 00:00:20,220 --> 00:00:28,590 So change the tabs to your MLProjects folder and then click "Upload" and then pick the 6 7 00:00:28,590 --> 00:00:35,970 LSD_math_score_data.csv file. Again, you'll find the CSV file in the lesson 7 8 00:00:35,970 --> 00:00:45,690 resources. After you've chosen this file, click "Upload" and then head back into your Jupyter notebook. Here 8 9 00:00:45,690 --> 00:00:47,720 we're going to write the following code - we're going to write 9 10 00:00:47,790 --> 00:00:58,300 "import pandas as pd", and then we're gonna use pandas to read our CSV file and we're going to store that 10 11 00:00:58,300 --> 00:01:08,860 information in a variable called data. So we'll write 11 12 00:01:08,950 --> 00:01:19,660 data = pd.read_csv('LSD_math_score_data.csv') 12 13 00:01:20,050 --> 00:01:29,980 As you're writing this make sure you don't have any typos in the file name; all the capitalization 13 14 00:01:30,100 --> 00:01:37,230 and the spaces matter, of course. Let's hit Shift+Enter and see what we get. What we're looking for at this 14 15 00:01:37,230 --> 00:01:44,990 stage are no errors. Now that we've done that, we can take a peek at our data variable. So let's print 15 16 00:01:44,990 --> 00:01:51,500 it out. Let's print out data. And we can do that by writing print and then within the parentheses we'll 16 17 00:01:51,500 --> 00:01:52,700 put data. 17 18 00:01:55,660 --> 00:02:05,190 And what you should see at this stage are seven rows and three columns. Now, just like our lists and our 18 19 00:02:05,190 --> 00:02:12,820 arrays, our data variable here is holding onto this collection. However, the super neat thing here is the 19 20 00:02:12,820 --> 00:02:22,080 structure of this data. We've got our data structured in both rows and columns. Rows and columns, boys 20 21 00:02:22,080 --> 00:02:26,160 and girls, are what spreadsheet monkey's dream about at night. 21 22 00:02:26,160 --> 00:02:32,820 Now, as a challenge, can you find out what the type of this data variable is? 22 23 00:02:33,060 --> 00:02:42,450 I'll give you a couple of seconds. Here's the solution. We'll write "type", provide our variable, hit Shift + 23 24 00:02:42,500 --> 00:02:51,600 Enter, and then we get the full name data is of type pandas.core.frame.DataFrame - so the 24 25 00:02:51,600 --> 00:02:58,680 type of this variable is not int, and it's not float, and it's not an array nor is it a list - it is of type 25 26 00:02:58,920 --> 00:03:05,100 DataFrame and that's how we'll be referring to it. We'll be referring to it by the short name and the 26 27 00:03:05,140 --> 00:03:13,410 shortening is the last bit at the end of this long name. In terms of lingo no programmer would say it 27 28 00:03:13,410 --> 00:03:20,610 is of type data frame they'll simply say this variable is a data frame. This is the kind of language 28 29 00:03:20,700 --> 00:03:26,950 that you'll hear people use when they're referring to types. As we've said before you can think of a 29 30 00:03:26,950 --> 00:03:34,840 data frame as a collection, but with a clearly defined structure - the data inside a data frame is structured 30 31 00:03:34,840 --> 00:03:41,460 in rows and columns just like an Excel spreadsheet and data frames are super common in Python and you'll 31 32 00:03:41,470 --> 00:03:47,530 see data frames being used in many, many places, so it's a good idea to learn a couple of the tricks and 32 33 00:03:47,530 --> 00:03:55,780 a couple of the things that we can do with data frames. For example we can grab a single column by providing 33 34 00:03:55,840 --> 00:04:05,160 the column name where before we were providing the index for a list or an array. So if I write data[], 34 35 00:04:05,160 --> 00:04:13,470 I can put the column name between some single quotes inside these square brackets. Say 35 36 00:04:13,470 --> 00:04:22,280 I take the third column and I write Avg_Math_Test_Score, when 36 37 00:04:22,280 --> 00:04:29,900 I hit Shift+Enter Jupyter notebook will display to me that data inside the single column. 37 38 00:04:30,190 --> 00:04:35,980 I guess I want to bring your attention to the fact that the Python syntax is very very similar between 38 39 00:04:35,980 --> 00:04:38,980 lists and arrays and data frames. 39 40 00:04:38,980 --> 00:04:44,680 However, instead of providing the position or the index between the square brackets, here we're specifying 40 41 00:04:44,800 --> 00:04:45,550 a column name. 41 42 00:04:46,480 --> 00:04:48,430 But again typos are 42 43 00:04:48,430 --> 00:04:54,670 something that we have to be very much aware of when we're doing this, because if we have a typo in our 43 44 00:04:54,670 --> 00:04:58,910 column name and we're trying to fetch a column that doesn't exist. 44 45 00:04:58,990 --> 00:05:05,810 Say if I delete the E and I press Shift+Enter we'll get a error. In this case, 45 46 00:05:05,870 --> 00:05:14,770 it is a key error and you can see that this key error brings up a whole bunch of other errors. 46 47 00:05:14,850 --> 00:05:21,390 In short, Python can't find this location in our data frame. 47 48 00:05:21,450 --> 00:05:24,400 This is why we have to pay a lot of attention to our spelling. 48 49 00:05:24,420 --> 00:05:28,960 We get the same error when we try to retrieve data from a data frame 49 50 00:05:28,980 --> 00:05:32,730 if we treat it like a list or like an array. 50 51 00:05:32,730 --> 00:05:41,450 So if we were to put an index here for, say data[1], we also get a key error. 51 52 00:05:41,460 --> 00:05:47,940 So even though Python handles the data types behind the scenes and they're never really explicit in 52 53 00:05:47,940 --> 00:05:53,760 your face with the syntax, this is another example where you want to be aware of what type of data you're 53 54 00:05:53,760 --> 00:06:00,030 working with, what type is my variable, because that will determine what kind of instructions you can 54 55 00:06:00,180 --> 00:06:02,760 give to your code. 55 56 00:06:02,760 --> 00:06:06,190 Let me fix this error now so we can bring up our column. 56 57 00:06:06,360 --> 00:06:12,720 Just gonna hit Control+Z or Command+Z to undo and hit Shift+Enter. 57 58 00:06:12,750 --> 00:06:20,330 Now let me show you how to save this data that we're extracting, the single column in a variable. To store 58 59 00:06:20,360 --> 00:06:22,580 this column in a single variable, 59 60 00:06:22,610 --> 00:06:34,670 all we have to do is provide a variable name, say onlyMathScores, and set it equal to 60 61 00:06:35,120 --> 00:06:36,920 data[], 61 62 00:06:37,130 --> 00:06:43,380 and then the column name. If we hit Shift+Enter at this point, our output will disappear. 62 63 00:06:43,380 --> 00:06:50,040 But, to prove to you guys that this data is indeed stored in this variable, we can print it out. 63 64 00:06:50,250 --> 00:06:56,230 So I'll write print(onlyMathScores) and hit Shift+Enter again. 64 65 00:06:56,250 --> 00:07:04,800 There we go. Extracting data from a data frame is pretty useful and we've seen how to get a single column 65 66 00:07:04,950 --> 00:07:06,390 out of a data frame. 66 67 00:07:06,660 --> 00:07:09,570 But what happens when we want to, say, I don't know, 67 68 00:07:09,570 --> 00:07:11,480 add a column instead. 68 69 00:07:11,730 --> 00:07:16,590 This is a good thing to know since very often you'll be combining different kinds of data frames or 69 70 00:07:16,590 --> 00:07:22,640 different kinds of data in your python code into a single table if you will. 70 71 00:07:23,070 --> 00:07:29,670 Remember how we selected a column bytes name from the data frame? We used the name of the column between 71 72 00:07:29,670 --> 00:07:31,530 the square brackets. 72 73 00:07:31,800 --> 00:07:35,900 Let me copy this entire line and paste it in the cell below. 73 74 00:07:35,940 --> 00:07:43,530 Now you'll also remember when we tried to grab a column that didn't exist, we got an error. 74 75 00:07:43,770 --> 00:07:51,020 So if I tried to store the values of Test_Subject inside OnlyMathScores and hit Shift+Enter, we 75 76 00:07:51,030 --> 00:07:58,650 get our key error. But if we change things around in the cell and we move data['Test_Subject'] 76 77 00:07:58,950 --> 00:08:04,560 to the left hand side of the equals sign and we provide a value on the right, 77 78 00:08:08,120 --> 00:08:16,670 we are giving Python a completely different instruction. If I hit Shift+Enter now, our Python code runs 78 79 00:08:16,730 --> 00:08:18,420 without a problem. 79 80 00:08:18,500 --> 00:08:26,720 And that's because we are saying "Add a new column with the name Test_Subject and set all the rows equal 80 81 00:08:26,720 --> 00:08:28,280 to Jennifer Lopez" 81 82 00:08:30,940 --> 00:08:34,790 Let's print out our data frame and see this in action. 82 83 00:08:35,030 --> 00:08:36,180 Here you go. 83 84 00:08:36,200 --> 00:08:42,800 Now we have four columns and all the rows in the fourth column have been set equal to the value. 84 85 00:08:42,830 --> 00:08:43,980 Jennifer Lopez. 85 86 00:08:44,570 --> 00:08:52,080 So this is how you can add a new column to an existing data frame. Let's talk about how to manipulate 86 87 00:08:52,170 --> 00:08:54,090 the values of a column. 87 88 00:08:54,090 --> 00:08:55,490 This is very, very useful. 88 89 00:08:55,530 --> 00:09:03,210 If we were to do calculations on all the values in a single column at the same time - for example, let's 89 90 00:09:03,210 --> 00:09:10,600 create a new column called "High_Score" and then set the values of that column equal to 100. 90 91 00:09:10,610 --> 00:09:13,680 So I'll write data which is the name of our data frame. 91 92 00:09:13,890 --> 00:09:26,430 data['High_Score'], and set it equal to the number 100. I'm going to hit Shift+Enter. 92 93 00:09:26,520 --> 00:09:30,220 I can print out my data frame. 93 94 00:09:30,420 --> 00:09:31,880 Take a look at it now. 94 95 00:09:32,040 --> 00:09:40,650 And here we see that High_Score on my 13 inch screen here shifts down and is displayed a little bit 95 96 00:09:40,650 --> 00:09:41,460 below. 96 97 00:09:41,730 --> 00:09:45,240 But it's still just the fifth row in the data frame. 97 98 00:09:45,240 --> 00:09:51,180 Now, as a challenge, see if you can figure out how to add all the values in the average Test Score column 98 99 00:09:51,300 --> 00:09:54,250 to the values in the High Score Column? 99 100 00:09:54,330 --> 00:10:00,900 In other words, overwrite the values that are currently stored in the High_Score column so that they equal 100 101 00:10:00,900 --> 00:10:06,720 100 plus whatever is inside the column for the average test scores. 101 102 00:10:11,710 --> 00:10:15,680 And here's the solution. Using the notation that we know so far, 102 103 00:10:15,700 --> 00:10:19,840 we would set the existing High Score column equal to 103 104 00:10:23,440 --> 00:10:26,590 the current value stored in High Score plus 104 105 00:10:29,740 --> 00:10:36,210 the value stored inside the Average Math Test Score. 105 106 00:10:42,030 --> 00:10:45,840 I'm going to add a print statement below this as well so that we can see what it looks like. 106 107 00:10:48,190 --> 00:10:51,330 Hit Shift+Enter and here we go. 107 108 00:10:51,350 --> 00:10:58,280 All the rows inside the High Score column have been updated to be equal to 100 plus whatever was stored 108 109 00:10:58,340 --> 00:11:01,730 inside the Average Math Test Score column. 109 110 00:11:01,880 --> 00:11:08,030 So when we look at this piece of code right here, we can see that this pattern is actually the same one 110 111 00:11:08,030 --> 00:11:12,750 that we've encountered previously in the fourth cell down from the top 111 112 00:11:12,800 --> 00:11:16,610 when we set myAge = myAge + 1. 112 113 00:11:16,670 --> 00:11:24,680 In this case we were also using the current value of myAge, doing a calculation with it, and then overwriting 113 114 00:11:24,890 --> 00:11:31,580 the value stored inside the variable with this new value. And this is exactly what's going on in this 114 115 00:11:31,580 --> 00:11:33,570 line too. 115 116 00:11:33,680 --> 00:11:40,730 So now that we know how to add two columns together, what if we wanted to, say, square the values inside 116 117 00:11:40,970 --> 00:11:44,270 this high score column? As a challenge, 117 118 00:11:44,300 --> 00:11:51,080 can you figure out how to update the data frame so that the values inside the High Score column are 118 119 00:11:51,080 --> 00:11:52,220 squared? 119 120 00:11:52,220 --> 00:11:59,540 In other words, we'll want to multiply 178 by itself and then do the same thing for every other value 120 121 00:11:59,810 --> 00:12:01,210 in each row in this column. 121 122 00:12:02,790 --> 00:12:10,150 I'll give you a few seconds to figure this out. And here's the solution. 122 123 00:12:10,240 --> 00:12:18,880 We simply set data["High_Score"] = data["High_Score"] * data["High_Score"]. 123 124 00:12:27,750 --> 00:12:35,850 If we print our data frame out now, we'll see the values updated in this column as follows. Now, there's 124 125 00:12:35,850 --> 00:12:41,040 other ways you can do this calculation, of course, we don't have to stick to this particular syntax. You 125 126 00:12:41,040 --> 00:12:44,190 can also write the Python code in this way - 126 127 00:12:44,190 --> 00:12:50,850 so instead of writing the name of the column at the very end you could have written it with two times 127 128 00:12:50,850 --> 00:12:53,460 signs and then the number 2. 128 129 00:12:53,460 --> 00:12:59,490 And this raises the values inside the rows of this column to the power of 2. 129 130 00:12:59,610 --> 00:13:04,120 If you had a single multiplication sign it would just be multiplying all the values by 2, 130 131 00:13:04,200 --> 00:13:10,350 but if you have two multiplication signs, it would be raising them to the power of 2. 131 132 00:13:10,360 --> 00:13:14,140 So now our data frame has five columns. 132 133 00:13:14,140 --> 00:13:15,850 It's got the time delay in minutes. 133 134 00:13:15,850 --> 00:13:17,370 It's got LSD parts per million. 134 135 00:13:17,380 --> 00:13:19,120 It's got the average math test scores. 135 136 00:13:19,150 --> 00:13:23,000 It's got a test subject and a high score. 136 137 00:13:23,200 --> 00:13:30,470 Previously we've extract that a single column and stored it in a variable called onlyMathScores. In 137 138 00:13:30,470 --> 00:13:31,080 these lessons, 138 139 00:13:31,100 --> 00:13:34,080 I've been harping on and on about data types. 139 140 00:13:34,400 --> 00:13:40,110 Would you like to venture a guess what the data type is for onlyMathScores? 140 141 00:13:40,190 --> 00:13:43,310 What category does this variable belong to? 141 142 00:13:44,060 --> 00:13:45,560 Well, let's check it out. 142 143 00:13:45,590 --> 00:13:55,700 Let's write type(onlyMathScores), hit Shift+Enter, and there we see the type of this variable. The full 143 144 00:13:55,700 --> 00:13:56,640 name of the type, 144 145 00:13:56,660 --> 00:14:03,590 specifically this variable, is of type pandas.core.series.Series. 145 146 00:14:03,590 --> 00:14:07,080 Now you might look at this and you might think it's a little odd, right? 146 147 00:14:07,100 --> 00:14:14,990 Because the type of our data variable is DataFrame, and previously we were working with lists and even 147 148 00:14:14,990 --> 00:14:23,060 arrays and yet when we extract a single column from this data frame we end up with something of data 148 149 00:14:23,060 --> 00:14:25,580 type Series. 149 150 00:14:25,630 --> 00:14:27,660 Now there is no need to panic. 150 151 00:14:27,680 --> 00:14:32,020 A series is actually very, very similar to an array. 151 152 00:14:32,270 --> 00:14:39,230 But there are a few differences which is why a series is a different category from an array. 152 153 00:14:39,230 --> 00:14:46,460 For example, the key difference is that a series is always always only one column. 153 154 00:14:46,520 --> 00:14:49,140 It only has a single dimension. 154 155 00:14:49,310 --> 00:14:53,660 It cannot be a matrix like an array or a list. 155 156 00:14:53,870 --> 00:14:56,740 A series is much, much more restrictive. 156 157 00:14:56,870 --> 00:15:04,730 Also, a series can have an attribute, like a name. You'll actually see this attribute when we print out 157 158 00:15:04,910 --> 00:15:07,200 onlyMathScores. Down here, 158 159 00:15:07,220 --> 00:15:11,520 you'll see that the name is basically the column heading. 159 160 00:15:11,630 --> 00:15:17,600 Now some of you might be asking themselves - why are you telling me this? Why is this interesting? And 160 161 00:15:17,780 --> 00:15:19,580 why does it matter? 161 162 00:15:19,580 --> 00:15:26,530 Well by checking up all these data types we've actually just made a discovery - we've made a discovery 162 163 00:15:26,620 --> 00:15:34,370 about the nature of data frames. A pandas data frame is essentially made up of a collection of series. 163 164 00:15:34,690 --> 00:15:41,410 Each column in the data frame is a series; Average Math Scores is a series, Test Subject as a series - every 164 165 00:15:41,410 --> 00:15:48,770 single column is a series and together they make up a data frame. And this brings us to a point where 165 166 00:15:48,770 --> 00:15:52,570 we've talked about quite a few different kinds of data structures. 166 167 00:15:52,640 --> 00:16:00,680 We've introduced you to arrays, lists, data frames and series and we know that a data frame is made up 167 168 00:16:00,680 --> 00:16:08,810 of series and we also know that a series can only have one column of data, while a data frame in contrast 168 169 00:16:09,020 --> 00:16:18,980 has two dimensions because it has both rows and columns. Now, say instead of pulling out a single column 169 170 00:16:19,220 --> 00:16:27,670 as a series from our data frame, say we want to extract another data frame from our data frame. 170 171 00:16:27,710 --> 00:16:34,270 Say we want to create a smaller data frame from our existing data frame. 171 172 00:16:34,330 --> 00:16:36,760 How would we do that? At the moment, 172 173 00:16:36,760 --> 00:16:43,420 we've got data inside five columns and we want to create a data frame that only consists of, say, two 173 174 00:16:43,420 --> 00:16:44,540 columns. 174 175 00:16:44,800 --> 00:16:50,730 Say we're only interested in the LSD parts per million and the Average Test Scores. 175 176 00:16:50,890 --> 00:16:53,320 How do we construct this subset? 176 177 00:16:53,320 --> 00:16:58,480 Well first, let's create a list of the columns that we care about. 177 178 00:16:58,570 --> 00:17:01,870 Do you remember how to do that? As a challenge, 178 179 00:17:01,880 --> 00:17:07,940 can you create a list called columnList and put two pieces of data inside of it? 179 180 00:17:07,940 --> 00:17:17,720 Put the LSD parts per million header and the Average Math Score Column header inside this list variable. 180 181 00:17:19,780 --> 00:17:21,870 Here is the solution. 181 182 00:17:21,950 --> 00:17:32,960 We'll write columnList = ['LSD_ppm', 182 183 00:17:33,510 --> 00:17:41,520 'Avg_Math_Test_Score']. And that's it. 183 184 00:17:41,600 --> 00:17:45,890 We've just created a list consisting of two strings. 184 185 00:17:45,950 --> 00:17:48,290 Two column heading names. 185 186 00:17:48,290 --> 00:17:52,640 Now we're gonna use this list to create a new data frame. 186 187 00:17:52,700 --> 00:17:54,470 I'm going to call this data frame. 187 188 00:17:54,470 --> 00:17:55,050 I don't know. 188 189 00:17:55,250 --> 00:18:01,190 cleanData and set it equal to data[] 189 190 00:18:01,190 --> 00:18:11,180 and then inside the square brackets I'm going to pass the columnList so instead of writing the name of 190 191 00:18:11,180 --> 00:18:17,990 every single column that I care about inside these square brackets I just provided a list of column 191 192 00:18:17,990 --> 00:18:24,450 names. And if I print out my cleanData data frame I can see what it looks like. 192 193 00:18:25,050 --> 00:18:27,880 It's just a data frame with two columns. 193 194 00:18:27,990 --> 00:18:35,160 Now, we've actually written some Python code in two lines that we could have done in a single line. 194 195 00:18:35,190 --> 00:18:42,900 We've split out the steps where we created a list and then created a data frame using that list. 195 196 00:18:42,900 --> 00:18:47,280 Oftentimes you'll see both of these steps done in a single line. 196 197 00:18:47,280 --> 00:18:52,410 So we could theoretically copy this piece of code here, 197 198 00:18:52,410 --> 00:19:01,710 our list of column headings, and just put it inside here, put it inside the square brackets of our data 198 199 00:19:01,710 --> 00:19:03,280 frame. 199 200 00:19:03,290 --> 00:19:06,790 Now I can comment out this line because we don't need it anymore. 200 201 00:19:06,840 --> 00:19:13,250 And if I press Shift+Enter, we'll actually get exactly the same result. 201 202 00:19:13,270 --> 00:19:19,380 So what we've done here is simply nested a list inside another piece of code. 202 203 00:19:19,420 --> 00:19:25,390 The reason I'm showing you this is because oftentimes when you see two square brackets just next to 203 204 00:19:25,390 --> 00:19:33,070 each other like this it can look really really scary but all it is is a list inside of something else. 204 205 00:19:34,530 --> 00:19:36,220 When we're writing our code like this, 205 206 00:19:36,240 --> 00:19:38,110 we're not creating an extra variable, 206 207 00:19:38,130 --> 00:19:40,840 we're not creating this column list variable. 207 208 00:19:40,950 --> 00:19:48,360 We've accomplished the same thing by providing the list of strings directly. Now, to prove to you that we 208 209 00:19:48,360 --> 00:19:50,160 have indeed created the data frame, 209 210 00:19:50,160 --> 00:19:54,120 let's print out the type of cleanData. 210 211 00:19:54,810 --> 00:19:58,290 And here we see that it is indeed a data frame. 211 212 00:19:58,320 --> 00:20:04,420 Now, what if we wanted to create a single column as a data frame? A data frame, 212 213 00:20:04,440 --> 00:20:09,600 after all, doesn't need to have many, many columns. It could have a single column just as well. 213 214 00:20:09,600 --> 00:20:14,640 And this is actually something that's very, very useful when running regressions with scikit-learn. 214 215 00:20:14,880 --> 00:20:19,380 For that we actually want to work with data frames instead of series. 215 216 00:20:19,380 --> 00:20:23,160 We're gonna be interested in predicting the math test scores. 216 217 00:20:23,160 --> 00:20:34,870 So in this case we can write y = [[]], 217 218 00:20:35,680 --> 00:20:47,290 because we're going to be supplying that list and then all we have to do is write the name of the column. 218 219 00:20:47,410 --> 00:20:53,120 We're still passing in a list here, but in this case, it's a list with only one item. 219 220 00:20:53,250 --> 00:21:00,490 And when do we check the type of y by writing type(y), we can see that y is indeed a data 220 221 00:21:00,490 --> 00:21:01,940 frame. 221 222 00:21:02,050 --> 00:21:10,500 If we weren't passing in a list and instead only had one pair of square brackets, we are passing in a 222 223 00:21:10,500 --> 00:21:11,610 string. 223 224 00:21:11,610 --> 00:21:18,030 And if I re-evaluate this cell, then the type for y would be a series. 224 225 00:21:18,030 --> 00:21:23,580 So this is an important point - when we provide a list to our data frame, 225 226 00:21:23,610 --> 00:21:34,530 we get out a data frame and when we provide a string to our data frame we get out a series. 226 227 00:21:34,710 --> 00:21:42,000 So this is another example when running Python code, that it's important to keep in mind the data types 227 228 00:21:42,000 --> 00:21:43,180 that you're working with, 228 229 00:21:43,350 --> 00:21:50,790 even though it's happening in the background. As a quick exercise, can you create a variable called capital 229 230 00:21:50,850 --> 00:21:56,800 X and set it equal to the LSD parts per million values? 230 231 00:21:56,820 --> 00:22:04,540 Also make sure that X is indeed a data frame; print the values of X and show the type. 231 232 00:22:04,540 --> 00:22:11,480 I'll give you a few seconds to figure this out and pause the video. And here's the solution. 232 233 00:22:11,580 --> 00:22:19,110 You'd write X = data[[]], so that we get 233 234 00:22:19,140 --> 00:22:26,910 a data frame out, and then we provide the column name which was LSD_ppm. To print the 234 235 00:22:26,910 --> 00:22:27,600 value, 235 236 00:22:27,600 --> 00:22:36,240 we simply write print(X) and to show us the type we'd write to type(X). 236 237 00:22:36,960 --> 00:22:38,340 Hitting Shift+Enter, 237 238 00:22:38,580 --> 00:22:39,810 you should see this. 238 239 00:22:39,810 --> 00:22:48,530 We should see that X is a data frame that consists of a single column, namely the LSD parts per million. 239 240 00:22:48,570 --> 00:22:49,800 Excellent. 240 241 00:22:49,800 --> 00:22:54,080 So we've done a lot of work with data frames at this point. 241 242 00:22:54,120 --> 00:23:00,850 We've seen how to add columns, extract columns and manipulate the data inside a column. 242 243 00:23:00,870 --> 00:23:07,260 Let's talk now about how to delete a column that we added to a data frame. After all, 243 244 00:23:07,260 --> 00:23:15,450 having read this scientific study from 1968 I discovered that Jennifer Lopez did not in fact sit for 244 245 00:23:15,450 --> 00:23:20,160 any arithmetic tests. To delete a column from a data frame, 245 246 00:23:20,170 --> 00:23:28,150 we use the python keyword "del", short for delete and we follow this by the name of the column that we 246 247 00:23:28,150 --> 00:23:29,300 want to get rid of. 247 248 00:23:29,440 --> 00:23:36,430 In this case the column name is Test_Subject, so we'll write 248 249 00:23:36,550 --> 00:23:43,080 "del data['Test_Subject']", and then below, 249 250 00:23:43,310 --> 00:23:51,370 let's print out our data frame just to see if we have indeed gone from five columns to four. 250 251 00:23:53,880 --> 00:23:58,500 And as you can see our Test_Subject column has been removed. 251 252 00:23:58,680 --> 00:24:04,050 So as a quick exercise can you delete the High_Score column from our data 252 253 00:24:04,050 --> 00:24:10,100 data frame. You've probably guessed it - it's the same pattern as in the cell above. We write 253 254 00:24:10,110 --> 00:24:13,710 del data['High_Score'] 254 255 00:24:13,820 --> 00:24:24,630 Let's print out, I'll printout data below as well so we can 255 256 00:24:24,630 --> 00:24:27,420 see that the column has indeed been removed. 256 257 00:24:27,420 --> 00:24:28,890 Good work. 257 258 00:24:28,950 --> 00:24:30,320 I'll see you in the next lesson.