0
1
00:00:00,930 --> 00:00:06,180
In this lesson we're going to talk about two more types of data structures for storing a lot of data
1

2
00:00:06,180 --> 00:00:07,900
at the same time.
2

3
00:00:08,040 --> 00:00:12,060
The first one that we're going to look at is called a data frame.
3

4
00:00:12,480 --> 00:00:15,700
In this lesson's resources you'll find included some fresh data.
4

5
00:00:16,260 --> 00:00:20,220
Let's upload this data to our Jupyter notebook.
5

6
00:00:20,220 --> 00:00:28,590
So change the tabs to your MLProjects folder and then click "Upload" and then pick the 
6

7
00:00:28,590 --> 00:00:35,970
LSD_math_score_data.csv file. Again, you'll find the CSV file in the lesson
7

8
00:00:35,970 --> 00:00:45,690
resources. After you've chosen this file, click "Upload" and then head back into your Jupyter notebook. Here
8

9
00:00:45,690 --> 00:00:47,720
we're going to write the following code - we're going to write
9

10
00:00:47,790 --> 00:00:58,300
"import pandas as pd", and then we're gonna use pandas to read our CSV file and we're going to store that
10

11
00:00:58,300 --> 00:01:08,860
information in a variable called data. So we'll write 
11

12
00:01:08,950 --> 00:01:19,660
data = pd.read_csv('LSD_math_score_data.csv')
12

13
00:01:20,050 --> 00:01:29,980
As you're writing this make sure you don't have any typos in the file name; all the capitalization
13

14
00:01:30,100 --> 00:01:37,230
and the spaces matter, of course. Let's hit Shift+Enter and see what we get. What we're looking for at this
14

15
00:01:37,230 --> 00:01:44,990
stage are no errors. Now that we've done that, we can take a peek at our data variable. So let's print
15

16
00:01:44,990 --> 00:01:51,500
it out. Let's print out data. And we can do that by writing print and then within the parentheses we'll
16

17
00:01:51,500 --> 00:01:52,700
put data.
17

18
00:01:55,660 --> 00:02:05,190
And what you should see at this stage are seven rows and three columns. Now, just like our lists and our
18

19
00:02:05,190 --> 00:02:12,820
arrays, our data variable here is holding onto this collection. However, the super neat thing here is the
19

20
00:02:12,820 --> 00:02:22,080
structure of this data. We've got our data structured in both rows and columns. Rows and columns, boys
20

21
00:02:22,080 --> 00:02:26,160
and girls, are what spreadsheet monkey's dream about at night.
21

22
00:02:26,160 --> 00:02:32,820
Now, as a challenge, can you find out what the type of this data variable is?
22

23
00:02:33,060 --> 00:02:42,450
I'll give you a couple of seconds. Here's the solution. We'll write "type", provide our variable, hit Shift +
23

24
00:02:42,500 --> 00:02:51,600
Enter, and then we get the full name data is of type pandas.core.frame.DataFrame - so the
24

25
00:02:51,600 --> 00:02:58,680
type of this variable is not int, and it's not float, and it's not an array nor is it a list - it is of type
25

26
00:02:58,920 --> 00:03:05,100
DataFrame and that's how we'll be referring to it. We'll be referring to it by the short name and the
26

27
00:03:05,140 --> 00:03:13,410
shortening is the last bit at the end of this long name. In terms of lingo no programmer would say it
27

28
00:03:13,410 --> 00:03:20,610
is of type data frame they'll simply say this variable is a data frame. This is the kind of language
28

29
00:03:20,700 --> 00:03:26,950
that you'll hear people use when they're referring to types. As we've said before you can think of a
29

30
00:03:26,950 --> 00:03:34,840
data frame as a collection, but with a clearly defined structure - the data inside a data frame is structured
30

31
00:03:34,840 --> 00:03:41,460
in rows and columns just like an Excel spreadsheet and data frames are super common in Python and you'll
31

32
00:03:41,470 --> 00:03:47,530
see data frames being used in many, many places, so it's a good idea to learn a couple of the tricks and
32

33
00:03:47,530 --> 00:03:55,780
a couple of the things that we can do with data frames. For example we can grab a single column by providing
33

34
00:03:55,840 --> 00:04:05,160
the column name where before we were providing the index for a list or an array. So if I write data[],
34

35
00:04:05,160 --> 00:04:13,470
I can put the column name between some single quotes inside these square brackets. Say
35

36
00:04:13,470 --> 00:04:22,280
I take the third column and I write Avg_Math_Test_Score, when
36

37
00:04:22,280 --> 00:04:29,900
I hit Shift+Enter Jupyter notebook will display to me that data inside the single column.
37

38
00:04:30,190 --> 00:04:35,980
I guess I want to bring your attention to the fact that the Python syntax is very very similar between
38

39
00:04:35,980 --> 00:04:38,980
lists and arrays and data frames.
39

40
00:04:38,980 --> 00:04:44,680
However, instead of providing the position or the index between the square brackets, here we're specifying
40

41
00:04:44,800 --> 00:04:45,550
a column name.
41

42
00:04:46,480 --> 00:04:48,430
But again typos are 
42

43
00:04:48,430 --> 00:04:54,670
something that we have to be very much aware of when we're doing this, because if we have a typo in our
43

44
00:04:54,670 --> 00:04:58,910
column name and we're trying to fetch a column that doesn't exist.
44

45
00:04:58,990 --> 00:05:05,810
Say if I delete the E and I press Shift+Enter we'll get a error. In this case,
45

46
00:05:05,870 --> 00:05:14,770
it is a key error and you can see that this key error brings up a whole bunch of other errors.
46

47
00:05:14,850 --> 00:05:21,390
In short, Python can't find this location in our data frame.
47

48
00:05:21,450 --> 00:05:24,400
This is why we have to pay a lot of attention to our spelling.
48

49
00:05:24,420 --> 00:05:28,960
We get the same error when we try to retrieve data from a data frame
49

50
00:05:28,980 --> 00:05:32,730
if we treat it like a list or like an array.
50

51
00:05:32,730 --> 00:05:41,450
So if we were to put an index here for, say data[1], we also get a key error.
51

52
00:05:41,460 --> 00:05:47,940
So even though Python handles the data types behind the scenes and they're never really explicit in
52

53
00:05:47,940 --> 00:05:53,760
your face with the syntax, this is another example where you want to be aware of what type of data you're
53

54
00:05:53,760 --> 00:06:00,030
working with, what type is my variable, because that will determine what kind of instructions you can
54

55
00:06:00,180 --> 00:06:02,760
give to your code.
55

56
00:06:02,760 --> 00:06:06,190
Let me fix this error now so we can bring up our column.
56

57
00:06:06,360 --> 00:06:12,720
Just gonna hit Control+Z or Command+Z to undo and hit Shift+Enter.
57

58
00:06:12,750 --> 00:06:20,330
Now let me show you how to save this data that we're extracting, the single column in a variable. To store
58

59
00:06:20,360 --> 00:06:22,580
this column in a single variable,
59

60
00:06:22,610 --> 00:06:34,670
all we have to do is provide a variable name, say onlyMathScores, and set it equal to
60

61
00:06:35,120 --> 00:06:36,920
data[],
61

62
00:06:37,130 --> 00:06:43,380
and then the column name. If we hit Shift+Enter at this point, our output will disappear.
62

63
00:06:43,380 --> 00:06:50,040
But, to prove to you guys that this data is indeed stored in this variable, we can print it out.
63

64
00:06:50,250 --> 00:06:56,230
So I'll write print(onlyMathScores) and hit Shift+Enter again.
64

65
00:06:56,250 --> 00:07:04,800
There we go. Extracting data from a data frame is pretty useful and we've seen how to get a single column
65

66
00:07:04,950 --> 00:07:06,390
out of a data frame.
66

67
00:07:06,660 --> 00:07:09,570
But what happens when we want to, say, I don't know,
67

68
00:07:09,570 --> 00:07:11,480
add a column instead.
68

69
00:07:11,730 --> 00:07:16,590
This is a good thing to know since very often you'll be combining different kinds of data frames or
69

70
00:07:16,590 --> 00:07:22,640
different kinds of data in your python code into a single table if you will.
70

71
00:07:23,070 --> 00:07:29,670
Remember how we selected a column bytes name from the data frame? We used the name of the column between
71

72
00:07:29,670 --> 00:07:31,530
the square brackets.
72

73
00:07:31,800 --> 00:07:35,900
Let me copy this entire line and paste it in the cell below.
73

74
00:07:35,940 --> 00:07:43,530
Now you'll also remember when we tried to grab a column that didn't exist, we got an error.
74

75
00:07:43,770 --> 00:07:51,020
So if I tried to store the values of Test_Subject inside OnlyMathScores and hit Shift+Enter, we
75

76
00:07:51,030 --> 00:07:58,650
get our key error. But if we change things around in the cell and we move data['Test_Subject']
76

77
00:07:58,950 --> 00:08:04,560
to the left hand side of the equals sign and we provide a value on the right,
77

78
00:08:08,120 --> 00:08:16,670
we are giving Python a completely different instruction. If I hit Shift+Enter now, our Python code runs
78

79
00:08:16,730 --> 00:08:18,420
without a problem.
79

80
00:08:18,500 --> 00:08:26,720
And that's because we are saying "Add a new column with the name Test_Subject and set all the rows equal
80

81
00:08:26,720 --> 00:08:28,280
to Jennifer Lopez"
81

82
00:08:30,940 --> 00:08:34,790
Let's print out our data frame and see this in action.
82

83
00:08:35,030 --> 00:08:36,180
Here you go.
83

84
00:08:36,200 --> 00:08:42,800
Now we have four columns and all the rows in the fourth column have been set equal to the value.
84

85
00:08:42,830 --> 00:08:43,980
Jennifer Lopez.
85

86
00:08:44,570 --> 00:08:52,080
So this is how you can add a new column to an existing data frame. Let's talk about how to manipulate
86

87
00:08:52,170 --> 00:08:54,090
the values of a column.
87

88
00:08:54,090 --> 00:08:55,490
This is very, very useful.
88

89
00:08:55,530 --> 00:09:03,210
If we were to do calculations on all the values in a single column at the same time - for example, let's
89

90
00:09:03,210 --> 00:09:10,600
create a new column called "High_Score" and then set the values of that column equal to 100.
90

91
00:09:10,610 --> 00:09:13,680
So I'll write data which is the name of our data frame.
91

92
00:09:13,890 --> 00:09:26,430
data['High_Score'], and set it equal to the number 100. I'm going to hit Shift+Enter.
92

93
00:09:26,520 --> 00:09:30,220
I can print out my data frame.
93

94
00:09:30,420 --> 00:09:31,880
Take a look at it now.
94

95
00:09:32,040 --> 00:09:40,650
And here we see that High_Score on my 13 inch screen here shifts down and is displayed a little bit
95

96
00:09:40,650 --> 00:09:41,460
below.
96

97
00:09:41,730 --> 00:09:45,240
But it's still just the fifth row in the data frame.
97

98
00:09:45,240 --> 00:09:51,180
Now, as a challenge, see if you can figure out how to add all the values in the average Test Score column
98

99
00:09:51,300 --> 00:09:54,250
to the values in the High Score Column?
99

100
00:09:54,330 --> 00:10:00,900
In other words, overwrite the values that are currently stored in the High_Score column so that they equal
100

101
00:10:00,900 --> 00:10:06,720
100 plus whatever is inside the column for the average test scores.
101

102
00:10:11,710 --> 00:10:15,680
And here's the solution. Using the notation that we know so far,
102

103
00:10:15,700 --> 00:10:19,840
we would set the existing High Score column equal to
103

104
00:10:23,440 --> 00:10:26,590
the current value stored in High Score plus
104

105
00:10:29,740 --> 00:10:36,210
the value stored inside the Average Math Test Score.
105

106
00:10:42,030 --> 00:10:45,840
I'm going to add a print statement below this as well so that we can see what it looks like.
106

107
00:10:48,190 --> 00:10:51,330
Hit Shift+Enter and here we go.
107

108
00:10:51,350 --> 00:10:58,280
All the rows inside the High Score column have been updated to be equal to 100 plus whatever was stored
108

109
00:10:58,340 --> 00:11:01,730
inside the Average Math Test Score column.
109

110
00:11:01,880 --> 00:11:08,030
So when we look at this piece of code right here, we can see that this pattern is actually the same one
110

111
00:11:08,030 --> 00:11:12,750
that we've encountered previously in the fourth cell down from the top
111

112
00:11:12,800 --> 00:11:16,610
when we set myAge = myAge + 1.
112

113
00:11:16,670 --> 00:11:24,680
In this case we were also using the current value of myAge, doing a calculation with it, and then overwriting
113

114
00:11:24,890 --> 00:11:31,580
the value stored inside the variable with this new value. And this is exactly what's going on in this
114

115
00:11:31,580 --> 00:11:33,570
line too.
115

116
00:11:33,680 --> 00:11:40,730
So now that we know how to add two columns together, what if we wanted to, say, square the values inside
116

117
00:11:40,970 --> 00:11:44,270
this high score column? As a challenge,
117

118
00:11:44,300 --> 00:11:51,080
can you figure out how to update the data frame so that the values inside the High Score column are
118

119
00:11:51,080 --> 00:11:52,220
squared?
119

120
00:11:52,220 --> 00:11:59,540
In other words, we'll want to multiply 178 by itself and then do the same thing for every other value
120

121
00:11:59,810 --> 00:12:01,210
in each row in this column.
121

122
00:12:02,790 --> 00:12:10,150
I'll give you a few seconds to figure this out. And here's the solution.
122

123
00:12:10,240 --> 00:12:18,880
We simply set data["High_Score"] = data["High_Score"] * data["High_Score"].
123

124
00:12:27,750 --> 00:12:35,850
If we print our data frame out now, we'll see the values updated in this column as follows. Now, there's
124

125
00:12:35,850 --> 00:12:41,040
other ways you can do this calculation, of course, we don't have to stick to this particular syntax. You
125

126
00:12:41,040 --> 00:12:44,190
can also write the Python code in this way -
126

127
00:12:44,190 --> 00:12:50,850
so instead of writing the name of the column at the very end you could have written it with two times
127

128
00:12:50,850 --> 00:12:53,460
signs and then the number 2.
128

129
00:12:53,460 --> 00:12:59,490
And this raises the values inside the rows of this column to the power of 2.
129

130
00:12:59,610 --> 00:13:04,120
If you had a single multiplication sign it would just be multiplying all the values by 2,
130

131
00:13:04,200 --> 00:13:10,350
but if you have two multiplication signs, it would be raising them to the power of 2.
131

132
00:13:10,360 --> 00:13:14,140
So now our data frame has five columns.
132

133
00:13:14,140 --> 00:13:15,850
It's got the time delay in minutes.
133

134
00:13:15,850 --> 00:13:17,370
It's got LSD parts per million.
134

135
00:13:17,380 --> 00:13:19,120
It's got the average math test scores.
135

136
00:13:19,150 --> 00:13:23,000
It's got a test subject and a high score.
136

137
00:13:23,200 --> 00:13:30,470
Previously we've extract that a single column and stored it in a variable called onlyMathScores. In
137

138
00:13:30,470 --> 00:13:31,080
these lessons,
138

139
00:13:31,100 --> 00:13:34,080
I've been harping on and on about data types.
139

140
00:13:34,400 --> 00:13:40,110
Would you like to venture a guess what the data type is for onlyMathScores?
140

141
00:13:40,190 --> 00:13:43,310
What category does this variable belong to?
141

142
00:13:44,060 --> 00:13:45,560
Well, let's check it out.
142

143
00:13:45,590 --> 00:13:55,700
Let's write type(onlyMathScores), hit Shift+Enter, and there we see the type of this variable. The full
143

144
00:13:55,700 --> 00:13:56,640
name of the type,
144

145
00:13:56,660 --> 00:14:03,590
specifically this variable, is of type pandas.core.series.Series.
145

146
00:14:03,590 --> 00:14:07,080
Now you might look at this and you might think it's a little odd, right?
146

147
00:14:07,100 --> 00:14:14,990
Because the type of our data variable is DataFrame, and previously we were working with lists and even
147

148
00:14:14,990 --> 00:14:23,060
arrays and yet when we extract a single column from this data frame we end up with something of data
148

149
00:14:23,060 --> 00:14:25,580
type Series.
149

150
00:14:25,630 --> 00:14:27,660
Now there is no need to panic.
150

151
00:14:27,680 --> 00:14:32,020
A series is actually very, very similar to an array.
151

152
00:14:32,270 --> 00:14:39,230
But there are a few differences which is why a series is a different category from an array.
152

153
00:14:39,230 --> 00:14:46,460
For example, the key difference is that a series is always always only one column.
153

154
00:14:46,520 --> 00:14:49,140
It only has a single dimension.
154

155
00:14:49,310 --> 00:14:53,660
It cannot be a matrix like an array or a list.
155

156
00:14:53,870 --> 00:14:56,740
A series is much, much more restrictive.
156

157
00:14:56,870 --> 00:15:04,730
Also, a series can have an attribute, like a name. You'll actually see this attribute when we print out
157

158
00:15:04,910 --> 00:15:07,200
onlyMathScores. Down here,
158

159
00:15:07,220 --> 00:15:11,520
you'll see that the name is basically the column heading.
159

160
00:15:11,630 --> 00:15:17,600
Now some of you might be asking themselves - why are you telling me this? Why is this interesting? And
160

161
00:15:17,780 --> 00:15:19,580
why does it matter?
161

162
00:15:19,580 --> 00:15:26,530
Well by checking up all these data types we've actually just made a discovery - we've made a discovery
162

163
00:15:26,620 --> 00:15:34,370
about the nature of data frames. A pandas data frame is essentially made up of a collection of series.
163

164
00:15:34,690 --> 00:15:41,410
Each column in the data frame is a series; Average Math Scores is a series, Test Subject as a series - every
164

165
00:15:41,410 --> 00:15:48,770
single column is a series and together they make up a data frame. And this brings us to a point where
165

166
00:15:48,770 --> 00:15:52,570
we've talked about quite a few different kinds of data structures.
166

167
00:15:52,640 --> 00:16:00,680
We've introduced you to arrays, lists, data frames and series and we know that a data frame is made up
167

168
00:16:00,680 --> 00:16:08,810
of series and we also know that a series can only have one column of data, while a data frame in contrast
168

169
00:16:09,020 --> 00:16:18,980
has two dimensions because it has both rows and columns. Now, say instead of pulling out a single column
169

170
00:16:19,220 --> 00:16:27,670
as a series from our data frame, say we want to extract another data frame from our data frame.
170

171
00:16:27,710 --> 00:16:34,270
Say we want to create a smaller data frame from our existing data frame.
171

172
00:16:34,330 --> 00:16:36,760
How would we do that? At the moment,
172

173
00:16:36,760 --> 00:16:43,420
we've got data inside five columns and we want to create a data frame that only consists of, say, two
173

174
00:16:43,420 --> 00:16:44,540
columns.
174

175
00:16:44,800 --> 00:16:50,730
Say we're only interested in the LSD parts per million and the Average Test Scores.
175

176
00:16:50,890 --> 00:16:53,320
How do we construct this subset?
176

177
00:16:53,320 --> 00:16:58,480
Well first, let's create a list of the columns that we care about.
177

178
00:16:58,570 --> 00:17:01,870
Do you remember how to do that? As a challenge,
178

179
00:17:01,880 --> 00:17:07,940
can you create a list called columnList and put two pieces of data inside of it?
179

180
00:17:07,940 --> 00:17:17,720
Put the LSD parts per million header and the Average Math Score Column header inside this list variable.
180

181
00:17:19,780 --> 00:17:21,870
Here is the solution.
181

182
00:17:21,950 --> 00:17:32,960
We'll write columnList = ['LSD_ppm',
182

183
00:17:33,510 --> 00:17:41,520
'Avg_Math_Test_Score']. And that's it.
183

184
00:17:41,600 --> 00:17:45,890
We've just created a list consisting of two strings.
184

185
00:17:45,950 --> 00:17:48,290
Two column heading names.
185

186
00:17:48,290 --> 00:17:52,640
Now we're gonna use this list to create a new data frame.
186

187
00:17:52,700 --> 00:17:54,470
I'm going to call this data frame.
187

188
00:17:54,470 --> 00:17:55,050
I don't know.
188

189
00:17:55,250 --> 00:18:01,190
cleanData and set it equal to data[]
189

190
00:18:01,190 --> 00:18:11,180
and then inside the square brackets I'm going to pass the columnList so instead of writing the name of
190

191
00:18:11,180 --> 00:18:17,990
every single column that I care about inside these square brackets I just provided a list of column
191

192
00:18:17,990 --> 00:18:24,450
names. And if I print out my cleanData data frame I can see what it looks like.
192

193
00:18:25,050 --> 00:18:27,880
It's just a data frame with two columns.
193

194
00:18:27,990 --> 00:18:35,160
Now, we've actually written some Python code in two lines that we could have done in a single line.
194

195
00:18:35,190 --> 00:18:42,900
We've split out the steps where we created a list and then created a data frame using that list.
195

196
00:18:42,900 --> 00:18:47,280
Oftentimes you'll see both of these steps done in a single line.
196

197
00:18:47,280 --> 00:18:52,410
So we could theoretically copy this piece of code here,
197

198
00:18:52,410 --> 00:19:01,710
our list of column headings, and just put it inside here, put it inside the square brackets of our data
198

199
00:19:01,710 --> 00:19:03,280
frame.
199

200
00:19:03,290 --> 00:19:06,790
Now I can comment out this line because we don't need it anymore.
200

201
00:19:06,840 --> 00:19:13,250
And if I press Shift+Enter, we'll actually get exactly the same result.
201

202
00:19:13,270 --> 00:19:19,380
So what we've done here is simply nested a list inside another piece of code.
202

203
00:19:19,420 --> 00:19:25,390
The reason I'm showing you this is because oftentimes when you see two square brackets just next to
203

204
00:19:25,390 --> 00:19:33,070
each other like this it can look really really scary but all it is is a list inside of something else.
204

205
00:19:34,530 --> 00:19:36,220
When we're writing our code like this,
205

206
00:19:36,240 --> 00:19:38,110
we're not creating an extra variable,
206

207
00:19:38,130 --> 00:19:40,840
we're not creating this column list variable.
207

208
00:19:40,950 --> 00:19:48,360
We've accomplished the same thing by providing the list of strings directly. Now, to prove to you that we
208

209
00:19:48,360 --> 00:19:50,160
have indeed created the data frame,
209

210
00:19:50,160 --> 00:19:54,120
let's print out the type of cleanData.
210

211
00:19:54,810 --> 00:19:58,290
And here we see that it is indeed a data frame.
211

212
00:19:58,320 --> 00:20:04,420
Now, what if we wanted to create a single column as a data frame? A data frame,
212

213
00:20:04,440 --> 00:20:09,600
after all, doesn't need to have many, many columns. It could have a single column just as well.
213

214
00:20:09,600 --> 00:20:14,640
And this is actually something that's very, very useful when running regressions with scikit-learn.
214

215
00:20:14,880 --> 00:20:19,380
For that we actually want to work with data frames instead of series.
215

216
00:20:19,380 --> 00:20:23,160
We're gonna be interested in predicting the math test scores.
216

217
00:20:23,160 --> 00:20:34,870
So in this case we can write y = [[]],
217

218
00:20:35,680 --> 00:20:47,290
because we're going to be supplying that list and then all we have to do is write the name of the column.
218

219
00:20:47,410 --> 00:20:53,120
We're still passing in a list here, but in this case, it's a list with only one item.
219

220
00:20:53,250 --> 00:21:00,490
And when do we check the type of y by writing type(y), we can see that y is indeed a data
220

221
00:21:00,490 --> 00:21:01,940
frame.
221

222
00:21:02,050 --> 00:21:10,500
If we weren't passing in a list and instead only had one pair of square brackets, we are passing in a
222

223
00:21:10,500 --> 00:21:11,610
string.
223

224
00:21:11,610 --> 00:21:18,030
And if I re-evaluate this cell, then the type for y would be a series.
224

225
00:21:18,030 --> 00:21:23,580
So this is an important point - when we provide a list to our data frame,
225

226
00:21:23,610 --> 00:21:34,530
we get out a data frame and when we provide a string to our data frame we get out a series.
226

227
00:21:34,710 --> 00:21:42,000
So this is another example when running Python code, that it's important to keep in mind the data types
227

228
00:21:42,000 --> 00:21:43,180
that you're working with,
228

229
00:21:43,350 --> 00:21:50,790
even though it's happening in the background. As a quick exercise, can you create a variable called capital
229

230
00:21:50,850 --> 00:21:56,800
X and set it equal to the LSD parts per million values?
230

231
00:21:56,820 --> 00:22:04,540
Also make sure that X is indeed a data frame; print the values of X and show the type.
231

232
00:22:04,540 --> 00:22:11,480
I'll give you a few seconds to figure this out and pause the video. And here's the solution.
232

233
00:22:11,580 --> 00:22:19,110
You'd write X = data[[]], so that we get
233

234
00:22:19,140 --> 00:22:26,910
a data frame out, and then we provide the column name which was LSD_ppm. To print the
234

235
00:22:26,910 --> 00:22:27,600
value,
235

236
00:22:27,600 --> 00:22:36,240
we simply write print(X) and to show us the type we'd write to type(X).
236

237
00:22:36,960 --> 00:22:38,340
Hitting Shift+Enter,
237

238
00:22:38,580 --> 00:22:39,810
you should see this.
238

239
00:22:39,810 --> 00:22:48,530
We should see that X is a data frame that consists of a single column, namely the LSD parts per million.
239

240
00:22:48,570 --> 00:22:49,800
Excellent.
240

241
00:22:49,800 --> 00:22:54,080
So we've done a lot of work with data frames at this point.
241

242
00:22:54,120 --> 00:23:00,850
We've seen how to add columns, extract columns and manipulate the data inside a column.
242

243
00:23:00,870 --> 00:23:07,260
Let's talk now about how to delete a column that we added to a data frame. After all,
243

244
00:23:07,260 --> 00:23:15,450
having read this scientific study from 1968 I discovered that Jennifer Lopez did not in fact sit for
244

245
00:23:15,450 --> 00:23:20,160
any arithmetic tests. To delete a column from a data frame,
245

246
00:23:20,170 --> 00:23:28,150
we use the python keyword "del", short for delete and we follow this by the name of the column that we
246

247
00:23:28,150 --> 00:23:29,300
want to get rid of.
247

248
00:23:29,440 --> 00:23:36,430
In this case the column name is Test_Subject, so we'll write 
248

249
00:23:36,550 --> 00:23:43,080
"del data['Test_Subject']", and then below,
249

250
00:23:43,310 --> 00:23:51,370
let's print out our data frame just to see if we have indeed gone from five columns to four.
250

251
00:23:53,880 --> 00:23:58,500
And as you can see our Test_Subject column has been removed.
251

252
00:23:58,680 --> 00:24:04,050
So as a quick exercise can you delete the High_Score column from our data
252

253
00:24:04,050 --> 00:24:10,100
data frame. You've probably guessed it - it's the same pattern as in the cell above. We write
253

254
00:24:10,110 --> 00:24:13,710
del data['High_Score']
254

255
00:24:13,820 --> 00:24:24,630
Let's print out, I'll printout data below as well so we can
255

256
00:24:24,630 --> 00:24:27,420
see that the column has indeed been removed.
256

257
00:24:27,420 --> 00:24:28,890
Good work.
257

258
00:24:28,950 --> 00:24:30,320
I'll see you in the next lesson.