0 1 00:00:00,630 --> 00:00:06,980 The next feature I want you to investigate is called RAD. 1 2 00:00:07,030 --> 00:00:16,390 This is a measure of the accessibility to highways for the property, and I want to challenge you to use 2 3 00:00:16,480 --> 00:00:22,690 matplotlib to generate a meaningful histogram for this RAD feature. 3 4 00:00:22,720 --> 00:00:28,900 This might be a little tricky and require some thought, so pause the video, play with the Python code 4 5 00:00:29,410 --> 00:00:33,170 and have a think about what this feature is actually telling us. 5 6 00:00:33,430 --> 00:00:39,650 Oh and, for the histogram pick a beautiful royal purple color while you're at it. 6 7 00:00:39,950 --> 00:00:42,400 I'll give you a few seconds to pause the video. 7 8 00:00:44,300 --> 00:00:45,980 Here's the solution. 8 9 00:00:46,160 --> 00:00:51,200 Let's check what would happen if we took this code here, 9 10 00:00:51,680 --> 00:00:53,060 pasted it in, 10 11 00:00:53,060 --> 00:01:06,600 changed RM to RAD, changed the x label to "Accessibility to Highways" and changed the hex code to 11 12 00:01:06,660 --> 00:01:12,440 a nice purple from materialpalette.com, change that here, 12 13 00:01:12,460 --> 00:01:15,780 paste it in and hit Shift+Enter. 13 14 00:01:16,160 --> 00:01:19,800 We get something like this. Now, 14 15 00:01:20,000 --> 00:01:23,090 this looks a little strange to me. 15 16 00:01:23,450 --> 00:01:31,610 It seems like the histogram's bins are hiding some information from us, because the bins for this histogram 16 17 00:01:31,880 --> 00:01:36,960 seem a little too broad. If I look at the Python code, 17 18 00:01:37,070 --> 00:01:38,610 "plt.hist()", 18 19 00:01:39,120 --> 00:01:47,610 we haven't supplied any bins as an argument to this function call and this means we're using automatic 19 20 00:01:47,870 --> 00:01:48,850 binning. 20 21 00:01:48,870 --> 00:01:54,610 We're letting matplotlib decide on how to show us the histogram. 21 22 00:01:54,760 --> 00:02:03,000 Maybe what we need to do is investigate what RAD actually is and how accessibility to highways is actually 22 23 00:02:03,000 --> 00:02:03,930 measured. 23 24 00:02:03,930 --> 00:02:06,670 For example, what are the units in RAD? 24 25 00:02:06,720 --> 00:02:13,200 Perhaps we should try to understand this before creating our visualization, so let's output RAD to our 25 26 00:02:13,200 --> 00:02:22,410 Jupyter notebook, so I'm going to say "data['RAD']", all caps, and hit Shift+Enter. 26 27 00:02:23,310 --> 00:02:25,740 And I'm going to scroll down and just take a look at this. 27 28 00:02:29,570 --> 00:02:37,870 So I've got 506 entries and all of these seem to be whole numbers. 28 29 00:02:38,140 --> 00:02:42,090 So starts out with 1, 2, 3, 29 30 00:02:42,160 --> 00:02:44,740 some of them have 5, 30 31 00:02:44,770 --> 00:02:46,950 some of them have 24. 31 32 00:02:47,090 --> 00:02:47,470 Hmm, 32 33 00:02:48,130 --> 00:02:56,930 okay, so this is a contrast to the house prices, RAD is a bunch of distinct integer values and all the 33 34 00:02:56,930 --> 00:02:59,000 values seem to be whole numbers. 34 35 00:02:59,000 --> 00:03:06,080 A better way that we can see this and just look at how many unique values there are in the series is to 35 36 00:03:06,080 --> 00:03:11,100 use the value_counts method on this series. 36 37 00:03:11,120 --> 00:03:21,500 So I'm going to put a dot after "data['RAD']" and write "value_counts()" 37 38 00:03:22,030 --> 00:03:24,730 and hit Shift+Enter. 38 39 00:03:24,740 --> 00:03:33,770 So this gives me a beautiful summary of how many observations in this column, in RAD, have a particular 39 40 00:03:33,770 --> 00:03:34,870 value. 40 41 00:03:34,940 --> 00:03:39,760 So, for example, we can see that 17 observations, 41 42 00:03:39,780 --> 00:03:48,680 yeah 17 properties, in the dataset have a RAD value of 7 and there's 132 dwellings 42 43 00:03:48,950 --> 00:03:53,780 that have the highway accessibility value of 24. 43 44 00:03:53,840 --> 00:04:01,940 So keeping this in mind and scrolling back up to the description, RAD actually refers to an index 44 45 00:04:02,210 --> 00:04:05,070 of accessibility to radial highways. 45 46 00:04:05,240 --> 00:04:06,500 So that's what we're looking at. 46 47 00:04:06,620 --> 00:04:08,930 We're looking at an index. 47 48 00:04:09,290 --> 00:04:16,780 In other words, accessibility to highways is ranked from 1 to 24. 48 49 00:04:17,060 --> 00:04:23,900 1 is the value for low accessibility and 24 is the value for high accessibility. 49 50 00:04:23,960 --> 00:04:33,620 In other words, a property with poor accessibility to transport scores low on this index; and a property 50 51 00:04:33,800 --> 00:04:39,420 that has good accessibility to transport has a high value on this index. 51 52 00:04:39,500 --> 00:04:41,450 So looking at our histogram again. 52 53 00:04:41,660 --> 00:04:49,700 So what we probably want is we want this histogram to reflect these index values instead of this automatic 53 54 00:04:49,870 --> 00:04:50,930 binning. 54 55 00:04:51,410 --> 00:04:56,300 We want to show these index values and we don't want to bin several of the indexed values together, 55 56 00:04:57,270 --> 00:05:03,470 and that's because the data in this RAD feature already has pretty much our bins mapped out for us. 56 57 00:05:03,590 --> 00:05:05,960 So we're gonna use these. 57 58 00:05:05,960 --> 00:05:13,760 I can modify the histogram code right here to take this into account simply by adding the bins argument 58 59 00:05:14,840 --> 00:05:18,660 and setting it equal to the value 24. 59 60 00:05:18,710 --> 00:05:24,380 Now let me refresh my histogram. Voila! All right. 60 61 00:05:24,410 --> 00:05:31,650 So that completes the challenge, we plotted our histogram for the RAD feature and what we can see is 61 62 00:05:31,650 --> 00:05:37,010 that there's quite a few properties between the 1 and 7 range on the index. 62 63 00:05:37,180 --> 00:05:43,970 And there's also a whole bunch of properties for the value 24 on the index. But, you know what this 63 64 00:05:43,970 --> 00:05:54,600 histogram kind of looks like? It looks like a bar chart and bar chart is a histogram's cousin. Histograms 64 65 00:05:54,740 --> 00:05:58,440 and bar charts can be used to pretty much show the same information. 65 66 00:05:58,460 --> 00:06:06,020 So let me show you the Python code for creating a bar chart using matplotlib as well. 66 67 00:06:06,020 --> 00:06:12,770 This is another data visualization technique that's really handy to have in your tool belt. So I'm going to 67 68 00:06:12,770 --> 00:06:20,800 come down here, add a few more cells and I'm gonna make use of this values_counts method, So I'm going to copy 68 69 00:06:20,890 --> 00:06:31,450 this line of code and I'm going to store the output, the result from this code, in a variable called frequency. 69 70 00:06:34,790 --> 00:06:39,550 "Frequency = data[ 70 71 00:06:39,550 --> 00:06:44,010 'RAD'].value_counts()". 71 72 00:06:44,050 --> 00:06:48,390 Now, frequency is also a pandas series. 72 73 00:06:48,460 --> 00:06:58,440 You can see this if I write the code "type(frequency)", hit Shift+Enter, so data['RAD'] 73 74 00:06:58,770 --> 00:07:06,190 is a series, but the return value of this value_counts method is also a series. 74 75 00:07:07,080 --> 00:07:12,840 And the reason I'm showing you this is because I want to draw your attention to something. I'm going to comment 75 76 00:07:12,840 --> 00:07:19,650 this out, and what I want to do is I want to access these values right here. 76 77 00:07:19,710 --> 00:07:25,310 I just want to access the labels for these unique index values. 77 78 00:07:25,750 --> 00:07:27,790 I can do this in one of two ways. 78 79 00:07:27,820 --> 00:07:33,760 Check it out. If I say frequency.index, 79 80 00:07:36,500 --> 00:07:41,210 then I'll get a collection of all these index values in my series. 80 81 00:07:41,540 --> 00:07:47,020 So this is one way of doing it. I'm going to comment this out and I'll show you the second way. 81 82 00:07:48,910 --> 00:07:52,500 "frequency.axes[ 82 83 00:07:52,540 --> 00:08:00,470 0]". If I hit Shift+Enter, then I get exactly the same result. 83 84 00:08:01,910 --> 00:08:11,350 The axes attribute of the series can also be used to retrieve the row axes labels. And the reason I'm 84 85 00:08:11,350 --> 00:08:18,430 interested in these in the first place is because we're going to use these to label the x axis on the 85 86 00:08:18,430 --> 00:08:20,950 bar chart that we're going to create. 86 87 00:08:20,950 --> 00:08:23,100 So check it out. I'm going to comment this out 87 88 00:08:23,560 --> 00:08:31,660 and then to create the bar chart I'm going to take my matplotlib object, "plt.bar()", 88 89 00:08:32,260 --> 00:08:34,900 and then I have to supply two things. 89 90 00:08:34,930 --> 00:08:37,720 The first is what I want on the x axis. 90 91 00:08:37,720 --> 00:08:43,860 And this is gonna be "frequency.index". 91 92 00:08:44,090 --> 00:08:50,020 And the second thing I have to supply for the bar chart is the height of the individual bars. 92 93 00:08:50,050 --> 00:08:54,210 So this will be an argument called height and I'm going to set that equal to, 93 94 00:08:54,730 --> 00:09:00,040 well this would just be the values inside my frequency variable. 94 95 00:09:00,190 --> 00:09:01,890 That'll be these values here. 95 96 00:09:02,780 --> 00:09:03,690 So I'm going to say "height= 96 97 00:09:03,710 --> 00:09:12,670 frequency" and then we put "plt.show()" afterwards and scroll down and hit Shift+Enter. 97 98 00:09:12,700 --> 00:09:22,060 And this is what we get. As it is, there's no labels on the axes and there's also the default color 98 99 00:09:22,070 --> 00:09:23,360 being used. 99 100 00:09:23,360 --> 00:09:29,210 So what I'm going to do is I'm going to make this bar chart a little larger. 100 101 00:09:29,230 --> 00:09:34,470 Let me grab this code up here that we have, come down here, 101 102 00:09:34,480 --> 00:09:36,260 paste it in. 102 103 00:09:36,420 --> 00:09:43,830 I'm going to delete this line here and then I'm going to leave my x label and y label as they are and 103 104 00:09:43,830 --> 00:09:46,680 hit Shift+Enter. 104 105 00:09:46,850 --> 00:09:48,190 There we go. 105 106 00:09:48,200 --> 00:09:54,470 So this is a bar chart, but I want to draw your attention to one thing. The neat thing about the code 106 107 00:09:54,470 --> 00:10:00,560 we've just written is that we haven't had to specify the number of bins ahead of time, 107 108 00:10:00,560 --> 00:10:03,840 we haven't had to write "bins=24". 108 109 00:10:04,070 --> 00:10:08,280 We haven't had to hard code the number 24 for the number of bins. 109 110 00:10:08,330 --> 00:10:17,310 Instead we wrote some Python code using value_counts which figured out the best way to draw the x and 110 111 00:10:17,310 --> 00:10:20,960 y axes for our bar chart for us. 111 112 00:10:21,220 --> 00:10:26,260 So this is a technique that you can apply to other types of indexed data as well. 112 113 00:10:26,320 --> 00:10:33,060 It makes the code that we've just written a lot more flexible than hard coding particular integer values. 113 114 00:10:33,160 --> 00:10:35,100 And that's a good thing. 114 115 00:10:35,380 --> 00:10:41,860 You're also gonna be looking at this chart here and you might be thinking: Hmmm this looks a lot better 115 116 00:10:41,980 --> 00:10:45,990 than the histogram just because it's got these spaces in between the bars. 116 117 00:10:46,300 --> 00:10:51,520 Because if we look at our histogram, it kind of looks like this at the moment - all the bins all the bars 117 118 00:10:51,850 --> 00:10:54,100 are jam packed together. 118 119 00:10:54,130 --> 00:10:59,000 So let me give you a little challenge so you can familiarize yourself with the histogram function 119 120 00:10:59,020 --> 00:11:01,180 a little better as well. 120 121 00:11:01,180 --> 00:11:06,970 I want you to modify this histogram so that it's also got some spaces between the bars. 121 122 00:11:06,970 --> 00:11:13,180 The trick will be to look at the documentation by say pulling up the quick documentation in the notebook 122 123 00:11:13,540 --> 00:11:21,690 and looking for the right argument to supply to the function call. You can pull up the quick documentation 123 124 00:11:21,690 --> 00:11:27,780 by pressing Shift and then Tab on your keyboard and hitting this little plus sign and scrolling down 124 125 00:11:27,870 --> 00:11:29,200 and taking a look at this 125 126 00:11:29,290 --> 00:11:35,100 here. I'll give you a few seconds to pause the video so you can find the parameter that you have to modify 126 127 00:11:35,490 --> 00:11:40,340 and give the bars a little bit more of a breathing room. 127 128 00:11:40,350 --> 00:11:44,920 How did you get on? Did you solve it? Here's the solution. 128 129 00:11:44,920 --> 00:11:53,260 The argument that we need to specify in this method call is "rwidth". By default, 129 130 00:11:53,260 --> 00:11:55,480 this has the value none. 130 131 00:11:55,480 --> 00:11:59,290 But let's check out what the description says for rwidth. 131 132 00:12:03,080 --> 00:12:09,570 If I scroll down in the quick documentation I can see that rwidth is an optional argument and that 132 133 00:12:09,570 --> 00:12:18,500 it is a number that specifies the relative width of the bars as a fraction of the total bin width. And 133 134 00:12:18,510 --> 00:12:19,990 the first time I read this, 134 135 00:12:20,490 --> 00:12:23,600 that didn't make a whole lot of sense to me. 135 136 00:12:23,610 --> 00:12:31,030 So what I had to do is try out a couple of different numbers and see how the chart turned out. 136 137 00:12:31,080 --> 00:12:42,030 So if we write "rwidth = 1" and hit Shift+Enter and see what we get, no change. But if we change 137 138 00:12:42,030 --> 00:12:50,590 that to say 0.5 and hit Shift+Enter, our histogram starts looking like this. 138 139 00:12:51,690 --> 00:12:59,310 So what this rwidth argument is doing if it's set to 0.5, our bar width will be approximately 139 140 00:12:59,490 --> 00:13:09,420 0.5 and on either side of the bar we'll have a space of 0.25. If we make this 140 141 00:13:09,570 --> 00:13:19,780 0.7, the gaps will get smaller and if we make this 0.3 then the gaps will 141 142 00:13:19,780 --> 00:13:29,760 get wider. So, in essence, you can add a value between 0 and 1 to this rwidth argument and you'll get 142 143 00:13:29,760 --> 00:13:30,660 different results. 143 144 00:13:31,120 --> 00:13:40,670 If I put in the value 10 then I get exactly the same as if I put in the value 1. All good? I'm going to 144 145 00:13:40,670 --> 00:13:44,110 leave it at 0.5. Cool. 145 146 00:13:44,660 --> 00:13:51,470 So we've looked at the average number of rooms per dwelling, we've looked at access to radial highways 146 147 00:13:52,130 --> 00:13:58,400 and we've looked at the property prices in our visualizations. So both the number of rooms and the house 147 148 00:13:58,400 --> 00:14:01,540 prices were quite easy to understand, right? 148 149 00:14:01,670 --> 00:14:07,820 Measuring how good the transport links were on the other hand was a little bit more complex given that 149 150 00:14:08,090 --> 00:14:15,800 it was measured as an index value with accessibility to radial highways. But there's actually another 150 151 00:14:15,950 --> 00:14:24,470 very nifty technique that the researchers are using to capture some information about these Boston Properties. 151 152 00:14:24,800 --> 00:14:31,340 You see, there's a river running through Boston and this river is called the Charles River and it looks 152 153 00:14:31,340 --> 00:14:39,700 something like this. Imagine for a second that you were conducting the original research and collating 153 154 00:14:39,880 --> 00:14:41,790 the Boston housing data. 154 155 00:14:42,100 --> 00:14:47,920 You want to be able to differentiate between the houses that are located right on the river and those 155 156 00:14:47,920 --> 00:14:50,210 that are located elsewhere. 156 157 00:14:50,230 --> 00:14:52,110 How would you go about doing this? 157 158 00:14:53,510 --> 00:14:57,900 And this brings us to our next challenge. And for this challenge, 158 159 00:14:57,900 --> 00:15:01,530 I want you to answer a very, very simple question. 159 160 00:15:01,800 --> 00:15:10,320 Tell me, out of the 506 properties in the dataset, how many properties are located on the Charles River? 160 161 00:15:10,410 --> 00:15:13,820 This challenge isn't going to be about data visualization. 161 162 00:15:13,920 --> 00:15:18,930 I just need a cold hard number from you. To solve this challenge, 162 163 00:15:18,930 --> 00:15:25,830 take a close look at the description of the features and then write a single line of code that will 163 164 00:15:25,830 --> 00:15:28,150 spit out the answer for you. 164 165 00:15:28,380 --> 00:15:30,810 And also while you're at it, have a think 165 166 00:15:30,840 --> 00:15:38,400 if you expect that the properties on the river will be worth more or less than properties that are away 166 167 00:15:38,400 --> 00:15:41,820 from the river. Is living next to the Charles River 167 168 00:15:41,880 --> 00:15:45,950 a good thing for house prices? Because we'll find out later. 168 169 00:15:46,200 --> 00:15:54,370 In the meantime, I'll give you a few seconds to pause the video so you can solve this challenge. 169 170 00:15:54,520 --> 00:15:58,390 Did you get it? Here is the solution. 170 171 00:15:58,470 --> 00:16:06,650 So the trick was looking for the feature description that would likely contain the answers and you maybe 171 172 00:16:06,660 --> 00:16:15,030 discovered that there is a feature called CHAS and this is the Charles River dummy variable 172 173 00:16:15,540 --> 00:16:21,080 which equals 1 if the tract bounds the river and 0 otherwise. 173 174 00:16:21,240 --> 00:16:27,410 In other words, CHAS captures whether the property is on the river or not. 174 175 00:16:27,410 --> 00:16:30,840 Now let's scroll back down and write the Python code. 175 176 00:16:31,020 --> 00:16:36,370 We're going to be using our old friend value_counts to solve this. 176 177 00:16:36,400 --> 00:16:46,890 If I write "data['CHAS'].value_counts()" and hit 177 178 00:16:46,890 --> 00:16:50,570 Shift+Enter I'm going to get the following output. 178 179 00:16:50,570 --> 00:17:00,050 I can see here that CHAS only has one of two values, 0 or 1, which ties out exactly with what they've 179 180 00:17:00,050 --> 00:17:01,550 said in the description. 180 181 00:17:01,550 --> 00:17:07,440 0 means not on the river and 1 means located on the Charles River. 181 182 00:17:07,460 --> 00:17:16,020 So the answer to the challenges question is there are 35 properties on the river. This type 182 183 00:17:16,020 --> 00:17:25,590 of feature is called a dummy variable and you'll find researchers using dummy variables to capture binary 183 184 00:17:25,770 --> 00:17:27,000 information. 184 185 00:17:27,060 --> 00:17:31,770 So this is a good example - is the property on the river or not on the river? 185 186 00:17:31,770 --> 00:17:33,890 Are we dealing with a man or a woman? 186 187 00:17:33,900 --> 00:17:36,030 Are the unemployed or employed? 187 188 00:17:36,030 --> 00:17:37,140 Is it a homeowner 188 189 00:17:37,140 --> 00:17:38,530 or are they renting? 189 190 00:17:38,550 --> 00:17:42,940 This is the kind of information that you can capture with dummy variables. 190 191 00:17:43,000 --> 00:17:48,150 In other words, working with dummy variables is actually very similar to working with an index, except 191 192 00:17:48,150 --> 00:17:51,780 that a dummy variable can only have one of two values. 192 193 00:17:51,780 --> 00:17:52,620 Good stuff. 193 194 00:17:52,620 --> 00:17:56,120 So we're really getting into the nitty gritty. In the next lessons 194 195 00:17:56,250 --> 00:18:02,250 we're gonna be looking at descriptive statistics, outliers and scatter plots. 195 196 00:18:02,250 --> 00:18:03,150 I'll see you there.