0 1 00:00:00,620 --> 00:00:08,480 Seaborn's Web site looks like this. Seaborn is a Python module that's actually based on matplotlib 1 2 00:00:08,730 --> 00:00:15,970 but it has some nice extra features. You can think of seaborn like matplotlib on steroids. 2 3 00:00:16,230 --> 00:00:24,300 If matplotlib was Lance Armstrong then Seabourn is still Lance Armstrong. 3 4 00:00:24,450 --> 00:00:27,150 Lance Armstrong on the Tour de France. 4 5 00:00:27,900 --> 00:00:35,160 So let's import the Seaborn module at the top of the notebook as sns. I'm going to write 5 6 00:00:35,190 --> 00:00:45,060 "import seaborn as sns" and hit Shift+Enter. In the cell below the histogram we're gonna make use of one 6 7 00:00:45,060 --> 00:00:46,600 of seaborne functions. 7 8 00:00:46,830 --> 00:00:59,760 We're gonna make use of the distplot function, so "sns.distplot(data[ 8 9 00:01:01,270 --> 00:01:13,280 'PRICE']) and then Enter, "plt.show()". Seaborn's distplot function 9 10 00:01:13,690 --> 00:01:21,320 only needs a single argument, namely the data it's meant to plot and we're gonna plot the target values and 10 11 00:01:21,410 --> 00:01:23,720 we're gonna see how this compares with 11 12 00:01:23,720 --> 00:01:27,990 matplotlib's histogram function. So let me hit Shift+Enter 12 13 00:01:28,820 --> 00:01:38,210 and here we can see that straight out of the box we get a very beautiful graph. We get both a histogram 13 14 00:01:38,840 --> 00:01:43,730 and an estimate of the probability density function. 14 15 00:01:44,120 --> 00:01:53,230 The probability density function is the darker squiggly line that's superimposed on top of the histogram. 15 16 00:01:53,300 --> 00:02:00,200 This is the line that estimates the distribution of the data and I think this is a neat little addition 16 17 00:02:00,260 --> 00:02:06,190 that Seaborn provides as an add on to the histogram. 17 18 00:02:06,230 --> 00:02:11,950 Now of course we can make the seaborn plot look a lot more like matplotlib. 18 19 00:02:12,140 --> 00:02:14,810 So for starters I could size it differently. 19 20 00:02:14,810 --> 00:02:18,580 So I'm going to take this line of code here. 20 21 00:02:18,650 --> 00:02:20,270 Copy it. 21 22 00:02:20,360 --> 00:02:22,640 Go down here. 22 23 00:02:22,640 --> 00:02:33,440 Paste it, press Shift+Enter can see that's a bit larger but easier to see. But we can also modify the 23 24 00:02:33,440 --> 00:02:34,510 number of arguments 24 25 00:02:34,520 --> 00:02:44,390 this distplot here takes. So it also has the bins argument that we can give it and I can give it 50. 25 26 00:02:45,770 --> 00:02:56,270 So now this ties out with what we see up here from matplotlib, and then we can also show and hide the two 26 27 00:02:56,300 --> 00:02:57,610 bits of the graph here. 27 28 00:02:57,680 --> 00:03:03,200 The PDF or probability density function and the histogram. 28 29 00:03:03,200 --> 00:03:14,790 So if I come up here to the arguments and I put "hist = False", with a capital F, and hit Shift+Enter 29 30 00:03:14,960 --> 00:03:22,610 it hides the histogram and we're just left with the probability density function which is the estimate 30 31 00:03:22,610 --> 00:03:25,700 for the distribution. 31 32 00:03:25,700 --> 00:03:27,470 And of course we can also do it the other way around. 32 33 00:03:27,470 --> 00:03:41,480 So if I write "kde = False" and then change the hist argument to True, then we're hiding the probability 33 34 00:03:41,480 --> 00:03:48,980 density function and just showing the histogram. And in terms of styling you can also grab a different 34 35 00:03:48,980 --> 00:03:49,690 color here. 35 36 00:03:50,130 --> 00:03:56,660 So if I want to go with a yellow palette for example pick this hex code, go back here, 36 37 00:03:58,380 --> 00:04:07,080 and then add the color argument as a keyword and paste the hex code in, hit Shift+Enter and then we'll see 37 38 00:04:07,080 --> 00:04:13,100 it like so. Now in retrospect I don't think this was actually a wise choice. 38 39 00:04:13,110 --> 00:04:15,330 This is very, very low contrast. 39 40 00:04:15,420 --> 00:04:23,730 I might have to go for something like this which is a bit more Amber and substitute that for the yellow. 40 41 00:04:24,930 --> 00:04:28,180 And this makes the histogram a lot more clear. 41 42 00:04:28,230 --> 00:04:33,230 All right so I think this is a really good time for a challenge. 42 43 00:04:33,300 --> 00:04:42,360 See if you can investigate and visualize two other features of the dataset and I also have a question 43 44 00:04:42,360 --> 00:04:43,350 for you. 44 45 00:04:43,440 --> 00:04:49,350 How many rooms do you think that the average property in Boston has? 45 46 00:04:49,410 --> 00:04:50,030 What do you think? 46 47 00:04:50,580 --> 00:04:54,210 What's the total number of rooms for a property in Boston? 47 48 00:04:54,210 --> 00:04:58,290 So including bedrooms, living rooms, bathrooms etc. 48 49 00:04:58,400 --> 00:05:00,050 Have you made your guess? 49 50 00:05:00,360 --> 00:05:04,280 Okay, so for this challenge we're gonna be checking the answer. 50 51 00:05:04,280 --> 00:05:07,580 We're gonna have a look at the RM feature. 51 52 00:05:07,730 --> 00:05:15,500 Can you write the Python code to plot this on a histogram and visualize the number of rooms per dwelling 52 53 00:05:16,310 --> 00:05:23,120 and then figure out what the average number of rooms is for a property in Boston Massachusetts? 53 54 00:05:23,120 --> 00:05:25,280 I think this would be fairly straightforward. 54 55 00:05:25,300 --> 00:05:31,240 Uh you can use either matplotlib or seaborn to generate your histogram but 55 56 00:05:31,430 --> 00:05:35,190 maybe have a think about what you would expect beforehand. 56 57 00:05:35,300 --> 00:05:36,080 Ready? 57 58 00:05:36,690 --> 00:05:38,710 Here's the solution. 58 59 00:05:39,480 --> 00:05:48,030 So what I'm gonna do is I'm going to take this cell here and I'm going to copy the cell, come down here 59 60 00:05:48,630 --> 00:05:56,870 and "Edit", paste the cell and then what I'm going to do is I'm going to change the code here, so I'm gonna 60 61 00:05:56,970 --> 00:06:00,240 substitute RM for price. 61 62 00:06:00,240 --> 00:06:02,690 I'm going to delete the bins. 62 63 00:06:02,700 --> 00:06:06,590 I think this is no longer what we want. 63 64 00:06:06,600 --> 00:06:17,130 I'm also gonna pick a different color so maybe I'm gonna go for teal. It was dark teal here, just to differentiate 64 65 00:06:17,250 --> 00:06:25,350 the chart a little bit and then I'm going to change the x label here, because we're no longer looking 65 66 00:06:25,350 --> 00:06:27,550 at the price we're looking at 66 67 00:06:27,570 --> 00:06:38,910 then number of rooms. In fact this is the average number of rooms. And I'm going to hit Shift+Enter 67 68 00:06:41,700 --> 00:06:51,000 and the chart that I get back looks like this. We can see that the highest bar is around the number 6. 68 69 00:06:51,040 --> 00:06:57,640 In other words, most of the properties in Boston have around 6 rooms. 69 70 00:06:57,710 --> 00:07:04,000 I don't know about you but that actually seems pretty big actually, because if you think about it, 70 71 00:07:04,000 --> 00:07:06,700 bedroom, living room, kitchen, bathroom, 71 72 00:07:06,700 --> 00:07:09,910 that adds up to 4. If you add another bedroom, 72 73 00:07:09,910 --> 00:07:13,820 then you have 5, but the average is 6. 73 74 00:07:13,840 --> 00:07:20,740 So I don't know if they have a linen closet that they're counting or they have two bathrooms or three bedrooms. 74 75 00:07:20,740 --> 00:07:26,080 Now of course we don't know anything about the size of the rooms, but still I'm going to venture a guess 75 76 00:07:26,080 --> 00:07:30,810 that six rooms on average doesn't seem to be too shabby. 76 77 00:07:31,200 --> 00:07:34,330 I certainly wouldn't mind living somewhat with 6 rooms. 77 78 00:07:34,330 --> 00:07:38,280 That's palatial by London standards. 78 79 00:07:38,360 --> 00:07:44,320 Now we just eyeballed this graph to figure out the average number of rooms. 79 80 00:07:44,440 --> 00:07:50,880 What if we wanted to know an exact number for the average? Well pandas 80 81 00:07:50,890 --> 00:07:53,110 also has you covered for that. 81 82 00:07:53,560 --> 00:08:04,270 So what we can do is come down here and access our our RM data series, so "data[' 82 83 00:08:04,360 --> 00:08:17,830 RM']", put a dot after it, and then call the mean() function. So "data['RM'].mean()". 83 84 00:08:18,470 --> 00:08:24,960 So here we're calling the mean method on our series object and hitting Shift Enter, 84 85 00:08:25,190 --> 00:08:29,630 we see that the true average is 6.28.