1 00:00:00,270 --> 00:00:01,530 Welcome back. 2 00:00:01,540 --> 00:00:04,350 And so we've done a whole bunch of plotting so far. 3 00:00:04,560 --> 00:00:09,170 And don't worry if you're sort of a little bit confused at the moment like here like Daniel showing 4 00:00:09,170 --> 00:00:12,960 me all these plots I'm looking at different data sets. 5 00:00:13,020 --> 00:00:14,370 How am I going to remember all this. 6 00:00:14,700 --> 00:00:15,320 It's all right. 7 00:00:15,330 --> 00:00:20,370 That's perfectly fine to feel like that it's going to take a little bit of practice to start to understand 8 00:00:20,640 --> 00:00:25,760 different kinds of plots what you should use which plot for but that's all part of the process right 9 00:00:25,800 --> 00:00:30,950 is just having a little practice having a go and just trying it out seeing what it looks like. 10 00:00:30,960 --> 00:00:32,130 Does this make sense. 11 00:00:32,610 --> 00:00:37,590 So that's why in this video we're going to do a similar plot to what we've just seen before on the car 12 00:00:37,590 --> 00:00:39,980 sales data set on another data set. 13 00:00:41,370 --> 00:00:44,350 Let's try on another dataset. 14 00:00:44,370 --> 00:00:52,430 And if we come up here we've got heart disease dot CSB which we've seen before which is a data set which 15 00:00:52,430 --> 00:00:57,440 has different parameters of patients and whether or not they have heart disease. 16 00:00:57,440 --> 00:01:06,370 So let's import that as PD not read CSB as we're going to call it just heart disease and so we'll go. 17 00:01:06,380 --> 00:01:10,680 Heart disease starts see as very beautiful. 18 00:01:10,760 --> 00:01:16,340 Now we'll have a look at it just to make sure we know what's happening will only be the first five rows 19 00:01:16,340 --> 00:01:19,890 because about 300 and this one excellent. 20 00:01:19,980 --> 00:01:22,450 We have these parameters we've seen them before. 21 00:01:22,450 --> 00:01:24,650 I've got a target column right on the end. 22 00:01:24,690 --> 00:01:32,500 So what we might do is create a histogram just to see the distribution of the age column. 23 00:01:33,270 --> 00:01:43,760 So heart disease age dot plot dot hist and we might just see the default number of beans. 24 00:01:43,760 --> 00:01:46,820 Can you remember how many that will be shift tab. 25 00:01:46,820 --> 00:01:52,430 Beans is going to be 10 beautiful so we'll just leave it at that OK. 26 00:01:52,660 --> 00:01:55,250 So this is where we're starting to see that curve right. 27 00:01:55,270 --> 00:01:57,270 This is that normal distribution. 28 00:01:57,280 --> 00:02:02,740 It's very common so we're kind of getting that curve here. 29 00:02:02,750 --> 00:02:10,150 We would say that most of the patients are around it's this line here so maybe late 50s early 60s thereabouts 30 00:02:10,160 --> 00:02:12,500 and what if we changed the number of beans so let's go. 31 00:02:12,500 --> 00:02:16,280 Beans equals 20 OK. 32 00:02:16,280 --> 00:02:23,300 We're starting to lose that curve shape a little bit 30 still relatively there we're definitely starting 33 00:02:23,300 --> 00:02:30,920 to see this more pronounced line here right at the late 50s there jacking up to 50 and see what happens. 34 00:02:30,930 --> 00:02:37,290 All right so that shape is still relatively there our initial shape that what we got from just having 35 00:02:37,290 --> 00:02:42,930 10 bends is kind of consistent the whole way through and maybe we'd look at this and go Okay this one's 36 00:02:42,930 --> 00:02:44,300 a bit of an outlier. 37 00:02:44,310 --> 00:02:45,930 These are a bit of an outlier. 38 00:02:45,960 --> 00:02:47,010 Why are they outliers. 39 00:02:47,010 --> 00:02:49,640 Well because they're so far from the middle. 40 00:02:49,710 --> 00:02:54,750 When you hear the word outlier that's kind of what it's talking about in statistic terms. 41 00:02:54,750 --> 00:02:59,270 It's about more than three standard deviations away from the mean. 42 00:02:59,280 --> 00:03:03,900 Now of course that definition will change depending on what data are working with. 43 00:03:03,990 --> 00:03:06,480 But this is kind of a way that you can visualize it. 44 00:03:07,360 --> 00:03:09,340 So let's take that back to 10. 45 00:03:09,390 --> 00:03:10,830 That looks pretty good. 46 00:03:10,980 --> 00:03:11,780 All right. 47 00:03:11,850 --> 00:03:17,220 Now these kind of visualizations are the things you'll do whenever you deal with a new data set. 48 00:03:17,220 --> 00:03:18,000 Let's finish. 49 00:03:18,000 --> 00:03:24,760 We have there's one more type of plot we haven't looked at directly from a data frame that's subplots. 50 00:03:24,810 --> 00:03:26,490 So we'll finish that off. 51 00:03:26,490 --> 00:03:31,480 We'll go to heart disease or head OK. 52 00:03:31,480 --> 00:03:32,500 This is our data frame. 53 00:03:32,500 --> 00:03:36,850 Now let's plot every single columns histogram in one plot. 54 00:03:36,850 --> 00:03:38,400 Let's see how we would do that. 55 00:03:38,650 --> 00:03:40,260 Heart disease. 56 00:03:40,260 --> 00:03:42,090 So we've got the age histogram here. 57 00:03:42,120 --> 00:03:48,700 But let's just make one for all of these and check the distribution dot plot haste. 58 00:03:48,730 --> 00:03:53,950 So remember this is just going straight from the entire data frame up here we selected the age column 59 00:03:55,370 --> 00:04:04,660 dot haste and maybe we want subplots equals true that makes sense because we want to put it all on the 60 00:04:04,660 --> 00:04:07,530 one plot keyword arc. 61 00:04:07,540 --> 00:04:08,590 So that's part of that. 62 00:04:08,620 --> 00:04:13,400 Maybe if we could read down a little bit examples. 63 00:04:13,550 --> 00:04:19,430 So this is one of those scenarios where if you wanted to use this dot his method and have it all on 64 00:04:19,430 --> 00:04:26,090 one plot because the doc string doesn't really have an example of putting it all with the subplots. 65 00:04:26,120 --> 00:04:32,120 This would take a little bit of research so you might look up a question like how to plot a histogram 66 00:04:32,150 --> 00:04:36,970 on subplots from a panda's data frame and then you'd find this little parameter here. 67 00:04:37,010 --> 00:04:41,450 Only reason I know it is because I've had a little bit of experience plotting subplots from Panda's 68 00:04:41,450 --> 00:04:43,050 data frames. 69 00:04:43,050 --> 00:04:49,100 So let's see what this looks like might take a little while because we're plotting 10 columns or 15 70 00:04:49,100 --> 00:04:50,940 columns thereabouts. 71 00:04:51,000 --> 00:04:53,660 All that doesn't look very good does it. 72 00:04:53,730 --> 00:04:58,400 We're going to overlapping plots we've got legends everywhere we've got these things aren't even to 73 00:04:58,400 --> 00:05:04,740 scale so maybe to fix it up we'll change the fig size. 74 00:05:04,930 --> 00:05:08,490 So what do we want we want with height. 75 00:05:08,500 --> 00:05:13,660 We want it to be a bit higher so it stretches it stretches it out because there's hardly any space here 76 00:05:16,960 --> 00:05:17,270 OK. 77 00:05:17,290 --> 00:05:23,780 That's looking a little bit better but still you've got columns like this which are just all clumped 78 00:05:23,780 --> 00:05:29,900 up here because what it's doing is it's using the same scale for all of them but in reality these all 79 00:05:29,900 --> 00:05:31,010 have different measures. 80 00:05:31,010 --> 00:05:32,480 That's not very ideal right. 81 00:05:32,480 --> 00:05:38,390 It might work out for some like this one cholesterol might work out for that column but it doesn't work 82 00:05:38,510 --> 00:05:40,010 at all for the other ones right. 83 00:05:40,010 --> 00:05:41,740 So slope I'll peak. 84 00:05:41,750 --> 00:05:43,240 Exactly. 85 00:05:43,340 --> 00:05:44,830 So what can we do. 86 00:05:44,840 --> 00:05:49,030 Well this code runs but it's not ideal. 87 00:05:49,060 --> 00:05:54,900 Then there's different scales and it's probably a bit too much going on for this plot to even be meaningful. 88 00:05:55,250 --> 00:05:56,630 But this is a good segway though. 89 00:05:56,960 --> 00:06:00,790 Now we've covered a few simple plots directly on a panda's data frame. 90 00:06:01,340 --> 00:06:05,780 Let's have a look at how we'd make some visualizations using the O method. 91 00:06:05,800 --> 00:06:11,230 So remember back in the previous video we talked about oh method which is subplots. 92 00:06:11,390 --> 00:06:18,110 Don't worry if it's confusing we'll have a look in a second but what we've been using so far is plotting 93 00:06:18,110 --> 00:06:21,830 directly from pandas using this stateless method. 94 00:06:21,830 --> 00:06:23,840 So in the meantime I have a practice. 95 00:06:23,840 --> 00:06:25,040 Keep using the plot. 96 00:06:25,190 --> 00:06:26,270 Make another plot of your own. 97 00:06:26,270 --> 00:06:32,090 Before we come into the next video and I'll see you there we'll practice using the more flexible object 98 00:06:32,150 --> 00:06:35,690 orientated map plot named API directly with pandas.