1 00:00:00,210 --> 00:00:06,240 Now for our next comparison what we might do is try to combine a couple of independent variables such 2 00:00:06,240 --> 00:00:13,200 as age thatll act which is maximum heart rate and then compare them to our target variable heart disease. 3 00:00:13,200 --> 00:00:23,540 So if we have a look back at our data from so what we might compare is age fell AK and target all in 4 00:00:23,540 --> 00:00:24,500 one. 5 00:00:24,530 --> 00:00:31,700 And now if we go to check our different values for fail act now the reason why I know this is max heart 6 00:00:31,700 --> 00:00:34,650 rate because we come back out in a second. 7 00:00:35,000 --> 00:00:37,250 So we can see there's a lot of different values here. 8 00:00:37,310 --> 00:00:43,880 So the length 91 if you call value counts on any column and the length comes up is something that means 9 00:00:43,880 --> 00:00:46,660 that there's that many different values in that column. 10 00:00:46,670 --> 00:00:51,470 So what that might be telling us is because there's so many different values looking at it on a bar 11 00:00:51,470 --> 00:00:57,710 graph may not be the best type of graph because if we look at this plot there's only four columns here 12 00:00:57,740 --> 00:00:59,240 and that's pretty interpretable. 13 00:00:59,240 --> 00:01:02,360 But when there's 91 that might be a bit harder to look at. 14 00:01:02,360 --> 00:01:04,800 So that's where we might bring in something like a scatter graph. 15 00:01:05,300 --> 00:01:10,980 Let's go back to our data dictionary and have a look at what fell AK is come up here. 16 00:01:12,730 --> 00:01:20,300 How AK maximum heart rate achieved and again we're really just diving in reading our data dictionary 17 00:01:20,540 --> 00:01:20,870 here. 18 00:01:20,870 --> 00:01:25,850 You might just go through different different columns read them which ones stand out to you and try 19 00:01:25,850 --> 00:01:32,660 to compare fewer of them with the target variable and remember we come back to our data frame and these 20 00:01:32,660 --> 00:01:36,230 are often referred to as features or independent variables. 21 00:01:36,230 --> 00:01:39,980 And this is often referred to as a target or dependent variable. 22 00:01:39,980 --> 00:01:45,160 So that's what we're doing here we're comparing independent variables to our target variable. 23 00:01:45,170 --> 00:01:49,670 This might seem a little as if we're jumping all over the place I know I've said this before it's because 24 00:01:49,670 --> 00:01:55,580 we we are we are literally just picking columns here that spark our interest. 25 00:01:55,580 --> 00:01:59,750 This one sparked my interest and we're trying to figure out how they relate to the target. 26 00:02:00,140 --> 00:02:01,710 That's what we're doing. 27 00:02:01,730 --> 00:02:03,480 You may choose a couple of other columns. 28 00:02:03,560 --> 00:02:06,770 I'm just going through the ones that look most interesting to me. 29 00:02:07,610 --> 00:02:17,420 So let's make a little heading age the max heart rate for heart disease and then we'll turn that into 30 00:02:17,420 --> 00:02:18,080 markdown. 31 00:02:18,230 --> 00:02:19,340 Beautiful. 32 00:02:19,340 --> 00:02:27,590 So to compare three such as aged slack and heart disease what we might have to do is create a plot with 33 00:02:27,590 --> 00:02:37,860 two plots on it so let's start off first by creating another figure BLT figure and we'll posit a fig 34 00:02:37,860 --> 00:02:42,610 size it goes 10 6. 35 00:02:42,750 --> 00:02:48,200 Then we might go scatter might do a scatter with the positive examples first. 36 00:02:48,270 --> 00:02:53,520 So just bear with me for a second and we'll talk through what's going on while we're while we're going 37 00:02:53,520 --> 00:02:53,990 through it. 38 00:02:54,020 --> 00:02:56,130 So BLT don't scatter. 39 00:02:56,340 --> 00:03:05,770 We want to see on this scatter the positive examples so we can go DFT on age or days of age in brackets. 40 00:03:05,850 --> 00:03:07,550 They really mean the same thing. 41 00:03:07,740 --> 00:03:14,790 But I like to do it this way and if they have to an age and then in here we're going to do def dot target 42 00:03:15,010 --> 00:03:15,920 equals. 43 00:03:16,140 --> 00:03:20,250 Of course you couldn't do this if your column names had spaces in them. 44 00:03:20,400 --> 00:03:23,940 But in our case none of our column names have spaces. 45 00:03:23,940 --> 00:03:26,110 What this is doing is a subset of the dataset. 46 00:03:26,190 --> 00:03:32,190 We're taking the age column from our data frame where this condition equals true. 47 00:03:32,190 --> 00:03:34,360 So where the target equals 1. 48 00:03:34,380 --> 00:03:41,130 So if we did that in a cell below here the after age day after Target equals one. 49 00:03:41,640 --> 00:03:45,950 So this is going to show us all the age columns where Target is equal to 1. 50 00:03:45,990 --> 00:03:47,650 So that's what we want. 51 00:03:47,880 --> 00:03:48,530 Beautiful. 52 00:03:48,530 --> 00:03:56,510 And we also want DFT foul act which is maximum heart rate def dot target equals one. 53 00:03:57,000 --> 00:03:57,500 Beautiful. 54 00:03:57,510 --> 00:04:00,470 And we're gonna give it a color of salmon. 55 00:04:00,510 --> 00:04:04,980 What would this look like if we called that wonderful. 56 00:04:04,980 --> 00:04:08,810 So along here we've got maximum heart rate and we got age. 57 00:04:08,910 --> 00:04:13,990 So what could you infer from this pretty quickly by just looking at it here. 58 00:04:14,020 --> 00:04:17,890 Well you might infer that there's kind of a downward trend. 59 00:04:18,010 --> 00:04:22,720 So if you were to draw a line through there just a straight line it's all over the place right. 60 00:04:23,170 --> 00:04:27,080 But it's kind of you can kind of see the pattern going down here. 61 00:04:27,400 --> 00:04:29,800 The youngest someone is the higher their heart rate. 62 00:04:29,800 --> 00:04:31,800 That's the kind of trend that you'd be seeing there. 63 00:04:31,810 --> 00:04:33,360 That makes a little bit of sense right now. 64 00:04:33,360 --> 00:04:37,840 These are positive examples so these are patients because we've got target equals one. 65 00:04:37,840 --> 00:04:40,490 These are patients with heart disease. 66 00:04:40,490 --> 00:04:43,370 And so now we want the negative examples on the same plot. 67 00:04:43,870 --> 00:04:54,940 So scanner with negative examples BLT dots scatter the IDF dot age all we're going to do is just copy 68 00:04:54,940 --> 00:04:59,370 what we've got above but accept target equals zero because they're negative example. 69 00:04:59,380 --> 00:05:09,610 This is without heart disease like DFT at Target equals zero wonderful C equals when I go light blue 70 00:05:10,080 --> 00:05:14,610 it's a color scheme we're set up for now beautiful. 71 00:05:14,650 --> 00:05:19,710 So now we've got positive and negative examples on there and again they're kind of all over the place. 72 00:05:19,720 --> 00:05:23,980 But if you were to look at the trend the line kind of goes down from both. 73 00:05:23,980 --> 00:05:27,190 But again you can't really split them. 74 00:05:27,190 --> 00:05:31,180 This is where our machine learning model have to take over from our ability to have a look at this. 75 00:05:31,180 --> 00:05:36,910 So if you were if you were to look at this now I just put a semicolon there to stop that little map 76 00:05:36,910 --> 00:05:37,960 properly about. 77 00:05:38,710 --> 00:05:43,660 If you were to look at this and you try to decipher what's going on here what was the maximum heart 78 00:05:43,660 --> 00:05:50,530 rate of of someone who had or didn't have heart disease the salmon versus blue dots blue dots are without 79 00:05:50,530 --> 00:05:52,690 heart disease salmon are with heart disease. 80 00:05:52,930 --> 00:05:59,230 If you had to decipher this it'd be pretty hard right because these are all so mixed up together maybe 81 00:05:59,230 --> 00:06:02,900 you might infer that there's some sort of pattern. 82 00:06:03,130 --> 00:06:09,210 There's a fair bit cluster here there's a big cluster here but really I can't really tell a pattern. 83 00:06:09,220 --> 00:06:13,990 Maybe if I had some more time to look at it I could find something but this is where machine learning 84 00:06:13,990 --> 00:06:14,870 is going to come into play. 85 00:06:14,920 --> 00:06:18,760 The patterns that you can't necessarily see straight away. 86 00:06:18,760 --> 00:06:20,910 Machine learning is going to dive into the data. 87 00:06:20,920 --> 00:06:25,560 It's going to form some calculations and figure out these patterns that we can't really see. 88 00:06:25,690 --> 00:06:28,830 Again if we had enough time maybe we'd find something. 89 00:06:28,930 --> 00:06:34,930 But when machine learning engineers were data scientists we prefer that the algorithm does the job for 90 00:06:34,930 --> 00:06:35,770 us. 91 00:06:35,830 --> 00:06:41,040 So what we might do is add some helpful info just to make this plot a little bit more complete. 92 00:06:41,140 --> 00:06:55,540 So BLT dot title heart disease in the function of age and max heart rate then penalty x label down the 93 00:06:55,540 --> 00:07:00,370 bottom is the age and they NPL T dot y label. 94 00:07:00,370 --> 00:07:08,920 This is gonna be max heart rate and then we'll add a legend so people can understand which dot is which 95 00:07:08,980 --> 00:07:14,350 and they're not just thinking about well what's going on here with this salmon and like blue dots amazing 96 00:07:14,350 --> 00:07:18,150 color scheme but not sure what's what's happening. 97 00:07:18,630 --> 00:07:21,780 And we'll put the semicolon there instead of there. 98 00:07:22,000 --> 00:07:22,750 Wonderful. 99 00:07:22,750 --> 00:07:27,360 So that's looking at a little bit better but again you let me know if you can see some sort of pattern. 100 00:07:27,370 --> 00:07:33,190 But to me all I can really see is is a bit of a downward trend so as someone gets older their maximum 101 00:07:33,190 --> 00:07:38,200 heart rate decreases which kind of makes sense again when you're exploring a dataset you won't necessarily 102 00:07:38,200 --> 00:07:40,330 always find patterns to begin with. 103 00:07:40,330 --> 00:07:48,030 We're just familiarizing ourselves with what's going on and so because we have just an age when we might 104 00:07:48,030 --> 00:07:51,670 check out is what's the distribution of the age. 105 00:07:51,880 --> 00:07:59,160 So to do so go check the distribution of the age column with a histogram 106 00:08:01,800 --> 00:08:05,980 now the distribution is another word for spread of the data. 107 00:08:06,170 --> 00:08:09,550 So we kind can't see here that there's almost an even spread. 108 00:08:09,750 --> 00:08:14,840 So you may recall what kind of distribution and even spread of data has. 109 00:08:14,850 --> 00:08:16,580 And if you don't that's perfectly fine. 110 00:08:16,590 --> 00:08:20,370 It took me a while to understand this as well to me a long time actually. 111 00:08:20,400 --> 00:08:22,630 And I still have to research. 112 00:08:22,760 --> 00:08:28,920 So this kind of distribution here is otherwise known as a normal distribution and if we were to really 113 00:08:29,100 --> 00:08:33,330 say what can normal distribution is we'd look at this perfect normal distribution. 114 00:08:33,330 --> 00:08:33,960 Let's have a look. 115 00:08:34,380 --> 00:08:42,200 Let's go normal distribution is gonna look like a bell curve to see something like that. 116 00:08:42,890 --> 00:08:48,310 So if we come to ours ours is very close to that shape but it's kind of swaying towards the right. 117 00:08:48,470 --> 00:08:54,680 So we can see that most of our population or most of our samples their age is around this big mid gap 118 00:08:54,680 --> 00:08:55,310 here. 119 00:08:55,310 --> 00:08:59,290 And we don't have that many around the 30 year old age or past. 120 00:08:59,300 --> 00:09:02,410 This might be 80 up or something like oh we don't have many past there. 121 00:09:02,510 --> 00:09:08,450 The majority of our dataset are within the 50 to 60 range. 122 00:09:08,560 --> 00:09:12,470 And so this is what you might want to do for a bunch of different columns is check the distributions 123 00:09:12,470 --> 00:09:19,820 check the spreads if we did have someone out here that was like 150 that would maybe be some type of 124 00:09:19,940 --> 00:09:25,820 data that we'd have to clean up because I'm not sure if anyone's ever lived to 150 or if we had someone 125 00:09:25,820 --> 00:09:29,350 down here that was maybe five or something like that. 126 00:09:29,360 --> 00:09:33,200 We'd also have to think about okay is that something we want to include in our dataset. 127 00:09:33,620 --> 00:09:39,260 But this is going to be different Column 2 Column so we're just checking the age here is the normal 128 00:09:39,260 --> 00:09:44,840 distribution we might do the same for other columns but the way you sort of think about samples out 129 00:09:44,840 --> 00:09:47,770 here is they're referred to as outliers. 130 00:09:47,990 --> 00:09:51,320 And now how will you tell that if there's any outliers. 131 00:09:51,320 --> 00:09:55,860 Well distribution plot this histogram is one of the best ways to do it. 132 00:09:55,910 --> 00:10:00,200 So if you're getting some weird results like some weird results in terms of your machine learning models 133 00:10:00,200 --> 00:10:05,060 later on it may be because there's outliers in the data and that's where you'll have to check different 134 00:10:05,060 --> 00:10:07,550 distributions of each columns. 135 00:10:07,550 --> 00:10:09,310 So that's one way to do it. 136 00:10:09,430 --> 00:10:17,060 Now what we might do is compare another two columns so if we come up here to our data dictionary I saw 137 00:10:17,060 --> 00:10:25,040 this one before the chest pain type and I thought that would be interesting to see so if we got here 138 00:10:25,040 --> 00:10:30,950 for different types of chest pain chest pain related decreased blood supply to the heart chest pain 139 00:10:30,950 --> 00:10:34,520 not related to heart typically esophageal spasms. 140 00:10:34,520 --> 00:10:40,010 So I guess that's like your esophagus which is I think that tube from your mouth to your stomach or 141 00:10:40,010 --> 00:10:43,850 something like that asymptomatic chest pain not showing signs of disease. 142 00:10:43,880 --> 00:10:46,100 Okay so this will be pretty interesting right. 143 00:10:46,100 --> 00:10:49,970 If we compare this to the target column. 144 00:10:49,970 --> 00:10:57,170 So chest pain versus target so does chest pain relate to whether or not someone has heart disease. 145 00:10:57,170 --> 00:11:03,230 So let's come down here guy make a little heading. 146 00:11:03,350 --> 00:11:07,980 Heart disease frequency. 147 00:11:08,240 --> 00:11:17,610 Chest pain type and what we might do actually is just remind ourselves copy our data dictionary for 148 00:11:17,610 --> 00:11:22,470 chest pain so we can remember what each different value is. 149 00:11:22,470 --> 00:11:32,400 So if we come in here and we'll copy this shift and enter will bring it down and we'll paste it here. 150 00:11:32,420 --> 00:11:38,930 Wonderful so this is what we're going to compare we can do that with a fairly quickly with a PD cross 151 00:11:38,930 --> 00:11:50,350 tab def not def not target Okay so if we look at this what would we devise in here. 152 00:11:50,400 --> 00:11:50,630 Mm hmm. 153 00:11:50,730 --> 00:11:59,870 It seems as chest pain goes up so does whether they have heart disease but if we get zero chest pain 154 00:12:01,530 --> 00:12:04,840 there's a lot more that don't have heart disease than do. 155 00:12:04,890 --> 00:12:12,570 But if we get to there's only 18 with zero so don't have heart disease but 69 so nearly over three times 156 00:12:12,570 --> 00:12:14,940 the amount that do have heart disease. 157 00:12:14,940 --> 00:12:22,850 So this is non a genial pain typically or non heart related Well that's that's interesting that it's 158 00:12:22,860 --> 00:12:27,060 non heart related and so these are the type of patterns you'll start to find out new data some make 159 00:12:27,060 --> 00:12:27,480 sense. 160 00:12:27,480 --> 00:12:34,170 That doesn't really make sense to me non heart related pain yet more people have heart disease with 161 00:12:34,170 --> 00:12:35,880 two than not. 162 00:12:35,880 --> 00:12:40,350 And so these are the type of things you might be wanting to discuss with a subject matter expert if 163 00:12:40,350 --> 00:12:45,900 you're looking through a dataset such as us looking through this heart disease dataset and I'm not a 164 00:12:45,900 --> 00:12:51,030 doctor I'm not trained medically I've researched some things about health but I've known nothing no 165 00:12:51,030 --> 00:12:53,220 idea about chest pain. 166 00:12:53,220 --> 00:12:57,180 So if I got this data set off someone from a medical profession I'm trying to use machine learning to 167 00:12:57,180 --> 00:12:59,920 figure out whether someone has heart disease or not. 168 00:13:00,150 --> 00:13:03,810 And I'm coming across these patterns that I don't really understand that's where I probably reach out 169 00:13:03,810 --> 00:13:09,750 and go hey to a medical professional or I do my own research such as looking up what these actually 170 00:13:09,750 --> 00:13:17,700 mean we did that before defining angina going and hey I'm seeing this in the data but can you shed some 171 00:13:17,700 --> 00:13:19,620 light it's kind of confusing to me. 172 00:13:19,710 --> 00:13:25,230 Is this correct is this not that's the kind of insights that we're looking for is both patterns that 173 00:13:25,230 --> 00:13:32,130 make sense to us such as the normal distribution of age such as the declining max heart rate as someone 174 00:13:32,130 --> 00:13:38,730 gets older this one maybe doesn't make as much sense is heart disease more prevalent in women than in 175 00:13:38,730 --> 00:13:39,810 these males. 176 00:13:39,960 --> 00:13:44,400 So this is the type of questions that we're trying to get we're trying to formulate an idea in the data 177 00:13:44,460 --> 00:13:49,380 by not finding answers but finding questions that we can ask. 178 00:13:49,640 --> 00:13:54,990 And so what we might do is make this a little bit more visual as we've done before. 179 00:13:54,990 --> 00:14:09,350 Make the cross more visual paid a cross tab def not he def target we're just writing what we've got 180 00:14:09,350 --> 00:14:15,560 above their dot plot same plot again we might do a bar because there's four different values here versus 181 00:14:15,560 --> 00:14:16,640 two different values there. 182 00:14:16,640 --> 00:14:20,210 So we should be at a two look at it fine on a bar graph. 183 00:14:20,810 --> 00:14:23,870 Think size equals ten and six. 184 00:14:23,870 --> 00:14:25,040 Wonderful. 185 00:14:25,040 --> 00:14:27,410 I'm going to use our favorite colors. 186 00:14:27,410 --> 00:14:36,340 Color equals what are we using light blue Salmon Actually I think the order was the other way round. 187 00:14:36,350 --> 00:14:39,090 Mary did salmon first then light blue. 188 00:14:39,230 --> 00:14:42,360 Doesn't really matter beautiful. 189 00:14:42,450 --> 00:14:48,660 Now we're gonna add some add some communication penalty dot title 190 00:14:53,330 --> 00:14:58,310 heart disease frequency her chest pain type 191 00:15:02,470 --> 00:15:14,780 peyote dot ex label and do chest pain type men on the why is penalty don't y label amount beautiful 192 00:15:14,870 --> 00:15:17,030 BLT dot legend. 193 00:15:17,030 --> 00:15:21,060 We're gonna put on here no disease 194 00:15:23,630 --> 00:15:28,310 disease and then we'll finish it off with. 195 00:15:28,310 --> 00:15:35,240 Again we're going to make sure that the X ticks are vertical so it's a bit easier to read shifting into 196 00:15:35,510 --> 00:15:39,930 what do we got pi plot has nothing y label we've got a little typo beautiful. 197 00:15:40,250 --> 00:15:40,910 There we go. 198 00:15:40,910 --> 00:15:42,010 So that's a bit more visual. 199 00:15:42,010 --> 00:15:45,280 So then we're just turning our cross tab into a graph here. 200 00:15:45,290 --> 00:15:49,720 Now this is something that we could take pretty quickly to someone and go Hey what's going on here. 201 00:15:49,730 --> 00:15:58,430 Chest pain type has way more accounts with the disease than chest pain too that is and it's supposed 202 00:15:58,430 --> 00:16:03,010 to be known and general pain so typically esophageal. 203 00:16:03,080 --> 00:16:07,600 That word is very hard for me to pronounce esophageal spasms non heart related. 204 00:16:07,600 --> 00:16:10,510 So this is again raising questions from the data. 205 00:16:10,520 --> 00:16:14,270 That's what we're trying to do in this exploratory data analysis. 206 00:16:14,270 --> 00:16:14,970 All right. 207 00:16:15,090 --> 00:16:23,090 So we got a pretty visualization there what we might do next is check out the correlation between independent 208 00:16:23,090 --> 00:16:26,450 variables and our dependent variable. 209 00:16:26,450 --> 00:16:31,250 So again if we have a look at our data frame this is what you want to be jumping in out of right just 210 00:16:31,250 --> 00:16:34,480 using of head to quickly have a snapshot of your data frame. 211 00:16:34,490 --> 00:16:39,080 These are our independent variables remind ourselves and we're trying to use them to predict target. 212 00:16:39,080 --> 00:16:45,830 So we'll see how we can compare those and we're gonna use a correlation matrix but we'll see that in 213 00:16:45,830 --> 00:16:46,460 the next video.