1 00:00:00,330 --> 00:00:01,270 Beautiful. 2 00:00:01,320 --> 00:00:06,510 Now I found out a little bit more about our data frame what we're going to do now is compare different 3 00:00:06,510 --> 00:00:07,970 columns to each other. 4 00:00:08,250 --> 00:00:16,200 And the reason being here is because this is helpful to help us start gaining an intuition about how 5 00:00:16,200 --> 00:00:17,120 the features. 6 00:00:17,130 --> 00:00:21,750 So these columns here relate to the target variable. 7 00:00:21,960 --> 00:00:24,590 So that's what our machine learning model is eventually going to do. 8 00:00:24,590 --> 00:00:30,360 Ryan it's going to search through the different columns of our data frame and then figure out the patterns 9 00:00:31,200 --> 00:00:34,880 as to what target value is associated with values here. 10 00:00:34,890 --> 00:00:40,590 Now we could do much the same if you just went through this row by row you would start to gain an intuition 11 00:00:40,920 --> 00:00:45,600 about what values associate with what target value. 12 00:00:45,990 --> 00:00:47,530 But let's stop talking about it. 13 00:00:47,550 --> 00:00:49,390 Let's visualize it. 14 00:00:49,410 --> 00:00:55,830 So what we'll do the first two columns we might compare our age to target. 15 00:00:55,830 --> 00:01:02,130 So we go here frequency according to sex. 16 00:01:02,490 --> 00:01:10,410 So this is what we want to do we want to compare the sex attribute to the target attribute. 17 00:01:10,410 --> 00:01:16,050 So what we might do is go into sex dot value counts here. 18 00:01:16,090 --> 00:01:17,940 There's no real pattern of how we're doing this. 19 00:01:17,940 --> 00:01:21,720 We're just kind of exploring the data. 20 00:01:21,740 --> 00:01:22,340 All right. 21 00:01:22,460 --> 00:01:26,600 So we see here there's a lot more male to female. 22 00:01:26,600 --> 00:01:28,190 So female 0 here. 23 00:01:28,250 --> 00:01:34,830 That number and male is one reason why I know that is if we go back up to our data dictionary we look 24 00:01:34,830 --> 00:01:39,100 here sex 1 equals male 0 equals female. 25 00:01:39,150 --> 00:01:39,980 Wonderful. 26 00:01:39,990 --> 00:01:41,150 So it come down. 27 00:01:41,460 --> 00:01:45,900 So we know our dataset is a little bit tilted towards having more males and females. 28 00:01:46,650 --> 00:01:55,740 If we wanted to compare the sex column to the tiger column a handy function in pandas is the PD dot 29 00:01:55,740 --> 00:01:56,940 cross tab. 30 00:01:56,940 --> 00:01:58,010 So let's have a look at that. 31 00:01:58,230 --> 00:02:09,010 Compare target column with sex column Patty cross tab we're going to pass it DFT at Target with the 32 00:02:09,010 --> 00:02:13,380 F dot six beautiful. 33 00:02:13,770 --> 00:02:16,600 So this is gonna give us some information here. 34 00:02:16,710 --> 00:02:18,720 What can we infer from this. 35 00:02:18,720 --> 00:02:24,870 Well what we could do to begin with before we even build a single machine learning model is make a simple 36 00:02:24,870 --> 00:02:25,900 her mistake. 37 00:02:25,920 --> 00:02:29,130 So since there are about 100 women right. 38 00:02:29,700 --> 00:02:32,720 If we add up there so we see sex is zero. 39 00:02:32,940 --> 00:02:37,370 And this is comparing the target column to the sex column. 40 00:02:37,440 --> 00:02:45,120 So since there are about 100 women and 72 percent of them or 72 out of out of the entire amount of women 41 00:02:45,750 --> 00:02:49,200 have a positive value of heart disease being present. 42 00:02:49,200 --> 00:02:51,510 So see here the target is 1. 43 00:02:51,510 --> 00:02:58,110 So that means an indication that they do have heart disease or we might infer based on this one variable 44 00:02:58,620 --> 00:03:00,900 if the participant is a woman. 45 00:03:00,900 --> 00:03:11,370 So if the sample if we come back up here if the sample in our data is a woman they'd be roughly a 75 46 00:03:11,370 --> 00:03:12,380 percent chance. 47 00:03:12,420 --> 00:03:19,890 She has heart disease reason being is because here we've taken about 100 or so again we're just rounding 48 00:03:19,890 --> 00:03:20,670 here. 49 00:03:21,030 --> 00:03:23,930 And if we add these up that's going to equal 96. 50 00:03:24,300 --> 00:03:31,230 But if we see here just looking at this just looking at this comparison between sex and target 72 out 51 00:03:31,230 --> 00:03:32,040 of 96. 52 00:03:32,040 --> 00:03:35,600 So basically 75 out of 100. 53 00:03:35,700 --> 00:03:39,920 So what we're inferring from this before we even build a single machine learning model. 54 00:03:40,170 --> 00:03:45,570 If a woman comes and we're trying to figure out whether she has heart disease or not based on our existing 55 00:03:45,570 --> 00:03:50,920 data based on our existing dataset 75 percent chance that she has heart disease. 56 00:03:51,030 --> 00:03:56,320 Again remember based on our existing data set it might be different in the real world. 57 00:03:56,460 --> 00:04:04,170 And so if we look at male there's about 200 in total with around half indicating a presence of heart 58 00:04:04,170 --> 00:04:05,120 disease. 59 00:04:05,130 --> 00:04:16,290 So see there 93 target equals one sex equals 1 when the sample is a male ninety three out of 207 indicate 60 00:04:16,290 --> 00:04:18,460 that there is heart disease. 61 00:04:18,540 --> 00:04:27,780 So if we looked at this if the participant is male we might predict around half the time that participant 62 00:04:27,900 --> 00:04:29,530 will have heart disease. 63 00:04:29,760 --> 00:04:36,320 And then if we average these out we'd get 75 percent plus 50 percent over 100 and you get about 60 2.5 64 00:04:36,360 --> 00:04:37,980 percent chance that anyone. 65 00:04:38,040 --> 00:04:43,260 Of course this is always based on our existing data because that's the only patterns that we can find 66 00:04:43,260 --> 00:04:45,650 is with the data that we have. 67 00:04:45,660 --> 00:04:55,080 So based on our existing dataset this up here if we were to see a random patient we're making our decisions 68 00:04:55,080 --> 00:04:58,090 whether that random patient a new patient we haven't seen before. 69 00:04:58,350 --> 00:05:05,370 We're making our decisions based on our existing data said and based off this comparison alone we might 70 00:05:05,370 --> 00:05:08,450 infer that there's a sixty two point five. 71 00:05:08,450 --> 00:05:12,510 Remember about 75 percent if they're are women and 50 percent of their male. 72 00:05:12,510 --> 00:05:17,190 So we're just adding them together and averaging them so sixty two point five percent chance that they 73 00:05:17,190 --> 00:05:18,410 have heart disease. 74 00:05:18,420 --> 00:05:20,910 Now this is our very simple baseline. 75 00:05:20,910 --> 00:05:26,430 And what we're trying to do here is just form an intuition in our head about the data set about how 76 00:05:26,430 --> 00:05:32,860 different features in this case were doing the sex feature the sex column we're comparing it to target. 77 00:05:33,000 --> 00:05:38,830 So if we would just use that one feature alone we would expect the patient to come to us we'll go okay 78 00:05:39,680 --> 00:05:40,770 any patient at all. 79 00:05:40,820 --> 00:05:46,510 There's a 60 2.5 percent chance that they have heart disease and now with that baseline what we're going 80 00:05:46,510 --> 00:05:50,120 to try and do is beat it using machine learning. 81 00:05:50,300 --> 00:05:54,930 So again if this is a little bit confusing don't worry it's a little bit confusing when I first started 82 00:05:54,930 --> 00:05:55,760 figuring out patterns. 83 00:05:55,770 --> 00:06:01,720 But the main thing to remember is all we're doing is just creating an intuition. 84 00:06:01,920 --> 00:06:07,390 We're becoming subject matter experts on the data or at least trying to. 85 00:06:07,820 --> 00:06:17,740 So what we might do is make this a bit more visual create a plot of cross tab paid dot cross tab the 86 00:06:17,860 --> 00:06:28,320 F Doc target the F six and then we can go dot plot and we'll do it as a kind of bar we'll give it a 87 00:06:28,320 --> 00:06:34,060 fig size just because we want it to come out and go 10 6. 88 00:06:34,320 --> 00:06:39,600 So that's width and height there and then we'll give it our famous color that we're going to work with 89 00:06:39,990 --> 00:06:48,950 which is Andrew Salmon and like Blue Wonderful beautiful. 90 00:06:48,950 --> 00:06:52,660 So this is another visualization that we can start to get an idea. 91 00:06:52,890 --> 00:06:56,900 And it shows it a little bit more intuitively than just a cross tab. 92 00:06:56,940 --> 00:07:01,550 So if we look here you've got target which is zero not heart disease. 93 00:07:01,570 --> 00:07:03,170 And this is sex 0 1. 94 00:07:03,240 --> 00:07:08,310 So we can see that the people who don't have heart disease there's far more male. 95 00:07:08,520 --> 00:07:13,140 And we can see here that the people who do have heart disease now there are more males that do have 96 00:07:13,140 --> 00:07:14,010 heart disease. 97 00:07:14,010 --> 00:07:17,520 But if we look at the ratios compared to each column. 98 00:07:17,520 --> 00:07:18,530 So this one is male. 99 00:07:18,570 --> 00:07:19,700 Blue is male. 100 00:07:19,800 --> 00:07:22,240 Then there's salmon color is female. 101 00:07:22,320 --> 00:07:27,720 If we compare these columns we can see that the females who do have heart disease is about a 3 to 1 102 00:07:27,720 --> 00:07:31,770 if you would just compare those visually that 3 to 1. 103 00:07:31,800 --> 00:07:32,040 Right. 104 00:07:32,040 --> 00:07:37,440 So that's where we're getting a three and four chance of a female at random having heart disease but 105 00:07:37,440 --> 00:07:41,650 males the ratio is kind of and it's definitely not completely even. 106 00:07:41,730 --> 00:07:45,090 But it's a lot closer than what the females are. 107 00:07:45,240 --> 00:07:45,870 Wonderful. 108 00:07:46,680 --> 00:07:51,480 So if we wanted to add some titles to this we could we could add some communication here such as maybe 109 00:07:51,480 --> 00:07:55,450 we go BLT title. 110 00:07:55,830 --> 00:07:56,820 Heart disease 111 00:07:59,300 --> 00:08:12,290 frequency for sex and then we go P. BLT dot maybe we add an x label and we go zero equals no disease 112 00:08:12,950 --> 00:08:25,820 1 equals disease and then we might go plot y label might put amount here then maybe a legend might change 113 00:08:25,850 --> 00:08:26,580 this legend. 114 00:08:26,650 --> 00:08:28,100 So if we can update that too. 115 00:08:28,160 --> 00:08:32,840 Rather than being 0 1 so we want to communicate this to someone we're doing some data analysis we want 116 00:08:32,840 --> 00:08:40,060 to they don't know what 0 1 means we wanted to female male actually I'll show you what this does. 117 00:08:40,130 --> 00:08:43,420 Before we even do it so let's do that. 118 00:08:43,450 --> 00:08:46,420 We'll add a little semicolon here so we get rid of that. 119 00:08:46,640 --> 00:08:47,760 What are we missing out here. 120 00:08:48,290 --> 00:08:48,950 We got legend 121 00:08:52,280 --> 00:08:53,480 that should work. 122 00:08:53,480 --> 00:08:53,900 Wonderful. 123 00:08:53,900 --> 00:08:55,840 So that's a little bit more intuitive. 124 00:08:55,870 --> 00:08:59,330 And so this is what I wanted to change here I wanted to get them vertical. 125 00:08:59,450 --> 00:09:01,980 So what we do is plot x ticks. 126 00:09:02,090 --> 00:09:05,450 These are X ticks here little ticks that are labeling. 127 00:09:05,440 --> 00:09:11,540 So we go plot x ticks and then rotation equals zero. 128 00:09:12,590 --> 00:09:16,910 This took us a little while to get set up but really if you're going through this you might breeze through 129 00:09:16,910 --> 00:09:18,770 it if you're going through it by yourself. 130 00:09:18,830 --> 00:09:23,960 But if we want to communicate it with someone else we better designed in this way because we may know 131 00:09:23,960 --> 00:09:30,080 the data ourselves but if we just want to share this image so someone else has an intuition over the 132 00:09:30,080 --> 00:09:36,920 comparison of who has heart disease depending on what sex they are they won't know what we know with 133 00:09:36,920 --> 00:09:37,560 the data. 134 00:09:37,610 --> 00:09:41,920 So that's why we're making our visuals as communicative as possible. 135 00:09:41,930 --> 00:09:42,530 All right. 136 00:09:42,700 --> 00:09:47,210 Well now that we've compared to columns we've seen how to do it what we're going to do in the next few 137 00:09:47,210 --> 00:09:49,250 videos is compare a few more. 138 00:09:49,250 --> 00:09:51,180 So take a little break. 139 00:09:51,290 --> 00:09:55,720 Reflect back on what we've gone through here and maybe try to compare two columns of your own. 140 00:09:55,760 --> 00:10:01,430 Usually you can pair the target with a single column here and you start to work out the patterns.