1
00:00:00,330 --> 00:00:01,270
Beautiful.

2
00:00:01,320 --> 00:00:06,510
Now I found out a little bit more about our data frame what we're going to do now is compare different

3
00:00:06,510 --> 00:00:07,970
columns to each other.

4
00:00:08,250 --> 00:00:16,200
And the reason being here is because this is helpful to help us start gaining an intuition about how

5
00:00:16,200 --> 00:00:17,120
the features.

6
00:00:17,130 --> 00:00:21,750
So these columns here relate to the target variable.

7
00:00:21,960 --> 00:00:24,590
So that's what our machine learning model is eventually going to do.

8
00:00:24,590 --> 00:00:30,360
Ryan it's going to search through the different columns of our data frame and then figure out the patterns

9
00:00:31,200 --> 00:00:34,880
as to what target value is associated with values here.

10
00:00:34,890 --> 00:00:40,590
Now we could do much the same if you just went through this row by row you would start to gain an intuition

11
00:00:40,920 --> 00:00:45,600
about what values associate with what target value.

12
00:00:45,990 --> 00:00:47,530
But let's stop talking about it.

13
00:00:47,550 --> 00:00:49,390
Let's visualize it.

14
00:00:49,410 --> 00:00:55,830
So what we'll do the first two columns we might compare our age to target.

15
00:00:55,830 --> 00:01:02,130
So we go here frequency according to sex.

16
00:01:02,490 --> 00:01:10,410
So this is what we want to do we want to compare the sex attribute to the target attribute.

17
00:01:10,410 --> 00:01:16,050
So what we might do is go into sex dot value counts here.

18
00:01:16,090 --> 00:01:17,940
There's no real pattern of how we're doing this.

19
00:01:17,940 --> 00:01:21,720
We're just kind of exploring the data.

20
00:01:21,740 --> 00:01:22,340
All right.

21
00:01:22,460 --> 00:01:26,600
So we see here there's a lot more male to female.

22
00:01:26,600 --> 00:01:28,190
So female 0 here.

23
00:01:28,250 --> 00:01:34,830
That number and male is one reason why I know that is if we go back up to our data dictionary we look

24
00:01:34,830 --> 00:01:39,100
here sex 1 equals male 0 equals female.

25
00:01:39,150 --> 00:01:39,980
Wonderful.

26
00:01:39,990 --> 00:01:41,150
So it come down.

27
00:01:41,460 --> 00:01:45,900
So we know our dataset is a little bit tilted towards having more males and females.

28
00:01:46,650 --> 00:01:55,740
If we wanted to compare the sex column to the tiger column a handy function in pandas is the PD dot

29
00:01:55,740 --> 00:01:56,940
cross tab.

30
00:01:56,940 --> 00:01:58,010
So let's have a look at that.

31
00:01:58,230 --> 00:02:09,010
Compare target column with sex column Patty cross tab we're going to pass it DFT at Target with the

32
00:02:09,010 --> 00:02:13,380
F dot six beautiful.

33
00:02:13,770 --> 00:02:16,600
So this is gonna give us some information here.

34
00:02:16,710 --> 00:02:18,720
What can we infer from this.

35
00:02:18,720 --> 00:02:24,870
Well what we could do to begin with before we even build a single machine learning model is make a simple

36
00:02:24,870 --> 00:02:25,900
her mistake.

37
00:02:25,920 --> 00:02:29,130
So since there are about 100 women right.

38
00:02:29,700 --> 00:02:32,720
If we add up there so we see sex is zero.

39
00:02:32,940 --> 00:02:37,370
And this is comparing the target column to the sex column.

40
00:02:37,440 --> 00:02:45,120
So since there are about 100 women and 72 percent of them or 72 out of out of the entire amount of women

41
00:02:45,750 --> 00:02:49,200
have a positive value of heart disease being present.

42
00:02:49,200 --> 00:02:51,510
So see here the target is 1.

43
00:02:51,510 --> 00:02:58,110
So that means an indication that they do have heart disease or we might infer based on this one variable

44
00:02:58,620 --> 00:03:00,900
if the participant is a woman.

45
00:03:00,900 --> 00:03:11,370
So if the sample if we come back up here if the sample in our data is a woman they'd be roughly a 75

46
00:03:11,370 --> 00:03:12,380
percent chance.

47
00:03:12,420 --> 00:03:19,890
She has heart disease reason being is because here we've taken about 100 or so again we're just rounding

48
00:03:19,890 --> 00:03:20,670
here.

49
00:03:21,030 --> 00:03:23,930
And if we add these up that's going to equal 96.

50
00:03:24,300 --> 00:03:31,230
But if we see here just looking at this just looking at this comparison between sex and target 72 out

51
00:03:31,230 --> 00:03:32,040
of 96.

52
00:03:32,040 --> 00:03:35,600
So basically 75 out of 100.

53
00:03:35,700 --> 00:03:39,920
So what we're inferring from this before we even build a single machine learning model.

54
00:03:40,170 --> 00:03:45,570
If a woman comes and we're trying to figure out whether she has heart disease or not based on our existing

55
00:03:45,570 --> 00:03:50,920
data based on our existing dataset 75 percent chance that she has heart disease.

56
00:03:51,030 --> 00:03:56,320
Again remember based on our existing data set it might be different in the real world.

57
00:03:56,460 --> 00:04:04,170
And so if we look at male there's about 200 in total with around half indicating a presence of heart

58
00:04:04,170 --> 00:04:05,120
disease.

59
00:04:05,130 --> 00:04:16,290
So see there 93 target equals one sex equals 1 when the sample is a male ninety three out of 207 indicate

60
00:04:16,290 --> 00:04:18,460
that there is heart disease.

61
00:04:18,540 --> 00:04:27,780
So if we looked at this if the participant is male we might predict around half the time that participant

62
00:04:27,900 --> 00:04:29,530
will have heart disease.

63
00:04:29,760 --> 00:04:36,320
And then if we average these out we'd get 75 percent plus 50 percent over 100 and you get about 60 2.5

64
00:04:36,360 --> 00:04:37,980
percent chance that anyone.

65
00:04:38,040 --> 00:04:43,260
Of course this is always based on our existing data because that's the only patterns that we can find

66
00:04:43,260 --> 00:04:45,650
is with the data that we have.

67
00:04:45,660 --> 00:04:55,080
So based on our existing dataset this up here if we were to see a random patient we're making our decisions

68
00:04:55,080 --> 00:04:58,090
whether that random patient a new patient we haven't seen before.

69
00:04:58,350 --> 00:05:05,370
We're making our decisions based on our existing data said and based off this comparison alone we might

70
00:05:05,370 --> 00:05:08,450
infer that there's a sixty two point five.

71
00:05:08,450 --> 00:05:12,510
Remember about 75 percent if they're are women and 50 percent of their male.

72
00:05:12,510 --> 00:05:17,190
So we're just adding them together and averaging them so sixty two point five percent chance that they

73
00:05:17,190 --> 00:05:18,410
have heart disease.

74
00:05:18,420 --> 00:05:20,910
Now this is our very simple baseline.

75
00:05:20,910 --> 00:05:26,430
And what we're trying to do here is just form an intuition in our head about the data set about how

76
00:05:26,430 --> 00:05:32,860
different features in this case were doing the sex feature the sex column we're comparing it to target.

77
00:05:33,000 --> 00:05:38,830
So if we would just use that one feature alone we would expect the patient to come to us we'll go okay

78
00:05:39,680 --> 00:05:40,770
any patient at all.

79
00:05:40,820 --> 00:05:46,510
There's a 60 2.5 percent chance that they have heart disease and now with that baseline what we're going

80
00:05:46,510 --> 00:05:50,120
to try and do is beat it using machine learning.

81
00:05:50,300 --> 00:05:54,930
So again if this is a little bit confusing don't worry it's a little bit confusing when I first started

82
00:05:54,930 --> 00:05:55,760
figuring out patterns.

83
00:05:55,770 --> 00:06:01,720
But the main thing to remember is all we're doing is just creating an intuition.

84
00:06:01,920 --> 00:06:07,390
We're becoming subject matter experts on the data or at least trying to.

85
00:06:07,820 --> 00:06:17,740
So what we might do is make this a bit more visual create a plot of cross tab paid dot cross tab the

86
00:06:17,860 --> 00:06:28,320
F Doc target the F six and then we can go dot plot and we'll do it as a kind of bar we'll give it a

87
00:06:28,320 --> 00:06:34,060
fig size just because we want it to come out and go 10 6.

88
00:06:34,320 --> 00:06:39,600
So that's width and height there and then we'll give it our famous color that we're going to work with

89
00:06:39,990 --> 00:06:48,950
which is Andrew Salmon and like Blue Wonderful beautiful.

90
00:06:48,950 --> 00:06:52,660
So this is another visualization that we can start to get an idea.

91
00:06:52,890 --> 00:06:56,900
And it shows it a little bit more intuitively than just a cross tab.

92
00:06:56,940 --> 00:07:01,550
So if we look here you've got target which is zero not heart disease.

93
00:07:01,570 --> 00:07:03,170
And this is sex 0 1.

94
00:07:03,240 --> 00:07:08,310
So we can see that the people who don't have heart disease there's far more male.

95
00:07:08,520 --> 00:07:13,140
And we can see here that the people who do have heart disease now there are more males that do have

96
00:07:13,140 --> 00:07:14,010
heart disease.

97
00:07:14,010 --> 00:07:17,520
But if we look at the ratios compared to each column.

98
00:07:17,520 --> 00:07:18,530
So this one is male.

99
00:07:18,570 --> 00:07:19,700
Blue is male.

100
00:07:19,800 --> 00:07:22,240
Then there's salmon color is female.

101
00:07:22,320 --> 00:07:27,720
If we compare these columns we can see that the females who do have heart disease is about a 3 to 1

102
00:07:27,720 --> 00:07:31,770
if you would just compare those visually that 3 to 1.

103
00:07:31,800 --> 00:07:32,040
Right.

104
00:07:32,040 --> 00:07:37,440
So that's where we're getting a three and four chance of a female at random having heart disease but

105
00:07:37,440 --> 00:07:41,650
males the ratio is kind of and it's definitely not completely even.

106
00:07:41,730 --> 00:07:45,090
But it's a lot closer than what the females are.

107
00:07:45,240 --> 00:07:45,870
Wonderful.

108
00:07:46,680 --> 00:07:51,480
So if we wanted to add some titles to this we could we could add some communication here such as maybe

109
00:07:51,480 --> 00:07:55,450
we go BLT title.

110
00:07:55,830 --> 00:07:56,820
Heart disease

111
00:07:59,300 --> 00:08:12,290
frequency for sex and then we go P. BLT dot maybe we add an x label and we go zero equals no disease

112
00:08:12,950 --> 00:08:25,820
1 equals disease and then we might go plot y label might put amount here then maybe a legend might change

113
00:08:25,850 --> 00:08:26,580
this legend.

114
00:08:26,650 --> 00:08:28,100
So if we can update that too.

115
00:08:28,160 --> 00:08:32,840
Rather than being 0 1 so we want to communicate this to someone we're doing some data analysis we want

116
00:08:32,840 --> 00:08:40,060
to they don't know what 0 1 means we wanted to female male actually I'll show you what this does.

117
00:08:40,130 --> 00:08:43,420
Before we even do it so let's do that.

118
00:08:43,450 --> 00:08:46,420
We'll add a little semicolon here so we get rid of that.

119
00:08:46,640 --> 00:08:47,760
What are we missing out here.

120
00:08:48,290 --> 00:08:48,950
We got legend

121
00:08:52,280 --> 00:08:53,480
that should work.

122
00:08:53,480 --> 00:08:53,900
Wonderful.

123
00:08:53,900 --> 00:08:55,840
So that's a little bit more intuitive.

124
00:08:55,870 --> 00:08:59,330
And so this is what I wanted to change here I wanted to get them vertical.

125
00:08:59,450 --> 00:09:01,980
So what we do is plot x ticks.

126
00:09:02,090 --> 00:09:05,450
These are X ticks here little ticks that are labeling.

127
00:09:05,440 --> 00:09:11,540
So we go plot x ticks and then rotation equals zero.

128
00:09:12,590 --> 00:09:16,910
This took us a little while to get set up but really if you're going through this you might breeze through

129
00:09:16,910 --> 00:09:18,770
it if you're going through it by yourself.

130
00:09:18,830 --> 00:09:23,960
But if we want to communicate it with someone else we better designed in this way because we may know

131
00:09:23,960 --> 00:09:30,080
the data ourselves but if we just want to share this image so someone else has an intuition over the

132
00:09:30,080 --> 00:09:36,920
comparison of who has heart disease depending on what sex they are they won't know what we know with

133
00:09:36,920 --> 00:09:37,560
the data.

134
00:09:37,610 --> 00:09:41,920
So that's why we're making our visuals as communicative as possible.

135
00:09:41,930 --> 00:09:42,530
All right.

136
00:09:42,700 --> 00:09:47,210
Well now that we've compared to columns we've seen how to do it what we're going to do in the next few

137
00:09:47,210 --> 00:09:49,250
videos is compare a few more.

138
00:09:49,250 --> 00:09:51,180
So take a little break.

139
00:09:51,290 --> 00:09:55,720
Reflect back on what we've gone through here and maybe try to compare two columns of your own.

140
00:09:55,760 --> 00:10:01,430
Usually you can pair the target with a single column here and you start to work out the patterns.