1
00:00:00,210 --> 00:00:06,240
Now for our next comparison what we might do is try to combine a couple of independent variables such

2
00:00:06,240 --> 00:00:13,200
as age thatll act which is maximum heart rate and then compare them to our target variable heart disease.

3
00:00:13,200 --> 00:00:23,540
So if we have a look back at our data from so what we might compare is age fell AK and target all in

4
00:00:23,540 --> 00:00:24,500
one.

5
00:00:24,530 --> 00:00:31,700
And now if we go to check our different values for fail act now the reason why I know this is max heart

6
00:00:31,700 --> 00:00:34,650
rate because we come back out in a second.

7
00:00:35,000 --> 00:00:37,250
So we can see there's a lot of different values here.

8
00:00:37,310 --> 00:00:43,880
So the length 91 if you call value counts on any column and the length comes up is something that means

9
00:00:43,880 --> 00:00:46,660
that there's that many different values in that column.

10
00:00:46,670 --> 00:00:51,470
So what that might be telling us is because there's so many different values looking at it on a bar

11
00:00:51,470 --> 00:00:57,710
graph may not be the best type of graph because if we look at this plot there's only four columns here

12
00:00:57,740 --> 00:00:59,240
and that's pretty interpretable.

13
00:00:59,240 --> 00:01:02,360
But when there's 91 that might be a bit harder to look at.

14
00:01:02,360 --> 00:01:04,800
So that's where we might bring in something like a scatter graph.

15
00:01:05,300 --> 00:01:10,980
Let's go back to our data dictionary and have a look at what fell AK is come up here.

16
00:01:12,730 --> 00:01:20,300
How AK maximum heart rate achieved and again we're really just diving in reading our data dictionary

17
00:01:20,540 --> 00:01:20,870
here.

18
00:01:20,870 --> 00:01:25,850
You might just go through different different columns read them which ones stand out to you and try

19
00:01:25,850 --> 00:01:32,660
to compare fewer of them with the target variable and remember we come back to our data frame and these

20
00:01:32,660 --> 00:01:36,230
are often referred to as features or independent variables.

21
00:01:36,230 --> 00:01:39,980
And this is often referred to as a target or dependent variable.

22
00:01:39,980 --> 00:01:45,160
So that's what we're doing here we're comparing independent variables to our target variable.

23
00:01:45,170 --> 00:01:49,670
This might seem a little as if we're jumping all over the place I know I've said this before it's because

24
00:01:49,670 --> 00:01:55,580
we we are we are literally just picking columns here that spark our interest.

25
00:01:55,580 --> 00:01:59,750
This one sparked my interest and we're trying to figure out how they relate to the target.

26
00:02:00,140 --> 00:02:01,710
That's what we're doing.

27
00:02:01,730 --> 00:02:03,480
You may choose a couple of other columns.

28
00:02:03,560 --> 00:02:06,770
I'm just going through the ones that look most interesting to me.

29
00:02:07,610 --> 00:02:17,420
So let's make a little heading age the max heart rate for heart disease and then we'll turn that into

30
00:02:17,420 --> 00:02:18,080
markdown.

31
00:02:18,230 --> 00:02:19,340
Beautiful.

32
00:02:19,340 --> 00:02:27,590
So to compare three such as aged slack and heart disease what we might have to do is create a plot with

33
00:02:27,590 --> 00:02:37,860
two plots on it so let's start off first by creating another figure BLT figure and we'll posit a fig

34
00:02:37,860 --> 00:02:42,610
size it goes 10 6.

35
00:02:42,750 --> 00:02:48,200
Then we might go scatter might do a scatter with the positive examples first.

36
00:02:48,270 --> 00:02:53,520
So just bear with me for a second and we'll talk through what's going on while we're while we're going

37
00:02:53,520 --> 00:02:53,990
through it.

38
00:02:54,020 --> 00:02:56,130
So BLT don't scatter.

39
00:02:56,340 --> 00:03:05,770
We want to see on this scatter the positive examples so we can go DFT on age or days of age in brackets.

40
00:03:05,850 --> 00:03:07,550
They really mean the same thing.

41
00:03:07,740 --> 00:03:14,790
But I like to do it this way and if they have to an age and then in here we're going to do def dot target

42
00:03:15,010 --> 00:03:15,920
equals.

43
00:03:16,140 --> 00:03:20,250
Of course you couldn't do this if your column names had spaces in them.

44
00:03:20,400 --> 00:03:23,940
But in our case none of our column names have spaces.

45
00:03:23,940 --> 00:03:26,110
What this is doing is a subset of the dataset.

46
00:03:26,190 --> 00:03:32,190
We're taking the age column from our data frame where this condition equals true.

47
00:03:32,190 --> 00:03:34,360
So where the target equals 1.

48
00:03:34,380 --> 00:03:41,130
So if we did that in a cell below here the after age day after Target equals one.

49
00:03:41,640 --> 00:03:45,950
So this is going to show us all the age columns where Target is equal to 1.

50
00:03:45,990 --> 00:03:47,650
So that's what we want.

51
00:03:47,880 --> 00:03:48,530
Beautiful.

52
00:03:48,530 --> 00:03:56,510
And we also want DFT foul act which is maximum heart rate def dot target equals one.

53
00:03:57,000 --> 00:03:57,500
Beautiful.

54
00:03:57,510 --> 00:04:00,470
And we're gonna give it a color of salmon.

55
00:04:00,510 --> 00:04:04,980
What would this look like if we called that wonderful.

56
00:04:04,980 --> 00:04:08,810
So along here we've got maximum heart rate and we got age.

57
00:04:08,910 --> 00:04:13,990
So what could you infer from this pretty quickly by just looking at it here.

58
00:04:14,020 --> 00:04:17,890
Well you might infer that there's kind of a downward trend.

59
00:04:18,010 --> 00:04:22,720
So if you were to draw a line through there just a straight line it's all over the place right.

60
00:04:23,170 --> 00:04:27,080
But it's kind of you can kind of see the pattern going down here.

61
00:04:27,400 --> 00:04:29,800
The youngest someone is the higher their heart rate.

62
00:04:29,800 --> 00:04:31,800
That's the kind of trend that you'd be seeing there.

63
00:04:31,810 --> 00:04:33,360
That makes a little bit of sense right now.

64
00:04:33,360 --> 00:04:37,840
These are positive examples so these are patients because we've got target equals one.

65
00:04:37,840 --> 00:04:40,490
These are patients with heart disease.

66
00:04:40,490 --> 00:04:43,370
And so now we want the negative examples on the same plot.

67
00:04:43,870 --> 00:04:54,940
So scanner with negative examples BLT dots scatter the IDF dot age all we're going to do is just copy

68
00:04:54,940 --> 00:04:59,370
what we've got above but accept target equals zero because they're negative example.

69
00:04:59,380 --> 00:05:09,610
This is without heart disease like DFT at Target equals zero wonderful C equals when I go light blue

70
00:05:10,080 --> 00:05:14,610
it's a color scheme we're set up for now beautiful.

71
00:05:14,650 --> 00:05:19,710
So now we've got positive and negative examples on there and again they're kind of all over the place.

72
00:05:19,720 --> 00:05:23,980
But if you were to look at the trend the line kind of goes down from both.

73
00:05:23,980 --> 00:05:27,190
But again you can't really split them.

74
00:05:27,190 --> 00:05:31,180
This is where our machine learning model have to take over from our ability to have a look at this.

75
00:05:31,180 --> 00:05:36,910
So if you were if you were to look at this now I just put a semicolon there to stop that little map

76
00:05:36,910 --> 00:05:37,960
properly about.

77
00:05:38,710 --> 00:05:43,660
If you were to look at this and you try to decipher what's going on here what was the maximum heart

78
00:05:43,660 --> 00:05:50,530
rate of of someone who had or didn't have heart disease the salmon versus blue dots blue dots are without

79
00:05:50,530 --> 00:05:52,690
heart disease salmon are with heart disease.

80
00:05:52,930 --> 00:05:59,230
If you had to decipher this it'd be pretty hard right because these are all so mixed up together maybe

81
00:05:59,230 --> 00:06:02,900
you might infer that there's some sort of pattern.

82
00:06:03,130 --> 00:06:09,210
There's a fair bit cluster here there's a big cluster here but really I can't really tell a pattern.

83
00:06:09,220 --> 00:06:13,990
Maybe if I had some more time to look at it I could find something but this is where machine learning

84
00:06:13,990 --> 00:06:14,870
is going to come into play.

85
00:06:14,920 --> 00:06:18,760
The patterns that you can't necessarily see straight away.

86
00:06:18,760 --> 00:06:20,910
Machine learning is going to dive into the data.

87
00:06:20,920 --> 00:06:25,560
It's going to form some calculations and figure out these patterns that we can't really see.

88
00:06:25,690 --> 00:06:28,830
Again if we had enough time maybe we'd find something.

89
00:06:28,930 --> 00:06:34,930
But when machine learning engineers were data scientists we prefer that the algorithm does the job for

90
00:06:34,930 --> 00:06:35,770
us.

91
00:06:35,830 --> 00:06:41,040
So what we might do is add some helpful info just to make this plot a little bit more complete.

92
00:06:41,140 --> 00:06:55,540
So BLT dot title heart disease in the function of age and max heart rate then penalty x label down the

93
00:06:55,540 --> 00:07:00,370
bottom is the age and they NPL T dot y label.

94
00:07:00,370 --> 00:07:08,920
This is gonna be max heart rate and then we'll add a legend so people can understand which dot is which

95
00:07:08,980 --> 00:07:14,350
and they're not just thinking about well what's going on here with this salmon and like blue dots amazing

96
00:07:14,350 --> 00:07:18,150
color scheme but not sure what's what's happening.

97
00:07:18,630 --> 00:07:21,780
And we'll put the semicolon there instead of there.

98
00:07:22,000 --> 00:07:22,750
Wonderful.

99
00:07:22,750 --> 00:07:27,360
So that's looking at a little bit better but again you let me know if you can see some sort of pattern.

100
00:07:27,370 --> 00:07:33,190
But to me all I can really see is is a bit of a downward trend so as someone gets older their maximum

101
00:07:33,190 --> 00:07:38,200
heart rate decreases which kind of makes sense again when you're exploring a dataset you won't necessarily

102
00:07:38,200 --> 00:07:40,330
always find patterns to begin with.

103
00:07:40,330 --> 00:07:48,030
We're just familiarizing ourselves with what's going on and so because we have just an age when we might

104
00:07:48,030 --> 00:07:51,670
check out is what's the distribution of the age.

105
00:07:51,880 --> 00:07:59,160
So to do so go check the distribution of the age column with a histogram

106
00:08:01,800 --> 00:08:05,980
now the distribution is another word for spread of the data.

107
00:08:06,170 --> 00:08:09,550
So we kind can't see here that there's almost an even spread.

108
00:08:09,750 --> 00:08:14,840
So you may recall what kind of distribution and even spread of data has.

109
00:08:14,850 --> 00:08:16,580
And if you don't that's perfectly fine.

110
00:08:16,590 --> 00:08:20,370
It took me a while to understand this as well to me a long time actually.

111
00:08:20,400 --> 00:08:22,630
And I still have to research.

112
00:08:22,760 --> 00:08:28,920
So this kind of distribution here is otherwise known as a normal distribution and if we were to really

113
00:08:29,100 --> 00:08:33,330
say what can normal distribution is we'd look at this perfect normal distribution.

114
00:08:33,330 --> 00:08:33,960
Let's have a look.

115
00:08:34,380 --> 00:08:42,200
Let's go normal distribution is gonna look like a bell curve to see something like that.

116
00:08:42,890 --> 00:08:48,310
So if we come to ours ours is very close to that shape but it's kind of swaying towards the right.

117
00:08:48,470 --> 00:08:54,680
So we can see that most of our population or most of our samples their age is around this big mid gap

118
00:08:54,680 --> 00:08:55,310
here.

119
00:08:55,310 --> 00:08:59,290
And we don't have that many around the 30 year old age or past.

120
00:08:59,300 --> 00:09:02,410
This might be 80 up or something like oh we don't have many past there.

121
00:09:02,510 --> 00:09:08,450
The majority of our dataset are within the 50 to 60 range.

122
00:09:08,560 --> 00:09:12,470
And so this is what you might want to do for a bunch of different columns is check the distributions

123
00:09:12,470 --> 00:09:19,820
check the spreads if we did have someone out here that was like 150 that would maybe be some type of

124
00:09:19,940 --> 00:09:25,820
data that we'd have to clean up because I'm not sure if anyone's ever lived to 150 or if we had someone

125
00:09:25,820 --> 00:09:29,350
down here that was maybe five or something like that.

126
00:09:29,360 --> 00:09:33,200
We'd also have to think about okay is that something we want to include in our dataset.

127
00:09:33,620 --> 00:09:39,260
But this is going to be different Column 2 Column so we're just checking the age here is the normal

128
00:09:39,260 --> 00:09:44,840
distribution we might do the same for other columns but the way you sort of think about samples out

129
00:09:44,840 --> 00:09:47,770
here is they're referred to as outliers.

130
00:09:47,990 --> 00:09:51,320
And now how will you tell that if there's any outliers.

131
00:09:51,320 --> 00:09:55,860
Well distribution plot this histogram is one of the best ways to do it.

132
00:09:55,910 --> 00:10:00,200
So if you're getting some weird results like some weird results in terms of your machine learning models

133
00:10:00,200 --> 00:10:05,060
later on it may be because there's outliers in the data and that's where you'll have to check different

134
00:10:05,060 --> 00:10:07,550
distributions of each columns.

135
00:10:07,550 --> 00:10:09,310
So that's one way to do it.

136
00:10:09,430 --> 00:10:17,060
Now what we might do is compare another two columns so if we come up here to our data dictionary I saw

137
00:10:17,060 --> 00:10:25,040
this one before the chest pain type and I thought that would be interesting to see so if we got here

138
00:10:25,040 --> 00:10:30,950
for different types of chest pain chest pain related decreased blood supply to the heart chest pain

139
00:10:30,950 --> 00:10:34,520
not related to heart typically esophageal spasms.

140
00:10:34,520 --> 00:10:40,010
So I guess that's like your esophagus which is I think that tube from your mouth to your stomach or

141
00:10:40,010 --> 00:10:43,850
something like that asymptomatic chest pain not showing signs of disease.

142
00:10:43,880 --> 00:10:46,100
Okay so this will be pretty interesting right.

143
00:10:46,100 --> 00:10:49,970
If we compare this to the target column.

144
00:10:49,970 --> 00:10:57,170
So chest pain versus target so does chest pain relate to whether or not someone has heart disease.

145
00:10:57,170 --> 00:11:03,230
So let's come down here guy make a little heading.

146
00:11:03,350 --> 00:11:07,980
Heart disease frequency.

147
00:11:08,240 --> 00:11:17,610
Chest pain type and what we might do actually is just remind ourselves copy our data dictionary for

148
00:11:17,610 --> 00:11:22,470
chest pain so we can remember what each different value is.

149
00:11:22,470 --> 00:11:32,400
So if we come in here and we'll copy this shift and enter will bring it down and we'll paste it here.

150
00:11:32,420 --> 00:11:38,930
Wonderful so this is what we're going to compare we can do that with a fairly quickly with a PD cross

151
00:11:38,930 --> 00:11:50,350
tab def not def not target Okay so if we look at this what would we devise in here.

152
00:11:50,400 --> 00:11:50,630
Mm hmm.

153
00:11:50,730 --> 00:11:59,870
It seems as chest pain goes up so does whether they have heart disease but if we get zero chest pain

154
00:12:01,530 --> 00:12:04,840
there's a lot more that don't have heart disease than do.

155
00:12:04,890 --> 00:12:12,570
But if we get to there's only 18 with zero so don't have heart disease but 69 so nearly over three times

156
00:12:12,570 --> 00:12:14,940
the amount that do have heart disease.

157
00:12:14,940 --> 00:12:22,850
So this is non a genial pain typically or non heart related Well that's that's interesting that it's

158
00:12:22,860 --> 00:12:27,060
non heart related and so these are the type of patterns you'll start to find out new data some make

159
00:12:27,060 --> 00:12:27,480
sense.

160
00:12:27,480 --> 00:12:34,170
That doesn't really make sense to me non heart related pain yet more people have heart disease with

161
00:12:34,170 --> 00:12:35,880
two than not.

162
00:12:35,880 --> 00:12:40,350
And so these are the type of things you might be wanting to discuss with a subject matter expert if

163
00:12:40,350 --> 00:12:45,900
you're looking through a dataset such as us looking through this heart disease dataset and I'm not a

164
00:12:45,900 --> 00:12:51,030
doctor I'm not trained medically I've researched some things about health but I've known nothing no

165
00:12:51,030 --> 00:12:53,220
idea about chest pain.

166
00:12:53,220 --> 00:12:57,180
So if I got this data set off someone from a medical profession I'm trying to use machine learning to

167
00:12:57,180 --> 00:12:59,920
figure out whether someone has heart disease or not.

168
00:13:00,150 --> 00:13:03,810
And I'm coming across these patterns that I don't really understand that's where I probably reach out

169
00:13:03,810 --> 00:13:09,750
and go hey to a medical professional or I do my own research such as looking up what these actually

170
00:13:09,750 --> 00:13:17,700
mean we did that before defining angina going and hey I'm seeing this in the data but can you shed some

171
00:13:17,700 --> 00:13:19,620
light it's kind of confusing to me.

172
00:13:19,710 --> 00:13:25,230
Is this correct is this not that's the kind of insights that we're looking for is both patterns that

173
00:13:25,230 --> 00:13:32,130
make sense to us such as the normal distribution of age such as the declining max heart rate as someone

174
00:13:32,130 --> 00:13:38,730
gets older this one maybe doesn't make as much sense is heart disease more prevalent in women than in

175
00:13:38,730 --> 00:13:39,810
these males.

176
00:13:39,960 --> 00:13:44,400
So this is the type of questions that we're trying to get we're trying to formulate an idea in the data

177
00:13:44,460 --> 00:13:49,380
by not finding answers but finding questions that we can ask.

178
00:13:49,640 --> 00:13:54,990
And so what we might do is make this a little bit more visual as we've done before.

179
00:13:54,990 --> 00:14:09,350
Make the cross more visual paid a cross tab def not he def target we're just writing what we've got

180
00:14:09,350 --> 00:14:15,560
above their dot plot same plot again we might do a bar because there's four different values here versus

181
00:14:15,560 --> 00:14:16,640
two different values there.

182
00:14:16,640 --> 00:14:20,210
So we should be at a two look at it fine on a bar graph.

183
00:14:20,810 --> 00:14:23,870
Think size equals ten and six.

184
00:14:23,870 --> 00:14:25,040
Wonderful.

185
00:14:25,040 --> 00:14:27,410
I'm going to use our favorite colors.

186
00:14:27,410 --> 00:14:36,340
Color equals what are we using light blue Salmon Actually I think the order was the other way round.

187
00:14:36,350 --> 00:14:39,090
Mary did salmon first then light blue.

188
00:14:39,230 --> 00:14:42,360
Doesn't really matter beautiful.

189
00:14:42,450 --> 00:14:48,660
Now we're gonna add some add some communication penalty dot title

190
00:14:53,330 --> 00:14:58,310
heart disease frequency her chest pain type

191
00:15:02,470 --> 00:15:14,780
peyote dot ex label and do chest pain type men on the why is penalty don't y label amount beautiful

192
00:15:14,870 --> 00:15:17,030
BLT dot legend.

193
00:15:17,030 --> 00:15:21,060
We're gonna put on here no disease

194
00:15:23,630 --> 00:15:28,310
disease and then we'll finish it off with.

195
00:15:28,310 --> 00:15:35,240
Again we're going to make sure that the X ticks are vertical so it's a bit easier to read shifting into

196
00:15:35,510 --> 00:15:39,930
what do we got pi plot has nothing y label we've got a little typo beautiful.

197
00:15:40,250 --> 00:15:40,910
There we go.

198
00:15:40,910 --> 00:15:42,010
So that's a bit more visual.

199
00:15:42,010 --> 00:15:45,280
So then we're just turning our cross tab into a graph here.

200
00:15:45,290 --> 00:15:49,720
Now this is something that we could take pretty quickly to someone and go Hey what's going on here.

201
00:15:49,730 --> 00:15:58,430
Chest pain type has way more accounts with the disease than chest pain too that is and it's supposed

202
00:15:58,430 --> 00:16:03,010
to be known and general pain so typically esophageal.

203
00:16:03,080 --> 00:16:07,600
That word is very hard for me to pronounce esophageal spasms non heart related.

204
00:16:07,600 --> 00:16:10,510
So this is again raising questions from the data.

205
00:16:10,520 --> 00:16:14,270
That's what we're trying to do in this exploratory data analysis.

206
00:16:14,270 --> 00:16:14,970
All right.

207
00:16:15,090 --> 00:16:23,090
So we got a pretty visualization there what we might do next is check out the correlation between independent

208
00:16:23,090 --> 00:16:26,450
variables and our dependent variable.

209
00:16:26,450 --> 00:16:31,250
So again if we have a look at our data frame this is what you want to be jumping in out of right just

210
00:16:31,250 --> 00:16:34,480
using of head to quickly have a snapshot of your data frame.

211
00:16:34,490 --> 00:16:39,080
These are our independent variables remind ourselves and we're trying to use them to predict target.

212
00:16:39,080 --> 00:16:45,830
So we'll see how we can compare those and we're gonna use a correlation matrix but we'll see that in

213
00:16:45,830 --> 00:16:46,460
the next video.