1
00:00:00,360 --> 00:00:06,750
Now that we've got all our tools ready to go the first step is to load the data in because without data

2
00:00:06,780 --> 00:00:08,880
how we're supposed to use our tools right.

3
00:00:08,910 --> 00:00:15,990
So we go here load data make a little heading then we're going to settle as DFS IDF is usually short

4
00:00:15,990 --> 00:00:25,020
for data frame so aid read see as V because we've got our heart disease CSB folder in our current working

5
00:00:25,020 --> 00:00:32,610
directory we can just go PD don't read CSB we're gonna pass it can we do tab auto complete heart disease

6
00:00:32,700 --> 00:00:39,420
dot CSB BEAUTIFUL AND THEN WE'RE GONNA HAVE A LOOK AT IT WONDERFUL THAT SEEMS LIKE IT'S IMPORTED great

7
00:00:39,450 --> 00:00:45,360
we've already had a sneak peak in this before but we might go dot shape this is going to tell us how

8
00:00:45,360 --> 00:00:51,860
many rows and columns there are so essentially we have 303 examples which is rows and this is columns

9
00:00:51,870 --> 00:00:57,330
so this comes in the form of rows columns.

10
00:00:57,330 --> 00:01:08,460
Now once we've got data loaded in what's next well that comes Data Exploration a.k.a. known as exploratory

11
00:01:09,090 --> 00:01:12,230
data analysis or EDTA.

12
00:01:13,410 --> 00:01:17,840
Once you've imported the data say this is where you want to go to start to explore it.

13
00:01:17,850 --> 00:01:23,880
Now there's no set way of doing this but what you should be trying to do is become more and more familiar

14
00:01:24,510 --> 00:01:28,760
with the data so you can compare different columns to each other.

15
00:01:28,980 --> 00:01:35,070
Compare them to the target value refer back to the data dictionary try and figure things out what your

16
00:01:35,070 --> 00:01:41,210
goal is is to become a subject matter expert on the data set that you're working with.

17
00:01:41,430 --> 00:01:55,280
So the goal here is to find out more about the data and become a subject matter expert on the data set.

18
00:01:55,600 --> 00:01:58,570
You're working with now.

19
00:01:58,600 --> 00:02:03,360
This is because if someone asked you a question about it you can give them an explanation.

20
00:02:03,370 --> 00:02:08,620
And when you start to build models you can soundcheck your model to make sure they're not performing

21
00:02:08,620 --> 00:02:15,620
too well such as like overfishing or they might be poorly performing such as under fitting now since

22
00:02:15,710 --> 00:02:19,220
EDTA has no really set methodology.

23
00:02:19,460 --> 00:02:27,740
What we're gonna do we'll write down a short checklist so we can go what questions are you trying to

24
00:02:27,740 --> 00:02:28,340
solve.

25
00:02:28,460 --> 00:02:35,350
So this is back to our problem definition and then what kind of data do we have.

26
00:02:36,140 --> 00:02:41,310
And how do we treat different types.

27
00:02:41,330 --> 00:02:43,950
So this is like numerical versus categorical.

28
00:02:43,970 --> 00:02:49,690
In our case all the data in our data frame is numerical.

29
00:02:49,930 --> 00:02:53,800
Now remember these steps aren't just associated with the heart disease data set.

30
00:02:53,800 --> 00:02:56,780
They can be applied to almost any data centers.

31
00:02:56,890 --> 00:02:58,230
What's missing from the data.

32
00:02:58,870 --> 00:03:02,860
And how do you deal with it.

33
00:03:02,860 --> 00:03:10,330
Step Four where are the outliers and why should you care about them.

34
00:03:10,570 --> 00:03:18,700
Now outliers are values or samples which are so far away from the other samples like in terms of if

35
00:03:18,700 --> 00:03:20,380
you were comparing one sample to another.

36
00:03:20,440 --> 00:03:25,330
If one sample was just completely outlandish it might be an incorrect like sample that someone may have

37
00:03:25,330 --> 00:03:30,700
made a mistake when they were looking man or it may be an incorrect sample or just full stop.

38
00:03:30,700 --> 00:03:36,960
Or it may actually be that that's just an outlandish example and it should be included in your dataset.

39
00:03:37,120 --> 00:03:40,360
Then finally and again this is a non exhaustive list.

40
00:03:40,420 --> 00:03:47,230
How can you add change or remove features to get more out of your data.

41
00:03:48,070 --> 00:03:53,020
So it's kind of like a EDTA checklist that you might want to start with as a bare minimum of what we're

42
00:03:53,020 --> 00:03:53,800
trying to figure out.

43
00:03:54,400 --> 00:03:55,630
How about we do that.

44
00:03:55,660 --> 00:04:01,450
So the first thing might check DNA head traditional way of looking at a data frame if we wanted to look

45
00:04:01,450 --> 00:04:02,410
at the bottom.

46
00:04:02,470 --> 00:04:11,080
We might do DFT tail wonderful and then the next thing because we're trying to predict this target variable.

47
00:04:11,140 --> 00:04:14,320
We want to figure out what's some information about that.

48
00:04:14,380 --> 00:04:18,390
So we go IDF target maybe we use this one.

49
00:04:18,430 --> 00:04:24,610
Remember this two different ways to access columns and data frames you go IDF dot target or IDF target

50
00:04:24,640 --> 00:04:28,250
as a string called value counts on that.

51
00:04:28,300 --> 00:04:30,360
So we'll go here.

52
00:04:31,000 --> 00:04:35,880
Let's find out how many of each class there are.

53
00:04:36,310 --> 00:04:41,900
So in our case if we go back out to our data dictionary and trying to figure out what Target is we can

54
00:04:41,900 --> 00:04:42,170
see.

55
00:04:42,170 --> 00:04:44,360
Target is have disease or not.

56
00:04:44,360 --> 00:04:49,110
So 1 equals yes zero equals no so there we go.

57
00:04:49,110 --> 00:04:57,420
So we have 165 examples where someone has heart disease based on their health parameters and 138 examples

58
00:04:57,420 --> 00:04:59,590
where someone doesn't have heart disease.

59
00:04:59,700 --> 00:05:06,450
And so what we would think here is that this is a relatively balanced problem meaning that we have quite

60
00:05:06,450 --> 00:05:10,800
a similar amount of examples in both classes.

61
00:05:10,830 --> 00:05:14,050
So that's a balanced classification problem.

62
00:05:14,110 --> 00:05:16,540
And so if we go here maybe we want to visualize this.

63
00:05:16,540 --> 00:05:28,650
So if we go IDF target value counts because we're always trying to be as explanatory we want to explain

64
00:05:29,100 --> 00:05:35,070
the data in a way that both we and other people can understand and visualizations are one of the best

65
00:05:35,070 --> 00:05:35,930
ways to do that.

66
00:05:36,510 --> 00:05:43,920
So kind equals maybe a bar graph and then we'll pass it some fancy colors so that we know this is our

67
00:05:43,920 --> 00:05:44,240
graph.

68
00:05:44,250 --> 00:05:48,480
We'll trademark these colors light blue salmon and light blue.

69
00:05:48,480 --> 00:05:51,980
Let's see this I'll put a little semicolon here.

70
00:05:52,000 --> 00:05:54,730
So that output doesn't come up.

71
00:05:54,730 --> 00:05:55,710
There we go.

72
00:05:55,750 --> 00:06:01,020
Maybe we could relabel these as something maybe that might involve changing the values in the data frame

73
00:06:01,030 --> 00:06:02,150
we won't do that for now.

74
00:06:02,200 --> 00:06:07,480
If we wanted to share this with someone we could relabel that is heart disease and no heart disease

75
00:06:08,680 --> 00:06:13,420
the next thing we might want to look at is different information about our data frame so we can do that

76
00:06:13,420 --> 00:06:18,250
with D after info what are the other columns got in them.

77
00:06:18,290 --> 00:06:19,630
Are there any missing values.

78
00:06:19,640 --> 00:06:21,430
So this is what this is going to tell us.

79
00:06:21,470 --> 00:06:25,370
So we have age which is a known now in 64.

80
00:06:25,370 --> 00:06:29,790
This 14 column 302 entries sex is in 64.

81
00:06:29,810 --> 00:06:35,120
C.P. what CPE member we can check our data dictionary come up here.

82
00:06:35,270 --> 00:06:36,580
Chest pain type.

83
00:06:36,580 --> 00:06:37,370
Okay.

84
00:06:37,490 --> 00:06:43,250
Typical that angina atypical and Jena not actually sure what what n Jena means.

85
00:06:43,850 --> 00:06:48,530
So this is where we're trying to figure out what our data is all about.

86
00:06:48,530 --> 00:06:52,810
So define n Jena a condition marked by severe pain in the chest.

87
00:06:52,820 --> 00:06:56,940
That makes sense if we're dealing with heart disease data.

88
00:06:56,970 --> 00:07:00,090
OK so if we come back.

89
00:07:00,170 --> 00:07:03,880
See there's no real structure to this like we're just kind of jumping back and forth.

90
00:07:03,890 --> 00:07:07,400
If it seems like we're bouncing around it's because we actually are.

91
00:07:07,400 --> 00:07:10,630
And another way to see is there any missing values.

92
00:07:10,670 --> 00:07:13,510
IDF is in a some.

93
00:07:13,510 --> 00:07:20,270
Because remember one of our questions in exploratory data analysis is what's missing from the data and

94
00:07:20,270 --> 00:07:22,090
how do we deal with it.

95
00:07:22,100 --> 00:07:28,310
So are there any missing values in our case.

96
00:07:28,320 --> 00:07:28,850
There is not.

97
00:07:28,860 --> 00:07:32,290
So we don't have to do anything about the missing values there.

98
00:07:32,460 --> 00:07:38,080
Then if we're still trying to find out more about how data frankly might do DFT describe which is gonna

99
00:07:38,100 --> 00:07:41,490
give us some numerical values about all of our columns.

100
00:07:41,490 --> 00:07:45,600
So we've got count here which is three and three that's how many rows there are.

101
00:07:45,750 --> 00:07:47,690
We get the main value of age.

102
00:07:47,700 --> 00:07:51,990
So that means the main value of all of our patients here is 54.

103
00:07:52,330 --> 00:07:53,620
We come through here.

104
00:07:53,740 --> 00:07:54,230
Okay.

105
00:07:54,330 --> 00:08:00,360
That's not too much there that we can really make use of for now what we might start to do is compare

106
00:08:00,360 --> 00:08:01,800
different columns.

107
00:08:01,800 --> 00:08:03,630
So we'll probably save that for the next video.

108
00:08:03,630 --> 00:08:05,400
So save this one getting too long.

109
00:08:05,880 --> 00:08:09,150
So now we've got some quick quick insights about the data.

110
00:08:09,150 --> 00:08:09,420
Right.

111
00:08:09,420 --> 00:08:14,400
No real structure here we're just trying to figure out what's going on we've read the column headings

112
00:08:14,400 --> 00:08:17,820
here we've compared them to our data dictionary.

113
00:08:18,030 --> 00:08:23,310
We've seen how many examples of each target there are in this graph here.

114
00:08:23,310 --> 00:08:24,960
This visualization.

115
00:08:25,080 --> 00:08:29,310
Now we're going to start to compare different columns and try and get a bit more of an idea of where

116
00:08:29,310 --> 00:08:31,810
the patterns are within our data.

117
00:08:31,920 --> 00:08:33,270
So let's do that in the next video.