1 00:00:00,360 --> 00:00:06,750 Now that we've got all our tools ready to go the first step is to load the data in because without data 2 00:00:06,780 --> 00:00:08,880 how we're supposed to use our tools right. 3 00:00:08,910 --> 00:00:15,990 So we go here load data make a little heading then we're going to settle as DFS IDF is usually short 4 00:00:15,990 --> 00:00:25,020 for data frame so aid read see as V because we've got our heart disease CSB folder in our current working 5 00:00:25,020 --> 00:00:32,610 directory we can just go PD don't read CSB we're gonna pass it can we do tab auto complete heart disease 6 00:00:32,700 --> 00:00:39,420 dot CSB BEAUTIFUL AND THEN WE'RE GONNA HAVE A LOOK AT IT WONDERFUL THAT SEEMS LIKE IT'S IMPORTED great 7 00:00:39,450 --> 00:00:45,360 we've already had a sneak peak in this before but we might go dot shape this is going to tell us how 8 00:00:45,360 --> 00:00:51,860 many rows and columns there are so essentially we have 303 examples which is rows and this is columns 9 00:00:51,870 --> 00:00:57,330 so this comes in the form of rows columns. 10 00:00:57,330 --> 00:01:08,460 Now once we've got data loaded in what's next well that comes Data Exploration a.k.a. known as exploratory 11 00:01:09,090 --> 00:01:12,230 data analysis or EDTA. 12 00:01:13,410 --> 00:01:17,840 Once you've imported the data say this is where you want to go to start to explore it. 13 00:01:17,850 --> 00:01:23,880 Now there's no set way of doing this but what you should be trying to do is become more and more familiar 14 00:01:24,510 --> 00:01:28,760 with the data so you can compare different columns to each other. 15 00:01:28,980 --> 00:01:35,070 Compare them to the target value refer back to the data dictionary try and figure things out what your 16 00:01:35,070 --> 00:01:41,210 goal is is to become a subject matter expert on the data set that you're working with. 17 00:01:41,430 --> 00:01:55,280 So the goal here is to find out more about the data and become a subject matter expert on the data set. 18 00:01:55,600 --> 00:01:58,570 You're working with now. 19 00:01:58,600 --> 00:02:03,360 This is because if someone asked you a question about it you can give them an explanation. 20 00:02:03,370 --> 00:02:08,620 And when you start to build models you can soundcheck your model to make sure they're not performing 21 00:02:08,620 --> 00:02:15,620 too well such as like overfishing or they might be poorly performing such as under fitting now since 22 00:02:15,710 --> 00:02:19,220 EDTA has no really set methodology. 23 00:02:19,460 --> 00:02:27,740 What we're gonna do we'll write down a short checklist so we can go what questions are you trying to 24 00:02:27,740 --> 00:02:28,340 solve. 25 00:02:28,460 --> 00:02:35,350 So this is back to our problem definition and then what kind of data do we have. 26 00:02:36,140 --> 00:02:41,310 And how do we treat different types. 27 00:02:41,330 --> 00:02:43,950 So this is like numerical versus categorical. 28 00:02:43,970 --> 00:02:49,690 In our case all the data in our data frame is numerical. 29 00:02:49,930 --> 00:02:53,800 Now remember these steps aren't just associated with the heart disease data set. 30 00:02:53,800 --> 00:02:56,780 They can be applied to almost any data centers. 31 00:02:56,890 --> 00:02:58,230 What's missing from the data. 32 00:02:58,870 --> 00:03:02,860 And how do you deal with it. 33 00:03:02,860 --> 00:03:10,330 Step Four where are the outliers and why should you care about them. 34 00:03:10,570 --> 00:03:18,700 Now outliers are values or samples which are so far away from the other samples like in terms of if 35 00:03:18,700 --> 00:03:20,380 you were comparing one sample to another. 36 00:03:20,440 --> 00:03:25,330 If one sample was just completely outlandish it might be an incorrect like sample that someone may have 37 00:03:25,330 --> 00:03:30,700 made a mistake when they were looking man or it may be an incorrect sample or just full stop. 38 00:03:30,700 --> 00:03:36,960 Or it may actually be that that's just an outlandish example and it should be included in your dataset. 39 00:03:37,120 --> 00:03:40,360 Then finally and again this is a non exhaustive list. 40 00:03:40,420 --> 00:03:47,230 How can you add change or remove features to get more out of your data. 41 00:03:48,070 --> 00:03:53,020 So it's kind of like a EDTA checklist that you might want to start with as a bare minimum of what we're 42 00:03:53,020 --> 00:03:53,800 trying to figure out. 43 00:03:54,400 --> 00:03:55,630 How about we do that. 44 00:03:55,660 --> 00:04:01,450 So the first thing might check DNA head traditional way of looking at a data frame if we wanted to look 45 00:04:01,450 --> 00:04:02,410 at the bottom. 46 00:04:02,470 --> 00:04:11,080 We might do DFT tail wonderful and then the next thing because we're trying to predict this target variable. 47 00:04:11,140 --> 00:04:14,320 We want to figure out what's some information about that. 48 00:04:14,380 --> 00:04:18,390 So we go IDF target maybe we use this one. 49 00:04:18,430 --> 00:04:24,610 Remember this two different ways to access columns and data frames you go IDF dot target or IDF target 50 00:04:24,640 --> 00:04:28,250 as a string called value counts on that. 51 00:04:28,300 --> 00:04:30,360 So we'll go here. 52 00:04:31,000 --> 00:04:35,880 Let's find out how many of each class there are. 53 00:04:36,310 --> 00:04:41,900 So in our case if we go back out to our data dictionary and trying to figure out what Target is we can 54 00:04:41,900 --> 00:04:42,170 see. 55 00:04:42,170 --> 00:04:44,360 Target is have disease or not. 56 00:04:44,360 --> 00:04:49,110 So 1 equals yes zero equals no so there we go. 57 00:04:49,110 --> 00:04:57,420 So we have 165 examples where someone has heart disease based on their health parameters and 138 examples 58 00:04:57,420 --> 00:04:59,590 where someone doesn't have heart disease. 59 00:04:59,700 --> 00:05:06,450 And so what we would think here is that this is a relatively balanced problem meaning that we have quite 60 00:05:06,450 --> 00:05:10,800 a similar amount of examples in both classes. 61 00:05:10,830 --> 00:05:14,050 So that's a balanced classification problem. 62 00:05:14,110 --> 00:05:16,540 And so if we go here maybe we want to visualize this. 63 00:05:16,540 --> 00:05:28,650 So if we go IDF target value counts because we're always trying to be as explanatory we want to explain 64 00:05:29,100 --> 00:05:35,070 the data in a way that both we and other people can understand and visualizations are one of the best 65 00:05:35,070 --> 00:05:35,930 ways to do that. 66 00:05:36,510 --> 00:05:43,920 So kind equals maybe a bar graph and then we'll pass it some fancy colors so that we know this is our 67 00:05:43,920 --> 00:05:44,240 graph. 68 00:05:44,250 --> 00:05:48,480 We'll trademark these colors light blue salmon and light blue. 69 00:05:48,480 --> 00:05:51,980 Let's see this I'll put a little semicolon here. 70 00:05:52,000 --> 00:05:54,730 So that output doesn't come up. 71 00:05:54,730 --> 00:05:55,710 There we go. 72 00:05:55,750 --> 00:06:01,020 Maybe we could relabel these as something maybe that might involve changing the values in the data frame 73 00:06:01,030 --> 00:06:02,150 we won't do that for now. 74 00:06:02,200 --> 00:06:07,480 If we wanted to share this with someone we could relabel that is heart disease and no heart disease 75 00:06:08,680 --> 00:06:13,420 the next thing we might want to look at is different information about our data frame so we can do that 76 00:06:13,420 --> 00:06:18,250 with D after info what are the other columns got in them. 77 00:06:18,290 --> 00:06:19,630 Are there any missing values. 78 00:06:19,640 --> 00:06:21,430 So this is what this is going to tell us. 79 00:06:21,470 --> 00:06:25,370 So we have age which is a known now in 64. 80 00:06:25,370 --> 00:06:29,790 This 14 column 302 entries sex is in 64. 81 00:06:29,810 --> 00:06:35,120 C.P. what CPE member we can check our data dictionary come up here. 82 00:06:35,270 --> 00:06:36,580 Chest pain type. 83 00:06:36,580 --> 00:06:37,370 Okay. 84 00:06:37,490 --> 00:06:43,250 Typical that angina atypical and Jena not actually sure what what n Jena means. 85 00:06:43,850 --> 00:06:48,530 So this is where we're trying to figure out what our data is all about. 86 00:06:48,530 --> 00:06:52,810 So define n Jena a condition marked by severe pain in the chest. 87 00:06:52,820 --> 00:06:56,940 That makes sense if we're dealing with heart disease data. 88 00:06:56,970 --> 00:07:00,090 OK so if we come back. 89 00:07:00,170 --> 00:07:03,880 See there's no real structure to this like we're just kind of jumping back and forth. 90 00:07:03,890 --> 00:07:07,400 If it seems like we're bouncing around it's because we actually are. 91 00:07:07,400 --> 00:07:10,630 And another way to see is there any missing values. 92 00:07:10,670 --> 00:07:13,510 IDF is in a some. 93 00:07:13,510 --> 00:07:20,270 Because remember one of our questions in exploratory data analysis is what's missing from the data and 94 00:07:20,270 --> 00:07:22,090 how do we deal with it. 95 00:07:22,100 --> 00:07:28,310 So are there any missing values in our case. 96 00:07:28,320 --> 00:07:28,850 There is not. 97 00:07:28,860 --> 00:07:32,290 So we don't have to do anything about the missing values there. 98 00:07:32,460 --> 00:07:38,080 Then if we're still trying to find out more about how data frankly might do DFT describe which is gonna 99 00:07:38,100 --> 00:07:41,490 give us some numerical values about all of our columns. 100 00:07:41,490 --> 00:07:45,600 So we've got count here which is three and three that's how many rows there are. 101 00:07:45,750 --> 00:07:47,690 We get the main value of age. 102 00:07:47,700 --> 00:07:51,990 So that means the main value of all of our patients here is 54. 103 00:07:52,330 --> 00:07:53,620 We come through here. 104 00:07:53,740 --> 00:07:54,230 Okay. 105 00:07:54,330 --> 00:08:00,360 That's not too much there that we can really make use of for now what we might start to do is compare 106 00:08:00,360 --> 00:08:01,800 different columns. 107 00:08:01,800 --> 00:08:03,630 So we'll probably save that for the next video. 108 00:08:03,630 --> 00:08:05,400 So save this one getting too long. 109 00:08:05,880 --> 00:08:09,150 So now we've got some quick quick insights about the data. 110 00:08:09,150 --> 00:08:09,420 Right. 111 00:08:09,420 --> 00:08:14,400 No real structure here we're just trying to figure out what's going on we've read the column headings 112 00:08:14,400 --> 00:08:17,820 here we've compared them to our data dictionary. 113 00:08:18,030 --> 00:08:23,310 We've seen how many examples of each target there are in this graph here. 114 00:08:23,310 --> 00:08:24,960 This visualization. 115 00:08:25,080 --> 00:08:29,310 Now we're going to start to compare different columns and try and get a bit more of an idea of where 116 00:08:29,310 --> 00:08:31,810 the patterns are within our data. 117 00:08:31,920 --> 00:08:33,270 So let's do that in the next video.