1 00:00:00,980 --> 00:00:07,890 Gone through Step 1 problem definition we've gone through Step 2 data and step 3. 2 00:00:08,020 --> 00:00:10,030 We've defined what success means for us. 3 00:00:10,530 --> 00:00:14,920 Now let's get on to Step 4 which is features. 4 00:00:14,920 --> 00:00:20,380 Now here the question we're trying to answer is what do we already know about the data. 5 00:00:20,450 --> 00:00:25,460 Now if you haven't worked with data before you might hear this word features and be wondering what powders 6 00:00:25,690 --> 00:00:27,230 features mean. 7 00:00:27,280 --> 00:00:30,720 Well you'll hear this world come up a lot in machine learning. 8 00:00:31,000 --> 00:00:37,030 Maybe in the form of feature learning or feature variables or when someone ask how many features are 9 00:00:37,030 --> 00:00:39,550 there or what kind of features are there. 10 00:00:40,630 --> 00:00:44,200 Features is another word for different forms of data. 11 00:00:44,390 --> 00:00:49,960 Now we've already discussed different kinds of data such as structured and unstructured but features 12 00:00:49,960 --> 00:00:56,210 refers to the different forms of data within structured or unstructured data. 13 00:00:56,290 --> 00:01:03,520 For example let's go back to our predicting heart disease problem we might want to see if things such 14 00:01:03,520 --> 00:01:10,270 as a person's body weight their sex their average resting heart rate and their chest pain rating can 15 00:01:10,270 --> 00:01:13,310 be used to predict if they have heart disease or not. 16 00:01:14,500 --> 00:01:21,880 These three things a patient's body weight sex average resting heart rate and chest pain are features 17 00:01:22,180 --> 00:01:28,410 of the data that could also be referred to as feature variables. 18 00:01:28,440 --> 00:01:37,900 In other words we want to use the feature variables to predict the target variables which is whether 19 00:01:37,900 --> 00:01:41,430 or not a person has heart disease or no. 20 00:01:41,650 --> 00:01:45,770 Now when it comes to feature variables again there are different kinds. 21 00:01:45,820 --> 00:01:49,920 You've got numerical which means a number like body weight. 22 00:01:50,410 --> 00:01:58,160 There's categorical which means one thing or another like sex or whether a patient is a smoker or not. 23 00:01:58,300 --> 00:02:06,160 And then there's derived which is when someone like yourself looks at the data and creates a new feature 24 00:02:06,400 --> 00:02:08,700 using the existing ones. 25 00:02:08,770 --> 00:02:14,890 For example you might look at someone's hospital visit history timestamps and if they've had a visit 26 00:02:14,920 --> 00:02:20,820 in the last year you could make a categorical feature called visited in last year. 27 00:02:22,360 --> 00:02:25,980 If someone had visited in the last year they would get true. 28 00:02:25,990 --> 00:02:28,330 Or in our case yes. 29 00:02:28,450 --> 00:02:31,240 If not they would get false or in this case. 30 00:02:31,320 --> 00:02:32,380 No. 31 00:02:32,500 --> 00:02:40,360 The process of deriving features like this out of data is often referred to as feature engineering our 32 00:02:40,360 --> 00:02:46,030 heart disease example is structured but unstructured data has features too. 33 00:02:46,370 --> 00:02:52,240 They're just a little less obvious if you looked at enough images of dogs you'd start to figure out 34 00:02:52,670 --> 00:02:53,460 OK. 35 00:02:53,620 --> 00:02:59,170 Most of these creatures have four shapes coming out of their body their legs and a couple of circles 36 00:02:59,170 --> 00:03:05,720 up the front their eyes as a machine learning algorithm looks at different images. 37 00:03:05,720 --> 00:03:11,570 It would start to learn these different shapes and much more and figure out how different pictures are 38 00:03:11,570 --> 00:03:14,430 similar or different to each other. 39 00:03:14,480 --> 00:03:21,170 Don't worry when it comes to figuring out the different patterns between features such as the four rectangles 40 00:03:21,170 --> 00:03:27,470 sort of shapes coming out of a dog's body or the circles at the front of the dog's head you don't have 41 00:03:27,470 --> 00:03:30,240 to tell the machine learning algorithm what they are. 42 00:03:30,260 --> 00:03:38,710 The beautiful thing is it because the mount on its own the final thing to remember is a feature works 43 00:03:38,710 --> 00:03:41,250 best within a machine learning algorithm. 44 00:03:41,260 --> 00:03:47,950 If many of the samples have it for an hour predicting heart disease problem say we had a feature which 45 00:03:47,950 --> 00:03:56,440 was called most Eden Foods which had a list of the foods the Patient 8 most often but only 10 per cent 46 00:03:56,800 --> 00:04:00,100 or 10 out of 100 patient records had it. 47 00:04:00,460 --> 00:04:09,760 So this one idea for patient I.D. for 3 2 8 has most in food which is fries not ideal and these other 48 00:04:09,760 --> 00:04:16,030 patients don't have it because remember only 10 out of 100 examples have the most eaten food. 49 00:04:16,030 --> 00:04:19,830 They have data here and so these ones are just missing and that will be the same. 50 00:04:19,830 --> 00:04:25,000 So if you can imagine there's 100 patients here only 10 of them will have this most eaten food column 51 00:04:25,000 --> 00:04:26,050 filled. 52 00:04:26,350 --> 00:04:31,520 Since a machine learning algorithm learns best when all samples have similar information. 53 00:04:31,620 --> 00:04:39,480 We have to leave this one out or try to collect more information before using it the process of ensuring 54 00:04:39,540 --> 00:04:43,570 all samples have similar information is called feature coverage. 55 00:04:43,650 --> 00:04:47,740 In an ideal dataset you've got complete feature coverage. 56 00:04:47,940 --> 00:04:54,750 So for us to want to to be able to use this feature of most in foods Ideally we'd want all values here 57 00:04:54,780 --> 00:05:01,560 or at least more than 10 percent coverage which means that over 10 percent or over 10 and 100 examples 58 00:05:01,830 --> 00:05:04,010 have some sort of value in this column. 59 00:05:05,450 --> 00:05:11,000 We'll have plenty of practice looking at different features and coming lectures and projects and lessons. 60 00:05:11,000 --> 00:05:14,610 In the meantime think about a problem you had to solve recently. 61 00:05:14,750 --> 00:05:16,450 What features went into it. 62 00:05:16,640 --> 00:05:22,130 Were they numerical or categorical or did you combine them into your own derived feature.