1
00:00:00,980 --> 00:00:07,890
Gone through Step 1 problem definition we've gone through Step 2 data and step 3.

2
00:00:08,020 --> 00:00:10,030
We've defined what success means for us.

3
00:00:10,530 --> 00:00:14,920
Now let's get on to Step 4 which is features.

4
00:00:14,920 --> 00:00:20,380
Now here the question we're trying to answer is what do we already know about the data.

5
00:00:20,450 --> 00:00:25,460
Now if you haven't worked with data before you might hear this word features and be wondering what powders

6
00:00:25,690 --> 00:00:27,230
features mean.

7
00:00:27,280 --> 00:00:30,720
Well you'll hear this world come up a lot in machine learning.

8
00:00:31,000 --> 00:00:37,030
Maybe in the form of feature learning or feature variables or when someone ask how many features are

9
00:00:37,030 --> 00:00:39,550
there or what kind of features are there.

10
00:00:40,630 --> 00:00:44,200
Features is another word for different forms of data.

11
00:00:44,390 --> 00:00:49,960
Now we've already discussed different kinds of data such as structured and unstructured but features

12
00:00:49,960 --> 00:00:56,210
refers to the different forms of data within structured or unstructured data.

13
00:00:56,290 --> 00:01:03,520
For example let's go back to our predicting heart disease problem we might want to see if things such

14
00:01:03,520 --> 00:01:10,270
as a person's body weight their sex their average resting heart rate and their chest pain rating can

15
00:01:10,270 --> 00:01:13,310
be used to predict if they have heart disease or not.

16
00:01:14,500 --> 00:01:21,880
These three things a patient's body weight sex average resting heart rate and chest pain are features

17
00:01:22,180 --> 00:01:28,410
of the data that could also be referred to as feature variables.

18
00:01:28,440 --> 00:01:37,900
In other words we want to use the feature variables to predict the target variables which is whether

19
00:01:37,900 --> 00:01:41,430
or not a person has heart disease or no.

20
00:01:41,650 --> 00:01:45,770
Now when it comes to feature variables again there are different kinds.

21
00:01:45,820 --> 00:01:49,920
You've got numerical which means a number like body weight.

22
00:01:50,410 --> 00:01:58,160
There's categorical which means one thing or another like sex or whether a patient is a smoker or not.

23
00:01:58,300 --> 00:02:06,160
And then there's derived which is when someone like yourself looks at the data and creates a new feature

24
00:02:06,400 --> 00:02:08,700
using the existing ones.

25
00:02:08,770 --> 00:02:14,890
For example you might look at someone's hospital visit history timestamps and if they've had a visit

26
00:02:14,920 --> 00:02:20,820
in the last year you could make a categorical feature called visited in last year.

27
00:02:22,360 --> 00:02:25,980
If someone had visited in the last year they would get true.

28
00:02:25,990 --> 00:02:28,330
Or in our case yes.

29
00:02:28,450 --> 00:02:31,240
If not they would get false or in this case.

30
00:02:31,320 --> 00:02:32,380
No.

31
00:02:32,500 --> 00:02:40,360
The process of deriving features like this out of data is often referred to as feature engineering our

32
00:02:40,360 --> 00:02:46,030
heart disease example is structured but unstructured data has features too.

33
00:02:46,370 --> 00:02:52,240
They're just a little less obvious if you looked at enough images of dogs you'd start to figure out

34
00:02:52,670 --> 00:02:53,460
OK.

35
00:02:53,620 --> 00:02:59,170
Most of these creatures have four shapes coming out of their body their legs and a couple of circles

36
00:02:59,170 --> 00:03:05,720
up the front their eyes as a machine learning algorithm looks at different images.

37
00:03:05,720 --> 00:03:11,570
It would start to learn these different shapes and much more and figure out how different pictures are

38
00:03:11,570 --> 00:03:14,430
similar or different to each other.

39
00:03:14,480 --> 00:03:21,170
Don't worry when it comes to figuring out the different patterns between features such as the four rectangles

40
00:03:21,170 --> 00:03:27,470
sort of shapes coming out of a dog's body or the circles at the front of the dog's head you don't have

41
00:03:27,470 --> 00:03:30,240
to tell the machine learning algorithm what they are.

42
00:03:30,260 --> 00:03:38,710
The beautiful thing is it because the mount on its own the final thing to remember is a feature works

43
00:03:38,710 --> 00:03:41,250
best within a machine learning algorithm.

44
00:03:41,260 --> 00:03:47,950
If many of the samples have it for an hour predicting heart disease problem say we had a feature which

45
00:03:47,950 --> 00:03:56,440
was called most Eden Foods which had a list of the foods the Patient 8 most often but only 10 per cent

46
00:03:56,800 --> 00:04:00,100
or 10 out of 100 patient records had it.

47
00:04:00,460 --> 00:04:09,760
So this one idea for patient I.D. for 3 2 8 has most in food which is fries not ideal and these other

48
00:04:09,760 --> 00:04:16,030
patients don't have it because remember only 10 out of 100 examples have the most eaten food.

49
00:04:16,030 --> 00:04:19,830
They have data here and so these ones are just missing and that will be the same.

50
00:04:19,830 --> 00:04:25,000
So if you can imagine there's 100 patients here only 10 of them will have this most eaten food column

51
00:04:25,000 --> 00:04:26,050
filled.

52
00:04:26,350 --> 00:04:31,520
Since a machine learning algorithm learns best when all samples have similar information.

53
00:04:31,620 --> 00:04:39,480
We have to leave this one out or try to collect more information before using it the process of ensuring

54
00:04:39,540 --> 00:04:43,570
all samples have similar information is called feature coverage.

55
00:04:43,650 --> 00:04:47,740
In an ideal dataset you've got complete feature coverage.

56
00:04:47,940 --> 00:04:54,750
So for us to want to to be able to use this feature of most in foods Ideally we'd want all values here

57
00:04:54,780 --> 00:05:01,560
or at least more than 10 percent coverage which means that over 10 percent or over 10 and 100 examples

58
00:05:01,830 --> 00:05:04,010
have some sort of value in this column.

59
00:05:05,450 --> 00:05:11,000
We'll have plenty of practice looking at different features and coming lectures and projects and lessons.

60
00:05:11,000 --> 00:05:14,610
In the meantime think about a problem you had to solve recently.

61
00:05:14,750 --> 00:05:16,450
What features went into it.

62
00:05:16,640 --> 00:05:22,130
Were they numerical or categorical or did you combine them into your own derived feature.