1
00:00:00,560 --> 00:00:01,660
Welcome back.

2
00:00:01,700 --> 00:00:06,920
The last video we finished off checking it out correlation matrix and we looked at our workflow and

3
00:00:06,920 --> 00:00:08,710
saw that we're now up to step five.

4
00:00:08,720 --> 00:00:13,100
So we've covered problem definition data violation we've looked at the features we've done little bit

5
00:00:13,100 --> 00:00:20,300
of data analysis now it's time to stop bringing in machine learning so let's see how we do that let's

6
00:00:20,300 --> 00:00:27,020
go into here we'll create a little heading it might do it 5.0 modeling and we're gonna turn this into

7
00:00:27,030 --> 00:00:31,950
markdown and so what's our problem.

8
00:00:31,960 --> 00:00:33,250
Let's scroll back up.

9
00:00:33,610 --> 00:00:43,400
We defined it up here right back at the star so problem definition is step 1.

10
00:00:43,440 --> 00:00:48,180
So in a statement given clinical parameters about a patient can we predict whether or not they have

11
00:00:48,180 --> 00:00:48,690
heart disease.

12
00:00:48,690 --> 00:00:53,550
So we're working with the classification problem okay as what we're trying to be answering with machine

13
00:00:53,550 --> 00:00:55,640
learning and evaluation.

14
00:00:55,740 --> 00:01:00,720
So this our evaluation metric because right now we've got a dataset kind of we're just exploring it.

15
00:01:00,720 --> 00:01:02,300
We're experimenting here.

16
00:01:02,340 --> 00:01:06,960
So if we wanted to actually get this into production into a real life setting we'd probably want it

17
00:01:06,960 --> 00:01:08,030
to be fairly good right.

18
00:01:08,040 --> 00:01:12,420
Because it's predicting something as important as whether someone or not has heart disease.

19
00:01:12,420 --> 00:01:18,420
So we want a minimum of 95 percent accuracy at predicting whether or not a patient has heart disease

20
00:01:18,450 --> 00:01:22,130
during our proof of concept which is kind of like what working through here.

21
00:01:22,450 --> 00:01:26,040
And if we can make it to that we'll keep pursuing it we'll keep going see if we can improve it make

22
00:01:26,040 --> 00:01:28,120
it better or see what we need to improve.

23
00:01:28,140 --> 00:01:32,490
So this is what we're gonna be trying to solve with machine learning we're trying to answer this problem

24
00:01:32,490 --> 00:01:33,020
statement.

25
00:01:33,690 --> 00:01:38,340
And this is our valuation metric something that we'll be working towards and of course each of these

26
00:01:38,340 --> 00:01:41,970
could change but these are just give us a little roadmap or something we can head towards.

27
00:01:42,660 --> 00:01:45,540
So let's go down see what we would do.

28
00:01:45,540 --> 00:01:51,450
The first thing we might do is relook at our our dataset remind ourselves of what we're doing.

29
00:01:51,450 --> 00:01:59,220
So we're going to be using these columns to try and predict the type column so a.k.a. the independent

30
00:01:59,220 --> 00:02:06,380
variables to predict the dependent variable here and the first thing we might do is split our data into

31
00:02:06,380 --> 00:02:13,800
x and y so we'll split it into features and labels and then what we might do is create a training and

32
00:02:13,800 --> 00:02:19,200
test split so model what we did in the socket line section when we worked through the socket line machine

33
00:02:19,200 --> 00:02:20,620
learning workflow.

34
00:02:20,920 --> 00:02:21,780
So let's do that.

35
00:02:21,930 --> 00:02:26,910
So let's go split data into X and Y

36
00:02:30,350 --> 00:02:34,790
X is going to be equal to every column except the tiger column.

37
00:02:34,790 --> 00:02:38,940
That's what we want to DFT drop we're going to drop the target.

38
00:02:39,200 --> 00:02:46,130
We're going to set the axis equal to 1 Beautiful and then Y is going to be these are all labels.

39
00:02:46,250 --> 00:02:48,330
So we're going to go DLF.

40
00:02:48,670 --> 00:02:54,430
We can just go target there we go.

41
00:02:54,490 --> 00:02:56,890
Now let's visualize x.

42
00:02:57,250 --> 00:02:58,260
Wonderful.

43
00:02:58,480 --> 00:03:04,060
No target column and we'll visualize y to remind ourselves it's just 0 1 whether or not someone has

44
00:03:04,060 --> 00:03:07,660
heart disease a coronary classification because there's only two options.

45
00:03:07,780 --> 00:03:15,040
Binary meaning 0 1 and now what we have to do is split our data into a training and test with.

46
00:03:15,220 --> 00:03:22,120
Now we've seen this before but it's worth reiterating why not use all the data to try to model what

47
00:03:22,120 --> 00:03:24,010
is a train and test split.

48
00:03:24,100 --> 00:03:28,990
Well let's say you wanted to take your model into the hospital and start using it on patients.

49
00:03:28,990 --> 00:03:35,500
How would you know how well your model goes on a new patient not included in the original full data

50
00:03:35,500 --> 00:03:35,800
set.

51
00:03:35,800 --> 00:03:43,970
Now we have so if we built a model on every single sample in this this data frame how would we know

52
00:03:43,970 --> 00:03:51,050
how would we have an insight into seeing how it would go on someone that our model has never seen before.

53
00:03:51,080 --> 00:03:53,020
This is where the test set comes in.

54
00:03:53,060 --> 00:03:58,820
It's used to mimic taking our model into a real environment as much as possible or at least it tries

55
00:03:58,820 --> 00:04:00,480
to as much as possible.

56
00:04:00,500 --> 00:04:03,830
It's important to never let our model learn from a test set.

57
00:04:04,010 --> 00:04:06,190
It should only be evaluated on a test.

58
00:04:07,130 --> 00:04:10,790
So we're gonna have a training set which is like when you're studying for an exam it's like the course

59
00:04:10,790 --> 00:04:11,360
material.

60
00:04:11,360 --> 00:04:15,460
So the machine learning model is going to learn the patterns in the course material then it's gonna

61
00:04:15,500 --> 00:04:19,250
be evaluated on the final exam which is the test set.

62
00:04:19,340 --> 00:04:29,770
So let's say that what we're going to do is split data into train and test sets so to do so we can use

63
00:04:30,240 --> 00:04:37,090
num pi random seed first so we can reproduce our results and we can split our data into train and test

64
00:04:37,090 --> 00:04:40,290
sets using socket lines train test split.

65
00:04:40,420 --> 00:04:51,220
So you go split into train and test set X train x test we saw a lot of this in the socket loan section

66
00:04:53,360 --> 00:04:58,750
but it's paramount to reiterate that that whenever you try to test or whenever you try to evaluate your

67
00:04:58,750 --> 00:05:06,520
model it should be on data that the model has never seen before test size and do the standard zero point

68
00:05:06,520 --> 00:05:07,050
two.

69
00:05:07,990 --> 00:05:12,130
So this X is what we've created up here now dependent variable.

70
00:05:12,220 --> 00:05:16,840
And this is why which is our labels.

71
00:05:16,840 --> 00:05:25,340
So now we're going to shift and into wonderful and we're going to have look at the training data.

72
00:05:25,650 --> 00:05:34,880
So we've got X trying we can see that it's shuffled it and its scope 242 rows out of the total 303 rows.

73
00:05:34,920 --> 00:05:37,720
So it's got about 80 percent of the entire rows.

74
00:05:37,770 --> 00:05:43,110
Now we can do the same with the Y train and we might do find the land of Y train as well

75
00:05:46,420 --> 00:05:52,500
y train is the same length as X train but it's only got the labels.

76
00:05:52,500 --> 00:05:59,330
And so if we compared the indexes here it's got the labels for each of these samples 1 3 2 1 3 2 wonderful.

77
00:05:59,340 --> 00:06:01,800
Now what do we do.

78
00:06:01,800 --> 00:06:06,140
Well now we've got our data into x and y we've got train and test as well.

79
00:06:06,150 --> 00:06:09,630
So we've got training data and we've got testing data.

80
00:06:09,690 --> 00:06:12,810
The next thing is to build a machine learning model.

81
00:06:12,810 --> 00:06:16,560
And we're working with classification.

82
00:06:16,620 --> 00:06:19,950
This comes to a point where it's like what model should we use.

83
00:06:19,950 --> 00:06:31,260
Now we've got our data split into training and test sets it's time to build a machine learning model

84
00:06:33,700 --> 00:06:34,960
we'll train it.

85
00:06:34,960 --> 00:06:45,880
So find the patterns on the training set and we'll test it use the patents.

86
00:06:46,100 --> 00:06:50,510
So using their patent it's found on the test set.

87
00:06:51,080 --> 00:06:53,960
But what machine learning model should we use.

88
00:06:53,960 --> 00:07:00,380
Well if you remember in the socket loan section we checked it out the socket loan machine learning map

89
00:07:01,650 --> 00:07:05,090
so we're gonna go in here.

90
00:07:05,220 --> 00:07:06,760
Wonderful.

91
00:07:06,780 --> 00:07:07,830
So this is our start.

92
00:07:07,860 --> 00:07:11,850
So we followed this through we know we've seen this before we've got a classification problem we're

93
00:07:11,850 --> 00:07:18,270
trying to classify whether or not someone has heart disease based on their health parameters based on

94
00:07:18,270 --> 00:07:19,890
their medical parameters.

95
00:07:19,920 --> 00:07:26,040
So what we might do is only try a few of these you could try all of them but we're going to try and

96
00:07:26,040 --> 00:07:32,130
do is use cane neighbors classifier we'll use this one so if we open that in the new tab and then we're

97
00:07:32,130 --> 00:07:37,800
going to also use our trusty random forest so we'll open that as well.

98
00:07:37,800 --> 00:07:44,250
So ensemble methods is a random forest in here random forest random forest classifier that's what we're

99
00:07:44,250 --> 00:07:44,700
after.

100
00:07:44,790 --> 00:07:48,450
Yes beautiful how we got here.

101
00:07:48,460 --> 00:07:50,020
The nearest neighbors.

102
00:07:50,020 --> 00:07:51,070
Wonderful.

103
00:07:51,070 --> 00:07:53,200
So if we look through that we'd see how we could use it.

104
00:07:53,230 --> 00:07:54,770
But we're going to see this in practice.

105
00:07:54,910 --> 00:07:56,510
But there's one more.

106
00:07:56,510 --> 00:08:06,270
If we go back here back to our machine learning map and it's not listed on here I wonder why that is.

107
00:08:06,510 --> 00:08:08,700
Well we're going to see it in the next video what it is.

108
00:08:08,710 --> 00:08:13,180
I'll give you a little clue it's called logistic regression and you might be thinking Hey we're working

109
00:08:13,180 --> 00:08:17,690
on a classification problem logistic regression has the word regression in it.

110
00:08:17,950 --> 00:08:21,150
And why isn't it in this classification part.

111
00:08:21,160 --> 00:08:25,270
Well that's a great question because I'm not really sure either but we're going to have a look at those

112
00:08:25,270 --> 00:08:26,620
three models in the next video.

113
00:08:26,710 --> 00:08:30,230
So we'll create them and we'll see how each of them goes.

114
00:08:30,250 --> 00:08:31,960
Because remember we're experimenting now.

115
00:08:31,990 --> 00:08:38,860
We're now modeling slash experiments phase and one part of the experiments phase is trying different

116
00:08:38,860 --> 00:08:40,630
machine learning models so that's what we're gonna do.

117
00:08:40,630 --> 00:08:46,090
We're going to try three different machine learning models and see how they compare with their results

118
00:08:46,120 --> 00:08:50,240
on the test set and then of course we want the best one.

119
00:08:50,290 --> 00:08:50,950
So let's do that.