1
00:00:00,670 --> 00:00:01,740
Welcome back.

2
00:00:01,750 --> 00:00:06,310
In the last video we saw an end to end psychic loan workflow.

3
00:00:06,310 --> 00:00:10,410
Now we're gonna break that down and jump into each of those steps individually.

4
00:00:10,450 --> 00:00:12,680
The first one is getting the data ready.

5
00:00:12,700 --> 00:00:16,540
You probably noticed I've kind of list to find this just because I'm a bit of a nerd so that way we

6
00:00:16,540 --> 00:00:20,170
don't have to scroll right back up to top to see what we're covering.

7
00:00:20,170 --> 00:00:25,870
We can just put this here this little cool trick wherever we want in the Jupiter notebook because we've

8
00:00:25,870 --> 00:00:29,440
run this cell and instantiated this as a list variable.

9
00:00:29,660 --> 00:00:33,890
So we're going to tick off getting the data ready in this section.

10
00:00:33,890 --> 00:00:35,940
And another thing we probably should have done right at the top.

11
00:00:35,950 --> 00:00:39,100
We can usually do it in any machine learning notebook that we're working on.

12
00:00:39,220 --> 00:00:45,580
So we'll just type in here standard imports just so we have them ready at our arsenal.

13
00:00:45,610 --> 00:00:55,960
We've already done some pie pandas as PD import map bot lib dot pie plot as TLT.

14
00:00:56,960 --> 00:01:02,270
And then we'll do a little plot lib inline so our plots appear in the notebook.

15
00:01:02,270 --> 00:01:06,830
So this is this three four lines of code that you can run at the top of most notebooks and then you'll

16
00:01:06,830 --> 00:01:11,900
see throughout as we start to use different socket loan functions we could also start to put them up

17
00:01:11,900 --> 00:01:14,150
the top two but we'll leave it at those for now.

18
00:01:14,630 --> 00:01:16,180
So let's go down to where we were.

19
00:01:16,310 --> 00:01:18,650
This section we're getting our data ready.

20
00:01:18,650 --> 00:01:25,360
Now the reason we have to do so is because most the time data doesn't come ready to be used with a psychic

21
00:01:25,360 --> 00:01:27,130
line machine learning model.

22
00:01:27,180 --> 00:01:29,090
Now Grant a little heading.

23
00:01:29,440 --> 00:01:30,760
So we have to get it ready.

24
00:01:31,150 --> 00:01:39,970
And getting data ready to be used with machine learning and then the three main things that we'll have

25
00:01:39,970 --> 00:01:40,530
to do.

26
00:01:40,570 --> 00:01:44,020
Number one is let's communicate a bit better Daniel.

27
00:01:44,140 --> 00:02:00,710
Three main things we have to do is one split the data into features and labels usually x and y.

28
00:02:00,710 --> 00:02:02,300
This is what you'll find them cold.

29
00:02:02,390 --> 00:02:07,880
Generally if you're on the internet somewhere or this like a loan library usually calls features x and

30
00:02:07,880 --> 00:02:19,730
labels y and then number two is feeling also called imputing or disregarding missing values.

31
00:02:19,970 --> 00:02:24,950
So if any of our rows in our data set have missing values if there's any incomplete fields we may have

32
00:02:24,950 --> 00:02:27,990
to fill them up or we may have to get rid of them completely.

33
00:02:28,010 --> 00:02:32,510
Those samples because a machine learning model can't learn when there's nothing there.

34
00:02:32,690 --> 00:02:42,680
You'll see it throws an era then the final one is converting non numerical values to numerical values

35
00:02:44,390 --> 00:02:54,050
also called feature encoding a.k.a. if we have say we have our car sales or we need to turn that into

36
00:02:54,080 --> 00:02:56,380
markdown beautiful.

37
00:02:56,900 --> 00:03:02,700
So if we have our car sales data and we have Toyota we have the color et cetera.

38
00:03:02,730 --> 00:03:05,340
A machine learning model can't understand Toyota.

39
00:03:05,360 --> 00:03:06,220
What have I done there.

40
00:03:06,380 --> 00:03:07,500
Can't understand Honda.

41
00:03:07,510 --> 00:03:08,660
Come and stand read it.

42
00:03:08,690 --> 00:03:12,440
We have to turn these into numbers and we'll see how to how to do that.

43
00:03:12,440 --> 00:03:13,910
In this section.

44
00:03:13,910 --> 00:03:14,560
All right.

45
00:03:14,660 --> 00:03:20,330
Well we need a data set to begin with first and I think we still have our heart disease data set imported.

46
00:03:21,110 --> 00:03:22,740
So let's check that.

47
00:03:22,740 --> 00:03:23,330
Go ahead.

48
00:03:24,380 --> 00:03:25,190
Wonderful.

49
00:03:25,190 --> 00:03:30,440
So where do step one first we've seen this before in our workflow but we'll just get a succinct version

50
00:03:30,440 --> 00:03:31,960
of how to actually do it.

51
00:03:31,970 --> 00:03:37,020
So in this case we want to use the feature columns to predict why.

52
00:03:37,040 --> 00:03:38,060
So what are we doing here.

53
00:03:38,060 --> 00:03:45,180
We're keeping it nice and simple we use pandas X equals heart disease don't drop.

54
00:03:45,320 --> 00:03:50,740
So we want to remove the target column along Axis 1.

55
00:03:50,960 --> 00:03:54,970
And now remember in a panda's data frame.

56
00:03:55,160 --> 00:03:56,740
Access equals one means.

57
00:03:56,780 --> 00:04:01,070
This access here all the columns axis and axis 0 is the Rose axis.

58
00:04:01,400 --> 00:04:10,340
So X is now going to be every single column except for Target wonderful and you know what our Y is going

59
00:04:10,340 --> 00:04:10,570
to be.

60
00:04:10,570 --> 00:04:12,890
Because that's the labels of our machine learning problem.

61
00:04:13,030 --> 00:04:16,910
Our Y is going to be the target column.

62
00:04:16,970 --> 00:04:20,700
Heart disease and we'll select it like that.

63
00:04:20,770 --> 00:04:24,760
We'll look at why dot head was here first five samples.

64
00:04:24,760 --> 00:04:25,140
Wonderful.

65
00:04:25,140 --> 00:04:27,600
We go back up here one on one on one.

66
00:04:27,670 --> 00:04:30,880
You see why is so beautiful.

67
00:04:30,880 --> 00:04:36,230
Now the next thing we have to do is split data into training and test sets.

68
00:04:36,370 --> 00:04:43,330
So in machine learning one of the most fundamental principles is never evaluate or test your models

69
00:04:43,720 --> 00:04:45,690
on data that it is learned from.

70
00:04:45,760 --> 00:04:51,610
Which is why we split it into training and test sets so suck it loan has a convenient function for allowing

71
00:04:51,610 --> 00:04:52,240
us to do that.

72
00:04:52,810 --> 00:04:58,330
So split the data into training and test sets.

73
00:04:58,420 --> 00:05:03,370
Remember right back at the start when going over concepts if you're looking at the test data it's like

74
00:05:03,400 --> 00:05:08,890
looking at the final exam before you've looked at the practice exam not what you want to be doing right.

75
00:05:08,890 --> 00:05:14,530
If the professor accidentally leaked the final exam everyone would be getting perfect marks and no one

76
00:05:14,530 --> 00:05:16,470
will be actually learning anything.

77
00:05:16,480 --> 00:05:23,320
So we want to from S.K. learn model selection SCA line has a fair few different modules but we'll see

78
00:05:23,320 --> 00:05:25,010
some of the most useful ones.

79
00:05:25,060 --> 00:05:28,780
Import train test split.

80
00:05:28,840 --> 00:05:29,930
Beautiful.

81
00:05:30,050 --> 00:05:35,710
Now when we call train test split it's going to return for different values.

82
00:05:35,770 --> 00:05:41,720
One is X train one is X test one is y train and one is y test.

83
00:05:41,770 --> 00:05:43,150
So let's see what happens.

84
00:05:43,210 --> 00:05:45,360
Train test splint.

85
00:05:45,370 --> 00:05:52,480
And then we pass it our X data and see what it says actually split arrays or matrices into random train

86
00:05:52,480 --> 00:05:53,620
and test subsets.

87
00:05:53,680 --> 00:05:54,160
Beautiful.

88
00:05:54,160 --> 00:05:54,850
That's what we want.

89
00:05:54,850 --> 00:05:59,790
We want some data to train on and we want some data to evaluate our machine learning models on its will

90
00:05:59,800 --> 00:06:06,280
pass it out features and we'll pass it our labels and we'll define the test size as been zero point

91
00:06:06,280 --> 00:06:06,610
two.

92
00:06:07,780 --> 00:06:09,600
So let's see what happens.

93
00:06:09,610 --> 00:06:10,280
Wonderful.

94
00:06:10,330 --> 00:06:11,850
That runs smoothly.

95
00:06:11,900 --> 00:06:17,710
Let's check out the shapes of our new matrices because remember our data here is really just a matrix

96
00:06:17,740 --> 00:06:21,950
in a data frame or a nun pie in the array in a data frame.

97
00:06:22,000 --> 00:06:28,610
So on X trained what shape will test x test shape y train shape.

98
00:06:28,660 --> 00:06:31,380
And finally y test what shape.

99
00:06:31,960 --> 00:06:32,620
Let's see this.

100
00:06:32,620 --> 00:06:36,240
Okay so 242 with 13 columns.

101
00:06:36,310 --> 00:06:39,550
So 2042 by 13 and 61 by 13.

102
00:06:39,550 --> 00:06:39,830
Okay.

103
00:06:39,850 --> 00:06:41,230
This makes sense.

104
00:06:41,260 --> 00:06:49,450
So if we look up here as one two three four five six seven eight nine ten eleven twelve thirteen thirteen

105
00:06:49,450 --> 00:06:50,580
different features.

106
00:06:50,620 --> 00:06:55,710
That's why our X train variable is 242 by thirteen.

107
00:06:55,750 --> 00:06:58,120
But where did this 240 to come from.

108
00:06:58,120 --> 00:07:04,780
Well that's because we've decided here that we want our test data set to be 20 per cent of the overall

109
00:07:04,780 --> 00:07:05,650
data.

110
00:07:05,650 --> 00:07:06,610
So let's have a look at this.

111
00:07:06,610 --> 00:07:16,380
If we go X dot shape because X is before we split it right we've got X up here X dot shape is 303.

112
00:07:16,390 --> 00:07:20,110
So we have 303 samples total.

113
00:07:20,210 --> 00:07:27,890
Let's just check that three a three samples total and 80 percent of them are going to be training data

114
00:07:27,920 --> 00:07:29,770
for the machine learning model.

115
00:07:29,890 --> 00:07:38,290
I trained up shape we want to times that by zero point eight to two hundred forty two point four.

116
00:07:38,390 --> 00:07:44,310
We add these together turn 42 plus 61.

117
00:07:44,310 --> 00:07:48,840
All it's done is it just rounded it down and automatically carve the other 20 per cent into the test

118
00:07:48,840 --> 00:07:52,740
set and it's done the same for y train and Y test.

119
00:07:52,980 --> 00:07:53,930
Beautiful.

120
00:07:53,970 --> 00:07:58,890
Now we've split our data into training and test the next step.

121
00:07:58,920 --> 00:08:02,940
We'll go back up here is to fill it or converting it.

122
00:08:02,940 --> 00:08:04,550
Making sure it's all numerical.

123
00:08:04,560 --> 00:08:06,950
So we might see this maybe we do that.

124
00:08:07,170 --> 00:08:08,190
We fill it first.

125
00:08:08,280 --> 00:08:09,570
Yeah that's a good idea.

126
00:08:09,600 --> 00:08:11,280
We'll come back and we'll do that in the next video.

127
00:08:11,310 --> 00:08:16,650
So maybe have a go at playing around with one of our CSB files and splitting it into training and test

128
00:08:16,650 --> 00:08:22,410
data or splitting it into x and y first and then change around this parameter here to see what happens

129
00:08:22,470 --> 00:08:27,990
when we change that two point three what you think will happen if I press shift enter here the numbers

130
00:08:27,990 --> 00:08:29,110
will change.

131
00:08:29,490 --> 00:08:31,660
Maybe you want to play around with a different fraction.

132
00:08:31,830 --> 00:08:36,370
Otherwise I'll see you in the next video and we'll look at how to fill some missing data.