1
00:00:00,100 --> 00:00:00,900
Okay, my friends.

2
00:00:00,900 --> 00:00:04,833
Are you ready for the final tool
of our data preprocessing toolkit

3
00:00:05,100 --> 00:00:10,066
feature scaling, which will allow us
to put all our features on the same scale.

4
00:00:10,066 --> 00:00:11,000
So that's the what?

5
00:00:11,000 --> 00:00:13,266
And let me quickly remind the why.

6
00:00:13,266 --> 00:00:14,733
Why do we need to do this?

7
00:00:14,733 --> 00:00:18,166
Well, that's
because for some of the machinery models,

8
00:00:18,500 --> 00:00:23,333
that's in order to avoid some features
to be dominated by other features

9
00:00:23,333 --> 00:00:27,800
in such a way that the dominated features
are not even considered by the machine

10
00:00:28,000 --> 00:00:28,800
model.

11
00:00:28,800 --> 00:00:32,400
Now, you also need to be aware
that we will not have to apply feature

12
00:00:32,400 --> 00:00:35,966
scaling for all the machinery models
just for some of them.

13
00:00:36,100 --> 00:00:39,733
Therefore, we won't include this
in our data preprocessing template,

14
00:00:39,733 --> 00:00:42,600
which I will show you by the way,
at the end of this tutorial.

15
00:00:42,600 --> 00:00:45,466
So we will just add this tool
in the toolkit

16
00:00:45,466 --> 00:00:49,233
because indeed for a lot of machinery
models, we won't even have to apply

17
00:00:49,233 --> 00:00:53,300
feature scaling even if we have features
taking very different values.

18
00:00:53,633 --> 00:00:57,500
For example, if you already know a bit
about the multiple linear

19
00:00:57,500 --> 00:01:01,800
regression model, you know that
each variable is actually multiplied

20
00:01:01,800 --> 00:01:05,400
by a coefficient, you know, in the linear
regression equation.

21
00:01:05,800 --> 00:01:06,733
And so well, you know,

22
00:01:06,733 --> 00:01:09,900
if you have variables
that take much higher values than others.

23
00:01:10,066 --> 00:01:14,300
Well, when learning the coefficients,
the coefficients will just compensate

24
00:01:14,400 --> 00:01:18,000
by taking small values for the variables
that take high values.

25
00:01:18,000 --> 00:01:18,333
Right.

26
00:01:18,333 --> 00:01:21,333
We will explain
this more in part to regression.

27
00:01:21,500 --> 00:01:25,200
But for now, just know that this is a tool
that will be applied

28
00:01:25,200 --> 00:01:29,000
from time to time for certain machine
learning models, but not all the time.

29
00:01:29,000 --> 00:01:31,600
As you will see in this course. All right.

30
00:01:31,600 --> 00:01:35,366
And also in the previous tutorial,
I actually told you that

31
00:01:35,366 --> 00:01:38,866
I was about to answer
one of the most important questions

32
00:01:38,866 --> 00:01:42,466
or most frequently asked questions
by the data science community.

33
00:01:42,800 --> 00:01:44,700
And I will keep, of course, my promise.

34
00:01:44,700 --> 00:01:47,000
I will answer this when it's time.

35
00:01:47,000 --> 00:01:49,766
Meaning,
at around the middle of this tutorial.

36
00:01:49,766 --> 00:01:52,933
But no worries,
all questions about feature scaling

37
00:01:52,966 --> 00:01:56,400
will be answered
so that you have absolutely no confusion.

38
00:01:57,000 --> 00:01:58,866
All right,
so we have the what and the why.

39
00:01:58,866 --> 00:02:00,600
And now let's proceed to the how.

40
00:02:00,600 --> 00:02:03,900
Meaning
how are we going to apply feature scaling.

41
00:02:04,200 --> 00:02:08,233
And to answer this question
I'm going to show you the following slides

42
00:02:08,433 --> 00:02:12,300
which are the main
two feature scaling techniques

43
00:02:12,300 --> 00:02:16,333
that indeed
put all your features in the same scale.

44
00:02:16,800 --> 00:02:20,100
And these two techniques are first
standardization,

45
00:02:20,366 --> 00:02:25,333
which consists
of subtracting each value of your feature

46
00:02:25,533 --> 00:02:29,333
by the mean of all the values
of the feature, and then dividing

47
00:02:29,333 --> 00:02:32,833
by the standard deviation,
which is the square root of the variance.

48
00:02:32,833 --> 00:02:36,400
And this will put all the values
of the feature

49
00:02:36,600 --> 00:02:39,200
between around minus three
and plus three, right?

50
00:02:39,200 --> 00:02:40,400
All the different features.

51
00:02:40,400 --> 00:02:45,200
When you apply this transformation
on all the features of your data set,

52
00:02:45,333 --> 00:02:49,200
well, all your features will take value
between around minus three and plus three.

53
00:02:49,266 --> 00:02:50,966
So that's standardization.

54
00:02:50,966 --> 00:02:56,300
And then you have normalization
which consists of subtracting each value

55
00:02:56,300 --> 00:02:59,300
of your feature
by the minimum value of the feature,

56
00:02:59,466 --> 00:03:03,133
and then dividing by the difference
between the maximum value of the feature

57
00:03:03,133 --> 00:03:04,966
and the minimum value of the feature.

58
00:03:04,966 --> 00:03:08,633
And so since this is positive,
this is positive.

59
00:03:08,633 --> 00:03:11,333
And this is always larger than this.

60
00:03:11,333 --> 00:03:14,433
Well that means that
all the values of your features

61
00:03:14,633 --> 00:03:17,466
will become between 0 and 1.

62
00:03:17,466 --> 00:03:17,900
All right.

63
00:03:17,900 --> 00:03:19,633
So this will result in having

64
00:03:19,633 --> 00:03:22,633
values of features between minus three
and plus three more or less.

65
00:03:22,733 --> 00:03:27,300
And this will result in having all the
values of your features between 0 and 1.

66
00:03:27,700 --> 00:03:31,466
Now the question is also much asked
by the data science community.

67
00:03:31,600 --> 00:03:35,100
Should we go for standardization
or normalization?

68
00:03:35,600 --> 00:03:38,600
Well, we're going to be here
very pragmatic.

69
00:03:38,766 --> 00:03:41,400
Normalization is recommended

70
00:03:41,400 --> 00:03:44,800
when you have a normal distribution
in most of your features.

71
00:03:45,000 --> 00:03:47,333
This will be a great feature
scaling technique.

72
00:03:47,333 --> 00:03:51,700
In that case standardization
actually works well all the time.

73
00:03:51,700 --> 00:03:53,900
It will do the job all the time.

74
00:03:53,900 --> 00:03:57,866
Therefore, since this is a technique
that will work all the time,

75
00:03:57,866 --> 00:04:02,433
and this is a technique that is more
recommended for some specific situations

76
00:04:02,433 --> 00:04:05,700
where you have most of your features
following a normal distribution.

77
00:04:06,000 --> 00:04:09,033
Then my ultimate recommendation for sure

78
00:04:09,033 --> 00:04:13,200
is to go for standardization,
because indeed this will always work.

79
00:04:13,200 --> 00:04:15,900
You will always do
some relevant feature scaling,

80
00:04:15,900 --> 00:04:18,900
and this will always
improve the training process.

81
00:04:18,966 --> 00:04:21,333
So I'm going to teach you this technique.

82
00:04:21,333 --> 00:04:25,066
I'm going to teach you on how to apply it
on R matrices and features.

83
00:04:25,066 --> 00:04:27,000
And I'm seeing matrices of features
because now

84
00:04:27,000 --> 00:04:30,700
we have two matrices of features
which are Xtrain and Exodus.

85
00:04:30,933 --> 00:04:33,666
And since we understood
in the previous tutorial

86
00:04:33,666 --> 00:04:37,600
that feature scaling must be applied
after the split, well, you understand

87
00:04:37,600 --> 00:04:41,433
that we want apply features
getting on the whole matrix of features x,

88
00:04:41,666 --> 00:04:45,666
but of course
on both x train and X test separately

89
00:04:45,900 --> 00:04:50,100
and actually
the scaler will be fitted to only X train.

90
00:04:50,433 --> 00:04:52,533
And then we'll transform X test.

91
00:04:52,533 --> 00:04:55,166
You know
we'll apply feature scaling on access

92
00:04:55,166 --> 00:04:58,666
because indeed since X test is something
that's we're not supposed to have

93
00:04:58,666 --> 00:05:01,800
during the training, but only after like
when going in production.

94
00:05:01,933 --> 00:05:07,266
Well, we're not allowed to fit our feature
scaling tool on the test set right

95
00:05:07,266 --> 00:05:09,533
by fitting the feature
scaling to on the test set,

96
00:05:09,533 --> 00:05:10,600
that means that we're going

97
00:05:10,600 --> 00:05:14,500
to get the mean of the whole set, and then
the standard deviation in the feature.

98
00:05:14,766 --> 00:05:17,233
No, we don't have the right to do this
because x

99
00:05:17,233 --> 00:05:18,966
this is supposed to be something new.

100
00:05:18,966 --> 00:05:22,533
And therefore we'll
just get the mean of the values in Xtrain

101
00:05:22,533 --> 00:05:25,133
then get the standard deviation
of the values next train.

102
00:05:25,133 --> 00:05:28,133
Then apply this formula
to transform all the values in Xtrain

103
00:05:28,333 --> 00:05:32,100
and then apply that same formula, but
with the same mean and standard deviation

104
00:05:32,333 --> 00:05:36,033
of the values
in Xtrain to scale the values of x.

105
00:05:36,033 --> 00:05:38,633
This. It's really,
really important that you understand this.

106
00:05:38,633 --> 00:05:42,366
And this is once again,
to give some further elements of response

107
00:05:42,600 --> 00:05:44,466
to that previous question.

108
00:05:44,466 --> 00:05:46,966
Should we scale before or after the split?

109
00:05:46,966 --> 00:05:47,733
All right.

110
00:05:47,733 --> 00:05:52,433
So now that we are all clear on this,
let's proceed to the implementation

111
00:05:52,433 --> 00:05:56,400
of the how, meaning the implementation
of feature scaling.