1
00:00:00,400 --> 00:00:00,900
All right.

2
00:00:00,900 --> 00:00:04,300
So, as usual, you know,
in order to be as much efficient

3
00:00:04,300 --> 00:00:07,600
as we can, we're going to do this
with scikit learn, right.

4
00:00:07,600 --> 00:00:10,166
This data science library
that has all the tools

5
00:00:10,166 --> 00:00:15,133
and the tool that we're about to use
is a class called standard scaler,

6
00:00:15,133 --> 00:00:19,066
in which will exactly perform
standardization on

7
00:00:19,200 --> 00:00:21,566
both your matrix of features
of the training set

8
00:00:21,566 --> 00:00:24,033
and the matrix of features
of the test set.

9
00:00:24,033 --> 00:00:24,933
So let's do this.

10
00:00:24,933 --> 00:00:28,800
Let's start by importing this class,
which we have to take first from.

11
00:00:29,200 --> 00:00:32,700
Well scikit learn of course as k learn.

12
00:00:32,966 --> 00:00:37,566
And then from which we're going
to get access to the pre processing

13
00:00:37,566 --> 00:00:42,200
module perfect which is a module
that contains that standard scalar class.

14
00:00:42,200 --> 00:00:43,000
All right.

15
00:00:43,000 --> 00:00:48,766
So then we're ready to import what we want
which is the standard scalar class.

16
00:00:48,900 --> 00:00:51,333
Perfect. So now we have the class.

17
00:00:51,333 --> 00:00:54,666
And then the natural next step here
is of course to create

18
00:00:54,666 --> 00:00:57,666
an object of the class
I'm about to reveal soon.

19
00:00:58,000 --> 00:01:01,500
That answer to one of the most frequent
questions in the data science community.

20
00:01:02,100 --> 00:01:03,633
So let's create this object.

21
00:01:03,633 --> 00:01:06,633
We're going to call it
SC for standard scalar.

22
00:01:06,666 --> 00:01:11,200
And then well this object will be created
as an instance of the standard

23
00:01:11,200 --> 00:01:11,900
scalar class.

24
00:01:11,900 --> 00:01:15,433
So I'm taking it here pasting that
right here adding some parenthesis.

25
00:01:16,033 --> 00:01:17,366
And good news here.

26
00:01:17,366 --> 00:01:21,900
We don't have any arguments to input
because what we simply want to do

27
00:01:21,900 --> 00:01:24,833
is get that mean, get
that standard deviation,

28
00:01:24,833 --> 00:01:27,866
and then apply this formula
to all the values in the feature.

29
00:01:27,866 --> 00:01:30,166
And for this
we don't need actually any parameters.

30
00:01:30,166 --> 00:01:32,733
This will automatically do the job.

31
00:01:32,733 --> 00:01:35,333
All right. So then next step.

32
00:01:35,333 --> 00:01:38,300
And now
well now is the time for me to reveal

33
00:01:38,300 --> 00:01:43,233
the answer to that question,
which is one of the most frequently asked

34
00:01:43,500 --> 00:01:46,433
questions in the data science community.

35
00:01:46,433 --> 00:01:50,566
And that question is do we have to apply

36
00:01:51,033 --> 00:01:54,033
feature scaling, you know, standardization

37
00:01:54,200 --> 00:01:58,500
to the dummy variables
in the matrix of features?

38
00:01:58,800 --> 00:02:01,000
This is one of the
most frequently asked questions.

39
00:02:01,000 --> 00:02:02,800
You will find it everywhere
online as well.

40
00:02:02,800 --> 00:02:04,666
And once again, actually

41
00:02:04,666 --> 00:02:08,466
the answer is pretty obvious,
but only after you get the explanation.

42
00:02:08,833 --> 00:02:10,266
So let me tell you the answer.

43
00:02:10,266 --> 00:02:12,466
The answer is no.

44
00:02:12,466 --> 00:02:15,000
The answer is no because simply

45
00:02:15,000 --> 00:02:19,166
well remember the goal of standardization
or feature scaling in general,

46
00:02:19,500 --> 00:02:23,566
it is to have all the values
of the features in the same range.

47
00:02:23,800 --> 00:02:29,000
And since I told you that standardization
actually transforms your features

48
00:02:29,000 --> 00:02:32,933
so that they take values between more or
less minus three and plus three.

49
00:02:33,200 --> 00:02:37,533
Well, since here are dummy variables
already, take values

50
00:02:37,533 --> 00:02:41,266
between minus three and plus three
because they're equal to either 1 or 0.

51
00:02:41,466 --> 00:02:46,133
Well, there is nothing extra
to be done here with standardization

52
00:02:46,333 --> 00:02:50,566
and actually standardization
will only make it worse because indeed

53
00:02:50,566 --> 00:02:54,333
it will still transform these values
between minus three and plus three.

54
00:02:54,333 --> 00:02:58,066
But then you will totally lose
the interpretation of these variables.

55
00:02:58,066 --> 00:02:58,900
In other words,

56
00:02:58,900 --> 00:03:03,533
you will lose the information of which
country corresponds to the observation.

57
00:03:03,766 --> 00:03:06,800
Now we perfectly know that you know,
remember one,

58
00:03:06,800 --> 00:03:10,200
zero and zero corresponds to France
because that's how it was encoded.

59
00:03:10,200 --> 00:03:13,066
And then zero, zero
and one corresponds to Spain.

60
00:03:13,066 --> 00:03:13,866
But you know, after

61
00:03:13,866 --> 00:03:17,800
we apply feature scaling,
if we apply it on the dummy variables,

62
00:03:18,000 --> 00:03:22,200
we will get nonsense numerical values
and we will be absolutely incapable

63
00:03:22,366 --> 00:03:26,633
to say which tuple of three values here
correspond to which country.

64
00:03:26,733 --> 00:03:28,800
So we will totally lose interpretation.

65
00:03:28,800 --> 00:03:32,400
And besides, this won't improve at all
to training performance

66
00:03:32,466 --> 00:03:35,933
because indeed our dummy
variables are anyway already

67
00:03:36,133 --> 00:03:39,300
between the same scale range
as your other variables

68
00:03:39,566 --> 00:03:43,133
you will see online that applying
standardization to your dummy

69
00:03:43,133 --> 00:03:46,200
variables might still increase
slightly the performance.

70
00:03:46,200 --> 00:03:48,200
You know the final accuracy of your model.

71
00:03:48,200 --> 00:03:51,166
But I've experimented many times
and I've never seen,

72
00:03:51,166 --> 00:03:54,900
you know, a considerable difference
that would justify here to apply feature

73
00:03:54,900 --> 00:03:56,400
scaling on the dummy variables.

74
00:03:56,400 --> 00:03:58,166
So really don't do this.

75
00:03:58,166 --> 00:04:01,333
Only apply feature
scaling to your numerical values.

76
00:04:01,333 --> 00:04:06,400
Right here we have clearly some variables
taking values in a very different range.

77
00:04:06,500 --> 00:04:06,866
Right.

78
00:04:06,866 --> 00:04:12,000
The age goes between 0 and 100
and the salary goes between 0 and 100,000.

79
00:04:12,166 --> 00:04:14,600
So clearly here
if it's better for the machine

80
00:04:14,600 --> 00:04:16,933
learning model
we have to apply feature scaling.

81
00:04:16,933 --> 00:04:20,133
But let's leave these dummy
variables alone

82
00:04:20,300 --> 00:04:23,433
so that we can keep
the interpretability of the model.

83
00:04:23,666 --> 00:04:24,333
All right.

84
00:04:24,333 --> 00:04:26,733
So that was the other
very important questions.

85
00:04:26,733 --> 00:04:28,200
Now I've covered everything.

86
00:04:28,200 --> 00:04:32,366
There should not be any confusion
left in data preprocessing.

87
00:04:32,500 --> 00:04:34,000
I'm really glad that you know this.

88
00:04:34,000 --> 00:04:38,000
And therefore I encourage you now
to press pause on this video.

89
00:04:38,233 --> 00:04:41,100
And guess what will be the next step
to apply feature

90
00:04:41,100 --> 00:04:44,800
scaling to R matrices
a feature extreme and excess.