1
00:00:00,200 --> 00:00:00,666
Hello my

2
00:00:00,666 --> 00:00:04,500
friends, I hope you digested
well the previous tutorial where we tackle

3
00:00:04,500 --> 00:00:08,400
this big but yet important
tool of our data preprocessing toolkit.

4
00:00:08,666 --> 00:00:09,166
Indeed.

5
00:00:09,166 --> 00:00:13,433
Now you know how to handle the case
where you have some categorical data

6
00:00:13,433 --> 00:00:17,466
in your data set, which is a situation
you will encounter many times

7
00:00:17,666 --> 00:00:19,700
in your future machine learning career.

8
00:00:19,700 --> 00:00:24,100
And now we have two tools to cover,
the first one being splitting

9
00:00:24,100 --> 00:00:26,800
the data set into the training set
and the test set,

10
00:00:26,800 --> 00:00:29,433
and the second one being feature scaling.

11
00:00:29,433 --> 00:00:32,566
So before we start, I'm about to answer

12
00:00:32,600 --> 00:00:35,666
one of the most frequently asked questions

13
00:00:35,833 --> 00:00:39,533
in the data science community,
which is be ready for it.

14
00:00:39,966 --> 00:00:44,400
Do we have to apply feature scaling
before splitting

15
00:00:44,400 --> 00:00:47,900
the data set into the training set,
and to set or after?

16
00:00:48,300 --> 00:00:51,500
I've seen this questions many times
and you will find

17
00:00:51,500 --> 00:00:54,600
that question in many forums
of the data science community.

18
00:00:54,700 --> 00:00:58,666
Some people will say that we have to apply
feature scaling before the split.

19
00:00:58,800 --> 00:01:04,000
Some people will say after the split and
now I'm about to reveal the right answer.

20
00:01:04,000 --> 00:01:08,533
There is only one right answer,
which is, by the way, totally obvious.

21
00:01:08,566 --> 00:01:10,466
After you get the explanation.

22
00:01:10,466 --> 00:01:15,000
So the answer is
we have to apply feature scaling.

23
00:01:15,700 --> 00:01:20,166
After splitting the data set
into the training set and the test set.

24
00:01:20,300 --> 00:01:22,366
And now let me explain.

25
00:01:22,366 --> 00:01:24,600
So first
just to make sure everybody understands.

26
00:01:24,600 --> 00:01:27,300
Let me explain the what first.
And then I'll explain the why.

27
00:01:27,300 --> 00:01:30,666
So of course splitting the data
set into the training set and the test.

28
00:01:30,666 --> 00:01:33,666
It consists of making two separate sets.

29
00:01:33,666 --> 00:01:35,700
One training set
where you're going to train

30
00:01:35,700 --> 00:01:40,066
your machine learning model on existing
observations, and one test set where

31
00:01:40,066 --> 00:01:44,433
you're going to evaluate the performance
of your model on new observations.

32
00:01:44,633 --> 00:01:48,900
And it's important to understand that
these new observations are exactly like,

33
00:01:48,900 --> 00:01:52,100
you know, some future data
that you're going to get and on

34
00:01:52,100 --> 00:01:54,600
which you're going to deploy your machine
learning model.

35
00:01:54,600 --> 00:01:55,100
All right.

36
00:01:55,100 --> 00:01:56,866
So that's this first tool.

37
00:01:56,866 --> 00:02:02,333
And now feature scaling simply consists
of scaling all your variables,

38
00:02:02,333 --> 00:02:07,766
all your features actually to make sure
they all take values in the same scale.

39
00:02:07,766 --> 00:02:11,666
And we do this so as to prevent
one feature to dominate the other,

40
00:02:11,700 --> 00:02:15,000
which therefore would be neglected
by the machine learning model.

41
00:02:15,366 --> 00:02:15,800
All right.

42
00:02:15,800 --> 00:02:18,133
So that's the what
for both of these tools.

43
00:02:18,133 --> 00:02:22,733
Now let me explain the why
we have to apply feature scaling.

44
00:02:23,000 --> 00:02:26,366
After splitting the data
set into the training set and test it.

45
00:02:26,366 --> 00:02:27,633
It's really obvious.

46
00:02:27,633 --> 00:02:30,866
It is for the simple reason
that the test set

47
00:02:31,200 --> 00:02:33,900
is supposed to be a brand new set

48
00:02:33,900 --> 00:02:37,466
on which you are going to evaluate
your machine learning model.

49
00:02:37,700 --> 00:02:41,100
So it's exactly like, you know,
your training, your machine learning model

50
00:02:41,100 --> 00:02:45,100
on your training set, and then later
on, you know, after it is trained,

51
00:02:45,133 --> 00:02:48,133
you're going to deploy it
on new observations.

52
00:02:48,300 --> 00:02:51,366
So what this means is that the test
set is something

53
00:02:51,366 --> 00:02:54,366
you're not supposed to work with
for the training.

54
00:02:54,600 --> 00:02:58,133
And feature scaling
is as you will see, a technique

55
00:02:58,133 --> 00:03:01,533
that will get the mean
and the standard deviation

56
00:03:01,533 --> 00:03:05,033
of your feature,
you know, in order to perform the scaling.

57
00:03:05,466 --> 00:03:09,566
So if we apply feature scaling
before the split,

58
00:03:09,900 --> 00:03:13,733
then it will actually get the mean
and the standard deviation

59
00:03:13,733 --> 00:03:17,033
of all the values,
including the ones in the test set.

60
00:03:17,166 --> 00:03:20,333
And since the test set is something
you're not supposed to have,

61
00:03:20,333 --> 00:03:24,300
you know, like some future data
in production, well, you know, applying

62
00:03:24,300 --> 00:03:28,700
feature scaling on the original data
set before the split would cause some

63
00:03:28,700 --> 00:03:31,933
what we call information
leakage on the test set.

64
00:03:31,933 --> 00:03:32,733
You know, we would

65
00:03:32,733 --> 00:03:36,933
grab some information from the test set,
which we're not supposed to get

66
00:03:37,033 --> 00:03:40,900
because it is supposed to be new data
with new observations.

67
00:03:41,166 --> 00:03:46,233
So remember this the essential reason
why you should not apply feature scaling

68
00:03:46,233 --> 00:03:50,133
before the split is to prevent information
leakage

69
00:03:50,366 --> 00:03:54,900
on the test set, which you're not supposed
to have until the training is done.