1
00:00:00,466 --> 00:00:00,900
All right.

2
00:00:00,900 --> 00:00:04,000
And now the last step that you had to do
was to figure out what

3
00:00:04,000 --> 00:00:07,000
to replace here in this feature
scaling implementation.

4
00:00:07,200 --> 00:00:08,833
Well, here that's super easy.

5
00:00:08,833 --> 00:00:12,733
We simply want to,
you know, feature scale, all the features.

6
00:00:12,733 --> 00:00:14,033
We want to scale all the features.

7
00:00:14,033 --> 00:00:15,800
We want to scale age and salary.

8
00:00:15,800 --> 00:00:19,433
And of course we don't have to scale
the dependent variable purchased

9
00:00:19,566 --> 00:00:23,666
because its values are zero and one
and therefore are already

10
00:00:23,666 --> 00:00:26,400
in the range of values
we want. So all good with this.

11
00:00:26,400 --> 00:00:29,900
So basically we're only going to scale
these two features age

12
00:00:29,900 --> 00:00:31,433
and estimated salary.

13
00:00:31,433 --> 00:00:34,933
And therefore here what we just have to
do, you know as replacements

14
00:00:35,200 --> 00:00:38,833
was just to remove
that selection of the indexes here.

15
00:00:38,866 --> 00:00:41,866
Here we selected old indexes
starting from three.

16
00:00:42,066 --> 00:00:44,266
But this time
we don't have to do anything.

17
00:00:44,266 --> 00:00:47,700
We can just scale
the whole matrix of features.

18
00:00:47,700 --> 00:00:50,833
So I'm just removing here
the index selections.

19
00:00:51,233 --> 00:00:52,033
And there you go.

20
00:00:52,033 --> 00:00:56,166
We'll be ready to feature scale
both our training set and our tests.

21
00:00:56,166 --> 00:01:00,300
It and I remind that
this is absolutely compulsory to do this.

22
00:01:00,566 --> 00:01:03,566
After splitting the data
set into the training set and test it

23
00:01:03,600 --> 00:01:07,400
in order to avoid information
leakage from the test set.

24
00:01:08,200 --> 00:01:08,700
All right.

25
00:01:08,700 --> 00:01:09,900
So there you go, my friends.

26
00:01:09,900 --> 00:01:12,966
That was what you needed to do for data
preprocessing.

27
00:01:13,000 --> 00:01:14,066
Congratulations.

28
00:01:14,066 --> 00:01:17,566
If you've got the same thing
don't worry about that test size.

29
00:01:17,566 --> 00:01:20,200
This is just for the form.
But there you go.

30
00:01:20,200 --> 00:01:22,600
This was simply what you had to do.

31
00:01:22,600 --> 00:01:22,966
All right.

32
00:01:22,966 --> 00:01:25,500
So now we're going to do a few prints

33
00:01:25,500 --> 00:01:29,300
to actually see the before
and after feature scaling.

34
00:01:29,333 --> 00:01:32,466
So what I'm going to do is right
after this

35
00:01:32,466 --> 00:01:35,666
code cell splitting the data
set into the training set and test set.

36
00:01:36,000 --> 00:01:40,233
I'm going to make four prints
just to show you if you don't need to

37
00:01:40,233 --> 00:01:43,233
look at that will feel free
not to include these new code cells.

38
00:01:43,400 --> 00:01:47,100
But what I want to do is print
first X train.

39
00:01:47,600 --> 00:01:51,166
Then I would like to print y train.

40
00:01:51,166 --> 00:01:53,100
So here y train.

41
00:01:53,100 --> 00:01:57,266
Then next one
I would like to print x test.

42
00:01:57,700 --> 00:01:59,933
And finally I would like to print

43
00:02:01,200 --> 00:02:04,200
well y tests okay.

44
00:02:04,266 --> 00:02:06,500
So that's an after feature scaling.

45
00:02:06,500 --> 00:02:10,366
However we will just,
you know print two cells

46
00:02:10,366 --> 00:02:13,366
because we actually don't apply feature
scaling to Y.

47
00:02:13,500 --> 00:02:15,500
And therefore we'll just print first.

48
00:02:15,500 --> 00:02:17,433
Well x train again.

49
00:02:17,433 --> 00:02:20,633
And second X test. Right.

50
00:02:20,633 --> 00:02:24,400
Because these will be the only sets
of data that will be changed.

51
00:02:24,566 --> 00:02:25,666
All right. Perfect.

52
00:02:25,666 --> 00:02:28,566
So now let's execute everything we have
the data set.

53
00:02:28,566 --> 00:02:29,166
All good.

54
00:02:29,166 --> 00:02:32,400
So let's do this
starting by importing the libraries.

55
00:02:32,633 --> 00:02:33,366
Good.

56
00:02:33,366 --> 00:02:36,666
Now importing the data set crate.

57
00:02:37,033 --> 00:02:40,533
Now splitting the data
set into the training set and test set.

58
00:02:40,733 --> 00:02:42,033
There we go.

59
00:02:42,033 --> 00:02:42,633
All right.

60
00:02:42,633 --> 00:02:45,533
Now let's print X train
and see what it looks like.

61
00:02:45,533 --> 00:02:47,466
All right. So let's scroll down a bit.

62
00:02:47,466 --> 00:02:47,900
All right.

63
00:02:47,900 --> 00:02:51,766
That's Xtrain was first the current age
and the age feature.

64
00:02:52,000 --> 00:02:56,100
And second column, the estimated salary,
the estimated salary feature.

65
00:02:56,300 --> 00:02:56,733
All right.

66
00:02:56,733 --> 00:03:00,800
And of course we have 300 observations
in this training set.

67
00:03:00,800 --> 00:03:02,166
You don't have to count them.

68
00:03:02,166 --> 00:03:04,566
But there you go. We have many of them.

69
00:03:04,566 --> 00:03:05,133
All right.

70
00:03:05,133 --> 00:03:10,433
So now let's print Y train
is just to have a look at what we create.

71
00:03:10,433 --> 00:03:12,433
You know this is not compulsory.

72
00:03:12,433 --> 00:03:17,100
So that's why train with all the purchased
decisions on the previous SUV's.

73
00:03:17,100 --> 00:03:20,366
Zero means
that the customer did not buy any SUV.

74
00:03:20,533 --> 00:03:25,966
And one means yes, the customer
bought a previous SUV and now x test.

75
00:03:26,533 --> 00:03:29,700
Scroll down a bit right? X does so same.

76
00:03:29,700 --> 00:03:33,566
It contains 100 observations, corresponds
to 100 customers,

77
00:03:33,866 --> 00:03:36,866
and for each of them
their age and the estimated salary.

78
00:03:37,200 --> 00:03:40,033
And you know, since X test
is actually supposed to be

79
00:03:40,033 --> 00:03:44,100
some new data in production, well,
we're actually going to suppose that

80
00:03:44,100 --> 00:03:49,533
X test is actually the set of customers
who purchased yes or no, the new SUV.

81
00:03:49,533 --> 00:03:52,666
No. We're going to pretend
that X test is actually some data

82
00:03:52,800 --> 00:03:56,700
when we deploy our model in production,
so that we can evaluate it

83
00:03:56,966 --> 00:04:00,300
on the new observations
we need on the new customers buying.

84
00:04:00,300 --> 00:04:02,400
Yes or no, that new SUV.

85
00:04:02,400 --> 00:04:02,866
All right.

86
00:04:02,866 --> 00:04:05,066
It's more fun
if we imagine access this way

87
00:04:05,066 --> 00:04:09,666
because it is indeed supposed to be
some new observations and now y test.

88
00:04:09,933 --> 00:04:11,133
Let's see.

89
00:04:11,133 --> 00:04:11,533
All right.

90
00:04:11,533 --> 00:04:12,966
And white test contains of course

91
00:04:12,966 --> 00:04:16,666
old the purchased decisions
of the customers in this test set.

92
00:04:16,900 --> 00:04:20,233
Meaning
whether or not they bought the new SUV.

93
00:04:20,466 --> 00:04:21,233
All right.

94
00:04:21,233 --> 00:04:24,100
Perfect.
So now let's apply feature scaling.

95
00:04:24,100 --> 00:04:28,066
And let's see how X train and X tests
are transformed.

96
00:04:28,433 --> 00:04:28,800
All right.

97
00:04:28,800 --> 00:04:31,100
So let's do this. Let's play.

98
00:04:31,100 --> 00:04:33,300
And now let's print X train.

99
00:04:33,300 --> 00:04:33,700
All right.

100
00:04:33,700 --> 00:04:37,666
So now we have some scaled values between
well you know

101
00:04:37,766 --> 00:04:41,400
minus two and plus three.

102
00:04:41,400 --> 00:04:45,366
You know this one is 2.06 -1.7.

103
00:04:45,600 --> 00:04:49,466
Anyway should be somewhere
between minus three and plus three okay.

104
00:04:49,466 --> 00:04:54,400
But now we can clearly see that we have
both the two features in the same range.

105
00:04:54,400 --> 00:04:57,900
The transformed age
and the transformed estimated salary

106
00:04:58,233 --> 00:05:00,300
are now indeed in the same range.

107
00:05:00,300 --> 00:05:04,300
And that's exactly what we're supposed
to get with features killing.

108
00:05:04,300 --> 00:05:05,066
All right.

109
00:05:05,066 --> 00:05:08,400
So now let's scroll down to print X test.

110
00:05:09,133 --> 00:05:14,766
And this same we get the two features H
and salary taking values in the same range

111
00:05:14,766 --> 00:05:17,300
between somewhere around minus three
and plus three.

112
00:05:17,300 --> 00:05:21,633
So this will improve the training
performance of the genetic regression

113
00:05:21,633 --> 00:05:22,100
model.

114
00:05:22,100 --> 00:05:25,166
Well you know for the training set
of course only for the training set.

115
00:05:25,166 --> 00:05:28,733
But then when we will deploy our model
to predict

116
00:05:28,900 --> 00:05:30,666
whether the customers of the test set.

117
00:05:30,666 --> 00:05:33,833
But yes or no, the new SUV, well,

118
00:05:33,833 --> 00:05:37,200
we will have indeed to apply the predict
method on these scaled values.

119
00:05:37,500 --> 00:05:39,966
Otherwise predictions will be nonsense,
right?

120
00:05:39,966 --> 00:05:43,166
The predict method has to be called
on a set of features

121
00:05:43,166 --> 00:05:47,100
with the same scale as the one
that was applied during the training.

122
00:05:47,333 --> 00:05:49,000
Okay, perfect.

123
00:05:49,000 --> 00:05:52,233
So now we can move on to the next
exciting step

124
00:05:52,366 --> 00:05:56,866
where we build and train our logistic
regression model on the training set.