1
00:00:00,166 --> 00:00:00,933
All right.

2
00:00:00,933 --> 00:00:02,133
And that's it. Right.

3
00:00:02,133 --> 00:00:05,900
This is the code to split the data
set into the training set in a dataset.

4
00:00:06,133 --> 00:00:09,666
Let me zoom out a bit
so that you can see it.

5
00:00:09,866 --> 00:00:11,700
All right. So that's the full code.

6
00:00:11,700 --> 00:00:15,733
This will return indeed
these four new sets

7
00:00:15,733 --> 00:00:18,800
composed of the training set
in X and Y train.

8
00:00:18,900 --> 00:00:21,433
And the test set in X and y tests.

9
00:00:21,433 --> 00:00:22,766
Let me show you this right away.

10
00:00:22,766 --> 00:00:26,700
So we're going to add four new code cells
here.

11
00:00:27,300 --> 00:00:27,733
Right.

12
00:00:27,733 --> 00:00:30,966
And we're going to print
each of these created sets.

13
00:00:31,266 --> 00:00:34,266
So first we're going to print X train.

14
00:00:35,033 --> 00:00:36,466
Let me copy this.

15
00:00:36,466 --> 00:00:38,833
Then we're going to print X

16
00:00:39,833 --> 00:00:41,300
test.

17
00:00:41,300 --> 00:00:45,600
Then we're going to print Y train.

18
00:00:45,733 --> 00:00:50,133
And finally we're going to print Y test.

19
00:00:50,766 --> 00:00:51,433
Perfect.

20
00:00:51,433 --> 00:00:53,833
All right.
So now let's execute everything.

21
00:00:53,833 --> 00:00:58,100
Starting with this cell here splitting
the dataset into training and test it.

22
00:00:58,133 --> 00:01:00,766
Done. Perfect. Run successfully.

23
00:01:00,766 --> 00:01:03,900
Now let's run the cell to print X train.

24
00:01:04,166 --> 00:01:08,533
And as you can see, indeed we have now
eight observations in this training set.

25
00:01:08,533 --> 00:01:11,733
Right 12345678

26
00:01:11,966 --> 00:01:16,400
which correspond to eight customers
taken randomly from this data set.

27
00:01:16,800 --> 00:01:22,066
And we clearly recognize the features here
with first the three columns being that

28
00:01:22,066 --> 00:01:28,033
one hot encoded variables that encode
the country categorical variable.

29
00:01:28,066 --> 00:01:30,533
We also call that dummy variables.

30
00:01:30,533 --> 00:01:33,300
Then we clearly have here
the age as the second

31
00:01:33,300 --> 00:01:36,700
variable as a second feature, you know,
and then the salary.

32
00:01:36,733 --> 00:01:41,400
So we clearly have a great matrix
of features for the training set.

33
00:01:42,000 --> 00:01:42,866
All right. Perfect.

34
00:01:42,866 --> 00:01:44,666
Now let's print X test.

35
00:01:44,666 --> 00:01:48,766
So we'll get here two observations
containing the same features

36
00:01:48,766 --> 00:01:49,700
here as here right.

37
00:01:49,700 --> 00:01:51,833
This is the matrix of features still.

38
00:01:51,833 --> 00:01:54,900
So we have the dummy variables here
in the first three columns.

39
00:01:55,133 --> 00:01:59,166
Then the age
and the two salaries of our two customers

40
00:01:59,166 --> 00:02:02,166
taken randomly from the data
set into this test set.

41
00:02:02,633 --> 00:02:03,866
Then Y train.

42
00:02:03,866 --> 00:02:08,433
So here we'll get eight
purchased decisions right with the zeros

43
00:02:08,433 --> 00:02:11,800
and ones here that were encoded before
with label encoder.

44
00:02:12,300 --> 00:02:14,666
And of course make sure to understand
this.

45
00:02:14,666 --> 00:02:19,566
These eight purchase decisions correspond
of course to the eight

46
00:02:19,566 --> 00:02:24,300
same customers of this matrix of features
X train of the training set right.

47
00:02:24,333 --> 00:02:27,333
These features
correspond to these purchase decisions.

48
00:02:27,366 --> 00:02:29,633
These are the same customers here.

49
00:02:29,633 --> 00:02:33,500
And finally Y test
which will output two results

50
00:02:33,633 --> 00:02:37,266
meaning two purchase decisions
right zero and one corresponding

51
00:02:37,266 --> 00:02:42,033
of course to the same customers as in this
matrix of features of the test set.

52
00:02:42,766 --> 00:02:45,300
All right.
So there you go. Congratulations.

53
00:02:45,300 --> 00:02:48,466
Now you have a new tool in your data
preprocessing

54
00:02:48,466 --> 00:02:51,633
toolkit splitting the data
set into the training set and data set.

55
00:02:51,966 --> 00:02:53,200
Not only you have this tool,

56
00:02:53,200 --> 00:02:57,400
but also you have the final answer
to the ultimate question.

57
00:02:57,600 --> 00:03:01,333
Do we have to apply feature scaling
before or after the split?

58
00:03:01,500 --> 00:03:05,466
And it's clearly after the split
to avoid indeed information leakage

59
00:03:05,700 --> 00:03:08,566
because simply the test set
is supposed to be something

60
00:03:08,566 --> 00:03:13,366
you write something
on which we evaluate our model on you.

61
00:03:13,366 --> 00:03:15,700
Observations. All right. Great.

62
00:03:15,700 --> 00:03:19,866
So I'm glad that you are really
making progress here with new tools

63
00:03:19,866 --> 00:03:23,533
and new knowledge that actually reduce
any kind of confusion.

64
00:03:23,833 --> 00:03:27,333
So now we're going to move on
to our final tool right,

65
00:03:27,333 --> 00:03:31,233
feature scaling, which now
you know, must be applied after the split.

66
00:03:31,500 --> 00:03:32,766
And you will see what

67
00:03:32,766 --> 00:03:37,433
we'll get with some other prints
after we deploy this tool on our data set.

68
00:03:37,500 --> 00:03:39,733
So I can't wait to show this to you.

69
00:03:39,733 --> 00:03:43,400
And I can't wait to give you this last
final tool in your toolkit,

70
00:03:43,666 --> 00:03:45,000
because then what does it mean?

71
00:03:45,000 --> 00:03:47,700
That means that we will be 100% ready

72
00:03:47,700 --> 00:03:51,266
to start building our future
machine learning models.