1
00:00:00,300 --> 00:00:02,600
All right,
so I think I've explained enough.

2
00:00:02,600 --> 00:00:06,400
Now we're relief that at least it's 100%
clear for everyone.

3
00:00:06,600 --> 00:00:08,066
And so there you go, my friends.

4
00:00:08,066 --> 00:00:12,066
Let's implement one of the last tools
of this data preprocessing toolkit,

5
00:00:12,233 --> 00:00:15,400
which is indeed the split of the data

6
00:00:15,400 --> 00:00:18,400
set into the training set
and the test set.

7
00:00:18,466 --> 00:00:20,500
All right. So how are we going to do this.

8
00:00:20,500 --> 00:00:24,700
Well we're going to do it with a function
a function by scikit learn.

9
00:00:24,700 --> 00:00:28,400
You know the most popular
and useful data science library.

10
00:00:28,633 --> 00:00:33,000
Because once again this library contains
a module that is called model selection,

11
00:00:33,200 --> 00:00:37,100
which contains itself
a function called train test split.

12
00:00:37,333 --> 00:00:39,633
And this function
will exactly do what we want,

13
00:00:39,633 --> 00:00:43,633
which is to create four separate sets,
actually not two, but four,

14
00:00:43,633 --> 00:00:45,300
because we will actually create

15
00:00:45,300 --> 00:00:48,600
a pair of matrix of features
independent variable for the training set.

16
00:00:48,766 --> 00:00:52,533
And another pair of matrix A features
independent variable for the test set.

17
00:00:52,900 --> 00:00:53,233
All right.

18
00:00:53,233 --> 00:00:56,400
So we're basically going to get four
set xtrain

19
00:00:56,400 --> 00:00:57,166
which is a matrix

20
00:00:57,166 --> 00:00:58,033
of features of the training

21
00:00:58,033 --> 00:01:02,266
set X test which is the matrix of features
of the test set Y train

22
00:01:02,266 --> 00:01:04,366
which is a dependent variable
of the training set,

23
00:01:04,366 --> 00:01:07,466
and Y test, which is the dependent
variable of the test set.

24
00:01:07,700 --> 00:01:09,166
That's exactly what we want.

25
00:01:09,166 --> 00:01:10,933
And now why do we want this?

26
00:01:10,933 --> 00:01:12,000
Well, it's not us.

27
00:01:12,000 --> 00:01:13,500
It's actually the future

28
00:01:13,500 --> 00:01:16,900
machine learning model
that we will build in the next part,

29
00:01:17,100 --> 00:01:22,400
which will be all of them
expecting this format as inputs,

30
00:01:22,666 --> 00:01:25,633
you know, for the training,
it will expect X train and Y

31
00:01:25,633 --> 00:01:29,000
train as inputs in the method
actually called the fit method.

32
00:01:29,233 --> 00:01:32,100
And for the predictions
also called inference,

33
00:01:32,100 --> 00:01:34,966
these models will predict X test.
All right.

34
00:01:34,966 --> 00:01:36,500
So that's the reason.

35
00:01:36,500 --> 00:01:40,500
It is simply the format expected
by the future machinery models.

36
00:01:40,500 --> 00:01:42,833
And now let's get these four sets.

37
00:01:42,833 --> 00:01:46,800
So we're going to get them
from scikit learn of course.

38
00:01:48,566 --> 00:01:49,600
There you go.

39
00:01:49,600 --> 00:01:53,866
From which we're going to get access
to model selection

40
00:01:53,866 --> 00:01:55,933
I really like Google Colab.

41
00:01:55,933 --> 00:01:59,800
And then from which we're going to import
that train

42
00:02:00,300 --> 00:02:03,200
underscore test split function.

43
00:02:03,200 --> 00:02:04,033
Perfect.

44
00:02:04,033 --> 00:02:08,433
You see how we can be so efficient
thanks to the assistance of Google Colab.

45
00:02:08,433 --> 00:02:11,133
I hope you really like it as well.

46
00:02:11,133 --> 00:02:14,133
All right, so now that we have
this function, well we're going to use it.

47
00:02:14,133 --> 00:02:18,500
And since we already know what this
function will return as, I just explained.

48
00:02:18,600 --> 00:02:23,433
Well let's create these four variables
returned by this Traintestsplit function.

49
00:02:23,700 --> 00:02:28,833
And as we said they are first x train
to the matrix of features

50
00:02:28,833 --> 00:02:33,000
of the training set,
therefore containing all the countries

51
00:02:33,333 --> 00:02:36,966
one hot encoded ages
and salaries of the training set.

52
00:02:37,200 --> 00:02:38,366
So xtrain.

53
00:02:38,366 --> 00:02:43,200
Then x test
the matrix of features of the test set.

54
00:02:43,566 --> 00:02:47,666
Then Y train,
which is the dependent variable

55
00:02:47,666 --> 00:02:50,800
of the training set,
meaning all the purchased decisions

56
00:02:50,866 --> 00:02:54,200
of the customers
in the training set Y train

57
00:02:54,366 --> 00:02:57,900
and then Y test, which same contains

58
00:02:57,900 --> 00:03:01,400
all the purchase decisions
of the customers in the test set.

59
00:03:01,566 --> 00:03:02,433
All right.

60
00:03:02,433 --> 00:03:06,966
So that's the four variables returned
by this traintestsplit function.

61
00:03:06,966 --> 00:03:08,500
And since it is the function

62
00:03:08,500 --> 00:03:11,766
that returns this variable,
well let's take that function right away.

63
00:03:12,066 --> 00:03:14,800
And let's add here an equals

64
00:03:14,800 --> 00:03:18,233
and train test split
and then some parenthesis.

65
00:03:18,433 --> 00:03:22,966
And now the question is what do we have
to input inside this function.

66
00:03:23,633 --> 00:03:24,133
All right.

67
00:03:24,133 --> 00:03:28,266
So actually there are some parameters
that we can guess right.

68
00:03:28,633 --> 00:03:32,200
Because indeed this train test
split is supposed to split something.

69
00:03:32,200 --> 00:03:34,833
So one of the input will be
that's something

70
00:03:34,833 --> 00:03:38,400
which we're about to split
and which is of course our data set.

71
00:03:38,633 --> 00:03:42,233
However of course this function
does not expect the data set as a whole.

72
00:03:42,400 --> 00:03:43,300
It expects.

73
00:03:43,300 --> 00:03:43,866
Well, the

74
00:03:43,866 --> 00:03:48,200
combination of the matrix of features X
and the dependent variable vector y.

75
00:03:48,200 --> 00:03:51,100
And that's the first two inputs
of this function.

76
00:03:51,100 --> 00:03:53,533
So let's input them here x.

77
00:03:53,533 --> 00:03:57,666
First a matrix of features
and y the dependent variable vector

78
00:03:58,600 --> 00:04:01,000
grid y. Yes.

79
00:04:01,000 --> 00:04:03,300
Then come up and then next arguments.

80
00:04:03,300 --> 00:04:07,533
So we still have to input
two more arguments which are going to be

81
00:04:07,866 --> 00:04:10,866
first the split size.

82
00:04:10,933 --> 00:04:15,533
You know, because we're not going to split
this data set into a training set

83
00:04:15,533 --> 00:04:19,566
and a set of the same size
actually we need a lot of observations

84
00:04:19,566 --> 00:04:22,000
in a training set
and a few in the test set.

85
00:04:22,000 --> 00:04:23,500
But we need a lot of them
in the training set.

86
00:04:23,500 --> 00:04:26,666
So that's to give the future machine
learning model more chance

87
00:04:26,666 --> 00:04:30,000
to understand and learn the correlations
in the data set.

88
00:04:30,300 --> 00:04:34,333
So let me just tell you
the recommended size of the split.

89
00:04:34,533 --> 00:04:37,766
Well I recommend to have 80% observation

90
00:04:37,833 --> 00:04:40,833
in the training set
and 20% in the test set.

91
00:04:41,333 --> 00:04:43,200
All right. This is a very good split.

92
00:04:43,200 --> 00:04:46,833
And therefore here
we're going to input a new parameter

93
00:04:46,833 --> 00:04:49,833
which is test size.

94
00:04:49,833 --> 00:04:53,066
And we'll set that equal to 0.2.

95
00:04:53,066 --> 00:04:57,000
Right 20%
observations will go into the test set.

96
00:04:57,266 --> 00:05:01,033
And therefore here since
we have ten observations in this data set,

97
00:05:01,200 --> 00:05:05,000
that means that eight observations
will go into the training set, meaning

98
00:05:05,033 --> 00:05:07,133
eight customers
will go into the training set.

99
00:05:07,133 --> 00:05:08,566
And to in the test set.

100
00:05:08,566 --> 00:05:10,733
And this is not necessarily the last two.

101
00:05:10,733 --> 00:05:12,633
You know, they will be taken randomly,

102
00:05:12,633 --> 00:05:15,900
but eight of them will
go into the training set and to notice it.

103
00:05:16,033 --> 00:05:16,900
All right.

104
00:05:16,900 --> 00:05:22,000
And now we'll add one final argument
just for teaching purposes so that we can

105
00:05:22,000 --> 00:05:26,533
have the same results displayed in here,
you know, in the notebook.

106
00:05:26,533 --> 00:05:28,800
Because then I'm going to run some prints

107
00:05:28,800 --> 00:05:32,533
to show you these four elements returned
by this traintestsplit function.

108
00:05:32,533 --> 00:05:34,300
You know, the training set
and the test set.

109
00:05:34,300 --> 00:05:37,533
And since there are some random factors
that are going to happen

110
00:05:37,533 --> 00:05:40,200
during the split, right,
because the observations

111
00:05:40,200 --> 00:05:43,200
will be randomly split
into the training set and the test set.

112
00:05:43,466 --> 00:05:46,200
Well, to make sure we have the same random
factors, we'll

113
00:05:46,200 --> 00:05:49,233
just add here random state

114
00:05:50,633 --> 00:05:51,766
one. Right.

115
00:05:51,766 --> 00:05:56,233
We were just fixing the seed here
so that we'll get the same split

116
00:05:56,233 --> 00:05:59,233
and therefore the same training set
and same test set.