1
00:00:00,133 --> 00:00:01,566
All right, my friends, let's do this.

2
00:00:01,566 --> 00:00:03,800
Let's proceed to the next tool in our data

3
00:00:03,800 --> 00:00:07,733
preprocessing toolkit,
which is about encoding categorical data.

4
00:00:07,866 --> 00:00:10,800
So first
let me explain why we have to do this.

5
00:00:10,800 --> 00:00:13,466
Let's open the data set again.

6
00:00:13,466 --> 00:00:16,800
And as we can see this data set contains
one column

7
00:00:16,800 --> 00:00:20,200
with categories,
you know France, Spain or Germany.

8
00:00:20,633 --> 00:00:24,866
First you might guess that it will be
difficult for machine learning model

9
00:00:24,900 --> 00:00:27,900
to compute some correlations
between these columns.

10
00:00:27,900 --> 00:00:32,100
You know, the features and the outcome,
which is the dependent variable.

11
00:00:32,233 --> 00:00:36,033
And therefore of course
we will have to turn these strings,

12
00:00:36,033 --> 00:00:39,000
you know, these categories into numbers.

13
00:00:39,000 --> 00:00:42,366
So one idea would be to encode France

14
00:00:42,366 --> 00:00:45,733
into zero, Spain
into one and Germany into two.

15
00:00:46,000 --> 00:00:50,233
However, if we do this, our future machine
learning model could understand

16
00:00:50,333 --> 00:00:53,700
that because France is zero,
Spain is one and Germany's two.

17
00:00:54,000 --> 00:00:57,533
There is a numerical order
between these three countries,

18
00:00:57,533 --> 00:01:00,600
and mostly
it could interpret that this order

19
00:01:00,600 --> 00:01:03,666
matters, whereas
of course it is absolutely not the case.

20
00:01:03,866 --> 00:01:04,200
Right?

21
00:01:04,200 --> 00:01:06,366
There is not a relationship order

22
00:01:06,366 --> 00:01:09,366
between these three countries
France, Germany and Spain.

23
00:01:09,366 --> 00:01:13,366
So we want to avoid the model
to have such an interpretation,

24
00:01:13,666 --> 00:01:17,666
because that could cause
some misinterpreted correlations

25
00:01:17,666 --> 00:01:21,033
between the features and the outcome,
which we want to predict.

26
00:01:21,433 --> 00:01:24,600
Therefore,
we can actually do much better than just

27
00:01:24,800 --> 00:01:28,566
encode these three countries
into zero, one, and two.

28
00:01:28,833 --> 00:01:32,933
And this thing that we can do better
is actually one hot encoding.

29
00:01:33,166 --> 00:01:36,733
And one hot encoding consists
of turning this

30
00:01:36,933 --> 00:01:41,033
country column into three columns
y three columns.

31
00:01:41,100 --> 00:01:42,766
Because there are actually three

32
00:01:42,766 --> 00:01:46,433
different classes in this country column,
you know, three different categories.

33
00:01:46,633 --> 00:01:50,300
If there were, for example, five countries
here, we would turn this column

34
00:01:50,300 --> 00:01:51,700
into five columns.

35
00:01:51,700 --> 00:01:55,000
And one hot encoding consists of creating

36
00:01:55,000 --> 00:01:58,000
binary vectors for each of the countries.

37
00:01:58,066 --> 00:02:00,100
Let me explain this right away.

38
00:02:00,100 --> 00:02:04,900
So very simply, France would, for example,
have the vector 100,

39
00:02:05,133 --> 00:02:08,433
Spain would have the vector 010

40
00:02:08,600 --> 00:02:12,900
and Germany would have the vector 001,
so that then

41
00:02:12,900 --> 00:02:16,666
there is not a numerical order
between the three countries,

42
00:02:16,866 --> 00:02:19,233
because instead of having zero,
one and two,

43
00:02:19,233 --> 00:02:23,400
we would only have zeros and ones
and therefore three new columns.

44
00:02:23,700 --> 00:02:26,100
I'm going to show you, of course,
what we're going to create.

45
00:02:26,100 --> 00:02:30,266
We're basically going to replace this
country column by three new columns

46
00:02:30,266 --> 00:02:33,900
containing the zeros
and ones encoding each of the countries.

47
00:02:34,166 --> 00:02:36,300
That is called one hot encoding.

48
00:02:36,300 --> 00:02:39,266
And that is a very useful and popular
method to use

49
00:02:39,266 --> 00:02:43,266
when pre-processing your data
sets containing categorical variables.

50
00:02:43,633 --> 00:02:46,633
So that's the first thing
we'll do here for this country column.

51
00:02:46,633 --> 00:02:50,100
And then remember that
there is also this purchased columns

52
00:02:50,100 --> 00:02:54,466
that has labels, you know,
non-numerical values with yes nos.

53
00:02:54,700 --> 00:02:58,633
And we will actually have to replace them
by zeros and ones.

54
00:02:58,833 --> 00:03:01,333
And that's totally fine
for the dependent variable.

55
00:03:01,333 --> 00:03:04,400
As long as it is a binary outcome,
it is super fine.

56
00:03:04,633 --> 00:03:08,433
It will actually not compromise
the future accuracy of the model.

57
00:03:08,666 --> 00:03:11,100
If you just replace no and yes
by zero and one.

58
00:03:11,100 --> 00:03:14,100
Okay, so I will teach you
how to do these two things.

59
00:03:14,100 --> 00:03:18,266
And first let's start by one hot
encoding the country column here.

60
00:03:18,566 --> 00:03:19,666
And there we go.

61
00:03:19,666 --> 00:03:22,566
Let's create a new code
cell for this new step.

62
00:03:22,566 --> 00:03:24,766
And coding the independent variable.

63
00:03:26,033 --> 00:03:26,366
All right.

64
00:03:26,366 --> 00:03:29,400
So to
do this we're going to use two classes.

65
00:03:29,400 --> 00:03:32,000
The first one is the column
transformer class

66
00:03:32,000 --> 00:03:36,000
from the compose module of once again
the scikit learn library.

67
00:03:36,300 --> 00:03:38,833
And the second class
is the one hot encoding class

68
00:03:38,833 --> 00:03:42,366
from the preprocessing module
of the same scikit learn library.

69
00:03:42,700 --> 00:03:45,600
So first let's import these two classes.

70
00:03:45,600 --> 00:03:48,600
So we have to take them from scikit learn

71
00:03:49,033 --> 00:03:53,100
from which we're going to call first
the compose module.

72
00:03:53,100 --> 00:03:56,100
There we go
from which we're going to import

73
00:03:56,133 --> 00:04:00,400
that class we're interested
in which is as Google Collab.

74
00:04:00,400 --> 00:04:03,300
Guess it's perfectly gone transform it.

75
00:04:03,300 --> 00:04:06,366
And then from scikit learn, once again

76
00:04:06,800 --> 00:04:11,600
we're going to get access
to the pre-processing module perfect,

77
00:04:11,766 --> 00:04:17,733
from which we're going to import
the one hot encoder class.

78
00:04:18,133 --> 00:04:21,633
And now we're going to mix
these two classes in order to do this.

79
00:04:21,633 --> 00:04:24,133
One hot encoding on the country column.