1
00:00:00,166 --> 00:00:00,966
Okay, my friends.

2
00:00:00,966 --> 00:00:04,966
So now let's add a new tool to our data
preprocessing toolkit,

3
00:00:05,166 --> 00:00:08,166
which is taking care of missing data.

4
00:00:08,333 --> 00:00:14,133
So indeed, if we have a look again
at our data set data dot csv, we notice

5
00:00:14,166 --> 00:00:19,166
that there is a missing salary here
for this specific customer from Germany.

6
00:00:19,466 --> 00:00:22,700
Of 40 years old
and who purchased the product.

7
00:00:23,033 --> 00:00:26,466
So generally you don't want to have
any missing data in your data

8
00:00:26,466 --> 00:00:29,633
set for the simple reason
that it can cause some errors

9
00:00:29,633 --> 00:00:33,766
when training your machine learning model,
and therefore you must handle them.

10
00:00:34,100 --> 00:00:36,333
There are actually several ways
to handle them.

11
00:00:36,333 --> 00:00:40,500
A first way is to just ignore
the observation by deleting it.

12
00:00:40,700 --> 00:00:44,900
That's one method, and this actually works
if you have a large data set

13
00:00:44,900 --> 00:00:49,133
and you know if you have only 1% missing
data, you know, removing 1%

14
00:00:49,300 --> 00:00:54,133
of the observations won't change much
the learning quality of your model.

15
00:00:54,133 --> 00:00:58,066
So 1% is fine, but sometimes you can have
a lot of missing data

16
00:00:58,066 --> 00:01:01,066
and therefore you must handle them
the right way.

17
00:01:01,100 --> 00:01:02,800
So that was the first way to ignore them.

18
00:01:02,800 --> 00:01:03,766
To remove them.

19
00:01:03,766 --> 00:01:04,900
And now a second way.

20
00:01:04,900 --> 00:01:06,900
And this is what we're adding right now in

21
00:01:06,900 --> 00:01:10,900
the toolkit
is to actually replace the missing data.

22
00:01:10,933 --> 00:01:16,066
You know, the missing value by the average
of all the values in the column

23
00:01:16,233 --> 00:01:18,133
in which the data is missing.

24
00:01:18,133 --> 00:01:19,933
So here we have a missing salary.

25
00:01:19,933 --> 00:01:23,566
What we want to do is to replace this
missing salary

26
00:01:23,666 --> 00:01:26,566
by the average of all these salaries.

27
00:01:26,566 --> 00:01:29,266
This is a classic way of handling
missing data.

28
00:01:29,266 --> 00:01:31,533
And I'm going to teach it to you
right away.

29
00:01:31,533 --> 00:01:34,000
So here we go. Taking care of missing
data.

30
00:01:34,000 --> 00:01:38,933
Let's create a new code cell
and let's replace that missing salary

31
00:01:39,066 --> 00:01:41,933
by the average of all the salaries here.

32
00:01:41,933 --> 00:01:42,600
All right.

33
00:01:42,600 --> 00:01:45,300
So to do this
we're going to use the libraries.

34
00:01:45,300 --> 00:01:49,633
And actually I'm about to introduce you
to one of the best data science libraries.

35
00:01:49,833 --> 00:01:52,033
I'm talking about scikit learn.

36
00:01:52,033 --> 00:01:56,733
Scikit learn is an amazing data science
libraries containing a lot of tools,

37
00:01:56,966 --> 00:01:59,966
including a lot of data
preprocessing tools.

38
00:01:59,966 --> 00:02:03,900
You will see that we will actually use
scikit learn a lot in this course.

39
00:02:04,166 --> 00:02:07,433
You know, more than half of the machine
learning models we will build

40
00:02:07,433 --> 00:02:10,433
in this
course will be built with scikit learn.

41
00:02:10,600 --> 00:02:12,600
So if you don't know scikit learn yet,

42
00:02:12,600 --> 00:02:15,600
I'm telling you
you're going to absolutely love it.

43
00:02:15,733 --> 00:02:19,166
And so for the first time here
we're going to use scikit learn to handle

44
00:02:19,233 --> 00:02:20,466
missing data.

45
00:02:20,466 --> 00:02:23,200
And to do this
the class that we're going to use

46
00:02:23,200 --> 00:02:26,200
from scikit learn is called simple input.

47
00:02:26,633 --> 00:02:30,966
We're actually going to first import
that simple input a class.

48
00:02:31,166 --> 00:02:36,266
Then we will create an instance you know
an object of the simple input a class.

49
00:02:36,566 --> 00:02:39,933
This object will allow us
to exactly replace

50
00:02:39,933 --> 00:02:43,100
this missing salary
here by the average of the salaries.

51
00:02:43,300 --> 00:02:46,200
And then we will have an updated data set.

52
00:02:46,200 --> 00:02:48,833
You know, an updated
actually matrix of features,

53
00:02:48,833 --> 00:02:52,266
because we will apply this
input on the matrix of features only.

54
00:02:52,500 --> 00:02:55,500
So we will have a new matrix of features
with no missing data

55
00:02:55,500 --> 00:02:59,366
because the missing salary will have been
replaced by the average salary.

56
00:02:59,866 --> 00:03:01,366
All right let's do this.

57
00:03:01,366 --> 00:03:01,866
Perfect.

58
00:03:01,866 --> 00:03:05,566
So first since this class belongs
to scikit learn, well

59
00:03:05,566 --> 00:03:11,800
we're going to start here by going from
scikit learn which has the name sklearn.

60
00:03:11,800 --> 00:03:12,966
So sklearn.

61
00:03:12,966 --> 00:03:16,366
Then remember in order to access a module
we have to add a dot.

62
00:03:16,666 --> 00:03:20,466
Because actually this simple import
a class which we want to import

63
00:03:20,766 --> 00:03:24,800
belongs to a certain module of scikit
learn called impute.

64
00:03:24,933 --> 00:03:26,633
This one impute.

65
00:03:26,633 --> 00:03:30,000
And from this impute model
well we're going to import.

66
00:03:30,000 --> 00:03:32,066
There we go. The simple

67
00:03:33,433 --> 00:03:34,400
import class.

68
00:03:34,400 --> 00:03:36,166
Google collab really exists.

69
00:03:36,166 --> 00:03:38,900
You will simple import a class. Perfect.

70
00:03:38,900 --> 00:03:40,133
Then next step.

71
00:03:40,133 --> 00:03:44,100
As I said the next step
is to create an instance of this class

72
00:03:44,100 --> 00:03:48,300
which you can exactly see
as the tool itself.

73
00:03:48,300 --> 00:03:49,133
You know, the tool

74
00:03:49,133 --> 00:03:53,100
that you'll use to replace that
missing salary by the average of salaries.

75
00:03:53,433 --> 00:03:54,766
So since we're about to

76
00:03:54,766 --> 00:03:58,566
create a new object, well,
we have to introduce here a new variable.

77
00:03:58,800 --> 00:04:02,400
And we're going to call this variable
imputer okay.

78
00:04:02,400 --> 00:04:06,433
Input it which will be exactly this object
of the simple input class.

79
00:04:06,866 --> 00:04:10,300
And therefore since it will be the object
of this simple Imputer class, well,

80
00:04:10,600 --> 00:04:14,866
we have naturally to call this class
simple input.

81
00:04:15,200 --> 00:04:18,633
So I'm going to copy this and paste
that here.

82
00:04:18,866 --> 00:04:20,933
That's
how you create an object of the class.

83
00:04:20,933 --> 00:04:22,433
You simply call the class.

84
00:04:22,433 --> 00:04:25,000
Then you add some parenthesis
and there you go.

85
00:04:25,000 --> 00:04:28,466
Now you're going to enter
the right arguments in order to replace

86
00:04:28,466 --> 00:04:32,200
indeed this missing salary
by the average of salaries, because note

87
00:04:32,466 --> 00:04:35,500
that there actually many replacements
that you could do.

88
00:04:35,500 --> 00:04:38,633
You could instead of replacing it
by the average salary,

89
00:04:38,633 --> 00:04:41,100
you could replace it by the median salary

90
00:04:41,100 --> 00:04:43,566
you know, there is a difference
between the average and the median.

91
00:04:43,566 --> 00:04:47,700
You could also replace a missing value
by the most frequent value.

92
00:04:47,833 --> 00:04:48,166
Right.

93
00:04:48,166 --> 00:04:51,300
That would be, for example,
relevant for categories okay.

94
00:04:51,300 --> 00:04:52,266
So we have many options.

95
00:04:52,266 --> 00:04:55,466
But the most classic one
and the one option that I recommend

96
00:04:55,600 --> 00:04:58,566
is the average salary.
The mean salary okay.

97
00:04:58,566 --> 00:05:01,566
And so that's exactly
what we have to enter here.

98
00:05:01,733 --> 00:05:06,600
First we have to specify
which missing values we have to replace.

99
00:05:06,866 --> 00:05:09,300
And so that's why we have to enter here.

100
00:05:09,300 --> 00:05:14,400
First argument called missing values
which has to be equal to NP.

101
00:05:14,433 --> 00:05:17,100
You know the numpy library dot none.

102
00:05:17,100 --> 00:05:20,000
And that's just to say
that we want to replace

103
00:05:20,000 --> 00:05:23,233
all the missing value in the data
set like this one.

104
00:05:23,233 --> 00:05:24,900
This is like an empty value.

105
00:05:24,900 --> 00:05:27,366
This is what this means an empty value.

106
00:05:27,366 --> 00:05:31,000
And then the second argument
we have to input here is exactly

107
00:05:31,000 --> 00:05:34,266
the one saying
that indeed the missing values here,

108
00:05:34,266 --> 00:05:37,766
you know, the empty values of the data set
will be replaced by the mean.

109
00:05:37,766 --> 00:05:42,733
And to do this we have to add
the next argument here, which is strategy.

110
00:05:43,300 --> 00:05:47,966
And this argument will be equal to
in quotes mean okay.

111
00:05:47,966 --> 00:05:52,400
And that's just to say that we want
indeed to replace all the missing values

112
00:05:52,400 --> 00:05:55,766
in the matrix of features
by the mean of the feature itself.