1
00:00:01,780 --> 00:00:06,070
Very often we find that some values in our data are missing.

2
00:00:07,560 --> 00:00:13,050
Now, this may happen when you are collecting data from people asking them to fill a physical form.

3
00:00:13,530 --> 00:00:15,390
And they do not answer all questions.

4
00:00:16,710 --> 00:00:19,590
This may happen because some of the data stored got corrupted.

5
00:00:20,940 --> 00:00:22,320
There may be many other reasons.

6
00:00:24,030 --> 00:00:30,800
But this results into a serious problem as we cannot run a machine learning algorithm on missing values.

7
00:00:32,620 --> 00:00:34,810
Now, this leaves us with only two options.

8
00:00:35,590 --> 00:00:42,190
Either we remove the entire rule, which has some missing value, when we have large dataset and only

9
00:00:42,190 --> 00:00:47,800
few of the observations of missing value, then we can delete those few observations and still have.

10
00:00:48,840 --> 00:00:52,770
Substantial number of observations preparing and test that model.

11
00:00:53,670 --> 00:01:01,680
But if the number of missing values is high, then the moving observations is not advisable in such

12
00:01:01,680 --> 00:01:02,250
a case.

13
00:01:02,400 --> 00:01:09,510
We should replace these blank places with some harmless values, harmless meaning that they should not

14
00:01:09,510 --> 00:01:11,130
impact the model much.

15
00:01:12,780 --> 00:01:16,350
Such neutral values are values close to the center of that variable.

16
00:01:17,370 --> 00:01:21,990
So we can choose mean or median value to report in these blanks.

17
00:01:24,160 --> 00:01:28,000
Here are some most common ways to do missing value imputation.

18
00:01:29,920 --> 00:01:31,550
First is too important.

19
00:01:31,810 --> 00:01:32,290
Zero.

20
00:01:33,670 --> 00:01:36,130
Note that a missing value is not zero.

21
00:01:36,430 --> 00:01:39,430
You can simply put a zero where you have a missing value.

22
00:01:40,980 --> 00:01:42,900
But this rarely makes business sense.

23
00:01:43,680 --> 00:01:47,070
So if zero makes sense in any particular scenario, we can do that.

24
00:01:49,020 --> 00:01:53,700
Second option is to impute with centers as this will not have much impact on the model.

25
00:01:55,050 --> 00:02:00,660
For example, if we have high rates of people as a variable and some of the values are missing in the

26
00:02:00,660 --> 00:02:03,990
data, putting zero does not make any sense.

27
00:02:05,110 --> 00:02:09,190
So we can put mean or median height of people in the missing field.

28
00:02:11,110 --> 00:02:17,260
If you have a categorical variable with some values missing, you can assign that category which has

29
00:02:17,260 --> 00:02:19,900
maximum frequency in the missing fields.

30
00:02:22,510 --> 00:02:26,710
So suppose we have gender data and 90 percent of them are males.

31
00:02:27,070 --> 00:02:30,820
Then in the missing fields, all the variable we can safely put maybe.

32
00:02:33,010 --> 00:02:39,490
The third option is, again, putting in mean value, but instead of using the population mean if he

33
00:02:39,490 --> 00:02:45,150
can make old relevant segments from the data we can use this segment mean for that segments observation.

34
00:02:46,340 --> 00:02:52,850
For example, if we have rainfall data of different cities, but the data is missing for some cities

35
00:02:53,180 --> 00:03:00,950
instead of using population mean we can take the mean of neighboring cities or cities belonging to similar

36
00:03:00,950 --> 00:03:05,330
region or cities belonging to same state to fill the blanks.

37
00:03:06,600 --> 00:03:10,590
This makes sense since rainfall does not ready a lot among neighboring cities.

38
00:03:12,760 --> 00:03:19,960
Therefore, we should use business knowledge so that we can select one of these methods wisely.

39
00:03:21,880 --> 00:03:29,980
When we find EDTA in any software, these softwares do give us the number of blank values in each variables.

40
00:03:31,240 --> 00:03:35,050
We will learn how to fill the blank values in this software.

41
00:03:35,110 --> 00:03:36,040
We are going to use.