1 00:00:01,780 --> 00:00:06,070 Very often we find that some values in our data are missing. 2 00:00:07,560 --> 00:00:13,050 Now, this may happen when you are collecting data from people asking them to fill a physical form. 3 00:00:13,530 --> 00:00:15,390 And they do not answer all questions. 4 00:00:16,710 --> 00:00:19,590 This may happen because some of the data stored got corrupted. 5 00:00:20,940 --> 00:00:22,320 There may be many other reasons. 6 00:00:24,030 --> 00:00:30,800 But this results into a serious problem as we cannot run a machine learning algorithm on missing values. 7 00:00:32,620 --> 00:00:34,810 Now, this leaves us with only two options. 8 00:00:35,590 --> 00:00:42,190 Either we remove the entire rule, which has some missing value, when we have large dataset and only 9 00:00:42,190 --> 00:00:47,800 few of the observations of missing value, then we can delete those few observations and still have. 10 00:00:48,840 --> 00:00:52,770 Substantial number of observations preparing and test that model. 11 00:00:53,670 --> 00:01:01,680 But if the number of missing values is high, then the moving observations is not advisable in such 12 00:01:01,680 --> 00:01:02,250 a case. 13 00:01:02,400 --> 00:01:09,510 We should replace these blank places with some harmless values, harmless meaning that they should not 14 00:01:09,510 --> 00:01:11,130 impact the model much. 15 00:01:12,780 --> 00:01:16,350 Such neutral values are values close to the center of that variable. 16 00:01:17,370 --> 00:01:21,990 So we can choose mean or median value to report in these blanks. 17 00:01:24,160 --> 00:01:28,000 Here are some most common ways to do missing value imputation. 18 00:01:29,920 --> 00:01:31,550 First is too important. 19 00:01:31,810 --> 00:01:32,290 Zero. 20 00:01:33,670 --> 00:01:36,130 Note that a missing value is not zero. 21 00:01:36,430 --> 00:01:39,430 You can simply put a zero where you have a missing value. 22 00:01:40,980 --> 00:01:42,900 But this rarely makes business sense. 23 00:01:43,680 --> 00:01:47,070 So if zero makes sense in any particular scenario, we can do that. 24 00:01:49,020 --> 00:01:53,700 Second option is to impute with centers as this will not have much impact on the model. 25 00:01:55,050 --> 00:02:00,660 For example, if we have high rates of people as a variable and some of the values are missing in the 26 00:02:00,660 --> 00:02:03,990 data, putting zero does not make any sense. 27 00:02:05,110 --> 00:02:09,190 So we can put mean or median height of people in the missing field. 28 00:02:11,110 --> 00:02:17,260 If you have a categorical variable with some values missing, you can assign that category which has 29 00:02:17,260 --> 00:02:19,900 maximum frequency in the missing fields. 30 00:02:22,510 --> 00:02:26,710 So suppose we have gender data and 90 percent of them are males. 31 00:02:27,070 --> 00:02:30,820 Then in the missing fields, all the variable we can safely put maybe. 32 00:02:33,010 --> 00:02:39,490 The third option is, again, putting in mean value, but instead of using the population mean if he 33 00:02:39,490 --> 00:02:45,150 can make old relevant segments from the data we can use this segment mean for that segments observation. 34 00:02:46,340 --> 00:02:52,850 For example, if we have rainfall data of different cities, but the data is missing for some cities 35 00:02:53,180 --> 00:03:00,950 instead of using population mean we can take the mean of neighboring cities or cities belonging to similar 36 00:03:00,950 --> 00:03:05,330 region or cities belonging to same state to fill the blanks. 37 00:03:06,600 --> 00:03:10,590 This makes sense since rainfall does not ready a lot among neighboring cities. 38 00:03:12,760 --> 00:03:19,960 Therefore, we should use business knowledge so that we can select one of these methods wisely. 39 00:03:21,880 --> 00:03:29,980 When we find EDTA in any software, these softwares do give us the number of blank values in each variables. 40 00:03:31,240 --> 00:03:35,050 We will learn how to fill the blank values in this software. 41 00:03:35,110 --> 00:03:36,040 We are going to use.