1 00:00:01,750 --> 00:00:06,040 Very often we find that some values in our data are missing. 2 00:00:07,470 --> 00:00:13,710 Now, this may happen when you are collecting data from people asking them to fill a physical form and 3 00:00:13,710 --> 00:00:15,390 they do not answer all questions. 4 00:00:16,680 --> 00:00:19,540 This may happen because some of the data stored got corrupted. 5 00:00:20,850 --> 00:00:22,290 There may be many other reasons. 6 00:00:23,970 --> 00:00:30,810 But this results into a serious problem as we cannot run a machine learning algorithm on missing values. 7 00:00:32,590 --> 00:00:39,040 Now, this leaves us with only two options, either we remove the entire rule, which has some missing 8 00:00:39,040 --> 00:00:45,160 value, when we have large dataset and only few of the observations are missing value, then we can 9 00:00:45,160 --> 00:00:47,770 delete those few observations and still have. 10 00:00:48,780 --> 00:00:55,560 Substantial number of observations to print and test that model, but if the number of missing values 11 00:00:55,680 --> 00:00:59,790 is high, then the moving observations is not advisable. 12 00:01:01,200 --> 00:01:08,790 In such a case, we should replace the blank places with some harmless values, harmless meaning that 13 00:01:08,790 --> 00:01:10,740 they should not impact the model. 14 00:01:10,740 --> 00:01:19,770 Much such neutral values are values close to the center of that variable so we can choose mean or median 15 00:01:19,770 --> 00:01:21,990 value to people in these blanks. 16 00:01:24,070 --> 00:01:28,000 Here are some most common ways to do missing value imputation. 17 00:01:29,860 --> 00:01:36,070 First is to impute with zero note that a missing value is not zero. 18 00:01:36,310 --> 00:01:39,400 You can simply put a zero where you have a missing value. 19 00:01:40,890 --> 00:01:46,830 But this really makes business sense, so if zero makes sense in any particular scenario, we can do 20 00:01:46,830 --> 00:01:47,100 that. 21 00:01:48,960 --> 00:01:53,660 Second option is to imbued with centers as this will not have much impact on the model. 22 00:01:54,990 --> 00:02:00,690 For example, if we have high rates of people as a variable and some of the values are missing in the 23 00:02:00,690 --> 00:02:03,990 data, putting zero does not make any sense. 24 00:02:05,050 --> 00:02:09,160 So we can put mean or median height of people in the missing field. 25 00:02:10,990 --> 00:02:17,260 If you have a categorical variable with some values missing, you can assign that category which has 26 00:02:17,260 --> 00:02:19,890 maximum frequency in the missing field. 27 00:02:22,420 --> 00:02:28,860 So suppose we have gender data and 90 percent of them are males, then in the field of the variable 28 00:02:29,020 --> 00:02:30,820 we can safely put male. 29 00:02:32,950 --> 00:02:39,340 The third option is, again, putting in mean value, but instead of using the population mean, if 30 00:02:39,340 --> 00:02:45,160 we can make out relevant segments from the data we can use this segment mean for that segments observation. 31 00:02:46,280 --> 00:02:52,820 For example, if we have rainfall data of different cities, but the data is missing for some cities 32 00:02:53,060 --> 00:03:00,950 instead of using population mean we can take the mean of neighboring cities or cities belonging to similar 33 00:03:00,950 --> 00:03:05,330 regions or cities belonging to same state to fill the blanks. 34 00:03:06,540 --> 00:03:10,580 This makes sense since rainfall does not worry a lot among neighboring cities. 35 00:03:12,700 --> 00:03:19,960 Therefore, we should use business knowledge so that we can select one of these methods wisely. 36 00:03:21,820 --> 00:03:29,930 When we find it in any software, the softwares do give us the number of blank values in each variables. 37 00:03:31,180 --> 00:03:35,090 We will learn how to fill the blank values in this software. 38 00:03:35,110 --> 00:03:36,040 We are going to use.