1 00:00:01,490 --> 00:00:04,970 In our data, we need to do our prior treatment also. 2 00:00:06,350 --> 00:00:07,260 What is an outlier? 3 00:00:07,280 --> 00:00:07,850 First of all. 4 00:00:08,810 --> 00:00:14,670 Our values are the values that appear to be or diverge from overall pattern of a variable. 5 00:00:16,050 --> 00:00:23,220 Outlays are usually because some occurred in the measurement or data entry or sampling process. 6 00:00:24,830 --> 00:00:30,650 For example, the error was during data entry and you wanted to enter one point three, but instead 7 00:00:30,650 --> 00:00:33,250 entered 13 because you missed the decimal point. 8 00:00:34,450 --> 00:00:39,910 Or it was measurement error, and while measuring one of these sensors malfunctioned and given erroneous 9 00:00:39,910 --> 00:00:40,330 value. 10 00:00:41,800 --> 00:00:48,910 Reasons may be any, but it must be handled carefully before feeding the data to our model, because 11 00:00:48,910 --> 00:00:54,400 these outlying values will otherwise lead to high error variants and lesser prediction accuracy. 12 00:00:55,350 --> 00:01:03,900 To find out liars, we should use editing and visualization tools such as box plot scatterplot and histograms. 13 00:01:05,600 --> 00:01:08,900 We will see this when we work on the software tools. 14 00:01:10,600 --> 00:01:15,920 Once we have identified the outliers, we replace these values with some other values. 15 00:01:16,870 --> 00:01:21,040 There are several options of what value we should use to impute outliers. 16 00:01:22,930 --> 00:01:30,070 First, let me show you how one single outlet value impacted the data and how can we identify these 17 00:01:30,070 --> 00:01:30,700 outliers? 18 00:01:33,770 --> 00:01:39,980 These two datasets are same, except the one on the right has one outlier value of hundred. 19 00:01:41,170 --> 00:01:45,210 This lead dimino, right, want to be shifted towards this outlier? 20 00:01:46,580 --> 00:01:48,260 But median is not doing much. 21 00:01:49,610 --> 00:01:54,740 So, Wendy, Data has outliers, there is a larger difference between mean and median. 22 00:01:56,100 --> 00:01:59,250 Also, the standard deviation in such a case is higher. 23 00:02:00,480 --> 00:02:09,210 So once we have identified the outliers, we have the following options we can do capping and floating. 24 00:02:10,240 --> 00:02:15,880 What this means is we find the upper limit and lower limit beyond which we will change the values. 25 00:02:17,180 --> 00:02:24,530 So suppose we select the 99 percentile as upper limit and all values larger than this value will be 26 00:02:24,530 --> 00:02:25,880 assigned a particular value? 27 00:02:27,160 --> 00:02:29,470 I hope you remember what percentile values. 28 00:02:31,690 --> 00:02:38,740 If you order all the values that are available with you in ascending order, then 99 percentile value 29 00:02:38,740 --> 00:02:44,010 is the value, which is larger than 99 percent of other values. 30 00:02:45,590 --> 00:02:53,210 So only one percent of values which are larger than this value will be considered as outliers and they 31 00:02:53,210 --> 00:02:54,790 will be assigned a different value. 32 00:02:56,320 --> 00:03:01,150 Usually we assign a value such as three times the 99 percentile value. 33 00:03:02,220 --> 00:03:04,560 This multiplier of three is a personal choice. 34 00:03:05,450 --> 00:03:10,940 Similarly, for lower limit, we can use first percentile value and a multiplier of point three. 35 00:03:13,930 --> 00:03:16,270 Second method is exponential smoldering. 36 00:03:17,930 --> 00:03:27,080 In this matter, we extrapolate Politicker of 1950, percentile to 99 percentile beyond the 99 percentile 37 00:03:27,080 --> 00:03:27,530 point. 38 00:03:28,520 --> 00:03:34,070 And to the values which are going beyond 99 percent, a point to fall on this good. 39 00:03:36,300 --> 00:03:39,550 This must seem difficult, but software packages do it straightaway. 40 00:03:39,750 --> 00:03:41,790 However, I always preferred Method one. 41 00:03:43,910 --> 00:03:50,000 Third is the Sigma approach, which is popular in the manufacturing industry, in this approach, all 42 00:03:50,000 --> 00:03:56,480 values beyond some sigma, for example, all values which are far from mean by more than three times 43 00:03:56,490 --> 00:03:57,440 standard deviation. 44 00:03:58,570 --> 00:04:02,290 Are replaced with the value meaning plus three standard deviation. 45 00:04:03,730 --> 00:04:05,860 This gives results similar to the first one. 46 00:04:06,910 --> 00:04:09,070 So one and three are actually equivalent. 47 00:04:10,890 --> 00:04:14,970 With this, we now know how to identify outliers and how to treat them.