1 00:00:01,610 --> 00:00:04,490 In our data, we need to do our prior treatment. 2 00:00:04,550 --> 00:00:04,970 Also. 3 00:00:06,410 --> 00:00:07,230 What is an outlier? 4 00:00:07,280 --> 00:00:07,820 First of all. 5 00:00:08,930 --> 00:00:14,720 Outlay values are the values that appear finally or diverge from overall pattern of a variable. 6 00:00:16,170 --> 00:00:23,220 Outlays are usually because some error occurred in the measurement or data entry or sampling process. 7 00:00:24,860 --> 00:00:30,620 For example, the error was during data entry and you wanted to enter one point three, but instead 8 00:00:30,650 --> 00:00:33,260 entered 13 because you missed the decimal point. 9 00:00:34,420 --> 00:00:35,790 Or it was measurement error. 10 00:00:35,950 --> 00:00:40,350 And while measuring one of these sensors malfunctioned and given at an Israeli. 11 00:00:41,890 --> 00:00:48,910 Reasons may be any, but it must be handled carefully before feeding the data to our model, because 12 00:00:48,910 --> 00:00:54,430 these outlying values will otherwise lead to high error variants and lesser prediction accuracy. 13 00:00:55,410 --> 00:01:03,900 To find outliers, we should use EDT and visualization tools such as both blogs, scatterplot and histograms. 14 00:01:05,690 --> 00:01:08,900 We will see this when we vote on this software tools. 15 00:01:10,660 --> 00:01:15,910 Once we have identified the outliers, we place these values with some other values. 16 00:01:16,900 --> 00:01:21,040 There are several options of what value we should use to impute outliers. 17 00:01:22,990 --> 00:01:27,460 First, let me show you how one single outlet value impacted these data. 18 00:01:28,210 --> 00:01:30,700 And how can we identify these outliers? 19 00:01:33,830 --> 00:01:39,210 These two datasets are seeme, except the one on the right has one outland value. 20 00:01:39,260 --> 00:01:39,980 Three, 300. 21 00:01:41,170 --> 00:01:45,210 This lead dimino, right, want to be shifted towards this outlier? 22 00:01:46,640 --> 00:01:48,290 But median is not doing much. 23 00:01:49,640 --> 00:01:51,570 So, Wendy, Data has outliers. 24 00:01:52,130 --> 00:01:54,770 There is a larger difference between mean and median. 25 00:01:56,190 --> 00:01:59,250 Also, the standard deviation in such a case is higher. 26 00:02:00,510 --> 00:02:05,040 So once we have identified the outliers, we have the following options. 27 00:02:07,230 --> 00:02:09,270 We can do capping and floating. 28 00:02:10,330 --> 00:02:15,880 What this means is we find the upper limit and lower limit beyond which we will change the values. 29 00:02:17,210 --> 00:02:24,140 So suppose we selected ninety nine percent day as a parliament and all valued larger than this value 30 00:02:24,200 --> 00:02:25,900 will be assigned a particular value. 31 00:02:27,280 --> 00:02:29,500 I hope you remember what the percentile value is. 32 00:02:31,780 --> 00:02:38,660 If you order all the values that are available with you in ascending order, then 99 percentile value 33 00:02:38,660 --> 00:02:44,020 is the value, which is larger than 99 percent of other values. 34 00:02:45,650 --> 00:02:52,730 To only one percent of values which are larger than this value will be considered as outliers. 35 00:02:52,880 --> 00:02:54,860 And they will be assigned a different value. 36 00:02:56,410 --> 00:03:01,180 Usually we assign a value such as three times the 99 percentile value. 37 00:03:02,280 --> 00:03:04,560 This murder, Gladder three is a personal choice. 38 00:03:05,510 --> 00:03:10,250 Similarly, for lower limit, we can use first percentile value and multiply it up. 39 00:03:10,280 --> 00:03:10,940 Point three. 40 00:03:13,990 --> 00:03:15,690 Second method is exponential. 41 00:03:15,850 --> 00:03:16,370 Moody. 42 00:03:18,010 --> 00:03:25,820 In this method, we extrapolate the decode of 19 fifth percentile to ninety ninth percentile beyond 43 00:03:25,820 --> 00:03:27,520 the 99 percentile point. 44 00:03:28,580 --> 00:03:34,070 And the values which are going beyond 99 percentile point to fall on this good. 45 00:03:36,360 --> 00:03:39,620 This may seem difficult, but software packages do it straight away. 46 00:03:39,900 --> 00:03:41,820 However, I always preferred Method one. 47 00:03:43,970 --> 00:03:49,610 Third is the Sigma approach, which is popular in the manufacturing industry in this approach. 48 00:03:49,760 --> 00:03:51,620 All values beyond some sigma. 49 00:03:52,100 --> 00:03:57,500 For example, all values which are farther from mean by more than three times standard deviation. 50 00:03:58,690 --> 00:04:02,310 Are replaced with the value mean plus three times to end the division. 51 00:04:03,760 --> 00:04:05,890 This gives results similar to the first one. 52 00:04:06,970 --> 00:04:09,070 So one and three are actually equivalent. 53 00:04:10,950 --> 00:04:15,000 With this, we now know how to identify our lives and how to treat them.