1 00:00:01,620 --> 00:00:09,600 In this session, you're going to see how to handle Outlash, just like non-value, we will have outliers 2 00:00:09,600 --> 00:00:10,440 in our data set. 3 00:00:11,500 --> 00:00:16,830 In fact, the chances for outliers in the data set are much, much higher compared to normal values. 4 00:00:17,350 --> 00:00:26,470 So it is important that as machine learning data scientists, you learn how to identify outliers and 5 00:00:26,470 --> 00:00:27,550 manage your class. 6 00:00:28,220 --> 00:00:28,430 Right. 7 00:00:29,050 --> 00:00:33,780 So let's start this session with an understanding of what is an outlier. 8 00:00:34,830 --> 00:00:41,580 Our client is an observation that is quite different from the rest of the data points in your dataset. 9 00:00:43,680 --> 00:00:51,900 It appears far away from other data points and is significantly different from the other data points 10 00:00:52,290 --> 00:00:56,150 significantly diverging from the war on poverty. 11 00:00:57,570 --> 00:01:02,000 If you see the image that is there on your screen, this is an outlier. 12 00:01:02,730 --> 00:01:04,400 You see all the other fishes. 13 00:01:04,590 --> 00:01:06,350 They're going from right to left. 14 00:01:07,200 --> 00:01:15,060 They are either blue, light blue, dark blue color, whereas the red fish, which is the odd man out, 15 00:01:15,630 --> 00:01:18,810 is red in color, is going from left to right. 16 00:01:20,230 --> 00:01:23,530 So definitely this red fish is an all. 17 00:01:23,830 --> 00:01:31,460 In fact, it's an outlier on two counts, the direction in which it is going and the color height. 18 00:01:31,900 --> 00:01:37,060 So Outlash can be a data point, can be an outlier on a single. 19 00:01:38,730 --> 00:01:45,600 Variable, not a single parameter, or it can be an outlier when you combine multiple parameters together, 20 00:01:45,660 --> 00:01:48,230 as you're seeing in this fish example, like. 21 00:01:49,300 --> 00:01:56,740 Even if that red fish had to be blue in color, it will still be an outlier because it is going from 22 00:01:56,920 --> 00:01:57,610 left to right. 23 00:01:59,610 --> 00:02:00,310 Are you getting it? 24 00:02:02,320 --> 00:02:04,820 So how do you understand what is an outlier? 25 00:02:05,260 --> 00:02:08,820 Let's understand what will be the impact of our clients? 26 00:02:09,730 --> 00:02:12,430 Our clients typically swing the results. 27 00:02:12,640 --> 00:02:18,900 OK, one way or the other, like they can change the results significantly. 28 00:02:20,260 --> 00:02:22,380 It basically means there can be buyers. 29 00:02:22,390 --> 00:02:23,980 The variation can be high. 30 00:02:24,280 --> 00:02:27,060 The accuracy and classification will take a hit. 31 00:02:27,970 --> 00:02:36,140 In addition to that, the mere presence of outliers actually impact the basic assumption behind many 32 00:02:36,160 --> 00:02:40,650 statistical tax tests and machine learning models. 33 00:02:41,110 --> 00:02:41,390 Right. 34 00:02:42,130 --> 00:02:50,020 So if the outliers are randomly distributed, the decrease in non-magnetic in fact, normality is a 35 00:02:50,020 --> 00:02:50,840 requirement. 36 00:02:51,460 --> 00:02:55,680 Again, many of the statistical tests and machine learning models. 37 00:02:56,320 --> 00:02:56,640 Right. 38 00:02:56,950 --> 00:03:03,970 So it is important that we identify outliers and statistically manage the outliers, which is what we 39 00:03:03,970 --> 00:03:05,620 are going to see in the rest of the. 40 00:03:06,910 --> 00:03:07,130 Right. 41 00:03:08,020 --> 00:03:10,820 So how do we identify outliers? 42 00:03:11,860 --> 00:03:13,930 Do you can identify outliers? 43 00:03:13,930 --> 00:03:21,820 You know, using Histogram Scatterplot Walks Block is the best visualization method to identify outliers. 44 00:03:22,240 --> 00:03:29,080 It clearly shows visually how many outliers are there and what is the extent of the outliers compared 45 00:03:29,080 --> 00:03:30,300 to the rest of the population. 46 00:03:31,500 --> 00:03:36,590 Right, so backplane is the best way to identify outliers, and that's what we going to be using. 47 00:03:36,990 --> 00:03:42,510 In fact, Biden comes with a pretty loaded package, right, using which he can generate box block. 48 00:03:42,510 --> 00:03:43,180 Very easy. 49 00:03:44,160 --> 00:03:44,540 OK. 50 00:03:45,050 --> 00:03:49,680 In addition to box plug, there are other terminals. 51 00:03:50,010 --> 00:03:57,800 These can be used to identify outliers like one any value, which is beyond the point. 52 00:03:57,870 --> 00:04:00,110 One point five times the average. 53 00:04:00,270 --> 00:04:00,630 Right. 54 00:04:01,410 --> 00:04:02,110 What is it? 55 00:04:02,160 --> 00:04:04,580 Port and rail you're going to see very shortly. 56 00:04:05,310 --> 00:04:11,340 You can also use capping letters at any value that is outside of the 90th percentile. 57 00:04:11,820 --> 00:04:18,750 I will consider as an outline like any data points that is beyond three or more standard deviations 58 00:04:19,080 --> 00:04:20,360 on both sides of the mean. 59 00:04:21,000 --> 00:04:24,460 OK, that is also considered as an outlier, right? 60 00:04:25,960 --> 00:04:33,370 In addition to this, if you have multiple factor outliers, that is a variable identified as an outlier 61 00:04:33,380 --> 00:04:36,850 based on more than one factor, right? 62 00:04:37,360 --> 00:04:43,230 You can use cukes, distance and other distance matters to identify a place. 63 00:04:44,050 --> 00:04:44,320 Right. 64 00:04:44,890 --> 00:04:48,220 So these rules are also used in addition to Boxleitner. 65 00:04:48,850 --> 00:04:49,080 Right. 66 00:04:50,080 --> 00:04:57,310 We want to identify outliers because outlines are very, very important aspect in exploratory. 67 00:04:57,310 --> 00:05:02,020 Did I want to either remove outliers or transform the data? 68 00:05:02,260 --> 00:05:03,610 We want to see that shortly. 69 00:05:03,790 --> 00:05:11,430 OK, so that outliers are reduced and hence you manage the fact of outliers, right? 70 00:05:12,880 --> 00:05:22,710 So let's see, what is this box like, so box plot, right, helps us to identify outlines how our own 71 00:05:22,720 --> 00:05:30,280 layers identify any observation outside this maximum and minimum right is normally considered as an 72 00:05:30,280 --> 00:05:31,930 outlier, as per box. 73 00:05:32,200 --> 00:05:34,750 So what is this maximum on minimum? 74 00:05:35,200 --> 00:05:45,360 It is 75 percent plus or minus plus one point five times the difference between 70 percent and 24 percent. 75 00:05:45,360 --> 00:05:54,220 Does this plus or minus one point five times the difference between 70, 15, 25 percent is called as 76 00:05:54,220 --> 00:05:55,340 an airport average. 77 00:05:56,140 --> 00:06:03,430 In fact, I just mentioned right as per the tunnel, one of the tunnels, any point outside, plus or 78 00:06:03,430 --> 00:06:12,400 minus one point five times the difference between 70 and 24 percent by itself can be considered as an 79 00:06:12,400 --> 00:06:12,880 outlier. 80 00:06:13,780 --> 00:06:18,280 But one box plot does is it adds one more data point, right. 81 00:06:18,520 --> 00:06:21,820 For maximum, it adds to 75 percent. 82 00:06:22,210 --> 00:06:24,550 And for minimum, it adds the. 83 00:06:26,710 --> 00:06:32,920 Twenty percent, percent, right, so any point outside of this maximum or minimum is calling this an 84 00:06:32,920 --> 00:06:33,950 outlier, right? 85 00:06:34,180 --> 00:06:36,680 So this is what we are going to be primarily using. 86 00:06:36,730 --> 00:06:42,480 As I said, there are other dumb rules which you can use depending on the appropriateness of the situation. 87 00:06:43,510 --> 00:06:43,900 Right. 88 00:06:45,670 --> 00:06:56,410 So what is this person like, because I use the term 24 seven and on 58 percent, right? 89 00:06:56,740 --> 00:07:00,280 So this person is for computing percentile. 90 00:07:00,310 --> 00:07:02,470 First we arrange the data in an ascending order. 91 00:07:03,630 --> 00:07:11,820 Small, too big by law to hide, you arrange the data in an ascending order. 92 00:07:11,850 --> 00:07:19,170 Right here I am trying to find what will be the 80th percentile height for this dataset. 93 00:07:19,200 --> 00:07:30,900 So this dataset contains the heights of of 20 people like so suppose let's say the 17 person rate has 94 00:07:30,900 --> 00:07:33,300 a height of five feet, 11 inches. 95 00:07:34,020 --> 00:07:36,420 So if you take the 17th person. 96 00:07:36,990 --> 00:07:39,690 Right, I arrange the data in an ascending order. 97 00:07:40,020 --> 00:07:47,960 So 16 people are less than that particular individuals like like 16 people. 98 00:07:48,750 --> 00:07:53,910 How heights lower than the height of the 75 percent. 99 00:07:54,200 --> 00:07:54,490 Right. 100 00:07:54,780 --> 00:07:56,680 80 percent of 2060. 101 00:07:57,180 --> 00:08:02,180 And so I'm saying 16 individuals are below this height. 102 00:08:03,150 --> 00:08:11,100 So if this individual has got five feet, 11 inches, all these people like 80 percent of the population 103 00:08:11,100 --> 00:08:14,400 in this, 80 percent of the people in this set. 104 00:08:14,410 --> 00:08:14,820 Right. 105 00:08:15,240 --> 00:08:18,550 Have a height less than five feet, 11 inches. 106 00:08:19,230 --> 00:08:22,750 So five feet, 11 inches is the 88 percent height. 107 00:08:23,100 --> 00:08:27,240 That means 80 percent of the people are below this particular. 108 00:08:29,140 --> 00:08:35,920 That's the concept of percentile, percentile is a ranking mechanism, it is used in many competitive 109 00:08:35,920 --> 00:08:43,750 exams like the Gyari Gmod and in India Gate and cat examinations. 110 00:08:44,620 --> 00:08:49,530 So in those examinations, the ranking is determined based on percentile calculations. 111 00:08:49,840 --> 00:08:58,480 If someone says, you know, I have secured the 98 percent, then that means his score is more than 112 00:08:58,480 --> 00:09:02,590 90 percent of people that go to that particular examination. 113 00:09:03,340 --> 00:09:03,600 Right. 114 00:09:03,640 --> 00:09:04,570 So it's a ranking. 115 00:09:05,960 --> 00:09:06,650 Are you getting it? 116 00:09:06,910 --> 00:09:10,810 So it is a personal concept that I used in Boxtel. 117 00:09:12,660 --> 00:09:15,420 Right now that we have understood. 118 00:09:16,950 --> 00:09:18,150 What is an outlier? 119 00:09:18,180 --> 00:09:19,840 How do I identify an outlier? 120 00:09:20,550 --> 00:09:24,360 We are going to see how to manage our lives, right? 121 00:09:24,720 --> 00:09:28,310 So because, as I said, outliers can swing the results one way or the other. 122 00:09:29,600 --> 00:09:35,990 One of the strategies that is adopted is to remove outliers like many people doing. 123 00:09:37,500 --> 00:09:41,940 The other way is to transform the event values, right? 124 00:09:42,870 --> 00:09:49,860 I can convert the data for that particular factor into a logarithmic. 125 00:09:52,050 --> 00:10:00,270 Transformation and transforming the variable rate from numerical to a logarithmic one, and then I identified 126 00:10:00,270 --> 00:10:04,420 outliers, I invariably find that the number of outliers has come. 127 00:10:04,500 --> 00:10:08,280 So converting data is not a numeric data into a lottery. 128 00:10:08,280 --> 00:10:14,240 To make data is one of the ways I manage outliers like the other use what is called as meaning. 129 00:10:14,700 --> 00:10:20,970 Meaning is the process of transforming numerical variables into categorical in logarithmic. 130 00:10:20,970 --> 00:10:27,900 I convert a numeric into values writing letters in meaning I can work numerical into categories. 131 00:10:28,480 --> 00:10:32,460 I suppose, let's say the age of people. 132 00:10:32,460 --> 00:10:33,070 Is that right? 133 00:10:33,330 --> 00:10:38,460 If we want to convert that into a categorical you know, I can say that young, old, middle aged, 134 00:10:38,470 --> 00:10:45,150 you know, very young and stuff like this helps us to manage our clients very well. 135 00:10:46,360 --> 00:10:46,780 What did. 136 00:10:48,570 --> 00:10:55,410 Now, let's see some examples, if you see this example that is there enough for the kids study that 137 00:10:55,410 --> 00:10:56,320 we are looking at? 138 00:10:57,270 --> 00:11:02,250 I have now developed the box block for applicants, so you can see that. 139 00:11:03,390 --> 00:11:07,300 So you can see that there are many outliers here, right? 140 00:11:08,590 --> 00:11:17,650 And Koplik and I see outliers beyond the seven thousand five hundred here it is beyond 10000 and beyond 141 00:11:18,160 --> 00:11:24,190 20000 fewer outliers, like most of the data ones, are clustered around this. 142 00:11:24,490 --> 00:11:24,800 Right. 143 00:11:25,210 --> 00:11:27,640 The same for KOPLIK and income. 144 00:11:28,240 --> 00:11:36,040 And a similar trend is seen following the models beyond this loan amount value outliers and beyond this 145 00:11:36,040 --> 00:11:38,290 point, some outliers of the. 146 00:11:40,030 --> 00:11:48,670 So as you can see, this plot helps us to get a very good view of how close the box produced, really 147 00:11:48,670 --> 00:11:49,810 useful and very powerful. 148 00:11:51,010 --> 00:11:58,800 So now, having seen the outlines of the dataset, I want to transform. 149 00:11:59,530 --> 00:11:59,810 Right. 150 00:12:00,160 --> 00:12:05,340 So, as I indicated, here I am converting the applicant, including. 151 00:12:06,900 --> 00:12:14,940 OK, into a logarithmic transmission, right, and I am doing the same thing for the lunar models because 152 00:12:14,940 --> 00:12:15,930 there are outliers. 153 00:12:16,880 --> 00:12:22,790 Right, and if you really see the number of our clients have come to. 154 00:12:24,320 --> 00:12:24,640 Right. 155 00:12:25,280 --> 00:12:32,340 The number of our lives have come down after eight transformed the data and redid the box from. 156 00:12:34,540 --> 00:12:40,660 So I recreated the box plot for the Transformed leader and obviously the number of our players have 157 00:12:40,810 --> 00:12:41,330 come down. 158 00:12:41,740 --> 00:12:45,580 So I use this transform data in my machine learning. 159 00:12:46,510 --> 00:12:50,230 So even here you have outlets finally we will live with it. 160 00:12:50,980 --> 00:12:57,820 Like if you find the forecast accuracy is not up to the mark, maybe remove these outliers in the transform 161 00:12:57,820 --> 00:13:03,510 data, see a on the model, see if there is any significant improvement. 162 00:13:04,030 --> 00:13:06,670 If that is a significant improvement, use it. 163 00:13:07,870 --> 00:13:14,260 But I go with the removal of the outliers and he transformed into. 164 00:13:20,230 --> 00:13:25,030 So this completes the session on managing outlets. 165 00:13:26,310 --> 00:13:32,790 Right, so outlines are very, very important concept, you must first identify them and either remove 166 00:13:33,150 --> 00:13:39,750 or transform the data so that the number of outliers come down and you can go ahead and continue with 167 00:13:39,750 --> 00:13:41,730 your model building activities. 168 00:13:42,720 --> 00:13:43,220 OK.