1 00:00:01,760 --> 00:00:06,110 So let's identify and treat outliers in Python. 2 00:00:07,730 --> 00:00:11,390 If you remember, we saved the word data as B.F.. 3 00:00:12,380 --> 00:00:19,670 So let's first try another function that is in full function will be, if not in full. 4 00:00:27,070 --> 00:00:28,660 You can see using this function. 5 00:00:28,780 --> 00:00:30,670 I can get that Nembutal KONE's. 6 00:00:32,510 --> 00:00:38,210 And the data type of that attack that we missed this while discussing you really? 7 00:00:38,300 --> 00:00:39,710 That's why we discussed earlier. 8 00:00:40,490 --> 00:00:44,120 Now let's move on to outlier treatment and identification. 9 00:00:44,740 --> 00:00:48,980 So from our EDT, we identify three variables. 10 00:00:50,190 --> 00:00:53,550 That we want to look at first, one is the crime rate. 11 00:00:54,450 --> 00:00:58,830 Second one is the hot Groomes and the third one is the rainfall. 12 00:01:00,090 --> 00:01:03,830 If you remember, we slaughtered Scatterplot for rainfall. 13 00:01:04,530 --> 00:01:10,740 And and ah, Groomes, we confirmed that this book contains outlier. 14 00:01:11,250 --> 00:01:11,850 However. 15 00:01:13,000 --> 00:01:20,380 We were not able to judge whether it was Eau Claire or Skewness in the case of crime rate. 16 00:01:24,130 --> 00:01:29,950 Since we definitely know that rainfall and and hot grooms gun ban all play it. 17 00:01:31,880 --> 00:01:34,550 We will directly write a function to read them. 18 00:01:35,710 --> 00:01:41,440 Before that, if you remember, in order to re lecture, we discuss about capping and loading. 19 00:01:43,800 --> 00:01:52,440 So first, we need to identify them 99 and the one, but until value of these two variables to do that. 20 00:01:53,250 --> 00:01:57,060 That is a function in numbat called percentile. 21 00:01:58,890 --> 00:02:03,210 We'll first look at the function, we'll write and be good. 22 00:02:03,350 --> 00:02:04,080 But Sunday. 23 00:02:06,520 --> 00:02:10,630 Then first, the argument, we have to pass the column name. 24 00:02:10,780 --> 00:02:14,590 So we read B.F. Dot and Groomes. 25 00:02:19,890 --> 00:02:24,540 And then in the square of ahead, we will write the percentile value. 26 00:02:24,660 --> 00:02:25,460 We want to see. 27 00:02:25,620 --> 00:02:27,510 So we want the 99 percent. 28 00:02:28,220 --> 00:02:29,340 All right, 99. 29 00:02:30,200 --> 00:02:31,020 Then on this score. 30 00:02:33,840 --> 00:02:43,380 You can see this is an array and the 99 percentile of and hot grooms is fifteen point three nine nine 31 00:02:43,650 --> 00:02:44,310 five two. 32 00:02:45,930 --> 00:02:48,450 But the output here is an array. 33 00:02:48,930 --> 00:02:55,530 So remember, if we want to fed the first number, often today, we have to specify the location of 34 00:02:55,530 --> 00:02:57,900 that value and the square record. 35 00:02:57,960 --> 00:03:00,170 So we'll write and be that person day. 36 00:03:04,730 --> 00:03:05,390 B, if. 37 00:03:05,970 --> 00:03:06,880 And our brooms. 38 00:03:10,560 --> 00:03:12,620 99 for 99 percent percentile. 39 00:03:15,110 --> 00:03:17,690 And then after that, and they squared record, we'll read zero. 40 00:03:17,960 --> 00:03:23,660 So we are actually vetting the first element of this. 41 00:03:27,120 --> 00:03:31,170 We will save the value of this 99 percentile. 42 00:03:31,240 --> 00:03:34,380 And another variable which we call upper limit. 43 00:03:34,530 --> 00:03:35,730 So we will write you. 44 00:03:35,850 --> 00:03:36,270 We. 45 00:03:42,050 --> 00:03:50,000 So we are just saving this fifteen point three nine nine value in another way, Raybuck, you we. 46 00:03:57,160 --> 00:04:04,330 Now, how to identify the rules where the hard growing value is more than this number, who do that 47 00:04:04,580 --> 00:04:10,050 will right B.F. and squid record will right. 48 00:04:10,060 --> 00:04:12,130 The condition we're. 49 00:04:16,430 --> 00:04:18,470 B.F. Dorte and her groom. 50 00:04:25,240 --> 00:04:26,720 It's more then you'll be. 51 00:04:28,220 --> 00:04:29,190 So we'll ride it. 52 00:04:29,310 --> 00:04:30,090 Then you'll be. 53 00:04:32,280 --> 00:04:36,630 Then you we when we done this, go on. 54 00:04:42,470 --> 00:04:51,710 You can see we are getting all the values we're about and half growing value is greater than this 99 55 00:04:51,710 --> 00:04:53,810 percentile value and courtroom's. 56 00:04:58,210 --> 00:05:01,540 You remember the you we well, you, us, we've been born three nine nine. 57 00:05:02,940 --> 00:05:09,300 So if you see we are getting all the rules we have, this value is greater than this SUV. 58 00:05:09,930 --> 00:05:13,890 Well, you know, we want to limit this value. 59 00:05:14,160 --> 00:05:20,540 If you remember, in our capping and loading, we discuss that we can multiply this value by any BGA 60 00:05:20,560 --> 00:05:21,570 or any value. 61 00:05:22,170 --> 00:05:24,000 We can replace those values. 62 00:05:25,590 --> 00:05:29,020 Now we know how to identify the outliers in our data. 63 00:05:29,920 --> 00:05:32,770 Now let's just cap this well loose. 64 00:05:34,130 --> 00:05:39,490 Now for our case, we are taking an inquiry to the multiplication of a local two three. 65 00:05:40,070 --> 00:05:47,720 Since we only want to be the genuine outliers, which is a hundred and one and eighty one, and we don't 66 00:05:47,720 --> 00:05:50,320 want to touch these three outliers. 67 00:05:50,390 --> 00:05:54,860 That is fifteen point four zero, which is very close to a what do we value. 68 00:05:55,400 --> 00:06:02,590 That's why we are taking an inquiry to three will write the F thought and hard Groomes. 69 00:06:03,560 --> 00:06:06,210 We want to change the values of this table. 70 00:06:06,440 --> 00:06:14,540 That's why we are selecting this variable only and in record will specify the condition where B of dot 71 00:06:14,630 --> 00:06:15,620 and heart Groomes. 72 00:06:19,130 --> 00:06:21,220 Is greater than three, we. 73 00:06:25,670 --> 00:06:28,640 So 3U is approximately 46. 74 00:06:29,030 --> 00:06:31,770 So you can see for these two values. 75 00:06:31,910 --> 00:06:33,680 Four hundred and one and eighty six. 76 00:06:34,070 --> 00:06:36,980 We want to limit this and cap this well loose. 77 00:06:37,850 --> 00:06:41,350 And we want to limit this by a value equal to three. 78 00:06:41,360 --> 00:06:41,630 We. 79 00:06:47,040 --> 00:06:47,850 Feed on this. 80 00:06:50,970 --> 00:06:53,590 This is just a warning, not an. 81 00:06:53,980 --> 00:06:58,030 So we can continue now if we rerun this a statement. 82 00:06:59,030 --> 00:06:59,680 We'll see. 83 00:07:01,960 --> 00:07:04,780 We have limited value for basics. 84 00:07:06,430 --> 00:07:08,030 This was our own hundred. 85 00:07:08,080 --> 00:07:09,360 And this was it on it. 86 00:07:09,860 --> 00:07:15,130 Now we are getting a constant value of what, BASIX? 87 00:07:16,680 --> 00:07:19,590 This is how we treat outliers using Python. 88 00:07:20,810 --> 00:07:25,980 Similarly, in the rainfall, there are values which are outlier on the lower side. 89 00:07:28,230 --> 00:07:32,380 Let's identify the outliers in rainfall and treat them as a. 90 00:07:35,570 --> 00:07:37,870 We will write and we don't, but Sunday. 91 00:07:41,500 --> 00:07:44,880 The if not green for. 92 00:07:51,930 --> 00:07:54,550 And we won the first percentile value. 93 00:07:58,430 --> 00:08:04,610 Since the outlier on Lordan and we won the first value of this update, that's why we have put Zettl. 94 00:08:05,640 --> 00:08:06,900 The newest trendy. 95 00:08:08,430 --> 00:08:11,740 Now, we will save this value in another variable called Elvi. 96 00:08:12,250 --> 00:08:13,480 That is the lower value. 97 00:08:16,950 --> 00:08:18,640 Let me quote put this on. 98 00:08:21,200 --> 00:08:25,790 Our Elvie is a variable which is containing this first percentile value. 99 00:08:26,630 --> 00:08:34,940 Now we compare this Elvie with our rate and we'll try to identify all the values which are lower than 100 00:08:34,940 --> 00:08:35,930 this and we value. 101 00:08:38,490 --> 00:08:38,760 Right. 102 00:08:38,910 --> 00:08:39,270 Beer. 103 00:08:44,660 --> 00:08:51,060 We're being screened for is less than an. 104 00:08:54,920 --> 00:08:55,610 Run this. 105 00:08:56,510 --> 00:08:59,320 You can see we are only getting one single value. 106 00:08:59,390 --> 00:09:01,790 We are the rainfall value is three. 107 00:09:02,480 --> 00:09:06,200 So this is definitely an outlier and we should treat it. 108 00:09:08,710 --> 00:09:14,410 As mentioned in the teary lecture for the lower values will multiply by a decimal point. 109 00:09:15,250 --> 00:09:18,360 So in our case, we will write zero point three. 110 00:09:27,800 --> 00:09:29,450 We will select all the values. 111 00:09:33,470 --> 00:09:40,250 Where we have no rainfall is less than zero point three and two, Elvie. 112 00:09:41,540 --> 00:09:45,720 And we will equate it to zero one three times. 113 00:09:46,330 --> 00:09:46,860 And we. 114 00:09:51,320 --> 00:09:52,850 We've done this and we'll. 115 00:09:55,180 --> 00:09:57,720 We've done this statement again. 116 00:09:57,770 --> 00:09:59,330 You can see that the. 117 00:10:01,070 --> 00:10:04,250 Rainfall value is now six and sort of three. 118 00:10:06,720 --> 00:10:09,620 That is how we treat our players using Biton. 119 00:10:11,530 --> 00:10:16,330 Now, let's look at the next variable, which is the crime rate even. 120 00:10:19,570 --> 00:10:22,870 Since we don't know exactly whether climate contains Eau Claire. 121 00:10:23,960 --> 00:10:26,190 Or their distribution is skewed. 122 00:10:26,500 --> 00:10:29,990 Well, first, to join a lot of crime rate. 123 00:10:30,030 --> 00:10:31,460 What says our dependent video? 124 00:10:31,530 --> 00:10:31,700 But. 125 00:10:35,110 --> 00:10:35,500 Right. 126 00:10:35,620 --> 00:10:39,220 Giant plot where X is crime rate. 127 00:10:39,640 --> 00:10:41,710 And why is our price would even. 128 00:10:52,000 --> 00:10:55,390 And data is being read on this. 129 00:11:02,350 --> 00:11:05,080 Plus, if you see that histogram of crime rate. 130 00:11:06,120 --> 00:11:14,850 There is a large concentration of values at the lower respecter of crime rate, but as we move along, 131 00:11:15,090 --> 00:11:20,910 as we move along to the higher end, the crime rate, the density of distribution reduces. 132 00:11:21,260 --> 00:11:27,330 So more so four points are concentrated towards low crime rate, whereas there are only a few bad which 133 00:11:27,330 --> 00:11:28,350 have high crime rate. 134 00:11:30,010 --> 00:11:38,230 And if you see the relationship with Y, you can see that is somewhat while the normal relationship 135 00:11:38,230 --> 00:11:39,120 here with Y. 136 00:11:40,220 --> 00:11:42,390 For low crime rate, the price is high. 137 00:11:42,920 --> 00:11:46,490 But as the crime rate is increasing, the price of it is decreasing. 138 00:11:47,720 --> 00:11:54,230 And since when you view this as a scatterplot, there is no linear relationship between price and crime 139 00:11:54,230 --> 00:11:54,530 rate. 140 00:11:54,740 --> 00:11:57,340 There is somewhat of polynomial relationship. 141 00:11:58,370 --> 00:12:01,430 So there is a way to treat this relationship. 142 00:12:01,910 --> 00:12:08,690 There is a way to take log or exponential or a square root of crime rate to make it more linear. 143 00:12:09,590 --> 00:12:14,870 And when we do that, our old players will automatically will be gone. 144 00:12:14,960 --> 00:12:19,400 So, for example, if I have value one, ten and a hundred. 145 00:12:20,770 --> 00:12:25,690 If we apply our normal rules, we treat hundred as an outlier. 146 00:12:26,230 --> 00:12:27,640 But if we take log. 147 00:12:28,850 --> 00:12:35,570 With the base 10 of all this value of what one will become zero, over ten will become one and over 148 00:12:35,580 --> 00:12:37,090 hundred will become two. 149 00:12:37,610 --> 00:12:39,110 So there are Wiess. 150 00:12:40,270 --> 00:12:47,050 To remove outliers without actually removing it, just by transforming the function or taking lall or 151 00:12:47,210 --> 00:12:51,730 bigging are maitake digging exponential or a square root of those values. 152 00:12:52,790 --> 00:13:00,320 Since we are getting the kind of relationship here between our X and Y, we first want to transform 153 00:13:00,350 --> 00:13:01,640 our variable crime rate. 154 00:13:02,060 --> 00:13:07,260 And then after transforming, we'll look at whether the outliers are present or not. 155 00:13:09,650 --> 00:13:10,170 Here for. 156 00:13:10,680 --> 00:13:13,050 Will not greet outliers here. 157 00:13:13,560 --> 00:13:19,860 First, we'll transform this variable and after after that, we'll look out for outliers. 158 00:13:21,510 --> 00:13:25,590 We will look at valuably transformation while creating dummy variables. 159 00:13:27,190 --> 00:13:35,530 So now, since we have treated our outlets, let's take a look at our UDD once more will write the if 160 00:13:36,730 --> 00:13:37,570 not risque. 161 00:13:44,760 --> 00:13:47,860 Let's look at the hotel rooms and rainfall. 162 00:13:52,240 --> 00:13:55,090 As you can see now, the maximum value is 46. 163 00:13:56,790 --> 00:14:00,630 And the mean and median values are a lot closer than before. 164 00:14:01,950 --> 00:14:06,490 Similarly for rainfall, the lower value is now six instead of three. 165 00:14:07,560 --> 00:14:10,920 And again, the median value is closer to the mean value. 166 00:14:13,760 --> 00:14:17,330 That's all for Eau Claire treatment and identification and buy them.