1 00:00:00,210 --> 00:00:07,110 Halaal, before going deep dive into this session, let's have a quick recap of what we have done below 2 00:00:07,110 --> 00:00:07,880 in this project. 3 00:00:08,190 --> 00:00:13,740 So from a starting, if I will talk about, we have the lots of analysis on data, some spatial analysis 4 00:00:13,890 --> 00:00:21,090 and some amazing analysis that we have, find some kind of trend for our data, as well as from machine 5 00:00:21,090 --> 00:00:23,320 learning aspect we have found here. 6 00:00:23,520 --> 00:00:31,380 What are exactly my most important feature, and we have basically set this using our coordination concept. 7 00:00:31,560 --> 00:00:38,460 After what we have done, we have basically used this design features concept call this year, month 8 00:00:38,460 --> 00:00:43,430 day plays a very important role at the time of my model creation. 9 00:00:43,800 --> 00:00:45,600 So that's what we have done. 10 00:00:45,870 --> 00:00:51,030 So in the previous session, what we have done basically we have put forward feature encoding on our 11 00:00:51,030 --> 00:00:53,460 data to intersession what we have to do. 12 00:00:53,490 --> 00:00:59,520 This is exactly the assignment that we have to cover so that any force problem statement is we have 13 00:00:59,520 --> 00:01:01,460 to handle outliers. 14 00:01:01,470 --> 00:01:03,480 What exactly are the outliers? 15 00:01:03,480 --> 00:01:09,060 Let's say there are some data points that are too much far away from the normal one. 16 00:01:09,210 --> 00:01:11,550 So those are exactly my outliers. 17 00:01:11,880 --> 00:01:14,820 So let me explain it in very layman's terms. 18 00:01:15,330 --> 00:01:21,920 Let's say you have a data of, let's say, a hundred persons whose age is between, let's say one, 19 00:01:21,940 --> 00:01:27,540 two hundred, but let's say one hundred person has an age of, let's say, seven hundred years. 20 00:01:27,870 --> 00:01:32,060 So that hundred person will be considered as my outlier. 21 00:01:32,070 --> 00:01:39,590 So whenever I'm going to build a model, that model will get impacted badly by this outlier. 22 00:01:39,720 --> 00:01:43,510 So we have to handle this outlier condition in our data as well. 23 00:01:43,890 --> 00:01:48,410 So let me show you a trick, how you can find outliers in your data. 24 00:01:48,750 --> 00:01:55,050 So whenever you have to find whether there is some outlier or not, there is basically a baby approach 25 00:01:55,050 --> 00:01:55,980 you can just follow. 26 00:01:55,990 --> 00:01:58,370 It is exactly your distribution. 27 00:01:58,560 --> 00:02:01,820 So let me found what exactly is our distribution over here. 28 00:02:02,190 --> 00:02:06,330 So let me give you very first a quick overview of what exactly is my data. 29 00:02:06,480 --> 00:02:08,310 I'm just going to call ahead over there. 30 00:02:08,340 --> 00:02:14,630 So if I'm going to executed, all the stuff gets executed over here for you over here. 31 00:02:14,640 --> 00:02:18,110 Here you have multiple numerical features. 32 00:02:18,120 --> 00:02:22,500 Let's say I'm going to talk about this lead on a school time feature. 33 00:02:22,650 --> 00:02:26,370 So let's say I'm just going to call my distribution of this feature. 34 00:02:26,370 --> 00:02:28,590 So I'm going to say yes and then start this plot. 35 00:02:28,590 --> 00:02:30,210 And very first, I have to access my data. 36 00:02:30,690 --> 00:02:34,590 So I'm going to say data frame of this feature. 37 00:02:34,590 --> 00:02:39,540 Just execute this cell and you will get this amazing distribution pattern here. 38 00:02:39,900 --> 00:02:46,230 And you will see over here the distribution of this feature is a little bit right skewed. 39 00:02:46,230 --> 00:02:50,550 And you can clearly observe over here, this is exactly that range. 40 00:02:50,970 --> 00:02:53,280 Those are exactly my outliers. 41 00:02:53,280 --> 00:02:59,970 You will see most of the data points that is back to this feature always line between almost 200 to 42 00:02:59,970 --> 00:03:00,600 300. 43 00:03:00,840 --> 00:03:07,240 But there are some low number of data points that are almost my six hundred eight hundred. 44 00:03:07,290 --> 00:03:10,080 So these are exactly my outliers situation. 45 00:03:10,290 --> 00:03:12,270 So we have to handle this situation. 46 00:03:12,300 --> 00:03:13,980 So how to handle the situation? 47 00:03:14,100 --> 00:03:20,090 So what we are going to do, basically, we are basically going to calculate log of this lead in a school 48 00:03:20,130 --> 00:03:27,390 time or you can see we are basically going to replace this original feature with its log, because once 49 00:03:27,390 --> 00:03:33,990 we take a log of this feature, this is Skewness up to some greater extent will get solved. 50 00:03:34,030 --> 00:03:35,370 That's our approach. 51 00:03:35,580 --> 00:03:41,240 So what I'm going to do over here, I'm just going to define a function, let's say, handle on this 52 00:03:41,250 --> 00:03:42,420 core outlier. 53 00:03:42,690 --> 00:03:48,700 So this function will be final over here whenever we have to handle our outlier in my data. 54 00:03:48,990 --> 00:03:54,360 So what this function will receive the very first is exactly whatever column name I'm going to pass 55 00:03:54,360 --> 00:03:54,840 it to function. 56 00:03:54,840 --> 00:03:55,260 That's it. 57 00:03:55,590 --> 00:04:03,090 And after it, what I have to do, I have to basically say I'm just going to call my log on whatever 58 00:04:03,090 --> 00:04:09,000 feature I'm going to pass over here so that if I have to import my name by model because it will function 59 00:04:09,270 --> 00:04:12,470 of log, it's exactly available in this number. 60 00:04:12,480 --> 00:04:16,200 So I'm going to say in NUM by as ENPI, that's it. 61 00:04:16,350 --> 00:04:25,470 Now what we have to do, we have to perform this log one B, so if I'm going to shift a stat then you 62 00:04:25,470 --> 00:04:30,310 will get all the documentation of this function, you will see all your custom parameters. 63 00:04:30,630 --> 00:04:34,410 So here what we have to pass that is first I have to access the frame. 64 00:04:34,410 --> 00:04:40,470 So I'm going to say this is exactly my data frame and whatever column I'm going to pass over here, 65 00:04:40,800 --> 00:04:44,150 this will exactly convert that feature into its log. 66 00:04:44,340 --> 00:04:45,540 That's what I am going to do. 67 00:04:46,080 --> 00:04:49,900 Now, what we have to do, we have to update this feature as well. 68 00:04:49,920 --> 00:04:53,110 So I'm going to say this, nothing but just this one. 69 00:04:53,400 --> 00:04:56,070 Now, what we have to do, we have to just execute this. 70 00:04:56,700 --> 00:04:59,370 Now that simple, we have to simply call. 71 00:04:59,450 --> 00:05:07,080 It functions, I would say, handle an outlier and simple, I have to just follow my lead and escort 72 00:05:07,100 --> 00:05:10,450 time feature and let me show you one more thing. 73 00:05:11,120 --> 00:05:18,200 Let's say once I will execute this function and again, if I'm going to execute the cell over here and 74 00:05:18,200 --> 00:05:26,390 if again, I'm going to call it a distribution plot of this data frame, let me access my data frame 75 00:05:26,390 --> 00:05:26,870 very close. 76 00:05:27,120 --> 00:05:32,240 So I'm going to say data frame of, let's say, need an escort to execute the cell. 77 00:05:32,240 --> 00:05:37,640 And you will visualize this time you don't have that much skewness in your data. 78 00:05:37,670 --> 00:05:42,370 It means you don't have that much outliers as what you have earlier. 79 00:05:42,380 --> 00:05:45,550 You will observe you will clearly observe here in a similar way. 80 00:05:45,740 --> 00:05:52,950 We can also handle this outlier in my price feature, which is exactly my ADR over here. 81 00:05:53,180 --> 00:05:59,930 So for this what I'm going to, let's say just check what exactly the distribution of this ADR, I'm 82 00:05:59,930 --> 00:06:03,710 going to say data frame of ADR, execute the sale. 83 00:06:03,740 --> 00:06:07,180 Now you will see it has certain kind of outliers. 84 00:06:07,190 --> 00:06:13,900 You will see the maximum data point in this area is exactly somewhere close to five thousand. 85 00:06:13,910 --> 00:06:21,170 And but I think that you have to notice most of the data points are going to lie between zero to almost 86 00:06:21,170 --> 00:06:22,540 200 to 300. 87 00:06:22,550 --> 00:06:28,950 And these are exactly my data points that are outliers, which are approx, close to five thousand. 88 00:06:29,150 --> 00:06:32,720 It means, again, we have to handle this outlier situation. 89 00:06:32,930 --> 00:06:37,630 So for this, I'm just going to call this function and I have to just pass this feature. 90 00:06:37,640 --> 00:06:38,120 That's it. 91 00:06:38,420 --> 00:06:40,760 Outlier gets handed over here. 92 00:06:40,970 --> 00:06:47,300 And again, if I'm going to copy all this stuff, let me just paste and let me again execute it. 93 00:06:47,310 --> 00:06:53,960 But before executing it, you have to also take care of whether any missing value is available in this 94 00:06:53,960 --> 00:06:54,740 area or not. 95 00:06:54,950 --> 00:07:01,840 So if there is any missing value in such scenarios, what you can do, you can just call drop and that's 96 00:07:01,850 --> 00:07:02,110 it. 97 00:07:02,120 --> 00:07:02,780 That simple. 98 00:07:03,170 --> 00:07:04,610 Just execute the sale. 99 00:07:04,610 --> 00:07:11,100 And this time you will see it is somehow very close to your normal distribution. 100 00:07:11,300 --> 00:07:18,400 So now this is exactly that data that you really want for your machine learning purpose. 101 00:07:18,410 --> 00:07:21,070 So that's all about the session of the session. 102 00:07:21,080 --> 00:07:26,330 Very much how exactly I have handled this outlier condition in my data for hope. 103 00:07:26,330 --> 00:07:27,690 You love the session very much. 104 00:07:27,710 --> 00:07:28,380 Thank you. 105 00:07:28,700 --> 00:07:29,690 Have a nice day. 106 00:07:29,720 --> 00:07:30,620 Keep learning. 107 00:07:30,620 --> 00:07:31,490 Keep going. 108 00:07:31,820 --> 00:07:32,660 Keep practicing.