1 00:00:00,450 --> 00:00:02,970 All before going deeper into recession. 2 00:00:03,180 --> 00:00:08,250 Let's have a quick recap of what we all have done in all our previous session. 3 00:00:08,460 --> 00:00:15,240 So we have basically performed lots of people, lots of analysis on data to extract some meaningful 4 00:00:15,240 --> 00:00:20,690 insights from data because we have to represent to our clients, to our stakeholders here. 5 00:00:20,970 --> 00:00:24,270 These are exactly your insights from data. 6 00:00:24,510 --> 00:00:28,530 And we have performed several techniques of speech and coding on our data. 7 00:00:28,560 --> 00:00:33,510 So in this session, what we have to do, we have to basically deal with the outliers. 8 00:00:33,690 --> 00:00:40,440 The very first statement for this particular session, we have to deal with the outlier as well as imputation 9 00:00:40,440 --> 00:00:43,990 of the outliers if it is available on data. 10 00:00:44,250 --> 00:00:49,550 So definitely outliers will be available in, you know, Matkal column. 11 00:00:49,740 --> 00:00:54,490 So I'm going to check whether I have some outlier in my price column or not. 12 00:00:54,780 --> 00:01:01,950 So the two main plots, the two main visuals that will be very handy for you whenever you have to deal 13 00:01:01,950 --> 00:01:07,590 with the outliers are exactly your distribution approach and your box plot approach. 14 00:01:07,890 --> 00:01:10,920 So what I am going to do, I'm just going to define a function over here. 15 00:01:10,930 --> 00:01:13,860 Let's say its name is plot and what it will achieve. 16 00:01:13,860 --> 00:01:19,790 The very first one, what data frame and the second one on what column you have to perform this operation. 17 00:01:20,130 --> 00:01:25,230 And let's say the very first operation, I exactly need my distribution thought. 18 00:01:25,380 --> 00:01:30,360 And here what I have to do, I just need a distribution of this D.F. of columns. 19 00:01:30,360 --> 00:01:34,290 Similarly, I'm going to call this box plot and here as well. 20 00:01:34,290 --> 00:01:41,340 I just need a box part of this one, let's say, on the very first axis, on the very first axis. 21 00:01:41,340 --> 00:01:45,980 I just need distribution and on the second axis, I just need a box. 22 00:01:45,990 --> 00:01:48,450 But so far, this what we can do. 23 00:01:48,450 --> 00:01:54,990 We guys can create a subplot for this and we do say BLT Dawid subplots. 24 00:01:54,990 --> 00:02:00,930 And here, if you will, press shift, plus that you will check what exactly the documentation, what 25 00:02:00,930 --> 00:02:04,060 are the custom parameters that this function will see. 26 00:02:04,080 --> 00:02:06,960 You will see a number of rows, number of columns. 27 00:02:07,140 --> 00:02:11,930 So let's say I just need a matrix of to come over. 28 00:02:11,940 --> 00:02:19,800 I can say here I have two rows and one column, so it will exactly return me one finger. 29 00:02:20,040 --> 00:02:24,720 And inside that one finger, basically I have one axis. 30 00:02:24,930 --> 00:02:31,220 Next year I'm going to say this is exactly my X1 and I have X two in my finger. 31 00:02:31,500 --> 00:02:39,960 So what I have to do on this axis one, I'm going to say I have to represent my distribution and on 32 00:02:39,960 --> 00:02:43,130 access to I have to represent my box plot. 33 00:02:43,230 --> 00:02:44,010 That's it. 34 00:02:44,020 --> 00:02:47,670 So if I'm going to execute it, I have to execute all the stuff. 35 00:02:47,670 --> 00:02:50,100 And after that, what we have to do here. 36 00:02:50,100 --> 00:02:56,160 Very first, I have to parse what exactly is the frame name and on what column I have to perform this 37 00:02:56,160 --> 00:02:56,610 operation. 38 00:02:56,610 --> 00:03:03,540 Just execute it and it will return this beautiful distribution plot as well as this beautiful box. 39 00:03:03,540 --> 00:03:05,940 But you will see it here in this distribution. 40 00:03:06,150 --> 00:03:11,880 You have some outliers because it is positively skewed distribution. 41 00:03:11,880 --> 00:03:18,480 You will see it here similarly in this box, but you will see what our data points, what our data points 42 00:03:18,480 --> 00:03:19,440 you will see over here. 43 00:03:19,740 --> 00:03:22,470 These are exactly your outliers. 44 00:03:22,710 --> 00:03:30,960 So let's say after having conversation with my domain expertise, he or she said, yeah, whatever data 45 00:03:30,960 --> 00:03:34,990 point that is greater than forty thousand, that is definitely an outlier. 46 00:03:35,010 --> 00:03:39,030 So what I am going to do, whatever data point I have greater than forty thousand. 47 00:03:39,240 --> 00:03:46,080 I'm just going to replace it with median because whenever you have an outlier median plays a very handy 48 00:03:46,270 --> 00:03:46,860 over there. 49 00:03:47,160 --> 00:03:52,380 So for this, what I'm going to do, I'm going to use a very handy function of numbers and here I'm 50 00:03:52,380 --> 00:03:58,170 going to say no matter where and if you will shift gears to the very first one, what exactly would 51 00:03:58,170 --> 00:04:02,960 condition in and on the basis of condition we have to perform some operation. 52 00:04:03,180 --> 00:04:07,970 So here I'm going to say my condition is exactly so. 53 00:04:07,980 --> 00:04:09,510 I'm going to mention my condition. 54 00:04:09,520 --> 00:04:13,710 So whatever my condition is on the basis of condition, it will take some action. 55 00:04:14,010 --> 00:04:21,870 So my condition is exactly D.F. of price greater than equal to forty thousand. 56 00:04:21,880 --> 00:04:28,740 So wherever this price is exactly greater than forty thousand in such case, I have to compute, I have 57 00:04:28,740 --> 00:04:31,290 to replace it with median. 58 00:04:31,500 --> 00:04:40,500 So I'm going to say data on the screen of price and on days I have to basically compute median of that. 59 00:04:40,620 --> 00:04:48,210 So I'm just going to compute median and wherever this condition will not satisfy it means whatever my 60 00:04:48,210 --> 00:04:51,330 price is exactly less than forty thousand. 61 00:04:51,630 --> 00:04:55,960 So in such case, what we have to do, we have to simply escape it. 62 00:04:56,040 --> 00:04:59,820 So wherever it is, less than forty thousand in such case I don't have. 63 00:04:59,850 --> 00:05:02,340 To perform any operations on data. 64 00:05:02,400 --> 00:05:05,670 So here I am going to say it on a train of crisis. 65 00:05:05,740 --> 00:05:08,560 It means I have to make it as it was. 66 00:05:08,880 --> 00:05:14,350 So what I'm going to do after it, I have to update this price as well. 67 00:05:14,490 --> 00:05:17,710 So what I have to do, I have to restore it here. 68 00:05:17,910 --> 00:05:20,850 This way I can restore it to just execute it. 69 00:05:21,000 --> 00:05:28,070 And if let's see if I'm just going to copy this again and again, I'm going to paste to what, dear? 70 00:05:28,470 --> 00:05:33,080 Now you'll see or hear your distribution data also gets changed. 71 00:05:33,270 --> 00:05:37,910 And this box also gets change is still you have some outliers. 72 00:05:37,920 --> 00:05:38,430 You will see. 73 00:05:38,640 --> 00:05:39,800 But that's okay. 74 00:05:39,810 --> 00:05:46,890 But because you will see here, you don't have far beyond our you don't have that much high outliers 75 00:05:46,920 --> 00:05:48,310 as you have earlier. 76 00:05:48,600 --> 00:05:53,490 So it means up to some greater extent your data is somehow already. 77 00:05:53,700 --> 00:06:00,540 So just go ahead with our next problem statement in which we have to separate our independent features 78 00:06:00,690 --> 00:06:03,020 as well as dependent feature. 79 00:06:03,210 --> 00:06:05,130 So for this, what we going to do? 80 00:06:05,130 --> 00:06:08,840 It means I'm just going to separate my data. 81 00:06:08,850 --> 00:06:15,870 So in X variable, I'm going to contain my all the independent features and in variable I'm going to 82 00:06:15,870 --> 00:06:18,600 consider all the dependent feature. 83 00:06:18,960 --> 00:06:27,270 So for this, I'm going to say in this X, I'm going to say data on a train drop and what I have to 84 00:06:27,270 --> 00:06:28,500 drop, I have to drop. 85 00:06:28,710 --> 00:06:35,550 Simply pass it, because price isn't my dependent feature, because this is that feature that we have 86 00:06:35,550 --> 00:06:36,150 to study. 87 00:06:36,510 --> 00:06:40,500 So what I'm going to do, I, I have to exclude this teacher. 88 00:06:40,650 --> 00:06:47,560 So using this drop and if I'm not going to mention my in-place parameter, so it means I don't have 89 00:06:47,560 --> 00:06:48,920 a drop in this act. 90 00:06:49,320 --> 00:06:57,210 So if on this X I'm going to call ahead, then you will see over here all of this stuff gets executed. 91 00:06:57,210 --> 00:07:00,660 And in this X, you don't have a price column. 92 00:07:00,840 --> 00:07:07,050 And let's say on this X, if I'm going to call, let's say, shape what exactly the shape of this. 93 00:07:07,260 --> 00:07:11,550 So you will see it had just thirty four columns. 94 00:07:11,550 --> 00:07:14,590 Add in this all the same in this region. 95 00:07:14,700 --> 00:07:15,320 Data frame. 96 00:07:15,540 --> 00:07:21,360 If I'm going to call shape of this, you will see this has thirty five features and this has thirty 97 00:07:21,360 --> 00:07:25,090 four features because it doesn't add your price feature. 98 00:07:25,110 --> 00:07:30,570 So basically in this variable I'm going to say I have to just access my price. 99 00:07:30,690 --> 00:07:33,230 So I'm going to say just access this price. 100 00:07:33,240 --> 00:07:36,180 And if you want to get some rough idea you can print it. 101 00:07:36,420 --> 00:07:39,990 You will see in this variable, this is exactly the index. 102 00:07:40,200 --> 00:07:44,090 And with respect to index, you have some price entries. 103 00:07:44,100 --> 00:07:48,840 So that's all about the second profile of the second that you have. 104 00:07:48,840 --> 00:07:52,650 And I think he's going to get the same.