1 00:00:01,990 --> 00:00:06,100 Let's see this important concept, outliers. 2 00:00:07,700 --> 00:00:18,620 Outliers are basically extreme points in our data set, and they impact the accuracy of our models quite 3 00:00:18,620 --> 00:00:19,190 adversely. 4 00:00:20,240 --> 00:00:29,240 This graph you will remember from the central tendency and dispersion session, I use this to explain 5 00:00:29,240 --> 00:00:31,020 the concept of standard deviation. 6 00:00:31,590 --> 00:00:35,510 I'm now going to use the same graph to explain outliers. 7 00:00:36,230 --> 00:00:38,630 So what are outliers in this example? 8 00:00:39,910 --> 00:00:47,650 If you really see the outliers at this point, sixty nine, seventy one and 96. 9 00:00:49,040 --> 00:00:53,510 So the question is, can we have outliers in our dataset? 10 00:00:54,550 --> 00:00:55,420 You can have. 11 00:00:56,290 --> 00:01:00,340 Provided the number of such outliers are negligible. 12 00:01:02,570 --> 00:01:09,150 And a better way to handle outliers is to remove them, remove outliers from your dataset. 13 00:01:09,680 --> 00:01:11,300 Why do you want to have outliers? 14 00:01:12,050 --> 00:01:16,130 Outliers can lead to low accuracy. 15 00:01:17,860 --> 00:01:24,100 Unwanted inferences can result if outliers are present in our dataset. 16 00:01:24,700 --> 00:01:31,150 OK, we're going to use more explored to understand outliers. 17 00:01:31,720 --> 00:01:33,760 Your box plot will look like this. 18 00:01:34,210 --> 00:01:38,050 The outliers will shown as a daughter star mark. 19 00:01:38,320 --> 00:01:42,550 OK, and the limits for box plot are. 20 00:01:44,530 --> 00:01:51,550 Q three plus one point five times you three minus one, are you one minus one point five times three 21 00:01:51,550 --> 00:01:59,790 minus Q on this one point five times three minus one is also known as important range. 22 00:02:00,670 --> 00:02:05,560 OK, unwater Q1 and Q3, they are percentiles. 23 00:02:06,310 --> 00:02:10,600 I'm sure you will remember percent from the previous session. 24 00:02:11,440 --> 00:02:17,980 I'm going to use Excel to demonstrate the concept of outliers and box plus. 25 00:02:19,910 --> 00:02:25,760 You can use Python or any other programming language to do the same. 26 00:02:26,390 --> 00:02:28,760 OK, now let's see. 27 00:02:29,810 --> 00:02:31,150 Box plot in action. 28 00:02:31,160 --> 00:02:34,330 I have the invoice amount for this. 29 00:02:34,340 --> 00:02:36,740 I'm going to create a box plot. 30 00:02:36,830 --> 00:02:39,050 OK, how many data points are there? 31 00:02:39,530 --> 00:02:46,490 There are 156 data points in our lives to let them come to insert. 32 00:02:47,760 --> 00:02:49,410 Click recommended charts. 33 00:02:51,100 --> 00:02:58,840 Come to all charts, click Moxham Whiskered, OK, so this is your box plot. 34 00:03:00,340 --> 00:03:05,880 So as you can see here, the outliers are clearly shown as dots, right? 35 00:03:06,760 --> 00:03:08,980 It is also telling what are those values? 36 00:03:10,720 --> 00:03:11,670 So we can remove them. 37 00:03:12,710 --> 00:03:15,200 Removed outliers from the box plot. 38 00:03:16,550 --> 00:03:20,210 OK, and as I mentioned earlier, you can use. 39 00:03:21,360 --> 00:03:28,480 Python or any of the programming language to create a box plot and to remove the outliers also. 40 00:03:29,290 --> 00:03:31,500 OK, you can do that programmatically. 41 00:03:33,150 --> 00:03:33,520 Right. 42 00:03:33,930 --> 00:03:41,440 So, again, a very powerful concept, as I said, outliers come in the way of our accuracy. 43 00:03:41,460 --> 00:03:44,820 So let's remove them to the extent possible. 44 00:03:45,870 --> 00:03:46,290 OK.