1 00:00:00,600 --> 00:00:06,120 So let us see how to do univariate analysis and what we will do. 2 00:00:06,150 --> 00:00:13,140 EDT, which is we will be finding out, extended the dictionary by looking at the EDT. 3 00:00:13,230 --> 00:00:19,230 If we have Dode on distribution of some of the variables, we'll be plotting histograms or box plots 4 00:00:19,830 --> 00:00:22,680 to find out whether there is skewness or outliers. 5 00:00:24,120 --> 00:00:30,210 And lastly, we'll be looking at the categorical variables making their bar plots and looking at their 6 00:00:30,210 --> 00:00:30,810 distribution. 7 00:00:33,840 --> 00:00:40,680 When we have looked at EGD histograms, Mock's plots and bar plots, we will summarize out observations 8 00:00:41,040 --> 00:00:41,550 in the end. 9 00:00:43,120 --> 00:00:51,060 So let us look at the EDT first IT league and we found out by writing somebody back at D.F.. 10 00:00:56,680 --> 00:00:59,190 So when we press control under this command, is it done? 11 00:00:59,990 --> 00:01:06,160 And we get the EGD blue hair, you can see that for each variable. 12 00:01:06,790 --> 00:01:08,620 I'm getting some information. 13 00:01:09,220 --> 00:01:13,570 The information includes the minimum value, maximum value. 14 00:01:14,770 --> 00:01:20,910 Meanwhile, you, Andy, for corteges, that is first quarter, second quarter and third quartile. 15 00:01:21,010 --> 00:01:27,860 And the maximum value second quartile is also the median value since second quartile represents these 16 00:01:28,030 --> 00:01:32,680 58 percentile value, which is also the middle value and the median. 17 00:01:34,260 --> 00:01:40,660 Now, one of the things we have to do when we have EGD compared the median and mean values. 18 00:01:41,590 --> 00:01:48,880 So whichever variable has skewness or outliers will have large difference between median and mean value. 19 00:01:50,350 --> 00:01:56,230 If we look at mean and median rally of price, they look approximately in the same range. 20 00:01:58,530 --> 00:02:04,570 And at that indicator of Skewness or outliers, their distribution in different caudate. 21 00:02:05,080 --> 00:02:13,270 So if we look at this particular variable call and hard rooms during the first quarter day, which is 22 00:02:13,270 --> 00:02:21,730 from minimum to first quartile value, ten point zero six to learn point one nine in this small range 23 00:02:21,730 --> 00:02:23,470 of one point one three. 24 00:02:23,890 --> 00:02:25,930 I have 25 percent of the values. 25 00:02:28,820 --> 00:02:36,320 But if I look at the third quarter lendee, maximum value from 14 to 101, twenty five percent of values 26 00:02:36,600 --> 00:02:40,820 are and are having a range from 14 to 101, which is a huge range. 27 00:02:42,560 --> 00:02:49,760 So it is clear to me that it is skewness awesome out that are present in the last quartile, which is 28 00:02:49,820 --> 00:02:50,960 giving us this result. 29 00:02:51,920 --> 00:02:58,910 Similarly, if we look at these rainfall variable, the first quarter is from three to twenty eight, 30 00:02:59,870 --> 00:03:02,600 whereas the last quarter is from 50 to 60. 31 00:03:03,770 --> 00:03:11,900 So it is suggesting that probably in the first quarter we either have outliers or the distribution is 32 00:03:11,900 --> 00:03:12,290 skewed. 33 00:03:13,820 --> 00:03:20,450 So by looking at the distribution of caudate, we can estimate whether there are outliers or skewness 34 00:03:20,910 --> 00:03:21,710 variables or not. 35 00:03:22,820 --> 00:03:24,620 You can go through each of these variables. 36 00:03:25,430 --> 00:03:31,370 I have identified these two eatables and we will be plotting box plot to identify whether there is skewness 37 00:03:31,520 --> 00:03:32,780 or outlet. 38 00:03:34,640 --> 00:03:43,460 The second thing we should note is the presence of and is in all of these variables almost seem features 39 00:03:43,460 --> 00:03:43,910 at present. 40 00:03:44,150 --> 00:03:52,160 But in N husband you can see that there is an added value of N is and as he was ending missing values 41 00:03:52,160 --> 00:03:52,690 in the doodah. 42 00:03:53,600 --> 00:04:00,380 So when we imported our data, whenever there was a blank space, are automatically converted into any 43 00:04:00,830 --> 00:04:02,810 and the count of any is going ahead. 44 00:04:03,830 --> 00:04:07,600 So in N Horsburgh variable, we do not have ID values. 45 00:04:09,170 --> 00:04:11,720 So we need to handle these eight missing values. 46 00:04:11,870 --> 00:04:16,850 Since analysis cannot be done if the dataset has missing values. 47 00:04:18,770 --> 00:04:24,740 The third thing to be noticed in EGD is the distribution of categorical variables. 48 00:04:25,430 --> 00:04:30,110 As you know, we have three categorical variables airport, water, body and bus terminal. 49 00:04:31,040 --> 00:04:36,790 If you look at airport, we have two 27 no and 279 yes. 50 00:04:37,880 --> 00:04:40,190 So there is nothing suspicious about this variable. 51 00:04:41,360 --> 00:04:46,400 Similarly, if I look at the distribution of water boarding variable, it is also not suspicious. 52 00:04:47,090 --> 00:04:51,290 But if you look at bus terminal, clearly there is something wrong with this variable. 53 00:04:51,620 --> 00:04:55,010 We have all the values as yes, there is no other category. 54 00:04:55,190 --> 00:05:00,410 And this bus terminal that evil will plot these into a bomb plot. 55 00:05:00,410 --> 00:05:06,200 Also, just to visualize how the distribution is, what these variables. 56 00:05:07,220 --> 00:05:13,460 If we have a lot of categories using visual cues can help identify any problems in need, categorical 57 00:05:13,460 --> 00:05:13,970 variables. 58 00:05:15,590 --> 00:05:22,370 So let's create box plot for the two variables for which we suspect that there is outliers or skewness 59 00:05:22,370 --> 00:05:22,780 present. 60 00:05:29,000 --> 00:05:36,630 So to create a box blur, we need to write a single line of code, which is box plot and within bracket 61 00:05:37,530 --> 00:05:37,940 relate. 62 00:05:38,120 --> 00:05:38,940 D.F. Dollar. 63 00:05:40,840 --> 00:05:44,970 The variable for which we want to create the box plot, which is. 64 00:05:45,030 --> 00:05:45,840 And heartworms. 65 00:05:52,050 --> 00:05:53,880 Let us zoom in this graph. 66 00:05:57,050 --> 00:05:59,480 So you can see that it has several parts. 67 00:06:01,260 --> 00:06:02,450 So it has a box. 68 00:06:03,080 --> 00:06:08,450 This is a box and it has two lanes, one above and one below. 69 00:06:10,400 --> 00:06:17,420 And this one dark lane in the middle to the dog lane in the middle is representing the median value. 70 00:06:19,070 --> 00:06:22,880 This line of the box is for the first quarter. 71 00:06:23,540 --> 00:06:25,380 This line is for detailed quartiles. 72 00:06:25,820 --> 00:06:30,710 So you can notice that a lot of values are concentrated in this small regionally. 73 00:06:31,340 --> 00:06:35,600 Only these two points are extremely far away from it. 74 00:06:36,390 --> 00:06:43,640 And we can easily see that these are outliers or outlying values which we need to ratify or modify it 75 00:06:43,650 --> 00:06:54,380 so that these do not impact our analysis not to see the outliers in the gene fault that even we look 76 00:06:54,380 --> 00:06:54,530 at. 77 00:06:54,590 --> 00:07:01,760 And then a matter which is usually used in a regression analysis will create a scatter plot of rainfall 78 00:07:01,760 --> 00:07:14,600 variable response variable, which is sold to create a scatterplot with a beer, beers and within bracket 79 00:07:14,920 --> 00:07:15,300 data. 80 00:07:17,000 --> 00:07:17,630 This is little less. 81 00:07:17,640 --> 00:07:20,110 And which is a boulder, Abdeh. 82 00:07:22,300 --> 00:07:22,670 No. 83 00:07:22,950 --> 00:07:23,260 Right. 84 00:07:24,340 --> 00:07:25,360 If dollar salt. 85 00:07:26,640 --> 00:07:28,150 Remember, this is capital in total. 86 00:07:32,880 --> 00:07:36,660 Plus, the F dollar rainfall. 87 00:07:45,230 --> 00:07:52,680 So you can see that most of the point are in this growing due to 60 day annually. 88 00:07:53,970 --> 00:07:56,020 Only one point is nearly zero. 89 00:07:56,430 --> 00:07:59,980 And it is very far away from all the other point, too. 90 00:08:00,360 --> 00:08:03,900 That is why we can also classify this point as an outlier. 91 00:08:04,430 --> 00:08:06,180 And we'll be handling this point. 92 00:08:06,300 --> 00:08:06,720 Also. 93 00:08:09,630 --> 00:08:13,360 And now we'll be drawing board plot of the categorical variables 94 00:08:16,060 --> 00:08:18,760 to see the distribution of a categorical variable. 95 00:08:18,880 --> 00:08:28,540 We can write bar plot and within bracket we'll rate Peyman. 96 00:08:32,090 --> 00:08:33,790 Blackard, DFW Airport. 97 00:08:47,870 --> 00:08:52,180 So you can see this is the bad blood of aiport variable. 98 00:08:53,600 --> 00:08:56,780 It has two categories. 99 00:08:56,900 --> 00:08:57,570 Yes and no. 100 00:08:57,920 --> 00:09:03,110 And the height of these bars is giving us these values in these two categories. 101 00:09:04,230 --> 00:09:07,760 We can similarly create, but not for other variables also. 102 00:09:09,350 --> 00:09:11,540 So let's create DeBartolo for bus terminal. 103 00:09:19,180 --> 00:09:28,930 We already saw in EGD that it has only one category which we are seeing in the butler, not to the point 104 00:09:28,930 --> 00:09:30,400 of grabbing a bar plodders. 105 00:09:30,610 --> 00:09:38,080 Usually when we have a lot of categories, if we draw Butler, we can easily identify such categories, 106 00:09:38,230 --> 00:09:41,050 which may not be very useful for the analysis purpose. 107 00:09:42,070 --> 00:09:48,640 We can identify such categories and glub them or aggregate them with other categories for better analysis. 108 00:09:50,200 --> 00:09:55,690 So here it is clear that this bus terminal very well has only one category. 109 00:09:57,070 --> 00:10:00,880 Having one category makes this variable useless. 110 00:10:01,360 --> 00:10:08,010 That is, since it is not having any added value, we cannot determine its impact on the response variable, 111 00:10:08,150 --> 00:10:13,960 but therefore we do not need to keep this variable bus terminal in our analysis. 112 00:10:14,650 --> 00:10:18,310 So now let us summarize these observations from just univariate analysis. 113 00:10:19,900 --> 00:10:27,670 The first observation is we have identified two variables, which are outliers. 114 00:10:28,140 --> 00:10:33,550 So in heart rooms and rainfall has outliers. 115 00:10:41,940 --> 00:10:48,850 Our second observation is, and horse bait has missing the loose. 116 00:10:58,550 --> 00:11:07,760 And the third observation is that there is a categorical variable called bus terminal, which is useless. 117 00:11:08,970 --> 00:11:10,760 The bus tour is useless. 118 00:11:18,320 --> 00:11:23,960 So we need to take action on these three observations, which we will see in the coming videos.