1
00:00:00,600 --> 00:00:06,120
So let us see how to do univariate analysis and what we will do.

2
00:00:06,150 --> 00:00:13,140
EDT, which is we will be finding out, extended the dictionary by looking at the EDT.

3
00:00:13,230 --> 00:00:19,230
If we have Dode on distribution of some of the variables, we'll be plotting histograms or box plots

4
00:00:19,830 --> 00:00:22,680
to find out whether there is skewness or outliers.

5
00:00:24,120 --> 00:00:30,210
And lastly, we'll be looking at the categorical variables making their bar plots and looking at their

6
00:00:30,210 --> 00:00:30,810
distribution.

7
00:00:33,840 --> 00:00:40,680
When we have looked at EGD histograms, Mock's plots and bar plots, we will summarize out observations

8
00:00:41,040 --> 00:00:41,550
in the end.

9
00:00:43,120 --> 00:00:51,060
So let us look at the EDT first IT league and we found out by writing somebody back at D.F..

10
00:00:56,680 --> 00:00:59,190
So when we press control under this command, is it done?

11
00:00:59,990 --> 00:01:06,160
And we get the EGD blue hair, you can see that for each variable.

12
00:01:06,790 --> 00:01:08,620
I'm getting some information.

13
00:01:09,220 --> 00:01:13,570
The information includes the minimum value, maximum value.

14
00:01:14,770 --> 00:01:20,910
Meanwhile, you, Andy, for corteges, that is first quarter, second quarter and third quartile.

15
00:01:21,010 --> 00:01:27,860
And the maximum value second quartile is also the median value since second quartile represents these

16
00:01:28,030 --> 00:01:32,680
58 percentile value, which is also the middle value and the median.

17
00:01:34,260 --> 00:01:40,660
Now, one of the things we have to do when we have EGD compared the median and mean values.

18
00:01:41,590 --> 00:01:48,880
So whichever variable has skewness or outliers will have large difference between median and mean value.

19
00:01:50,350 --> 00:01:56,230
If we look at mean and median rally of price, they look approximately in the same range.

20
00:01:58,530 --> 00:02:04,570
And at that indicator of Skewness or outliers, their distribution in different caudate.

21
00:02:05,080 --> 00:02:13,270
So if we look at this particular variable call and hard rooms during the first quarter day, which is

22
00:02:13,270 --> 00:02:21,730
from minimum to first quartile value, ten point zero six to learn point one nine in this small range

23
00:02:21,730 --> 00:02:23,470
of one point one three.

24
00:02:23,890 --> 00:02:25,930
I have 25 percent of the values.

25
00:02:28,820 --> 00:02:36,320
But if I look at the third quarter lendee, maximum value from 14 to 101, twenty five percent of values

26
00:02:36,600 --> 00:02:40,820
are and are having a range from 14 to 101, which is a huge range.

27
00:02:42,560 --> 00:02:49,760
So it is clear to me that it is skewness awesome out that are present in the last quartile, which is

28
00:02:49,820 --> 00:02:50,960
giving us this result.

29
00:02:51,920 --> 00:02:58,910
Similarly, if we look at these rainfall variable, the first quarter is from three to twenty eight,

30
00:02:59,870 --> 00:03:02,600
whereas the last quarter is from 50 to 60.

31
00:03:03,770 --> 00:03:11,900
So it is suggesting that probably in the first quarter we either have outliers or the distribution is

32
00:03:11,900 --> 00:03:12,290
skewed.

33
00:03:13,820 --> 00:03:20,450
So by looking at the distribution of caudate, we can estimate whether there are outliers or skewness

34
00:03:20,910 --> 00:03:21,710
variables or not.

35
00:03:22,820 --> 00:03:24,620
You can go through each of these variables.

36
00:03:25,430 --> 00:03:31,370
I have identified these two eatables and we will be plotting box plot to identify whether there is skewness

37
00:03:31,520 --> 00:03:32,780
or outlet.

38
00:03:34,640 --> 00:03:43,460
The second thing we should note is the presence of and is in all of these variables almost seem features

39
00:03:43,460 --> 00:03:43,910
at present.

40
00:03:44,150 --> 00:03:52,160
But in N husband you can see that there is an added value of N is and as he was ending missing values

41
00:03:52,160 --> 00:03:52,690
in the doodah.

42
00:03:53,600 --> 00:04:00,380
So when we imported our data, whenever there was a blank space, are automatically converted into any

43
00:04:00,830 --> 00:04:02,810
and the count of any is going ahead.

44
00:04:03,830 --> 00:04:07,600
So in N Horsburgh variable, we do not have ID values.

45
00:04:09,170 --> 00:04:11,720
So we need to handle these eight missing values.

46
00:04:11,870 --> 00:04:16,850
Since analysis cannot be done if the dataset has missing values.

47
00:04:18,770 --> 00:04:24,740
The third thing to be noticed in EGD is the distribution of categorical variables.

48
00:04:25,430 --> 00:04:30,110
As you know, we have three categorical variables airport, water, body and bus terminal.

49
00:04:31,040 --> 00:04:36,790
If you look at airport, we have two 27 no and 279 yes.

50
00:04:37,880 --> 00:04:40,190
So there is nothing suspicious about this variable.

51
00:04:41,360 --> 00:04:46,400
Similarly, if I look at the distribution of water boarding variable, it is also not suspicious.

52
00:04:47,090 --> 00:04:51,290
But if you look at bus terminal, clearly there is something wrong with this variable.

53
00:04:51,620 --> 00:04:55,010
We have all the values as yes, there is no other category.

54
00:04:55,190 --> 00:05:00,410
And this bus terminal that evil will plot these into a bomb plot.

55
00:05:00,410 --> 00:05:06,200
Also, just to visualize how the distribution is, what these variables.

56
00:05:07,220 --> 00:05:13,460
If we have a lot of categories using visual cues can help identify any problems in need, categorical

57
00:05:13,460 --> 00:05:13,970
variables.

58
00:05:15,590 --> 00:05:22,370
So let's create box plot for the two variables for which we suspect that there is outliers or skewness

59
00:05:22,370 --> 00:05:22,780
present.

60
00:05:29,000 --> 00:05:36,630
So to create a box blur, we need to write a single line of code, which is box plot and within bracket

61
00:05:37,530 --> 00:05:37,940
relate.

62
00:05:38,120 --> 00:05:38,940
D.F. Dollar.

63
00:05:40,840 --> 00:05:44,970
The variable for which we want to create the box plot, which is.

64
00:05:45,030 --> 00:05:45,840
And heartworms.

65
00:05:52,050 --> 00:05:53,880
Let us zoom in this graph.

66
00:05:57,050 --> 00:05:59,480
So you can see that it has several parts.

67
00:06:01,260 --> 00:06:02,450
So it has a box.

68
00:06:03,080 --> 00:06:08,450
This is a box and it has two lanes, one above and one below.

69
00:06:10,400 --> 00:06:17,420
And this one dark lane in the middle to the dog lane in the middle is representing the median value.

70
00:06:19,070 --> 00:06:22,880
This line of the box is for the first quarter.

71
00:06:23,540 --> 00:06:25,380
This line is for detailed quartiles.

72
00:06:25,820 --> 00:06:30,710
So you can notice that a lot of values are concentrated in this small regionally.

73
00:06:31,340 --> 00:06:35,600
Only these two points are extremely far away from it.

74
00:06:36,390 --> 00:06:43,640
And we can easily see that these are outliers or outlying values which we need to ratify or modify it

75
00:06:43,650 --> 00:06:54,380
so that these do not impact our analysis not to see the outliers in the gene fault that even we look

76
00:06:54,380 --> 00:06:54,530
at.

77
00:06:54,590 --> 00:07:01,760
And then a matter which is usually used in a regression analysis will create a scatter plot of rainfall

78
00:07:01,760 --> 00:07:14,600
variable response variable, which is sold to create a scatterplot with a beer, beers and within bracket

79
00:07:14,920 --> 00:07:15,300
data.

80
00:07:17,000 --> 00:07:17,630
This is little less.

81
00:07:17,640 --> 00:07:20,110
And which is a boulder, Abdeh.

82
00:07:22,300 --> 00:07:22,670
No.

83
00:07:22,950 --> 00:07:23,260
Right.

84
00:07:24,340 --> 00:07:25,360
If dollar salt.

85
00:07:26,640 --> 00:07:28,150
Remember, this is capital in total.

86
00:07:32,880 --> 00:07:36,660
Plus, the F dollar rainfall.

87
00:07:45,230 --> 00:07:52,680
So you can see that most of the point are in this growing due to 60 day annually.

88
00:07:53,970 --> 00:07:56,020
Only one point is nearly zero.

89
00:07:56,430 --> 00:07:59,980
And it is very far away from all the other point, too.

90
00:08:00,360 --> 00:08:03,900
That is why we can also classify this point as an outlier.

91
00:08:04,430 --> 00:08:06,180
And we'll be handling this point.

92
00:08:06,300 --> 00:08:06,720
Also.

93
00:08:09,630 --> 00:08:13,360
And now we'll be drawing board plot of the categorical variables

94
00:08:16,060 --> 00:08:18,760
to see the distribution of a categorical variable.

95
00:08:18,880 --> 00:08:28,540
We can write bar plot and within bracket we'll rate Peyman.

96
00:08:32,090 --> 00:08:33,790
Blackard, DFW Airport.

97
00:08:47,870 --> 00:08:52,180
So you can see this is the bad blood of aiport variable.

98
00:08:53,600 --> 00:08:56,780
It has two categories.

99
00:08:56,900 --> 00:08:57,570
Yes and no.

100
00:08:57,920 --> 00:09:03,110
And the height of these bars is giving us these values in these two categories.

101
00:09:04,230 --> 00:09:07,760
We can similarly create, but not for other variables also.

102
00:09:09,350 --> 00:09:11,540
So let's create DeBartolo for bus terminal.

103
00:09:19,180 --> 00:09:28,930
We already saw in EGD that it has only one category which we are seeing in the butler, not to the point

104
00:09:28,930 --> 00:09:30,400
of grabbing a bar plodders.

105
00:09:30,610 --> 00:09:38,080
Usually when we have a lot of categories, if we draw Butler, we can easily identify such categories,

106
00:09:38,230 --> 00:09:41,050
which may not be very useful for the analysis purpose.

107
00:09:42,070 --> 00:09:48,640
We can identify such categories and glub them or aggregate them with other categories for better analysis.

108
00:09:50,200 --> 00:09:55,690
So here it is clear that this bus terminal very well has only one category.

109
00:09:57,070 --> 00:10:00,880
Having one category makes this variable useless.

110
00:10:01,360 --> 00:10:08,010
That is, since it is not having any added value, we cannot determine its impact on the response variable,

111
00:10:08,150 --> 00:10:13,960
but therefore we do not need to keep this variable bus terminal in our analysis.

112
00:10:14,650 --> 00:10:18,310
So now let us summarize these observations from just univariate analysis.

113
00:10:19,900 --> 00:10:27,670
The first observation is we have identified two variables, which are outliers.

114
00:10:28,140 --> 00:10:33,550
So in heart rooms and rainfall has outliers.

115
00:10:41,940 --> 00:10:48,850
Our second observation is, and horse bait has missing the loose.

116
00:10:58,550 --> 00:11:07,760
And the third observation is that there is a categorical variable called bus terminal, which is useless.

117
00:11:08,970 --> 00:11:10,760
The bus tour is useless.

118
00:11:18,320 --> 00:11:23,960
So we need to take action on these three observations, which we will see in the coming videos.