1
00:00:01,290 --> 00:00:04,680
Now, we have imported the data and the raw data is ready.

2
00:00:05,280 --> 00:00:08,310
We will first take a look at the sample of this data.

3
00:00:10,380 --> 00:00:14,790
To take the Simpson will write our variable name, which is D.F..

4
00:00:17,300 --> 00:00:18,740
And little red dot head.

5
00:00:23,680 --> 00:00:24,260
Fit on there.

6
00:00:25,110 --> 00:00:28,440
We are getting five top rolls from raw data.

7
00:00:28,910 --> 00:00:30,270
There's the sample of forward data.

8
00:00:30,360 --> 00:00:32,130
We have 19 columns.

9
00:00:33,620 --> 00:00:39,460
And the sample you can get, the type of all the variables, what values they are taking.

10
00:00:40,900 --> 00:00:50,140
Next to look at the count of each variable and to look at their data, Day-To-Day will write D.F. Dot

11
00:00:50,240 --> 00:00:50,760
info.

12
00:00:58,250 --> 00:01:02,510
Here you can see we have over variable names on the left.

13
00:01:03,080 --> 00:01:05,710
Then we have a column for number of values.

14
00:01:08,400 --> 00:01:13,160
If you notice, four and Hoz bats, the number of loses 98.

15
00:01:14,280 --> 00:01:18,360
That mean eight values are missing from this variable.

16
00:01:19,680 --> 00:01:20,930
We need to handle this.

17
00:01:20,940 --> 00:01:27,420
Missing were loose before running logistic regression or any other kind of machine learning algorithm

18
00:01:27,420 --> 00:01:28,140
on this data.

19
00:01:29,610 --> 00:01:37,500
And on the right more side, you can see the type of each variable name is sense for numerical data.

20
00:01:38,550 --> 00:01:44,340
We also have object, which are distinct types of data that eschatological data.

21
00:01:46,620 --> 00:01:52,830
To get the number of rose and columns in your data, you can also use Daudt shape.

22
00:01:53,340 --> 00:01:55,680
So if I had a beef dot shape,

23
00:01:59,040 --> 00:02:01,770
I wouldn't get the number of rows and columns.

24
00:02:02,580 --> 00:02:06,570
So as you can see, we have five hundred and six rows and total.

25
00:02:06,600 --> 00:02:08,070
There are 19 columns.

26
00:02:09,400 --> 00:02:13,330
In which it bin columns are of over independent variables.

27
00:02:14,110 --> 00:02:17,200
And the last column is for word dependent variable.

28
00:02:18,400 --> 00:02:23,420
Now let's rent a duty on this data and buy a 10:00.

29
00:02:23,530 --> 00:02:27,280
You will get a duty only for your numerical values.

30
00:02:27,910 --> 00:02:35,290
So you will not get unity for airport waterboardings and bus terminal variable.

31
00:02:37,690 --> 00:02:38,520
Two grand.

32
00:02:38,720 --> 00:02:39,100
Really?

33
00:02:39,400 --> 00:02:41,250
We just have to write down describe.

34
00:02:41,670 --> 00:02:42,830
So we'll write D.F..

35
00:02:43,080 --> 00:02:44,500
This is our very own name.

36
00:02:44,800 --> 00:02:46,000
And then describe.

37
00:02:50,490 --> 00:02:56,820
If we run this, you can see all our numerical variables are listed on the top.

38
00:02:58,170 --> 00:03:04,650
And we have value such as count mean standard deviation minimum maximum.

39
00:03:04,720 --> 00:03:05,880
25Th percentile.

40
00:03:05,910 --> 00:03:07,010
58 percentile.

41
00:03:07,260 --> 00:03:08,770
Seventy fifth percentile.

42
00:03:09,180 --> 00:03:11,190
For each of these different variables.

43
00:03:12,880 --> 00:03:21,460
Count, it stands for total number of values in that variable mean send for the mean or average of that

44
00:03:21,460 --> 00:03:21,970
variable.

45
00:03:22,340 --> 00:03:25,250
Standard deviation is the standard deviation of that variable.

46
00:03:26,600 --> 00:03:29,990
Min and Max are for minimum and maximum value.

47
00:03:30,230 --> 00:03:32,610
That variable is speaking in our dataset.

48
00:03:33,530 --> 00:03:39,610
Then we 25th, 58 and 75 percentile value percentile.

49
00:03:40,430 --> 00:03:47,300
If you remember, just mean that if you arrange all the values in ascending order, your 25th percentile

50
00:03:47,510 --> 00:03:53,930
value will be the value that is occurring at the 25th percentile of that range data.

51
00:03:54,560 --> 00:04:01,700
Similarly, 50 percentile is the value that is occurring at the 15th position of that data.

52
00:04:02,810 --> 00:04:04,670
This is same as the median of data.

53
00:04:06,040 --> 00:04:09,930
And similarly, 75 percent, 10 cents for value.

54
00:04:09,980 --> 00:04:11,560
That is occurring at 75.

55
00:04:11,700 --> 00:04:14,060
Position of the Arain straight up.

56
00:04:15,860 --> 00:04:19,220
Now we will look at all of these variables one by one.

57
00:04:21,620 --> 00:04:23,990
Not what we want from this.

58
00:04:24,040 --> 00:04:27,710
It really is the number of missing values.

59
00:04:28,890 --> 00:04:37,650
And the variables which have outliers, outliers, means values that are not following the pattern of

60
00:04:37,650 --> 00:04:38,520
that variable.

61
00:04:39,120 --> 00:04:45,690
So, for example, if the values of my some variable is between one and 10.

62
00:04:46,530 --> 00:04:52,650
And then there is just one value, which is in range of thousand or 10000, then we call that value

63
00:04:52,680 --> 00:04:53,490
as outlier.

64
00:04:55,080 --> 00:04:58,020
So first we want to identify missing values.

65
00:04:58,050 --> 00:05:01,260
Second, we want to identify the outliers.

66
00:05:03,090 --> 00:05:06,690
Third, we will look at their distribution of categorical variables.

67
00:05:09,130 --> 00:05:16,780
We have already identified missing values that in four or you can also look at the count, but all of

68
00:05:16,780 --> 00:05:17,830
this UDD.

69
00:05:20,210 --> 00:05:23,600
To identify the outliers, there are two methods.

70
00:05:23,910 --> 00:05:27,830
First, you have to look at the difference between the mean and the median.

71
00:05:27,920 --> 00:05:29,960
Median is that 50 percent had value.

72
00:05:31,490 --> 00:05:40,970
So if there is any outlier, there is a huge difference between mean and median on outliers on the affect

73
00:05:41,030 --> 00:05:42,980
mean, not the median value.

74
00:05:43,100 --> 00:05:43,400
That's.

75
00:05:43,880 --> 00:05:50,420
If there is any outlier in one of our variable, we'll see a huge difference between mean and median

76
00:05:50,420 --> 00:05:50,790
value.

77
00:05:52,440 --> 00:05:59,590
We can also look at the distribution of this minimum under 25 percent annually, 50 percent.

78
00:06:00,010 --> 00:06:04,870
Lou, 75 percent value and maximum to notice any outlier.

79
00:06:06,630 --> 00:06:14,810
If you see any major difference between any two consecutive values, that means there is an outlier,

80
00:06:16,830 --> 00:06:17,700
for example.

81
00:06:17,880 --> 00:06:22,290
And then what first column, which is for estimated price.

82
00:06:23,190 --> 00:06:28,110
You can see minimum wage, Lewis fight and 25th percentile value is seventeen.

83
00:06:29,100 --> 00:06:32,090
Then again, 50 percentile value is 21.

84
00:06:32,820 --> 00:06:38,190
If you notice, that is not a great difference in any two of this category.

85
00:06:39,380 --> 00:06:44,060
The difference between these categories are falling in between four to 25.

86
00:06:44,810 --> 00:06:47,180
So you can see there is not a huge difference.

87
00:06:47,310 --> 00:06:51,410
Or we can say that there is no outlier in our cristeta.

88
00:06:54,190 --> 00:06:59,580
Now, similarly, you can look at all of this individual variables.

89
00:07:00,120 --> 00:07:04,050
I will radically jump to the variable in which there are defects.

90
00:07:06,720 --> 00:07:09,660
So if you look at the variable and hard Groomes.

91
00:07:12,100 --> 00:07:13,570
The minimum value is ten.

92
00:07:14,340 --> 00:07:16,920
The 25th percentile value is eleven.

93
00:07:17,920 --> 00:07:18,620
Again, 58.

94
00:07:19,240 --> 00:07:21,690
And so in the fifth percentile value are well.

95
00:07:22,420 --> 00:07:23,140
And 14.

96
00:07:25,270 --> 00:07:27,490
And the maximum value is one zero one.

97
00:07:28,750 --> 00:07:35,050
You can see there is a huge difference between so in the fifth percentile value and the maximum value.

98
00:07:37,430 --> 00:07:41,950
So we can say that there is something wrong with this data either.

99
00:07:42,640 --> 00:07:48,160
There is a skewed distribution or there is an outlier in this data.

100
00:07:50,030 --> 00:07:53,780
Similarly, if you look at the rainfall data.

101
00:07:56,280 --> 00:07:58,220
The minimum value is just three.

102
00:07:58,740 --> 00:08:06,390
When the 25 person dead will lose 28 and then 50 percent, then value is 39 and so on.

103
00:08:06,810 --> 00:08:15,030
You can see that there is a outlier on the lower end of this data to confirm our assumptions.

104
00:08:15,510 --> 00:08:23,190
There are several methods will first draw bulk splotched for Rover and Hot Grooms and then will use

105
00:08:23,190 --> 00:08:26,680
a scatterplot blurring to file players for overtrained for.

106
00:08:29,860 --> 00:08:30,700
One more thing.

107
00:08:30,850 --> 00:08:37,050
There is no hard and fast rule for identifying outliers or identifying by turning your data.

108
00:08:38,850 --> 00:08:42,450
These are creative processes and you have to come back.

109
00:08:42,540 --> 00:08:45,560
Few are finding any difficulty in handling this data.

110
00:08:45,660 --> 00:08:47,130
Later on in the process.

111
00:08:49,150 --> 00:08:57,940
Now, before drying our box plot for Anha Grooms, I will first explain you what box plot is by using

112
00:08:58,060 --> 00:09:04,650
an example for and Herzberg's so I will write as soon as that box plot

113
00:09:08,350 --> 00:09:11,170
and then y equal to and hoz husbands.

114
00:09:17,450 --> 00:09:19,650
And then data equal to B.F..

115
00:09:26,150 --> 00:09:37,070
So this is our box looks in the middle, you can see a rectangle with a line in between, the upper

116
00:09:37,070 --> 00:09:43,610
line of this box is that top quartile value, which is same as the 75 percentile value.

117
00:09:45,320 --> 00:09:53,480
The bottom line of this box plot is for the first quartile value, which is the 25 percentile value

118
00:09:54,480 --> 00:09:55,060
and model.

119
00:09:55,610 --> 00:09:57,960
The line is for the 50 percent annually.

120
00:09:58,280 --> 00:09:59,480
Also known as the median.

121
00:10:01,700 --> 00:10:07,220
Now, the difference between the upper line of this box and the lower line on this box, which is the

122
00:10:07,220 --> 00:10:14,360
75 percent down only under 25 percentile value is known as and third quartile range or also known as

123
00:10:14,480 --> 00:10:15,170
AQR.

124
00:10:16,370 --> 00:10:24,230
And this this skirts, which are present at the top and at the bottom of this lines are calculated using

125
00:10:24,230 --> 00:10:25,370
this I.Q..

126
00:10:26,240 --> 00:10:35,120
So by default, this is one point five times a cure from the upper and the lower end of this box.

127
00:10:36,440 --> 00:10:45,170
We are not going into detail about this, but whatever points lies outside this with skirts are known

128
00:10:45,170 --> 00:10:46,220
as outliers.

129
00:10:48,200 --> 00:10:52,420
If you want to know more about box plot, you can write this statement.

130
00:10:53,210 --> 00:10:59,030
You can use caution, Mark, before any function to open that help for that function.

131
00:11:01,580 --> 00:11:06,740
So if we execute it, you will get the before you, Hal, about this function.

132
00:11:07,670 --> 00:11:12,350
So you can use questionmark an airplane to open the help section.

133
00:11:14,330 --> 00:11:23,060
So now let's grow a box, look for and heart rooms will write as soon as dot box blood.

134
00:11:27,150 --> 00:11:29,880
And then what via variable is and hot Groomes.

135
00:11:35,380 --> 00:11:36,850
And the data is D.F..

136
00:11:40,820 --> 00:11:51,170
Fee executer, you can see the box and the skirts are flying at values below 20.

137
00:11:53,650 --> 00:12:01,000
There are two points present outside this box spot which added values around 80 in hungry.

138
00:12:02,830 --> 00:12:06,520
Clearly, these two points are outlier for this variable.

139
00:12:06,730 --> 00:12:09,610
There are no values between 20 to 80.

140
00:12:10,210 --> 00:12:14,610
These two outliers are the only value which just lying about 80.

141
00:12:16,030 --> 00:12:20,800
So from this box plot, we have identified two outliers in our anha crumbs.

142
00:12:22,120 --> 00:12:26,800
Now let's discuss the second method of identifying outliers.

143
00:12:28,270 --> 00:12:36,460
Which is scatterplot draw droid scatterplot between our X and Y variable here, our X variable will

144
00:12:36,460 --> 00:12:46,030
be there in full and our Y variable will be our dependent variable, which is solid to plot scatterplot.

145
00:12:46,580 --> 00:12:48,570
We'll write as soon as Stojan plot,

146
00:12:55,000 --> 00:12:56,860
then our X is rainfall.

147
00:13:01,270 --> 00:13:02,590
Y is sold.

148
00:13:06,800 --> 00:13:08,210
And data is D.F..

149
00:13:15,240 --> 00:13:23,010
You can see on the top we have the Instagram of a word dream for data on the right hand side.

150
00:13:23,100 --> 00:13:28,020
We have this program of over sored, which is dependent variable.

151
00:13:28,680 --> 00:13:32,010
Our dependent variable is on there taking well, lose zero and one.

152
00:13:32,250 --> 00:13:36,910
That's why we are only getting two bars in our Instagram.

153
00:13:38,880 --> 00:13:42,810
And in between on the plot area, we have the scatter plots.

154
00:13:44,700 --> 00:13:54,420
And as you can see, all the points are line between 20 to 60 except one point, which is near to zero.

155
00:13:56,180 --> 00:14:01,370
We can clearly see this as a kind of outlier for this rainfall data.

156
00:14:02,680 --> 00:14:06,400
And we will create this outlet in the later part of forecourts.

157
00:14:08,540 --> 00:14:10,610
Now we have two observations.

158
00:14:11,780 --> 00:14:14,090
From over a really off numerical data.

159
00:14:14,510 --> 00:14:18,470
First, there are missing values and horse bets.

160
00:14:19,110 --> 00:14:27,830
Second, there are outliers on the higher end and end hard rooms, and that is an outlier on the lower

161
00:14:27,830 --> 00:14:30,800
end and rainfall variable.

162
00:14:32,720 --> 00:14:38,180
Now, let's look at our categorical variables for our categorical variables.

163
00:14:38,240 --> 00:14:40,490
We will plot borrowed graphs.

164
00:14:42,170 --> 00:14:46,520
So our first categorical variable is airport to grow.

165
00:14:47,050 --> 00:14:47,900
Alert for that.

166
00:14:47,930 --> 00:14:49,910
We will write as soon as dot com plot.

167
00:14:52,960 --> 00:14:56,490
X equal to airport and they take to D.F..

168
00:15:07,700 --> 00:15:11,050
You can see our airport variable is sticking to values.

169
00:15:11,570 --> 00:15:12,680
Yes and no.

170
00:15:13,850 --> 00:15:20,230
And the height of this, but represent the number of occurrence of these values in our data.

171
00:15:21,950 --> 00:15:24,260
As you can see, there is nothing wrong with this data.

172
00:15:26,710 --> 00:15:34,960
Now, let's move on to our next Freamon, which is what everybody will copy this call and will write

173
00:15:35,300 --> 00:15:37,120
Waterboardings and set off airport.

174
00:15:47,740 --> 00:15:51,060
So over waterboarding, variable is taking forward values.

175
00:15:52,130 --> 00:15:53,690
River Lake.

176
00:15:54,140 --> 00:15:54,580
None.

177
00:15:55,040 --> 00:15:57,560
And river in lake again.

178
00:15:58,100 --> 00:16:00,110
There is nothing wrong with this data.

179
00:16:03,060 --> 00:16:10,860
Next, get a political video of Elizabeth Steadman, then we'll just ride bus terminal and sort of what

180
00:16:10,910 --> 00:16:12,390
body and what this.

181
00:16:15,840 --> 00:16:20,800
Here you can see bus terminal is on there taking one value, which is yes.

182
00:16:21,660 --> 00:16:24,000
So in a way, this is not a variable.

183
00:16:24,120 --> 00:16:25,290
This is a constant.

184
00:16:26,370 --> 00:16:30,840
It will not provide any additional information to what model.

185
00:16:31,500 --> 00:16:33,660
It will just act as a constraint.

186
00:16:34,650 --> 00:16:41,280
Therefore, we can delete this variable or grob this variable from our data as it will not provide any

187
00:16:41,280 --> 00:16:43,030
additional information to that model.

188
00:16:43,500 --> 00:16:46,110
This will just act as a concern for our model.

189
00:16:47,990 --> 00:16:56,440
Nobilis known, our observations are the first observation was missing when he was in and Herzberg's.

190
00:17:08,410 --> 00:17:18,970
I would second observation was bus terminal has on the US well use and it will not provide any additional

191
00:17:18,970 --> 00:17:21,610
information for our model and we can drop it.

192
00:17:28,630 --> 00:17:31,790
I know our third observation was.

193
00:17:32,850 --> 00:17:37,050
About old players and and what, Groomes and brimful.

194
00:17:47,340 --> 00:17:53,940
So always perform this EBD and univariate analysis before craning your model.

195
00:17:55,320 --> 00:18:01,830
This will help you clean your data and make your data usable for your machine learning algorithms.