1
00:00:00,090 --> 00:00:02,830
Hellhole before going deep down to the station.

2
00:00:02,910 --> 00:00:07,350
Let's have a quick recap of what we have done in all our previous session.

3
00:00:07,590 --> 00:00:13,140
So we have basically performed several techniques of great art, namely the preprocessing we have process

4
00:00:13,140 --> 00:00:19,500
our duration, golomb, arrival time feature, departure time feature and lots of feature as much so

5
00:00:19,500 --> 00:00:20,340
in this session.

6
00:00:20,340 --> 00:00:27,360
What we have to do, we have to handle our categorical data and we have to basically perform feature

7
00:00:27,360 --> 00:00:29,850
encoding techniques on our data.

8
00:00:30,240 --> 00:00:35,990
So this categorical data is exactly of basically two types.

9
00:00:36,000 --> 00:00:38,310
The very first one is your nominal data.

10
00:00:38,310 --> 00:00:40,660
The second one is your ordinary data.

11
00:00:40,890 --> 00:00:44,420
So what exactly is this nominal data?

12
00:00:44,430 --> 00:00:46,220
So what is this nominal data?

13
00:00:46,260 --> 00:00:53,010
So normal data are basically those data that are not in any order, like the name of countries, the

14
00:00:53,010 --> 00:01:01,590
name of country doesn't have any hierarchy, whereas ordinal data are basically those data that has

15
00:01:01,590 --> 00:01:05,500
some kind of hierarchy like say good, better, best.

16
00:01:05,520 --> 00:01:07,590
So they have some kind of hierarchy.

17
00:01:07,860 --> 00:01:15,810
So whenever you have your nominal data in such case, you have to perform your one part encoding.

18
00:01:15,990 --> 00:01:24,960
And whenever you have your ordinal data in such case, you have to use your label encoding plus or you

19
00:01:24,960 --> 00:01:27,450
have to perform your label encoding.

20
00:01:27,700 --> 00:01:34,440
So basically you have to deal with the categorical data either by one hot encoding or the one encoding.

21
00:01:34,440 --> 00:01:39,140
And there are several techniques to deal with this categorical stuff like that.

22
00:01:39,600 --> 00:01:45,600
So let's say what I'm going to do very first, Clexane in this Triniti, I'm going to past my list,

23
00:01:45,780 --> 00:01:52,140
which is might get in a column to get all your data off your categorical let's say I'm going to store

24
00:01:52,140 --> 00:01:55,730
it in a new database, which is exactly categorical.

25
00:01:55,740 --> 00:01:57,030
So just executed.

26
00:01:57,030 --> 00:02:01,910
And if I'm going to got a had on my data frame, you will get a preview.

27
00:02:02,190 --> 00:02:05,070
How exactly you do that looks like it gives off.

28
00:02:05,250 --> 00:02:09,620
All the features are exactly all categorical data type.

29
00:02:09,630 --> 00:02:10,960
Let's say very first.

30
00:02:11,100 --> 00:02:13,800
We have to deal with this Etheline feature.

31
00:02:13,980 --> 00:02:22,040
That's what I am going to do for, let's say, on this categorical that if I have to access my airline

32
00:02:22,040 --> 00:02:28,590
line and on this, if I'm going to call value count to get account of each and every feature available

33
00:02:28,590 --> 00:02:34,830
in this Aalam, you will see this feature has that much on this feature, has that much on this feature,

34
00:02:34,830 --> 00:02:35,970
has that much come.

35
00:02:36,300 --> 00:02:43,710
Let's say you have to perform analysis between this airline feature and this feature.

36
00:02:43,710 --> 00:02:51,240
Let's say you have to phone what exactly is a distribution of each and every airline with respect to

37
00:02:51,240 --> 00:02:52,510
your price column.

38
00:02:52,740 --> 00:02:58,760
So in such case, you can use a very handy plot box plot or you can use some of that broadnax, the

39
00:02:58,770 --> 00:03:01,110
distribution plot or some plot.

40
00:03:01,110 --> 00:03:03,700
But I'm going to show you very hard.

41
00:03:03,720 --> 00:03:07,170
It got a very popular plot used in the street.

42
00:03:07,410 --> 00:03:13,050
So to extract that, you have to use your Seimone library and form this.

43
00:03:13,050 --> 00:03:15,630
You have to use your box lot.

44
00:03:15,720 --> 00:03:21,800
And here on if I'm going to press shift plus that you will get all the documentation of its function.

45
00:03:21,810 --> 00:03:22,950
What is your X?

46
00:03:23,250 --> 00:03:23,760
What is it?

47
00:03:23,770 --> 00:03:26,700
Why what is your data frame and all these things like that.

48
00:03:27,030 --> 00:03:34,740
Let's say on my X, I have to say I have basically my all the different different categories of airline

49
00:03:34,830 --> 00:03:38,400
and on why I need my price.

50
00:03:38,790 --> 00:03:46,560
And what exactly is my data from my data from nothing but Kreen underscore data and we have to sort

51
00:03:46,560 --> 00:03:47,480
this data as well.

52
00:03:47,790 --> 00:03:52,070
Let's say I'm going to short this data on the basis of descending order.

53
00:03:52,320 --> 00:03:57,780
So for this, you have to call this chart and its core value and here very fast, you have to say on

54
00:03:57,780 --> 00:03:59,710
what basis you have to start it.

55
00:03:59,760 --> 00:04:03,240
So I'm going to say I have to short it on a basis of procedure.

56
00:04:03,300 --> 00:04:10,590
Then you have to set your standard parameter as far because you have to sort it on a basis of descending

57
00:04:10,590 --> 00:04:11,010
order.

58
00:04:11,160 --> 00:04:16,710
Let's say I'm going to set my own figure aside for this plot for the Spaceward.

59
00:04:17,040 --> 00:04:20,250
For this you guys can use this particular figure.

60
00:04:20,430 --> 00:04:25,540
And in this build or figure, you have a parameter which is exactly thickset.

61
00:04:25,560 --> 00:04:32,080
And here you guys can set your own window, except I'm going to set my own window of 15.

62
00:04:32,180 --> 00:04:32,690
Comfier.

63
00:04:32,760 --> 00:04:34,260
So just executed it.

64
00:04:34,660 --> 00:04:38,060
It will take a couple of seconds and it is a beautiful box.

65
00:04:38,070 --> 00:04:45,180
But you will see with respect to Jacquier with this is exactly the distribution we are this is one line

66
00:04:45,270 --> 00:04:48,720
because it is exactly twenty five percent datapoints.

67
00:04:48,810 --> 00:04:54,750
Made sure this is your medium and this one shows this is seven percent data.

68
00:04:54,780 --> 00:04:58,980
And from this box plot we can come up with the conclusion.

69
00:04:59,010 --> 00:04:59,490
Yeah.

70
00:05:00,040 --> 00:05:08,890
Jacqui, always business has the highest price, whereas all the other lines had almost the similar

71
00:05:08,890 --> 00:05:15,880
media, if there is not a much fluctuation in all other airlines apart from this Jet Airways business

72
00:05:15,880 --> 00:05:20,860
in a similar way, you can also perform this analysis with respect to your total.

73
00:05:20,860 --> 00:05:25,810
Is Daubs feature, let's say is I am going to call this an escalator.

74
00:05:25,810 --> 00:05:33,370
Dot had a there you will see a column name over here as total is let's say you have to extend.

75
00:05:33,520 --> 00:05:33,890
Yeah.

76
00:05:34,330 --> 00:05:38,090
What exactly the distribution of this to the top.

77
00:05:38,110 --> 00:05:47,410
So for this I want to use this box plot and here this time on this x axis, you just need total on this

78
00:05:47,450 --> 00:05:49,680
score is tops feature.

79
00:05:49,720 --> 00:05:52,120
So what you have to do just executed.

80
00:05:52,120 --> 00:05:55,110
And this is a beautiful box plot.

81
00:05:55,120 --> 00:06:01,870
And you will see here with respect to one stop, you have some outlets and data.

82
00:06:01,910 --> 00:06:03,390
It means flight.

83
00:06:03,430 --> 00:06:10,390
Who has one stock who are just a single stop there, maybe have a higher fare than others, and you

84
00:06:10,390 --> 00:06:16,670
will see a flight who has just for rest of their price isn't fluctuating very much.

85
00:06:16,670 --> 00:06:23,560
So that's the inference how you can extract from the data, how you can extract from this or how you

86
00:06:23,560 --> 00:06:25,450
can understand your data.

87
00:06:25,990 --> 00:06:30,910
So now what we have to do, we have to basically convert this.

88
00:06:30,910 --> 00:06:37,360
We have to convert this airline feature into some Intisar format because my machine learning isn't able

89
00:06:37,360 --> 00:06:38,170
to understand.

90
00:06:38,200 --> 00:06:43,930
Yeah, this is what exactly the meaning of this Endako, what exactly the meaning of this Eternia,

91
00:06:43,930 --> 00:06:48,570
because it is obvious Jimm and machine learning just works on a mathematical equation.

92
00:06:48,580 --> 00:06:51,460
It just works on some kind of Intisar data.

93
00:06:51,470 --> 00:06:53,860
So we have to convert this into some into the format.

94
00:06:54,190 --> 00:06:59,620
So far, this what I'm going to do for this airline, we are going to use a one hot encoding.

95
00:07:00,040 --> 00:07:06,580
So to use redhot encoding, you guys could call a function, which is exactly I get underscored dummies,

96
00:07:06,760 --> 00:07:12,010
which is exactly in your PARNAS module to just use this function.

97
00:07:12,010 --> 00:07:16,920
And here you have to say on what column you have to perform it.

98
00:07:17,110 --> 00:07:23,780
So I have to perform it on categorical of it, and then I'm going to pass the parameter drop and it's

99
00:07:24,340 --> 00:07:25,200
equals cool.

100
00:07:25,420 --> 00:07:29,190
Otherwise it will provide you some repetitions of columns.

101
00:07:29,240 --> 00:07:32,260
Just a drop on the score first equals to true.

102
00:07:32,500 --> 00:07:36,610
Let's say after doing this domestication it will return my data frame.

103
00:07:36,610 --> 00:07:40,870
So I'm going to store it in the airline and just execute it.

104
00:07:40,870 --> 00:07:47,910
And if on this airline I'm going to call that had over there, we will see all this feature gets dummy

105
00:07:47,920 --> 00:07:52,220
fired and all these features have some Intisar data.

106
00:07:52,340 --> 00:07:57,120
Now, what we have to do, we have to dummy fly this source column.

107
00:07:57,130 --> 00:07:59,680
Let's see what I'm going to do and closed.

108
00:07:59,740 --> 00:08:03,430
I have to access this source column, so I'm going to access it.

109
00:08:03,430 --> 00:08:10,860
And on this source, if I'm going to call these columns, you will see over here, Delli has that much

110
00:08:10,870 --> 00:08:12,050
number of those states.

111
00:08:12,190 --> 00:08:13,660
Calcutta has that much.

112
00:08:13,660 --> 00:08:14,950
Bangalore has that much.

113
00:08:15,190 --> 00:08:21,200
Let's say I have to exact the distribution of this source with respect to price.

114
00:08:21,340 --> 00:08:24,070
So in such case, I can again use this box.

115
00:08:24,070 --> 00:08:32,800
But so this time I'm going to say on x axis, I have to just pass this source now I have to just execute

116
00:08:32,800 --> 00:08:37,920
it and it will return this beautiful box and we will see over here.

117
00:08:38,440 --> 00:08:46,180
Bangalore has the highest fluctuation in need and we will see what daddy has, definitely our highest

118
00:08:46,180 --> 00:08:51,120
median compared to all other metropolitan cities of India.

119
00:08:51,130 --> 00:08:52,390
And we will see what here.

120
00:08:52,780 --> 00:08:56,460
This is exactly our distribution of Mumbai.

121
00:08:56,470 --> 00:08:59,560
This is of Gené with respect to all other cities as well.

122
00:09:00,160 --> 00:09:04,550
So now we have to dumphy this source column as well.

123
00:09:04,810 --> 00:09:06,760
So far, this is what I'm going to do.

124
00:09:06,910 --> 00:09:11,820
I'm just going to copy this and I have to just do some modification over here.

125
00:09:12,250 --> 00:09:16,060
So this time I have to talk for source.

126
00:09:16,390 --> 00:09:21,460
And here I have going to say I'm going to create a new data frame.

127
00:09:21,460 --> 00:09:24,070
That's its name is sort and simple.

128
00:09:24,190 --> 00:09:30,970
I'm going to call ahead on my data to get a preview how exactly this new screen looks like you will

129
00:09:30,970 --> 00:09:37,120
see change for this this distally for this this Calicut for this this way for this this.

130
00:09:37,230 --> 00:09:38,990
So we can clearly understand that.

131
00:09:39,010 --> 00:09:39,320
Yeah.

132
00:09:39,410 --> 00:09:42,760
Here my Bangalore is available.

133
00:09:42,760 --> 00:09:50,380
It means wherever you have one, it means in that particular role, that particular data is available.

134
00:09:50,770 --> 00:09:58,010
So now what we have to do after we find the source we have to do is find our destination as well.

135
00:09:58,240 --> 00:09:59,650
So I'm going to say I have to.

136
00:09:59,990 --> 00:10:06,710
The destination as well, so I'm going to say this is nothing but this one, and if I am going to call

137
00:10:06,740 --> 00:10:13,610
our value conservative over there, so we will observe over here, it has that much number of count

138
00:10:13,610 --> 00:10:15,170
with this number of this.

139
00:10:15,680 --> 00:10:24,050
Let's say we have to extract what exactly is a distribution with respect to its price so far.

140
00:10:24,050 --> 00:10:25,610
This what we guys can do.

141
00:10:25,610 --> 00:10:28,730
We guys can copy the SAT and simple nothing.

142
00:10:28,940 --> 00:10:35,120
We have to just based on what you can do, you can create a function it as well if you have to do work

143
00:10:35,120 --> 00:10:36,720
in a much more smarter way.

144
00:10:37,280 --> 00:10:45,280
So this time on X-axis, basically I have my destination feature, so just execute it.

145
00:10:45,530 --> 00:10:52,160
And this is a beautiful box, but you will see or hear in case of New Delhi, wherever my New Delhi

146
00:10:52,370 --> 00:10:53,170
destination.

147
00:10:53,480 --> 00:11:01,070
So this is exactly our distribution of it means flights that are going to New Delhi that has the highest

148
00:11:01,100 --> 00:11:07,550
bidder, whereas the flights that are going towards Calcutta has the lowest price you will observe over

149
00:11:07,550 --> 00:11:09,500
here in the distribution of this Calcutta.

150
00:11:09,800 --> 00:11:11,550
Now, what do you have to do?

151
00:11:11,570 --> 00:11:12,920
You have to fly.

152
00:11:12,950 --> 00:11:16,730
This destination so far is what I'm going to do.

153
00:11:16,730 --> 00:11:19,640
Either you can copy or you can manually type with you.

154
00:11:20,030 --> 00:11:25,870
So just to copy paste and this time I have to do my flight, basically my destination.

155
00:11:26,060 --> 00:11:28,190
So I have to exit this destination.

156
00:11:28,190 --> 00:11:34,110
That said, and this time I have to store all this stuff, let's say, in destination.

157
00:11:34,160 --> 00:11:36,200
So I'm just going to store it in destination.

158
00:11:36,470 --> 00:11:41,150
And this time I have to just press and just execute it.

159
00:11:41,150 --> 00:11:45,700
You will see with respect to destination, you have this much data frame.

160
00:11:45,710 --> 00:11:51,650
So in the upcoming session, we are basically going to deal with this road feature because you will

161
00:11:51,650 --> 00:11:54,590
also over here, this column is a little bit messy.

162
00:11:54,800 --> 00:12:00,000
So you have to do a lot of reporting in this column to fly this column.

163
00:12:00,380 --> 00:12:01,820
So that's all about decision.

164
00:12:01,830 --> 00:12:03,860
And hopefully you love the session very much.

165
00:12:04,340 --> 00:12:05,100
Thank you, guys.

166
00:12:05,120 --> 00:12:06,230
Have a nice day.

167
00:12:06,290 --> 00:12:07,160
Keep learning.

168
00:12:07,160 --> 00:12:08,060
Keep growing.

169
00:12:08,330 --> 00:12:09,320
Keep practicing.