1 00:00:00,090 --> 00:00:02,830 Hellhole before going deep down to the station. 2 00:00:02,910 --> 00:00:07,350 Let's have a quick recap of what we have done in all our previous session. 3 00:00:07,590 --> 00:00:13,140 So we have basically performed several techniques of great art, namely the preprocessing we have process 4 00:00:13,140 --> 00:00:19,500 our duration, golomb, arrival time feature, departure time feature and lots of feature as much so 5 00:00:19,500 --> 00:00:20,340 in this session. 6 00:00:20,340 --> 00:00:27,360 What we have to do, we have to handle our categorical data and we have to basically perform feature 7 00:00:27,360 --> 00:00:29,850 encoding techniques on our data. 8 00:00:30,240 --> 00:00:35,990 So this categorical data is exactly of basically two types. 9 00:00:36,000 --> 00:00:38,310 The very first one is your nominal data. 10 00:00:38,310 --> 00:00:40,660 The second one is your ordinary data. 11 00:00:40,890 --> 00:00:44,420 So what exactly is this nominal data? 12 00:00:44,430 --> 00:00:46,220 So what is this nominal data? 13 00:00:46,260 --> 00:00:53,010 So normal data are basically those data that are not in any order, like the name of countries, the 14 00:00:53,010 --> 00:01:01,590 name of country doesn't have any hierarchy, whereas ordinal data are basically those data that has 15 00:01:01,590 --> 00:01:05,500 some kind of hierarchy like say good, better, best. 16 00:01:05,520 --> 00:01:07,590 So they have some kind of hierarchy. 17 00:01:07,860 --> 00:01:15,810 So whenever you have your nominal data in such case, you have to perform your one part encoding. 18 00:01:15,990 --> 00:01:24,960 And whenever you have your ordinal data in such case, you have to use your label encoding plus or you 19 00:01:24,960 --> 00:01:27,450 have to perform your label encoding. 20 00:01:27,700 --> 00:01:34,440 So basically you have to deal with the categorical data either by one hot encoding or the one encoding. 21 00:01:34,440 --> 00:01:39,140 And there are several techniques to deal with this categorical stuff like that. 22 00:01:39,600 --> 00:01:45,600 So let's say what I'm going to do very first, Clexane in this Triniti, I'm going to past my list, 23 00:01:45,780 --> 00:01:52,140 which is might get in a column to get all your data off your categorical let's say I'm going to store 24 00:01:52,140 --> 00:01:55,730 it in a new database, which is exactly categorical. 25 00:01:55,740 --> 00:01:57,030 So just executed. 26 00:01:57,030 --> 00:02:01,910 And if I'm going to got a had on my data frame, you will get a preview. 27 00:02:02,190 --> 00:02:05,070 How exactly you do that looks like it gives off. 28 00:02:05,250 --> 00:02:09,620 All the features are exactly all categorical data type. 29 00:02:09,630 --> 00:02:10,960 Let's say very first. 30 00:02:11,100 --> 00:02:13,800 We have to deal with this Etheline feature. 31 00:02:13,980 --> 00:02:22,040 That's what I am going to do for, let's say, on this categorical that if I have to access my airline 32 00:02:22,040 --> 00:02:28,590 line and on this, if I'm going to call value count to get account of each and every feature available 33 00:02:28,590 --> 00:02:34,830 in this Aalam, you will see this feature has that much on this feature, has that much on this feature, 34 00:02:34,830 --> 00:02:35,970 has that much come. 35 00:02:36,300 --> 00:02:43,710 Let's say you have to perform analysis between this airline feature and this feature. 36 00:02:43,710 --> 00:02:51,240 Let's say you have to phone what exactly is a distribution of each and every airline with respect to 37 00:02:51,240 --> 00:02:52,510 your price column. 38 00:02:52,740 --> 00:02:58,760 So in such case, you can use a very handy plot box plot or you can use some of that broadnax, the 39 00:02:58,770 --> 00:03:01,110 distribution plot or some plot. 40 00:03:01,110 --> 00:03:03,700 But I'm going to show you very hard. 41 00:03:03,720 --> 00:03:07,170 It got a very popular plot used in the street. 42 00:03:07,410 --> 00:03:13,050 So to extract that, you have to use your Seimone library and form this. 43 00:03:13,050 --> 00:03:15,630 You have to use your box lot. 44 00:03:15,720 --> 00:03:21,800 And here on if I'm going to press shift plus that you will get all the documentation of its function. 45 00:03:21,810 --> 00:03:22,950 What is your X? 46 00:03:23,250 --> 00:03:23,760 What is it? 47 00:03:23,770 --> 00:03:26,700 Why what is your data frame and all these things like that. 48 00:03:27,030 --> 00:03:34,740 Let's say on my X, I have to say I have basically my all the different different categories of airline 49 00:03:34,830 --> 00:03:38,400 and on why I need my price. 50 00:03:38,790 --> 00:03:46,560 And what exactly is my data from my data from nothing but Kreen underscore data and we have to sort 51 00:03:46,560 --> 00:03:47,480 this data as well. 52 00:03:47,790 --> 00:03:52,070 Let's say I'm going to short this data on the basis of descending order. 53 00:03:52,320 --> 00:03:57,780 So for this, you have to call this chart and its core value and here very fast, you have to say on 54 00:03:57,780 --> 00:03:59,710 what basis you have to start it. 55 00:03:59,760 --> 00:04:03,240 So I'm going to say I have to short it on a basis of procedure. 56 00:04:03,300 --> 00:04:10,590 Then you have to set your standard parameter as far because you have to sort it on a basis of descending 57 00:04:10,590 --> 00:04:11,010 order. 58 00:04:11,160 --> 00:04:16,710 Let's say I'm going to set my own figure aside for this plot for the Spaceward. 59 00:04:17,040 --> 00:04:20,250 For this you guys can use this particular figure. 60 00:04:20,430 --> 00:04:25,540 And in this build or figure, you have a parameter which is exactly thickset. 61 00:04:25,560 --> 00:04:32,080 And here you guys can set your own window, except I'm going to set my own window of 15. 62 00:04:32,180 --> 00:04:32,690 Comfier. 63 00:04:32,760 --> 00:04:34,260 So just executed it. 64 00:04:34,660 --> 00:04:38,060 It will take a couple of seconds and it is a beautiful box. 65 00:04:38,070 --> 00:04:45,180 But you will see with respect to Jacquier with this is exactly the distribution we are this is one line 66 00:04:45,270 --> 00:04:48,720 because it is exactly twenty five percent datapoints. 67 00:04:48,810 --> 00:04:54,750 Made sure this is your medium and this one shows this is seven percent data. 68 00:04:54,780 --> 00:04:58,980 And from this box plot we can come up with the conclusion. 69 00:04:59,010 --> 00:04:59,490 Yeah. 70 00:05:00,040 --> 00:05:08,890 Jacqui, always business has the highest price, whereas all the other lines had almost the similar 71 00:05:08,890 --> 00:05:15,880 media, if there is not a much fluctuation in all other airlines apart from this Jet Airways business 72 00:05:15,880 --> 00:05:20,860 in a similar way, you can also perform this analysis with respect to your total. 73 00:05:20,860 --> 00:05:25,810 Is Daubs feature, let's say is I am going to call this an escalator. 74 00:05:25,810 --> 00:05:33,370 Dot had a there you will see a column name over here as total is let's say you have to extend. 75 00:05:33,520 --> 00:05:33,890 Yeah. 76 00:05:34,330 --> 00:05:38,090 What exactly the distribution of this to the top. 77 00:05:38,110 --> 00:05:47,410 So for this I want to use this box plot and here this time on this x axis, you just need total on this 78 00:05:47,450 --> 00:05:49,680 score is tops feature. 79 00:05:49,720 --> 00:05:52,120 So what you have to do just executed. 80 00:05:52,120 --> 00:05:55,110 And this is a beautiful box plot. 81 00:05:55,120 --> 00:06:01,870 And you will see here with respect to one stop, you have some outlets and data. 82 00:06:01,910 --> 00:06:03,390 It means flight. 83 00:06:03,430 --> 00:06:10,390 Who has one stock who are just a single stop there, maybe have a higher fare than others, and you 84 00:06:10,390 --> 00:06:16,670 will see a flight who has just for rest of their price isn't fluctuating very much. 85 00:06:16,670 --> 00:06:23,560 So that's the inference how you can extract from the data, how you can extract from this or how you 86 00:06:23,560 --> 00:06:25,450 can understand your data. 87 00:06:25,990 --> 00:06:30,910 So now what we have to do, we have to basically convert this. 88 00:06:30,910 --> 00:06:37,360 We have to convert this airline feature into some Intisar format because my machine learning isn't able 89 00:06:37,360 --> 00:06:38,170 to understand. 90 00:06:38,200 --> 00:06:43,930 Yeah, this is what exactly the meaning of this Endako, what exactly the meaning of this Eternia, 91 00:06:43,930 --> 00:06:48,570 because it is obvious Jimm and machine learning just works on a mathematical equation. 92 00:06:48,580 --> 00:06:51,460 It just works on some kind of Intisar data. 93 00:06:51,470 --> 00:06:53,860 So we have to convert this into some into the format. 94 00:06:54,190 --> 00:06:59,620 So far, this what I'm going to do for this airline, we are going to use a one hot encoding. 95 00:07:00,040 --> 00:07:06,580 So to use redhot encoding, you guys could call a function, which is exactly I get underscored dummies, 96 00:07:06,760 --> 00:07:12,010 which is exactly in your PARNAS module to just use this function. 97 00:07:12,010 --> 00:07:16,920 And here you have to say on what column you have to perform it. 98 00:07:17,110 --> 00:07:23,780 So I have to perform it on categorical of it, and then I'm going to pass the parameter drop and it's 99 00:07:24,340 --> 00:07:25,200 equals cool. 100 00:07:25,420 --> 00:07:29,190 Otherwise it will provide you some repetitions of columns. 101 00:07:29,240 --> 00:07:32,260 Just a drop on the score first equals to true. 102 00:07:32,500 --> 00:07:36,610 Let's say after doing this domestication it will return my data frame. 103 00:07:36,610 --> 00:07:40,870 So I'm going to store it in the airline and just execute it. 104 00:07:40,870 --> 00:07:47,910 And if on this airline I'm going to call that had over there, we will see all this feature gets dummy 105 00:07:47,920 --> 00:07:52,220 fired and all these features have some Intisar data. 106 00:07:52,340 --> 00:07:57,120 Now, what we have to do, we have to dummy fly this source column. 107 00:07:57,130 --> 00:07:59,680 Let's see what I'm going to do and closed. 108 00:07:59,740 --> 00:08:03,430 I have to access this source column, so I'm going to access it. 109 00:08:03,430 --> 00:08:10,860 And on this source, if I'm going to call these columns, you will see over here, Delli has that much 110 00:08:10,870 --> 00:08:12,050 number of those states. 111 00:08:12,190 --> 00:08:13,660 Calcutta has that much. 112 00:08:13,660 --> 00:08:14,950 Bangalore has that much. 113 00:08:15,190 --> 00:08:21,200 Let's say I have to exact the distribution of this source with respect to price. 114 00:08:21,340 --> 00:08:24,070 So in such case, I can again use this box. 115 00:08:24,070 --> 00:08:32,800 But so this time I'm going to say on x axis, I have to just pass this source now I have to just execute 116 00:08:32,800 --> 00:08:37,920 it and it will return this beautiful box and we will see over here. 117 00:08:38,440 --> 00:08:46,180 Bangalore has the highest fluctuation in need and we will see what daddy has, definitely our highest 118 00:08:46,180 --> 00:08:51,120 median compared to all other metropolitan cities of India. 119 00:08:51,130 --> 00:08:52,390 And we will see what here. 120 00:08:52,780 --> 00:08:56,460 This is exactly our distribution of Mumbai. 121 00:08:56,470 --> 00:08:59,560 This is of Gené with respect to all other cities as well. 122 00:09:00,160 --> 00:09:04,550 So now we have to dumphy this source column as well. 123 00:09:04,810 --> 00:09:06,760 So far, this is what I'm going to do. 124 00:09:06,910 --> 00:09:11,820 I'm just going to copy this and I have to just do some modification over here. 125 00:09:12,250 --> 00:09:16,060 So this time I have to talk for source. 126 00:09:16,390 --> 00:09:21,460 And here I have going to say I'm going to create a new data frame. 127 00:09:21,460 --> 00:09:24,070 That's its name is sort and simple. 128 00:09:24,190 --> 00:09:30,970 I'm going to call ahead on my data to get a preview how exactly this new screen looks like you will 129 00:09:30,970 --> 00:09:37,120 see change for this this distally for this this Calicut for this this way for this this. 130 00:09:37,230 --> 00:09:38,990 So we can clearly understand that. 131 00:09:39,010 --> 00:09:39,320 Yeah. 132 00:09:39,410 --> 00:09:42,760 Here my Bangalore is available. 133 00:09:42,760 --> 00:09:50,380 It means wherever you have one, it means in that particular role, that particular data is available. 134 00:09:50,770 --> 00:09:58,010 So now what we have to do after we find the source we have to do is find our destination as well. 135 00:09:58,240 --> 00:09:59,650 So I'm going to say I have to. 136 00:09:59,990 --> 00:10:06,710 The destination as well, so I'm going to say this is nothing but this one, and if I am going to call 137 00:10:06,740 --> 00:10:13,610 our value conservative over there, so we will observe over here, it has that much number of count 138 00:10:13,610 --> 00:10:15,170 with this number of this. 139 00:10:15,680 --> 00:10:24,050 Let's say we have to extract what exactly is a distribution with respect to its price so far. 140 00:10:24,050 --> 00:10:25,610 This what we guys can do. 141 00:10:25,610 --> 00:10:28,730 We guys can copy the SAT and simple nothing. 142 00:10:28,940 --> 00:10:35,120 We have to just based on what you can do, you can create a function it as well if you have to do work 143 00:10:35,120 --> 00:10:36,720 in a much more smarter way. 144 00:10:37,280 --> 00:10:45,280 So this time on X-axis, basically I have my destination feature, so just execute it. 145 00:10:45,530 --> 00:10:52,160 And this is a beautiful box, but you will see or hear in case of New Delhi, wherever my New Delhi 146 00:10:52,370 --> 00:10:53,170 destination. 147 00:10:53,480 --> 00:11:01,070 So this is exactly our distribution of it means flights that are going to New Delhi that has the highest 148 00:11:01,100 --> 00:11:07,550 bidder, whereas the flights that are going towards Calcutta has the lowest price you will observe over 149 00:11:07,550 --> 00:11:09,500 here in the distribution of this Calcutta. 150 00:11:09,800 --> 00:11:11,550 Now, what do you have to do? 151 00:11:11,570 --> 00:11:12,920 You have to fly. 152 00:11:12,950 --> 00:11:16,730 This destination so far is what I'm going to do. 153 00:11:16,730 --> 00:11:19,640 Either you can copy or you can manually type with you. 154 00:11:20,030 --> 00:11:25,870 So just to copy paste and this time I have to do my flight, basically my destination. 155 00:11:26,060 --> 00:11:28,190 So I have to exit this destination. 156 00:11:28,190 --> 00:11:34,110 That said, and this time I have to store all this stuff, let's say, in destination. 157 00:11:34,160 --> 00:11:36,200 So I'm just going to store it in destination. 158 00:11:36,470 --> 00:11:41,150 And this time I have to just press and just execute it. 159 00:11:41,150 --> 00:11:45,700 You will see with respect to destination, you have this much data frame. 160 00:11:45,710 --> 00:11:51,650 So in the upcoming session, we are basically going to deal with this road feature because you will 161 00:11:51,650 --> 00:11:54,590 also over here, this column is a little bit messy. 162 00:11:54,800 --> 00:12:00,000 So you have to do a lot of reporting in this column to fly this column. 163 00:12:00,380 --> 00:12:01,820 So that's all about decision. 164 00:12:01,830 --> 00:12:03,860 And hopefully you love the session very much. 165 00:12:04,340 --> 00:12:05,100 Thank you, guys. 166 00:12:05,120 --> 00:12:06,230 Have a nice day. 167 00:12:06,290 --> 00:12:07,160 Keep learning. 168 00:12:07,160 --> 00:12:08,060 Keep growing. 169 00:12:08,330 --> 00:12:09,320 Keep practicing.