1 00:00:00,470 --> 00:00:06,810 All so in the previous session, what we have done, we have basically extract some desired features 2 00:00:06,810 --> 00:00:12,720 from our data and then all of our sessions in all our previous sessions, we have analyzed our data 3 00:00:12,720 --> 00:00:14,550 as well as we have understand our data. 4 00:00:14,940 --> 00:00:18,640 And that's a time to approach for your machine learning. 5 00:00:18,930 --> 00:00:25,370 So in this session, what we are going to do, we have this amazing assignment in which we have to apply 6 00:00:25,770 --> 00:00:28,490 feature encoding techniques on data. 7 00:00:28,500 --> 00:00:34,110 So very first, let's understand why there is a need of feature encoding techniques and what are different, 8 00:00:34,140 --> 00:00:36,240 different type of encoding techniques. 9 00:00:36,820 --> 00:00:42,240 So, yeah, so you will figure out here you have direct, here you have direct and all these kinds of 10 00:00:42,330 --> 00:00:46,120 string like things like machine learning now understand an extension data. 11 00:00:46,140 --> 00:00:53,220 So it means we have to convert this data and do some numerical format using some feature encoding techniques. 12 00:00:53,400 --> 00:00:57,880 That's why feature encoding techniques come into existence. 13 00:00:58,110 --> 00:01:05,160 So in this session, what we are going to do, we are basically going to use our technique, a very 14 00:01:05,160 --> 00:01:09,650 popular technique known as mean encoding technique. 15 00:01:09,900 --> 00:01:14,320 So let's understand what exactly the mean encoding technique. 16 00:01:14,790 --> 00:01:20,750 So if you will figure out in this market segment, let's say in this market segment, here you have 17 00:01:21,000 --> 00:01:22,500 so let me show you a thing. 18 00:01:22,590 --> 00:01:28,710 Data underscore, get of market segment, just press tab. 19 00:01:28,830 --> 00:01:35,220 And if on this I'm going to call my unique and if I'm just going to execute it, you will see it has 20 00:01:35,220 --> 00:01:37,710 that much unique number of categories. 21 00:01:37,860 --> 00:01:45,870 So what meaning coding will do mean encoding will basically see whatever will be the mean with respect 22 00:01:45,870 --> 00:01:47,670 to this direct column. 23 00:01:47,670 --> 00:01:55,110 Or you can say whatever the mean of the direct with respect to this cancellation column, because cancellation 24 00:01:55,110 --> 00:01:58,290 is exactly my independent feature that we have to predict. 25 00:01:58,320 --> 00:02:03,170 So what are what will be the nien of this act with respect to this cancellation? 26 00:02:03,610 --> 00:02:06,830 I'm basically going to replace Garecht with that mean. 27 00:02:06,840 --> 00:02:09,240 That's what my mean encoding will do. 28 00:02:09,360 --> 00:02:14,360 So if you're not that much comfort, let me show you a code so that you all can understand. 29 00:02:14,550 --> 00:02:20,400 So here what we have to do very first, let's say we have to extract all those columns on which we have 30 00:02:20,400 --> 00:02:21,590 to perform these techniques. 31 00:02:22,050 --> 00:02:28,860 So if on this data underscore that if I'm going to call my columns or they are just execute and these 32 00:02:28,860 --> 00:02:37,080 are exactly all the columns, let's say here, if I'm going to say in these columns, I just don't need 33 00:02:37,080 --> 00:02:42,330 this cancellation because I don't have to encode this feature because this is my defining feature and 34 00:02:42,330 --> 00:02:44,620 this is that what we have to predict. 35 00:02:44,940 --> 00:02:48,480 So it means I'm going to cost zero to it. 36 00:02:48,630 --> 00:02:50,610 And here I'm going to say zero to tailgated. 37 00:02:50,820 --> 00:02:55,200 And if I'm going to execute it, you will see you don't have this cancellation column. 38 00:02:55,230 --> 00:03:01,710 Now I'm going to store it somewhere else, let's say columns, and I'm just going to print it as columns 39 00:03:01,710 --> 00:03:02,640 just execute. 40 00:03:02,760 --> 00:03:07,100 And you will figure out these are my all those columns that we have to include. 41 00:03:07,110 --> 00:03:13,920 So I'm going to say for column in this cost list, it means I have to think it means I have to fetch 42 00:03:13,920 --> 00:03:14,840 each and every column. 43 00:03:15,300 --> 00:03:19,010 Then I'm going to see data on the scorecard. 44 00:03:19,020 --> 00:03:24,350 And this time I have to group my data on the basis of each and every column for this. 45 00:03:24,360 --> 00:03:29,610 I'm just going to say, let me show you a very, very basic example over here. 46 00:03:29,850 --> 00:03:36,380 Let's say I'm going to say data on a scorecard and let's say I'm just going to access my hotel. 47 00:03:36,630 --> 00:03:39,900 So if I'm going to exit this hotel, you will see all the stats. 48 00:03:40,380 --> 00:03:45,270 And now what I have to do if I'm with the group on the basis of this hotel. 49 00:03:45,300 --> 00:03:49,770 So for this, I'm going to say I have to group on the basis of all this hotel. 50 00:03:50,310 --> 00:03:58,290 Once I have this group, by then I'm going to access my calculation, which is exactly this one. 51 00:03:58,300 --> 00:04:00,410 So I have to just access this. 52 00:04:00,780 --> 00:04:04,800 So I'm going to say this is exactly my calculation. 53 00:04:04,800 --> 00:04:09,660 And on this, if I am going to call me Nordea, so let me just call me. 54 00:04:09,690 --> 00:04:16,410 And if I'm going to execute, you will see with respect to City Hall Hotel, you have that much me with 55 00:04:16,410 --> 00:04:20,090 respect to this resort hotel, you have that much mean. 56 00:04:20,370 --> 00:04:27,960 It means what you can do wherever you have a city hotel, just replace it with zero one, two seven. 57 00:04:28,140 --> 00:04:33,360 And internally, my machine learning model will able to understand what exactly the meaning of the zero 58 00:04:33,360 --> 00:04:37,740 point two seven, because machine learning now understand what a city hotel is. 59 00:04:37,760 --> 00:04:40,800 Just understand my Intisar and data. 60 00:04:40,810 --> 00:04:42,950 So we have to tell the machine learning model. 61 00:04:42,970 --> 00:04:45,430 Yeah, this is exactly my city hotel. 62 00:04:45,450 --> 00:04:47,580 So this is that approach that you have to follow. 63 00:04:47,580 --> 00:04:50,010 And this is exactly what I mean. 64 00:04:50,010 --> 00:04:56,010 In quoting now, you have to perform, you have to perform this approach for each and every feature. 65 00:04:56,040 --> 00:04:59,810 So here I am going to say I have to just perform this at. 66 00:05:00,200 --> 00:05:06,800 For each and every feature here, I have to say, this gets replaced by basically my column, that's 67 00:05:06,800 --> 00:05:07,050 it. 68 00:05:07,490 --> 00:05:12,670 Now, what I have to do, basically, let's say I have to print something, Sarah. 69 00:05:12,710 --> 00:05:14,600 I'm going to say just print it. 70 00:05:15,200 --> 00:05:21,560 And let's say after printing to make it more user friendly, you can add some spaces as well. 71 00:05:21,670 --> 00:05:26,990 I'm just going to add and just going to execute, you will see with respect to all that, you have that 72 00:05:26,990 --> 00:05:31,940 much value with respect to resort hotel, you have that much value and all this sort of thing. 73 00:05:32,390 --> 00:05:39,350 Now, what you have to do basically here, you have to simply convert all these stuffs into a dictionary. 74 00:05:39,390 --> 00:05:43,120 Now, the question that you guys can ask or is a query. 75 00:05:43,610 --> 00:05:43,850 Yeah. 76 00:05:43,850 --> 00:05:45,940 Why why there is a need to convert this. 77 00:05:45,950 --> 00:05:51,350 All this is just introduction because you have to map you that our city hotel. 78 00:05:51,350 --> 00:05:53,420 I have to map this value. 79 00:05:53,660 --> 00:05:57,820 And once you will create a dictionary now data is in the form of key value pairs. 80 00:05:57,830 --> 00:06:02,650 It means the sea city hotels become key and this becomes value. 81 00:06:03,080 --> 00:06:08,000 So once you have all this stuff in your dictionary, you can easily map that. 82 00:06:08,450 --> 00:06:09,890 That's the power of dictionary. 83 00:06:10,250 --> 00:06:15,250 So here I am going to say I have to convert all this stuff into the city. 84 00:06:15,260 --> 00:06:17,500 Just called to underscore DECT over there. 85 00:06:17,510 --> 00:06:19,310 And again, I'm going to execute it now. 86 00:06:19,310 --> 00:06:23,550 You will see you have all these stuffs available over here. 87 00:06:23,690 --> 00:06:26,050 All this is now simple. 88 00:06:26,060 --> 00:06:26,770 It's simple. 89 00:06:26,780 --> 00:06:29,800 Now you have to just you have to just map it. 90 00:06:30,260 --> 00:06:32,200 So far, this is what I'm going to do. 91 00:06:32,210 --> 00:06:35,570 Let's say this time I don't have to pay Latza this time. 92 00:06:35,570 --> 00:06:40,580 This is exactly my dictionary with respect to each and every column. 93 00:06:40,940 --> 00:06:43,550 So here I going to say this is exactly my dictionary. 94 00:06:44,000 --> 00:06:48,380 Once I have all this stuff, what I have to do, I have to just map it. 95 00:06:48,380 --> 00:06:54,470 So for this, I'm going to say data on the score, get off column, dot the map. 96 00:06:54,470 --> 00:06:57,200 I have to map what my dictionary is simple. 97 00:06:57,620 --> 00:07:00,430 Then I have to update my net column as well. 98 00:07:00,650 --> 00:07:07,270 So data and a score get of column equals to this one index. 99 00:07:07,280 --> 00:07:08,960 It just executed. 100 00:07:08,960 --> 00:07:09,950 It will take a time. 101 00:07:09,950 --> 00:07:19,490 And again, if I'm going to call my head over there now, you will figure out all your string data gets 102 00:07:19,490 --> 00:07:26,600 converted into some the values and your machine learning model now able to understand internally what 103 00:07:26,600 --> 00:07:31,370 exactly the meaning of this, because machine learning how to understand is Janita. 104 00:07:32,060 --> 00:07:38,390 So if you guys are not that much comfort level about this, meaning coding, you can definitely raise 105 00:07:38,390 --> 00:07:43,180 your credit in your section or you can personally text me as well. 106 00:07:43,730 --> 00:07:46,030 So I hope you will love this approach as well. 107 00:07:46,040 --> 00:07:50,690 There are tons of approaches to convert this into something into the format, like the label encoding 108 00:07:50,690 --> 00:07:51,500 and all these things. 109 00:07:51,890 --> 00:07:56,510 But here I'm going to show you some advance support, because whenever you are going to work on some 110 00:07:56,510 --> 00:08:00,790 real work data there, you have to use some advanced approach. 111 00:08:00,800 --> 00:08:04,650 You don't have to use some approach like dummy approaches. 112 00:08:04,700 --> 00:08:11,060 Now, what you have to do, you need your entire data frame in which you have all the numerical features 113 00:08:11,060 --> 00:08:13,010 in which we have all the categorical features. 114 00:08:13,510 --> 00:08:17,540 So for this, what I'm going to do, I'm just going to concatenate my two data. 115 00:08:18,140 --> 00:08:26,150 So here I'm going to say in this data underscore get and after it, I have something. 116 00:08:26,330 --> 00:08:31,580 So here I am going to say data of num underscore features. 117 00:08:31,760 --> 00:08:38,390 After having all this stuff, I have to mention something my access equals to one, because here I'm 118 00:08:38,390 --> 00:08:43,860 going to say I have to concatenate in vertical fashion after doing all this stuff. 119 00:08:43,880 --> 00:08:45,350 Let me summarize. 120 00:08:45,350 --> 00:08:48,140 Let's say this is my entire data. 121 00:08:48,440 --> 00:08:50,170 I'm going to say this is own data frame. 122 00:08:50,210 --> 00:08:51,630 Just execute this cell. 123 00:08:51,950 --> 00:09:00,050 And if I'm going to call this data frame, Dot had to get a quick overview of how my data frame looks 124 00:09:00,050 --> 00:09:00,340 like. 125 00:09:00,380 --> 00:09:04,970 Now, you will see these are exactly all the features that I have over here. 126 00:09:05,000 --> 00:09:11,260 And you will see over here there are two features that are going to repeat it, which are exactly this 127 00:09:11,270 --> 00:09:15,780 cancellation and cancel it means you can drop any one of them. 128 00:09:16,100 --> 00:09:21,580 So for this, I'm just going to say data frame, dot drop. 129 00:09:21,590 --> 00:09:23,380 I have to drop what I have to do up. 130 00:09:23,390 --> 00:09:28,950 I have to drop this feature the symbol then I have to access equals to one because I have to remove 131 00:09:29,150 --> 00:09:30,260 in a vertical way. 132 00:09:30,260 --> 00:09:32,700 Then I'm going to say I have to update it as well. 133 00:09:33,020 --> 00:09:35,920 So in place equals to true. 134 00:09:36,590 --> 00:09:39,530 Just execute all this just gets executed. 135 00:09:39,530 --> 00:09:46,580 And if I'm going to call my data or shape forward here now, you will see it has that much number of 136 00:09:46,580 --> 00:09:49,910 rows and it has that much number of columns. 137 00:09:49,920 --> 00:09:51,580 So that's all about the session. 138 00:09:51,650 --> 00:09:58,130 In the upcoming session, we are going to decide what exactly the outline, how to handle outlier situations 139 00:09:58,130 --> 00:09:58,850 in your data. 140 00:09:59,330 --> 00:09:59,660 So who. 141 00:10:00,020 --> 00:10:06,680 Discussion very much, and you still have a query you can feel free to ask, just raise in your section 142 00:10:06,710 --> 00:10:09,160 or you can personally text me as well. 143 00:10:09,170 --> 00:10:10,260 So thank you. 144 00:10:10,300 --> 00:10:11,170 Have a nice day. 145 00:10:11,350 --> 00:10:12,260 Keep learning. 146 00:10:12,260 --> 00:10:13,160 Keep growing. 147 00:10:13,340 --> 00:10:14,210 Keep practicing.