1 00:00:00,090 --> 00:00:03,250 Hello and welcome to the very first session on this project. 2 00:00:03,270 --> 00:00:06,980 So this is exactly the very first session and indecision. 3 00:00:06,990 --> 00:00:11,370 This is our very first a statement that we have to go to. 4 00:00:11,370 --> 00:00:18,090 The very first one is we have to perform data cleaning on our data and we have to prepare our data for 5 00:00:18,090 --> 00:00:25,200 our modeling purposes, because whenever you have your data in real world aspect, you will never have 6 00:00:25,200 --> 00:00:26,300 your clean data. 7 00:00:26,310 --> 00:00:28,550 You always have your raw data. 8 00:00:28,560 --> 00:00:30,240 You always have your Massada. 9 00:00:30,420 --> 00:00:36,690 So you have to deal with that raw and messy data and you have to prepare that data in such a way that 10 00:00:36,690 --> 00:00:40,880 you can build a cool and fancy machine learning model on top of that. 11 00:00:41,220 --> 00:00:43,500 So I'm just going to open my Jupiter notebook. 12 00:00:43,500 --> 00:00:49,860 I do it here and this is exactly my Jupiter notoriety, so that if I have to import some basic modules 13 00:00:49,860 --> 00:00:57,930 like Banda's Numberi Matplotlib, Seabourne Plotnick for extraction, manipulation and validation as 14 00:00:58,530 --> 00:01:05,740 so very first, let's say I'm just going to import my bananas and I'm going to create its speedy just 15 00:01:05,760 --> 00:01:06,510 import it. 16 00:01:06,510 --> 00:01:09,180 And after that I have to import my name as well. 17 00:01:09,180 --> 00:01:16,350 For dramatical computations are for descriptive statistics, all the stuff that are going to cover in 18 00:01:16,350 --> 00:01:18,430 my name by module after it. 19 00:01:18,450 --> 00:01:22,160 We have to use our some data with the additional library. 20 00:01:22,170 --> 00:01:27,590 So I'm just going to say my import matplotlib dot. 21 00:01:27,960 --> 00:01:35,040 So here I'm going to say Immaculata dot plot as BLT for better representation. 22 00:01:35,040 --> 00:01:38,280 You guys can also go ahead with this Seabourne. 23 00:01:38,280 --> 00:01:40,190 So I'm going to import my Theban as well. 24 00:01:40,590 --> 00:01:41,820 So just executed. 25 00:01:41,820 --> 00:01:48,750 It will take a while and all your stuff gets successfully executed over here now ready for what you 26 00:01:48,750 --> 00:01:49,320 have to do. 27 00:01:49,620 --> 00:01:56,240 You have to redo it because before performing any kind of analysis, you have to prepare the data for 28 00:01:56,280 --> 00:01:58,770 analysis as well as for the marketing purpose. 29 00:01:58,980 --> 00:02:06,450 So I'm just going to say PD Dot and it's good to see because it is exactly my data and the final kesby 30 00:02:06,450 --> 00:02:06,900 format. 31 00:02:07,170 --> 00:02:09,660 So you will figure out this is exact. 32 00:02:09,660 --> 00:02:15,870 My dataset, which is hotel underscore bookings, but very close to happen where exactly your data is 33 00:02:15,870 --> 00:02:16,260 available. 34 00:02:16,260 --> 00:02:21,780 So I'm just going to copy this entire path and I'm just going to paste over here and here. 35 00:02:21,990 --> 00:02:28,650 If you will press tab, you will get all this bill will add that particular part you will get this is 36 00:02:28,650 --> 00:02:34,130 exactly that data that you need to read it after data. 37 00:02:34,170 --> 00:02:38,490 What I'm going to do so it will exactly that anonymized some data form object. 38 00:02:38,490 --> 00:02:42,490 So I'm going to say it is exactly my idea of just executed Alexa. 39 00:02:42,520 --> 00:02:47,250 I'm just going to call ahead to get a preview how exactly my data looks like. 40 00:02:47,250 --> 00:02:53,370 You will see this is exactly that data on which you have to do lots of analysis, lots of people sitting 41 00:02:53,580 --> 00:02:56,670 and lots of insights you have to fetch from this data. 42 00:02:57,060 --> 00:03:03,450 And let me check how many rules and how many columns we have in this data so far this year. 43 00:03:03,450 --> 00:03:05,730 You guys can just call shape or whatever. 44 00:03:05,760 --> 00:03:10,530 So here I'm going to say D.F. dot shape executed and you will observe. 45 00:03:10,740 --> 00:03:15,290 It has that much number of rules and it has that much number of columns. 46 00:03:15,570 --> 00:03:20,010 Now, let me check how many missing value we have in and what he does for this. 47 00:03:20,010 --> 00:03:25,290 I'm just going to say, Dot, is there any means is null value available? 48 00:03:25,560 --> 00:03:30,750 Got some to do summation of all the missing values in your data. 49 00:03:31,260 --> 00:03:37,660 Just execute this command and you will figure out you have that much huge number of missing values available 50 00:03:37,660 --> 00:03:38,100 over here. 51 00:03:38,340 --> 00:03:40,890 You will see in this country you have approx. 52 00:03:40,890 --> 00:03:43,650 Five hundred in this Algerian company. 53 00:03:43,650 --> 00:03:47,580 You will see how many huge number of missing values we have. 54 00:03:47,580 --> 00:03:49,950 So we have to deal with these missing values. 55 00:03:50,070 --> 00:03:53,640 So let me define some function over this so that we can deal with that. 56 00:03:53,820 --> 00:04:00,390 So here I'm going to say let's see my function in resetting or let's say data underscore clean and what 57 00:04:00,390 --> 00:04:01,980 exactly is function will receive. 58 00:04:01,980 --> 00:04:05,160 And we do define it later and what this function will do. 59 00:04:05,520 --> 00:04:07,050 This function will nothing. 60 00:04:07,050 --> 00:04:11,520 It just will my missing values by basically zero. 61 00:04:11,520 --> 00:04:16,860 So I'm just going to say to just going to fill my missing values with zero and then after it I'm going 62 00:04:16,860 --> 00:04:18,900 to update them as well. 63 00:04:18,900 --> 00:04:22,230 To just assign in place equals to do so here. 64 00:04:22,230 --> 00:04:28,620 I'm going to say whatever data from this function will receive just to all these additions after what 65 00:04:28,620 --> 00:04:34,530 we have, do I have to print something, let's say offering all the manipulations I'm going to print? 66 00:04:34,830 --> 00:04:38,460 Whether I have any null value or not so far is just call. 67 00:04:38,460 --> 00:04:40,200 This is an old or some old here. 68 00:04:40,320 --> 00:04:42,090 Now we have to just execute this well. 69 00:04:42,100 --> 00:04:44,580 And now what we have to do, we have to call this function. 70 00:04:44,580 --> 00:04:46,660 So I'm going to say Daytona's correctly. 71 00:04:47,010 --> 00:04:53,490 Add in this function, you will see it just receive a single parameter, which is exactly your data 72 00:04:53,490 --> 00:04:53,790 frame. 73 00:04:53,790 --> 00:04:58,200 So I have to just pass this D.F. over here and just execute it. 74 00:04:58,200 --> 00:04:59,010 It will take a while. 75 00:04:59,010 --> 00:04:59,490 And this. 76 00:04:59,900 --> 00:05:05,630 Beautiful statistic, you will figure out what you don't have any missing value in your data. 77 00:05:05,980 --> 00:05:08,750 That is exactly what we want. 78 00:05:08,770 --> 00:05:12,410 But still, you have to do a lot of people sitting on your data. 79 00:05:12,480 --> 00:05:19,030 It's very first, if I'm going to call these columns over there, you will see all your columns available 80 00:05:19,030 --> 00:05:19,510 over here. 81 00:05:19,870 --> 00:05:27,530 So in this all these columns, you will figure out these addas, these children, add these babies. 82 00:05:27,820 --> 00:05:32,720 So let me copy both three variables and let me create a list there. 83 00:05:32,770 --> 00:05:34,780 So I'm just going to create a list. 84 00:05:35,050 --> 00:05:38,860 And on this list, what exactly is my strategy over here? 85 00:05:38,920 --> 00:05:45,070 So I'm just going to check how many unique values or you can say number of unique values in each and 86 00:05:45,070 --> 00:05:47,770 every feature in each and every element of this list. 87 00:05:48,010 --> 00:05:53,670 For this, I'm going to say for I list very first, I have to treat on each and every element. 88 00:05:53,670 --> 00:05:55,420 So I'm going to save for I list. 89 00:05:55,720 --> 00:06:04,120 Once I have this iteration, then I'm going to say on this I it means def of I got unique. 90 00:06:04,540 --> 00:06:07,180 So what I have to do, I have to bring this here. 91 00:06:07,180 --> 00:06:09,780 I'm going to say I have to print this. 92 00:06:09,970 --> 00:06:15,610 So let me just add some placeholder and whatever to value this placeholder. 93 00:06:16,270 --> 00:06:20,210 So it will exactly the C values from my format function. 94 00:06:20,440 --> 00:06:30,230 So here I'm going to say this placeholder has unique values as again I have to add a placeholder and 95 00:06:30,230 --> 00:06:38,380 the very first placeholder will get receive value from this format as whatever you have in your iteration 96 00:06:38,500 --> 00:06:43,440 and the second placeholder will receive value from this deal. 97 00:06:43,810 --> 00:06:46,240 I thought unique means this. 98 00:06:46,240 --> 00:06:53,730 I will get shifted over here and this D.F. of IDOT Unique will get shifted over here. 99 00:06:53,740 --> 00:06:54,310 That's it. 100 00:06:54,640 --> 00:06:58,400 Just provide some small bracket and just execute. 101 00:06:59,110 --> 00:07:02,730 Now you will figure out Addas has unique values as this. 102 00:07:02,770 --> 00:07:07,030 This children has unique value as just this and all this stuff over here. 103 00:07:07,690 --> 00:07:09,360 So you will figure it out over here. 104 00:07:09,640 --> 00:07:16,810 It seems to have some Tertius and data because this adult and these children and these babies can't 105 00:07:16,810 --> 00:07:20,050 have zero at a time because you will see here I have zero. 106 00:07:20,050 --> 00:07:20,800 Here I have zero. 107 00:07:20,860 --> 00:07:22,520 And here I have zero as well. 108 00:07:22,840 --> 00:07:27,040 So let me create a filter over here so my filter will see nothing. 109 00:07:27,040 --> 00:07:28,930 But let me just create a filter. 110 00:07:29,140 --> 00:07:32,560 My future is anything but deal of children. 111 00:07:33,190 --> 00:07:34,380 It goes to zero. 112 00:07:34,390 --> 00:07:37,750 So this is exactly my very first filter. 113 00:07:37,750 --> 00:07:43,750 So I'm going to say this is exactly my very first filter and and then I'm going to define my second 114 00:07:43,750 --> 00:07:49,350 filter, like C, D of of others equally close to zero. 115 00:07:49,360 --> 00:07:51,940 So this is exactly my second filter. 116 00:07:52,290 --> 00:07:55,590 And you can see this is exactly my second condition of this filter. 117 00:07:55,780 --> 00:07:57,790 And after that we have a third condition. 118 00:07:57,790 --> 00:08:02,480 And the third condition is nothing but div of babies equal. 119 00:08:02,500 --> 00:08:03,580 It goes to zero. 120 00:08:03,580 --> 00:08:06,070 So this is exact, my entire filter. 121 00:08:06,080 --> 00:08:08,260 So I'm going to say this is exactly my filter. 122 00:08:08,530 --> 00:08:16,330 And if I'm going to pass this filter in my data frame now, you will get you some data over here. 123 00:08:16,600 --> 00:08:22,360 Now you will figure out over here and if I'm going to scroll it to here so you will figure out you don't 124 00:08:22,360 --> 00:08:24,010 have your all the columns. 125 00:08:24,250 --> 00:08:26,780 So let me display my all the columns over here. 126 00:08:27,040 --> 00:08:28,690 So far, this is what I'm going to do. 127 00:08:28,690 --> 00:08:33,460 I'm just going to say B'Day Dot Saturnus corruption. 128 00:08:33,460 --> 00:08:36,070 You have some inbuilt function over here and here. 129 00:08:36,070 --> 00:08:44,440 I'm going to say display dot max underscore columns and here I have 32 columns, so I have to display 130 00:08:44,440 --> 00:08:46,060 method to do it. 131 00:08:46,060 --> 00:08:55,450 And if again, I'm just going to copy all this stuff, pasting or they're executing again, and now 132 00:08:55,450 --> 00:09:04,180 you will figure out this is exactly my and our he will see here I have adult children, babies and all 133 00:09:04,180 --> 00:09:05,640 these and these are zero. 134 00:09:06,070 --> 00:09:13,950 But if you think logically, it is not possible that adults, children and babies at a time can be zero 135 00:09:13,960 --> 00:09:14,950 simultaneously. 136 00:09:14,950 --> 00:09:17,920 It means these are exactly your noise. 137 00:09:18,130 --> 00:09:20,290 These are exactly you're wrong. 138 00:09:20,290 --> 00:09:21,690 And three new data points. 139 00:09:21,700 --> 00:09:22,720 You have to remove this. 140 00:09:23,110 --> 00:09:24,790 So far, this is what I'm going to do. 141 00:09:24,790 --> 00:09:29,470 I'm just going to say I just need a negation of this filter. 142 00:09:29,480 --> 00:09:31,710 So for negation, you have to use this apparatus. 143 00:09:31,730 --> 00:09:33,640 I'm just going to say negation of this filter. 144 00:09:33,640 --> 00:09:34,140 That's it. 145 00:09:34,960 --> 00:09:38,370 So if I will execute it now, you will see what here. 146 00:09:38,710 --> 00:09:45,970 This is exactly that data on which you have to perform all your analysis on which you have to build 147 00:09:46,210 --> 00:09:51,370 your cool and machine learning model after doing lots of feature on new data. 148 00:09:51,670 --> 00:09:54,280 So let's say this is exactly my data. 149 00:09:54,280 --> 00:09:57,510 So let me name this data as, let's say data. 150 00:09:57,910 --> 00:09:59,740 So and if I'm going to. 151 00:10:00,240 --> 00:10:02,430 I had over there and I will figure it out. 152 00:10:02,460 --> 00:10:09,690 This is exactly the data set on which you have lots of people, lots of cleaning, and then you have 153 00:10:09,690 --> 00:10:13,950 to build a model that's all about the session of the session very much. 154 00:10:14,600 --> 00:10:18,810 You have a nice to keep learning, keep growing, keep practicing.