1 00:00:00,270 --> 00:00:01,260 Welcome back. 2 00:00:01,260 --> 00:00:06,780 In the last video we saw how to convert non numerical data into numerical data. 3 00:00:06,780 --> 00:00:10,950 That way we could feed a machine learning model or at least get one to try and figure out some patterns 4 00:00:10,950 --> 00:00:11,870 in our data. 5 00:00:11,940 --> 00:00:16,980 In this one we're gonna figure out what we can do when we've got missing value creating a little heading 6 00:00:16,980 --> 00:00:18,240 here. 7 00:00:18,240 --> 00:00:23,220 What if there were missing values. 8 00:00:23,220 --> 00:00:26,010 There's two main ways to deal with missing data. 9 00:00:26,130 --> 00:00:28,170 Let's put that here actually. 10 00:00:28,230 --> 00:00:33,690 So we'll go One it's fill them with some value. 11 00:00:33,690 --> 00:00:36,450 Also known as imputation. 12 00:00:36,450 --> 00:00:41,100 So this is when you have say some samples and they're missing information you could fill it with some 13 00:00:41,100 --> 00:00:41,900 value. 14 00:00:42,120 --> 00:00:43,730 That's called imputing data. 15 00:00:43,740 --> 00:00:50,680 The other one is to remove the samples with missing data altogether. 16 00:00:50,680 --> 00:00:55,640 Now it should be known that there's no real good way to fully deal with missing data. 17 00:00:55,790 --> 00:01:01,110 If you do fill missing data with some value you're going to be kind of introducing some real data to 18 00:01:01,110 --> 00:01:02,030 your dataset. 19 00:01:02,160 --> 00:01:06,420 And if you do remove samples with missing values well then you kind of going to be working with less 20 00:01:06,420 --> 00:01:06,870 sample. 21 00:01:06,870 --> 00:01:10,780 So in the real world you're not always going to have complete data sets. 22 00:01:10,830 --> 00:01:15,050 But when you do deal with missing data it's kind of still a big area of research of how to do it. 23 00:01:15,060 --> 00:01:20,070 But we're going to go through basically these two ways of figuring it out so we can at least get a machine 24 00:01:20,070 --> 00:01:24,420 learning model working with data that once had missing values. 25 00:01:24,420 --> 00:01:28,290 So in saying that we do need a dataset that has missing values. 26 00:01:28,380 --> 00:01:34,560 So luckily here's something I've prepared earlier my beautiful Google Sheet which is our faithful car 27 00:01:34,560 --> 00:01:35,550 sales data set. 28 00:01:35,570 --> 00:01:37,210 It's missing data here. 29 00:01:37,260 --> 00:01:40,200 We keep going through far out. 30 00:01:40,200 --> 00:01:40,920 What are we done. 31 00:01:40,920 --> 00:01:44,120 Someone has we've failed at recording with this data set. 32 00:01:44,130 --> 00:01:48,350 And this is kind of often what you'll see a lot of times in the real world when you're working in data 33 00:01:48,350 --> 00:01:48,990 sets. 34 00:01:49,050 --> 00:01:53,190 You'll have these samples so like a row one no one is missing the color of the car. 35 00:01:53,220 --> 00:01:55,260 But it has everything else. 36 00:01:55,320 --> 00:01:56,370 Well that's all right. 37 00:01:56,370 --> 00:01:57,900 Let's see how he'd work with it. 38 00:01:57,900 --> 00:02:06,740 So the first thing we do is we're going to import car sales missing data we're going to go call it something 39 00:02:06,740 --> 00:02:07,490 nice and easy. 40 00:02:07,490 --> 00:02:15,200 Car sales missing equals PD read see as V and remember everything's in the Data folder might not be 41 00:02:15,200 --> 00:02:22,590 on your machine but in my case it is car sales extended mine's called missing data dot CSI V. 42 00:02:22,700 --> 00:02:23,650 Wonderful. 43 00:02:23,660 --> 00:02:24,060 I'm gonna go. 44 00:02:24,060 --> 00:02:26,820 Car sales missing we're going to look at the head of it. 45 00:02:26,900 --> 00:02:27,440 Beautiful. 46 00:02:28,160 --> 00:02:28,810 Excellent. 47 00:02:28,820 --> 00:02:29,450 So we've got it here. 48 00:02:29,460 --> 00:02:35,150 It doesn't look like it's missing any data but a real quick way to calculate to see if it is we go car 49 00:02:35,150 --> 00:02:40,520 sales missing the convenience of having Independence Day to frame is that we can call pandas functions 50 00:02:41,170 --> 00:02:44,000 is an A for is non or his name. 51 00:02:44,120 --> 00:02:45,780 I think in pandas it's called man. 52 00:02:45,860 --> 00:02:52,920 If we do import this it will fill it with Nan so it's something like N and I'll keep that blank there. 53 00:02:54,020 --> 00:02:59,890 So we'll go here is in a dot some and this will show us how many missing values there are. 54 00:03:00,010 --> 00:03:05,470 So let's go shift into so in the make column there's forty nine missing values in the color call on 55 00:03:05,470 --> 00:03:11,140 this fifty and then the other columns is about 50 odd missing values in each column. 56 00:03:11,140 --> 00:03:14,590 All right let's try and convert it to numbers. 57 00:03:14,590 --> 00:03:22,270 Let's try and convert now just like we did in the previous video data to numbers what we might do is 58 00:03:22,270 --> 00:03:29,110 cheat a little here we'll move up we'll come up here to the code that we've got before we're going to 59 00:03:29,110 --> 00:03:35,500 just copy this and this is one of the only times I'll allow you to copy code just to save a little bit 60 00:03:35,500 --> 00:03:43,810 of time here go here but first I think what we might need to do is transform this make it into x and 61 00:03:43,810 --> 00:03:51,330 y because otherwise we're going to be using it on the wrong X so let's do that we need a new X and Y 62 00:03:52,710 --> 00:03:57,960 create X and Y we want X and goes. 63 00:03:57,960 --> 00:04:05,250 Car sales missing don't drop price we're gonna do the same again we're going to use the example of trying 64 00:04:05,250 --> 00:04:13,900 to use these four columns to predict the price column it's a price we've got access equals one Y equals 65 00:04:13,960 --> 00:04:21,700 car sales missing and we're going to get this the price column wonderful I want to try to do this and 66 00:04:21,700 --> 00:04:29,770 see what happens value era why is this so we've tried to do the exact same code as before you saw me 67 00:04:29,770 --> 00:04:38,930 copy and paste that we've got X why do we think that's not working input contains NAND. 68 00:04:38,930 --> 00:04:42,640 Of course that's why car sales are missing let's check it out. 69 00:04:42,690 --> 00:04:45,050 Will this show us and then yes it will. 70 00:04:45,050 --> 00:04:49,970 So we've got at least one man that we can see here it's in the make column is about forty nine of these 71 00:04:49,970 --> 00:04:54,950 nans in the color column the odometer doors and price there's 50 of them. 72 00:04:55,270 --> 00:05:00,710 So what we have to do is before we can even convert our data into numbers is we have to do something 73 00:05:01,280 --> 00:05:03,090 with these missing values. 74 00:05:03,140 --> 00:05:05,300 So let's see how we might do that. 75 00:05:05,330 --> 00:05:07,680 There's a couple of ways. 76 00:05:07,730 --> 00:05:13,250 The first one is that we can fill missing data with pandas so let's check this out we'll make another 77 00:05:13,250 --> 00:05:21,200 heading fill missing might put in this option one fill missing data with pandas 78 00:05:24,200 --> 00:05:25,180 let's say what we might do. 79 00:05:25,190 --> 00:05:32,360 So we might go fill the make column we'll do this for each column actually fill the make column and 80 00:05:32,360 --> 00:05:32,720 then we'll go. 81 00:05:32,720 --> 00:05:41,060 Car sales missing what we might do is any categorical column we'll fill it with a missing string any 82 00:05:41,060 --> 00:05:45,190 of the others like numerical we might just fill it with the mean of the column. 83 00:05:45,190 --> 00:05:49,070 Remember how I said feeling data is never a perfect practice. 84 00:05:49,100 --> 00:05:52,790 So if there's missing values and you fill it out with something it's never going to be perfect. 85 00:05:52,790 --> 00:05:54,940 Like if you actually had the real value there. 86 00:05:55,370 --> 00:06:00,590 But there's a few things that are kind of just like Level 1 ways of filling data like a categorical 87 00:06:00,590 --> 00:06:05,510 you might fill it with missing a numerical column you might fill it with the mean or like the average 88 00:06:05,510 --> 00:06:09,950 of that column like any missing values will have the average of that column so they're not too far away 89 00:06:09,950 --> 00:06:16,130 from the rest of the values and you're not filling it with some massively outlier value. 90 00:06:16,130 --> 00:06:17,630 Let's see what that will look like. 91 00:06:17,630 --> 00:06:24,860 Make dot fill in a mailing to go missing and then we'll go in place Eagle shrew because we want it to 92 00:06:24,860 --> 00:06:33,020 happen straight away and we go fill the column we're gonna do the same thing. 93 00:06:33,020 --> 00:06:40,040 Car sales missing we're going to get the color column in there fill in and we can just do the exact 94 00:06:40,040 --> 00:06:45,300 same as we've got above because they're both categorically in place equals true. 95 00:06:45,350 --> 00:06:46,560 Wonderful. 96 00:06:46,580 --> 00:06:50,530 Now we're going to fill the odometer column. 97 00:06:50,660 --> 00:06:53,530 Now remember this one's we're doing it in order here. 98 00:06:54,110 --> 00:07:00,510 So we've got make color odometer fill the odometer column we might do the mean here. 99 00:07:00,590 --> 00:07:04,360 So car sales missing. 100 00:07:04,360 --> 00:07:06,080 How would we do this. 101 00:07:06,090 --> 00:07:07,180 You want to fill it. 102 00:07:07,180 --> 00:07:11,300 I don't want a came. 103 00:07:11,310 --> 00:07:13,920 We have to put that little spice that same way fill in. 104 00:07:13,940 --> 00:07:17,160 But this time we want the main scratches head. 105 00:07:17,160 --> 00:07:19,330 Car sales missing. 106 00:07:19,610 --> 00:07:23,210 Don't much care. 107 00:07:23,220 --> 00:07:25,510 This might turn out to be a fairly long. 108 00:07:25,680 --> 00:07:26,960 It's actually pretty easy. 109 00:07:27,030 --> 00:07:31,890 Doug main overthinking it. 110 00:07:31,950 --> 00:07:36,460 Classic implies ego's true. 111 00:07:36,600 --> 00:07:37,260 Oh right. 112 00:07:37,260 --> 00:07:46,060 And then finally we want to fill the doors column and all our doors is a different one as well because 113 00:07:46,150 --> 00:07:52,230 we said before that it can kind of be a category for what we might do is because majority of the doors. 114 00:07:52,390 --> 00:07:55,620 Like say if we go here we check it out. 115 00:07:55,630 --> 00:08:02,950 Car sales how many doors would you say the average car has like if you had to guess how many doors a 116 00:08:02,950 --> 00:08:05,620 random car how many doors does it have. 117 00:08:06,550 --> 00:08:10,270 I'm going to say for Madonna kind of agrees with me there. 118 00:08:10,270 --> 00:08:15,670 So we're just going to fill the missing cars the missing all values in our data frame with 4. 119 00:08:15,880 --> 00:08:24,320 So fill in a nice and simple 4 in place equals true wonderful. 120 00:08:24,340 --> 00:08:28,270 Now let's check it again check our data frame again. 121 00:08:28,930 --> 00:08:35,200 So we want car sales what's it look like car sales missing is and I don't see how any missing values 122 00:08:35,200 --> 00:08:35,890 do we have now. 123 00:08:37,230 --> 00:08:38,250 Wonderful. 124 00:08:38,280 --> 00:08:41,790 So all of the feature columns so make color odometer. 125 00:08:41,790 --> 00:08:45,540 Doors are all full of values except for the price column. 126 00:08:45,540 --> 00:08:50,220 Now we kind of left this one out on purpose because price is special because it's the one we're trying 127 00:08:50,220 --> 00:08:57,580 to predict what we might do is just remove the rows from our data frame that have missing price values. 128 00:08:57,630 --> 00:09:03,630 So we are going to lose some data but it's hard to really predict something when it doesn't have a label. 129 00:09:03,630 --> 00:09:05,230 So let's do that. 130 00:09:05,400 --> 00:09:13,430 Remove rows with missing price value so we can do this. 131 00:09:13,430 --> 00:09:22,430 Car sales missing that drop and a in-place equals true and that should just remove the price rose or 132 00:09:22,430 --> 00:09:26,660 the values in the data frame the rows in the data frame which don't have a price value. 133 00:09:26,660 --> 00:09:28,190 Well let's check it one more time. 134 00:09:28,190 --> 00:09:34,100 Car sales missing that is in a dot some beautiful. 135 00:09:34,110 --> 00:09:37,310 So we have no missing values in our data frame anymore. 136 00:09:37,350 --> 00:09:44,090 So what this means is that we've kind of lost we did have a thousand rise at the start. 137 00:09:44,090 --> 00:09:45,800 Now we have 950. 138 00:09:45,860 --> 00:09:51,440 We have lost 50 samples at the sacrifice of filling up all the data and removing any samples that don't 139 00:09:51,440 --> 00:09:55,490 have a label a.k.a. the price column. 140 00:09:55,490 --> 00:10:02,240 So now let's bring back our code from before the one that error it and we'll get that disgusting error. 141 00:10:02,390 --> 00:10:05,140 So let's bring this down here. 142 00:10:05,190 --> 00:10:07,480 We're going to fill this up. 143 00:10:07,850 --> 00:10:19,470 Actually we need to re split our data into x and y so x equals car sales missing don't drop price access 144 00:10:19,560 --> 00:10:20,340 equals one. 145 00:10:20,340 --> 00:10:21,690 Now this is really good practice right. 146 00:10:21,690 --> 00:10:26,910 It seems that we're we're just re typing a lot of the same code again and we are usually where you want 147 00:10:26,910 --> 00:10:32,790 to kind of avoid this but we're seeing what it's like just continually filling up data or massaging 148 00:10:32,790 --> 00:10:34,470 data into a way that we need it. 149 00:10:34,470 --> 00:10:40,500 So here continual practice you'll always need to do this is split data into x and y or data and labels 150 00:10:40,820 --> 00:10:47,610 and what's saying here or target I'm thinking of are our categorical problem our classification problem 151 00:10:47,610 --> 00:10:53,100 that we worked on before now let's copy and paste this code from before we should have filled up all 152 00:10:53,100 --> 00:10:53,860 the values. 153 00:10:53,880 --> 00:10:59,200 So fingers crossed this should work similar to how it did before. 154 00:10:59,290 --> 00:11:00,300 Wonderful. 155 00:11:00,310 --> 00:11:05,740 Now what it's kind of done is given this output instead of just outputting numbers. 156 00:11:05,870 --> 00:11:12,760 So what we might do is we don't really need X in car sales there we go. 157 00:11:13,330 --> 00:11:21,410 So we can see that now we have converted again our car sales missing data into numbers and we fill it. 158 00:11:21,550 --> 00:11:24,160 So that is just one option. 159 00:11:24,160 --> 00:11:28,520 You might be thinking where inside psychic loan why do we use panders. 160 00:11:28,570 --> 00:11:35,020 Well panders is very versatile and kind of fun to use I really like using panders but we can actually 161 00:11:35,020 --> 00:11:36,300 do what we've just done. 162 00:11:36,460 --> 00:11:42,450 So filling up these values here we can replicate this or what we've done with panders with pure psychic 163 00:11:42,450 --> 00:11:42,990 line. 164 00:11:43,110 --> 00:11:44,960 So why don't we look at that. 165 00:11:45,030 --> 00:11:48,970 What we'll do is we'll wait to the next video for that so we'll get this set up. 166 00:11:49,050 --> 00:11:58,270 So option 2 fill missing values with socket line. 167 00:11:58,290 --> 00:11:58,970 All right. 168 00:11:59,040 --> 00:12:00,890 So have a little review of what we've done there. 169 00:12:00,930 --> 00:12:02,460 Go back up and see what we've done. 170 00:12:02,580 --> 00:12:06,690 You might want to try are playing around with the car sales missing data frame. 171 00:12:06,690 --> 00:12:12,270 See if you can fill the missing values play around with these functions here with pandas and then see 172 00:12:12,270 --> 00:12:14,280 if you can convert it into numbers. 173 00:12:14,340 --> 00:12:20,300 Otherwise I'll see in the next video we'll learn how to fill missing values in a data center with pure 174 00:12:20,380 --> 00:12:21,210 so I learn.