1 00:00:00,360 --> 00:00:05,530 We saw in the last section how to fill missing values in a data set and then convert it to numbers while 2 00:00:05,550 --> 00:00:06,820 we filled it with pandas. 3 00:00:06,870 --> 00:00:09,140 And then we converted it to numbers using psychic learn. 4 00:00:09,330 --> 00:00:13,980 But it'd be nice to do missing values with cyclone and convert it to numbers of psychic loans so we're 5 00:00:13,980 --> 00:00:16,440 kind of sticking to the one library. 6 00:00:16,440 --> 00:00:17,370 Let's see how we do then. 7 00:00:17,910 --> 00:00:23,850 So first things first we'll need to reinstall and create our car sales missing data because we've kind 8 00:00:23,850 --> 00:00:27,930 of ah filled the missing values in place up here using pander. 9 00:00:27,930 --> 00:00:33,830 So the one we have right now already has its missing values field so we'll reread it back in. 10 00:00:33,870 --> 00:00:35,690 Read CSB data. 11 00:00:36,120 --> 00:00:36,740 I'm going to go. 12 00:00:36,740 --> 00:00:38,010 Car sales. 13 00:00:38,110 --> 00:00:43,890 I've got a few missing extended Where is it. 14 00:00:44,190 --> 00:00:45,180 Car sales. 15 00:00:45,180 --> 00:00:46,520 Missing data. 16 00:00:46,920 --> 00:00:48,740 Extended missing data does see a swing. 17 00:00:48,750 --> 00:00:49,770 Beautiful. 18 00:00:49,770 --> 00:00:52,650 That's what we want and we'll just check. 19 00:00:52,710 --> 00:00:59,700 Why am I doing car missing missing cars missing dot head. 20 00:00:59,710 --> 00:01:00,460 Wonderful. 21 00:01:00,460 --> 00:01:04,490 Now we'll just clarify whether we have some missing data in this one again. 22 00:01:04,540 --> 00:01:09,630 How do we do that is and I know some wonderful. 23 00:01:09,690 --> 00:01:13,380 So we have some missing value is not so wonderful if you're actually working on this data set because 24 00:01:13,770 --> 00:01:17,440 ideally your data set doesn't add missing values but that's all right. 25 00:01:17,460 --> 00:01:19,400 We're going to figure out how to fill them up. 26 00:01:19,590 --> 00:01:21,060 So how do we do that. 27 00:01:21,060 --> 00:01:24,780 The first thing first we're gonna split it into a subset. 28 00:01:24,790 --> 00:01:29,700 Well first we're gonna get rid of rows which don't have price values and then we'll split it into x 29 00:01:29,700 --> 00:01:36,090 and y because we don't want to deal with data that doesn't have labels car sales missing dot drop in 30 00:01:36,090 --> 00:01:39,180 a subset equals. 31 00:01:39,240 --> 00:01:44,190 We're going to give it the price column in place equals true. 32 00:01:44,820 --> 00:01:51,980 So what we're saying is basically with this line here get the car sales missing data frame drop the 33 00:01:51,980 --> 00:01:56,030 end and values that are in the subset of the price column. 34 00:01:56,030 --> 00:02:00,950 These 50 values in the price column that don't have values remove them from the data frame and then 35 00:02:00,950 --> 00:02:09,020 we're going to recalculate how many missing values we have might put here. 36 00:02:09,020 --> 00:02:12,550 Little comment dropped the rows with no labels. 37 00:02:13,800 --> 00:02:14,470 Beautiful. 38 00:02:14,490 --> 00:02:19,260 We've lost some from the make color a dominant doors columns because they may have been overlapping 39 00:02:19,260 --> 00:02:23,110 with the samples in the price column that had missing values. 40 00:02:23,160 --> 00:02:24,810 So what's next. 41 00:02:24,830 --> 00:02:28,610 Well we want to split into x and y. 42 00:02:28,980 --> 00:02:34,160 Plenty of practice doing this this is what a lot of lot of machine learning problems end up being right 43 00:02:34,170 --> 00:02:40,440 is it's turning your data into features and labels and then getting a machine learning model to hopefully 44 00:02:40,530 --> 00:02:44,360 learn some patterns in those features and predict labels. 45 00:02:44,400 --> 00:02:46,220 That's what we're going to try and do here. 46 00:02:46,230 --> 00:02:50,880 Access equals one beautiful why we've seen this before. 47 00:02:50,880 --> 00:02:53,780 Car sales missing price. 48 00:02:53,850 --> 00:02:58,970 So again we're just using these four columns here to predict the price column. 49 00:02:59,030 --> 00:03:00,540 Well that's our goal anyway. 50 00:03:00,570 --> 00:03:06,620 So how would we fill the missing data and take care of these missing loans with psychic loan. 51 00:03:06,630 --> 00:03:07,200 Well let's see. 52 00:03:07,440 --> 00:03:10,860 We've got two little handy dandy. 53 00:03:10,860 --> 00:03:16,100 Well one we've seen before but one we haven't import simple computer. 54 00:03:16,620 --> 00:03:24,540 Well it might do is fill missing values with psychic line. 55 00:03:24,540 --> 00:03:29,890 Now you might remember and I recall right up there when we started this part too well so I could learn 56 00:03:29,890 --> 00:03:35,670 or Part 1 which is getting the data ready feeling missing values is also called imputation which is 57 00:03:35,670 --> 00:03:36,630 what this does. 58 00:03:36,650 --> 00:03:39,070 And just simple impute up from socket line. 59 00:03:39,360 --> 00:03:44,310 And that's going to help us fill missing values and then we've seen this one before which is the sign 60 00:03:44,310 --> 00:03:51,180 hit learn not compose import column transformer which allows us to define some kind of transformer and 61 00:03:51,180 --> 00:03:53,610 then apply it to whichever columns that we want to use it on. 62 00:03:53,990 --> 00:04:01,540 We want to fill want to just do it the exact same We did it with pandas but this time just reproduce 63 00:04:01,540 --> 00:04:02,600 it with socket lines. 64 00:04:02,620 --> 00:04:13,560 So fill categorical values with missing and numerical values with main every kind of writing these notes 65 00:04:13,590 --> 00:04:17,850 like so we can remind ourselves and communicate through our code make sure we're kind of talking it 66 00:04:17,850 --> 00:04:22,530 through doing the right stamps because otherwise you can get a bit lost with everything going on. 67 00:04:22,620 --> 00:04:29,430 So simple computer we're calling this class that we just imported here we're going to tell it this strategy 68 00:04:29,880 --> 00:04:32,960 equals constants so the strategy we want it to fills. 69 00:04:32,970 --> 00:04:39,680 It's basically going hey run over the categorical values I've shortened categorical to cat here it's 70 00:04:39,680 --> 00:04:45,840 gonna go over it saying for every value keep your strategy constant and the fill value 71 00:04:48,880 --> 00:04:51,280 is going to be missing. 72 00:04:51,530 --> 00:04:56,750 OK that's starting to make sense and then we're going to go we're gonna have a special one for our door 73 00:04:56,820 --> 00:05:03,850 feature simple impute up as door is again that little funny one that is numerical but is also categorical 74 00:05:04,170 --> 00:05:11,590 we're going to keep this as constant we're going to say Phil value equals four nice and simple and then 75 00:05:11,590 --> 00:05:17,950 we'll have numerical computer which we could have really called now and Putin actually let's do that 76 00:05:17,950 --> 00:05:28,950 because when keeping these these fairly short now computer simple computer strategy Eagles main so we'll 77 00:05:28,950 --> 00:05:33,640 see what these mean and in a second we're just kind of as I say we're working through it and letting 78 00:05:33,660 --> 00:05:46,920 go define columns so cat columns or cat features Eagles make and color and then we've got a double feature 79 00:05:47,100 --> 00:05:55,350 because again member Dawes is special case and then we've also got numb features which is odometer 80 00:05:58,000 --> 00:06:01,230 Cayenne wonderful. 81 00:06:01,630 --> 00:06:11,970 And then what we're going to do is create and impute up something that fills missing data because that's 82 00:06:11,970 --> 00:06:13,080 what imputation is right. 83 00:06:13,080 --> 00:06:18,060 If you hear that term imputation kind of referring to find a missing value and fill out with something 84 00:06:18,210 --> 00:06:20,550 or calculate something to fill it with. 85 00:06:20,580 --> 00:06:26,160 So this is where we're going to leverage our column transformer. 86 00:06:26,180 --> 00:06:31,560 I love that word column transformer always makes me think of Optimus Prime for some reason. 87 00:06:31,560 --> 00:06:34,400 What's your favorite Transformer. 88 00:06:34,710 --> 00:06:39,850 Mine's definitely Optimus Prime or bumblebee on the hype train they're getting distracted. 89 00:06:39,850 --> 00:06:44,750 Daniel we're talking about machine learning well actually transformers are machines that can learn so 90 00:06:44,750 --> 00:06:46,770 we're not really getting distracted. 91 00:06:47,000 --> 00:06:54,360 So what we're doing here is we're setting up our computer and kind of passing it a few things to get 92 00:06:54,360 --> 00:06:54,990 it ready. 93 00:06:55,050 --> 00:07:00,330 So just bear with me a second I'm kind of coding and trying to talk at the same time I'm getting distracted 94 00:07:00,570 --> 00:07:02,070 so I adore computer. 95 00:07:02,280 --> 00:07:07,570 We're gonna go door computer and then we're going to go door feature. 96 00:07:09,730 --> 00:07:19,420 And then finally one more and we got num computer and then we're gonna go numb computer and then we 97 00:07:19,420 --> 00:07:24,640 may go numb features beautiful. 98 00:07:25,510 --> 00:07:33,780 We've got a fair bit of code here and we're gonna go transform one more transform the data so going 99 00:07:33,780 --> 00:07:38,230 to go field X because remember X has some missing values at the moment. 100 00:07:38,230 --> 00:07:49,870 If I go up here we want to go X dot is and I dealt some which is just the same as up here except we've 101 00:07:49,870 --> 00:07:53,620 got no price price category there. 102 00:07:53,740 --> 00:08:02,440 So we come down here filled X equals we're going to take our little impute here impute her not fit transform 103 00:08:03,540 --> 00:08:07,870 X and then we're gonna have a look at the field X. 104 00:08:08,080 --> 00:08:08,890 Will this work. 105 00:08:08,890 --> 00:08:10,920 Fingers crossed we've got a lot of code here. 106 00:08:11,030 --> 00:08:17,430 I've got one two twenty 24 lines of code plus comments it worked. 107 00:08:17,430 --> 00:08:18,280 No errors. 108 00:08:18,300 --> 00:08:20,070 What has actually just happened. 109 00:08:20,070 --> 00:08:24,150 Well let's step through this what we might do is zoom out so we can just a little bit so we can see 110 00:08:24,150 --> 00:08:25,550 it all in one hit. 111 00:08:25,620 --> 00:08:26,430 Beautiful. 112 00:08:26,430 --> 00:08:31,970 So what we've done is we've imported from psychic learn simple computer class as well as a column transform 113 00:08:31,970 --> 00:08:39,930 the class and then we've defined some impurities and remember impute are just filling missing data using 114 00:08:39,960 --> 00:08:44,190 the simple computer class which takes strategy and a fill value. 115 00:08:44,190 --> 00:08:50,730 If the strategy is constant we have to pass it a fill value saying hey go to the categorical columns 116 00:08:50,970 --> 00:08:52,080 constantly fill them. 117 00:08:52,080 --> 00:08:56,930 If you find a missing value you just fill them with missing the string missing the same thing for outdoor 118 00:08:56,970 --> 00:09:00,010 computer it's going to say keep the strategy constant. 119 00:09:00,030 --> 00:09:03,990 So for every missing cell do the same thing and fill it with four. 120 00:09:04,010 --> 00:09:05,220 Yeah that makes sense. 121 00:09:05,220 --> 00:09:11,850 And for our numerical columns in this case which is the odometer column fill it with the mean. 122 00:09:11,880 --> 00:09:15,860 So keeping that one nice and simple then we've defined which columns are which. 123 00:09:15,960 --> 00:09:21,270 So our categorical columns are making color realistically doors is also that but we've kind of made 124 00:09:21,270 --> 00:09:25,590 it its own because it's again halfway between the number halfway between the category. 125 00:09:25,650 --> 00:09:31,080 So we've got the door feature and then we've defined the numbers feature or the numerical features which 126 00:09:31,080 --> 00:09:38,370 is our odometer and then we've used the column transform a class which is what we imported when we created 127 00:09:38,520 --> 00:09:44,310 and computer passing it the imputations we wanted to do all the transformations we wanted to do. 128 00:09:45,210 --> 00:09:47,230 So these are the names. 129 00:09:47,310 --> 00:09:52,230 So that's what column transformer takes it takes a list of multiple different Transformers. 130 00:09:52,290 --> 00:09:59,700 So see him we've got a list and now within the list we have troubles of the name the computer we want 131 00:09:59,700 --> 00:10:01,050 to use. 132 00:10:01,380 --> 00:10:03,210 This is just the name of the computer right. 133 00:10:03,210 --> 00:10:07,350 So if we had to access this imputation later we can use this as its name. 134 00:10:07,350 --> 00:10:12,480 I've just kept it simple and called it the exact same thing as what the variable computer is. 135 00:10:12,480 --> 00:10:17,270 And these are the features that we want this specific computer to change. 136 00:10:17,280 --> 00:10:24,480 So this one Kadam pewter is going to use cat in pewter on the categorical features door computer on 137 00:10:24,480 --> 00:10:28,950 the door feature and then numerical computer on the numerical features. 138 00:10:29,290 --> 00:10:30,410 That was a mouthful. 139 00:10:30,600 --> 00:10:37,740 And then once our transformer is defined we can call this field X because we're going to use our computer 140 00:10:37,800 --> 00:10:47,090 and fit transform on our X data to fill up the values of X well but don't take my word for it let's 141 00:10:47,090 --> 00:10:47,990 say it in code. 142 00:10:48,000 --> 00:10:53,690 So car sales fell let's create another data frame because we want to use the same checking method that 143 00:10:53,690 --> 00:10:54,920 we've used before. 144 00:10:55,040 --> 00:10:58,730 We'll pass it filled X and then we'll pass it the column names. 145 00:10:58,730 --> 00:11:03,980 This is just the names of what these respective columns where you can see cars and make colour doors 146 00:11:04,100 --> 00:11:05,340 and then odometer. 147 00:11:05,390 --> 00:11:06,990 So that's all we're going to do here. 148 00:11:07,130 --> 00:11:18,420 Make Carla doors DOMA to posit Kayyem as well and beautiful. 149 00:11:18,440 --> 00:11:20,000 That should work. 150 00:11:20,000 --> 00:11:24,290 Now we're gonna have a look at the head of this data frame car sales filled. 151 00:11:24,310 --> 00:11:28,370 Go ahead wonderful that looks familiar to what we've seen before. 152 00:11:28,380 --> 00:11:37,630 But the icing on the cake here the real test is seeing is an a what does is output use of all. 153 00:11:37,920 --> 00:11:40,430 We now have no missing values. 154 00:11:40,500 --> 00:11:44,060 Thanks to this bunch of code that we've written here. 155 00:11:44,090 --> 00:11:49,890 So we've used simple computer plus column transformer to fill the missing values so these missing values 156 00:11:49,900 --> 00:11:57,300 here with some preset defined ones that we've created here and now car sales filled which is what we've 157 00:11:57,300 --> 00:12:03,190 made out of our transformed data now has no missing values. 158 00:12:03,230 --> 00:12:04,510 Beautiful. 159 00:12:04,930 --> 00:12:07,550 So now what should we be able to do. 160 00:12:07,570 --> 00:12:13,410 We have no missing values should be able to convert these into numbers. 161 00:12:13,410 --> 00:12:21,320 So when we come up here and we have some code that has done this before this is in so we're going to 162 00:12:21,320 --> 00:12:24,700 copy this come back down. 163 00:12:24,740 --> 00:12:26,840 This is the only time I'll let you copy and paste code. 164 00:12:26,900 --> 00:12:31,670 I know I've said that before but again to just save us a little bit of time what we might do is again 165 00:12:31,910 --> 00:12:32,720 split. 166 00:12:32,720 --> 00:12:42,980 Car sales filled into x and y so we want X equals car sales field don't drop I know that already is 167 00:12:44,070 --> 00:12:48,430 that already is our x value. 168 00:12:48,470 --> 00:12:52,780 Beautiful so we can make this into him car sales filled 169 00:12:55,690 --> 00:12:56,620 wonderful. 170 00:12:56,680 --> 00:12:59,880 So this is telling us we've got a sparse matrix here now. 171 00:13:01,060 --> 00:13:04,000 What can we do now. 172 00:13:04,270 --> 00:13:12,690 We've got our data as numbers and filled no missing values. 173 00:13:12,790 --> 00:13:21,620 Let's put a model that lets fit our model MPD at random just to make sure everything's working right. 174 00:13:22,160 --> 00:13:23,330 Always trust the code. 175 00:13:23,330 --> 00:13:24,280 Always trust the code. 176 00:13:24,860 --> 00:13:25,540 That's what we're gonna do. 177 00:13:25,550 --> 00:13:27,400 We're gonna reemployment what are we seeing this. 178 00:13:27,490 --> 00:13:33,540 Nothing outperforms at random forest regress are going to have heaps of practice doing this from this 179 00:13:33,540 --> 00:13:36,310 K model selection. 180 00:13:36,350 --> 00:13:44,480 They want something to split our data into train and test and then we also actually want to split our 181 00:13:44,480 --> 00:13:53,960 data into try and tense by going ex train x test y train y test equals train test split wonderful and 182 00:13:53,960 --> 00:14:00,540 we're gonna pass it this time transformed x c this variable here because what we've done is we've passed 183 00:14:00,540 --> 00:14:00,680 it. 184 00:14:00,680 --> 00:14:07,920 Car sales field and basically just turned all the categorical features we've won hot encoded them so 185 00:14:07,920 --> 00:14:15,660 we'll go here transformed X also pass at Y because y still saved in memory why does it have to change 186 00:14:15,660 --> 00:14:20,220 because that's just labels there already numbers and the test size is zero point. 187 00:14:20,760 --> 00:14:21,400 Wonderful. 188 00:14:21,510 --> 00:14:25,840 And so we'll set up our model which is a random forest regress. 189 00:14:27,210 --> 00:14:29,310 And then we're going to model not fit. 190 00:14:29,340 --> 00:14:35,310 So this is telling our model to find Hey random forest regress are find the patterns between X train 191 00:14:35,310 --> 00:14:42,060 and Y train and then we'll get to a model dot score on our test data sets it's going to say hey evaluate 192 00:14:42,090 --> 00:14:47,760 I know you've found some patterns in this training dataset but now evaluate those patterns on this test 193 00:14:47,770 --> 00:14:54,930 dataset and moment of truth shift into beautiful we're getting a warning here because there's some changes 194 00:14:54,930 --> 00:14:59,460 happening in a future version of socket line you might not have this warning if you're using socket 195 00:14:59,460 --> 00:15:07,440 line version point 2 to all it's saying is that an estimate is will be changed from the default 10 inversion 196 00:15:07,470 --> 00:15:09,160 zero point two to one hundred. 197 00:15:09,210 --> 00:15:16,060 So we can basically get rid of this warning by going here equals 100 wonderful. 198 00:15:16,090 --> 00:15:21,820 So even though this model has more estimates than the previous one we've used on our car sales missing 199 00:15:21,820 --> 00:15:24,380 data it performs worse. 200 00:15:24,550 --> 00:15:32,680 So this one gets a score of zero point 2 1 9 and if we come up here where is it. 201 00:15:32,690 --> 00:15:35,510 We did trying to model in the previous section. 202 00:15:35,510 --> 00:15:38,380 This has got 3 point 3 0 4. 203 00:15:38,390 --> 00:15:40,450 The maximum score is 1. 204 00:15:40,730 --> 00:15:44,020 They're both not doing as ideally as we might like. 205 00:15:44,030 --> 00:15:47,750 But why do you think of this one performed worse. 206 00:15:47,750 --> 00:15:52,610 Well it's because it's only got nine hundred and fifty samples. 207 00:15:52,790 --> 00:16:04,280 So if we go land car sales filled as well as land car sales missing now we want maybe the original car 208 00:16:04,280 --> 00:16:05,570 sales. 209 00:16:05,570 --> 00:16:06,260 There we go. 210 00:16:06,260 --> 00:16:12,410 So our previous model was built on this one and the model we've just built was built on this one. 211 00:16:12,410 --> 00:16:18,750 So one has a thousand samples and the other only has 950 samples which is why it's done slightly worse. 212 00:16:18,750 --> 00:16:21,030 And so that's a big paradigm of machine learning right. 213 00:16:21,030 --> 00:16:26,090 Like most of the time if you have more data usually a machine learning model is able to find better 214 00:16:26,090 --> 00:16:30,830 patents and so that's kind of what's happened here is that because we've dropped the samples that don't 215 00:16:30,830 --> 00:16:36,740 have labels our model hasn't been out of Find as many patents even though when they removed 50 out of 216 00:16:36,740 --> 00:16:41,790 a thousand samples we have covered a lot here already so far. 217 00:16:41,860 --> 00:16:47,500 The most important key takeaways from this are most datasets you come across won't be in a form ready 218 00:16:47,500 --> 00:16:52,570 and immediately to start using with machine learning models and some take a bit more preparation than 219 00:16:52,570 --> 00:16:53,290 others. 220 00:16:53,410 --> 00:16:58,510 And so most of the time what you'll have to be doing is your data will have to be numerical and it can't 221 00:16:58,600 --> 00:17:00,560 have missing values. 222 00:17:00,670 --> 00:17:06,970 And so the process of filling missing values is called imputation and the process of turning your non 223 00:17:06,970 --> 00:17:13,810 numerical values into numerical values is referred to as feature engineering or feature encoding. 224 00:17:13,840 --> 00:17:18,380 So with that being said we've covered part 1 of how to deal with you Dana. 225 00:17:18,400 --> 00:17:23,440 Let's get into part two and figure out where the hell did I get this idea for choosing a random forest 226 00:17:23,440 --> 00:17:24,080 regress. 227 00:17:24,160 --> 00:17:27,030 For our problem why did I pick this machine learning. 228 00:17:27,030 --> 00:17:28,840 Well we'll have to wait and see.