1 00:00:00,240 --> 00:00:05,160 Now I know at the end of the last video we said we'd make sure all the data was filled up like there 2 00:00:05,160 --> 00:00:06,420 were no missing values. 3 00:00:06,600 --> 00:00:11,720 But after reviewing it I think it makes a little bit more sense to make sure it's all numerical first. 4 00:00:11,810 --> 00:00:13,350 So let's make a little heading. 5 00:00:13,440 --> 00:00:15,140 We'll call this one point one. 6 00:00:15,330 --> 00:00:18,880 Make sure it's all numerical. 7 00:00:18,990 --> 00:00:20,160 Wonderful. 8 00:00:20,160 --> 00:00:23,620 And now to begin with we kind of need another dataset. 9 00:00:23,640 --> 00:00:24,540 Now why is that. 10 00:00:24,960 --> 00:00:31,730 Well because our heart disease data center this one here is already all numerical. 11 00:00:31,800 --> 00:00:39,800 So that been said let's import another dataset which is our trusty car sales data. 12 00:00:39,800 --> 00:00:43,300 Now if you remember this right at the start of kind of revealed a little bit here. 13 00:00:43,520 --> 00:00:46,090 But this one only has 10 rows. 14 00:00:46,090 --> 00:00:53,630 And what I've done is I've gone and manufactured a little extended version of this which has the same 15 00:00:53,660 --> 00:00:57,560 headings make color odometer doors and price. 16 00:00:57,560 --> 00:00:59,990 But this time we've got a thousand rows. 17 00:01:00,080 --> 00:01:02,270 I'm not going to scroll through because that would take too long. 18 00:01:02,270 --> 00:01:03,500 You're just gonna have to trust me on that. 19 00:01:03,830 --> 00:01:05,090 Anyway I can prove it. 20 00:01:05,120 --> 00:01:08,720 Car sales that's important PDA reads here's me. 21 00:01:09,380 --> 00:01:14,060 Remember I've moved all of our CSA folders into a data file which is kind of best practice. 22 00:01:14,060 --> 00:01:19,170 Keep your working directory tidy car sales extended. 23 00:01:19,760 --> 00:01:23,080 Dot CSI v wonderful and we're going to have a look at it. 24 00:01:23,150 --> 00:01:27,490 Car sales dot head wonderful. 25 00:01:27,510 --> 00:01:33,610 And then we'll figure out how many there are a thousand beautiful you so. 26 00:01:33,750 --> 00:01:39,390 And then we'll look at the data types make as an object color as an object odometer doors and price 27 00:01:39,420 --> 00:01:41,170 are all integers. 28 00:01:41,230 --> 00:01:46,350 And now these are objects because they're strings and they are categories whereas these are numerical 29 00:01:46,350 --> 00:01:47,250 values. 30 00:01:47,250 --> 00:01:48,050 Now what we have to do. 31 00:01:48,090 --> 00:01:53,040 Because this section is make sure it's only miracle is before we can run a machine learning model. 32 00:01:53,040 --> 00:01:58,350 We have to convert these to numbers but just to prove it let's try anyway and see what happens. 33 00:01:58,710 --> 00:02:05,160 So first of all what we're going to do is split the data into x and y so we'll use these four columns 34 00:02:05,370 --> 00:02:12,460 make color odometer doors we're not price but these four to try and predict the price of a car. 35 00:02:12,510 --> 00:02:14,970 Now have a think to yourself Does this actually make sense. 36 00:02:14,970 --> 00:02:16,860 Could you do that in real life. 37 00:02:16,860 --> 00:02:20,190 Do we expect our machine learning model to do well on this kind of problem. 38 00:02:20,670 --> 00:02:22,390 If you were given only the make. 39 00:02:22,450 --> 00:02:30,290 So Honda the color a white Honda with 35000 431 kilometers on the odometer and four doors. 40 00:02:30,420 --> 00:02:34,740 Could you predict the price was fifteen thousand three hundred twenty three dollars. 41 00:02:35,730 --> 00:02:39,530 Maybe maybe not but you probably need a little bit more information. 42 00:02:39,720 --> 00:02:42,050 The same probably goes with our machine learning model. 43 00:02:42,150 --> 00:02:49,790 But nonetheless let's try and make one split into X Y which is we want the feature matrix. 44 00:02:49,890 --> 00:02:51,930 So we're going to go X equals. 45 00:02:52,020 --> 00:02:54,040 Car sales don't drop. 46 00:02:54,210 --> 00:03:00,810 Now we don't want the price column or what have I done actually press shift into getting very trigger 47 00:03:00,810 --> 00:03:02,220 happy in this neighborhood. 48 00:03:02,260 --> 00:03:09,240 Price access equals one and we go into do Y equals car sales. 49 00:03:09,240 --> 00:03:16,980 Now this is going to be just the price column because we want to use the X to predict the Y and then 50 00:03:16,980 --> 00:03:26,360 we'll also split into training and testing split into training and test. 51 00:03:26,530 --> 00:03:33,930 So the X train x test y train y test equals. 52 00:03:33,940 --> 00:03:43,850 We saw this before train test split we'll pass it X we'll pass it y and we'll pass it test size equals. 53 00:03:43,870 --> 00:03:46,740 We use 20 percent of the rise of the test data. 54 00:03:47,260 --> 00:03:48,450 Wonderful. 55 00:03:48,460 --> 00:03:51,790 Well let's try to build a machine learning model that will learn on the training data. 56 00:03:51,820 --> 00:04:00,610 So the X train and the Y train and then predict on the the test data all right build machine learning 57 00:04:00,610 --> 00:04:02,110 model. 58 00:04:02,110 --> 00:04:10,570 Now this time we're going to use still random forest but I want you to guess what this random forest 59 00:04:10,570 --> 00:04:16,580 can be used for so we're going from SBA loan ensemble. 60 00:04:16,580 --> 00:04:18,630 Import random forest regressive. 61 00:04:18,960 --> 00:04:23,390 If you're remember back in our workflow we were predicting a classification problem. 62 00:04:23,390 --> 00:04:27,640 Now this random forest aggressor is the same as a classifier random forest. 63 00:04:27,650 --> 00:04:31,220 But this time it can predict a number which is what we're trying to do right. 64 00:04:31,250 --> 00:04:36,190 We're trying to predict the price of a car given some attributes about it. 65 00:04:36,410 --> 00:04:38,650 So we're going to import a regression model. 66 00:04:38,780 --> 00:04:41,460 We're going to settle up this time instead of calling it CnF. 67 00:04:41,450 --> 00:04:45,770 We'll call it model regress are beautiful. 68 00:04:45,890 --> 00:04:47,350 And then we'll go model dot fit. 69 00:04:47,360 --> 00:04:52,660 We want to train it on the training data so it's going to learn the patterns the relationships between 70 00:04:52,730 --> 00:04:55,730 X variables and the price the label. 71 00:04:55,730 --> 00:05:02,260 And then we're gonna score it evaluated on the test data why test now How beautiful is that. 72 00:05:02,270 --> 00:05:04,240 This is the power of S.K. loan right. 73 00:05:04,250 --> 00:05:11,780 We can train create a model trainer and run it evaluate it in four lines of code. 74 00:05:11,780 --> 00:05:16,610 We could really do it in less but four lines of code is pretty damn good to build a full blown machine 75 00:05:16,610 --> 00:05:17,200 learning model. 76 00:05:17,660 --> 00:05:18,370 So let's do it. 77 00:05:19,280 --> 00:05:20,030 Oh no. 78 00:05:20,240 --> 00:05:21,170 We're going to error. 79 00:05:21,170 --> 00:05:22,790 What's happened value error. 80 00:05:22,940 --> 00:05:24,710 We kind of expected this though right. 81 00:05:24,710 --> 00:05:29,050 Value error could not convert string to float Nissan. 82 00:05:29,810 --> 00:05:35,630 Now I see what's happened is that our machine learning model can't deal with strings. 83 00:05:35,630 --> 00:05:39,750 So as we kind of knew we have to convert these into numbers. 84 00:05:39,860 --> 00:05:46,880 So let's see how we do that with SDK loan and we're going to kind of go through this little chunk of 85 00:05:46,880 --> 00:05:52,250 turning the categories making color into numbers with SDK loan and it might seem like a bit to take 86 00:05:52,250 --> 00:05:57,530 in to begin with but we'll ride it out we'll see the code in action and then we'll go back through it 87 00:05:57,530 --> 00:05:59,350 and see what it's actually doing. 88 00:05:59,360 --> 00:06:08,270 So from SBA loan Doc pre processing SBA loan has a fair few modules that at each do a number of different 89 00:06:08,270 --> 00:06:09,480 things. 90 00:06:09,590 --> 00:06:13,350 One hot encoder wants and we haven't seen him before. 91 00:06:13,520 --> 00:06:21,030 We'll put a little comment on here what we're going to do turn the categories into numbers. 92 00:06:21,200 --> 00:06:29,250 Wonderful from SBA loan dot compose import column transformer. 93 00:06:29,520 --> 00:06:34,460 I wonder what that does you can probably kind of guess what column transformer does but we'll see it 94 00:06:34,460 --> 00:06:42,140 in action first then we're going define our categorical features so in our case our categorical features 95 00:06:42,140 --> 00:06:48,480 are make color and a tricky one is doors. 96 00:06:48,510 --> 00:06:51,560 Now you might be thinking why doors. 97 00:06:51,690 --> 00:06:54,160 Let's go back up have a look at it. 98 00:06:54,210 --> 00:06:58,680 Doors 4 5 4 4 3. 99 00:06:58,860 --> 00:06:59,760 Why is this. 100 00:06:59,760 --> 00:07:00,440 Well let's see it. 101 00:07:00,450 --> 00:07:03,930 Car sales doors. 102 00:07:04,110 --> 00:07:05,790 Value counts. 103 00:07:05,920 --> 00:07:12,030 It's safe to assume that doors is numerical right because you've got 4 5 and 3 as a values. 104 00:07:12,030 --> 00:07:17,260 But it's also categorical because you can go cars with four doors fit into this category. 105 00:07:17,370 --> 00:07:19,440 Cars with five doors fit into this category. 106 00:07:19,440 --> 00:07:23,160 So the seventy nine of them and cars with three doors fit into this category. 107 00:07:23,160 --> 00:07:24,820 So this 65 of those. 108 00:07:25,020 --> 00:07:28,900 So we're going to in our case treat doors as categorical. 109 00:07:28,920 --> 00:07:31,530 And I'm just continually spamming command s. 110 00:07:31,620 --> 00:07:36,750 The Notebook saves it's like I've ingrained it into habit just like pressing shift and enter to run 111 00:07:36,750 --> 00:07:37,710 itself. 112 00:07:37,740 --> 00:07:42,600 So we've got three categorical features beautiful and then we're going to make this little variable 113 00:07:42,600 --> 00:07:43,550 called One heart. 114 00:07:43,560 --> 00:07:48,870 We're going to take advantage of the one hot encoder class that we've imported. 115 00:07:48,960 --> 00:07:54,570 So we'll instantiate that there and we're going to make a little variable called transformer and then 116 00:07:54,660 --> 00:07:58,380 this one is going to use the column trains former. 117 00:07:58,530 --> 00:08:03,570 This is going to accept a list of couples with a name. 118 00:08:03,600 --> 00:08:10,770 So in this case we'll call it one heart and then we'll pass the actual transformer that we want to use 119 00:08:10,770 --> 00:08:14,200 which is the one hot variable that we've just instantiated. 120 00:08:14,200 --> 00:08:20,370 We're going to format this to make sure it's nice spacing I'm going to pass it the list of features 121 00:08:20,370 --> 00:08:23,230 we'd like to transform. 122 00:08:23,670 --> 00:08:26,420 And then finally we don't need that. 123 00:08:26,610 --> 00:08:32,580 Let's bring that back up stuff all over the place here. 124 00:08:32,580 --> 00:08:33,650 Bring that back up. 125 00:08:33,720 --> 00:08:34,460 Wonderful. 126 00:08:34,470 --> 00:08:41,340 And then here we're going to pass one final variable parameter to column transformer called remainder 127 00:08:41,700 --> 00:08:50,890 Eagles pass through beautiful and then we're going to have a little variable here called transformed 128 00:08:51,040 --> 00:08:55,990 X which is basically going to be the version of our X data. 129 00:08:56,530 --> 00:09:01,420 Let's have a look at what X is basically transformed x. 130 00:09:01,480 --> 00:09:08,840 We don't want all of those is going to be the version of this except converted into numbers. 131 00:09:09,000 --> 00:09:10,590 So let's see that. 132 00:09:10,610 --> 00:09:12,720 So we've got our transformer here. 133 00:09:12,750 --> 00:09:21,540 This is what we're going to go to transform that fit transform on X and then we're going to have a look 134 00:09:21,570 --> 00:09:24,730 at transformed x. 135 00:09:24,730 --> 00:09:29,040 Well we've got some invalid syntax over a comma here. 136 00:09:31,170 --> 00:09:33,630 Name error transform is not defined. 137 00:09:33,630 --> 00:09:36,120 Classic transformer. 138 00:09:36,130 --> 00:09:37,280 There we go. 139 00:09:37,280 --> 00:09:38,480 Wonderful. 140 00:09:38,480 --> 00:09:40,790 So this is spitting out an array. 141 00:09:40,790 --> 00:09:44,770 Maybe if we put this into a data frame that would make sense. 142 00:09:44,780 --> 00:09:47,700 And this guy paid a data frame. 143 00:09:47,840 --> 00:09:56,610 See if we can actually transformed X wonderful so you might be looking at this going what the hell is 144 00:09:56,610 --> 00:09:57,060 this. 145 00:09:57,300 --> 00:10:05,750 But if we bring back X what we're going to see dot head work and a bit zoomed in so we're going to have 146 00:10:05,750 --> 00:10:08,350 to kind of keep bouncing up and down a little. 147 00:10:08,840 --> 00:10:17,000 But what has happened is that because we've told psychic line that hey make color and doors are categories 148 00:10:18,080 --> 00:10:21,620 what this code here has done has transformed it. 149 00:10:21,710 --> 00:10:24,460 Specifically one hot encoded it. 150 00:10:24,500 --> 00:10:32,180 So now we've converted the make color indoors columns into one hot encoded variables but the odometer 151 00:10:32,180 --> 00:10:35,810 column so which is here you can see the numbers are actually the same still. 152 00:10:35,870 --> 00:10:39,860 So 35 4 3 1 Yep 1 9 2 7 1 4. 153 00:10:40,020 --> 00:10:41,240 Yep Correct correct. 154 00:10:41,900 --> 00:10:44,340 So the odometer column hasn't been changed. 155 00:10:44,480 --> 00:10:50,450 So if we look back up here the reason why make color indoors have been changed because we define them 156 00:10:50,450 --> 00:10:51,330 here. 157 00:10:51,590 --> 00:10:53,690 We instantiated a 1 Hot encoder. 158 00:10:53,720 --> 00:10:55,570 We'll see what that is in a moment. 159 00:10:55,570 --> 00:10:59,630 And we've created a transformer using column transformer. 160 00:10:59,630 --> 00:11:02,460 So basically this is saying hey column transformer. 161 00:11:02,630 --> 00:11:08,270 Take the one hot encoder and apply it to the categorical features. 162 00:11:08,270 --> 00:11:12,210 And for the remainder of the columns that you find pass through. 163 00:11:12,380 --> 00:11:14,220 Don't do anything to those. 164 00:11:14,300 --> 00:11:20,420 And then what we've done is created transformed X and fit our transformer. 165 00:11:20,420 --> 00:11:20,950 So fit. 166 00:11:20,960 --> 00:11:23,680 Transform to our X data. 167 00:11:23,750 --> 00:11:25,950 And that's a kind of a big of a mouthful there. 168 00:11:26,150 --> 00:11:28,940 And you might be thinking what does one hot encoding even mean. 169 00:11:29,390 --> 00:11:30,950 Well let's have a look. 170 00:11:31,040 --> 00:11:35,220 We'll go to our keynote because this is one art encoding in a nutshell. 171 00:11:35,220 --> 00:11:41,530 If we have four cars so car 0 1 2 3 and that colors are red green blue and red. 172 00:11:42,080 --> 00:11:48,680 If we wanted to one hot and code these so basically turn the categories into numbers it would look something 173 00:11:48,680 --> 00:11:50,930 like this is what one hot encoding does. 174 00:11:50,990 --> 00:11:55,020 We've got cars 0 and because the color of cars 0 is red. 175 00:11:55,070 --> 00:11:58,930 This little arrow here could be the code that we've used to transform our data frame. 176 00:11:59,630 --> 00:12:01,930 So to transform our categories into numbers. 177 00:12:01,930 --> 00:12:03,950 So this is one hot encoding. 178 00:12:04,130 --> 00:12:09,410 So because cars 0 the color is red it gets a 1 and it gets a zero for the other two categories. 179 00:12:09,520 --> 00:12:10,760 And the same for green. 180 00:12:10,760 --> 00:12:17,150 But this time because it's green and gets a zero for red and a 1 for green and a zero for Blue Blue. 181 00:12:17,150 --> 00:12:22,550 Exactly the same zero for Red zero for green and a one for blue because it's a blue car. 182 00:12:22,550 --> 00:12:30,860 And then finally because car 3 is also red it has a 1 just like car 0 and there's 0 and 0 4 green and 183 00:12:30,860 --> 00:12:31,770 blue. 184 00:12:31,850 --> 00:12:36,140 Now that's what we've done here except not only with the color column. 185 00:12:36,140 --> 00:12:38,330 We've done it with the make and the doors column. 186 00:12:39,050 --> 00:12:45,230 So that's why we have 0 1 2 all the way up to 11 features with zeros and ones. 187 00:12:45,230 --> 00:12:53,390 Now what we've done is we've encoded the different parameters of each sample so Honda white and number 188 00:12:53,390 --> 00:12:56,450 of doors into zeros or ones. 189 00:12:56,450 --> 00:13:01,970 So now the beautiful thing is our data is all numerical. 190 00:13:01,970 --> 00:13:02,960 So what can we do. 191 00:13:03,620 --> 00:13:08,200 Well we should be out of fit a model but just in case that wasn't clear. 192 00:13:08,240 --> 00:13:15,880 We've got one more way that we can do this now so we can go dummies equals there's a function in panders 193 00:13:15,890 --> 00:13:21,260 called Get dummies which is kind of like one hot encoding but I'm not sure I'm actually not sure why 194 00:13:21,260 --> 00:13:26,960 it's called dummies I just know that the function is get dummies go dummies PD don't get dummies and 195 00:13:26,960 --> 00:13:33,830 we want to transform car sales and pass a list of the columns that we want to transform or turn into 196 00:13:33,830 --> 00:13:35,360 dummies. 197 00:13:35,360 --> 00:13:41,100 So here doors now let's have a look at dummies. 198 00:13:41,120 --> 00:13:41,930 Here we go. 199 00:13:42,410 --> 00:13:46,630 So I think because doors Dawes is numerical. 200 00:13:46,690 --> 00:13:52,600 It hasn't worked on here but this is this is what we can see what's happening so we've got zero make 201 00:13:52,600 --> 00:14:01,030 BMW is zero so it's a Honda it's got zero for Nissan zero for Toyota and it's got zeros for black blue 202 00:14:01,240 --> 00:14:03,490 green red and white. 203 00:14:03,520 --> 00:14:10,200 So if we went through this all it's done is turn the make and the color into zeros and ones. 204 00:14:10,600 --> 00:14:11,860 Beautiful. 205 00:14:11,860 --> 00:14:17,560 So now that our data is all in zeros and ones let's try and refit the model 206 00:14:21,280 --> 00:14:25,990 so we want to go N.P. not random seed. 207 00:14:26,450 --> 00:14:30,090 We've got 42 just so our results are reproducible. 208 00:14:30,160 --> 00:14:33,550 We're going to use x transformed up here. 209 00:14:33,700 --> 00:14:41,620 So this one here instead of X so let's do that X train we'll set up a new new training and test data 210 00:14:42,940 --> 00:14:52,300 equals train test split we're gonna pass it transformed X and Y can stay the same because Y is already 211 00:14:52,300 --> 00:14:56,310 numerical and then we're going to have tests. 212 00:14:56,410 --> 00:14:58,960 Size equals zero point two. 213 00:14:59,140 --> 00:14:59,720 Excellent. 214 00:14:59,740 --> 00:15:09,520 And then we can go model that fit X train y train beautiful. 215 00:15:09,540 --> 00:15:10,410 It worked. 216 00:15:10,410 --> 00:15:14,100 And now let's evaluate it model school X test. 217 00:15:14,460 --> 00:15:15,210 Why test 218 00:15:18,560 --> 00:15:19,390 wonderful. 219 00:15:19,550 --> 00:15:20,930 Now it worked right. 220 00:15:20,930 --> 00:15:28,400 So but because we use transformed X and because up here we did this little bit of code here to to transform 221 00:15:28,400 --> 00:15:37,790 our data from being categorical into numerical and previously our model could not convert floats to 222 00:15:37,940 --> 00:15:40,010 strings to floats for Nissan. 223 00:15:40,010 --> 00:15:48,490 So when we used our first x variable up here our data was still in categorical form so still using strings 224 00:15:48,490 --> 00:15:50,930 to to represent a sample. 225 00:15:51,010 --> 00:15:57,370 And what happened it air it out got a value error but now right down here when we run the same code 226 00:15:58,210 --> 00:16:02,830 model not fit X train y train and score we don't get any errors. 227 00:16:02,830 --> 00:16:03,410 Why. 228 00:16:03,430 --> 00:16:05,970 Because our data is all numerical. 229 00:16:05,980 --> 00:16:11,290 Now as for this score here we imagine the best score you can get is one point zero the model probably 230 00:16:11,290 --> 00:16:13,620 hasn't done as well as it could possibly do. 231 00:16:13,690 --> 00:16:20,070 But when we talked about that if you were to look at this data here the X data and you were trying to 232 00:16:20,850 --> 00:16:25,260 trying to use this to predict a car's price it will probably be pretty hard. 233 00:16:25,290 --> 00:16:29,700 So maybe the model has found some patterns but they just weren't that great because there wasn't too 234 00:16:29,700 --> 00:16:31,550 much information about each sample. 235 00:16:32,250 --> 00:16:33,520 But that's not the point. 236 00:16:33,630 --> 00:16:35,520 We can look into a valuation metrics later. 237 00:16:35,520 --> 00:16:41,040 The point here is that we've converted our model our data sorry from being non numerical to completely 238 00:16:41,040 --> 00:16:45,840 numerical and that has allowed us to fit a machine learning model on it. 239 00:16:45,840 --> 00:16:46,700 Now we've covered that. 240 00:16:46,710 --> 00:16:48,670 We've converted all our data numbers. 241 00:16:48,750 --> 00:16:51,910 What happens if there was missing values. 242 00:16:52,040 --> 00:16:53,240 Let's check it out in the next video.