1 00:00:00,280 --> 00:00:06,220 OK so now we've got this behemoth of a function that's going to pre process our data. 2 00:00:06,340 --> 00:00:12,240 Let's try run our DNA test that we just pull up here and tried to make some predictions on. 3 00:00:12,240 --> 00:00:14,070 So we import it here. 4 00:00:14,070 --> 00:00:17,880 Tried to make some predictions on it but it didn't work because there's not the same format as the training 5 00:00:17,880 --> 00:00:18,820 data. 6 00:00:18,870 --> 00:00:25,980 Let's run it through our pre processed data and hopefully if we've done it correctly it's going to make 7 00:00:25,980 --> 00:00:29,640 the same changes on DNA test as we have on the training data. 8 00:00:30,210 --> 00:00:31,230 So let's go here. 9 00:00:31,230 --> 00:00:36,030 DNA test equals are actually we'll put a little common here. 10 00:00:36,030 --> 00:00:45,420 Process test data and we might just re instantiate to make sure that we've hit shift enter here shift 11 00:00:45,420 --> 00:00:51,570 into here just to make sure that these cells have run I've been notorious in writing a function and 12 00:00:51,570 --> 00:00:54,010 then not running the cell so it actually doesn't work. 13 00:00:54,060 --> 00:01:03,350 Pretty process data DNA test and then we should be able to check it out like DNA test or ahead Okay 14 00:01:03,450 --> 00:01:05,830 101 columns all that's looking better. 15 00:01:05,840 --> 00:01:11,140 So we're up we're up a fair few columns from what we started with we go back in and we start with 52 16 00:01:11,150 --> 00:01:12,560 columns. 17 00:01:12,560 --> 00:01:18,770 Now if we go through here guy on the end there's a fair few missing. 18 00:01:18,770 --> 00:01:19,160 All right. 19 00:01:19,250 --> 00:01:19,610 OK. 20 00:01:19,640 --> 00:01:23,450 So it's going to be missing the sales price column. 21 00:01:23,460 --> 00:01:23,680 OK. 22 00:01:23,690 --> 00:01:28,880 That makes sense but still a different shape to X train. 23 00:01:28,880 --> 00:01:32,180 So we've got one hundred and one versus one hundred and two columns. 24 00:01:32,360 --> 00:01:38,000 And now if you remember in the end of last video I asked to have a think about where this function might 25 00:01:38,000 --> 00:01:44,610 break and if you're not sure that's fine and I've stumbled across this many of times right in the process 26 00:01:44,610 --> 00:01:50,100 of trying to make our training data or our test data in the same format as our training data and what 27 00:01:50,100 --> 00:01:58,470 it is is that the test data when we imported it is slightly different to the training data as in Let 28 00:01:58,470 --> 00:02:00,850 me demonstrate we're doing model driven idea. 29 00:02:00,870 --> 00:02:02,890 So we'll try to make a prediction again. 30 00:02:03,150 --> 00:02:09,740 So make predictions on updated test data. 31 00:02:10,120 --> 00:02:20,880 We'll go test spreads equal ideal model don't predict the test and give value error again. 32 00:02:20,910 --> 00:02:22,350 Hold on I just pre processed it. 33 00:02:22,350 --> 00:02:22,920 What's wrong. 34 00:02:23,070 --> 00:02:29,270 Let's check it out number of features of the model must match the import model and features is one or 35 00:02:29,270 --> 00:02:29,590 two. 36 00:02:29,630 --> 00:02:29,950 Okay. 37 00:02:29,960 --> 00:02:34,590 That's what our model has been trained on and input features in one to one. 38 00:02:35,470 --> 00:02:39,200 Well we just saw this when we went X train go ahead. 39 00:02:39,500 --> 00:02:45,320 That's one hundred and two columns so there is a difference in the number of columns that our test data 40 00:02:45,320 --> 00:02:47,150 frame and our training data frame map. 41 00:02:47,180 --> 00:02:49,970 Hence why our machine learning model doesn't work. 42 00:02:49,970 --> 00:02:52,430 So how would we figure out where the difference is. 43 00:02:52,520 --> 00:02:57,020 What we can do that using the columns attribute and turning them into sets. 44 00:02:57,200 --> 00:02:57,820 Let's have a look. 45 00:02:58,340 --> 00:03:04,820 So we can find how the columns differ using a python set. 46 00:03:04,820 --> 00:03:09,070 If you've heard of sets there are basically a list with only unique values in them. 47 00:03:09,390 --> 00:03:12,710 Go Set X train dot columns 48 00:03:14,930 --> 00:03:24,910 minus set D test dot columns and now this is going to tell us which columns are incorrect. 49 00:03:24,990 --> 00:03:26,850 AUCTIONEER idea is missing. 50 00:03:26,850 --> 00:03:32,550 So our RDF test has no column auctioneer I.D. is missing. 51 00:03:32,630 --> 00:03:37,550 So what that means is when we imported our test data frame according to our function it didn't have 52 00:03:37,550 --> 00:03:40,010 any auction I.D. values that are missing. 53 00:03:40,010 --> 00:03:48,470 So what we can do is manually update a column on our DNA test data frame to have all false values for 54 00:03:48,500 --> 00:03:48,890 auction. 55 00:03:48,890 --> 00:03:55,910 I.D. is missing because the reason being there is no column auctioneer idea then this is my having if 56 00:03:55,910 --> 00:03:58,820 you can't tell I'm having trouble saying auctioneer here Heidi. 57 00:03:59,120 --> 00:04:04,660 The reason why RDF test data frame has no column named this right. 58 00:04:04,760 --> 00:04:12,180 This this here one I'm highlighting is because all of the auctioneer I.D. values were filled in the 59 00:04:12,230 --> 00:04:12,650 test. 60 00:04:12,980 --> 00:04:29,990 So what we have to do is manually adjust DNA test to have auctioneer I.D. is missing column and this 61 00:04:29,990 --> 00:04:35,870 is just as simple as adding the F test go or just copy this here. 62 00:04:36,080 --> 00:04:41,430 So we'll create a new column and we're going to set it to just false because it had no missing values 63 00:04:41,480 --> 00:04:47,860 a day after head and now Lopes DNA test go ahead. 64 00:04:47,870 --> 00:04:51,170 Wrong guy different today of tests done ahead. 65 00:04:51,190 --> 00:04:52,060 One hundred two columns. 66 00:04:52,060 --> 00:04:53,430 Beautiful. 67 00:04:53,510 --> 00:04:54,960 So if we come right to the end it should have. 68 00:04:54,970 --> 00:04:56,980 AUCTIONEER Ida is missing. 69 00:04:56,980 --> 00:04:57,790 Wonderful. 70 00:04:58,180 --> 00:05:11,650 So this means finally now our test data frame has the same features as our training data frame. 71 00:05:11,830 --> 00:05:17,110 We can make predictions let's do it. 72 00:05:18,460 --> 00:05:23,320 So make predictions on the test data. 73 00:05:23,320 --> 00:05:23,950 Wonderful. 74 00:05:23,980 --> 00:05:29,400 So we'll go test as equals DNA test on it. 75 00:05:29,560 --> 00:05:30,700 Ideal model. 76 00:05:30,700 --> 00:05:39,190 So our ideal model don't predict DNA test boom that worked. 77 00:05:39,190 --> 00:05:39,970 Now let's have a look. 78 00:05:39,970 --> 00:05:43,400 They should all be like prices an array of prices and it's gonna be. 79 00:05:43,570 --> 00:05:45,310 How big is our test data frame. 80 00:05:45,310 --> 00:05:47,760 So these are all the sale prices that we've predicted. 81 00:05:47,980 --> 00:05:51,310 So twelve thousand four hundred fifty seven samples. 82 00:05:51,310 --> 00:05:58,870 But it's not gonna help us in an array format because Kaggle expects it to be 83 00:06:03,200 --> 00:06:09,190 in a data frame with two columns sales I.D. sales price to what we can do we can turn it into that. 84 00:06:09,230 --> 00:06:16,610 So format predictions into the same format Kaggle is after 85 00:06:19,670 --> 00:06:29,350 we've made some predictions but they're not yet in the same format Kaggle is asking for. 86 00:06:30,110 --> 00:06:36,260 Now we'll just link that here evaluation so we know we could go there. 87 00:06:37,540 --> 00:06:37,940 Okay. 88 00:06:37,970 --> 00:06:39,590 So how would we do this. 89 00:06:39,590 --> 00:06:43,940 Well we need the sales I'd call column and we need the sales price prediction. 90 00:06:43,940 --> 00:06:52,400 So what we might do is just make a simple data frame the pred equals P a data frame so empty data frame 91 00:06:52,810 --> 00:07:03,000 dear friends the sales idea column is just going to be sales I.D. from the F test so we can go equals 92 00:07:03,090 --> 00:07:12,950 the F test sales I'd say a well that needs to be the same capitalization and then the sales price is 93 00:07:12,950 --> 00:07:16,810 going to be test parades. 94 00:07:16,880 --> 00:07:24,520 So just our array here and then let's have a look at IDF parades boom. 95 00:07:24,760 --> 00:07:25,450 There we go. 96 00:07:25,450 --> 00:07:26,190 Look at that. 97 00:07:26,200 --> 00:07:27,520 How exciting. 98 00:07:27,520 --> 00:07:32,110 We've got a submission that we could submit to this cow competition but at the moment this while this 99 00:07:32,110 --> 00:07:33,600 cable competition is no longer running. 100 00:07:33,610 --> 00:07:39,070 But that's an example of how you could get your data in a format that it's asking for and then to export 101 00:07:39,070 --> 00:07:40,270 that how might you do that. 102 00:07:40,900 --> 00:07:47,770 Well if we wanted to export this to CSB we might do something like DFA breads dot to put a little comment 103 00:07:47,770 --> 00:07:48,090 here 104 00:07:50,980 --> 00:08:02,980 export prediction data to see as they will put it into data and we'll go test predictions. 105 00:08:02,980 --> 00:08:04,500 Something easy. 106 00:08:04,520 --> 00:08:17,470 Dot CSA and index equals false wonderful I actually know this is Blue Book for bulldozers. 107 00:08:17,630 --> 00:08:19,110 We'll see if that'll export. 108 00:08:19,280 --> 00:08:21,650 Let's go up into our data folder. 109 00:08:21,800 --> 00:08:23,980 Do we have that test predictions dot CSB. 110 00:08:24,070 --> 00:08:25,400 Wonderful. 111 00:08:25,400 --> 00:08:26,360 Okay. 112 00:08:26,420 --> 00:08:31,730 And so now we've got some test predictions we can't really evaluate these though because we don't have 113 00:08:31,730 --> 00:08:33,860 the ground truth labels for the test dataset. 114 00:08:33,860 --> 00:08:37,160 That's why we were doing our evaluation on the validation dataset. 115 00:08:37,940 --> 00:08:43,110 So what we might look at one final thing to wrap up this project is feature importance. 116 00:08:43,130 --> 00:08:49,550 So we've just used out the patterns whereas in here where we made predictions on the test data we've 117 00:08:49,550 --> 00:08:55,400 used the patterns that our model has found in the training data set to make some predictions on sale 118 00:08:55,400 --> 00:09:01,870 price and these patterns are how each of these columns contribute to predicting the sale price. 119 00:09:01,880 --> 00:09:11,000 So a logical thing or a good idea to figure out would be in feature importance which what columns here 120 00:09:11,090 --> 00:09:15,770 or what features meant the most when the model was trying to make a prediction. 121 00:09:15,830 --> 00:09:17,150 So let's do that in the next video.