1
00:00:00,280 --> 00:00:06,220
OK so now we've got this behemoth of a function that's going to pre process our data.

2
00:00:06,340 --> 00:00:12,240
Let's try run our DNA test that we just pull up here and tried to make some predictions on.

3
00:00:12,240 --> 00:00:14,070
So we import it here.

4
00:00:14,070 --> 00:00:17,880
Tried to make some predictions on it but it didn't work because there's not the same format as the training

5
00:00:17,880 --> 00:00:18,820
data.

6
00:00:18,870 --> 00:00:25,980
Let's run it through our pre processed data and hopefully if we've done it correctly it's going to make

7
00:00:25,980 --> 00:00:29,640
the same changes on DNA test as we have on the training data.

8
00:00:30,210 --> 00:00:31,230
So let's go here.

9
00:00:31,230 --> 00:00:36,030
DNA test equals are actually we'll put a little common here.

10
00:00:36,030 --> 00:00:45,420
Process test data and we might just re instantiate to make sure that we've hit shift enter here shift

11
00:00:45,420 --> 00:00:51,570
into here just to make sure that these cells have run I've been notorious in writing a function and

12
00:00:51,570 --> 00:00:54,010
then not running the cell so it actually doesn't work.

13
00:00:54,060 --> 00:01:03,350
Pretty process data DNA test and then we should be able to check it out like DNA test or ahead Okay

14
00:01:03,450 --> 00:01:05,830
101 columns all that's looking better.

15
00:01:05,840 --> 00:01:11,140
So we're up we're up a fair few columns from what we started with we go back in and we start with 52

16
00:01:11,150 --> 00:01:12,560
columns.

17
00:01:12,560 --> 00:01:18,770
Now if we go through here guy on the end there's a fair few missing.

18
00:01:18,770 --> 00:01:19,160
All right.

19
00:01:19,250 --> 00:01:19,610
OK.

20
00:01:19,640 --> 00:01:23,450
So it's going to be missing the sales price column.

21
00:01:23,460 --> 00:01:23,680
OK.

22
00:01:23,690 --> 00:01:28,880
That makes sense but still a different shape to X train.

23
00:01:28,880 --> 00:01:32,180
So we've got one hundred and one versus one hundred and two columns.

24
00:01:32,360 --> 00:01:38,000
And now if you remember in the end of last video I asked to have a think about where this function might

25
00:01:38,000 --> 00:01:44,610
break and if you're not sure that's fine and I've stumbled across this many of times right in the process

26
00:01:44,610 --> 00:01:50,100
of trying to make our training data or our test data in the same format as our training data and what

27
00:01:50,100 --> 00:01:58,470
it is is that the test data when we imported it is slightly different to the training data as in Let

28
00:01:58,470 --> 00:02:00,850
me demonstrate we're doing model driven idea.

29
00:02:00,870 --> 00:02:02,890
So we'll try to make a prediction again.

30
00:02:03,150 --> 00:02:09,740
So make predictions on updated test data.

31
00:02:10,120 --> 00:02:20,880
We'll go test spreads equal ideal model don't predict the test and give value error again.

32
00:02:20,910 --> 00:02:22,350
Hold on I just pre processed it.

33
00:02:22,350 --> 00:02:22,920
What's wrong.

34
00:02:23,070 --> 00:02:29,270
Let's check it out number of features of the model must match the import model and features is one or

35
00:02:29,270 --> 00:02:29,590
two.

36
00:02:29,630 --> 00:02:29,950
Okay.

37
00:02:29,960 --> 00:02:34,590
That's what our model has been trained on and input features in one to one.

38
00:02:35,470 --> 00:02:39,200
Well we just saw this when we went X train go ahead.

39
00:02:39,500 --> 00:02:45,320
That's one hundred and two columns so there is a difference in the number of columns that our test data

40
00:02:45,320 --> 00:02:47,150
frame and our training data frame map.

41
00:02:47,180 --> 00:02:49,970
Hence why our machine learning model doesn't work.

42
00:02:49,970 --> 00:02:52,430
So how would we figure out where the difference is.

43
00:02:52,520 --> 00:02:57,020
What we can do that using the columns attribute and turning them into sets.

44
00:02:57,200 --> 00:02:57,820
Let's have a look.

45
00:02:58,340 --> 00:03:04,820
So we can find how the columns differ using a python set.

46
00:03:04,820 --> 00:03:09,070
If you've heard of sets there are basically a list with only unique values in them.

47
00:03:09,390 --> 00:03:12,710
Go Set X train dot columns

48
00:03:14,930 --> 00:03:24,910
minus set D test dot columns and now this is going to tell us which columns are incorrect.

49
00:03:24,990 --> 00:03:26,850
AUCTIONEER idea is missing.

50
00:03:26,850 --> 00:03:32,550
So our RDF test has no column auctioneer I.D. is missing.

51
00:03:32,630 --> 00:03:37,550
So what that means is when we imported our test data frame according to our function it didn't have

52
00:03:37,550 --> 00:03:40,010
any auction I.D. values that are missing.

53
00:03:40,010 --> 00:03:48,470
So what we can do is manually update a column on our DNA test data frame to have all false values for

54
00:03:48,500 --> 00:03:48,890
auction.

55
00:03:48,890 --> 00:03:55,910
I.D. is missing because the reason being there is no column auctioneer idea then this is my having if

56
00:03:55,910 --> 00:03:58,820
you can't tell I'm having trouble saying auctioneer here Heidi.

57
00:03:59,120 --> 00:04:04,660
The reason why RDF test data frame has no column named this right.

58
00:04:04,760 --> 00:04:12,180
This this here one I'm highlighting is because all of the auctioneer I.D. values were filled in the

59
00:04:12,230 --> 00:04:12,650
test.

60
00:04:12,980 --> 00:04:29,990
So what we have to do is manually adjust DNA test to have auctioneer I.D. is missing column and this

61
00:04:29,990 --> 00:04:35,870
is just as simple as adding the F test go or just copy this here.

62
00:04:36,080 --> 00:04:41,430
So we'll create a new column and we're going to set it to just false because it had no missing values

63
00:04:41,480 --> 00:04:47,860
a day after head and now Lopes DNA test go ahead.

64
00:04:47,870 --> 00:04:51,170
Wrong guy different today of tests done ahead.

65
00:04:51,190 --> 00:04:52,060
One hundred two columns.

66
00:04:52,060 --> 00:04:53,430
Beautiful.

67
00:04:53,510 --> 00:04:54,960
So if we come right to the end it should have.

68
00:04:54,970 --> 00:04:56,980
AUCTIONEER Ida is missing.

69
00:04:56,980 --> 00:04:57,790
Wonderful.

70
00:04:58,180 --> 00:05:11,650
So this means finally now our test data frame has the same features as our training data frame.

71
00:05:11,830 --> 00:05:17,110
We can make predictions let's do it.

72
00:05:18,460 --> 00:05:23,320
So make predictions on the test data.

73
00:05:23,320 --> 00:05:23,950
Wonderful.

74
00:05:23,980 --> 00:05:29,400
So we'll go test as equals DNA test on it.

75
00:05:29,560 --> 00:05:30,700
Ideal model.

76
00:05:30,700 --> 00:05:39,190
So our ideal model don't predict DNA test boom that worked.

77
00:05:39,190 --> 00:05:39,970
Now let's have a look.

78
00:05:39,970 --> 00:05:43,400
They should all be like prices an array of prices and it's gonna be.

79
00:05:43,570 --> 00:05:45,310
How big is our test data frame.

80
00:05:45,310 --> 00:05:47,760
So these are all the sale prices that we've predicted.

81
00:05:47,980 --> 00:05:51,310
So twelve thousand four hundred fifty seven samples.

82
00:05:51,310 --> 00:05:58,870
But it's not gonna help us in an array format because Kaggle expects it to be

83
00:06:03,200 --> 00:06:09,190
in a data frame with two columns sales I.D. sales price to what we can do we can turn it into that.

84
00:06:09,230 --> 00:06:16,610
So format predictions into the same format Kaggle is after

85
00:06:19,670 --> 00:06:29,350
we've made some predictions but they're not yet in the same format Kaggle is asking for.

86
00:06:30,110 --> 00:06:36,260
Now we'll just link that here evaluation so we know we could go there.

87
00:06:37,540 --> 00:06:37,940
Okay.

88
00:06:37,970 --> 00:06:39,590
So how would we do this.

89
00:06:39,590 --> 00:06:43,940
Well we need the sales I'd call column and we need the sales price prediction.

90
00:06:43,940 --> 00:06:52,400
So what we might do is just make a simple data frame the pred equals P a data frame so empty data frame

91
00:06:52,810 --> 00:07:03,000
dear friends the sales idea column is just going to be sales I.D. from the F test so we can go equals

92
00:07:03,090 --> 00:07:12,950
the F test sales I'd say a well that needs to be the same capitalization and then the sales price is

93
00:07:12,950 --> 00:07:16,810
going to be test parades.

94
00:07:16,880 --> 00:07:24,520
So just our array here and then let's have a look at IDF parades boom.

95
00:07:24,760 --> 00:07:25,450
There we go.

96
00:07:25,450 --> 00:07:26,190
Look at that.

97
00:07:26,200 --> 00:07:27,520
How exciting.

98
00:07:27,520 --> 00:07:32,110
We've got a submission that we could submit to this cow competition but at the moment this while this

99
00:07:32,110 --> 00:07:33,600
cable competition is no longer running.

100
00:07:33,610 --> 00:07:39,070
But that's an example of how you could get your data in a format that it's asking for and then to export

101
00:07:39,070 --> 00:07:40,270
that how might you do that.

102
00:07:40,900 --> 00:07:47,770
Well if we wanted to export this to CSB we might do something like DFA breads dot to put a little comment

103
00:07:47,770 --> 00:07:48,090
here

104
00:07:50,980 --> 00:08:02,980
export prediction data to see as they will put it into data and we'll go test predictions.

105
00:08:02,980 --> 00:08:04,500
Something easy.

106
00:08:04,520 --> 00:08:17,470
Dot CSA and index equals false wonderful I actually know this is Blue Book for bulldozers.

107
00:08:17,630 --> 00:08:19,110
We'll see if that'll export.

108
00:08:19,280 --> 00:08:21,650
Let's go up into our data folder.

109
00:08:21,800 --> 00:08:23,980
Do we have that test predictions dot CSB.

110
00:08:24,070 --> 00:08:25,400
Wonderful.

111
00:08:25,400 --> 00:08:26,360
Okay.

112
00:08:26,420 --> 00:08:31,730
And so now we've got some test predictions we can't really evaluate these though because we don't have

113
00:08:31,730 --> 00:08:33,860
the ground truth labels for the test dataset.

114
00:08:33,860 --> 00:08:37,160
That's why we were doing our evaluation on the validation dataset.

115
00:08:37,940 --> 00:08:43,110
So what we might look at one final thing to wrap up this project is feature importance.

116
00:08:43,130 --> 00:08:49,550
So we've just used out the patterns whereas in here where we made predictions on the test data we've

117
00:08:49,550 --> 00:08:55,400
used the patterns that our model has found in the training data set to make some predictions on sale

118
00:08:55,400 --> 00:09:01,870
price and these patterns are how each of these columns contribute to predicting the sale price.

119
00:09:01,880 --> 00:09:11,000
So a logical thing or a good idea to figure out would be in feature importance which what columns here

120
00:09:11,090 --> 00:09:15,770
or what features meant the most when the model was trying to make a prediction.

121
00:09:15,830 --> 00:09:17,150
So let's do that in the next video.