1
00:00:00,520 --> 00:00:06,830
In the last video we saw how we might find some of the most ideal and parameters for our Random Forests

2
00:00:06,830 --> 00:00:07,400
regressing.

3
00:00:07,910 --> 00:00:11,780
But we only set the number of or in my case you might have done this differently you might have had

4
00:00:11,780 --> 00:00:12,630
some more time.

5
00:00:12,740 --> 00:00:18,120
But in the essence of time for recording these videos only set it up to two.

6
00:00:18,120 --> 00:00:26,270
However in a previous experiment a.k.a. step 6 I set it to 100 and found some more ideal hybrid parameters

7
00:00:26,270 --> 00:00:27,020
and what we have here.

8
00:00:27,050 --> 00:00:28,210
So that's what we'll do.

9
00:00:28,280 --> 00:00:37,400
We're going to in this video train a model with the best hybrid parameters and we'll put a little note

10
00:00:37,400 --> 00:00:39,290
here and go No.

11
00:00:40,040 --> 00:00:48,620
These were found after 100 iterations of randomized search CV.

12
00:00:48,620 --> 00:00:55,550
And now if you were to do something like this like 100 iterations of randomized search CV you still

13
00:00:55,550 --> 00:01:01,490
may find different height parameters potentially better height parameters the height of Rambus searching

14
00:01:01,550 --> 00:01:06,260
like finding the best height around us from model is as we've talked about before and it's not an exact

15
00:01:06,260 --> 00:01:10,550
science it involves a lot of trial and error a lot of experimentation.

16
00:01:10,550 --> 00:01:15,080
So for this one you're just going to have to kind of trust me that I've run this 100 iterations it took

17
00:01:15,080 --> 00:01:22,310
a couple hours on my mac so I'll just show you which ones we found so most ideal hyper parameters and

18
00:01:22,310 --> 00:01:27,810
that way once we've gone through something like this or randomize search CV to find some ideal hybrid

19
00:01:27,830 --> 00:01:33,710
parameters because this was only I'm scrolling back and forth here because this was only trained on

20
00:01:33,740 --> 00:01:41,750
10000 examples what you'll probably do is use random search CV to find some ideal hybrid parameters

21
00:01:41,900 --> 00:01:47,660
across a space like this on a subset because otherwise it would take hours and hours and hours and then

22
00:01:47,720 --> 00:01:55,030
retrain our model on the full dataset which is what we're about to do now with the most ideal hyper

23
00:01:55,030 --> 00:01:58,450
parameters that were found in randomize search CV.

24
00:01:58,690 --> 00:02:03,410
So let's call this one ideal model and we'll settle up here.

25
00:02:03,920 --> 00:02:08,390
Random Forest regress are so in my case an estimate as was 40.

26
00:02:08,440 --> 00:02:12,950
Now this is interesting right because the default in random forest regress.

27
00:02:13,060 --> 00:02:19,510
If we go here shift tab the default is 100 so it's actually from the randomize search CV it's actually

28
00:02:19,510 --> 00:02:22,390
found that 100 estimate weren't required.

29
00:02:22,390 --> 00:02:33,130
So 40 still provides pretty good results mean samples leaf was found to be 1 and then mean samples it

30
00:02:33,550 --> 00:02:43,810
was found to be 14 and then Max features was found to be 0 point 5 and end jobs of course we're going

31
00:02:43,810 --> 00:02:48,010
to set to negative 1 because we want our computer to use all of the processes that it has.

32
00:02:48,100 --> 00:02:54,120
Well in my case I do Max samples equals none because why we want to train on all of the data.

33
00:02:54,130 --> 00:02:59,170
So we set Max samples to none a.k.a. it will train on as many samples as possible.

34
00:02:59,740 --> 00:03:07,170
And now we're going to fit the ideal model because it just instantiated above ideal model.

35
00:03:07,320 --> 00:03:11,740
Don't bet on the training data ex train.

36
00:03:11,780 --> 00:03:13,080
Why train.

37
00:03:13,090 --> 00:03:14,860
And now this may take a couple of minutes.

38
00:03:14,860 --> 00:03:20,080
So what we'll do I'll run this cell and again when a time travel and once the training is complete we'll

39
00:03:20,080 --> 00:03:24,190
be out to see how long it took because we've got time and then we'll evaluate our model

40
00:03:27,380 --> 00:03:28,510
beautiful.

41
00:03:28,570 --> 00:03:31,240
So that took about one minute and 14 seconds.

42
00:03:31,240 --> 00:03:32,700
Now that's on the entire data.

43
00:03:32,710 --> 00:03:39,020
And the reason being is the main reason is because an estimate as is 40 rather than 100.

44
00:03:39,040 --> 00:03:44,950
So what that means is the random forest is building 40 smaller little models rather than 100 so almost

45
00:03:45,240 --> 00:03:46,540
two and a half times less.

46
00:03:46,870 --> 00:03:52,750
And I did forget one parameter that we probably should have put in here random state equals 42 not going

47
00:03:52,750 --> 00:03:57,300
to rerun the model now but just so it's there I'll put a little comment here.

48
00:03:57,400 --> 00:04:04,360
Random state so this is so if we were to set an umpire random seed same thing random state so our results

49
00:04:04,450 --> 00:04:10,050
are reproducible and here's how reproducible.

50
00:04:10,050 --> 00:04:14,920
Yes it is a beautiful safe hours to run their cell again over and over and over with random the state

51
00:04:14,920 --> 00:04:16,020
set to 42.

52
00:04:16,020 --> 00:04:21,170
We would get the same model being fit but that took about 1 minute 14 seconds on my computer.

53
00:04:21,180 --> 00:04:25,440
It may take a little bit longer a little bit shorter on your computer depending on how much processing

54
00:04:25,440 --> 00:04:32,190
power you have and if you're running randomize search TV a number of times more than what we've done

55
00:04:32,190 --> 00:04:33,230
up here.

56
00:04:33,300 --> 00:04:38,280
You may potentially find different parameters to what I've found here and if you have I encourage you

57
00:04:38,280 --> 00:04:44,630
to share them so other people can try them out and will evaluate it to see how they go to see whose

58
00:04:44,630 --> 00:04:46,470
parameters are the real best ones.

59
00:04:46,530 --> 00:04:52,760
So let's check it out with our handy show scores function and we'll pass it our model that we just trained

60
00:04:54,320 --> 00:05:01,030
shifting into their beautiful and because this is trained on all the data we should see a significant

61
00:05:01,030 --> 00:05:04,060
improvement over the model that we trained before.

62
00:05:04,060 --> 00:05:05,410
So let's go up.

63
00:05:05,410 --> 00:05:05,860
Here we go.

64
00:05:05,890 --> 00:05:07,160
R.S. model.

65
00:05:07,300 --> 00:05:07,560
Okay.

66
00:05:07,570 --> 00:05:17,080
What we might do is bring these closer to each other so we can go scores for ideal model trained on

67
00:05:18,160 --> 00:05:21,050
all the data and let's go here.

68
00:05:21,070 --> 00:05:32,900
Scores on IRS model only trained on ten thousand examples the exact same line of code we ran a couple

69
00:05:32,900 --> 00:05:38,420
of cells up from just doing it here so we can directly compare them all right.

70
00:05:38,420 --> 00:05:43,010
So let's look at here the valid am SLA is especially the one we want to pay most attention to.

71
00:05:43,010 --> 00:05:49,390
As you can see with our ideal model in one minute 14 seconds of training time it's gone through 400

72
00:05:49,490 --> 00:05:56,120
or so thousand rows and reduced the valid M.S. Ellie M S Ellie.

73
00:05:56,180 --> 00:05:57,070
Yeah.

74
00:05:57,110 --> 00:05:58,510
That's a mouthful.

75
00:05:58,580 --> 00:05:59,390
Bye.

76
00:05:59,480 --> 00:06:00,980
What is that about point eight.

77
00:06:00,980 --> 00:06:02,150
So that's pretty damn good.

78
00:06:02,150 --> 00:06:07,730
And if we saw with this we were already doing well on the cable competition with our original model.

79
00:06:07,730 --> 00:06:12,290
Let's see where our new and improved one which is our ideal model.

80
00:06:12,350 --> 00:06:13,880
So we're looking at this here.

81
00:06:14,030 --> 00:06:16,830
Valid root mean square log error.

82
00:06:17,030 --> 00:06:27,160
Point 2 4 6 where does that get us on the leaderboard point two 4 6.

83
00:06:27,280 --> 00:06:27,870
All right.

84
00:06:27,870 --> 00:06:29,790
So around about there so 30.

85
00:06:29,790 --> 00:06:36,750
So we're almost in the top 30 on the Kaggle leaderboard just in a model that took a minute to run and

86
00:06:36,750 --> 00:06:40,250
going through here with the random forest regresses straight out a psychic learn.

87
00:06:40,290 --> 00:06:45,180
That's pretty damn impressive and we haven't really done an incredible amount of data manipulation we've

88
00:06:45,180 --> 00:06:50,710
turned our data into numbers we've filled missing value but we're already getting some incredible results.

89
00:06:50,730 --> 00:06:53,480
So what will probably do is leave our ideal model there.

90
00:06:53,490 --> 00:06:59,190
We could search for longer with randomized search CV up here and find some better height parameters

91
00:06:59,190 --> 00:07:04,040
and probably slightly improve our model even more maybe push us right into the top 25.

92
00:07:04,050 --> 00:07:05,370
That'd be pretty cool.

93
00:07:05,370 --> 00:07:08,670
But what we might do now is see how we'd make a submission to Kaggle.

94
00:07:09,540 --> 00:07:10,470
So what does this mean.

95
00:07:10,560 --> 00:07:11,510
So if we come here.

96
00:07:11,640 --> 00:07:13,710
How do you even get on the leaderboard.

97
00:07:13,710 --> 00:07:21,700
Well if we go here to data to get on the leaderboard we have to make some predictions on test dot CSB.

98
00:07:21,730 --> 00:07:27,940
So right now our predictions are invalid dot CSC but all of these scores on the leaderboard.

99
00:07:27,940 --> 00:07:33,430
So really we are just guesstimating where we'd end up because we're predicting on valid CSP at the moment

100
00:07:34,000 --> 00:07:40,590
we're evaluating our model on a validation set but if we come back to overview and evaluation what we

101
00:07:40,590 --> 00:07:44,420
have to do is submit a submission file.

102
00:07:44,760 --> 00:07:51,150
So okay submission file should be formatted as follows sales I.D. sale price so that's what we publish

103
00:07:51,150 --> 00:07:58,470
might do now we might import the test data set and formatted in a way so we can use our machine learning

104
00:07:58,470 --> 00:08:05,210
model on it and create an example submission in the format that Kaggle is asking of us.

105
00:08:05,220 --> 00:08:06,410
So that sounds like a good plan.

106
00:08:06,420 --> 00:08:10,620
Now we've got an ideal model we'll use that to make predictions on the test dataset.