1 00:00:00,520 --> 00:00:06,830 In the last video we saw how we might find some of the most ideal and parameters for our Random Forests 2 00:00:06,830 --> 00:00:07,400 regressing. 3 00:00:07,910 --> 00:00:11,780 But we only set the number of or in my case you might have done this differently you might have had 4 00:00:11,780 --> 00:00:12,630 some more time. 5 00:00:12,740 --> 00:00:18,120 But in the essence of time for recording these videos only set it up to two. 6 00:00:18,120 --> 00:00:26,270 However in a previous experiment a.k.a. step 6 I set it to 100 and found some more ideal hybrid parameters 7 00:00:26,270 --> 00:00:27,020 and what we have here. 8 00:00:27,050 --> 00:00:28,210 So that's what we'll do. 9 00:00:28,280 --> 00:00:37,400 We're going to in this video train a model with the best hybrid parameters and we'll put a little note 10 00:00:37,400 --> 00:00:39,290 here and go No. 11 00:00:40,040 --> 00:00:48,620 These were found after 100 iterations of randomized search CV. 12 00:00:48,620 --> 00:00:55,550 And now if you were to do something like this like 100 iterations of randomized search CV you still 13 00:00:55,550 --> 00:01:01,490 may find different height parameters potentially better height parameters the height of Rambus searching 14 00:01:01,550 --> 00:01:06,260 like finding the best height around us from model is as we've talked about before and it's not an exact 15 00:01:06,260 --> 00:01:10,550 science it involves a lot of trial and error a lot of experimentation. 16 00:01:10,550 --> 00:01:15,080 So for this one you're just going to have to kind of trust me that I've run this 100 iterations it took 17 00:01:15,080 --> 00:01:22,310 a couple hours on my mac so I'll just show you which ones we found so most ideal hyper parameters and 18 00:01:22,310 --> 00:01:27,810 that way once we've gone through something like this or randomize search CV to find some ideal hybrid 19 00:01:27,830 --> 00:01:33,710 parameters because this was only I'm scrolling back and forth here because this was only trained on 20 00:01:33,740 --> 00:01:41,750 10000 examples what you'll probably do is use random search CV to find some ideal hybrid parameters 21 00:01:41,900 --> 00:01:47,660 across a space like this on a subset because otherwise it would take hours and hours and hours and then 22 00:01:47,720 --> 00:01:55,030 retrain our model on the full dataset which is what we're about to do now with the most ideal hyper 23 00:01:55,030 --> 00:01:58,450 parameters that were found in randomize search CV. 24 00:01:58,690 --> 00:02:03,410 So let's call this one ideal model and we'll settle up here. 25 00:02:03,920 --> 00:02:08,390 Random Forest regress are so in my case an estimate as was 40. 26 00:02:08,440 --> 00:02:12,950 Now this is interesting right because the default in random forest regress. 27 00:02:13,060 --> 00:02:19,510 If we go here shift tab the default is 100 so it's actually from the randomize search CV it's actually 28 00:02:19,510 --> 00:02:22,390 found that 100 estimate weren't required. 29 00:02:22,390 --> 00:02:33,130 So 40 still provides pretty good results mean samples leaf was found to be 1 and then mean samples it 30 00:02:33,550 --> 00:02:43,810 was found to be 14 and then Max features was found to be 0 point 5 and end jobs of course we're going 31 00:02:43,810 --> 00:02:48,010 to set to negative 1 because we want our computer to use all of the processes that it has. 32 00:02:48,100 --> 00:02:54,120 Well in my case I do Max samples equals none because why we want to train on all of the data. 33 00:02:54,130 --> 00:02:59,170 So we set Max samples to none a.k.a. it will train on as many samples as possible. 34 00:02:59,740 --> 00:03:07,170 And now we're going to fit the ideal model because it just instantiated above ideal model. 35 00:03:07,320 --> 00:03:11,740 Don't bet on the training data ex train. 36 00:03:11,780 --> 00:03:13,080 Why train. 37 00:03:13,090 --> 00:03:14,860 And now this may take a couple of minutes. 38 00:03:14,860 --> 00:03:20,080 So what we'll do I'll run this cell and again when a time travel and once the training is complete we'll 39 00:03:20,080 --> 00:03:24,190 be out to see how long it took because we've got time and then we'll evaluate our model 40 00:03:27,380 --> 00:03:28,510 beautiful. 41 00:03:28,570 --> 00:03:31,240 So that took about one minute and 14 seconds. 42 00:03:31,240 --> 00:03:32,700 Now that's on the entire data. 43 00:03:32,710 --> 00:03:39,020 And the reason being is the main reason is because an estimate as is 40 rather than 100. 44 00:03:39,040 --> 00:03:44,950 So what that means is the random forest is building 40 smaller little models rather than 100 so almost 45 00:03:45,240 --> 00:03:46,540 two and a half times less. 46 00:03:46,870 --> 00:03:52,750 And I did forget one parameter that we probably should have put in here random state equals 42 not going 47 00:03:52,750 --> 00:03:57,300 to rerun the model now but just so it's there I'll put a little comment here. 48 00:03:57,400 --> 00:04:04,360 Random state so this is so if we were to set an umpire random seed same thing random state so our results 49 00:04:04,450 --> 00:04:10,050 are reproducible and here's how reproducible. 50 00:04:10,050 --> 00:04:14,920 Yes it is a beautiful safe hours to run their cell again over and over and over with random the state 51 00:04:14,920 --> 00:04:16,020 set to 42. 52 00:04:16,020 --> 00:04:21,170 We would get the same model being fit but that took about 1 minute 14 seconds on my computer. 53 00:04:21,180 --> 00:04:25,440 It may take a little bit longer a little bit shorter on your computer depending on how much processing 54 00:04:25,440 --> 00:04:32,190 power you have and if you're running randomize search TV a number of times more than what we've done 55 00:04:32,190 --> 00:04:33,230 up here. 56 00:04:33,300 --> 00:04:38,280 You may potentially find different parameters to what I've found here and if you have I encourage you 57 00:04:38,280 --> 00:04:44,630 to share them so other people can try them out and will evaluate it to see how they go to see whose 58 00:04:44,630 --> 00:04:46,470 parameters are the real best ones. 59 00:04:46,530 --> 00:04:52,760 So let's check it out with our handy show scores function and we'll pass it our model that we just trained 60 00:04:54,320 --> 00:05:01,030 shifting into their beautiful and because this is trained on all the data we should see a significant 61 00:05:01,030 --> 00:05:04,060 improvement over the model that we trained before. 62 00:05:04,060 --> 00:05:05,410 So let's go up. 63 00:05:05,410 --> 00:05:05,860 Here we go. 64 00:05:05,890 --> 00:05:07,160 R.S. model. 65 00:05:07,300 --> 00:05:07,560 Okay. 66 00:05:07,570 --> 00:05:17,080 What we might do is bring these closer to each other so we can go scores for ideal model trained on 67 00:05:18,160 --> 00:05:21,050 all the data and let's go here. 68 00:05:21,070 --> 00:05:32,900 Scores on IRS model only trained on ten thousand examples the exact same line of code we ran a couple 69 00:05:32,900 --> 00:05:38,420 of cells up from just doing it here so we can directly compare them all right. 70 00:05:38,420 --> 00:05:43,010 So let's look at here the valid am SLA is especially the one we want to pay most attention to. 71 00:05:43,010 --> 00:05:49,390 As you can see with our ideal model in one minute 14 seconds of training time it's gone through 400 72 00:05:49,490 --> 00:05:56,120 or so thousand rows and reduced the valid M.S. Ellie M S Ellie. 73 00:05:56,180 --> 00:05:57,070 Yeah. 74 00:05:57,110 --> 00:05:58,510 That's a mouthful. 75 00:05:58,580 --> 00:05:59,390 Bye. 76 00:05:59,480 --> 00:06:00,980 What is that about point eight. 77 00:06:00,980 --> 00:06:02,150 So that's pretty damn good. 78 00:06:02,150 --> 00:06:07,730 And if we saw with this we were already doing well on the cable competition with our original model. 79 00:06:07,730 --> 00:06:12,290 Let's see where our new and improved one which is our ideal model. 80 00:06:12,350 --> 00:06:13,880 So we're looking at this here. 81 00:06:14,030 --> 00:06:16,830 Valid root mean square log error. 82 00:06:17,030 --> 00:06:27,160 Point 2 4 6 where does that get us on the leaderboard point two 4 6. 83 00:06:27,280 --> 00:06:27,870 All right. 84 00:06:27,870 --> 00:06:29,790 So around about there so 30. 85 00:06:29,790 --> 00:06:36,750 So we're almost in the top 30 on the Kaggle leaderboard just in a model that took a minute to run and 86 00:06:36,750 --> 00:06:40,250 going through here with the random forest regresses straight out a psychic learn. 87 00:06:40,290 --> 00:06:45,180 That's pretty damn impressive and we haven't really done an incredible amount of data manipulation we've 88 00:06:45,180 --> 00:06:50,710 turned our data into numbers we've filled missing value but we're already getting some incredible results. 89 00:06:50,730 --> 00:06:53,480 So what will probably do is leave our ideal model there. 90 00:06:53,490 --> 00:06:59,190 We could search for longer with randomized search CV up here and find some better height parameters 91 00:06:59,190 --> 00:07:04,040 and probably slightly improve our model even more maybe push us right into the top 25. 92 00:07:04,050 --> 00:07:05,370 That'd be pretty cool. 93 00:07:05,370 --> 00:07:08,670 But what we might do now is see how we'd make a submission to Kaggle. 94 00:07:09,540 --> 00:07:10,470 So what does this mean. 95 00:07:10,560 --> 00:07:11,510 So if we come here. 96 00:07:11,640 --> 00:07:13,710 How do you even get on the leaderboard. 97 00:07:13,710 --> 00:07:21,700 Well if we go here to data to get on the leaderboard we have to make some predictions on test dot CSB. 98 00:07:21,730 --> 00:07:27,940 So right now our predictions are invalid dot CSC but all of these scores on the leaderboard. 99 00:07:27,940 --> 00:07:33,430 So really we are just guesstimating where we'd end up because we're predicting on valid CSP at the moment 100 00:07:34,000 --> 00:07:40,590 we're evaluating our model on a validation set but if we come back to overview and evaluation what we 101 00:07:40,590 --> 00:07:44,420 have to do is submit a submission file. 102 00:07:44,760 --> 00:07:51,150 So okay submission file should be formatted as follows sales I.D. sale price so that's what we publish 103 00:07:51,150 --> 00:07:58,470 might do now we might import the test data set and formatted in a way so we can use our machine learning 104 00:07:58,470 --> 00:08:05,210 model on it and create an example submission in the format that Kaggle is asking of us. 105 00:08:05,220 --> 00:08:06,410 So that sounds like a good plan. 106 00:08:06,420 --> 00:08:10,620 Now we've got an ideal model we'll use that to make predictions on the test dataset.