1 00:00:00,610 --> 00:00:01,630 Welcome back. 2 00:00:01,630 --> 00:00:05,570 We left off the last video asking the question how do we improve this school. 3 00:00:05,590 --> 00:00:07,640 And what if Ridge wasn't working. 4 00:00:07,760 --> 00:00:10,890 Then we kind of referred back to our machine learning map here. 5 00:00:11,050 --> 00:00:15,670 What I might do is put this link in here and go. 6 00:00:16,010 --> 00:00:17,820 Let's refer back to the map 7 00:00:21,280 --> 00:00:21,790 link there. 8 00:00:21,790 --> 00:00:25,410 And I'm one of pull up here. 9 00:00:25,430 --> 00:00:26,570 Step one. 10 00:00:26,570 --> 00:00:34,620 Check the socket line machine learning map. 11 00:00:34,810 --> 00:00:36,350 Now you can find this pretty easily right. 12 00:00:36,350 --> 00:00:39,030 By searching socket line machine learning map. 13 00:00:39,050 --> 00:00:40,190 Now what we're getting here. 14 00:00:40,190 --> 00:00:40,760 Remember how. 15 00:00:40,760 --> 00:00:46,310 Right at the very start we talked about how we're getting focused on just writing machine learning code 16 00:00:46,580 --> 00:00:51,350 deciphering what problem we're working on and then finding the right model for it and applying that 17 00:00:51,350 --> 00:00:51,970 model. 18 00:00:51,980 --> 00:00:54,400 That's exactly what we've just done here. 19 00:00:54,500 --> 00:00:56,040 We've discovered what our problem is. 20 00:00:56,060 --> 00:00:57,110 We've imported our data. 21 00:00:57,170 --> 00:01:02,960 And we want to use these characteristics about different towns and cities to figure out the average 22 00:01:03,110 --> 00:01:06,520 house price and so that's what we've done. 23 00:01:06,570 --> 00:01:11,820 We've deciphered our problem gone to the machine learning MAP AND GONE HEY WE'RE GONNA GO FOLLOW THIS 24 00:01:11,820 --> 00:01:12,570 THROUGH. 25 00:01:12,570 --> 00:01:17,310 AND NOW WE FOUND ridge regression because we answered a few questions to get us to here and we figured 26 00:01:17,370 --> 00:01:20,260 well that's not working so we might try something else. 27 00:01:20,370 --> 00:01:27,510 And that's where we get upon to ensemble regresses let's close on that and see what it says ensemble 28 00:01:27,570 --> 00:01:28,860 methods. 29 00:01:28,860 --> 00:01:34,920 The goal of ensemble methods is to combine the predictions of several base estimate is built on with 30 00:01:34,950 --> 00:01:39,060 a given learning algorithm in order to improve generalize ability. 31 00:01:39,060 --> 00:01:43,330 That's a big word slash robustness robustness. 32 00:01:43,420 --> 00:01:45,930 Farhad I can't speak over a single estimate. 33 00:01:46,110 --> 00:01:52,710 Now what this means is it's essentially combining a whole bunch of smaller models. 34 00:01:52,710 --> 00:01:59,160 So imagine if we used 10 different range regression models and we combined them and asked them to score 35 00:01:59,160 --> 00:02:01,980 each other and then we average that score. 36 00:02:01,980 --> 00:02:04,430 That's the premise of ensemble methods. 37 00:02:04,500 --> 00:02:07,890 And why am I talking about ensemble methods. 38 00:02:07,890 --> 00:02:11,440 Well that's where random forest come from. 39 00:02:11,600 --> 00:02:17,900 Forests of randomize trees and now the reason why we go straight to randomized or randomized trains 40 00:02:17,960 --> 00:02:21,250 because it's kind of like the war horse of machine learning right. 41 00:02:21,320 --> 00:02:22,340 Random Forests. 42 00:02:22,340 --> 00:02:25,970 You see it's got a classified version and a random forest regress. 43 00:02:26,090 --> 00:02:31,610 That's why we're focused on it almost no matter what structured data problem you have and by structured 44 00:02:31,610 --> 00:02:36,830 data I mean things that you find in a data frame or a table something that looks like this if there 45 00:02:36,830 --> 00:02:43,280 are patterns between the features and labels a.k.a. these columns in the target column chances are a 46 00:02:43,280 --> 00:02:46,870 random forest will find them at least somewhat. 47 00:02:46,880 --> 00:02:51,020 So that's why we're focusing on the random forest and ensemble methods. 48 00:02:51,110 --> 00:02:53,780 Now how can we use an ensemble method. 49 00:02:53,900 --> 00:03:01,700 Well from S.K. lone dot ensemble import random forest classifier define x and y instantiate the random 50 00:03:01,700 --> 00:03:09,620 forest classifier and then fit it huh that seems pretty straightforward. 51 00:03:09,620 --> 00:03:13,500 In fact we've actually seen this but let's see it for our particular problem. 52 00:03:13,520 --> 00:03:23,360 So we've tried the read regression model let's try the random forest. 53 00:03:23,650 --> 00:03:28,610 So how do we do this guy from a sky line dot ensemble. 54 00:03:28,630 --> 00:03:33,490 Import random forest Progresso. 55 00:03:33,680 --> 00:03:40,790 Now we're going to set up a random seed because we like our results to be reproducible and beta random 56 00:03:40,790 --> 00:03:43,210 seed and we love the number 42. 57 00:03:43,210 --> 00:03:49,090 Then we're going to go create the data we could actually even just copy everything here. 58 00:03:49,160 --> 00:03:54,260 Copy that and do it again but we're going to get in the habit of really getting used to this kind of 59 00:03:54,270 --> 00:03:59,390 setup could copy that sub out reach for random forest aggressor but we're not going to do that we're 60 00:03:59,390 --> 00:04:07,220 gonna write it down so create the data ex Eagles Boston data dot drop and we're gonna remove the target 61 00:04:07,910 --> 00:04:09,260 access equals one. 62 00:04:09,400 --> 00:04:18,440 Wonderful y Eagles owner and this is Boston def wrong name Boston IDF want to use the target column 63 00:04:18,440 --> 00:04:19,790 here. 64 00:04:19,790 --> 00:04:22,110 Wonderful. 65 00:04:22,130 --> 00:04:35,270 Split the data X train x test y train y test equals train test split. 66 00:04:35,980 --> 00:04:39,690 X Y test size equals zero point two. 67 00:04:39,850 --> 00:04:42,230 Beautiful exactly the same code as above. 68 00:04:42,400 --> 00:04:48,220 And then we're going to go instantiate random forest regress. 69 00:04:48,280 --> 00:04:52,900 We're gonna call that model equals or maybe we'll call it something else just so we can quickly compare 70 00:04:52,910 --> 00:04:57,170 it RF equals random No. 71 00:04:57,250 --> 00:04:58,140 We need capital. 72 00:04:58,550 --> 00:04:59,330 What can we do tab. 73 00:04:59,340 --> 00:05:01,370 Would I complete random forest regress. 74 00:05:01,420 --> 00:05:02,290 Yes we can. 75 00:05:02,290 --> 00:05:02,970 Wonderful. 76 00:05:02,990 --> 00:05:07,210 We're going to R F don't fit our f stance short for random for us. 77 00:05:07,220 --> 00:05:08,140 Ron I've just made that up. 78 00:05:08,710 --> 00:05:19,330 Why train and then we want to go evaluate the random forest aggressor that makes sense beautiful. 79 00:05:19,360 --> 00:05:24,740 So our f y test is that correct. 80 00:05:24,740 --> 00:05:25,470 Mm hmm. 81 00:05:25,490 --> 00:05:28,870 Before we even run it what do you think will happen here. 82 00:05:30,520 --> 00:05:35,960 So does it look kind of similar to this while in fact it does it is kind of gave that away before but 83 00:05:35,960 --> 00:05:36,710 let's run it. 84 00:05:36,710 --> 00:05:41,570 If in doubt run the code we get this little warning here because again in the future if you're using 85 00:05:41,870 --> 00:05:47,800 zero point two two or socket loan zero point two to or above an estimate is which is a hyper parameter 86 00:05:47,840 --> 00:05:56,090 here is gonna be set to 100 so we can remove that warning by setting an estimate as to one hundred eight 87 00:05:56,390 --> 00:05:58,110 point eight 7 3. 88 00:05:58,130 --> 00:06:04,300 Now what did our ridge get point six six six. 89 00:06:04,320 --> 00:06:11,900 Well that could be an omen napkin model not scored just so we can see them side by side. 90 00:06:12,050 --> 00:06:14,930 Maybe put a little comment here like we know what's going on. 91 00:06:17,470 --> 00:06:26,610 Check the range model again x test y test and we know they're using the same splits here because we've 92 00:06:26,610 --> 00:06:32,550 set up a number by random seed so we go to model school so on the same data right we could even comment 93 00:06:32,550 --> 00:06:35,170 this out in a cell just to be sure. 94 00:06:35,350 --> 00:06:36,900 Well commented out 95 00:06:39,750 --> 00:06:41,050 Oh that's even done better there. 96 00:06:41,050 --> 00:06:43,830 And you know why that will change even though we've got a random seed. 97 00:06:44,240 --> 00:06:49,550 It's because there's some sort of randomness within random forest regress hence random forest. 98 00:06:49,550 --> 00:06:53,300 It means it's randomly creating models every time you run it. 99 00:06:53,350 --> 00:07:01,990 So if we did this again so just by using a different model we improved our score by point two to about 100 00:07:01,990 --> 00:07:03,010 that so pointed to. 101 00:07:03,030 --> 00:07:06,310 But you can clearly see that this number is bigger than this number. 102 00:07:06,730 --> 00:07:14,650 And that's all from just defining what our problem wants and then paying attention to our machine learning 103 00:07:14,830 --> 00:07:15,690 map. 104 00:07:15,730 --> 00:07:20,290 So this is kind of what you want to do at the start of every machine learning problem or project that 105 00:07:20,290 --> 00:07:23,470 you're working on is figure out what kind of problem that you're trying to work on. 106 00:07:23,470 --> 00:07:25,650 Is it regression are you trying to predict a number. 107 00:07:25,660 --> 00:07:26,770 Is it classification. 108 00:07:26,770 --> 00:07:29,170 Are you trying to predict whether something is one thing or another. 109 00:07:29,200 --> 00:07:32,430 Is it clustering or is it dimensionality reduction. 110 00:07:32,440 --> 00:07:37,570 So once you sort of know what problem you're working with then you can follow the steps in this flowchart 111 00:07:37,870 --> 00:07:44,750 and figure out which side can't learn estimates or a.k.a. machine learning model you can use. 112 00:07:45,220 --> 00:07:46,390 So what we're going to do next. 113 00:07:46,390 --> 00:07:49,990 That's an example of working it for a regression problem. 114 00:07:50,050 --> 00:07:52,510 You can probably try this ahead of time if you want. 115 00:07:52,510 --> 00:07:58,750 If you want to be a daredevil like that you see if you can choose which estimate you should use for 116 00:07:58,750 --> 00:08:00,170 a classification problem. 117 00:08:00,190 --> 00:08:01,120 Start here. 118 00:08:01,210 --> 00:08:06,460 Go through here use the heart disease dataset we've been working for heart disease dot CSC and see if 119 00:08:06,460 --> 00:08:08,410 you can figure out which one it is. 120 00:08:08,590 --> 00:08:13,330 Or you could kind of just scroll up and see the code we've written before but let's set that up to point 121 00:08:13,320 --> 00:08:14,140 to. 122 00:08:14,140 --> 00:08:20,340 This can be choosing an estimated up for it's a classification problem. 123 00:08:24,160 --> 00:08:26,850 And we'll remind ourselves. 124 00:08:27,490 --> 00:08:31,870 Let's go to the map Excellent. 125 00:08:32,090 --> 00:08:33,950 So that's why we're gonna cover in the next one. 126 00:08:33,950 --> 00:08:37,790 We've seen what it's like choosing one for regression problem and now we're gonna see what it's like 127 00:08:37,790 --> 00:08:39,280 for a classification problem. 128 00:08:39,560 --> 00:08:40,460 I'll see in the next video.