1 00:00:00,560 --> 00:00:01,300 OK. 2 00:00:01,320 --> 00:00:02,400 Welcome back. 3 00:00:02,400 --> 00:00:07,010 Let's revisit our little list here of what we're covering to see where we're up to. 4 00:00:07,680 --> 00:00:12,390 OK so we've done zero and end to end psychic learn workflow and we've done one. 5 00:00:12,420 --> 00:00:14,110 Getting your data ready. 6 00:00:14,130 --> 00:00:15,080 So now we're going to do. 7 00:00:15,090 --> 00:00:15,820 Number two. 8 00:00:15,930 --> 00:00:20,480 Choose the right estimate at slash algorithm for our problems. 9 00:00:20,580 --> 00:00:23,050 Let's make a little heading for that too. 10 00:00:23,130 --> 00:00:30,540 Choosing the right estimate a slash algorithm for our problem. 11 00:00:30,540 --> 00:00:35,810 Now a little thing to note here is that psychic learn let's just put it here. 12 00:00:35,820 --> 00:00:50,000 Psychic loan users estimate as another term for machine learning model or algorithm. 13 00:00:50,170 --> 00:00:51,370 So that's an important thing to note. 14 00:00:51,370 --> 00:00:56,860 Like you come across if you're using the psychic loan documentation or if you're just browsing general 15 00:00:56,860 --> 00:01:00,380 machine learning stuff online you'll hear a few different names for it. 16 00:01:00,400 --> 00:01:04,370 So machine learning model machine learning algorithm or estimate. 17 00:01:04,390 --> 00:01:09,730 In the case of psychic line they kind of had their own nomenclature to just make things a bit more easier 18 00:01:09,790 --> 00:01:10,990 across the library. 19 00:01:10,990 --> 00:01:17,300 But if you hear those kind of times just remember it's talking about using a machine learning model. 20 00:01:17,390 --> 00:01:18,240 OK. 21 00:01:18,340 --> 00:01:19,930 How do we go about this. 22 00:01:19,930 --> 00:01:26,530 Well some other things to note is that first of all before you choose an estimate slash algorithm for 23 00:01:26,530 --> 00:01:31,250 your problem is you have to figure out what kind of problem are you working with. 24 00:01:31,330 --> 00:01:36,400 And up here we've seen a few different types of problems but the main ones we're going to be looking 25 00:01:36,400 --> 00:01:46,260 at are classification a.k.a. predicting whether a sample is one thing or another. 26 00:01:47,170 --> 00:01:56,340 And regression predicting a number so classification is like our heart disease problem. 27 00:01:56,430 --> 00:01:59,580 We're trying to predict whether someone has heart disease or not. 28 00:01:59,580 --> 00:02:04,260 And regression is we're trying to predict a number like with the Boston Housing dataset or with our 29 00:02:04,260 --> 00:02:05,680 car sales data set. 30 00:02:05,730 --> 00:02:11,340 We're trying to predict a house price or a car price that doesn't look very good. 31 00:02:11,340 --> 00:02:12,840 Let's go dot point. 32 00:02:12,840 --> 00:02:16,520 This thing's got to look good right. 33 00:02:16,590 --> 00:02:21,860 So in the previous examples I've kind of just randomly imported this random forest regress. 34 00:02:21,900 --> 00:02:24,270 But where did I get that from. 35 00:02:24,330 --> 00:02:26,220 And I'll let you in on a little secret. 36 00:02:26,220 --> 00:02:27,860 It's not that trivial. 37 00:02:27,870 --> 00:02:33,570 There are a lot of different machine learning models but psychic line has made something very helpful 38 00:02:33,600 --> 00:02:35,800 and I'm very excited to show it to you. 39 00:02:35,820 --> 00:02:36,510 Let's have a look. 40 00:02:36,820 --> 00:02:44,750 So let's go google S.K. loan email map choosing the right estimate. 41 00:02:44,840 --> 00:02:48,550 This is what we want this when you first see this. 42 00:02:48,650 --> 00:02:54,470 It's gonna look like a whole bunch of just different jargon going through but as we start to dive into 43 00:02:54,470 --> 00:02:59,450 it we'll start to realize oh wow this has some really useful things that we can use in our problems 44 00:02:59,450 --> 00:03:02,990 our machine learning problems and what is this. 45 00:03:02,990 --> 00:03:08,670 Well at this Web site here you just go SDK line or socket line machine learning map and Google. 46 00:03:08,840 --> 00:03:10,640 And this is choosing the right estimate. 47 00:03:10,780 --> 00:03:16,020 Remember estimating so I line is the same as machine learning model or machine learning algorithm. 48 00:03:16,210 --> 00:03:22,030 And the documentation here says often the hardest part of solving machine learning problem can be finding 49 00:03:22,030 --> 00:03:23,910 the right estimate for the job. 50 00:03:23,910 --> 00:03:24,690 Yes. 51 00:03:24,700 --> 00:03:29,590 Different estimates are better suited for different types of data and different problems. 52 00:03:29,650 --> 00:03:35,290 That makes sense if we have a classification problem we might want to use one of these estimates. 53 00:03:35,440 --> 00:03:40,720 And if we have a regression problem we won't want to use one of these estimates and then for something 54 00:03:40,720 --> 00:03:43,930 like a clustering problem we have these estimates. 55 00:03:44,050 --> 00:03:47,310 And there was something like a dimensionality reduction problem. 56 00:03:47,390 --> 00:03:48,750 You I want to use one of these. 57 00:03:49,180 --> 00:03:54,130 But let's get hands on right because at the moment this this graph or this flow chart this map whatever 58 00:03:54,130 --> 00:03:57,760 you want to call it it can seem a bit confusing to begin with. 59 00:03:57,760 --> 00:04:01,490 So what we're going to do is get hands on with a problem. 60 00:04:01,510 --> 00:04:04,630 So if we start here first we need some data. 61 00:04:04,660 --> 00:04:09,070 So this is what these little blue steps are you do you have above 50 samples. 62 00:04:09,090 --> 00:04:09,530 No. 63 00:04:09,550 --> 00:04:10,460 Get more data. 64 00:04:10,540 --> 00:04:11,320 Simple. 65 00:04:11,320 --> 00:04:12,880 Do we have above 50 samples. 66 00:04:12,880 --> 00:04:13,590 Yes. 67 00:04:13,780 --> 00:04:15,130 Predicting a category. 68 00:04:15,220 --> 00:04:15,800 Yes. 69 00:04:15,820 --> 00:04:16,810 Do we have labeled data. 70 00:04:16,900 --> 00:04:17,890 Yes. 71 00:04:17,940 --> 00:04:19,600 Do we have under 100 k samples. 72 00:04:19,600 --> 00:04:20,200 Yes. 73 00:04:20,200 --> 00:04:22,420 Use linear SPC. 74 00:04:23,200 --> 00:04:24,280 Okay. 75 00:04:24,310 --> 00:04:24,670 All right. 76 00:04:24,670 --> 00:04:25,360 Enough talk. 77 00:04:25,450 --> 00:04:27,950 Let's get back to our notebook and start writing some code. 78 00:04:28,090 --> 00:04:34,440 What we're going to do is to begin with two point one we'll see how we did regression just recently 79 00:04:34,450 --> 00:04:40,270 so we'll see how we would pick an estimate slash algorithm for a regression problem. 80 00:04:40,330 --> 00:04:41,710 So let's do this. 81 00:04:41,790 --> 00:04:43,070 I'll make two point one. 82 00:04:43,240 --> 00:04:53,620 Picking a machine learning model for our regression problem beautiful. 83 00:04:53,710 --> 00:04:58,420 And so what we're going to do is we're going to use one of socket loans built in datasets and that's 84 00:04:58,420 --> 00:05:00,620 the Boston Housing data set. 85 00:05:00,670 --> 00:05:09,830 So let's go import Boston housing data set and we can do that from S.K. learned on data sets. 86 00:05:10,540 --> 00:05:14,440 Import load Boston and then we want to go. 87 00:05:14,440 --> 00:05:22,960 Boston just to settle up Eagles load Boston and then we want to see what it looks like. 88 00:05:23,050 --> 00:05:26,180 Boston OK. 89 00:05:26,350 --> 00:05:30,190 So it imports as a dictionary we've got data as one of the keys. 90 00:05:30,190 --> 00:05:32,190 Target is one of the keys. 91 00:05:32,380 --> 00:05:34,930 And then I think we have a feature names. 92 00:05:34,960 --> 00:05:35,890 What can we do with this. 93 00:05:36,070 --> 00:05:40,990 Well let's first turn it into a panda's data frame so that we can see it a little bit better than being 94 00:05:40,990 --> 00:05:42,280 a dictionary. 95 00:05:42,280 --> 00:05:44,670 So we'll go to Boston DNF. 96 00:05:44,890 --> 00:05:48,780 This is one of the first steps you'll you usually do with any kind of problem with any kind of data 97 00:05:48,780 --> 00:05:52,060 set is try to get it into a panda's data frame. 98 00:05:52,060 --> 00:05:52,270 Right. 99 00:05:52,270 --> 00:05:57,940 Because we've seen what pandas is capable of and we know it looks good and we know it's pretty malleable 100 00:05:57,940 --> 00:06:01,510 and we can just do a whole bunch of different things one is in Japan this data frame. 101 00:06:01,540 --> 00:06:07,260 So rather than having an a dictionary we'll get it into a data frame and then we want to set up Boston 102 00:06:07,420 --> 00:06:08,460 DCF. 103 00:06:08,680 --> 00:06:13,930 And I'm kind of typing here without talking but essentially what I'm doing is I'm taking the data key 104 00:06:14,170 --> 00:06:19,570 from the Boston dictionary and setting the columns to be the featured names from the dictionary and 105 00:06:19,570 --> 00:06:22,350 then creating a target column here. 106 00:06:22,400 --> 00:06:30,220 This is what we're trying to predict by setting it to PD series and then taking the target key from 107 00:06:30,220 --> 00:06:32,290 the Boston dictionary. 108 00:06:32,290 --> 00:06:37,970 So right now Boston is a dictionary and we're going to turn it into Boston DNF which is a data frame. 109 00:06:38,170 --> 00:06:47,130 So let's see what this bad boy looks like key era features names it might just be feature names columns 110 00:06:47,340 --> 00:06:48,870 lots of typos. 111 00:06:48,960 --> 00:06:50,790 This is what happens when you type and talk. 112 00:06:50,880 --> 00:06:51,380 Okay. 113 00:06:51,540 --> 00:06:52,200 Excellent. 114 00:06:52,200 --> 00:06:53,370 So what can we see here. 115 00:06:54,540 --> 00:06:58,850 Well we've got cram Xian index charts. 116 00:06:58,910 --> 00:07:00,440 Not sure what all of these are. 117 00:07:00,470 --> 00:07:02,010 But it's got a target column. 118 00:07:02,180 --> 00:07:06,200 So I'm assuming that's what we're trying to predict and we can figure out what this actually is by just 119 00:07:06,200 --> 00:07:11,860 going as K line Boston housing dataset we're looking at data sets. 120 00:07:11,870 --> 00:07:19,160 Lloyd Boston so this is what we've just done we've called this function from the S.K. lined up data 121 00:07:19,160 --> 00:07:20,150 sets module. 122 00:07:20,340 --> 00:07:21,780 We can read more on the user guide. 123 00:07:21,810 --> 00:07:24,180 I've seen this before and it's one kind of familiar with it. 124 00:07:24,180 --> 00:07:27,100 But if you're first looking at it you might have to dive in. 125 00:07:27,120 --> 00:07:30,900 So this is just going to give us a data dictionary of what the columns that we're dealing with. 126 00:07:31,260 --> 00:07:37,270 So if we see Crim is per capita crime rate per town Yep. 127 00:07:37,500 --> 00:07:38,180 OK. 128 00:07:38,200 --> 00:07:44,300 And then zone proportion of residential land zoned for lots over 25000 square feet. 129 00:07:44,320 --> 00:07:44,670 Yep. 130 00:07:45,110 --> 00:07:45,610 OK. 131 00:07:45,850 --> 00:07:51,670 So long story short what this dataset is is a whole bunch of different parameters about different towns 132 00:07:51,940 --> 00:07:52,870 in Boston. 133 00:07:52,870 --> 00:07:56,920 So each row of these Boston is a city in America by the way. 134 00:07:57,250 --> 00:08:02,250 So each row is a different town in Boston and there's different features about this town. 135 00:08:02,260 --> 00:08:09,190 You can read all of these here and what we're trying to do is use these features about the town to predict 136 00:08:09,220 --> 00:08:10,930 the median house price. 137 00:08:11,020 --> 00:08:15,400 And I believe this house price is in thousands and this data says a little bit all too that's why everything 138 00:08:15,400 --> 00:08:16,450 is so cheap. 139 00:08:16,540 --> 00:08:18,170 But the premise is here. 140 00:08:18,190 --> 00:08:20,440 This is a regression problem. 141 00:08:20,440 --> 00:08:27,400 So now we can figure out a few things about our data frame so we want how many samples we're going to 142 00:08:27,400 --> 00:08:27,610 get. 143 00:08:27,630 --> 00:08:32,420 Len Boston DNF five hundred and six. 144 00:08:32,470 --> 00:08:33,100 Wonderful. 145 00:08:33,880 --> 00:08:39,670 So now that we know we have five hundred and six samples and we have a regression problem and we have 146 00:08:39,670 --> 00:08:48,550 labels what can we do we'll go back to our machine learning map we come right to the start and we go 147 00:08:48,640 --> 00:08:52,050 okay we go follow this little little golden arrow. 148 00:08:52,090 --> 00:08:52,780 Beautiful. 149 00:08:52,780 --> 00:08:54,400 Do we have above 50 samples. 150 00:08:54,400 --> 00:08:55,340 Yes. 151 00:08:55,360 --> 00:09:00,820 See if we didn't have above 50 samples SBA loan will be telling us get more data which makes sense. 152 00:09:00,820 --> 00:09:02,560 Are we predicting category. 153 00:09:02,560 --> 00:09:03,470 No. 154 00:09:03,490 --> 00:09:05,440 Are we predicting a quantity. 155 00:09:05,440 --> 00:09:10,470 Yes because we are working on a regression problem. 156 00:09:10,600 --> 00:09:14,080 Do we have over 100 k samples under 100 samples. 157 00:09:14,260 --> 00:09:14,920 Yes. 158 00:09:14,980 --> 00:09:16,820 Few features should be important. 159 00:09:16,840 --> 00:09:18,820 We're actually not sure what this is. 160 00:09:18,970 --> 00:09:21,170 So few features should be important. 161 00:09:21,220 --> 00:09:22,600 What can we. 162 00:09:22,630 --> 00:09:29,000 We've got 1 2 3 4 5 6 7 8 9 10 11 12 13. 163 00:09:29,230 --> 00:09:34,000 We've got 13 features but we don't actually know whether a few of them will be important or not. 164 00:09:34,000 --> 00:09:36,870 So let's just pretend no for the time being. 165 00:09:36,910 --> 00:09:42,100 And so this is going to point us now to one of these green squares and within each of these green squares 166 00:09:42,460 --> 00:09:46,060 is an estimate or a machine learning model. 167 00:09:46,060 --> 00:09:53,010 So if we click on this one ridge regression this is gonna take us to the ridge regression and classification 168 00:09:53,010 --> 00:09:54,220 documentation. 169 00:09:54,360 --> 00:09:59,550 So it tells us red regression rage regression addresses some of the problems of ordinary least squares 170 00:09:59,570 --> 00:10:04,050 and we got a whole bunch of math symbols but really we would just want to work out how we can use this 171 00:10:04,050 --> 00:10:05,840 machine learning model. 172 00:10:05,850 --> 00:10:09,640 Okay so from Mesquite loan import linear model Reg okay. 173 00:10:09,750 --> 00:10:11,620 So we can just important like that. 174 00:10:11,640 --> 00:10:13,000 Well let's see. 175 00:10:13,200 --> 00:10:16,240 Let's go back let's get some code working. 176 00:10:17,140 --> 00:10:22,360 Let's try the ridge progression model. 177 00:10:22,960 --> 00:10:29,840 So if we go back up here we go from S K line input linear model and its linear model dot reach to use 178 00:10:29,840 --> 00:10:33,740 the ridge progression but actually we don't want I just do it like that. 179 00:10:33,740 --> 00:10:43,550 Let's skip a line of code and we can do it by going from S.K. line Don linear model import ridge and 180 00:10:43,550 --> 00:10:46,010 then we're going to set up a random seed 181 00:10:50,260 --> 00:10:59,250 MP don't random so we can make sure our results are reproducible and then we want to create the data. 182 00:10:59,980 --> 00:11:03,040 So we're going to go X equals we've seen this before. 183 00:11:03,070 --> 00:11:05,400 Boston D F don't drop. 184 00:11:05,410 --> 00:11:11,520 We want to drop the target column for X because then it'll just be the features matrix. 185 00:11:11,550 --> 00:11:13,310 Access equals one. 186 00:11:13,410 --> 00:11:18,590 I'm going to put a few rows down here so we've got some space to see why it goes. 187 00:11:18,660 --> 00:11:24,420 BOSTON THE F and this is going to be the target column because we want to use x to predict y right. 188 00:11:24,630 --> 00:11:29,640 And then we want to go split into train and test sets. 189 00:11:29,690 --> 00:11:44,160 We're going to go X train x test y train y test equals train test split x y test size we use 20 percent 190 00:11:44,160 --> 00:11:49,020 again because that's a good general number you'll see that come up over and over again in machine learning 191 00:11:49,020 --> 00:11:55,230 projects and then the next thing to do is because we've imported Ridge much like up here we've instantiated 192 00:11:55,230 --> 00:11:57,350 the model with random forest. 193 00:11:58,170 --> 00:12:02,390 So model equals random forest regress so we can do the same thing with Ridge. 194 00:12:02,520 --> 00:12:11,070 So we go instantiate Ridge model and we'll call it the generic model again bridge and then we're going 195 00:12:11,070 --> 00:12:23,100 to go modeled off it X train y train and then you want to go check the score of the ridge Ridge model 196 00:12:23,280 --> 00:12:34,220 on test data model dot school X test y test beautiful look at that how quick was that. 197 00:12:34,410 --> 00:12:35,930 So what did we do here. 198 00:12:36,550 --> 00:12:38,050 Where did this come from. 199 00:12:38,050 --> 00:12:42,210 Remember we went back to our machine learning map. 200 00:12:42,220 --> 00:12:43,190 We started here. 201 00:12:43,240 --> 00:12:46,200 We answered a few questions and follow the flow chart along here. 202 00:12:46,230 --> 00:12:47,610 And we clicked on ridge regression. 203 00:12:48,300 --> 00:12:51,600 And that took us to the documentation and we went here. 204 00:12:51,700 --> 00:12:52,000 Okay. 205 00:12:52,000 --> 00:12:56,150 There's some example code but we prefer to just start typing it out ourselves which is what we did. 206 00:12:56,150 --> 00:13:01,690 He kind of skipped over looking at this because I wanted to just get it in here so you can get an example. 207 00:13:01,690 --> 00:13:08,930 And so now what we've done is we've split our data into x and y we've got our Boston data frame here. 208 00:13:09,010 --> 00:13:13,450 We've taken these columns because we want to use these columns to predict the target. 209 00:13:14,120 --> 00:13:15,950 And so that's what we've done X and Y. 210 00:13:15,970 --> 00:13:20,980 Now we split into train and tests and we've instantiated a bridge model because that's the model that 211 00:13:20,980 --> 00:13:22,650 the map suggested we should use. 212 00:13:23,530 --> 00:13:30,690 This bad boy here and then we fitted it to the data a.k.a. asked our model to find the patterns between 213 00:13:30,720 --> 00:13:39,660 X train and Y train and then we evaluated our model on the test data set Wow that was actually surprisingly 214 00:13:40,050 --> 00:13:40,830 easy. 215 00:13:40,890 --> 00:13:43,100 Now this score What is this. 216 00:13:43,110 --> 00:13:48,630 So this is if we press shift hab returns the coefficient of determination r squared of the prediction. 217 00:13:48,690 --> 00:13:51,670 Now we'll look into more evaluation metrics as we go on. 218 00:13:51,810 --> 00:13:56,400 But just remember the highest possible score you can get here is one point zero. 219 00:13:56,400 --> 00:14:02,710 So if our model was to predict each of these columns exactly then it would get a score of one point 220 00:14:02,720 --> 00:14:03,350 zero. 221 00:14:03,540 --> 00:14:06,630 So zero point six x is not too bad right. 222 00:14:06,630 --> 00:14:10,300 The closer to 1 the better what you might be asking is how do we. 223 00:14:10,300 --> 00:14:11,900 How do we improve this score 224 00:14:16,180 --> 00:14:17,530 Brian. 225 00:14:17,590 --> 00:14:22,720 What if Ridge wasn't working. 226 00:14:22,720 --> 00:14:28,990 And luckily if we go back to the machine learning map we've got a little error here that says not working 227 00:14:29,700 --> 00:14:30,030 right. 228 00:14:30,040 --> 00:14:37,110 So if our ridge regression wasn't performing as well as we wanted or wasn't working at all well this 229 00:14:37,120 --> 00:14:41,830 arrows pointing us here to go to one of these two green rectangles. 230 00:14:41,890 --> 00:14:43,970 So maybe we'll have a look at this in the next video. 231 00:14:44,110 --> 00:14:44,980 What's going on here. 232 00:14:44,980 --> 00:14:50,020 What if ridge regression wasn't working at the moment it is but we want to kind of improve this score 233 00:14:50,770 --> 00:14:54,060 so we'll check out what we can do with that in the next video.