1 00:00:00,360 --> 00:00:08,070 So the beautiful thing is now that our data has no missing values and has turned all into numeric values 2 00:00:09,340 --> 00:00:11,440 we should be able to build a machine learning model. 3 00:00:12,040 --> 00:00:13,100 So let's do that. 4 00:00:13,100 --> 00:00:17,530 We've instantiated one way back up here as model. 5 00:00:17,620 --> 00:00:24,370 Ray back when this we started the modeling section far we've covered a fair bit of ground here. 6 00:00:24,390 --> 00:00:26,370 You should be proud modeling. 7 00:00:26,530 --> 00:00:27,400 Wonderful. 8 00:00:27,400 --> 00:00:29,620 We're going to do basically the exact same thing. 9 00:00:30,310 --> 00:00:33,970 So let's build a machine learning model we could copy this but we're just going to write it out again 10 00:00:34,680 --> 00:00:37,630 or leave a bit of communication to ourselves. 11 00:00:37,630 --> 00:00:50,950 So now that all of our data is numeric as well as our data frame has no missing values we should be 12 00:00:50,950 --> 00:00:57,040 able to build a machine learning model. 13 00:00:57,550 --> 00:00:58,750 Wonderful. 14 00:00:58,750 --> 00:01:00,740 So remember what our goal is. 15 00:01:01,080 --> 00:01:04,590 It's to find patterns in the data to predict the sale price. 16 00:01:04,590 --> 00:01:05,470 So that's what we'll have to do. 17 00:01:05,470 --> 00:01:06,810 Let's have a look DFT HAMP. 18 00:01:06,830 --> 00:01:07,270 Go ahead. 19 00:01:07,270 --> 00:01:09,220 One last look at our data before we model. 20 00:01:09,730 --> 00:01:16,590 So we need to drop the sale price column so that we can build a machine learning model that takes in 21 00:01:16,620 --> 00:01:22,190 all of the other columns except for sale price finds patterns and then tries to predict sale price. 22 00:01:22,200 --> 00:01:28,260 So what we're going to do I'm going to set up a little timer magic and if you haven't seen this before 23 00:01:28,260 --> 00:01:32,310 the percentage percentage sign that's a Jupiter notebook magic function. 24 00:01:32,670 --> 00:01:34,370 You can search out for a whole bunch of these. 25 00:01:34,380 --> 00:01:39,900 But what this is going to do is basically just calculate how much time this particular cell takes to 26 00:01:39,900 --> 00:01:43,650 run and I'm gonna do it on my computer which is a MacBook Pro. 27 00:01:44,970 --> 00:01:46,120 How many rows do we have. 28 00:01:46,290 --> 00:01:50,540 We have four hundred thousand or something rows. 29 00:01:50,610 --> 00:01:52,000 This may take a little while. 30 00:01:52,110 --> 00:01:57,830 In previous examples when we've worked with less data our machine learning models have been pretty quick 31 00:01:57,840 --> 00:01:58,030 right. 32 00:01:58,050 --> 00:02:00,410 Because it's only have to sort through a couple of hundred rows. 33 00:02:00,420 --> 00:02:02,240 But this is 420000. 34 00:02:02,370 --> 00:02:05,790 I mean of course there are much bigger data sets but we're starting to get up there right. 35 00:02:05,790 --> 00:02:10,270 We're in the hundreds of thousands of samples instantiate model. 36 00:02:10,320 --> 00:02:15,960 So what this time thing is going to do is because it's our goal as a as a data scientist as a machine 37 00:02:15,960 --> 00:02:20,430 learning engineer is to do as many experiments as fast as possible. 38 00:02:20,490 --> 00:02:25,640 If you training models on all of the data all the time it's going to take a fairly long time. 39 00:02:25,650 --> 00:02:30,570 So I'm just going to demonstrate how long it takes for my personal computer to use a baseline random 40 00:02:30,570 --> 00:02:36,030 forest regressive model to find all the patterns in all four hundred twelve thousand rows. 41 00:02:36,030 --> 00:02:40,590 So don't worry I'm not going to have to sit there and watch it run once we kick it off. 42 00:02:41,250 --> 00:02:46,920 All are pause the video and then come back and show you how long it took the ways of thinking I want 43 00:02:46,920 --> 00:02:53,040 you to start thinking about is when you're doing experiments start to think to yourself How can I reduce 44 00:02:53,040 --> 00:02:57,840 the amount of time it takes between my experiments. 45 00:02:57,840 --> 00:03:01,860 The reason being is because you're not always trying to do the right thing you're trying to figure out 46 00:03:01,860 --> 00:03:03,870 what's wrong what doesn't work. 47 00:03:04,350 --> 00:03:10,260 So that's the rationale between trying to speed up your time between experiments and as I typed as I 48 00:03:10,260 --> 00:03:15,380 talk what we're doing here is we're just instantiating a random forest model. 49 00:03:15,750 --> 00:03:24,600 We're gonna put random state equals 242 their random state equals 42 that way our results are be reproducible. 50 00:03:24,600 --> 00:03:25,530 So this is all we're doing. 51 00:03:25,560 --> 00:03:30,240 We've imported the random forest regressive class because we're working on a regression problem we're 52 00:03:30,240 --> 00:03:35,820 setting end jobs to equal negative one because this is a fairly large dataset for what we've been working 53 00:03:35,820 --> 00:03:36,990 with for now. 54 00:03:36,990 --> 00:03:42,120 So I want to use all of the cause and my computer and I want random say Digg or 42 so our results are 55 00:03:42,480 --> 00:03:45,930 reproducible and I'm going to fit the model to the data. 56 00:03:45,930 --> 00:03:49,050 So this is our X and this is our y. 57 00:03:49,130 --> 00:03:54,780 Remember we're trying to predict sale price based on all the other columns except for sale price. 58 00:03:55,740 --> 00:04:01,440 So when I kick this off this would take a few minutes but I'm going to pause a video and because we've 59 00:04:01,440 --> 00:04:04,410 got the little magic function time here we can see how long it takes. 60 00:04:04,410 --> 00:04:09,450 So get ready to time travel in three to one. 61 00:04:09,470 --> 00:04:09,990 All right. 62 00:04:09,990 --> 00:04:10,890 We're back. 63 00:04:10,890 --> 00:04:11,680 That was pretty quick. 64 00:04:12,480 --> 00:04:21,050 So our model is finished and it says here at the wall time was six minutes and 58 seconds. 65 00:04:21,050 --> 00:04:27,830 So it took about seven minutes to go through 412 thousand seven hundred roses is actually pretty good 66 00:04:27,830 --> 00:04:28,220 right. 67 00:04:28,220 --> 00:04:31,040 Think about how quickly you could look at that. 68 00:04:31,040 --> 00:04:35,150 How long would it take you to go through 400 12000 different examples. 69 00:04:35,150 --> 00:04:39,050 You didn't have to wait for that time because I spent it up but if you were to run this on your computer 70 00:04:39,080 --> 00:04:43,760 it may take longer it may take less time depending on what sort of computer you have on running this 71 00:04:43,760 --> 00:04:48,770 on a 13 inch MacBook Pro I think from about 2016 or something like that. 72 00:04:48,800 --> 00:04:51,700 I'm not even exactly sure how much computing power mine has. 73 00:04:51,890 --> 00:04:55,500 But it's a laptop so maybe not as much as a desktop anyway. 74 00:04:56,210 --> 00:05:01,700 But you can imagine as these numbers of Rose started to really get up high this could start to take 75 00:05:01,700 --> 00:05:02,470 a long time. 76 00:05:02,510 --> 00:05:08,330 If our goal as machine learning engineers if our goal as data scientist is to reduce the amount of time 77 00:05:08,330 --> 00:05:14,000 between experiments we're going to have to be pretty nifty when we try out different models when we 78 00:05:14,000 --> 00:05:19,380 work on different data because waiting seven minutes every time even if this number was higher burner 79 00:05:19,400 --> 00:05:22,430 cell to run that's going to slow us down a fair bit. 80 00:05:22,430 --> 00:05:29,300 So we'll see a little trick soon of how we can improve this but finally score the model so see how our 81 00:05:29,300 --> 00:05:29,870 model did. 82 00:05:29,870 --> 00:05:36,980 If it's gone over four hundred twelve thousand rows surely it's found something some patterns in here. 83 00:05:36,980 --> 00:05:44,060 So we're going to score it on just the exact same data that it was trained on so we'll copy this. 84 00:05:44,160 --> 00:05:52,130 And now remember for regression algorithm the default score value here is going to be returned the coefficient 85 00:05:52,130 --> 00:05:54,710 of determination which is the r squared. 86 00:05:54,710 --> 00:06:01,940 Now if you need a refresher on what r squared is you can go what is co efficient of determination. 87 00:06:02,360 --> 00:06:07,610 But in other words it's a score that can go up to 1 0. 88 00:06:07,640 --> 00:06:13,010 If all your model does is predict the mean of the sale price column and it can go to negative infinity. 89 00:06:13,010 --> 00:06:15,080 If your model is absolute terrible. 90 00:06:15,620 --> 00:06:21,700 So we're looking for a value that's close to 1. 91 00:06:21,800 --> 00:06:26,240 So again this might take a little while because the model has to compute the score of how it did across 92 00:06:26,240 --> 00:06:27,920 four hundred twelve thousand rows 93 00:06:30,960 --> 00:06:33,980 bomb point nine eight. 94 00:06:34,020 --> 00:06:34,910 Wow. 95 00:06:34,920 --> 00:06:39,250 If the maximum is four point nine eight seven so close at point nine nine. 96 00:06:39,300 --> 00:06:41,720 So if the maximum is one point zero. 97 00:06:41,730 --> 00:06:47,430 Does that mean a random forest regress has almost found a perfect pattern for finding predicting the 98 00:06:47,430 --> 00:06:48,640 sale price based on it. 99 00:06:48,670 --> 00:06:56,430 Are the columns in the data frame not so fast I want you to think about before we get into the next 100 00:06:56,430 --> 00:06:57,630 video where we're going to cover it. 101 00:06:58,290 --> 00:07:02,890 Why is this metric not reliable right. 102 00:07:02,890 --> 00:07:10,140 So if we have a look here we fitted the model on this data and then we've evaluated it on the same data. 103 00:07:10,180 --> 00:07:13,570 Why doesn't this metric hold much water have a think about it. 104 00:07:13,570 --> 00:07:14,800 We'll talk about in the next video.