1 00:00:00,450 --> 00:00:01,380 Beautiful. 2 00:00:01,410 --> 00:00:06,210 Now we could add Jupiter notebook set up and our tools are importing thanks to installing them Conda. 3 00:00:06,330 --> 00:00:07,970 Let's give this bad boy name. 4 00:00:07,980 --> 00:00:12,430 End to end bulldozer price regression. 5 00:00:12,450 --> 00:00:14,700 Nice and simple. 6 00:00:15,240 --> 00:00:15,990 Wonderful. 7 00:00:16,080 --> 00:00:19,680 And what we're gonna do is going to start off by bringing in some headings to a notebook. 8 00:00:19,680 --> 00:00:24,570 So we have some structure of what we're doing and the ones we're going to do is we're gonna bring this 9 00:00:24,570 --> 00:00:31,020 framework in so we get all these six steps here from definition data evaluation which I get Kaggle. 10 00:00:31,050 --> 00:00:36,260 What evaluation metric they use features modeling and then experiments. 11 00:00:36,270 --> 00:00:37,680 So let's do that. 12 00:00:38,070 --> 00:00:42,060 We'll create a new cell above here just a predefined notebook. 13 00:00:42,420 --> 00:00:44,260 So what could we start this off with. 14 00:00:44,300 --> 00:00:54,900 Give it a nice title so predicting the sale price of bulldozers using machine learning. 15 00:00:54,900 --> 00:00:57,240 Let's put the merger here. 16 00:00:57,240 --> 00:00:58,980 There we go frequently used. 17 00:00:59,010 --> 00:01:00,640 Good one. 18 00:01:00,690 --> 00:01:02,280 Turn this into markdown. 19 00:01:02,310 --> 00:01:03,390 Wonderful. 20 00:01:03,390 --> 00:01:04,530 So in this notebook 21 00:01:07,750 --> 00:01:20,200 we're going to go through an example machine learning project with the goal of predicting the sale price 22 00:01:20,470 --> 00:01:21,840 of bulldozers. 23 00:01:22,120 --> 00:01:23,090 Nice and simple right. 24 00:01:23,090 --> 00:01:26,670 Do you want to get into modeling and and to working on code as fast as possible. 25 00:01:26,680 --> 00:01:31,370 So this is just these kind of layouts is just to help us out in the beginning. 26 00:01:31,390 --> 00:01:35,740 So problem definition was an excellent data. 27 00:01:35,740 --> 00:01:39,960 Number three e is so you can't even remember what the top of my head. 28 00:01:40,420 --> 00:01:41,360 Let's go back Maria. 29 00:01:41,440 --> 00:01:43,190 This is why we have this so we can check it out. 30 00:01:43,210 --> 00:01:45,640 Evaluation had a mind like there. 31 00:01:45,790 --> 00:01:47,340 But these things will happen right. 32 00:01:47,440 --> 00:01:51,670 Whenever you dive into every project it's not going to be like you know everything of what to do off 33 00:01:51,670 --> 00:01:52,210 by heart. 34 00:01:52,210 --> 00:01:53,140 From the very start. 35 00:01:53,410 --> 00:01:56,320 That's what machine learning is experimentation features. 36 00:01:56,320 --> 00:02:02,270 Let's go through these and then maybe down here somewhere we'll be modeling or whatever but we'll wait 37 00:02:02,270 --> 00:02:03,490 till we get to that point. 38 00:02:03,580 --> 00:02:06,220 So let's define our problem in a single sentence. 39 00:02:06,220 --> 00:02:07,210 We'll go back to Kaggle. 40 00:02:08,050 --> 00:02:09,550 So we go to the overview. 41 00:02:09,550 --> 00:02:10,060 There we go. 42 00:02:10,300 --> 00:02:14,530 You could just put in that but we'll rewrite it in our own words. 43 00:02:14,530 --> 00:02:16,110 That's a bit more fun isn't it. 44 00:02:16,110 --> 00:02:16,960 So come here. 45 00:02:16,990 --> 00:02:18,010 What could we call this 46 00:02:20,870 --> 00:02:21,980 brilliant quote. 47 00:02:21,980 --> 00:02:31,010 How well can we predict the future sale price of a bulldozer given its characteristics 48 00:02:35,540 --> 00:02:46,420 and previous examples of how much simpler bulldozers have been sold for. 49 00:02:46,480 --> 00:02:47,860 Does that make sense. 50 00:02:47,860 --> 00:02:53,710 So our goal here is to use some data to build a machine learning model to predict the sale price of 51 00:02:53,720 --> 00:02:55,030 bulldozers in the future. 52 00:02:55,030 --> 00:02:56,270 And now data. 53 00:02:56,470 --> 00:02:58,010 Now we've already imported this. 54 00:02:58,020 --> 00:03:01,990 So if we have a look at you've been notebook open we've already got this open here. 55 00:03:01,990 --> 00:03:03,520 We don't need next year window. 56 00:03:03,550 --> 00:03:04,660 Let's have a look at data. 57 00:03:04,690 --> 00:03:06,820 Blue Book for bulldozers. 58 00:03:06,820 --> 00:03:07,630 There we go. 59 00:03:08,920 --> 00:03:11,390 We've got trained invalid train. 60 00:03:11,500 --> 00:03:11,850 All right. 61 00:03:11,860 --> 00:03:13,170 So I might just put a link in here. 62 00:03:13,180 --> 00:03:18,820 So a guy the data is downloaded from the Kaggle 63 00:03:21,520 --> 00:03:24,750 Blue Book for bulldozers competition 64 00:03:28,710 --> 00:03:33,270 and then you might come into this cable competition and they've already given us some information so 65 00:03:33,270 --> 00:03:38,640 I might just put something like this so I might just copy this to remind ourselves of what train valid 66 00:03:38,640 --> 00:03:39,930 and test is. 67 00:03:39,930 --> 00:03:43,820 Again this would be different depending on whatever data set you're working with. 68 00:03:43,830 --> 00:03:44,710 We'll put it in here. 69 00:03:44,720 --> 00:03:50,180 So train valid and test just so someone who is new to this would be out goes round notebook and go Oh 70 00:03:50,210 --> 00:03:59,350 Cassie what's happening there are three main data sets and we're just gonna put the link in here this 71 00:03:59,350 --> 00:04:01,610 is all part of communicating your work right. 72 00:04:01,630 --> 00:04:05,470 That's what you want when you're doing things machine learning or any kind of coding project. 73 00:04:05,470 --> 00:04:10,220 It can be easy to sort of get lost in making your own set of projects and then someone else gets upon 74 00:04:10,220 --> 00:04:14,950 it and because you've done it all in your own style they're gonna be like oh wow what's going on here. 75 00:04:14,980 --> 00:04:16,030 So this is what we're trying to do. 76 00:04:16,030 --> 00:04:18,670 We're trying to help our future selves as well as anyone else. 77 00:04:18,670 --> 00:04:25,240 We share our work with so evaluation we can come to cargo competitions whenever a competition is posted 78 00:04:25,240 --> 00:04:26,290 on Kaggle. 79 00:04:26,420 --> 00:04:29,640 There's gonna be an overview there's gonna be some prizes. 80 00:04:29,650 --> 00:04:35,440 It's gonna be a description of what's going on here and there's gonna be an evaluation so how do you 81 00:04:35,440 --> 00:04:37,240 win the competition. 82 00:04:37,240 --> 00:04:44,740 So here is where we'll be shown the evaluation metric for this competition is the R M S L E or root 83 00:04:45,070 --> 00:04:47,190 mean squared log era. 84 00:04:47,230 --> 00:04:48,940 Wow that's a mouthful. 85 00:04:49,090 --> 00:04:55,690 Between the actual and predicted auction prices see that's what all a valuation metrics are is a comparison 86 00:04:55,960 --> 00:04:59,520 between the actual and predicted auction prices. 87 00:04:59,530 --> 00:05:03,700 So what we'll do is we're going to copy this in here. 88 00:05:03,700 --> 00:05:05,830 There we go. 89 00:05:05,830 --> 00:05:11,290 And s and don't worry we're gonna have a look at that we haven't actually gone over what our M SLA is 90 00:05:11,530 --> 00:05:14,950 but we have gone over mean squared error. 91 00:05:15,040 --> 00:05:23,500 Now you might be able infer from this that rude mean squared log error is just the root of the mean 92 00:05:23,500 --> 00:05:27,220 squared lawyer error but we'll have a look at that when we come to evaluation. 93 00:05:27,310 --> 00:05:34,430 For more on the evaluation of this project check we're going to put another link in there. 94 00:05:36,490 --> 00:05:42,880 So as our goal with most regression metrics we might pull in here. 95 00:05:42,880 --> 00:05:57,240 Note the goal for most regression evaluation metrics is to minimize the era for example. 96 00:05:57,360 --> 00:06:13,330 Our goal for this project will be to build a machine learning model which minimizes and SLA so that's 97 00:06:13,360 --> 00:06:14,760 the regression metrics right. 98 00:06:14,760 --> 00:06:20,460 If you have mean absolute error and lazy or mean squared error your goal is often to minimize it and 99 00:06:20,460 --> 00:06:21,900 now features. 100 00:06:21,900 --> 00:06:26,910 So this is where we want to have a look at what different parts are there of the data. 101 00:06:26,910 --> 00:06:33,000 So if we come to Kaggle and we have a look at data this dataset comes with a data dictionary. 102 00:06:33,000 --> 00:06:38,160 And the good thing is that we've downloaded that it's in a form of an Excel file that's the extension 103 00:06:38,160 --> 00:06:39,180 for Excel. 104 00:06:39,180 --> 00:06:44,160 Now I don't have excel on my computer but the beautiful thing is you can open Excel files in google 105 00:06:44,160 --> 00:06:46,410 sheets which is what I've done here. 106 00:06:46,440 --> 00:06:54,680 So what we'll see is we haven't imported our data set yet but these are the different columns. 107 00:06:54,690 --> 00:06:56,480 We'll have a look at it in second. 108 00:06:56,550 --> 00:06:58,640 And what this is going to tell us is. 109 00:06:58,660 --> 00:07:00,310 Okay so this is the sales idea. 110 00:07:00,360 --> 00:07:02,030 That's a column saleslady. 111 00:07:02,170 --> 00:07:07,310 The description is unique identifier of a particular sale of machine at auction. 112 00:07:07,320 --> 00:07:08,680 Wonderful. 113 00:07:08,730 --> 00:07:11,210 So I might copy this. 114 00:07:11,430 --> 00:07:12,780 Anyone with a Lincoln view. 115 00:07:12,810 --> 00:07:13,620 Beautiful. 116 00:07:14,070 --> 00:07:16,110 And go put it in here. 117 00:07:16,110 --> 00:07:19,290 Kaggle provides a data dictionary 118 00:07:21,620 --> 00:07:27,870 detailing all of the features of the dataset. 119 00:07:28,430 --> 00:07:31,250 You can view this data 120 00:07:34,100 --> 00:07:37,070 on Google Sheets so I'll put another link there. 121 00:07:37,320 --> 00:07:37,990 Beautiful. 122 00:07:38,030 --> 00:07:41,780 So that way we sort of know what's happening there with what's going on in a minute. 123 00:07:41,790 --> 00:07:43,130 We haven't done much at all here right. 124 00:07:43,130 --> 00:07:44,330 We haven't explored these at all. 125 00:07:44,330 --> 00:07:49,460 But what I prefer to do is just set out this little layout and then get cracking on the code as soon 126 00:07:49,460 --> 00:07:50,050 as possible. 127 00:07:50,060 --> 00:07:51,570 We'll finish this video there. 128 00:07:51,590 --> 00:07:53,820 So we've gone through steps 1 before. 129 00:07:53,840 --> 00:07:55,180 Nice and simple right. 130 00:07:55,190 --> 00:07:57,790 We've defined our problem so we know what we're doing. 131 00:07:57,860 --> 00:07:59,810 We know where the data is coming from. 132 00:07:59,810 --> 00:08:02,240 We know what evaluation metric we're moving towards. 133 00:08:02,240 --> 00:08:08,450 Thanks to Kaggle and thankfully Cagle have provided a data dictionary for us and these things you won't 134 00:08:08,450 --> 00:08:11,890 always get them like from the start as simple as this. 135 00:08:11,900 --> 00:08:15,720 So this is where it might take if he didn't have these things if he didn't have a problem definition. 136 00:08:15,830 --> 00:08:20,900 If your data wasn't in one source you could easily download from a link or your evaluation metric wasn't 137 00:08:20,900 --> 00:08:22,040 just given to you. 138 00:08:22,040 --> 00:08:23,140 Same with the data dictionary. 139 00:08:23,150 --> 00:08:25,470 These are things you might have to do some research to do. 140 00:08:25,580 --> 00:08:29,240 But luckily for this particular project we have them already. 141 00:08:29,240 --> 00:08:35,690 So without any further ado since our tools are ready let's import the data and get started on it.