1 00:00:00,730 --> 00:00:07,450 Now since we're starting a new project it makes sense to take the steps in starting a new project. 2 00:00:07,740 --> 00:00:08,920 We're up to here. 3 00:00:09,010 --> 00:00:14,590 What we're gonna do in this video is create a project folder get our data ready create an environment 4 00:00:15,070 --> 00:00:19,540 and then launch up a Jupiter notebook and make sure that we can import our tools. 5 00:00:19,540 --> 00:00:22,090 So they're ready to use for our problem. 6 00:00:22,240 --> 00:00:23,680 So let's get started. 7 00:00:23,680 --> 00:00:25,370 I'm gonna go to my desktop. 8 00:00:25,510 --> 00:00:29,840 I'm going to come to my machine learning course folder that I've been using. 9 00:00:29,920 --> 00:00:31,050 I'm gonna make a new folder. 10 00:00:31,060 --> 00:00:41,420 This one I'm gonna call let's call bulldozer price prediction project. 11 00:00:41,470 --> 00:00:43,060 Nice and simple. 12 00:00:43,080 --> 00:00:44,560 Well I go to spelling errors. 13 00:00:44,560 --> 00:00:45,120 Nope. 14 00:00:45,160 --> 00:00:46,420 Beautiful. 15 00:00:46,420 --> 00:00:47,950 So come into this folder. 16 00:00:48,280 --> 00:00:48,870 Beautiful. 17 00:00:48,910 --> 00:00:50,150 We've got an empty folder. 18 00:00:50,440 --> 00:00:55,080 So if we go here we can take this one off to create project folder. 19 00:00:55,340 --> 00:00:56,660 Now for the data. 20 00:00:56,750 --> 00:00:58,290 This one's a little bit different. 21 00:00:58,310 --> 00:01:07,130 We're going to go to Kaggle to get our data set because what the bulldozer problem is so we got a Kaggle 22 00:01:07,130 --> 00:01:10,310 dot com bluebell for bulldozers. 23 00:01:10,310 --> 00:01:16,650 Now this problem that we're working through originated as a cable competition so you can see here predict 24 00:01:16,660 --> 00:01:23,920 the auction sale price for a piece of heavy equipment a.k.a. bulldozer to create a blue book for a bulldozer. 25 00:01:23,920 --> 00:01:28,060 Now mushrooming a blue book means maybe historic sales events or something like that. 26 00:01:28,060 --> 00:01:29,200 Not entirely sure. 27 00:01:29,920 --> 00:01:35,520 So if we look here we can see an overview of the project when this competition first came out. 28 00:01:35,620 --> 00:01:38,560 There was a ten thousand dollar prize pool for this competition. 29 00:01:38,560 --> 00:01:41,460 And if you're wondering hey I'm not even sure what Kaggle is. 30 00:01:41,470 --> 00:01:48,880 Well you can imagine it as the home for data scientists because Kaggle is full of different competitions 31 00:01:48,880 --> 00:01:54,820 different data sets different notebooks and discussions as well as heaps of courses as well. 32 00:01:54,820 --> 00:01:57,930 Carol it is a plethora of resources for data scientists. 33 00:01:58,030 --> 00:01:59,590 If you're wondering what the competitions are. 34 00:01:59,650 --> 00:02:01,510 Well this is one here. 35 00:02:01,510 --> 00:02:09,400 What happens is a company might post a data set and then it'll be your job to see how well you can model 36 00:02:09,400 --> 00:02:13,930 it or how well you can build a system to solve the problem they're posting. 37 00:02:13,930 --> 00:02:16,540 And that's what we're going to be working through in this project. 38 00:02:16,540 --> 00:02:21,760 We're going to download the data the dataset from the original competition and then we're going to build 39 00:02:21,760 --> 00:02:25,520 an entire project to end up building a machine learning model. 40 00:02:25,690 --> 00:02:29,650 And we'll see how we would have ended up on the leaderboard. 41 00:02:29,650 --> 00:02:31,720 So you can see here we've got some scores. 42 00:02:31,720 --> 00:02:35,740 And these are different teams of people around the world of data scientists just like you. 43 00:02:36,040 --> 00:02:38,700 So that's what we'll have by the end of this project. 44 00:02:38,830 --> 00:02:42,220 But we're talking too much let's get the data. 45 00:02:42,220 --> 00:02:44,640 You can come to this you are all here. 46 00:02:44,710 --> 00:02:49,650 Dot com slash see for competition slash Blue Book for bulldozers slash data. 47 00:02:49,840 --> 00:02:52,050 You can click this download all link. 48 00:02:52,210 --> 00:02:56,420 Now you will need an account to log in to get this data. 49 00:02:56,590 --> 00:03:00,410 But since I'm already signed in I'm going to download it. 50 00:03:00,420 --> 00:03:05,500 I can see that it's going into my downloads folder so we come back to my file browser. 51 00:03:05,500 --> 00:03:08,080 You can see I've actually already downloaded it here. 52 00:03:08,080 --> 00:03:13,080 But this is just going to re download the exact same file so that we've got it there. 53 00:03:13,850 --> 00:03:20,230 Now we could keep it here but since we're working in our project folder I'm going to move it into the 54 00:03:20,230 --> 00:03:22,300 bulldozer price prediction project. 55 00:03:22,300 --> 00:03:27,430 The folder we just created so your folder might be differently named. 56 00:03:27,490 --> 00:03:29,120 Now we can see yeah beautiful. 57 00:03:29,170 --> 00:03:31,180 We've got our zip file. 58 00:03:31,180 --> 00:03:37,550 Let's unzip that wonderful net comes into a folder of more folders. 59 00:03:37,620 --> 00:03:38,130 Far out. 60 00:03:38,140 --> 00:03:42,070 We've got train train and valid train valid CSB. 61 00:03:42,120 --> 00:03:46,710 And if you're looking through these wondering what's going on well the next place to look is back in 62 00:03:46,710 --> 00:03:49,360 Kaggle and and see what's going on here. 63 00:03:49,380 --> 00:03:52,440 So we've got data description what I might do is zoom in. 64 00:03:52,470 --> 00:03:53,300 There we go. 65 00:03:53,400 --> 00:03:54,180 Pinch to zoom. 66 00:03:54,180 --> 00:03:54,630 Come on. 67 00:03:54,660 --> 00:03:56,590 And you know how to use a computer. 68 00:03:56,620 --> 00:03:57,130 Okay. 69 00:03:57,230 --> 00:03:58,370 So this is what it's telling us. 70 00:03:58,410 --> 00:04:01,160 The data for this competition is split into three parts. 71 00:04:01,170 --> 00:04:04,890 And remember how I said that this is a time series problem. 72 00:04:04,920 --> 00:04:13,680 Well we've got train CSB which contains data up to the end of 2011 and then we have valid dot CSP which 73 00:04:13,680 --> 00:04:20,310 is the validation set which contains data from January 1 2012 to April 30 2012. 74 00:04:20,310 --> 00:04:27,500 Now you notice here that the validation set has data after the training set which is from 2011. 75 00:04:27,510 --> 00:04:30,120 We'll have a look at this when we're working through it. 76 00:04:30,120 --> 00:04:35,760 And then finally the test data set which won't be released until the last week of the competition but 77 00:04:35,970 --> 00:04:40,780 because this competition has already passed we've got access to the test at CSB. 78 00:04:41,010 --> 00:04:44,690 It contains data after the validation set. 79 00:04:44,700 --> 00:04:53,440 So see here the validation set ends on April 30 2012 whereas this one continues from May 1 2012. 80 00:04:53,670 --> 00:04:54,050 OK. 81 00:04:54,120 --> 00:04:58,110 So this is some information here about about the data we're going to be working with. 82 00:04:58,170 --> 00:05:02,190 There's also a data dictionary but we'll have a look at that in a second. 83 00:05:02,220 --> 00:05:06,240 Let's go back to the folder we were working with and we look here. 84 00:05:06,240 --> 00:05:15,300 Train says V was trained CSC are still in the zip file so we'll unzip that train dot CSC. 85 00:05:15,320 --> 00:05:16,020 Wonderful. 86 00:05:16,380 --> 00:05:18,020 Now we got valid dot CSB. 87 00:05:18,090 --> 00:05:22,400 OK so this one must be trained and valid dot CSB in the one file. 88 00:05:22,400 --> 00:05:27,620 We'll have a look at that Manjoo and notebook in a minute we're just familiarizing ourselves with the 89 00:05:27,620 --> 00:05:31,290 data we've downloaded from Kaggle then we can see data dictionary. 90 00:05:31,310 --> 00:05:31,870 Okay. 91 00:05:31,910 --> 00:05:34,750 That's pretty self-explanatory what that might be. 92 00:05:34,880 --> 00:05:37,340 And then there's test dot CSP. 93 00:05:37,370 --> 00:05:37,670 OK. 94 00:05:37,700 --> 00:05:38,880 So we've got that. 95 00:05:38,870 --> 00:05:39,720 All right. 96 00:05:39,860 --> 00:05:43,320 Now let's go back to our project folder that we're working on. 97 00:05:43,670 --> 00:05:51,590 What I might do is put this create sticking with the folder nomenclature which is a fancy name for naming 98 00:05:51,590 --> 00:05:57,380 system that we've been using we might create a data folder and just put all of our data in there and 99 00:05:57,380 --> 00:06:02,540 then we can delete this zip file because we've got Blue Book for bulldozers in there. 100 00:06:03,350 --> 00:06:04,110 Wonderful. 101 00:06:04,850 --> 00:06:06,570 Now we've got a file to work with. 102 00:06:06,590 --> 00:06:11,240 We come back here bulldoze a price prediction project. 103 00:06:11,240 --> 00:06:12,800 We've got our data. 104 00:06:12,800 --> 00:06:15,440 Let's go back to our workflow. 105 00:06:15,440 --> 00:06:17,680 We've got our data in our project folder. 106 00:06:17,690 --> 00:06:18,290 Excellent. 107 00:06:18,290 --> 00:06:23,060 The next is to create an environment or collection of tools using condo. 108 00:06:23,070 --> 00:06:23,970 Let's do that. 109 00:06:24,380 --> 00:06:30,490 And to do that we're going to have to use terminal make this window a little bit bigger. 110 00:06:30,650 --> 00:06:39,640 So I'm going to change directory into my desktop which is where I'm storing this folder come back in 111 00:06:39,640 --> 00:06:40,270 the terminal. 112 00:06:40,280 --> 00:06:43,840 So change directory decks top and I'll course. 113 00:06:43,880 --> 00:06:48,790 And then within that I've got all those price prediction. 114 00:06:48,830 --> 00:06:50,270 Wonderful. 115 00:06:50,270 --> 00:06:50,910 And so I'll go. 116 00:06:50,950 --> 00:06:55,930 Alas it shows me that I have this data folder which is exactly what we've got there. 117 00:06:56,030 --> 00:06:56,990 Wonderful. 118 00:06:56,990 --> 00:07:01,830 And so now we're going to create an environment so we can do that with Conda create. 119 00:07:01,880 --> 00:07:04,200 And because we want to create a little end folder. 120 00:07:04,260 --> 00:07:07,790 Actually I need the prefix tag. 121 00:07:07,800 --> 00:07:12,950 Now we've seen this in our previous project but we're going to create a new environment within here 122 00:07:14,620 --> 00:07:15,940 so let's do that. 123 00:07:15,940 --> 00:07:20,100 So conduct create prefix Dot. 124 00:07:20,590 --> 00:07:21,460 Yes correct. 125 00:07:21,460 --> 00:07:23,770 And then we're going to give it pandas. 126 00:07:23,980 --> 00:07:33,790 Now imply that plot lead Jupiter and psychic line cause we're staying in line with the tools that we 127 00:07:33,790 --> 00:07:34,480 want to use. 128 00:07:34,480 --> 00:07:40,090 So pandas map plot lib num pi Jupiter many countries what we're using to create this environment and 129 00:07:40,090 --> 00:07:41,670 so I can't learn. 130 00:07:41,680 --> 00:07:44,380 So let's see if it works 131 00:07:47,420 --> 00:07:48,190 beautiful. 132 00:07:48,260 --> 00:07:52,650 And we're going to press yes and there we go. 133 00:07:52,650 --> 00:07:56,120 So I'll let this load for about a minute or so and I'll come back once it's finished. 134 00:07:57,740 --> 00:07:58,530 Beautiful. 135 00:07:58,550 --> 00:08:00,580 So that took about a minute or so on my machine. 136 00:08:00,580 --> 00:08:04,490 It'll depend on how quick your machine is or how fast your internet connection is because Condit has 137 00:08:04,490 --> 00:08:09,480 to download a few things so he can see to activate this environment which we've seen before. 138 00:08:09,590 --> 00:08:15,620 We can use the code here come to activate users Daniel desktop machine learning course bulldozer price 139 00:08:15,620 --> 00:08:18,200 prediction project slash m. 140 00:08:18,230 --> 00:08:23,480 And if we look back into our folder that we created we've got an end folder which in here is going to 141 00:08:23,480 --> 00:08:25,400 contain all of our tools. 142 00:08:25,610 --> 00:08:30,410 So let's activate that and see if we've got a Jupiter notebook ready to use which we won't. 143 00:08:30,410 --> 00:08:37,400 But we'll start up a Jupiter server so we can get started in our workspace desktop slash email course 144 00:08:37,910 --> 00:08:50,750 slash bulldozer price prediction project slash and wonderful so you notice here that the base has now 145 00:08:50,750 --> 00:08:56,800 changed to this file path here and to test if we've got Jupiter installed we can type in Jupiter notebook 146 00:08:57,380 --> 00:09:00,670 what this is going to do is create a serve on our local machine. 147 00:09:00,830 --> 00:09:07,220 Okay this computer right here running Jupiter which is where we can start Jupiter notebooks and get 148 00:09:07,220 --> 00:09:11,060 our workspace up and running beautiful. 149 00:09:11,330 --> 00:09:15,220 So that took about 30 seconds or so the first time you run it in a new environment. 150 00:09:15,230 --> 00:09:17,210 It may take a little while to load. 151 00:09:17,210 --> 00:09:24,320 Now we've got the exact same folder that we created in our file explorer or finder in my case up on 152 00:09:24,320 --> 00:09:24,980 Jupiter. 153 00:09:25,010 --> 00:09:25,670 Wonderful. 154 00:09:25,700 --> 00:09:34,180 And the way we can test to see if our environment is is set up correctly so we can go new notebook wonderful 155 00:09:34,360 --> 00:09:41,020 and to test it quickly we're going to go import num pi as MP We need to check that we can import all 156 00:09:41,020 --> 00:09:52,170 of our tools pandas as PDA import map plot lib dot pi plot as peyote import SCA loan. 157 00:09:52,240 --> 00:09:57,370 Now if it's worked correctly this cell should run without any errors. 158 00:09:57,370 --> 00:10:02,380 Now again this might take a little while if it's the first time running in this environment as Condor 159 00:10:02,380 --> 00:10:07,430 and Jupiter prepare all these tools so we'll just wait patiently while this runs through. 160 00:10:07,750 --> 00:10:08,880 Beautiful. 161 00:10:08,950 --> 00:10:12,700 So again on my machine that took about 30 seconds or so maybe closer to a minute. 162 00:10:12,880 --> 00:10:16,300 But what this means is we've got our tools ready to use. 163 00:10:16,300 --> 00:10:23,170 So if we come back to our workflow we've just imported pandas Matt pop live num pi and socket line and 164 00:10:23,170 --> 00:10:29,470 we're working within a Jupiter notebook and we've gone through these steps here so we've created a project 165 00:10:29,470 --> 00:10:34,870 folder we've downloaded our data from Kaggle we've created an environment which is a collection of tools 166 00:10:35,410 --> 00:10:39,520 we're working in a job in a notebook and now we have access to these tools here. 167 00:10:39,520 --> 00:10:40,660 So this is what we're going to do now. 168 00:10:40,660 --> 00:10:46,550 Data analysis manipulation and of course finishing off with some machine learning. 169 00:10:46,690 --> 00:10:48,940 Well our notebooks are ready to go. 170 00:10:48,940 --> 00:10:51,810 This while we're working for the next few videos so I'll see you then.