1 00:00:00,310 --> 00:00:00,870 All right. 2 00:00:00,900 --> 00:00:04,950 So now we've got our notebook a little bit laid out and we know what steps we're taking we know what 3 00:00:04,950 --> 00:00:06,050 problem we're trying to solve. 4 00:00:06,060 --> 00:00:08,210 We know where our data is coming from. 5 00:00:08,280 --> 00:00:12,840 We know our valuation metric and the features let's import the data and get started. 6 00:00:12,870 --> 00:00:18,340 So in my case we've imported it to a folder called data. 7 00:00:18,570 --> 00:00:23,880 You might have created it followed the same so when I write the file path your file path might be different 8 00:00:23,880 --> 00:00:24,600 to mine. 9 00:00:24,630 --> 00:00:26,040 The mine's gonna be called data. 10 00:00:26,160 --> 00:00:33,510 So let's do that and we're going to import the training and validation sets to begin with. 11 00:00:33,510 --> 00:00:39,090 If you're wondering why we're employing them that same time this all seem clear in an upcoming video. 12 00:00:39,750 --> 00:00:45,060 So you'll just have to trust me for the time being that we're importing both in one set and we go blue 13 00:00:45,060 --> 00:00:45,350 book. 14 00:00:45,360 --> 00:00:47,570 We can probably use tab order complete. 15 00:00:48,210 --> 00:00:49,000 Surely. 16 00:00:49,020 --> 00:00:50,490 There we go. 17 00:00:50,490 --> 00:00:56,390 So we want train and validate CSB wonderful hip shift and enter. 18 00:00:56,400 --> 00:01:01,030 Now this might take a while because as we'll see in a second this is a fairly large data set. 19 00:01:01,100 --> 00:01:02,210 We're getting a warning. 20 00:01:02,990 --> 00:01:11,270 What do we have here d type warning columns 13 39 40 41 have mixed types specified D type option on 21 00:01:11,270 --> 00:01:14,120 import or set low memory Eagles falls. 22 00:01:14,120 --> 00:01:20,070 So if we go here we can get rid of this warning by setting low memory equals false. 23 00:01:20,150 --> 00:01:26,360 So this is essentially saying is when panders Read says when this function is instantiated it depending 24 00:01:26,360 --> 00:01:32,210 on how large your data set is because data frames live in RAM pandas tries to minimize the amount of 25 00:01:32,210 --> 00:01:33,260 space they take up. 26 00:01:33,620 --> 00:01:38,270 So what we're doing is we're just setting low memory to equal false to go Hey we know you're trying 27 00:01:38,270 --> 00:01:40,290 to minimize space but don't worry too much. 28 00:01:40,310 --> 00:01:44,090 We've got plenty of RAM space so this should work. 29 00:01:44,090 --> 00:01:45,580 We should be out to get rid of that warning. 30 00:01:45,590 --> 00:01:46,250 Wonderful. 31 00:01:46,250 --> 00:01:49,450 So this is just saying store it however you want to try import it. 32 00:01:49,450 --> 00:01:52,040 Don't try to minimize the space that you're using. 33 00:01:52,040 --> 00:01:52,970 Wonderful. 34 00:01:52,970 --> 00:01:58,280 And the first thing we'll probably do is go DFA info to get a little bit of information of what's happening 35 00:01:59,690 --> 00:02:00,470 so there we go. 36 00:02:00,470 --> 00:02:01,450 We're going to have a look at that. 37 00:02:01,520 --> 00:02:07,790 India's biggest dataset we've worked with yet by a fairly long shot we have four hundred and twelve 38 00:02:07,790 --> 00:02:09,900 thousand six hundred ninety eight rows. 39 00:02:09,920 --> 00:02:12,990 That is a lot of rows and 53 columns. 40 00:02:12,990 --> 00:02:17,840 Remember we have a look at our columns These are our variables and descriptions. 41 00:02:17,840 --> 00:02:19,960 So we should have sales I.D.. 42 00:02:20,060 --> 00:02:27,440 Yes that's a non now integer machine I.D. which is the identifier for a particular machine. 43 00:02:27,440 --> 00:02:29,760 Machines may have multiple sales are OK. 44 00:02:29,780 --> 00:02:35,630 So that means a machine could be sold multiple times over the crossover timeframe than the one we want 45 00:02:35,630 --> 00:02:38,690 to probably pay most attention to because it's a time series. 46 00:02:38,690 --> 00:02:44,150 Aside from the sale price that's how much our machines have been sold for is the sale date. 47 00:02:44,150 --> 00:02:51,470 So this is when the machine has been sold so we have a look at this sale date time of sale. 48 00:02:51,470 --> 00:02:52,490 Beautiful. 49 00:02:52,490 --> 00:02:52,930 OK. 50 00:02:53,300 --> 00:02:56,350 So what can we do now. 51 00:02:56,510 --> 00:03:02,990 Since the sale date is a date what we might want to do is pass it as a date but we won't do that yet 52 00:03:03,020 --> 00:03:05,150 because we want to demonstrate something. 53 00:03:05,150 --> 00:03:09,560 So we're going to set up the plot because this is what we usually do when we first begin any kind of 54 00:03:09,560 --> 00:03:12,890 project is try to make some visualizations. 55 00:03:13,040 --> 00:03:15,200 Looking at this there's a lot going on here right. 56 00:03:15,200 --> 00:03:19,340 Like you could just look at this this is hard to infer if you're like me and you're just looking at 57 00:03:19,340 --> 00:03:24,920 list of words and numbers and you just say ah I need to visualize things I need to look at things but 58 00:03:24,920 --> 00:03:28,210 we can see actually what can we see from this. 59 00:03:28,250 --> 00:03:31,910 So we've got four hundred twelve thousand six hundred ninety eight entries. 60 00:03:31,910 --> 00:03:37,800 These are known now we say we got some missing values especially for usage band whatever that is. 61 00:03:37,800 --> 00:03:44,600 So if we come up to our data dictionary usage band so value low medium high calculated comparing this 62 00:03:44,600 --> 00:03:51,160 particular machine sale hours to average usage for the F base model so therefore based on what's really 63 00:03:51,170 --> 00:03:59,580 another feature there we go this aggregation of F on model description far out we've got a fair few 64 00:03:59,970 --> 00:04:04,950 pieces of information here about our bulldozers writes of this actually a pretty good dataset but we 65 00:04:04,950 --> 00:04:12,360 can see that for usage band there's a fair few missing values because there's only seventy three thousand 66 00:04:12,360 --> 00:04:18,900 six hundred seventy non null objects so that means there's four hundred twelve thousand six hundred 67 00:04:18,900 --> 00:04:24,600 ninety eight minus that missing values and we can figure that out actually before we plot let's figure 68 00:04:24,600 --> 00:04:30,880 that out so we've got Def is an eight this is how we're going to check if there's no values there we 69 00:04:30,880 --> 00:04:31,840 go. 70 00:04:31,840 --> 00:04:37,600 So usage band has three hundred thirty nine thousand and twenty eight missing values knowing what we 71 00:04:37,600 --> 00:04:41,500 know about machine learning models that's something we're gonna have to take care of before we can build 72 00:04:41,500 --> 00:04:48,010 a model but let's push forward let's plot because we're trying to predict sale price and we know that 73 00:04:48,010 --> 00:04:54,520 our data is a time series what we might want to look at is something like comparing sale dates with 74 00:04:54,520 --> 00:05:03,340 sale price and because as so many rows here if we were to do that if we were to try plot all of these 75 00:05:03,340 --> 00:05:08,140 data points in one plot because you imagine trying to draw four hundred and twelve thousand dot points 76 00:05:08,530 --> 00:05:13,870 on a plot and compute is gonna have a hard time so let's do about a thousand or so why not let's go 77 00:05:13,870 --> 00:05:20,360 say all day we go here we'll do the first thousand How about that. 78 00:05:20,630 --> 00:05:25,970 That's a fairly large number we want to do sale price and if you want to find out the names of different 79 00:05:25,970 --> 00:05:28,610 columns remember you can always go DFT columns. 80 00:05:29,180 --> 00:05:36,020 So what we're doing here we're taking sale date wherever that is sale date there which is we come to 81 00:05:36,020 --> 00:05:41,270 our data dictionary this is where we want to be referencing so the time of sale so the date that it's 82 00:05:41,270 --> 00:05:43,940 been sold as well a sale price. 83 00:05:43,970 --> 00:05:46,010 Now is that in our data dictionary. 84 00:05:46,040 --> 00:05:47,650 I think that's pretty straightforward actually. 85 00:05:47,660 --> 00:05:53,430 What sales prices are there we go sale price cost of sale in US dollars so let's check it out let's 86 00:05:53,430 --> 00:05:59,700 check when one of these being sold we're going to put the out on the x axis because member scatter is 87 00:05:59,730 --> 00:06:03,320 X Y and sale price on the y axis. 88 00:06:03,460 --> 00:06:04,770 Let's check it out. 89 00:06:04,770 --> 00:06:08,280 What if we go to oh we've got our equal sign 90 00:06:15,100 --> 00:06:20,360 so you see here we try and down the bottom to put Saudi at the first thousand examples so this is why 91 00:06:20,360 --> 00:06:25,850 our plot is looking a bit funky and it's just trying to pack in if we have a look at our data frame 92 00:06:25,880 --> 00:06:32,970 DFW sale date that's what it's trying to plot on the x axis if we have a look at the first thousand 93 00:06:34,370 --> 00:06:38,830 they're trying to plot all of these eyes along here so you can see how it just looks like one big chunk 94 00:06:38,830 --> 00:06:42,550 of a barcode and then just a big chunk of black text there. 95 00:06:43,120 --> 00:06:51,250 So what we might do is let's just plot a histogram because we always like checking the distribution 96 00:06:51,520 --> 00:06:55,060 which is the spread of data of as many variables as possible. 97 00:06:55,060 --> 00:07:01,460 But in this case why not do Sal Price because that's our target variable so sale price. 98 00:07:01,720 --> 00:07:06,780 And now again if it seems kind of unstructured what I'm doing here it's because it really is like this 99 00:07:06,780 --> 00:07:08,970 so many different ways you can explore a data set. 100 00:07:08,970 --> 00:07:15,660 One of the best I like is is DNA Info check to see if there's any null value so we know now as in an 101 00:07:15,660 --> 00:07:18,230 A as in missing values. 102 00:07:18,300 --> 00:07:24,060 So we know how many missing values are checked the columns plot one of the most important columns from 103 00:07:24,690 --> 00:07:28,390 now I've just chosen Saudi because it's a time serious problem. 104 00:07:28,500 --> 00:07:34,710 Plot one of the most important columns vs. the sale price of versus the target column and then do a 105 00:07:34,710 --> 00:07:37,050 distribution of a few columns of your choice. 106 00:07:37,050 --> 00:07:42,170 These are kind of the stereotypical things that I like to do when first exploring a dataset. 107 00:07:42,330 --> 00:07:49,350 But what we can see here we see that most of our sale prices are below 20000 U.S. dollars. 108 00:07:49,350 --> 00:07:53,950 So the biggest column here or not most but a large majority a large portion. 109 00:07:54,150 --> 00:07:56,250 And then we got some below 30000. 110 00:07:56,240 --> 00:08:01,410 Let's just assume that 30000 40000 and then it just keeps going off and there's less and less and less 111 00:08:01,410 --> 00:08:05,520 and there's not very many that are costing upwards of 100000 U.S. dollars. 112 00:08:06,000 --> 00:08:07,110 So that's what we're looking at there. 113 00:08:07,440 --> 00:08:08,790 Okay. 114 00:08:08,800 --> 00:08:14,580 Now what we might do is because these dates aren't really in a format that we want them in. 115 00:08:15,060 --> 00:08:20,200 So see how they're in D type object when you're dealing with time series. 116 00:08:20,230 --> 00:08:24,070 We want as much information to be encoded in the data as possible. 117 00:08:24,070 --> 00:08:29,450 So that's what we'll do a little way that we can do this is called passing dates. 118 00:08:29,450 --> 00:08:32,840 So let's have a look at what this looks like passing dates. 119 00:08:32,980 --> 00:08:47,260 So when we work with time series data they want to enrich the time and date component as much as possible. 120 00:08:47,260 --> 00:08:49,910 You might be thinking well how do we do that. 121 00:08:49,960 --> 00:08:51,400 Well let's put it here. 122 00:08:51,400 --> 00:08:55,000 We can do that by telling pandas 123 00:08:57,520 --> 00:09:01,740 which of our columns has dates in it 124 00:09:05,580 --> 00:09:06,270 using 125 00:09:09,470 --> 00:09:11,150 the past dates parameter 126 00:09:14,060 --> 00:09:14,600 wonderful. 127 00:09:14,630 --> 00:09:22,340 And what this is going to do is turn whichever column we pass to pass dates into a date time object. 128 00:09:22,340 --> 00:09:34,000 So let's first see what is a date time object in pandas go in here timestamp to date time so you could 129 00:09:34,000 --> 00:09:38,790 find out a whole bunch of different things what date time here is daytime object timestamp is a pandas 130 00:09:38,800 --> 00:09:43,390 equivalent of Python's date time and is India exchangeable with it. 131 00:09:43,390 --> 00:09:46,810 In most cases OK. 132 00:09:46,910 --> 00:09:47,360 Beautiful. 133 00:09:47,360 --> 00:09:49,480 So basically when we run there. 134 00:09:49,490 --> 00:09:50,860 So let's do it again. 135 00:09:51,170 --> 00:09:56,780 Import Data again at this time pass dates remember. 136 00:09:57,020 --> 00:10:03,290 If in doubt run the code speed a raid C as V and we're going to go data. 137 00:10:03,370 --> 00:10:08,980 Book for bulldozers tick and we're gonna go train train invalid CSB. 138 00:10:09,030 --> 00:10:09,720 Wonderful. 139 00:10:09,730 --> 00:10:14,910 But this time we're gonna pass it low memory was false so we don't have that warning and then we're 140 00:10:14,910 --> 00:10:19,900 going to pass it pass dates and we'll pass it the column name sale date. 141 00:10:19,920 --> 00:10:20,310 Why. 142 00:10:20,310 --> 00:10:22,180 Because that's where our dates are. 143 00:10:22,320 --> 00:10:25,510 And so take note of what this is. 144 00:10:25,530 --> 00:10:29,220 So if we go sale date it's a date type object. 145 00:10:29,220 --> 00:10:34,470 We might just put a little here that day type deed type O. 146 00:10:35,190 --> 00:10:41,740 So when we do this let's see what the differences. 147 00:10:41,880 --> 00:10:49,770 Now this is gonna override our original data frame so we can go DFT sale date D daytime huh. 148 00:10:49,850 --> 00:10:51,490 That's a bit funky. 149 00:10:51,530 --> 00:10:53,360 What if we did DFW Saudi. 150 00:10:53,390 --> 00:10:54,700 What does it look like. 151 00:10:54,770 --> 00:10:59,540 You have got Saudi first thousand okay. 152 00:10:59,690 --> 00:11:03,270 So if we come up here we got sale date. 153 00:11:03,450 --> 00:11:05,560 See how a look like this. 154 00:11:05,700 --> 00:11:09,500 What panders is done is because we've passed it to past dates. 155 00:11:09,510 --> 00:11:13,560 It's gone over all of the values in the Saudis column and converted them into this style. 156 00:11:13,590 --> 00:11:21,300 So we've got year month day and we can see that the data type is daytime 64 and if you were to Google 157 00:11:22,400 --> 00:11:27,280 and compare these two because you might be wondering why does it have a different value here compared 158 00:11:27,280 --> 00:11:27,900 to here. 159 00:11:28,000 --> 00:11:31,230 You'll find that in num pi these two are equivalent. 160 00:11:32,170 --> 00:11:36,640 So you'll just have to take my word for it you could Google and we might actually do that just so it 161 00:11:36,640 --> 00:11:37,300 can prove to you. 162 00:11:37,810 --> 00:11:39,040 So we go here. 163 00:11:39,140 --> 00:11:43,480 Control C so what data type is 164 00:11:46,270 --> 00:11:46,700 there we go. 165 00:11:46,900 --> 00:11:51,550 They time 64 is a general D type while M 8 N S is a specific time. 166 00:11:51,550 --> 00:11:52,090 There we go. 167 00:11:53,740 --> 00:11:57,710 This is stack overflow so this is just a workflow right when you don't get something. 168 00:11:57,730 --> 00:11:58,700 There we go. 169 00:11:58,720 --> 00:12:04,360 MP dog day type daytime sixty four equals NPD up day type n nights. 170 00:12:04,450 --> 00:12:11,640 That's what we're after but now that our dates have been passed let's do the same plot that we did before. 171 00:12:11,670 --> 00:12:15,340 So fig X equals penalty not subplots. 172 00:12:15,340 --> 00:12:16,220 Wonderful. 173 00:12:16,390 --> 00:12:18,120 And we're gonna do the exact same thing. 174 00:12:18,130 --> 00:12:19,290 AX don't scatter. 175 00:12:19,750 --> 00:12:24,920 We'll finish up this video in a second actually because we are actually on a rush right. 176 00:12:24,930 --> 00:12:29,920 We learn and we're exploring sale price and then we'll go here. 177 00:12:29,930 --> 00:12:33,740 We do a thousand and see what looks different wonderful. 178 00:12:33,750 --> 00:12:38,730 So see how down the bottom now instead of when we come back up here we've just got a whole bunch of 179 00:12:39,090 --> 00:12:45,900 black because we're trying to put all these down there because our sale day column is now in the form 180 00:12:45,900 --> 00:12:51,270 of the date time object map plot leaves able to intelligently look at the data type and then plotted 181 00:12:51,270 --> 00:12:52,430 accordingly. 182 00:12:52,500 --> 00:12:57,630 So we've got higher prices so you can see there's a lot of sales maybe 2005 something happened so there 183 00:12:57,630 --> 00:13:00,990 wasn't any sales in 2005 2008. 184 00:13:00,990 --> 00:13:04,890 I believe that might have been the financial crisis or something like that so there wasn't too many 185 00:13:04,890 --> 00:13:06,040 sales here. 186 00:13:06,120 --> 00:13:14,700 And then again that's between 2008 2009 we can see the highest sale was done just in 2007 and then there's 187 00:13:14,700 --> 00:13:20,610 a lot of sales around the middle of between 2009 2010 but again this doesn't really tell us too much 188 00:13:20,610 --> 00:13:20,910 right. 189 00:13:20,970 --> 00:13:24,300 We're just getting an idea of what's going on. 190 00:13:24,300 --> 00:13:26,410 So let's have a look one more time check it out. 191 00:13:26,460 --> 00:13:31,380 Do you have to have this data and that we're working with we got sales I.D. sales price a whole bunch 192 00:13:31,380 --> 00:13:35,810 of different columns here and we can see that there's a whole bunch of missing values. 193 00:13:36,000 --> 00:13:42,420 And if we wanted to find out more about what undercarriage pad width means remember we've got our data 194 00:13:42,420 --> 00:13:44,150 dictionary and we could have a look here. 195 00:13:44,160 --> 00:13:45,170 There we go. 196 00:13:45,330 --> 00:13:49,750 Machine configuration width of crawler trains wonderful. 197 00:13:50,170 --> 00:13:58,450 So now we've done a little bit of exploration what we might do is sort our data frame by sale date and 198 00:13:58,450 --> 00:14:00,090 then keep going from there. 199 00:14:00,130 --> 00:14:06,260 Make some more changes so we'll do that in the next video but otherwise have a go at exploring the data 200 00:14:06,260 --> 00:14:13,030 frame importing it and trying to pass the dates with sale date and don't pass the dates and check the 201 00:14:13,030 --> 00:14:15,550 difference in data type when you do and don't pass dates.