1 00:00:00,360 --> 00:00:05,250 While we're making good progress on their swords a price regression problem in the last video we tried 2 00:00:05,250 --> 00:00:10,480 to build a machine learning model we learned that we've got some data that isn't in numeric format. 3 00:00:10,490 --> 00:00:15,180 So see here we got some strings and if we check all Actually we're working dear temp. 4 00:00:15,210 --> 00:00:16,140 My bad. 5 00:00:16,210 --> 00:00:19,230 See these are things that you're going to forget that you're going to go through that you're going to 6 00:00:19,230 --> 00:00:21,990 mix up TDM temp still. 7 00:00:22,020 --> 00:00:22,850 Same problem right. 8 00:00:22,860 --> 00:00:28,560 Because we've got some values here some columns that are numeric and a whole bunch that aren't. 9 00:00:28,560 --> 00:00:30,630 So what can we do about that. 10 00:00:30,630 --> 00:00:32,680 Well not even Daniel. 11 00:00:33,030 --> 00:00:37,160 What do you do when you're not even working with your temporary data frame. 12 00:00:37,700 --> 00:00:38,620 Okay. 13 00:00:38,850 --> 00:00:40,610 Well this is all part of the game right. 14 00:00:40,620 --> 00:00:46,330 Make mistakes and then going back and fixing them that's part of being a data scientist heaps of experimentation. 15 00:00:46,330 --> 00:00:55,540 So one way we can turn all of how string data or non numeric data into numbers is by converting them 16 00:00:55,540 --> 00:00:57,400 into pandas categories. 17 00:00:57,390 --> 00:00:58,390 That's right. 18 00:00:58,780 --> 00:00:59,910 What's that mean. 19 00:00:59,920 --> 00:01:03,570 So convert strings to categories. 20 00:01:03,940 --> 00:01:16,810 Now we'll put it here one way we can turn all of our data into numbers is by converting them into pandas 21 00:01:16,870 --> 00:01:18,370 categories. 22 00:01:18,520 --> 00:01:25,210 So like the date time object that we saw when we imported s date pandas has another data type called 23 00:01:25,210 --> 00:01:26,310 category. 24 00:01:26,350 --> 00:01:30,530 Let's have a look at the pandas types API. 25 00:01:30,530 --> 00:01:31,000 Oh there we go. 26 00:01:31,000 --> 00:01:32,380 This is probably a bit better. 27 00:01:32,530 --> 00:01:33,790 We can see this here. 28 00:01:34,090 --> 00:01:42,450 So panders to an API that types where can we have a look at this utility functions data types related 29 00:01:42,450 --> 00:01:43,260 functionality. 30 00:01:43,260 --> 00:01:44,010 So here we go. 31 00:01:44,790 --> 00:01:47,240 Let's copy this but let's just read it out first. 32 00:01:47,250 --> 00:01:48,400 What's going on here. 33 00:01:48,400 --> 00:01:53,850 Do you type introspection so d types is bull type is categorical type. 34 00:01:53,960 --> 00:01:54,390 Okay. 35 00:01:54,420 --> 00:02:01,450 Pandas D type convert input into a is only D type object on num pi Diana type object. 36 00:02:01,500 --> 00:02:11,540 Okay so we've got a fair few options here to have a look at different data types with pandas but since 37 00:02:11,540 --> 00:02:16,250 we're focusing on converting strings to categories let's see how we do that. 38 00:02:16,400 --> 00:02:17,730 Let's put a little link here. 39 00:02:17,880 --> 00:02:27,280 We can check the different data types compatible with pandas here. 40 00:02:27,470 --> 00:02:33,630 So remember a lot of machine learning is basically getting some data. 41 00:02:33,710 --> 00:02:40,790 What we got here and then preparing it to run with an existing machine learning model that we've chosen 42 00:02:40,790 --> 00:02:43,350 from now machine learning math is in here. 43 00:02:43,430 --> 00:02:48,260 So it's about getting some inputs a.k.a. a whole bunch of data that we've got here. 44 00:02:48,470 --> 00:02:53,010 All this sort of stuff here and then massaging it or essentially here. 45 00:02:53,090 --> 00:02:57,800 That's a good word for it massaging it into a way that it works with a machine learning model. 46 00:02:57,800 --> 00:02:59,750 So that machine learning model can find the patent. 47 00:02:59,750 --> 00:03:03,070 Someone else has built this model for us and we're going to thank them very much. 48 00:03:03,310 --> 00:03:07,340 And then it's going to find the patents and give us some outputs I can say some predictions some patents 49 00:03:08,000 --> 00:03:12,380 some label something like that and then we can evaluate how well that's works is what we're doing here 50 00:03:12,380 --> 00:03:16,520 we're manipulating our inputs to work with our machine learning model. 51 00:03:16,520 --> 00:03:17,180 So let's have a look. 52 00:03:17,210 --> 00:03:21,890 Let's remind ourselves of temp dot head of what we're working with. 53 00:03:21,980 --> 00:03:22,440 Okay. 54 00:03:22,550 --> 00:03:23,000 Yep. 55 00:03:23,090 --> 00:03:25,980 We've seen this a fair few times this is just another way to look at it. 56 00:03:26,000 --> 00:03:28,380 Always a good idea to look at your data. 57 00:03:28,550 --> 00:03:35,850 And so what we might do is try to use one of these API type so what can we use. 58 00:03:35,880 --> 00:03:41,700 Is there a string one here we go API types is string day type. 59 00:03:41,830 --> 00:03:43,300 How do we use this. 60 00:03:43,690 --> 00:03:46,070 Is string daytime string OK. 61 00:03:47,340 --> 00:03:48,240 Let's try this out. 62 00:03:48,300 --> 00:04:01,230 So paid a dot API dot types is string daytime DL temp usage band use a warning is emphasizing her use 63 00:04:01,230 --> 00:04:03,750 warning isn't an to use warning automat. 64 00:04:03,900 --> 00:04:04,500 Who knows. 65 00:04:04,500 --> 00:04:06,990 I'm going to keep making that mistake but let's check. 66 00:04:07,130 --> 00:04:07,350 OK. 67 00:04:07,380 --> 00:04:10,450 So it is a string the usage ban column is a string. 68 00:04:10,460 --> 00:04:13,220 Now we could do the same thing for almost every other column. 69 00:04:13,230 --> 00:04:16,140 But why would we go through and write that much code. 70 00:04:16,140 --> 00:04:18,120 Let's build a little for loop that does it for us. 71 00:04:18,120 --> 00:04:25,650 So find the columns which contain strings because if we know which columns contain strings then we can 72 00:04:25,650 --> 00:04:27,230 convert them into some kind of number. 73 00:04:27,540 --> 00:04:29,040 And our machine learning model will be happy. 74 00:04:29,070 --> 00:04:33,100 So for label columns in DFT help. 75 00:04:33,180 --> 00:04:37,380 Now I'm gonna go through this one first and then we'll retrospect it afterwards because you might be 76 00:04:37,380 --> 00:04:42,830 looking at is going I don't know what D F temped on items is but maybe you want to try it out maybe 77 00:04:42,830 --> 00:04:48,060 on a pause a video and try it out and you explain it to me at times and if not that's right we're going 78 00:04:48,050 --> 00:04:50,530 to go through it daytime. 79 00:04:50,880 --> 00:04:55,770 Now all we're doing is we're just copying something like this because we're trying to find all the columns 80 00:04:55,770 --> 00:05:02,950 with strings and we're going to pass this or actually this is more content. 81 00:05:02,950 --> 00:05:09,540 That's what the documentation says and we're going to pass content here. 82 00:05:09,650 --> 00:05:10,910 So this makes sense. 83 00:05:10,970 --> 00:05:14,510 You don't know what the f items attempt on items does unless you've checked it. 84 00:05:14,540 --> 00:05:19,130 But we're gonna go for label content and then we're going to check if the content is a string data type 85 00:05:19,250 --> 00:05:21,890 yes then we're going to print the label. 86 00:05:21,890 --> 00:05:25,150 Now you might be able to guess what this does but if not that's okay. 87 00:05:25,160 --> 00:05:27,360 We're gonna see what does anyway. 88 00:05:27,410 --> 00:05:30,020 If in doubt run the code and we got here. 89 00:05:30,170 --> 00:05:30,420 Okay. 90 00:05:30,440 --> 00:05:35,230 So these are all the columns that are string data types does that make sense if you're wondering what 91 00:05:35,230 --> 00:05:37,370 the f temp items does. 92 00:05:37,370 --> 00:05:41,670 It treats our data frame like a dictionary which is kind of what it is. 93 00:05:41,750 --> 00:05:44,090 It's a dictionary of keys and values. 94 00:05:44,110 --> 00:05:44,500 Okay. 95 00:05:44,510 --> 00:05:49,310 The keys are the column names and the values are what you see in the in the column. 96 00:05:50,430 --> 00:05:59,770 So let's give a little example if you're wondering what day after items does. 97 00:05:59,830 --> 00:06:00,640 Here's an example. 98 00:06:00,670 --> 00:06:06,350 So we'll create a random dict dictionary equals let's be creative. 99 00:06:06,340 --> 00:06:10,960 Key 1 he calls hello. 100 00:06:10,990 --> 00:06:13,460 And k 2 equals. 101 00:06:13,480 --> 00:06:20,920 Well maybe you can be a little bit more creative than me but we're going to go for K value in random 102 00:06:20,920 --> 00:06:26,070 dick dot items so see how we've got just on items. 103 00:06:26,150 --> 00:06:30,490 Same with up here dollar items. 104 00:06:30,720 --> 00:06:31,440 We're going to go. 105 00:06:31,470 --> 00:06:33,580 We want print. 106 00:06:33,580 --> 00:06:40,450 If this is okay and then we'll pass it K and then we'll also go. 107 00:06:40,450 --> 00:06:42,710 If this is a value 108 00:06:45,520 --> 00:06:51,060 value wonderful so if you've worked with Python dictionaries before you might be out to tell what's 109 00:06:51,060 --> 00:06:52,580 going to happen here. 110 00:06:52,590 --> 00:06:53,940 Well what have we got here. 111 00:06:56,510 --> 00:06:58,220 Oh I see. 112 00:06:58,400 --> 00:06:59,600 It's printed out twice. 113 00:06:59,600 --> 00:07:00,080 That's right. 114 00:07:00,080 --> 00:07:01,610 We can kind of get what's going on there. 115 00:07:02,030 --> 00:07:09,110 So essentially what happens here is for labeled our content is basically labels are the column names 116 00:07:09,170 --> 00:07:11,300 and the content is the content of the column. 117 00:07:11,300 --> 00:07:12,700 So this is what this is doing. 118 00:07:12,710 --> 00:07:19,630 This is this brilliant little for loop tells us all the columns that contains the string data type. 119 00:07:19,670 --> 00:07:20,480 Wonderful. 120 00:07:20,480 --> 00:07:23,370 So now that we know that what can we do. 121 00:07:23,370 --> 00:07:24,170 Mm hmm. 122 00:07:24,500 --> 00:07:33,450 Well if we were to read through our pandas data types related functionality we're not going to do that. 123 00:07:33,480 --> 00:07:35,720 But it's somewhere in here or somewhere in there. 124 00:07:35,850 --> 00:07:43,650 Or you could also search up how to change the data type of a pandas column. 125 00:07:44,070 --> 00:07:52,020 One of the things you might stumble across with is converting strings into categories rather than talk 126 00:07:52,020 --> 00:07:52,580 about it Daniel. 127 00:07:52,610 --> 00:07:54,000 Come on let's just let's just code. 128 00:07:54,000 --> 00:07:55,170 Let's see it happen all right. 129 00:07:55,410 --> 00:07:55,980 All right. 130 00:07:56,040 --> 00:07:56,400 All right. 131 00:07:56,940 --> 00:07:57,450 Let's do it. 132 00:07:57,540 --> 00:08:03,630 So this will turn all of the string values into category values and you're wondering why would we do 133 00:08:03,630 --> 00:08:03,860 this. 134 00:08:03,860 --> 00:08:05,710 Well that's gonna make a lot of sense in a second. 135 00:08:06,150 --> 00:08:16,260 So for label we're going to just bring the same for loop that we wrote up for of 10 items and we're 136 00:08:16,260 --> 00:08:24,840 going to go if the PD dot API is a fairly long one types is the string the type. 137 00:08:24,840 --> 00:08:25,390 That's a mouthful. 138 00:08:25,410 --> 00:08:33,300 So basically is the content to string if true we want to ADF temp we want to change that column. 139 00:08:33,330 --> 00:08:34,530 So see here the column name. 140 00:08:34,530 --> 00:08:36,770 So we want to access that column name. 141 00:08:36,870 --> 00:08:47,860 We want to set it equal to the content so right keep the column name but change the content to as type 142 00:08:48,700 --> 00:08:52,660 category you spelled this right. 143 00:08:52,850 --> 00:08:56,320 Cat as ordered. 144 00:08:56,320 --> 00:08:57,880 Oh my goodness. 145 00:08:57,880 --> 00:08:59,140 What is happening here. 146 00:08:59,770 --> 00:09:01,240 Well let's have a look 147 00:09:04,130 --> 00:09:05,280 what is going to happen. 148 00:09:05,800 --> 00:09:06,360 Mm hmm. 149 00:09:06,380 --> 00:09:10,580 This might take a little while because we've got to go through a lot of columns but we're going to preload 150 00:09:10,600 --> 00:09:11,930 and execute next itself. 151 00:09:12,170 --> 00:09:12,510 Right. 152 00:09:12,530 --> 00:09:19,580 Because when we're pumping through this beautiful of temp dot info I have a look here. 153 00:09:19,610 --> 00:09:24,350 What is going on what have we changed. 154 00:09:24,460 --> 00:09:26,760 I'll give you a hint it's something over here. 155 00:09:26,770 --> 00:09:27,820 Well it is this. 156 00:09:27,850 --> 00:09:29,970 So now these used to be objects okay. 157 00:09:29,980 --> 00:09:30,970 They used to be strings. 158 00:09:30,970 --> 00:09:34,580 Now we've changed them into categories. 159 00:09:34,580 --> 00:09:35,010 Mm hmm. 160 00:09:35,770 --> 00:09:37,380 Well that's interesting. 161 00:09:37,390 --> 00:09:38,290 What does this mean. 162 00:09:38,590 --> 00:09:39,890 Well Aha. 163 00:09:39,910 --> 00:09:46,090 This is where we can have a look member we're trying to make sure all of our data frame is numeric so 164 00:09:46,090 --> 00:09:48,710 we can't pass strings to a machine learning models. 165 00:09:48,760 --> 00:09:51,140 This is what we're working towards IDF temp. 166 00:09:51,160 --> 00:09:55,870 Let's check out which one did we check before. 167 00:09:55,870 --> 00:09:59,670 Was it the state beautiful. 168 00:10:00,720 --> 00:10:03,920 Let's check that IDF temp dot state. 169 00:10:03,930 --> 00:10:06,720 Look at categories. 170 00:10:06,840 --> 00:10:07,890 Wonderful. 171 00:10:08,560 --> 00:10:10,400 Well there's old faithful Wyoming. 172 00:10:10,590 --> 00:10:13,620 Shout out to Wyoming and every other state wherever you're from. 173 00:10:14,310 --> 00:10:18,450 And every other country to be honest because this is on the Internet. 174 00:10:18,540 --> 00:10:18,990 All right. 175 00:10:19,020 --> 00:10:20,520 So we have categories here. 176 00:10:20,520 --> 00:10:26,000 And what this is going to do is remember when we did the effort that state value counts. 177 00:10:26,010 --> 00:10:32,340 Remember when we did this so if we were to go through all of these because they remember how he did 178 00:10:32,340 --> 00:10:37,050 up here and there's a fair few things we've just covered here Daniel come on explain them as ordered 179 00:10:37,200 --> 00:10:39,900 means these are now ordered. 180 00:10:39,930 --> 00:10:41,430 Say how the alphabetical. 181 00:10:41,430 --> 00:10:43,380 Ay ay ay ay. 182 00:10:43,380 --> 00:10:43,970 See. 183 00:10:44,770 --> 00:10:45,200 All right. 184 00:10:45,210 --> 00:10:51,030 Isn't it b these aren't ordered here a while they're ordered by their value counts. 185 00:10:51,170 --> 00:10:54,980 But essentially what this is done is it turned it from strings to category. 186 00:10:54,980 --> 00:11:03,960 So this is going to assign when we called this it's going to assign a numerical value to each category. 187 00:11:03,960 --> 00:11:08,620 So now looking at this it doesn't really make sense right because these are still look like strings. 188 00:11:08,620 --> 00:11:12,530 However under the hood panders is treating these as numbers. 189 00:11:12,530 --> 00:11:17,190 So Alabama might be one Alaska might be to Arizona might be three. 190 00:11:17,430 --> 00:11:23,790 And we can check that by going do you have 10 state cap the category. 191 00:11:23,850 --> 00:11:25,190 So we're accessing the catch. 192 00:11:25,200 --> 00:11:28,360 Remember how we access the date time using DTM. 193 00:11:28,410 --> 00:11:31,740 Well now we're accessing the category dot codes. 194 00:11:33,690 --> 00:11:34,920 Okay. 195 00:11:34,950 --> 00:11:42,930 So now we have a way that our data frame can be accessed at least all the string values right have been 196 00:11:42,930 --> 00:11:49,350 now converted to categories because see here we loop through our data frame and we figured out whether 197 00:11:49,350 --> 00:11:50,770 they're a string or not. 198 00:11:50,910 --> 00:11:55,960 And we've changed their data type to categories and we've given them order. 199 00:11:56,010 --> 00:12:06,700 So now we have all these categories here and we have a way to access the numeric value not that hey 200 00:12:06,890 --> 00:12:10,940 from rubbing my hands together you may not be out of here that because I'm really excited. 201 00:12:11,220 --> 00:12:18,650 But even though we have a way to access our data as numeric we still have a whole bunch of missing values. 202 00:12:18,660 --> 00:12:20,510 So let's write this down. 203 00:12:20,550 --> 00:12:21,690 Let's give some thanks. 204 00:12:21,890 --> 00:12:36,360 So thanks to pandas categories we now have a way to access all of our data in the form of numbers 205 00:12:38,690 --> 00:12:45,110 beautiful but we still have a bunch of missing data. 206 00:12:46,020 --> 00:12:49,970 Let's have a look at that checked missing data. 207 00:12:50,100 --> 00:12:55,490 So we got DAF temp is now got some. 208 00:12:55,500 --> 00:12:56,860 Now let's get some ratios here. 209 00:12:56,880 --> 00:12:57,210 Right. 210 00:12:57,210 --> 00:13:03,160 So let's divide it by length the data frame so we got no missing values in these columns here because 211 00:13:03,160 --> 00:13:05,500 these are all zeros and then we have. 212 00:13:05,660 --> 00:13:06,030 Okay. 213 00:13:06,130 --> 00:13:10,720 So not much missing there but usage band has 80 per cent of missing data. 214 00:13:11,130 --> 00:13:12,460 Eighty five cent missing data. 215 00:13:12,470 --> 00:13:18,010 81 per cent missing data 52 per cent so or 93. 216 00:13:18,070 --> 00:13:22,720 My goodness maybe something is going on there we'd probably have to if this was a data set in the wild 217 00:13:22,720 --> 00:13:27,610 we were working on with the client maybe we'd ask them but since it's from Kaggle accessing a subject 218 00:13:27,610 --> 00:13:29,890 matter expert is probably pretty hard on this dataset. 219 00:13:30,490 --> 00:13:34,780 So these are some things we might explore a bit more but we're on our mission to build a machine learning 220 00:13:34,780 --> 00:13:35,290 model. 221 00:13:35,290 --> 00:13:40,110 So while we might have to do next is fill these missing values. 222 00:13:40,120 --> 00:13:47,920 But before we do that since we've made some changes on our DFT camp we'll say that to a new CSB so we 223 00:13:47,920 --> 00:13:54,940 can if we were to just start this notebook from fresh we can just import our manipulated data frame 224 00:13:55,150 --> 00:13:57,760 and start exactly where we are now. 225 00:13:57,820 --> 00:14:03,420 So save pre processed data and we'll go. 226 00:14:03,790 --> 00:14:14,120 Export current temp data frame so we'll go DFT temp dot to see as we will put it back in the data folder. 227 00:14:14,200 --> 00:14:22,980 So blue book Blue Book for bulldozers and we're going to call it just something simple like trying temp 228 00:14:23,010 --> 00:14:24,240 dot CSA. 229 00:14:24,600 --> 00:14:25,230 Beautiful. 230 00:14:26,100 --> 00:14:28,960 We want to index equals false. 231 00:14:29,070 --> 00:14:36,930 Otherwise we'll come up with some funny things when we report it and then we're going to import pre 232 00:14:37,770 --> 00:14:39,750 processed data. 233 00:14:39,790 --> 00:14:41,910 Now you could do this for whatever stage you're at right. 234 00:14:41,910 --> 00:14:46,020 If you've made some changes to the temporary data frame I just decided to do it now because it crossed 235 00:14:46,020 --> 00:14:46,860 my mind. 236 00:14:46,860 --> 00:14:51,450 You can do it whatever stage you're out if you've updated some columns if you've done some feature engineering 237 00:14:51,450 --> 00:14:56,010 you can export it and re import it that way if you're coming back at a later session you don't have 238 00:14:56,010 --> 00:14:58,810 to make those changes again you don't have to run your notebook from the top. 239 00:14:58,830 --> 00:15:02,110 You could just run it from here on out which is written it to far. 240 00:15:02,110 --> 00:15:03,670 Daniel come on we want to read it now. 241 00:15:03,730 --> 00:15:08,190 Peter Reid says we this should work. 242 00:15:08,270 --> 00:15:09,100 This should report it. 243 00:15:09,110 --> 00:15:13,790 And then if we went the F Tampa go ahead we should have the exact same data frame that we're working 244 00:15:13,790 --> 00:15:16,490 with now wonderful. 245 00:15:16,630 --> 00:15:23,800 And so again we'll just remind ourselves that we have a lot of missing values to deal with beautiful 246 00:15:23,800 --> 00:15:28,390 just because we export and re imported our data frame doesn't mean that our missing values have been 247 00:15:28,390 --> 00:15:28,870 handled. 248 00:15:29,260 --> 00:15:30,390 So that's really what we're gonna do. 249 00:15:30,410 --> 00:15:34,570 Well it is what we're going to do in the next video as we work towards getting our data ready for a 250 00:15:34,570 --> 00:15:35,840 machine learning model. 251 00:15:36,040 --> 00:15:37,930 Let's start to fill some missing values.