1 00:00:00,360 --> 00:00:03,830 Now we filled up all the missing numerical values in our data frame. 2 00:00:04,380 --> 00:00:05,190 Let's do the same. 3 00:00:05,190 --> 00:00:09,510 But for the rest of the missing values which are these ones here with numbers. 4 00:00:09,510 --> 00:00:13,920 So we found this out using the F temp dot is in a dot some. 5 00:00:14,100 --> 00:00:20,250 So we got a fair few that are still missing and I have a hunch that they're all categorical over how 6 00:00:20,490 --> 00:00:24,390 we found columns before that with the string data type. 7 00:00:24,960 --> 00:00:32,520 Let's do that again but this time because we've filled up all the numeric data let's find all the columns 8 00:00:32,520 --> 00:00:38,400 that aren't numeric data and that should tell us all of these columns are still have missing values. 9 00:00:38,400 --> 00:00:39,200 Does that make sense. 10 00:00:39,210 --> 00:00:43,380 Because we've filled up all of the columns with numeric data. 11 00:00:43,650 --> 00:00:47,150 We want to find the columns that aren't numeric data but make a little headings here. 12 00:00:47,190 --> 00:00:53,610 So filling and turning categorical variables into numbers. 13 00:00:53,610 --> 00:01:00,320 We're working on categorical variables now because we've already worked on our numerical variables so 14 00:01:00,380 --> 00:01:08,500 let's code check check for columns which aren't numeric. 15 00:01:08,990 --> 00:01:12,850 We'll go for label content India. 16 00:01:13,080 --> 00:01:24,450 Tim dot items and then we'll go if not paid a note API dot types dot is numeric. 17 00:01:24,950 --> 00:01:31,160 So this is make sense if not so if they're not numeric because we filled up all the variables which 18 00:01:31,160 --> 00:01:39,050 are numeric now we want to find the ones which aren't numeric content and then we want to go print label 19 00:01:40,130 --> 00:01:41,690 beautiful. 20 00:01:41,720 --> 00:01:47,010 Now this is gonna tell us all of the values all of the columns that aren't numeric but some of them. 21 00:01:47,010 --> 00:01:47,290 OK. 22 00:01:47,300 --> 00:01:53,000 Usage band that has a fair few missing value so that has three hundred thirty nine thousand missing 23 00:01:53,000 --> 00:01:53,970 values. 24 00:01:53,990 --> 00:01:56,390 So now we know all of the columns that aren't numeric. 25 00:01:56,390 --> 00:01:56,930 What can we do. 26 00:01:56,930 --> 00:02:02,540 How do we fill these up when we turned a string column into categories. 27 00:02:02,540 --> 00:02:10,100 We had access to an attribute called codes which gave us a numerical value for all of the variables 28 00:02:10,670 --> 00:02:12,520 in that column. 29 00:02:12,890 --> 00:02:14,960 Let's have a look an example before we do it. 30 00:02:14,960 --> 00:02:16,760 So PD categorical. 31 00:02:16,760 --> 00:02:22,940 So what this is going to do is go Panda's categorical type and then we're going to pass it the f 10 32 00:02:23,160 --> 00:02:28,190 state we'll look at a state code because that's the one we use an example for and we'll see what this 33 00:02:28,190 --> 00:02:34,220 does categories fifty three objects in this is categories. 34 00:02:34,250 --> 00:02:39,440 Okay can we do D Here categorical D type. 35 00:02:39,500 --> 00:02:41,300 Okay yeah. 36 00:02:41,450 --> 00:02:45,880 And what it might do is dot codes. 37 00:02:46,100 --> 00:02:53,880 Okay so this is how we turn all of this date the variables in the state column into numbers. 38 00:02:54,540 --> 00:03:00,300 So how would we do this for everything all of the categorical columns because that's what we need to 39 00:03:00,300 --> 00:03:06,150 do right we need to fill in the missing categorical variables and turn them into numbers at the same 40 00:03:06,150 --> 00:03:07,190 time. 41 00:03:07,500 --> 00:03:08,760 So we might use this. 42 00:03:08,760 --> 00:03:11,610 Let's keep that they'll make another sell here. 43 00:03:11,610 --> 00:03:21,710 We might go turn categorical variables into numbers and fill missing. 44 00:03:21,780 --> 00:03:26,460 So let's say let's loop through for content in DFT camp. 45 00:03:26,460 --> 00:03:33,150 Got items and then if not so if it's not just like we've done up here. 46 00:03:33,150 --> 00:03:39,810 So if it's not a numerical type paid a lot API Times Dot is numeric type. 47 00:03:40,170 --> 00:03:46,030 But remember we've got the not here so it's checking if it's not numeric daytime content. 48 00:03:46,260 --> 00:03:51,890 Wonderful because we know it's not numeric we might check to see if it's missing again and we're going 49 00:03:51,890 --> 00:03:56,940 to add a binary column just like we did in the previous video with numerical data. 50 00:03:56,940 --> 00:04:06,930 So I had a binary column to indicate whether a sample had missing value wonderful to DFT HAMP. 51 00:04:07,070 --> 00:04:07,940 Here we go. 52 00:04:07,940 --> 00:04:08,600 Label. 53 00:04:08,600 --> 00:04:11,320 Plus we'll do a little string. 54 00:04:11,320 --> 00:04:19,700 Here is underscore missing equals payday is now content so that's just going to return true or false. 55 00:04:19,730 --> 00:04:21,040 If the value was missing too. 56 00:04:21,050 --> 00:04:25,540 If the categorical value is missing it will return true. 57 00:04:25,600 --> 00:04:28,130 It will make a column or assign that to true. 58 00:04:28,430 --> 00:04:37,700 If it's not missing it will just have false a.k.a. true as one false is 0 turn categories into numbers 59 00:04:37,820 --> 00:04:39,440 and add plus 1. 60 00:04:39,440 --> 00:04:42,050 Then you might be wondering Daniel why are we adding plus 1. 61 00:04:42,050 --> 00:04:50,330 Well actually let's write the line of code first because let's see the code PD categorical content codes 62 00:04:50,420 --> 00:04:52,290 plus 1. 63 00:04:52,350 --> 00:04:55,000 Now be wondering why we're doing plus 1. 64 00:04:55,010 --> 00:04:55,790 All these codes. 65 00:04:55,820 --> 00:04:58,560 So we did this plus 1. 66 00:04:58,700 --> 00:05:00,780 See how it goes up. 67 00:05:00,800 --> 00:05:04,100 Keep that there goes up OK. 68 00:05:05,370 --> 00:05:06,540 Why would we do this. 69 00:05:06,900 --> 00:05:12,690 Well the way it panders categories works is it automatically if there was some missing value in here 70 00:05:12,720 --> 00:05:15,730 it is signs that a code of negative 1. 71 00:05:15,930 --> 00:05:17,520 But we don't want that. 72 00:05:17,550 --> 00:05:20,040 We want it to be not negative when we want it to be zero. 73 00:05:20,100 --> 00:05:24,570 So maybe if we find a column that has missing values we might be out to see this. 74 00:05:24,570 --> 00:05:27,140 If not we go usage. 75 00:05:27,170 --> 00:05:27,710 No. 76 00:05:27,860 --> 00:05:28,860 Is it using banned. 77 00:05:28,950 --> 00:05:30,980 That one has a lot of missing values doesn't it. 78 00:05:31,050 --> 00:05:31,620 Usage ban. 79 00:05:31,650 --> 00:05:35,070 Okay let's have a look at this column. 80 00:05:35,220 --> 00:05:39,130 Codes are negative one negative one negative one beautiful. 81 00:05:39,300 --> 00:05:42,480 We don't want this to be negative one so we're going to add one there. 82 00:05:42,480 --> 00:05:45,820 So all of our numbers are positive in our data frame. 83 00:05:45,840 --> 00:05:48,240 That's the reason why we add plus 1 here. 84 00:05:48,270 --> 00:05:55,170 So if we were to do this born that way we know when we turn our data frame and America into all numbers 85 00:05:55,320 --> 00:06:00,990 or categories that way we know zero is actually missing all the other categories are just plus 1 on 86 00:06:00,990 --> 00:06:03,360 what their category code is. 87 00:06:03,540 --> 00:06:07,200 So let's run this cell. 88 00:06:07,200 --> 00:06:11,040 This might take a little while a fair few missing value so we have to fill all of these. 89 00:06:11,040 --> 00:06:16,180 So what we're doing we're turning all of these columns that are non numeric. 90 00:06:16,180 --> 00:06:18,370 Remember that's what we found here. 91 00:06:18,390 --> 00:06:19,560 Not numeric. 92 00:06:19,620 --> 00:06:23,830 We're taking those missing values all actually first we're turning them into categories. 93 00:06:23,880 --> 00:06:29,220 If they have a missing value that row will be assigned negative 1 but we're adding plus 1 to the code 94 00:06:29,850 --> 00:06:35,220 so all the missing values are going to be 0 and all of the values that are still there. 95 00:06:35,220 --> 00:06:41,760 The categories are gonna be some number pertaining to whatever Panda's decided that their categories 96 00:06:41,900 --> 00:06:44,090 are beautiful. 97 00:06:44,160 --> 00:06:46,040 It's worked OK. 98 00:06:46,380 --> 00:06:54,560 So now we can check to see what we've done by going DFT temp dot info so we say here. 99 00:06:54,630 --> 00:06:58,200 Okay so now we have one hundred and three different columns. 100 00:06:58,210 --> 00:06:59,680 Wow. 101 00:06:59,680 --> 00:07:00,460 Well let's have a look. 102 00:07:00,460 --> 00:07:01,170 So if we go. 103 00:07:01,160 --> 00:07:12,780 Def temp don't head T Okay so see these are all now categorical is missing columns that have been adjusted 104 00:07:13,020 --> 00:07:14,190 that have been added on the end. 105 00:07:14,190 --> 00:07:20,010 Now this is from this line of code here to see what I mean add a binary column to indicate whether a 106 00:07:20,010 --> 00:07:21,970 sample had missing value. 107 00:07:22,380 --> 00:07:27,190 So this means that for sample 0 it had a differential type. 108 00:07:27,450 --> 00:07:28,190 It was missing. 109 00:07:28,770 --> 00:07:34,920 So if we went back and found that column for sample 0 it would be filled with a value 0 instead of just 110 00:07:34,920 --> 00:07:35,300 missing 111 00:07:39,070 --> 00:07:46,990 one last check is to see if there are any more missing values cause what we've done filling up the numeric 112 00:07:47,080 --> 00:07:51,720 and categorical values of our data frame and turning them into numbers. 113 00:07:51,790 --> 00:07:54,750 There shouldn't be any more missing values. 114 00:07:55,060 --> 00:07:56,140 Beautiful. 115 00:07:56,440 --> 00:07:57,690 So they should be all zero. 116 00:07:57,700 --> 00:08:01,810 So maybe I can a bit cut off 10. 117 00:08:01,930 --> 00:08:12,870 See what I mean is getting truncated 20 wonderful so we've got no more missing values in our data frame 118 00:08:13,020 --> 00:08:16,440 and our data should be all numeric. 119 00:08:16,440 --> 00:08:17,730 What does that telling you. 120 00:08:17,880 --> 00:08:19,860 What do you know about machine learning models. 121 00:08:19,860 --> 00:08:21,020 What are those two points. 122 00:08:21,060 --> 00:08:22,250 Hint to you. 123 00:08:22,650 --> 00:08:26,880 When thinking about machine learning models will way to the next video so we can find out.