1 00:00:00,033 --> 00:00:01,500 Okay, then. 2 00:00:01,500 --> 00:00:02,733 Almost last step. 3 00:00:02,733 --> 00:00:05,233 Now remember, this is just an object. 4 00:00:05,233 --> 00:00:08,700 We haven't connected anything yet to our matrix of features. 5 00:00:08,966 --> 00:00:14,100 So the next step is indeed to apply this input object on the matrix of features. 6 00:00:14,600 --> 00:00:15,933 So how are we going to do that. 7 00:00:15,933 --> 00:00:20,066 Well remember that a class contains an assemble of instructions 8 00:00:20,200 --> 00:00:24,166 but also some operations and actions which you can apply 9 00:00:24,400 --> 00:00:26,433 to other objects or variables. 10 00:00:26,433 --> 00:00:27,933 And these are called methods. 11 00:00:27,933 --> 00:00:29,566 You know they're like functions. 12 00:00:29,566 --> 00:00:32,566 And one of them is exactly the fit method. 13 00:00:32,800 --> 00:00:38,300 The fit method will exactly connect this input to the matrix of features. 14 00:00:38,333 --> 00:00:41,433 In other words, what this fit method will do is it will 15 00:00:41,433 --> 00:00:45,033 look at the missing values in, you know, the salary column. 16 00:00:45,300 --> 00:00:48,400 And also it will compute the average of the salaries. 17 00:00:48,700 --> 00:00:51,000 All right. So that that simply what it will do. 18 00:00:51,000 --> 00:00:55,100 And then that will not be enough to do the and we want to do 19 00:00:55,100 --> 00:00:59,233 the replacement will have to call another method called transform time 20 00:00:59,400 --> 00:01:02,700 and which will this time apply the transformation meaning 21 00:01:02,966 --> 00:01:06,866 it will replace the missing salary here by the average of the salaries. 22 00:01:07,100 --> 00:01:09,400 Okay, so let's do this. 23 00:01:09,400 --> 00:01:11,966 Let's first call the fit method. 24 00:01:11,966 --> 00:01:16,066 In order to do this well of course we have to call first our object Imputer. 25 00:01:16,733 --> 00:01:19,366 And then from this object you know adding a dot, 26 00:01:19,366 --> 00:01:22,366 we will call the fit method 27 00:01:22,600 --> 00:01:26,266 which has some parentheses because it's like a function inside a class. 28 00:01:26,300 --> 00:01:29,066 And what does this function expect as arguments. 29 00:01:29,066 --> 00:01:32,033 Well, it simply expects all the columns of X 30 00:01:32,033 --> 00:01:35,900 with numerical values, but only the ones with numerical values. 31 00:01:35,900 --> 00:01:38,900 Not the ones with text or strings categories. 32 00:01:39,000 --> 00:01:41,033 And so. Well, how do we get this column? 33 00:01:41,033 --> 00:01:42,133 Well first let's get 34 00:01:42,133 --> 00:01:46,066 the matrix of features X because that's where we want to replace the missing data. 35 00:01:46,066 --> 00:01:48,600 And from this matrix of features x. 36 00:01:48,600 --> 00:01:51,300 Well first we're going to look at all the rows. 37 00:01:51,300 --> 00:01:52,433 You know this fit 38 00:01:52,433 --> 00:01:57,166 method will read the whole column that we specify inside this method. 39 00:01:57,466 --> 00:02:01,066 But then for the columns here you know we could specify all the columns 40 00:02:01,300 --> 00:02:03,833 where to look for some missing data. 41 00:02:03,833 --> 00:02:06,300 However, this first column has danger. 42 00:02:06,300 --> 00:02:09,700 You know it is a column with strings, and therefore this might cause 43 00:02:09,700 --> 00:02:12,700 a warning or an error when looking for some missing data here. 44 00:02:12,833 --> 00:02:17,000 Therefore, we're only going to specify these columns with only real numbers 45 00:02:17,133 --> 00:02:18,666 age and salary. 46 00:02:18,666 --> 00:02:23,533 And therefore here we're going to enter the range from one to be careful 47 00:02:23,533 --> 00:02:27,600 not to because remember the upper bound of a range in Python is excluded. 48 00:02:27,600 --> 00:02:30,400 So if we exclude two this will list through the salary. 49 00:02:30,400 --> 00:02:33,533 Therefore we have to go up to three okay. 50 00:02:33,800 --> 00:02:34,366 Up to three. 51 00:02:34,366 --> 00:02:36,733 So that well this fit method 52 00:02:36,733 --> 00:02:40,933 will look for all the missing values in the h column and the salary column. 53 00:02:40,933 --> 00:02:44,700 So here we're specifying specific columns which are the h column in 54 00:02:44,700 --> 00:02:45,700 the salary column. 55 00:02:45,700 --> 00:02:48,600 And that's because we know that there is a missing salary. 56 00:02:48,600 --> 00:02:51,000 And by the way there is also a missing H. 57 00:02:51,000 --> 00:02:55,933 However, I recommend on a general rule to select all the numerical columns, 58 00:02:56,100 --> 00:02:59,100 because in your career you will actually work with huge data sets 59 00:02:59,300 --> 00:03:02,233 and you won't be able to see where the missing values are. 60 00:03:02,233 --> 00:03:07,600 So just include all the numerical columns to make sure to replace any missing data. 61 00:03:07,966 --> 00:03:11,100 Remember to exclude these ones the string columns okay, 62 00:03:11,400 --> 00:03:14,400 so that's just something important to consider. 63 00:03:14,433 --> 00:03:15,466 And then here we go. 64 00:03:15,466 --> 00:03:19,200 This will connect our imputer to our matrix of features. 65 00:03:19,200 --> 00:03:23,666 And now final step we have to call the transform method 66 00:03:23,866 --> 00:03:27,133 once again from our Imputer object. 67 00:03:27,933 --> 00:03:32,233 And so this transform method will exactly do that 68 00:03:32,233 --> 00:03:36,900 replacement of the missing salary here by the mean of the salaries. 69 00:03:37,300 --> 00:03:38,766 And same for the missing age. 70 00:03:38,766 --> 00:03:42,833 It will be replaced by the mean of all the ages in the H column. 71 00:03:43,400 --> 00:03:46,400 And so according to you, what do we have to input here? 72 00:03:46,533 --> 00:03:48,366 Well, there is no trap here. 73 00:03:48,366 --> 00:03:53,533 We of course have to input the columns of X where we want to replace missing data. 74 00:03:53,533 --> 00:03:57,000 And so these are the h column and the salary column. 75 00:03:57,233 --> 00:04:00,066 And therefore we simply have to input exactly 76 00:04:00,066 --> 00:04:03,066 the same as what was input in the fit method. 77 00:04:03,066 --> 00:04:08,366 So we just going to take this copy this and paste that inside the transform method. 78 00:04:09,400 --> 00:04:09,933 However be 79 00:04:09,933 --> 00:04:15,166 careful this transform method actually returns the new updated version 80 00:04:15,166 --> 00:04:16,400 of the matrix of features 81 00:04:16,400 --> 00:04:20,700 x with the two replacements of the missing salary and the missing H, 82 00:04:21,033 --> 00:04:24,033 and therefore what we want to do now, and that's the last thing we have to do, 83 00:04:24,133 --> 00:04:29,200 is to indeed update our matrix of features X, and to do this well, since 84 00:04:29,500 --> 00:04:33,866 this exactly returns these two columns here, with that replacement done 85 00:04:34,166 --> 00:04:38,766 well, what we want to do to update X is actually to take this, 86 00:04:38,766 --> 00:04:43,200 you know, take the second and third column of X matrix features and 87 00:04:44,266 --> 00:04:48,333 change it by what will be returned by this transform function 88 00:04:48,600 --> 00:04:52,600 of the input object, so that the second and third columns of x 89 00:04:52,733 --> 00:04:56,266 will be replaced with that average age and average salary. 90 00:04:56,400 --> 00:04:59,833 And therefore the whole matrix of features X will be exactly the same. 91 00:04:59,833 --> 00:05:03,400 But with these new average age and average salary. 92 00:05:04,033 --> 00:05:06,000 Okay, we're going to check that right away. 93 00:05:06,000 --> 00:05:09,466 Now to do so, we're going to create a new code cell 94 00:05:09,466 --> 00:05:13,200 where we're going to print the new matrix of features X. 95 00:05:13,433 --> 00:05:17,700 And let's see if indeed that missing value that we can clearly see 96 00:05:17,700 --> 00:05:22,200 here as once again Nan is replaced in this new version of X. 97 00:05:22,200 --> 00:05:25,533 So let's not forget to run these two cells here, 98 00:05:25,866 --> 00:05:29,400 this first cell to indeed replace the missing data. 99 00:05:29,400 --> 00:05:33,300 And now this cell to print the new matrix X. 100 00:05:33,566 --> 00:05:34,800 And there you go. 101 00:05:34,800 --> 00:05:39,433 As we can clearly see that missing salary in the previous matrix of features 102 00:05:39,433 --> 00:05:44,500 X was indeed replaced by the average salaries of this column. 103 00:05:44,766 --> 00:05:46,600 Okay. Check that if you want. 104 00:05:46,600 --> 00:05:47,566 But there you go. 105 00:05:47,566 --> 00:05:51,266 Now you have another tool in your data preprocessing toolkit. 106 00:05:51,500 --> 00:05:52,833 So congratulations. 107 00:05:52,833 --> 00:05:58,633 And now we're going to proceed to a new tool which is to encode categorical data.