1 00:00:00,166 --> 00:00:00,966 Okay, my friends. 2 00:00:00,966 --> 00:00:04,966 So now let's add a new tool to our data preprocessing toolkit, 3 00:00:05,166 --> 00:00:08,166 which is taking care of missing data. 4 00:00:08,333 --> 00:00:14,133 So indeed, if we have a look again at our data set data dot csv, we notice 5 00:00:14,166 --> 00:00:19,166 that there is a missing salary here for this specific customer from Germany. 6 00:00:19,466 --> 00:00:22,700 Of 40 years old and who purchased the product. 7 00:00:23,033 --> 00:00:26,466 So generally you don't want to have any missing data in your data 8 00:00:26,466 --> 00:00:29,633 set for the simple reason that it can cause some errors 9 00:00:29,633 --> 00:00:33,766 when training your machine learning model, and therefore you must handle them. 10 00:00:34,100 --> 00:00:36,333 There are actually several ways to handle them. 11 00:00:36,333 --> 00:00:40,500 A first way is to just ignore the observation by deleting it. 12 00:00:40,700 --> 00:00:44,900 That's one method, and this actually works if you have a large data set 13 00:00:44,900 --> 00:00:49,133 and you know if you have only 1% missing data, you know, removing 1% 14 00:00:49,300 --> 00:00:54,133 of the observations won't change much the learning quality of your model. 15 00:00:54,133 --> 00:00:58,066 So 1% is fine, but sometimes you can have a lot of missing data 16 00:00:58,066 --> 00:01:01,066 and therefore you must handle them the right way. 17 00:01:01,100 --> 00:01:02,800 So that was the first way to ignore them. 18 00:01:02,800 --> 00:01:03,766 To remove them. 19 00:01:03,766 --> 00:01:04,900 And now a second way. 20 00:01:04,900 --> 00:01:06,900 And this is what we're adding right now in 21 00:01:06,900 --> 00:01:10,900 the toolkit is to actually replace the missing data. 22 00:01:10,933 --> 00:01:16,066 You know, the missing value by the average of all the values in the column 23 00:01:16,233 --> 00:01:18,133 in which the data is missing. 24 00:01:18,133 --> 00:01:19,933 So here we have a missing salary. 25 00:01:19,933 --> 00:01:23,566 What we want to do is to replace this missing salary 26 00:01:23,666 --> 00:01:26,566 by the average of all these salaries. 27 00:01:26,566 --> 00:01:29,266 This is a classic way of handling missing data. 28 00:01:29,266 --> 00:01:31,533 And I'm going to teach it to you right away. 29 00:01:31,533 --> 00:01:34,000 So here we go. Taking care of missing data. 30 00:01:34,000 --> 00:01:38,933 Let's create a new code cell and let's replace that missing salary 31 00:01:39,066 --> 00:01:41,933 by the average of all the salaries here. 32 00:01:41,933 --> 00:01:42,600 All right. 33 00:01:42,600 --> 00:01:45,300 So to do this we're going to use the libraries. 34 00:01:45,300 --> 00:01:49,633 And actually I'm about to introduce you to one of the best data science libraries. 35 00:01:49,833 --> 00:01:52,033 I'm talking about scikit learn. 36 00:01:52,033 --> 00:01:56,733 Scikit learn is an amazing data science libraries containing a lot of tools, 37 00:01:56,966 --> 00:01:59,966 including a lot of data preprocessing tools. 38 00:01:59,966 --> 00:02:03,900 You will see that we will actually use scikit learn a lot in this course. 39 00:02:04,166 --> 00:02:07,433 You know, more than half of the machine learning models we will build 40 00:02:07,433 --> 00:02:10,433 in this course will be built with scikit learn. 41 00:02:10,600 --> 00:02:12,600 So if you don't know scikit learn yet, 42 00:02:12,600 --> 00:02:15,600 I'm telling you you're going to absolutely love it. 43 00:02:15,733 --> 00:02:19,166 And so for the first time here we're going to use scikit learn to handle 44 00:02:19,233 --> 00:02:20,466 missing data. 45 00:02:20,466 --> 00:02:23,200 And to do this the class that we're going to use 46 00:02:23,200 --> 00:02:26,200 from scikit learn is called simple input. 47 00:02:26,633 --> 00:02:30,966 We're actually going to first import that simple input a class. 48 00:02:31,166 --> 00:02:36,266 Then we will create an instance you know an object of the simple input a class. 49 00:02:36,566 --> 00:02:39,933 This object will allow us to exactly replace 50 00:02:39,933 --> 00:02:43,100 this missing salary here by the average of the salaries. 51 00:02:43,300 --> 00:02:46,200 And then we will have an updated data set. 52 00:02:46,200 --> 00:02:48,833 You know, an updated actually matrix of features, 53 00:02:48,833 --> 00:02:52,266 because we will apply this input on the matrix of features only. 54 00:02:52,500 --> 00:02:55,500 So we will have a new matrix of features with no missing data 55 00:02:55,500 --> 00:02:59,366 because the missing salary will have been replaced by the average salary. 56 00:02:59,866 --> 00:03:01,366 All right let's do this. 57 00:03:01,366 --> 00:03:01,866 Perfect. 58 00:03:01,866 --> 00:03:05,566 So first since this class belongs to scikit learn, well 59 00:03:05,566 --> 00:03:11,800 we're going to start here by going from scikit learn which has the name sklearn. 60 00:03:11,800 --> 00:03:12,966 So sklearn. 61 00:03:12,966 --> 00:03:16,366 Then remember in order to access a module we have to add a dot. 62 00:03:16,666 --> 00:03:20,466 Because actually this simple import a class which we want to import 63 00:03:20,766 --> 00:03:24,800 belongs to a certain module of scikit learn called impute. 64 00:03:24,933 --> 00:03:26,633 This one impute. 65 00:03:26,633 --> 00:03:30,000 And from this impute model well we're going to import. 66 00:03:30,000 --> 00:03:32,066 There we go. The simple 67 00:03:33,433 --> 00:03:34,400 import class. 68 00:03:34,400 --> 00:03:36,166 Google collab really exists. 69 00:03:36,166 --> 00:03:38,900 You will simple import a class. Perfect. 70 00:03:38,900 --> 00:03:40,133 Then next step. 71 00:03:40,133 --> 00:03:44,100 As I said the next step is to create an instance of this class 72 00:03:44,100 --> 00:03:48,300 which you can exactly see as the tool itself. 73 00:03:48,300 --> 00:03:49,133 You know, the tool 74 00:03:49,133 --> 00:03:53,100 that you'll use to replace that missing salary by the average of salaries. 75 00:03:53,433 --> 00:03:54,766 So since we're about to 76 00:03:54,766 --> 00:03:58,566 create a new object, well, we have to introduce here a new variable. 77 00:03:58,800 --> 00:04:02,400 And we're going to call this variable imputer okay. 78 00:04:02,400 --> 00:04:06,433 Input it which will be exactly this object of the simple input class. 79 00:04:06,866 --> 00:04:10,300 And therefore since it will be the object of this simple Imputer class, well, 80 00:04:10,600 --> 00:04:14,866 we have naturally to call this class simple input. 81 00:04:15,200 --> 00:04:18,633 So I'm going to copy this and paste that here. 82 00:04:18,866 --> 00:04:20,933 That's how you create an object of the class. 83 00:04:20,933 --> 00:04:22,433 You simply call the class. 84 00:04:22,433 --> 00:04:25,000 Then you add some parenthesis and there you go. 85 00:04:25,000 --> 00:04:28,466 Now you're going to enter the right arguments in order to replace 86 00:04:28,466 --> 00:04:32,200 indeed this missing salary by the average of salaries, because note 87 00:04:32,466 --> 00:04:35,500 that there actually many replacements that you could do. 88 00:04:35,500 --> 00:04:38,633 You could instead of replacing it by the average salary, 89 00:04:38,633 --> 00:04:41,100 you could replace it by the median salary 90 00:04:41,100 --> 00:04:43,566 you know, there is a difference between the average and the median. 91 00:04:43,566 --> 00:04:47,700 You could also replace a missing value by the most frequent value. 92 00:04:47,833 --> 00:04:48,166 Right. 93 00:04:48,166 --> 00:04:51,300 That would be, for example, relevant for categories okay. 94 00:04:51,300 --> 00:04:52,266 So we have many options. 95 00:04:52,266 --> 00:04:55,466 But the most classic one and the one option that I recommend 96 00:04:55,600 --> 00:04:58,566 is the average salary. The mean salary okay. 97 00:04:58,566 --> 00:05:01,566 And so that's exactly what we have to enter here. 98 00:05:01,733 --> 00:05:06,600 First we have to specify which missing values we have to replace. 99 00:05:06,866 --> 00:05:09,300 And so that's why we have to enter here. 100 00:05:09,300 --> 00:05:14,400 First argument called missing values which has to be equal to NP. 101 00:05:14,433 --> 00:05:17,100 You know the numpy library dot none. 102 00:05:17,100 --> 00:05:20,000 And that's just to say that we want to replace 103 00:05:20,000 --> 00:05:23,233 all the missing value in the data set like this one. 104 00:05:23,233 --> 00:05:24,900 This is like an empty value. 105 00:05:24,900 --> 00:05:27,366 This is what this means an empty value. 106 00:05:27,366 --> 00:05:31,000 And then the second argument we have to input here is exactly 107 00:05:31,000 --> 00:05:34,266 the one saying that indeed the missing values here, 108 00:05:34,266 --> 00:05:37,766 you know, the empty values of the data set will be replaced by the mean. 109 00:05:37,766 --> 00:05:42,733 And to do this we have to add the next argument here, which is strategy. 110 00:05:43,300 --> 00:05:47,966 And this argument will be equal to in quotes mean okay. 111 00:05:47,966 --> 00:05:52,400 And that's just to say that we want indeed to replace all the missing values 112 00:05:52,400 --> 00:05:55,766 in the matrix of features by the mean of the feature itself.