1 00:00:00,100 --> 00:00:02,100 Hello and welcome to this tutorial. 2 00:00:02,100 --> 00:00:05,233 Now we are finally going to start preparing the data 3 00:00:05,233 --> 00:00:08,233 so that our machine learning models run correctly. 4 00:00:08,400 --> 00:00:11,200 And the first problem that we have to deal with 5 00:00:11,200 --> 00:00:14,800 is the case where you have some missing data in your data set. 6 00:00:15,433 --> 00:00:17,666 And that happens quite a lot actually in real life. 7 00:00:17,666 --> 00:00:21,366 So you have to get the trick to handle this problem 8 00:00:21,366 --> 00:00:24,833 and make it all good for your machine learning model to run correctly. 9 00:00:25,266 --> 00:00:27,300 So let's have a look at the data set. 10 00:00:27,300 --> 00:00:30,300 Here I'm going to my Google Sheets tab. 11 00:00:30,800 --> 00:00:33,066 Okay so here is the data set. 12 00:00:33,066 --> 00:00:36,066 And as you can see there is two missing data. 13 00:00:36,066 --> 00:00:39,233 There is one missing data in the H column here for Spain 14 00:00:39,633 --> 00:00:43,166 and one missing data in the salary column for Germany. 15 00:00:43,566 --> 00:00:45,833 So how can we handle this problem? 16 00:00:45,833 --> 00:00:48,833 Well the first idea is to remove 17 00:00:49,066 --> 00:00:52,433 the lines, the observations, where there is some missing data. 18 00:00:52,466 --> 00:00:57,266 So what we could do is to remove this line and remove this line. 19 00:00:57,633 --> 00:01:00,566 But that can be quite dangerous because imagine 20 00:01:00,566 --> 00:01:03,566 this data set contains crucial information. 21 00:01:03,866 --> 00:01:06,866 It would be quite dangerous to remove an observation. 22 00:01:06,966 --> 00:01:10,933 So we need to figure out a better idea to handle this problem. 23 00:01:11,533 --> 00:01:12,366 And another idea. 24 00:01:12,366 --> 00:01:15,500 And that's actually the most common idea to handle missing 25 00:01:15,500 --> 00:01:19,100 data is to take the mean of the columns. 26 00:01:19,466 --> 00:01:22,866 So here we are going to replace this missing data here 27 00:01:23,100 --> 00:01:26,100 by the mean of all the values in the column h. 28 00:01:26,466 --> 00:01:29,500 And that's the same for every feature that contains missing data. 29 00:01:29,866 --> 00:01:31,433 We replace this missing data 30 00:01:31,433 --> 00:01:35,100 by the mean of the values in the column that contains this missing data. 31 00:01:35,666 --> 00:01:37,500 Okay. So let's do this. 32 00:01:37,500 --> 00:01:40,433 First we need to take our data set. 33 00:01:40,433 --> 00:01:41,766 So data set. 34 00:01:41,766 --> 00:01:45,100 Let's start by taking care of the missing value in the h column. 35 00:01:45,500 --> 00:01:47,766 So here we will need to take the h column. 36 00:01:47,766 --> 00:01:51,133 And to do this in R we need to add a dollar sign here. 37 00:01:51,566 --> 00:01:53,466 And here we choose h. 38 00:01:53,466 --> 00:01:59,000 So by doing that data set dollar h we're taking the column age of the data set. 39 00:01:59,700 --> 00:02:01,366 Then we're going to add equals. 40 00:02:01,366 --> 00:02:03,866 And then we're going to add an if else. 41 00:02:03,866 --> 00:02:07,866 So I'm going to type if else here then parentheses. 42 00:02:09,000 --> 00:02:11,566 In the ifelse function you have to input three parameters. 43 00:02:11,566 --> 00:02:13,600 The first parameter is your condition. 44 00:02:13,600 --> 00:02:18,100 If condition the second parameter is the value you want to input. 45 00:02:18,100 --> 00:02:19,566 If the condition is true 46 00:02:19,566 --> 00:02:22,600 and this is the value you want to input if the condition is false. 47 00:02:23,100 --> 00:02:25,033 So let's first input the condition. 48 00:02:25,033 --> 00:02:29,100 The condition is going to be is.na parentheses. 49 00:02:29,100 --> 00:02:32,100 Data set dollar age. 50 00:02:32,933 --> 00:02:34,800 And that's it 51 00:02:34,800 --> 00:02:37,033 is Na is a function that tells 52 00:02:37,033 --> 00:02:40,033 if the value in the function is missing or not. 53 00:02:40,133 --> 00:02:43,200 So by putting is Na data set dollars age. 54 00:02:43,333 --> 00:02:47,766 We are checking to see if all the values in the column age are missing. 55 00:02:48,066 --> 00:02:52,800 So this will return true if the value in the column age is missing 56 00:02:53,066 --> 00:02:56,200 and false if the value in the column h is not missing. 57 00:02:56,566 --> 00:02:58,300 Okay, so that's the condition. 58 00:02:58,300 --> 00:03:02,366 And now the second parameter is the value that is going to be returned. 59 00:03:02,400 --> 00:03:04,866 If the condition above is true. 60 00:03:04,866 --> 00:03:08,233 And of course if the condition is true that means that there is a missing value. 61 00:03:08,400 --> 00:03:10,966 And that means that we have to replace it with the average. 62 00:03:10,966 --> 00:03:12,300 So here we're going to impute 63 00:03:12,300 --> 00:03:15,633 as the second parameter the average of the column age. 64 00:03:16,133 --> 00:03:16,466 Okay. 65 00:03:16,466 --> 00:03:19,033 And to compute the average in R there is a simple way. 66 00:03:19,033 --> 00:03:22,366 We can type RV then data set 67 00:03:23,433 --> 00:03:24,133 dollar sign 68 00:03:24,133 --> 00:03:27,133 age because we want to take the column age. 69 00:03:27,500 --> 00:03:29,800 Then comma. 70 00:03:29,800 --> 00:03:31,566 And then here we're going to add a function. 71 00:03:31,566 --> 00:03:35,066 So we're going to type fund in capitals phone equals 72 00:03:35,400 --> 00:03:38,300 then function then function x. 73 00:03:38,300 --> 00:03:39,966 This is still part of the R syntax. 74 00:03:39,966 --> 00:03:43,166 We're just making a function here which is going to be the mean function. 75 00:03:43,533 --> 00:03:46,533 And then we have to specify what this function will be. 76 00:03:46,600 --> 00:03:51,000 And so this function is of course the mean which is an existing function in R. 77 00:03:51,600 --> 00:03:55,000 So here parenthesis x comma. 78 00:03:55,000 --> 00:03:58,400 And here we're just going to add na dot 79 00:03:58,633 --> 00:04:01,633 rmm equals true. 80 00:04:02,466 --> 00:04:04,000 And that means that we ask. 81 00:04:04,000 --> 00:04:07,433 And that means that we ask R to include the missing values. 82 00:04:07,533 --> 00:04:11,333 When R will go through the whole column aid to compute the mean of the values. 83 00:04:11,600 --> 00:04:15,433 And then we still need to close the parenthesis here again then comma. 84 00:04:15,466 --> 00:04:17,633 So that's it for the second parameter. 85 00:04:17,633 --> 00:04:20,366 And now we need to add the third parameter. 86 00:04:20,366 --> 00:04:22,300 And in your opinion what is it going to be. 87 00:04:23,366 --> 00:04:26,300 Well this third parameter is the value you want to return. 88 00:04:26,300 --> 00:04:28,833 If the first condition is not true. 89 00:04:28,833 --> 00:04:31,933 That means that the value in the column age is not missing. 90 00:04:31,966 --> 00:04:34,333 So that means that the value exists. 91 00:04:34,333 --> 00:04:37,633 So here we are simply going to put data set 92 00:04:39,533 --> 00:04:41,466 age okay. 93 00:04:41,466 --> 00:04:45,333 And that's done by typing this we replace the missing value 94 00:04:45,333 --> 00:04:48,466 in the column age by the mean of the column age itself. 95 00:04:48,900 --> 00:04:51,900 So let's select this and let's see what happens. 96 00:04:52,100 --> 00:04:54,533 So command and control plus enter to execute. 97 00:04:54,533 --> 00:04:57,533 And here it is and executed properly. 98 00:04:57,600 --> 00:05:00,600 And now let's look at our data set by clicking on this tab here. 99 00:05:01,266 --> 00:05:02,233 And good perfect. 100 00:05:02,233 --> 00:05:05,200 The missing value that was here in the age column was replaced 101 00:05:05,200 --> 00:05:07,933 by the mean of the column. Great. 102 00:05:07,933 --> 00:05:11,066 So now let's do the same for the salary. 103 00:05:11,633 --> 00:05:14,766 We're just going to copy this paste. 104 00:05:15,300 --> 00:05:19,233 And we're just going to replace age by the salary here. 105 00:05:20,000 --> 00:05:20,800 Here as well. 106 00:05:21,900 --> 00:05:23,700 Here as well. 107 00:05:23,700 --> 00:05:26,700 And also here. 108 00:05:26,800 --> 00:05:27,633 Great. 109 00:05:27,633 --> 00:05:29,933 And now we have to be careful with something. 110 00:05:29,933 --> 00:05:31,366 It has to be a line here. 111 00:05:31,366 --> 00:05:33,633 So we just have to do this. 112 00:05:33,633 --> 00:05:35,633 And same for here. 113 00:05:35,633 --> 00:05:37,633 This okay. And now we're fine. 114 00:05:37,633 --> 00:05:40,566 Now we're ready to select this code section here. 115 00:05:40,566 --> 00:05:42,933 Press Command and Control plus enter to execute. 116 00:05:42,933 --> 00:05:43,566 Here it is. 117 00:05:43,566 --> 00:05:46,233 Let's check our data set and perfect. 118 00:05:46,233 --> 00:05:49,266 The missing value in the salary that was here was replaced 119 00:05:49,266 --> 00:05:54,900 by the mean of the salary column $63,777.