1 00:00:01,750 --> 00:00:08,350 In this session, we went to see how to handle the missing values, missing values are very much a reality. 2 00:00:09,560 --> 00:00:10,460 When we develop. 3 00:00:11,570 --> 00:00:18,530 Machine learning solutions you how to handle missing values if you want to have a higher focus, because 4 00:00:19,370 --> 00:00:25,100 sometimes the machine learning models may even throw up errors that are missing values in your data 5 00:00:25,100 --> 00:00:25,410 set. 6 00:00:25,640 --> 00:00:26,070 OK. 7 00:00:26,360 --> 00:00:30,810 And more fundamentally, missing values can lead to a biased model. 8 00:00:31,680 --> 00:00:35,290 You can make a wrong classification to the extent of missing values. 9 00:00:35,300 --> 00:00:38,060 It's more like why? 10 00:00:38,120 --> 00:00:41,330 Why do we have missing values in the first place? 11 00:00:41,930 --> 00:00:48,920 Can be because values can be lost when you're extracting data from from some of the system. 12 00:00:49,790 --> 00:00:54,380 Right are at the time of data collection, some values might not be collected. 13 00:00:54,380 --> 00:00:55,750 It can be due to an error. 14 00:00:55,760 --> 00:00:58,520 Also, there can be various reasons. 15 00:00:58,850 --> 00:01:01,100 But it's a it's a reality. 16 00:01:01,100 --> 00:01:02,680 It's something that you cannot avoid. 17 00:01:03,080 --> 00:01:03,400 Right. 18 00:01:03,440 --> 00:01:06,790 You will have some missing values. 19 00:01:07,790 --> 00:01:08,050 Right. 20 00:01:08,330 --> 00:01:14,330 So how do we handle it, first strategy, then second strategy. 21 00:01:15,760 --> 00:01:25,750 Use the value of mean or median, if it is up, if that particular missing value should have a continuous 22 00:01:25,750 --> 00:01:33,340 data, like if the missing value corresponds to a categorical data use, more like you've already seen 23 00:01:33,340 --> 00:01:36,190 that what is mean median mode. 24 00:01:37,630 --> 00:01:39,730 Let's call DC and come back. 25 00:01:39,730 --> 00:01:40,130 Right. 26 00:01:40,550 --> 00:01:46,330 If you see this, this is something we saw in the previous session molders that frequently occurring 27 00:01:46,330 --> 00:01:52,090 number and median is the midpoint after arranging data in ascending order. 28 00:01:52,750 --> 00:01:54,670 Mean is the arithmetic average. 29 00:01:55,630 --> 00:01:55,880 Right. 30 00:01:56,240 --> 00:02:07,150 So if you're how missing values corresponding to a continuous data you use either mean or maybe. 31 00:02:08,170 --> 00:02:15,610 If the missing data corresponds to a categorical type of data, it was more. 32 00:02:17,400 --> 00:02:19,580 That's all very simple, right? 33 00:02:20,150 --> 00:02:30,440 So in our data, in the insurance case study that we are looking at, we have 149 nonviolence and even 34 00:02:30,440 --> 00:02:37,320 taking nonviolence according to different factors like different variables. 35 00:02:38,480 --> 00:02:43,730 Now, why I'm taking it, because my strategy will be different for different types of variables. 36 00:02:44,090 --> 00:02:50,030 If the variable is continuous, I replace with minimal medium in the variable is categorical. 37 00:02:50,030 --> 00:02:51,050 I replace it more. 38 00:02:51,320 --> 00:02:51,530 Right. 39 00:02:51,650 --> 00:02:54,980 Which is the highest frequency value. 40 00:02:55,820 --> 00:02:56,140 Right. 41 00:02:56,840 --> 00:02:59,300 So now I am replacing. 42 00:02:59,300 --> 00:03:00,040 I have no water. 43 00:03:00,590 --> 00:03:09,380 I am replacing with me in the case of a continuous variable, in the case of a categorical variable 44 00:03:09,380 --> 00:03:12,290 and replacing with more, as you can see the code here. 45 00:03:13,450 --> 00:03:19,630 And that on this takes care of the missing girls because you are replacing with the appropriate values. 46 00:03:19,630 --> 00:03:24,100 So now your dataset is complete and you're ready to move forward, right. 47 00:03:24,600 --> 00:03:29,530 Very, very simple technique, but a very powerful technique in handling missing girls. 48 00:03:29,980 --> 00:03:38,020 As I said, missing values must be addressed appropriately, either delete or replaced with mean million 49 00:03:38,020 --> 00:03:38,460 or more. 50 00:03:39,160 --> 00:03:39,500 Right. 51 00:03:39,970 --> 00:03:43,630 So this completes this short session. 52 00:03:43,870 --> 00:03:52,300 OK, so we go on to the next Silicon Valley, be talking about atrazine polglase, OK. 53 00:03:53,710 --> 00:04:01,990 Just one last point, after replacing Chip once, if there are any missing letters. 54 00:04:03,120 --> 00:04:03,420 Right. 55 00:04:04,030 --> 00:04:09,990 Please ensure there are no missing just to do the check, right, to be absolutely sure that there are 56 00:04:09,990 --> 00:04:11,230 no missing ones. 57 00:04:11,430 --> 00:04:16,470 I found the code again and see that I need the same values as you can see in the case study that we 58 00:04:16,470 --> 00:04:20,370 have taken from the caller in on the missing values. 59 00:04:20,460 --> 00:04:23,460 And we're good to Moyet, right. 60 00:04:24,360 --> 00:04:24,720 OK.