1 00:00:01,780 --> 00:00:08,590 Now that we have looked at univariate analysis and by vinyard analysis, it is time to sit back and 2 00:00:08,590 --> 00:00:10,990 again think about these videos that we have. 3 00:00:12,910 --> 00:00:16,630 Remember that we are trying to keep only relevant variables in the analysis. 4 00:00:17,970 --> 00:00:21,760 So variables like bus terminal, which has only value. 5 00:00:21,790 --> 00:00:22,330 Yes. 6 00:00:23,160 --> 00:00:24,330 To our needed I it. 7 00:00:26,110 --> 00:00:31,150 It is pretty indicative that this variable is not going to add any information to our model. 8 00:00:33,420 --> 00:00:34,250 It may be relevant. 9 00:00:34,980 --> 00:00:39,410 Maybe if a city does not have a bus terminal, the prices will be lower there. 10 00:00:40,380 --> 00:00:45,870 But we cannot see that by looking at the sample because dissemble has no observation. 11 00:00:46,110 --> 00:00:47,730 Devalued bus terminal is no. 12 00:00:48,660 --> 00:00:55,980 So since this variable is not useful, we will remove it from our day, does it? 13 00:00:57,150 --> 00:01:04,230 Think about all these points again, these four points to decide if the variables we have are relevant 14 00:01:04,230 --> 00:01:04,620 or not. 15 00:01:06,030 --> 00:01:10,530 Fosters rebuild with single unique value such as bus terminal. 16 00:01:11,340 --> 00:01:12,170 Those should be removed. 17 00:01:14,000 --> 00:01:17,990 Even for non categorical variables, if they have only one value to old. 18 00:01:18,920 --> 00:01:20,480 That is basically not a variable. 19 00:01:20,870 --> 00:01:22,700 That is basically behaving as a constant. 20 00:01:23,030 --> 00:01:25,730 We don't need such variables and we will be deleting them. 21 00:01:27,580 --> 00:01:31,450 Variables with low filtrate are the second point. 22 00:01:31,660 --> 00:01:32,410 We saw that. 23 00:01:33,280 --> 00:01:36,490 And Horsburgh Variable had eight values missing. 24 00:01:36,850 --> 00:01:41,390 So we decided that we will replace these values by the mean of the other values. 25 00:01:42,820 --> 00:01:48,640 But suppose we had only 50 values and the rest 450 values were empty. 26 00:01:49,890 --> 00:01:52,960 Does it make sense to impute values in that case? 27 00:01:54,530 --> 00:01:59,820 In such a case will not be able to capture the actual effect of that variable on the airport. 28 00:02:01,320 --> 00:02:08,910 Even if you keep the variable and impute mean value in the missing places and on the analysis nearly 29 00:02:09,000 --> 00:02:14,100 always, there will not be any significant relationship between such a variable and the output. 30 00:02:16,040 --> 00:02:18,830 So we have the option of deleting Saitoti, able to. 31 00:02:20,290 --> 00:02:20,830 Thirdly. 32 00:02:21,950 --> 00:02:25,640 Remember that businesses are working within a regulatory framework. 33 00:02:26,870 --> 00:02:30,560 I'm regulatory framework may not allow usage of certain variables. 34 00:02:31,400 --> 00:02:34,220 For example, if you decide to build a model. 35 00:02:35,200 --> 00:02:37,720 To identify credit worthiness of a customer. 36 00:02:38,130 --> 00:02:39,760 This is the profile of a customer. 37 00:02:40,600 --> 00:02:43,140 And your model sees that a person's gender. 38 00:02:44,230 --> 00:02:46,540 All religion is a significant variable. 39 00:02:47,740 --> 00:02:50,680 Now, if you base your decisions on this model. 40 00:02:51,850 --> 00:02:56,350 You will be treating people of different genders or different religions differently. 41 00:02:57,340 --> 00:03:00,760 This situation can be considered as one of discrimination. 42 00:03:01,300 --> 00:03:06,610 And if you cannot base your decision on a particular variable, that is, if you will not accept the 43 00:03:06,610 --> 00:03:09,070 result of your model basis, a particular variable. 44 00:03:09,550 --> 00:03:11,260 There is no point keeping that variable. 45 00:03:13,110 --> 00:03:17,160 So keep this in mind when selecting sensitive variables for analysis. 46 00:03:18,980 --> 00:03:23,300 Last point is, again, trading the importance of business knowledge. 47 00:03:24,260 --> 00:03:27,260 Don't just take the variable because you have data available for it. 48 00:03:28,160 --> 00:03:33,590 I mean, yes, we can do exploratory analysis where without business knowledge, we try to identify 49 00:03:33,590 --> 00:03:34,160 a pattern. 50 00:03:34,820 --> 00:03:40,430 But when we are establishing a cause and effect relationship, try to keep only variables that make 51 00:03:40,430 --> 00:03:41,310 logical sense. 52 00:03:42,660 --> 00:03:48,180 To complement your understanding whether the relationship is related or not, you should use the by 53 00:03:48,180 --> 00:03:49,170 various analysis. 54 00:03:50,820 --> 00:03:56,810 If you think house prices will increase with better air quality, that should be reflected in these 55 00:03:56,820 --> 00:03:57,570 scatterplot. 56 00:03:58,400 --> 00:04:00,180 We saw scatterplot of crime did. 57 00:04:00,510 --> 00:04:05,990 It was non-linear, but still it had a recognizable pattern indicating some type of relationship. 58 00:04:07,310 --> 00:04:13,370 But when we plotted rainfall, it was completely, uniformly distributed regardless of race. 59 00:04:14,640 --> 00:04:20,550 This is telling us that rainfall may not be having a significant impact on the outward variable. 60 00:04:21,610 --> 00:04:24,640 So we have an option to delete these rainfall variable. 61 00:04:24,710 --> 00:04:25,150 Also. 62 00:04:26,540 --> 00:04:34,610 As you have seen, identifying, adding and removing variables is an iterative process, and we'll be 63 00:04:34,610 --> 00:04:37,280 doing it even post regression analysis. 64 00:04:38,880 --> 00:04:43,920 We will remove the variables we have identified using these software tools from our dataset.