1 00:00:00,366 --> 00:00:01,933 Dummy variable trap. 2 00:00:01,933 --> 00:00:04,066 Last time we learned how to create dummy variables 3 00:00:04,066 --> 00:00:09,133 to replace our categorical predictor estate in the model, we also discussed 4 00:00:09,133 --> 00:00:12,133 that you can never include both dummy variables at the same time. 5 00:00:12,466 --> 00:00:15,466 In our example, we omitted the California dummy. 6 00:00:15,733 --> 00:00:17,166 Now why is that? 7 00:00:17,166 --> 00:00:21,366 What will happen if we include the second dummy variable in the model as well? 8 00:00:22,000 --> 00:00:23,133 Let's see. 9 00:00:23,133 --> 00:00:26,533 The intuition here is that you're basically duplicating a variable. 10 00:00:26,733 --> 00:00:31,166 This is because D2 always equals to one minus d1. 11 00:00:32,000 --> 00:00:35,633 The phenomenon where one or several independent variables 12 00:00:35,633 --> 00:00:39,166 in a linear regression predict another is called multicollinearity. 13 00:00:39,700 --> 00:00:42,700 As a result of this effect, the model cannot distinguish between 14 00:00:42,866 --> 00:00:48,300 the effects of D1 from the effects of D2, and therefore it won't work properly. 15 00:00:48,766 --> 00:00:51,533 And this is called the dummy variable trap. 16 00:00:51,533 --> 00:00:54,533 If you do the math behind the scenario, you will see that 17 00:00:54,633 --> 00:00:58,200 the real problem is that you cannot have these three elements 18 00:00:58,200 --> 00:01:02,366 in your model at the same time the constant and both the dummy variables. 19 00:01:02,800 --> 00:01:05,666 I will leave it up to you to figure this one out on your own, 20 00:01:05,666 --> 00:01:07,100 and let me know if you have any trouble. 21 00:01:07,100 --> 00:01:09,000 I'll definitely help you out. 22 00:01:09,000 --> 00:01:13,700 So to sum up, whenever you're building a model, always omit one dummy variable. 23 00:01:14,133 --> 00:01:17,233 And this applies irrespective of the number of dummy variables 24 00:01:17,533 --> 00:01:20,233 they are in that specific dummy set. 25 00:01:20,233 --> 00:01:22,533 If you have nine, then you should only include eight. 26 00:01:22,533 --> 00:01:23,933 If you have 100. 27 00:01:23,933 --> 00:01:26,366 Then you should only include 99 of them. 28 00:01:26,366 --> 00:01:29,033 Also note that if you have two sets of dummy variables, 29 00:01:29,033 --> 00:01:32,000 then you need to apply the same rule to each set. 30 00:01:32,000 --> 00:01:34,900 For instance, we could have had a column which specifies 31 00:01:34,900 --> 00:01:37,900 the industry in which the companies operate. 32 00:01:38,000 --> 00:01:38,800 To build the model. 33 00:01:38,800 --> 00:01:42,466 In that case, we would have had to perform exactly the same steps and create 34 00:01:42,466 --> 00:01:45,733 another set of dummy variables specifically for that column. 35 00:01:46,200 --> 00:01:51,200 And then we would include all but one of those dummy variables in our actual model. 36 00:01:51,633 --> 00:01:54,433 I hope this explanation was useful, and you will never fall 37 00:01:54,433 --> 00:01:57,133 victim of the dummy variable trap in your modeling. 38 00:01:57,133 --> 00:02:00,366 Next time we're going to cover the different ways you can build a model. 39 00:02:00,600 --> 00:02:02,700 We will learn all about backward elimination, 40 00:02:02,700 --> 00:02:06,300 forward selection, stepwise regression, and much more. 41 00:02:06,633 --> 00:02:09,600 It's going to be an exciting tutorial and I look forward to seeing you then.