1 00:00:01,266 --> 00:00:03,766 Today we're talking about dummy variables. 2 00:00:03,766 --> 00:00:06,366 So basically the information that we have is the profit 3 00:00:06,366 --> 00:00:09,566 of each, company or of each startup. 4 00:00:09,566 --> 00:00:13,500 Then the R&D spend, the admin spend, the marketing spend. 5 00:00:13,500 --> 00:00:16,400 So those three are expenses that the company incurred 6 00:00:16,400 --> 00:00:20,166 and then the state in which it operates, either in New York or California. 7 00:00:20,700 --> 00:00:24,333 The challenge that we're faced with is that the venture capitalist 8 00:00:24,333 --> 00:00:28,200 fund wants to see if there's any correlations between profit 9 00:00:28,400 --> 00:00:31,200 and the amounts that have been spent on, 10 00:00:31,200 --> 00:00:35,466 different expenses, R&D, admin and marketing, and also, 11 00:00:35,833 --> 00:00:39,000 whether on which in which state the company operates. 12 00:00:39,033 --> 00:00:42,833 So is there a correlation between profit and, all of these variables? 13 00:00:42,833 --> 00:00:47,933 And how would you go about creating a model to understand 14 00:00:48,100 --> 00:00:52,433 how knowing R&D spend admin and marketing and state to predict profit. 15 00:00:52,800 --> 00:00:56,366 And so therefore profit is our dependent variable 16 00:00:56,566 --> 00:00:59,566 and the rest, the blue ones, are all independent variables. 17 00:00:59,700 --> 00:01:03,400 And what we need to do is build a linear regression. 18 00:01:03,700 --> 00:01:05,366 So let's go ahead and get started. 19 00:01:06,433 --> 00:01:08,700 As we would with multiple linear regression. 20 00:01:08,700 --> 00:01:13,666 We would start by saying y which is profit is equal to what. 21 00:01:13,700 --> 00:01:14,466 What is it equal to. 22 00:01:14,466 --> 00:01:16,866 Well, first of all, there's a constant. 23 00:01:16,866 --> 00:01:18,200 In this case it's b zero. 24 00:01:18,200 --> 00:01:22,200 And I've put it under the profit column just because I needed to fit it somewhere. 25 00:01:22,566 --> 00:01:26,400 And then we would start adding on these, variables into our equation. 26 00:01:26,400 --> 00:01:30,166 So then we've got the b1 coefficient times the X1 variable 27 00:01:30,166 --> 00:01:32,066 which is R&D spend in this case. 28 00:01:32,066 --> 00:01:37,033 And x1 is actually the amount, the dollar amount that you see in the R&D column. 29 00:01:38,000 --> 00:01:40,300 Then you've got the admin. 30 00:01:40,300 --> 00:01:43,200 variable which is x2. 31 00:01:43,200 --> 00:01:45,266 It's got a coefficient of B2. Once again admin. 32 00:01:45,266 --> 00:01:48,900 So in this case x2 is going to be the dollar amount you see in the admin column. 33 00:01:49,166 --> 00:01:50,700 Then you've got the marketing spend 34 00:01:50,700 --> 00:01:54,566 x3 which will be the dollar amount you see in the marketing column. 35 00:01:54,866 --> 00:01:57,600 And then you've got the state variable. 36 00:01:57,600 --> 00:02:00,600 And here when we get here we're questioning 37 00:02:00,600 --> 00:02:04,900 what should we place in our equation for the state column. 38 00:02:04,900 --> 00:02:08,633 Because we don't actually have a number, we don't have a dollar value 39 00:02:08,633 --> 00:02:12,300 or any other type of number to add into our equation. 40 00:02:12,300 --> 00:02:14,666 We can't just add a word into our equation. 41 00:02:14,666 --> 00:02:19,866 And the thing here is that the state is actually a categorical variable. 42 00:02:19,866 --> 00:02:21,800 So we talked about types of variables before. 43 00:02:21,800 --> 00:02:23,633 And we understood that 44 00:02:23,633 --> 00:02:26,766 there's categorical variables and there's numeric variables. 45 00:02:27,000 --> 00:02:29,766 Well in this case state is a categorical variable. 46 00:02:29,766 --> 00:02:32,400 And therefore we can't add it to our equation. 47 00:02:32,400 --> 00:02:35,400 We need to do something about this situation 48 00:02:35,400 --> 00:02:39,633 and the approach that you need to take when you face categorical variables 49 00:02:39,633 --> 00:02:42,900 in regression models is you need to create dummy variables. 50 00:02:43,100 --> 00:02:45,533 Let's see how we we can do that. 51 00:02:45,533 --> 00:02:49,466 First you need to go through your column and find all the different categories 52 00:02:49,466 --> 00:02:52,300 you have. So in this case we have two categories. 53 00:02:52,300 --> 00:02:55,766 So for every single category that you found you need to create a new column 54 00:02:56,066 --> 00:02:56,700 for New York. 55 00:02:56,700 --> 00:02:58,833 We're going to create a column called New Yorker for California. 56 00:02:58,833 --> 00:03:00,333 We're going to create a column California. 57 00:03:00,333 --> 00:03:05,533 So we're kind of expanding our dataset and adding some additional columns into it. 58 00:03:06,033 --> 00:03:07,366 And how do we populate the columns. 59 00:03:07,366 --> 00:03:08,666 So this is the fun part 60 00:03:08,666 --> 00:03:11,533 to populate these columns, let's start with the New York column. 61 00:03:11,533 --> 00:03:15,266 You need to find all of your rows where state actually says New York, 62 00:03:15,266 --> 00:03:16,466 and you need to. 63 00:03:16,466 --> 00:03:21,100 For those rows, you need to put a one in the New York column and then in California 64 00:03:21,100 --> 00:03:25,000 and for all the rows that say, California, basically for all the rows 65 00:03:25,000 --> 00:03:28,333 that don't say New York, whatever else they say, you just put a zero. 66 00:03:28,900 --> 00:03:31,700 And then for California, for the column California, 67 00:03:31,700 --> 00:03:34,866 you do the same thing wherever a row says California. 68 00:03:34,866 --> 00:03:38,133 In the state column, you place a one in the California column. 69 00:03:38,533 --> 00:03:41,400 And for any other values in the state 70 00:03:41,400 --> 00:03:44,400 column, you place a zero in the California column. 71 00:03:44,700 --> 00:03:47,133 And so you end up with a data set like this. 72 00:03:47,133 --> 00:03:50,900 And these two new columns are called dummy variables. 73 00:03:51,900 --> 00:03:54,900 And building your regression model from here is very simple. 74 00:03:54,900 --> 00:03:57,966 All you have to do is use the New York column, 75 00:03:58,300 --> 00:04:00,700 and you're going to be using it instead of states. 76 00:04:00,700 --> 00:04:02,766 You're not going to be using state anymore. 77 00:04:02,766 --> 00:04:07,933 And basically you add a variable which is B4 times D1 78 00:04:08,100 --> 00:04:11,400 and D1 in this case is your dummy variable for New York. 79 00:04:11,933 --> 00:04:15,100 And you don't use the California column either. 80 00:04:16,033 --> 00:04:21,366 So as you can see here, all of the information in our data is preserved. 81 00:04:21,600 --> 00:04:25,500 If we just stick to the One New York column, because you can tell right away 82 00:04:25,500 --> 00:04:29,333 if D1 is a one, then it's a company that works in it operates in New York. 83 00:04:29,533 --> 00:04:32,300 If D1 is a zero, it's a company that operates in California. 84 00:04:32,300 --> 00:04:37,000 So we didn't lose any information by including only the New York column. 85 00:04:37,200 --> 00:04:41,166 And we will actually talk more about why you should never include 86 00:04:41,166 --> 00:04:45,033 all of your dummy variable columns in your regression model. 87 00:04:45,033 --> 00:04:48,466 We'll talk more about that in the, next tutorial when we're talking 88 00:04:48,466 --> 00:04:51,466 about the dummy variable trap. 89 00:04:51,866 --> 00:04:54,200 But for now I would like to discuss two things. 90 00:04:54,200 --> 00:04:55,200 So first of all, 91 00:04:56,333 --> 00:04:57,600 the New 92 00:04:57,600 --> 00:05:01,000 York column or all of the dummy variables they work as switches. 93 00:05:01,366 --> 00:05:02,333 In this case. 94 00:05:02,333 --> 00:05:05,333 Let's look at the New York column, which we're including in our regression. 95 00:05:05,666 --> 00:05:07,300 It works like a light switch. 96 00:05:07,300 --> 00:05:12,833 So if it's a one then you know that this company is in New York 97 00:05:12,833 --> 00:05:13,800 if it's a zero. 98 00:05:13,800 --> 00:05:15,933 So off in this case, 99 00:05:15,933 --> 00:05:19,233 in the case of the picture, then you know that the company doesn't work 100 00:05:19,233 --> 00:05:20,233 in New York options. 101 00:05:20,233 --> 00:05:23,433 So the dummy variables work like light switches. 102 00:05:23,433 --> 00:05:26,933 And that's why they're ones and zeros and they don't need any other values in them. 103 00:05:27,600 --> 00:05:31,300 And the second thing is that when you look at this approach, it might seem biased. 104 00:05:31,300 --> 00:05:33,833 So we are including a variable for New York. 105 00:05:33,833 --> 00:05:35,266 And there's a coefficient for new. 106 00:05:35,266 --> 00:05:38,300 So we basically have this benefit of having 107 00:05:38,300 --> 00:05:41,533 a coefficient in our equation for New York which is before. 108 00:05:42,100 --> 00:05:46,233 But for California there's no coefficient because when D1 is zero, 109 00:05:46,233 --> 00:05:48,633 that whole last part of the equation becomes zero. 110 00:05:48,633 --> 00:05:53,433 And there's no benefit of a coefficient in our equation for California 111 00:05:53,433 --> 00:05:56,566 and might seem biased at first, but in reality, that's not the case 112 00:05:56,566 --> 00:06:01,000 because the way regression models work is that they will take by default 113 00:06:01,566 --> 00:06:06,133 that state or that variable, that dummy variable that you have not included. 114 00:06:06,133 --> 00:06:11,133 It will become the default situation for this regression model. 115 00:06:11,133 --> 00:06:15,966 So basically what that means is that the coefficient for California 116 00:06:16,066 --> 00:06:19,933 is going to be included in the constant in B0. 117 00:06:20,466 --> 00:06:24,766 And by default when d1 is equal to zero, 118 00:06:24,766 --> 00:06:27,866 this whole equation will turn into an equation. 119 00:06:27,866 --> 00:06:30,600 You can think of it as it'll turn into an equation for California. 120 00:06:30,600 --> 00:06:34,433 But then when d1 becomes one, you're adding B4, 121 00:06:34,433 --> 00:06:38,866 which is once again like this is a very basic explanation, 122 00:06:38,866 --> 00:06:41,866 but you're adding a coefficient, which is the difference 123 00:06:41,866 --> 00:06:43,466 between New York and California. 124 00:06:43,466 --> 00:06:44,700 So so basically 125 00:06:44,700 --> 00:06:48,900 you'll altering from California to New York by flipping this light switch. 126 00:06:48,900 --> 00:06:51,600 If it's on or off, then kind of default state. 127 00:06:51,600 --> 00:06:54,000 And the whole equation is working for California. 128 00:06:54,000 --> 00:06:59,133 If it's on on, then by adding that before you're altering the equation 129 00:06:59,133 --> 00:07:02,133 from the default state of California to New York. 130 00:07:02,200 --> 00:07:06,333 So that's a intuitive way to think of dummy variables. 131 00:07:06,333 --> 00:07:09,333 So there's nothing wrong with the fact that we're only including one. 132 00:07:09,633 --> 00:07:13,433 And once again, in the next tutorial, we'll talk more about why 133 00:07:13,433 --> 00:07:17,133 it is a bad idea to ever include both dummy variables. 134 00:07:17,566 --> 00:07:20,400 I'll look forward to seeing you next time until then, happy analyzing.