1 00:00:01,266 --> 00:00:02,466 Great to see you back here. 2 00:00:02,466 --> 00:00:05,000 And today we're going to look at the handy trick 3 00:00:05,000 --> 00:00:08,133 that will help your models become more robust. 4 00:00:08,433 --> 00:00:11,833 Before we continue, though, I wanted to quickly recap on what we did last time. 5 00:00:12,166 --> 00:00:16,100 And what we did is we used the backward elimination method 6 00:00:16,100 --> 00:00:20,133 to construct a multiple linear regression using our data. 7 00:00:20,466 --> 00:00:24,266 And through that method we constructed four separate models. 8 00:00:24,266 --> 00:00:25,966 You can see them on the back in the background here. 9 00:00:25,966 --> 00:00:28,966 So model 123 and then four. 10 00:00:29,000 --> 00:00:31,700 And when we go to model four we were only left with one 11 00:00:31,700 --> 00:00:34,933 independent variable spent on research and development 12 00:00:35,100 --> 00:00:37,766 because all of the other ones were eliminated. 13 00:00:37,766 --> 00:00:40,800 And basically we completed the backward elimination process. 14 00:00:41,066 --> 00:00:44,866 However, we were left with kind of a feeling 15 00:00:45,133 --> 00:00:48,100 that maybe we shouldn't have excluded the last variable. 16 00:00:48,100 --> 00:00:49,266 And why is that? Well. 17 00:00:49,266 --> 00:00:54,200 First of all, we saw that this p value is not that much bigger than, 18 00:00:54,900 --> 00:00:56,633 the significance level that we selected. 19 00:00:56,633 --> 00:00:59,066 We select a significance level of 0.05. 20 00:00:59,066 --> 00:01:00,633 The p value here is 0.06. 21 00:01:00,633 --> 00:01:03,700 So just a bit greater than the threshold. 22 00:01:04,133 --> 00:01:06,933 So that kind of leaves us with a feeling 23 00:01:06,933 --> 00:01:09,600 maybe we shouldn't have excluded that variable. 24 00:01:09,600 --> 00:01:12,900 The problem here is that this, 25 00:01:12,900 --> 00:01:17,200 these methods, these stepwise regression methods, they are very arbitrary. 26 00:01:17,200 --> 00:01:21,433 So once you select your significance level threshold, 27 00:01:22,133 --> 00:01:23,400 you got to stick to it, right? 28 00:01:23,400 --> 00:01:26,633 You use we selected 0.05 this 10.06. 29 00:01:26,733 --> 00:01:27,666 So that's greater. 30 00:01:27,666 --> 00:01:31,266 And we just got to cut it off and not look not look back anymore. 31 00:01:31,266 --> 00:01:33,366 And just proceed with the method. 32 00:01:33,366 --> 00:01:38,066 So how can we improve our method of building models 33 00:01:38,400 --> 00:01:41,066 to assess situations like that 34 00:01:41,066 --> 00:01:46,200 and give, you know, like an extra opinion or have another criteria 35 00:01:46,200 --> 00:01:49,766 to tell us whether or not we should have actually kept this, variable. 36 00:01:50,133 --> 00:01:54,000 And there is a way and we're going to talk about it right now. 37 00:01:55,066 --> 00:01:58,500 So let's look at this top part over here. 38 00:01:59,700 --> 00:02:03,533 The top part is responsible for the variable. 39 00:02:03,533 --> 00:02:06,900 So you have the coefficients and you've got the p values and so on. 40 00:02:06,900 --> 00:02:09,900 So we've talked about it quite quite extensively. 41 00:02:09,900 --> 00:02:13,633 But now let's look at the bottom part the second part of our report. 42 00:02:13,966 --> 00:02:15,600 And here as we discussed before 43 00:02:15,600 --> 00:02:18,600 we've got the stats there are for the model as a whole. 44 00:02:18,600 --> 00:02:21,766 So how well has been fitted, how well it's working and so on. 45 00:02:21,766 --> 00:02:24,866 So the stats that we'll be looking at today 46 00:02:24,866 --> 00:02:28,500 are R squared and adjusted R squared. 47 00:02:28,500 --> 00:02:32,800 And they will help us come up with a cool approach 48 00:02:32,800 --> 00:02:38,366 or great approach, to improve our, backward elimination method. 49 00:02:38,800 --> 00:02:41,466 So we already talked about both the R squared 50 00:02:41,466 --> 00:02:44,466 and adjusted r squared in the section of this course. 51 00:02:44,500 --> 00:02:47,400 However, if you chose to skip that section because you're not interested 52 00:02:47,400 --> 00:02:50,266 in the formulas and so on, then I'll quickly recap for you. Now. 53 00:02:50,266 --> 00:02:56,300 So r squared over here is basically a characteristic 54 00:02:56,433 --> 00:03:01,133 or a parameter of your model which tells you about the goodness of fit. 55 00:03:01,133 --> 00:03:03,433 So how well your model has been fitted. 56 00:03:03,433 --> 00:03:06,166 And r squared can never be greater than one. 57 00:03:06,166 --> 00:03:09,166 And you want it to be as close to one as possible. 58 00:03:09,166 --> 00:03:11,566 The closer to one it is, the better. 59 00:03:11,566 --> 00:03:14,166 Your model is seem to be fitted. 60 00:03:14,166 --> 00:03:19,166 However, r squared is biased, and it's biased in a way that 61 00:03:20,466 --> 00:03:23,666 it the way is constructed and the way these models are run. 62 00:03:23,666 --> 00:03:27,700 So the ordinary least squared, method, 63 00:03:28,000 --> 00:03:31,233 it's doesn't allow r squared to ever decrease. 64 00:03:31,233 --> 00:03:35,666 So the more variables you add to your model, the greater r squared will be. 65 00:03:35,700 --> 00:03:39,000 So basically what what we're, what this means is that 66 00:03:39,666 --> 00:03:42,666 as long as you keep adding variables, r squared will always grow. 67 00:03:43,200 --> 00:03:44,433 And we can observe that here. 68 00:03:44,433 --> 00:03:45,466 So if we start from the end 69 00:03:45,466 --> 00:03:49,333 where we have only one variable, you can see that r squared is 0.94. 70 00:03:49,733 --> 00:03:51,200 Then r squared became. 71 00:03:51,200 --> 00:03:54,200 Well if you if you go this way it's 0.95. 72 00:03:54,433 --> 00:03:56,700 But then it's 0.90 507. 73 00:03:56,700 --> 00:04:00,366 So as you can see the more variables we have the greater r squared gets. 74 00:04:00,366 --> 00:04:02,100 And that's that's always going to be the case. 75 00:04:02,100 --> 00:04:05,566 Just because of the way r squared is derived. 76 00:04:06,266 --> 00:04:09,233 And moreover you can even include completely random variables. 77 00:04:09,233 --> 00:04:12,366 So if I throw into this model I throw another variable 78 00:04:12,366 --> 00:04:15,333 which is basically the temperature outside. 79 00:04:15,333 --> 00:04:18,533 Like air temperature outside right now then. 80 00:04:18,733 --> 00:04:21,200 And I throw that in as an independent variable. 81 00:04:21,200 --> 00:04:22,566 Of course it's not a predictor. 82 00:04:22,566 --> 00:04:26,266 It can't predict profit of a company that works in New York or California. 83 00:04:26,700 --> 00:04:30,900 But R squared is still going to grow and is going to imply 84 00:04:30,900 --> 00:04:33,900 that our model is now even better fitted. 85 00:04:33,933 --> 00:04:35,600 So that way r squared is biased. 86 00:04:35,600 --> 00:04:38,366 And that's where adjusted r squared comes into play. 87 00:04:38,366 --> 00:04:41,066 Adjusted R squared is very similar to r squared. 88 00:04:41,066 --> 00:04:43,800 It's got a very similar formula. 89 00:04:43,800 --> 00:04:47,500 But it actually has a penalization factor. 90 00:04:47,700 --> 00:04:51,333 So basically just like r squared would grow 91 00:04:51,333 --> 00:04:55,266 if you had more variables adjusted R square would also grow. 92 00:04:55,266 --> 00:04:59,100 But there's a penalization factor which makes it small 93 00:04:59,100 --> 00:05:02,466 which reduces adjusted r squared as you add more variables. 94 00:05:02,700 --> 00:05:05,066 So there's kind of these two effects battling each other. 95 00:05:05,066 --> 00:05:07,666 On one hand it's growing because of the way it's constructed. 96 00:05:07,666 --> 00:05:12,266 On the other hand, the penalization factor is penalizing you or penalizing 97 00:05:12,266 --> 00:05:15,233 the adjusted R squared and reducing it every time you add a variable. 98 00:05:15,233 --> 00:05:18,666 So basically, if the variable that you added doesn't, 99 00:05:18,666 --> 00:05:22,666 make adjusted R doesn't make r squared grow that much, 100 00:05:22,666 --> 00:05:27,566 like for instance, here you can see 0.9505 to 0.9507. 101 00:05:27,566 --> 00:05:31,400 So it only grew by a fraction very very small amount. 102 00:05:31,666 --> 00:05:32,700 Well if that happens 103 00:05:32,700 --> 00:05:36,133 then the penalization factor is going to overwhelm this growth. 104 00:05:36,533 --> 00:05:39,566 And therefore the adjusted r squared is actually going to decrease 105 00:05:39,566 --> 00:05:40,266 in that scenario. 106 00:05:40,266 --> 00:05:44,400 And that way we can use and we will use the adjusted R squared 107 00:05:44,666 --> 00:05:48,566 to watch the goodness of fit our models and how it changes. 108 00:05:49,400 --> 00:05:50,733 So let's let's go ahead and do that. 109 00:05:50,733 --> 00:05:53,466 Let's observe that adjusted r squared in our method. 110 00:05:53,466 --> 00:05:54,933 What was the adjusted r squared here. 111 00:05:54,933 --> 00:05:56,733 It was 0.94. 112 00:05:57,966 --> 00:05:58,533 Then when 113 00:05:58,533 --> 00:06:01,533 we excluded administration expenses 114 00:06:01,900 --> 00:06:04,633 adjusted r squared went from 0.9. 115 00:06:04,633 --> 00:06:07,300 5 to 0.9475. 116 00:06:07,300 --> 00:06:10,300 So as you can see here adjusted r squared went up. 117 00:06:10,466 --> 00:06:11,100 We reduced. 118 00:06:11,100 --> 00:06:16,200 basically what that means is that the model is now better. 119 00:06:16,200 --> 00:06:19,200 It's being it's fitted better. It works. 120 00:06:19,200 --> 00:06:22,300 these variables in this combination fit, 121 00:06:22,766 --> 00:06:26,433 the profit variable fit this model to explain the profit variable 122 00:06:26,466 --> 00:06:28,666 better than these variables in this combination, 123 00:06:28,666 --> 00:06:30,900 to explain the profit variable, which is good. 124 00:06:30,900 --> 00:06:32,933 It's a good step okay. 125 00:06:32,933 --> 00:06:35,033 So that means we improved our model here. 126 00:06:35,033 --> 00:06:37,866 Adjusted R squared is 0.9475. 127 00:06:37,866 --> 00:06:40,633 Let's see what had happened to it when we moved to the next step. 128 00:06:40,633 --> 00:06:41,300 And the next step. 129 00:06:41,300 --> 00:06:44,866 Adjusted R squared is 0.9483. 130 00:06:44,866 --> 00:06:49,500 So it went up again meaning that once again we improved our model. 131 00:06:49,500 --> 00:06:51,600 So altogether these variables 132 00:06:52,800 --> 00:06:55,500 are doing a better job explaining profit. 133 00:06:55,500 --> 00:06:58,800 Then these variables together are doing a job explaining profit. 134 00:06:59,033 --> 00:07:01,800 That's great. We improved our model again. 135 00:07:01,800 --> 00:07:04,900 And now let's see what happens when we take out the loss variable 136 00:07:04,900 --> 00:07:06,033 marketing spend. 137 00:07:06,033 --> 00:07:10,066 So we went from adjusted R squared 0.9483 138 00:07:10,233 --> 00:07:13,966 to adjusted r squared 0.9454. 139 00:07:14,433 --> 00:07:15,733 And what does that tell us. 140 00:07:15,733 --> 00:07:18,900 Our adjusted r squared went down and went down 141 00:07:19,833 --> 00:07:24,433 by from by 0.003 approximately. 142 00:07:24,600 --> 00:07:27,866 So that means that this model, this new model 143 00:07:28,500 --> 00:07:31,033 is actually worse than this model. 144 00:07:31,033 --> 00:07:37,266 So this model was better fitted to predict or explain the variance in profit. 145 00:07:37,333 --> 00:07:40,333 Then this model is doing so. 146 00:07:41,300 --> 00:07:41,966 So there you go. 147 00:07:41,966 --> 00:07:44,400 So even though we excluded a variable 148 00:07:44,400 --> 00:07:48,000 according to our backward elimination method, this 149 00:07:48,333 --> 00:07:51,666 this variable should have been excluded because it was in 150 00:07:51,966 --> 00:07:55,200 it was with us with this variable, 151 00:07:55,200 --> 00:07:58,200 this model is actually working better. 152 00:07:58,233 --> 00:08:02,000 And that is your takeaway, handy trick. 153 00:08:02,000 --> 00:08:07,200 How to, observe your models and, how to create them. 154 00:08:07,500 --> 00:08:11,000 You don't only just follow the backward elimination or whatever method 155 00:08:11,000 --> 00:08:15,133 that you're using and, arbitrarily follow the rules. 156 00:08:15,733 --> 00:08:18,500 Just instead follow the rules, 157 00:08:18,500 --> 00:08:23,133 but also watch the adjusted R squared and see if it's actually improving. 158 00:08:23,133 --> 00:08:25,166 So if it's growing, then you're doing the right thing. 159 00:08:25,166 --> 00:08:27,066 As soon as you see the R adjusted r squared drop 160 00:08:27,066 --> 00:08:30,366 then you have to stop and question, did I just do the right thing or not. 161 00:08:30,366 --> 00:08:31,533 And you know what. 162 00:08:31,533 --> 00:08:32,366 What is the trade off 163 00:08:32,366 --> 00:08:36,933 of excluding a certain variable or the opposite of including the variable. 164 00:08:37,100 --> 00:08:40,100 So adjusted r squared is kind of your indicator. 165 00:08:40,200 --> 00:08:41,200 How are you going along the way. 166 00:08:42,500 --> 00:08:44,266 but that's it for today. 167 00:08:44,266 --> 00:08:47,700 I hope you find this useful and hope you can apply. 168 00:08:47,700 --> 00:08:50,700 You'll find ways to apply this in your actual work. 169 00:08:50,933 --> 00:08:52,233 I look forward to seeing you next time. 170 00:08:52,233 --> 00:08:55,233 And until then, happy analyzing.