1 00:00:00,300 --> 00:00:01,433 Hello and welcome to 2 00:00:01,433 --> 00:00:04,433 this art tutorial and in previous tutorials, we already 3 00:00:04,433 --> 00:00:09,266 implemented a multiple linear regression model that we fitted to the training set. 4 00:00:09,300 --> 00:00:11,833 But when we take a step back, do you think it's actually 5 00:00:11,833 --> 00:00:15,500 the optimal model that we can make with the data set that we have here? 6 00:00:15,800 --> 00:00:18,200 You know, because when we built this model, 7 00:00:18,200 --> 00:00:21,200 we actually used all the independent variables. 8 00:00:21,300 --> 00:00:22,200 But what if 9 00:00:22,200 --> 00:00:24,366 among these independent variables there are some 10 00:00:24,366 --> 00:00:28,633 that are highly statistically significant, that is have great impact 11 00:00:28,633 --> 00:00:31,633 or great effect on the dependent variable profit, 12 00:00:31,900 --> 00:00:35,300 and some that are not statistically significant at all. 13 00:00:35,333 --> 00:00:40,133 That means that if we removed this non statistically significant variables 14 00:00:40,133 --> 00:00:43,733 from the model, well we would still get some amazing predictions. 15 00:00:44,533 --> 00:00:48,300 So the goal in this tutorial is to actually find a team, 16 00:00:48,466 --> 00:00:52,200 an optimal team of independent variables, so that each independent 17 00:00:52,200 --> 00:00:56,700 variable of the team has great impact on the dependent variable profit. 18 00:00:56,866 --> 00:00:59,433 That is, each independent variable of the team 19 00:00:59,433 --> 00:01:02,966 is a powerful predictor that is highly statistically significant 20 00:01:03,066 --> 00:01:06,500 and definitely has an effect on the dependent variable profit. 21 00:01:06,800 --> 00:01:08,566 And this effect can be positive. 22 00:01:08,566 --> 00:01:11,833 That is, for an increase of one unit of the independent variable, 23 00:01:11,833 --> 00:01:14,600 the profit will increase or it can be negative. 24 00:01:14,600 --> 00:01:17,600 That is, for an increase of one unit of the independent variable 25 00:01:17,600 --> 00:01:19,700 the profit will decrease. 26 00:01:19,700 --> 00:01:22,966 And so we're going to divide this final step of building 27 00:01:22,966 --> 00:01:26,900 this optimal model using backward elimination into two tutorials. 28 00:01:27,266 --> 00:01:30,600 In this first tutorial that we are having right now, I'm going to walk you 29 00:01:30,600 --> 00:01:35,400 through the backward elimination algorithm or without completing it up to the end. 30 00:01:35,866 --> 00:01:38,866 That means that at the end of this tutorial, you'll get a homework 31 00:01:38,966 --> 00:01:44,200 which will consist of completing what we started with backward elimination. 32 00:01:44,200 --> 00:01:48,000 So I'm sure you will have no problem, because I'm going to walk you through 33 00:01:48,000 --> 00:01:51,900 the introduction of backward elimination so that you can understand everything 34 00:01:51,900 --> 00:01:55,000 and have all the tools to complete the homework. 35 00:01:55,333 --> 00:01:58,166 And then in the next tutorial, I'll give you the solution of this 36 00:01:58,166 --> 00:02:01,433 homework and we will complete together the backward elimination. 37 00:02:02,000 --> 00:02:03,833 So I hope you're excited. 38 00:02:03,833 --> 00:02:04,833 Let's start right now. 39 00:02:04,833 --> 00:02:07,266 The backward elimination. 40 00:02:07,266 --> 00:02:10,466 So for those of you who follow the Python tutorial, 41 00:02:10,466 --> 00:02:13,466 you will notice that it's actually a little more simple here, 42 00:02:13,500 --> 00:02:17,300 because in the Python tutorial we had to use another library and another 43 00:02:17,300 --> 00:02:21,433 multiple linear regression model to implement backward elimination. 44 00:02:21,866 --> 00:02:25,300 And this time we will simply take the model that we created here, 45 00:02:25,466 --> 00:02:30,133 and we will use this amazing function summary of R that returns 46 00:02:30,133 --> 00:02:35,900 a great deal of statistical information that can help make our model more robust. 47 00:02:36,400 --> 00:02:39,966 And then same as in Python, we are going to do some copy paste 48 00:02:40,100 --> 00:02:44,700 very simply until we get to the final team of our independent variables. 49 00:02:45,166 --> 00:02:46,500 So let's do this. 50 00:02:46,500 --> 00:02:48,333 You're going to see it's really quick and easy. 51 00:02:48,333 --> 00:02:50,166 We're going to do that very efficiently. 52 00:02:50,166 --> 00:02:54,066 And the first step of doing that is to actually take our model 53 00:02:54,100 --> 00:02:55,633 because as I just mentioned, 54 00:02:55,633 --> 00:02:59,333 we are going to use this same model to implement backward elimination. 55 00:03:00,000 --> 00:03:03,633 So here I just copied the model and I'm going to paste it here. 56 00:03:04,166 --> 00:03:07,166 And now we're just going to change two things. 57 00:03:07,500 --> 00:03:11,100 The first thing is instead of having this dot here, 58 00:03:11,200 --> 00:03:14,200 you know this dot that represents all the independent variables, 59 00:03:14,200 --> 00:03:17,766 we we're going to write all the independent variables separated by a plus 60 00:03:18,300 --> 00:03:21,966 because you know, the principle of backward elimination 61 00:03:21,966 --> 00:03:25,200 is that we will remove each independent variable 62 00:03:25,200 --> 00:03:28,200 that is not statistically significant one by one. 63 00:03:28,500 --> 00:03:31,900 So we need here to write each of the independent variables 64 00:03:31,900 --> 00:03:36,400 so that when we copy paste this model here we will just need to remove 65 00:03:36,600 --> 00:03:41,400 the non statistically significant variable from this equation here okay. 66 00:03:41,400 --> 00:03:42,466 So let's first do this 67 00:03:42,466 --> 00:03:46,700 I'm going to take our data set here to look at the independent variables. 68 00:03:47,166 --> 00:03:49,533 So the first independent variable is already spent. 69 00:03:49,533 --> 00:03:51,566 So let's add it here. 70 00:03:51,566 --> 00:03:53,800 So r dot d. 71 00:03:53,800 --> 00:03:56,166 So you know as a reminder there is a dot here 72 00:03:56,166 --> 00:04:00,666 because the original name for this independent variable is r space 73 00:04:00,666 --> 00:04:04,200 d and r just replaced the space by a dot here. 74 00:04:04,633 --> 00:04:07,200 So it's good to know that if you're working with R 75 00:04:07,200 --> 00:04:10,266 and if you have some data sets with spaces in your column names, 76 00:04:10,900 --> 00:04:12,600 and then okay, so what was the name. 77 00:04:13,833 --> 00:04:15,200 Dot another dot. 78 00:04:15,200 --> 00:04:18,066 So that means another space dot and spend. 79 00:04:18,066 --> 00:04:20,266 All right. So that's the first independent variable. 80 00:04:20,266 --> 00:04:22,000 And now let's add the second one. 81 00:04:22,000 --> 00:04:25,200 So we need to separate them by a plus okay. 82 00:04:25,600 --> 00:04:27,300 And what is the second one. 83 00:04:27,300 --> 00:04:30,300 The second one is administration okay. 84 00:04:30,633 --> 00:04:32,066 So here there is no dot. 85 00:04:32,066 --> 00:04:35,200 Everything is fine just as it's spelled right. 86 00:04:35,500 --> 00:04:38,500 Administration plus. 87 00:04:40,766 --> 00:04:43,766 Marketing spend. 88 00:04:44,733 --> 00:04:45,566 Plus. 89 00:04:45,566 --> 00:04:49,800 And we have one one last independent variable which is the state. 90 00:04:50,333 --> 00:04:54,366 So here we don't need to create the dummy variables as we did in Python. 91 00:04:54,366 --> 00:04:57,833 Because remember we used here this amazing 92 00:04:58,000 --> 00:05:02,300 factor function that encoded this state categorical variable 93 00:05:02,533 --> 00:05:05,333 into factors that are one two, three. 94 00:05:05,333 --> 00:05:09,500 And there is no relational order between those categories. 95 00:05:09,500 --> 00:05:10,866 So everything is fine. 96 00:05:10,866 --> 00:05:13,433 We don't need to create any dummy variables. 97 00:05:13,433 --> 00:05:15,200 So that's the beauty of R. 98 00:05:15,200 --> 00:05:19,200 And so here same we don't need to sum two separate dummy variables. 99 00:05:19,200 --> 00:05:23,100 We can take the original independent variable state okay. 100 00:05:23,100 --> 00:05:25,833 So as I mentioned there are two things that we would like to change here. 101 00:05:25,833 --> 00:05:29,900 The first thing was to replace the dot by all this independent 102 00:05:29,900 --> 00:05:31,766 variable separated by a plus. 103 00:05:31,766 --> 00:05:32,100 And now 104 00:05:32,100 --> 00:05:35,633 the second thing that we would like to do, but it's not compulsory, is just because 105 00:05:35,866 --> 00:05:38,866 I would like to use all the data set to see the correlations 106 00:05:39,000 --> 00:05:42,000 is to replace here training set by 107 00:05:42,500 --> 00:05:45,033 simply our data set. 108 00:05:45,033 --> 00:05:46,666 So that's not compulsory. 109 00:05:46,666 --> 00:05:49,866 We can actually do backward elimination using the training set. 110 00:05:50,433 --> 00:05:54,566 But we're just taking the whole data set in order to have complete information 111 00:05:54,633 --> 00:05:58,533 about which independent variables are statistically significant 112 00:05:58,700 --> 00:06:01,566 and which independent variables are not. 113 00:06:01,566 --> 00:06:04,300 Okay. And now actually we're almost ready. 114 00:06:04,300 --> 00:06:09,400 We just need to use the summary function, which we actually already used before. 115 00:06:09,700 --> 00:06:12,600 And there is nothing more simple than using the summary function. 116 00:06:12,600 --> 00:06:15,400 We just need to take the summary function here. 117 00:06:15,400 --> 00:06:19,100 And then in parenthesis we input our regressor. 118 00:06:19,500 --> 00:06:22,133 Here it is. And now that's actually ready. 119 00:06:22,133 --> 00:06:26,000 We're actually ready to start the first steps of our backward elimination. 120 00:06:26,000 --> 00:06:30,000 Well speaking of backward elimination let's have a look at the slide. 121 00:06:30,000 --> 00:06:32,933 You saw with Kirill in the intuition tutorial. 122 00:06:32,933 --> 00:06:33,966 And here is the slide. 123 00:06:33,966 --> 00:06:37,266 So let's have a quick reminder about the five steps here. 124 00:06:37,500 --> 00:06:38,966 So the first step is to select 125 00:06:38,966 --> 00:06:42,466 a significance level that is a threshold for our p value 126 00:06:42,700 --> 00:06:46,800 such that if the p value of an independent variable is below 127 00:06:46,800 --> 00:06:50,533 the significance level, then this independent variable stays in the model. 128 00:06:50,833 --> 00:06:53,700 And if the p value of the independent variable is higher 129 00:06:53,700 --> 00:06:57,033 than the significance level, then it will not stay in the model. 130 00:06:57,033 --> 00:06:58,833 We will remove it. 131 00:06:58,833 --> 00:07:02,166 So. But the first step is just to select a significance level. 132 00:07:02,166 --> 00:07:04,800 We don't have to do anything here with the independent variables. 133 00:07:04,800 --> 00:07:05,933 We just need to choose one. 134 00:07:05,933 --> 00:07:09,833 And we're going to choose 5% 0.05 okay. 135 00:07:09,833 --> 00:07:14,566 And now step to step two is to fit the full model with all possible predictors. 136 00:07:14,900 --> 00:07:16,633 So that's actually what we've just done. 137 00:07:16,633 --> 00:07:19,566 You know by taking all our independent variables 138 00:07:19,566 --> 00:07:24,166 in our regressor using the LM function that actually fits the full model 139 00:07:24,166 --> 00:07:27,666 with all the possible predictors, that is with all the independent variables. 140 00:07:28,266 --> 00:07:30,200 Okay. So done. 141 00:07:30,200 --> 00:07:31,533 And now what is step three. 142 00:07:31,533 --> 00:07:35,366 Step three is to look at the predictor that has the highest p value. 143 00:07:35,966 --> 00:07:39,600 So we will find it thanks to our summary function. 144 00:07:40,100 --> 00:07:43,100 And if the p value is higher than the significance level 145 00:07:43,200 --> 00:07:46,233 that is if it's higher than 5%, then we'll go to step four. 146 00:07:46,600 --> 00:07:49,600 And if that's not the case our model is actually ready. 147 00:07:49,666 --> 00:07:51,866 But don't worry, it will not be that quick. 148 00:07:52,900 --> 00:07:54,000 So actually. 149 00:07:54,000 --> 00:07:54,866 So let's suppose 150 00:07:54,866 --> 00:07:59,100 we found the highest p value higher than the significance level of 5%. 151 00:07:59,433 --> 00:08:01,466 Then we need to move on to step four. 152 00:08:01,466 --> 00:08:03,500 And the step four is actually to remove 153 00:08:03,500 --> 00:08:06,533 this independent variable that has the highest p value. 154 00:08:07,266 --> 00:08:10,433 And once we remove the predictor we're ready to move on to step five 155 00:08:10,600 --> 00:08:13,800 which is to fit the model without this variable. 156 00:08:14,233 --> 00:08:17,633 So that's why, you know, we wrote all the independent variables 157 00:08:17,633 --> 00:08:19,866 one by one separated by a plus. 158 00:08:19,866 --> 00:08:23,966 Because, you know, once we reached step five here, we will just copy paste 159 00:08:24,166 --> 00:08:29,366 the regressor and the summary function and remove this independent variable 160 00:08:29,366 --> 00:08:32,333 that had the highest p value from the regressor 161 00:08:32,333 --> 00:08:35,166 to build a new regressor without this variable. 162 00:08:35,166 --> 00:08:38,166 And that will fit this model without the variable. 163 00:08:38,233 --> 00:08:41,533 And once it's done we go back to step three here 164 00:08:41,833 --> 00:08:45,266 to repeat this same pathway that is. 165 00:08:45,266 --> 00:08:48,266 Once again we're going to look for the independent variables 166 00:08:48,266 --> 00:08:51,033 among the new team of independent variables 167 00:08:51,033 --> 00:08:53,966 without the independent variable that we just removed. 168 00:08:53,966 --> 00:08:57,066 So we're going to look for the independent variable that has a highest p value. 169 00:08:57,366 --> 00:08:58,233 And same 170 00:08:58,233 --> 00:09:01,700 if the p value is higher than the significance level we'll go to step four. 171 00:09:01,800 --> 00:09:04,300 And otherwise our model is ready. 172 00:09:04,300 --> 00:09:05,200 So let's do this. 173 00:09:05,200 --> 00:09:10,500 We already completed step one by choosing a significance level of 5%. 174 00:09:10,966 --> 00:09:12,266 And same for step two. 175 00:09:12,266 --> 00:09:14,833 We actually fitted the model with all possible predictors. 176 00:09:14,833 --> 00:09:17,833 Well we need of course to execute the code. 177 00:09:17,933 --> 00:09:21,966 And now we will move on to step three which will consist 178 00:09:21,966 --> 00:09:24,966 of looking for the independent variable that has the highest p value. 179 00:09:25,166 --> 00:09:26,200 So let's do this right now 180 00:09:27,333 --> 00:09:27,866 okay. 181 00:09:27,866 --> 00:09:32,466 So as I just mentioned we need to execute this 182 00:09:32,466 --> 00:09:35,500 to build our regressor with all the independent variables. 183 00:09:35,500 --> 00:09:39,366 Well actually we don't really need to execute this because we actually 184 00:09:39,366 --> 00:09:43,233 executed this code section here that builds exactly the same regressor. 185 00:09:43,233 --> 00:09:47,700 But just to complete all these steps in this tutorial, let's execute that again. 186 00:09:47,700 --> 00:09:49,800 That will not cause any issue. 187 00:09:49,800 --> 00:09:53,566 So I'm going to press Command and Control plus enter to execute. 188 00:09:53,933 --> 00:09:54,900 And here we go. 189 00:09:54,900 --> 00:09:58,200 Same regressor created again with a different syntax. 190 00:09:58,500 --> 00:10:02,233 And that's because we want to remove the non-significant independent 191 00:10:02,233 --> 00:10:04,200 variable one by one. 192 00:10:04,200 --> 00:10:04,933 Great. 193 00:10:04,933 --> 00:10:07,933 So that actually completes step two. 194 00:10:07,933 --> 00:10:10,833 And now let's move on to step three which was to look 195 00:10:10,833 --> 00:10:13,833 for the independent variable that has the highest p value. 196 00:10:14,066 --> 00:10:17,566 And to do this we are going to select this 197 00:10:17,566 --> 00:10:21,433 summary function with the regressor inside and press Command and Control Plus. 198 00:10:21,433 --> 00:10:22,433 And to execute. 199 00:10:23,866 --> 00:10:26,566 Let's move that up a little bit. 200 00:10:26,566 --> 00:10:29,300 So these informations are very important 201 00:10:29,300 --> 00:10:32,366 informations when we want to build a robust model. 202 00:10:32,666 --> 00:10:36,333 And it's not only thanks to the informations about the P values here 203 00:10:36,333 --> 00:10:39,666 that will help select the optimal team of independent variables. 204 00:10:39,933 --> 00:10:42,966 Because below we also have this multiple r squared. 205 00:10:42,966 --> 00:10:45,100 And this adjusted r squared. 206 00:10:45,100 --> 00:10:48,900 That will help us build even more robust model than the one 207 00:10:48,900 --> 00:10:51,000 we are going to make in the next two tutorials. 208 00:10:51,000 --> 00:10:54,700 Because at the end of this part there is this section called Evaluating models 209 00:10:54,700 --> 00:10:58,100 performance to actually improve the model performance. 210 00:10:58,600 --> 00:11:02,066 And in this part we will actually use the multiple r squared and the adjusted 211 00:11:02,066 --> 00:11:06,333 R squared to finalize our journey towards the most robust model. 212 00:11:06,533 --> 00:11:09,533 And you will perfectly understand why at the end of this part. 213 00:11:09,933 --> 00:11:12,266 But for now, let's focus on the p values. 214 00:11:12,266 --> 00:11:14,766 So the p values are actually in this column. 215 00:11:14,766 --> 00:11:18,666 And in R there's actually a shortcut to look at the statistical significance. 216 00:11:18,966 --> 00:11:20,433 It's this last column here. 217 00:11:20,433 --> 00:11:22,300 Well this last column doesn't have a name. 218 00:11:22,300 --> 00:11:25,500 But you need to look at the stars here because as a reminder, 219 00:11:25,600 --> 00:11:29,900 the more the p value is below 5% our significance level, 220 00:11:30,300 --> 00:11:33,533 then the more the independent variable will be statistically significant. 221 00:11:33,533 --> 00:11:37,000 For the dependent variable profit, and the more the p value 222 00:11:37,000 --> 00:11:40,000 is higher than the significance level 5%, 223 00:11:40,033 --> 00:11:43,566 then the less statistically significant the independent variable will be. 224 00:11:44,166 --> 00:11:48,233 So in short, the lower is the p value, the more your independent 225 00:11:48,233 --> 00:11:51,366 variable will have high impact on your dependent variable, 226 00:11:51,533 --> 00:11:54,366 and the higher is the p value, the less effect. 227 00:11:54,366 --> 00:11:58,333 In fact, your independent variable is going to have on the dependent variable. 228 00:11:59,000 --> 00:12:02,400 And there is this reminder here that says that if the p value 229 00:12:02,400 --> 00:12:05,800 is between zero and 0.1 percent, 230 00:12:06,300 --> 00:12:10,200 then it's three stars, meaning that it's highly statistically significant. 231 00:12:10,800 --> 00:12:14,500 Then if the p value is between 0.1% and 1%, 232 00:12:14,900 --> 00:12:18,500 then it's two stars, meaning that it's very statistically significant 233 00:12:18,500 --> 00:12:20,900 but less significant than when there is three stars. 234 00:12:20,900 --> 00:12:24,000 Then if the p value is between 1% and 5%, 235 00:12:24,300 --> 00:12:27,000 then it's simply statistically significant. 236 00:12:27,000 --> 00:12:28,333 That is, your independent variable 237 00:12:28,333 --> 00:12:31,400 still have some good impact on the dependent variable. 238 00:12:31,966 --> 00:12:37,166 And then if the p value is between 5% and 10%, then it's a dot, meaning 239 00:12:37,166 --> 00:12:41,833 that there is definitely a certain level of statistical significance. 240 00:12:41,833 --> 00:12:45,700 That is, your independent variable has a certain effect on your dependent variable, 241 00:12:46,200 --> 00:12:50,700 but definitely not as much as your other independent variables 242 00:12:50,700 --> 00:12:53,700 that are in these categories here, especially for this one. 243 00:12:54,166 --> 00:12:58,033 And finally, if your p value is between 10% and one, 244 00:12:58,333 --> 00:13:01,766 well, there's absolutely no statistical significance. 245 00:13:02,400 --> 00:13:05,266 So that means that with what we first observe here, 246 00:13:05,266 --> 00:13:10,333 well we can see that the R&D spend is highly statistically significant. 247 00:13:10,766 --> 00:13:13,700 But the rest seems to be not significant. 248 00:13:13,700 --> 00:13:17,900 But let's wait for the backward elimination to be completed to find out 249 00:13:17,900 --> 00:13:22,300 if our final team is actually only composed of already spent, 250 00:13:22,700 --> 00:13:26,000 because by removing some independent variables here, that will remove 251 00:13:26,000 --> 00:13:30,000 some possible bias, that once some independent variables are removed, 252 00:13:30,200 --> 00:13:33,066 we can actually find an independent variable that is more 253 00:13:33,066 --> 00:13:37,200 statistically significant than what it appeared to be at the beginning. 254 00:13:37,200 --> 00:13:39,100 That is at the first step here. 255 00:13:39,100 --> 00:13:40,466 So let's find out about that. 256 00:13:40,466 --> 00:13:41,966 And actually you will find out 257 00:13:41,966 --> 00:13:45,700 about that yourself because this will be the subject of the homework. 258 00:13:45,700 --> 00:13:48,866 But don't worry I'm going to walk you through the first steps of 259 00:13:48,866 --> 00:13:51,866 backward elimination and you will complete it yourself. 260 00:13:52,000 --> 00:13:54,900 And in the next tutorial, of course, we'll have the solution 261 00:13:54,900 --> 00:13:56,366 and we'll complete it together. 262 00:13:56,366 --> 00:13:59,366 So I look forward to seeing if we get the same results. 263 00:13:59,466 --> 00:14:02,733 Okay, so now let's carry on with backward elimination. 264 00:14:02,866 --> 00:14:04,966 So remember we are at step three. 265 00:14:04,966 --> 00:14:06,633 And step three is actually to look 266 00:14:06,633 --> 00:14:10,000 for the independent variable that has the highest p value. 267 00:14:10,600 --> 00:14:12,766 And we can find it very easily. 268 00:14:12,766 --> 00:14:18,100 It's actually this one because indeed its p value is 0.999. 269 00:14:18,100 --> 00:14:19,900 That is 99%. 270 00:14:19,900 --> 00:14:22,400 So that's actually a very high p value. 271 00:14:22,400 --> 00:14:26,100 And we are way above the significance level of 5%. 272 00:14:26,333 --> 00:14:29,633 So this dummy variable for state state 273 00:14:29,633 --> 00:14:33,366 two is definitely not statistically significant at all. 274 00:14:33,600 --> 00:14:37,266 It has absolutely no effect on the dependent variable profit. 275 00:14:37,766 --> 00:14:43,033 And by the way we also observe that state three here has a 94% p value. 276 00:14:43,233 --> 00:14:46,233 And there is no way that if we remove state two 277 00:14:46,333 --> 00:14:50,733 well this p value will decrease below the 5% significance level. 278 00:14:51,066 --> 00:14:55,533 So we can actually remove this state three independent variable as well, 279 00:14:55,933 --> 00:15:00,900 because clearly the state has no effect or impact on the dependent variable profit. 280 00:15:01,233 --> 00:15:03,900 So we will actually make some kind of a shortcut here. 281 00:15:03,900 --> 00:15:07,966 And instead of removing the independent variable that has the highest p value 282 00:15:08,033 --> 00:15:13,000 that is state two, we will actually remove both these dummy variables for state 283 00:15:13,466 --> 00:15:17,100 because definitely the state is not statistically significant. 284 00:15:17,400 --> 00:15:18,766 So let's do this. 285 00:15:18,766 --> 00:15:21,766 Let's remove the state variable from our equation. 286 00:15:21,766 --> 00:15:23,533 So I'm going to put that down 287 00:15:25,166 --> 00:15:25,500 okay. 288 00:15:25,500 --> 00:15:27,000 So as I told you it's very simple. 289 00:15:27,000 --> 00:15:31,733 We're just going to copy this and paste it here. 290 00:15:32,300 --> 00:15:34,733 And then so what do we have to do. Now. 291 00:15:34,733 --> 00:15:39,200 We just need to remove the state independent variable from our equation. 292 00:15:39,200 --> 00:15:41,366 Here. 293 00:15:41,366 --> 00:15:44,800 And by doing that we complete step four. 294 00:15:44,866 --> 00:15:49,400 Because if we go back to our slide, step four is to actually remove the predictor. 295 00:15:49,800 --> 00:15:50,333 Great. 296 00:15:50,333 --> 00:15:53,800 And now we can move on to step five which is to fit the multiple linear 297 00:15:53,800 --> 00:15:58,733 regression model without the independent variable state that we just removed okay. 298 00:15:58,733 --> 00:16:00,033 So we removed state. 299 00:16:00,033 --> 00:16:02,533 So step four completed. 300 00:16:02,533 --> 00:16:05,900 And now step five is to actually fit the model 301 00:16:05,900 --> 00:16:09,133 without this independent variable state that we just removed. 302 00:16:09,233 --> 00:16:10,200 So let's do this. 303 00:16:10,200 --> 00:16:13,433 We just need to select this press command and control. 304 00:16:13,433 --> 00:16:14,500 Press enter to execute. 305 00:16:15,466 --> 00:16:16,500 And here it is. 306 00:16:16,500 --> 00:16:20,700 Our new regressor is ready without the state independent variable. 307 00:16:20,900 --> 00:16:23,700 So now we have a team of three independent variables. 308 00:16:23,700 --> 00:16:27,400 Wait and see which independent variable is going to be kicked out of the team. 309 00:16:27,900 --> 00:16:32,200 And speaking of that, this is where I'm going to leave you alone for the homework. 310 00:16:32,466 --> 00:16:35,566 But don't worry, you will have the solution in the next tutorial. 311 00:16:35,733 --> 00:16:38,666 But really try to implement this yourself. 312 00:16:38,666 --> 00:16:41,566 Complete backward elimination up to the end, 313 00:16:41,566 --> 00:16:44,133 and it will be fun to see if we get the same results. 314 00:16:44,133 --> 00:16:45,633 And there is actually kind of a decision 315 00:16:45,633 --> 00:16:47,933 to make at the end of backward elimination. 316 00:16:47,933 --> 00:16:51,633 So I'm curious to see how you make that decision, make that call, 317 00:16:51,666 --> 00:16:54,966 because both solutions are actually great 318 00:16:55,333 --> 00:16:58,300 and we will talk about that in the solution. 319 00:16:58,300 --> 00:17:01,466 So good luck for the homework you're going to see. 320 00:17:01,466 --> 00:17:02,100 It's going to be fine. 321 00:17:02,100 --> 00:17:06,866 So basically what you only have to do is to follow this backward elimination slide. 322 00:17:07,200 --> 00:17:10,933 And so together in this tutorial we went up to this step five here. 323 00:17:11,233 --> 00:17:11,700 And now. 324 00:17:11,700 --> 00:17:15,833 As you can see you have to go back to step three and redo the steps 325 00:17:15,833 --> 00:17:18,833 three four, five exactly like we just did. 326 00:17:18,833 --> 00:17:24,266 Until you find that the highest p value is not higher than the significance level, 327 00:17:24,533 --> 00:17:27,533 and when it's the case, your model will be ready. 328 00:17:28,233 --> 00:17:31,000 Okay, so I look forward to seeing you in the next tutorial. 329 00:17:31,000 --> 00:17:34,833 I look forward to comparing with you your results to mine, 330 00:17:35,266 --> 00:17:37,466 and I'm sure everything will be okay. 331 00:17:37,466 --> 00:17:39,566 Backward elimination is very practical 332 00:17:39,566 --> 00:17:43,000 and it will actually be fun and easy to complete this. 333 00:17:43,633 --> 00:17:46,866 So thank you for watching this tutorial and I look forward to seeing you 334 00:17:46,866 --> 00:17:49,000 in the next one for the solution. 335 00:17:49,000 --> 00:17:50,733 Until then, enjoy machine learning.