1 00:00:00,133 --> 00:00:02,633 Hello and welcome to this art tutorial. 2 00:00:02,633 --> 00:00:04,100 So today in this tutorial, 3 00:00:04,100 --> 00:00:07,566 we are ready to fit multiple linear regression to the training set. 4 00:00:07,933 --> 00:00:11,300 So as for simple linear regression, the first thing that we're going to do 5 00:00:11,300 --> 00:00:14,833 is to introduce the multiple linear regressor. 6 00:00:15,200 --> 00:00:18,200 And we're going to call it regressor. 7 00:00:18,233 --> 00:00:21,733 And then as for simple linear regression we're going to use the lm function 8 00:00:22,000 --> 00:00:23,000 here lm. 9 00:00:23,000 --> 00:00:27,333 And in parentheses we're going to input the parameters okay. 10 00:00:27,366 --> 00:00:31,166 So let's have a look at the LM function here by pressing F1. 11 00:00:31,266 --> 00:00:33,566 Here we go. Fitting linear models. 12 00:00:33,566 --> 00:00:34,733 Let's look at our arguments. 13 00:00:34,733 --> 00:00:37,233 So that's the same as for simple linear regression. 14 00:00:37,233 --> 00:00:39,000 Actually the first argument is formula. 15 00:00:39,000 --> 00:00:43,466 But this time you're going to see how we will need to change the formula syntax. 16 00:00:43,666 --> 00:00:47,200 And then the second compulsory argument to import is the data. 17 00:00:47,433 --> 00:00:49,800 And of course it's going to be the training set. 18 00:00:49,800 --> 00:00:52,333 So let's first input the formula. 19 00:00:52,333 --> 00:00:54,133 So formula here. 20 00:00:54,133 --> 00:00:56,466 So remember in simple linear regression 21 00:00:56,466 --> 00:01:01,100 we wrote the formula salary proportional to years experience. 22 00:01:01,266 --> 00:01:03,533 And here it's almost going to be the same. 23 00:01:03,533 --> 00:01:07,066 But the difference is that we have several independent variables. 24 00:01:07,300 --> 00:01:10,100 We have four independent variables. 25 00:01:10,100 --> 00:01:12,900 In simple linear regression the dependent variable 26 00:01:12,900 --> 00:01:16,366 was proportional to the only independent variable. 27 00:01:16,866 --> 00:01:18,633 Well here it's going to be the same. 28 00:01:18,633 --> 00:01:22,533 The profit is going to be proportional to these independent variables. 29 00:01:23,000 --> 00:01:27,700 Only the correct way to say it is not proportional, but is the profit 30 00:01:27,900 --> 00:01:32,166 is going to be a linear combination of the independent variables, 31 00:01:32,700 --> 00:01:36,200 and formula equals profit 32 00:01:37,200 --> 00:01:38,733 and n. 33 00:01:38,733 --> 00:01:42,400 And here the idea is to add all the independent variables 34 00:01:42,400 --> 00:01:44,466 separated with a plus sign. 35 00:01:44,466 --> 00:01:46,566 So let's look at them. 36 00:01:46,566 --> 00:01:49,766 It's r d spend 37 00:01:50,700 --> 00:01:53,633 I'm using the dot here because in the data set 38 00:01:53,633 --> 00:01:56,633 we have some spaces and I'll replace them by dots. 39 00:01:57,100 --> 00:01:59,233 So here it's dot then. 40 00:01:59,233 --> 00:02:03,600 Plus and then the other variable which is administration 41 00:02:05,033 --> 00:02:08,000 plus marketing spend 42 00:02:08,000 --> 00:02:09,966 spend plus state. 43 00:02:09,966 --> 00:02:11,133 And that's all. 44 00:02:11,133 --> 00:02:13,900 So that's how you express the profit 45 00:02:13,900 --> 00:02:17,500 as a linear combination of all these independent variables. 46 00:02:17,700 --> 00:02:22,200 However there is a trick to write this formula in a much more efficient way. 47 00:02:22,633 --> 00:02:25,700 It's instead of writing all the independent variables here, 48 00:02:26,066 --> 00:02:28,700 we can simply write a dot. 49 00:02:28,700 --> 00:02:30,200 That's what we can simply write. 50 00:02:30,200 --> 00:02:32,366 And that's the exact same equation. 51 00:02:32,366 --> 00:02:35,266 R understands that you want to express the profit 52 00:02:35,266 --> 00:02:38,733 as a linear combination of all the independent variables. 53 00:02:39,033 --> 00:02:42,800 So the dot here just means all the independent variables okay. 54 00:02:42,800 --> 00:02:44,900 So that's for the first argument. Formula equals 55 00:02:44,900 --> 00:02:48,000 profit as a linear combination of all the independent variables. 56 00:02:48,300 --> 00:02:49,933 And the second argument is data. 57 00:02:49,933 --> 00:02:53,833 And as we said it is of course the training set 58 00:02:54,766 --> 00:02:58,533 because we want to train our multiple linear regression model 59 00:02:58,533 --> 00:03:02,200 on the training set and then test the performance on the test set later. 60 00:03:02,700 --> 00:03:03,133 All right. 61 00:03:03,133 --> 00:03:07,233 So that's it for building our multiple linear regression regressor. 62 00:03:07,433 --> 00:03:11,300 So we're going to select this and execute. 63 00:03:11,700 --> 00:03:12,333 Here we go. 64 00:03:12,333 --> 00:03:15,600 And now now there is something very important to do. 65 00:03:15,833 --> 00:03:21,633 And this is something very practical very useful in R it's that we are going to 66 00:03:21,666 --> 00:03:25,800 look at the informations of our regressor that we just built here. 67 00:03:26,200 --> 00:03:28,500 And to do this it's really, really simple. 68 00:03:28,500 --> 00:03:32,100 We just need to it's actually like simple linear regression only here. 69 00:03:32,100 --> 00:03:33,466 This time it's going to be more interesting 70 00:03:33,466 --> 00:03:35,400 because we will have several independent variables, 71 00:03:35,400 --> 00:03:38,400 and some of them will have a stronger effect on the dependent variable. 72 00:03:38,600 --> 00:03:40,733 So we'll see. It's going to be more interesting. 73 00:03:40,733 --> 00:03:43,900 And so like in simple linear regression we're going to write a summary 74 00:03:44,566 --> 00:03:45,900 of our regressor. 75 00:03:47,100 --> 00:03:47,700 All right. 76 00:03:47,700 --> 00:03:50,266 And just press enter. 77 00:03:50,266 --> 00:03:53,266 And that's it I'm just going to move that up. 78 00:03:54,566 --> 00:03:58,933 And here we have all the info of our regressor okay. 79 00:03:58,933 --> 00:04:02,800 So here the first thing is just a reminder of what the formula is. 80 00:04:02,800 --> 00:04:05,800 So the profit is expressed as a linear combination 81 00:04:05,933 --> 00:04:07,900 of all the independent variables. 82 00:04:07,900 --> 00:04:10,900 And our regressor is trained on the training set. 83 00:04:11,133 --> 00:04:13,266 Okay. Perfect residuals. 84 00:04:13,266 --> 00:04:16,266 We will be talking about that at the end of this part. 85 00:04:16,433 --> 00:04:18,866 So let's not focus on that right now. 86 00:04:18,866 --> 00:04:21,066 However that's the important part. 87 00:04:21,066 --> 00:04:22,766 The coefficients here. 88 00:04:22,766 --> 00:04:24,700 That's what we need to focus on right now. 89 00:04:24,700 --> 00:04:27,900 Because as you can see it gives some info 90 00:04:27,900 --> 00:04:30,900 for each of your independent variable. 91 00:04:30,933 --> 00:04:32,533 Or by the way state two and state three. 92 00:04:32,533 --> 00:04:34,733 Here are the dummy variables. 93 00:04:34,733 --> 00:04:38,300 And that's because, you know, I told you that the library 94 00:04:38,300 --> 00:04:40,200 takes care of everything for you. 95 00:04:40,200 --> 00:04:42,966 Not only it created the dummy variables are 96 00:04:42,966 --> 00:04:47,266 knew that it had to create dummy variables for the state variable, and we helped R 97 00:04:47,266 --> 00:04:51,166 to understand that because we encoded the state variable as factors. 98 00:04:51,533 --> 00:04:56,466 So or did all the job for you because not only he created the dummy 99 00:04:56,466 --> 00:05:00,600 variables for the state variable, but also R didn't fall into the dummy 100 00:05:00,600 --> 00:05:03,633 variable try because as you can see, it automatically removed 101 00:05:03,633 --> 00:05:08,700 one of the dummy variable to avoid some redundant dependency. 102 00:05:08,866 --> 00:05:09,866 So that's perfect. 103 00:05:09,866 --> 00:05:13,833 That's the beauty of, our libraries as well as Python libraries and Python. 104 00:05:13,833 --> 00:05:17,300 I know we specified a new section to remove the dummy variable, 105 00:05:17,300 --> 00:05:19,266 but that was just to remind you about the trap. 106 00:05:19,266 --> 00:05:20,566 We don't need to do it. 107 00:05:20,566 --> 00:05:22,533 R and Python take care of that for you.