1 00:00:00,133 --> 00:00:02,800 Hello and welcome to this art tutorial. 2 00:00:02,800 --> 00:00:05,933 In the following tutorials we will be implementing multiple linear 3 00:00:05,933 --> 00:00:07,000 regression in R. 4 00:00:07,000 --> 00:00:10,133 And right now, as usual, we are going to start with the basics 5 00:00:10,466 --> 00:00:13,466 which is to set our folder as working directory. 6 00:00:13,666 --> 00:00:15,933 So right now I'm on my desktop. 7 00:00:15,933 --> 00:00:18,300 I'm going to my Machine Learning A-Z folder. 8 00:00:18,300 --> 00:00:21,166 Then part two regression. 9 00:00:21,166 --> 00:00:23,900 And then we want to go to multiple linear regression. 10 00:00:23,900 --> 00:00:25,333 And here is the folder. 11 00:00:25,333 --> 00:00:28,700 Make sure that you have the 50 ups dot csv file. 12 00:00:28,933 --> 00:00:32,266 And if that's the case you're ready to click on this more button here 13 00:00:32,566 --> 00:00:35,533 to set the folder as working directory. 14 00:00:35,533 --> 00:00:36,766 All right. 15 00:00:36,766 --> 00:00:41,666 Now let's start with step one which is to prepare the data 16 00:00:41,666 --> 00:00:44,900 to make our multiple linear regression ready to be built. 17 00:00:45,500 --> 00:00:48,066 So as usual we are going to use our template, 18 00:00:48,066 --> 00:00:51,000 the data pre-processing template that we made in part one. 19 00:00:51,000 --> 00:00:53,066 And we are just going to copy this 20 00:00:54,066 --> 00:00:57,066 copy and paste it here. 21 00:00:57,466 --> 00:00:58,833 All right. 22 00:00:58,833 --> 00:01:01,666 And now let's take care of the few things to change. 23 00:01:01,666 --> 00:01:04,666 So first we will change the name of the data set 24 00:01:04,966 --> 00:01:08,200 which is here 50 strips. 25 00:01:11,066 --> 00:01:13,633 All right 50 startups dot CSV. 26 00:01:13,633 --> 00:01:18,433 We can select this and execute to have a look at our data set. 27 00:01:19,233 --> 00:01:20,666 Here it is. 28 00:01:20,666 --> 00:01:22,600 And that's the data set. 29 00:01:22,600 --> 00:01:24,500 I'll remind what this data set is about. 30 00:01:24,500 --> 00:01:29,800 So this contains informations of startups actually 50 startups. 31 00:01:30,200 --> 00:01:33,200 And these informations are some amount of money spent. 32 00:01:33,500 --> 00:01:38,866 So for example there's the amount spent in R&D administration marketing. 33 00:01:39,333 --> 00:01:43,766 And finally there is also the state in which the startup operates. 34 00:01:44,400 --> 00:01:47,400 And finally we have a last column here which is the profit. 35 00:01:47,666 --> 00:01:51,000 And that's the profit we want to predict with our multiple linear 36 00:01:51,000 --> 00:01:51,800 regression models. 37 00:01:51,800 --> 00:01:55,366 And we want to predict that profit based on this 38 00:01:55,866 --> 00:01:58,500 independent variables which are the earned spend, 39 00:01:58,500 --> 00:02:01,500 the administration marketing spend and the state. 40 00:02:01,566 --> 00:02:05,400 So we are doing this because we are doing a mission for investors 41 00:02:05,400 --> 00:02:09,333 who want to know in which startup they should invest their money. 42 00:02:09,700 --> 00:02:12,466 And so not only they want to predict the future 43 00:02:12,466 --> 00:02:15,466 profits for new startups based on the same information, 44 00:02:15,733 --> 00:02:17,233 but also they want to see 45 00:02:17,233 --> 00:02:21,000 which independent variable has the highest effect on the profit 46 00:02:21,266 --> 00:02:24,266 and which one governs the relationship between the profit 47 00:02:24,300 --> 00:02:26,033 and those independent variables. 48 00:02:26,033 --> 00:02:30,166 Is there an independent variable that has a highest effect than another one? 49 00:02:30,166 --> 00:02:34,333 Does the state in which the started operates have an impact on the profit? 50 00:02:34,700 --> 00:02:38,400 We'll find that out thanks to our multiple linear regression model in R. 51 00:02:38,400 --> 00:02:40,033 And thanks to this model, 52 00:02:40,033 --> 00:02:44,566 the investors will be able to draw some insights from our results. 53 00:02:45,700 --> 00:02:46,600 Okay, so now the 54 00:02:46,600 --> 00:02:50,533 next step step of the first step data pre-processing is to split 55 00:02:50,533 --> 00:02:53,533 the data set into the training set and the test set. 56 00:02:53,800 --> 00:02:56,633 But is it this step step we need to do right now. 57 00:02:56,633 --> 00:02:57,100 I know that 58 00:02:57,100 --> 00:03:01,466 the template is suggesting that, but let's not forget that in our data set 59 00:03:01,466 --> 00:03:06,166 we have one specific variable which should strike our attention. 60 00:03:07,366 --> 00:03:08,300 Well it's this one. 61 00:03:08,300 --> 00:03:09,333 It's the state variable 62 00:03:09,333 --> 00:03:13,866 because it contains categories which means it's a categorical variable. 63 00:03:14,200 --> 00:03:14,833 And remember 64 00:03:14,833 --> 00:03:18,800 when we have a categorical variable like this with categories written in text. 65 00:03:19,200 --> 00:03:22,666 This would cause some issues in our machine learning model equations. 66 00:03:23,266 --> 00:03:24,333 Because how do you want 67 00:03:24,333 --> 00:03:28,233 to make a linear equation with one of the variable written as text? 68 00:03:28,233 --> 00:03:29,800 Wouldn't make any sense. 69 00:03:29,800 --> 00:03:32,033 So what we're going to do, of course, is to 70 00:03:33,133 --> 00:03:36,133 encode the state variable. 71 00:03:36,300 --> 00:03:37,166 And to do this 72 00:03:37,166 --> 00:03:41,600 we are going to use what we learned in part one data pre-processing only. 73 00:03:41,600 --> 00:03:44,933 We didn't include that in the template, because this will actually be 74 00:03:45,033 --> 00:03:49,033 one of the only examples where we'll need to encode our categorical data. 75 00:03:49,033 --> 00:03:51,100 We put it in a separate file. 76 00:03:51,100 --> 00:03:54,100 And so right now we are going to open the separate file.