1 00:00:00,133 --> 00:00:02,666 Hello and welcome to this art tutorial. 2 00:00:02,666 --> 00:00:04,000 So in the following tutorials 3 00:00:04,000 --> 00:00:07,500 we're going to implement a simple linear regression model on R. 4 00:00:07,800 --> 00:00:10,666 So it's going to be the same steps as in Python. 5 00:00:10,666 --> 00:00:13,200 And let's start with the first step. 6 00:00:13,200 --> 00:00:16,100 So the first step is to actually set the working directory. 7 00:00:16,100 --> 00:00:18,866 As you can see right now I'm on my desktop. 8 00:00:18,866 --> 00:00:23,100 So I'm going to go to my machine learning A-Z folder path to regression 9 00:00:23,700 --> 00:00:26,700 and then section for simple linear regression. 10 00:00:27,000 --> 00:00:27,800 And here we are. 11 00:00:27,800 --> 00:00:30,600 That's the folder we want to set as working directory. 12 00:00:30,600 --> 00:00:33,833 Make sure that it contains your salary data dot csv file. 13 00:00:34,000 --> 00:00:37,600 That's the data on which we will build our simple linear regression model. 14 00:00:37,633 --> 00:00:39,766 So make sure it is here and now. 15 00:00:39,766 --> 00:00:41,500 To set this folder as working directory, 16 00:00:41,500 --> 00:00:43,733 you just need to click on this more button here. 17 00:00:43,733 --> 00:00:46,400 And then click on Set as Working Directory. 18 00:00:46,400 --> 00:00:48,000 And that's it. That's done. 19 00:00:48,000 --> 00:00:49,266 Now we're ready to start. 20 00:00:49,266 --> 00:00:53,000 We are ready to start with the real first step of making a machine learning model, 21 00:00:53,200 --> 00:00:56,100 which is the data pre-processing step. 22 00:00:56,100 --> 00:00:57,566 So we're going to use of course 23 00:00:57,566 --> 00:01:00,566 the data pre-processing template that we made in part one. 24 00:01:00,766 --> 00:01:04,066 So I'm just going to copy the template here only this 25 00:01:04,566 --> 00:01:07,300 copy and then paste it 26 00:01:08,566 --> 00:01:09,433 here. 27 00:01:09,433 --> 00:01:10,466 All right. 28 00:01:10,466 --> 00:01:14,533 And now we just need to change a few things to adapt it to our data set. 29 00:01:14,833 --> 00:01:17,700 So of course we will need to change the name of the data set here. 30 00:01:17,700 --> 00:01:23,833 It is not data dot CSV but salary underscore data okay. 31 00:01:24,066 --> 00:01:27,200 So then I'm going to select this to have a look at the data set. 32 00:01:28,000 --> 00:01:30,033 Here we go. Let's have a look. 33 00:01:30,033 --> 00:01:31,833 Here's the data set okay. 34 00:01:31,833 --> 00:01:34,200 So just to remind what this data set is about this data 35 00:01:34,200 --> 00:01:38,300 set contains some information of employees in a company. 36 00:01:38,866 --> 00:01:42,733 And these two informations are the number of years of experience 37 00:01:42,733 --> 00:01:45,733 the employee has and the salary. 38 00:01:45,800 --> 00:01:48,600 So we are trying to understand if there is a correlation 39 00:01:48,600 --> 00:01:51,600 between the salary and the number of years of experience. 40 00:01:51,666 --> 00:01:54,600 And mostly we're trying to see if it's a linear correlation. 41 00:01:54,600 --> 00:01:55,800 That means if it's a, 42 00:01:55,800 --> 00:01:59,333 that means if there is a linear dependency between these two variables. 43 00:01:59,800 --> 00:02:03,166 And so what we need to understand that the first reflex that we must have 44 00:02:03,166 --> 00:02:05,800 when we make a model is that we must understand 45 00:02:05,800 --> 00:02:09,266 which is the independent variable and which is the dependent variable. 46 00:02:09,266 --> 00:02:10,966 So the independent variable 47 00:02:10,966 --> 00:02:14,733 is the number of years of experience, and the dependent variable is the salary. 48 00:02:15,133 --> 00:02:18,133 And so what happens is that we are trying to predict 49 00:02:18,300 --> 00:02:21,900 the dependent variable salary based on the information 50 00:02:21,933 --> 00:02:25,900 of the independent variable years of experience okay. 51 00:02:25,900 --> 00:02:26,966 So that's the data set. 52 00:02:26,966 --> 00:02:28,966 And now let's continue with our model. 53 00:02:28,966 --> 00:02:31,866 So let's go back to a simple linear regression here. 54 00:02:31,866 --> 00:02:35,100 And we don't need to specify any column of interest. 55 00:02:35,100 --> 00:02:36,133 We have all we need. 56 00:02:36,133 --> 00:02:39,000 So we won't use this line here okay. 57 00:02:39,000 --> 00:02:43,000 Now we are ready to split the data set into the training set and the test set. 58 00:02:43,200 --> 00:02:46,166 So we perhaps need to change the split ratio. 59 00:02:46,166 --> 00:02:47,200 Let's see. 60 00:02:47,200 --> 00:02:50,366 the data set contains 30 observations. 61 00:02:50,366 --> 00:02:52,833 So what what would be a good split ratio. 62 00:02:52,833 --> 00:02:54,633 It's really as you prefer. 63 00:02:54,633 --> 00:02:58,300 I know that I told you that a good split ratio is 75%. 64 00:02:58,666 --> 00:03:02,500 But just for the sake of beauty, let's take 20 observations 65 00:03:02,500 --> 00:03:06,933 in a training set and ten observations in a test set so that would be that 66 00:03:06,933 --> 00:03:10,200 the split ratio would be two third. 67 00:03:10,800 --> 00:03:11,266 Okay. 68 00:03:11,266 --> 00:03:15,200 And of course, let's not forget to change the name of the dependent variable 69 00:03:15,200 --> 00:03:17,833 because this was the name of the data in the template. 70 00:03:17,833 --> 00:03:19,266 And now let's see what the name is. 71 00:03:19,266 --> 00:03:20,600 The name is salary. 72 00:03:20,600 --> 00:03:23,400 So here you know that's the name of the dependent variable. 73 00:03:23,400 --> 00:03:26,400 So we need to change purchased into salary. 74 00:03:27,400 --> 00:03:28,533 And now I think it's ready. 75 00:03:28,533 --> 00:03:31,766 We are ready to split the data set into the training set and the data set. 76 00:03:32,066 --> 00:03:35,066 So let's do it and let's see what happens. 77 00:03:35,900 --> 00:03:36,400 Here we go. 78 00:03:36,400 --> 00:03:38,366 It's worked perfectly. 79 00:03:38,366 --> 00:03:41,366 So now let's have a look at the training set and the test set 80 00:03:42,833 --> 00:03:43,633 okay. 81 00:03:43,633 --> 00:03:47,700 The training set contains the 20 observations generated from the splits. 82 00:03:48,000 --> 00:03:51,033 And in the test set we have our ten observations. 83 00:03:51,600 --> 00:03:56,000 So we are going to train our simple linear regression model on the training set. 84 00:03:56,000 --> 00:03:59,333 That means that our model is going to learn the correlations 85 00:03:59,333 --> 00:04:00,733 between the number of years of experience 86 00:04:00,733 --> 00:04:03,733 and the salary in this set here in the training set. 87 00:04:04,000 --> 00:04:06,633 And then later we will test its performance, 88 00:04:06,633 --> 00:04:09,633 its power of prediction on the test set. 89 00:04:09,766 --> 00:04:11,166 So let's continue. 90 00:04:11,166 --> 00:04:14,500 the last step of the data pre-processing is feature scaling. 91 00:04:14,800 --> 00:04:18,633 But the simple linear regression package that we are going to use here 92 00:04:18,633 --> 00:04:20,700 in R takes care of this. 93 00:04:20,700 --> 00:04:24,000 So we won't need to apply feature scaling manually. 94 00:04:24,233 --> 00:04:25,700 So we will be fine with that. 95 00:04:25,700 --> 00:04:29,433 And actually the data pre-processing phase is ready. 96 00:04:29,966 --> 00:04:31,033 So awesome. 97 00:04:31,033 --> 00:04:34,433 We are ready to start building the linear regression model. 98 00:04:34,666 --> 00:04:36,466 We are going to do that in the next tutorial. 99 00:04:36,466 --> 00:04:38,166 So I can't wait to see you there. 100 00:04:38,166 --> 00:04:39,966 And until then, enjoy machine learning.