1
00:00:00,133 --> 00:00:02,666
Hello and welcome to this art tutorial.

2
00:00:02,666 --> 00:00:04,000
So in the following tutorials

3
00:00:04,000 --> 00:00:07,500
we're going to implement
a simple linear regression model on R.

4
00:00:07,800 --> 00:00:10,666
So it's going to be the same steps
as in Python.

5
00:00:10,666 --> 00:00:13,200
And let's start with the first step.

6
00:00:13,200 --> 00:00:16,100
So the first step is to actually set
the working directory.

7
00:00:16,100 --> 00:00:18,866
As you can see right now
I'm on my desktop.

8
00:00:18,866 --> 00:00:23,100
So I'm going to go to my machine
learning A-Z folder path to regression

9
00:00:23,700 --> 00:00:26,700
and then section for simple linear
regression.

10
00:00:27,000 --> 00:00:27,800
And here we are.

11
00:00:27,800 --> 00:00:30,600
That's the folder
we want to set as working directory.

12
00:00:30,600 --> 00:00:33,833
Make sure that it contains
your salary data dot csv file.

13
00:00:34,000 --> 00:00:37,600
That's the data on which we will build
our simple linear regression model.

14
00:00:37,633 --> 00:00:39,766
So make sure it is here and now.

15
00:00:39,766 --> 00:00:41,500
To set this folder as working directory,

16
00:00:41,500 --> 00:00:43,733
you just need to click
on this more button here.

17
00:00:43,733 --> 00:00:46,400
And then click on Set
as Working Directory.

18
00:00:46,400 --> 00:00:48,000
And that's it. That's done.

19
00:00:48,000 --> 00:00:49,266
Now we're ready to start.

20
00:00:49,266 --> 00:00:53,000
We are ready to start with the real first
step of making a machine learning model,

21
00:00:53,200 --> 00:00:56,100
which is the data pre-processing step.

22
00:00:56,100 --> 00:00:57,566
So we're going to use of course

23
00:00:57,566 --> 00:01:00,566
the data pre-processing template
that we made in part one.

24
00:01:00,766 --> 00:01:04,066
So I'm just going to copy
the template here only this

25
00:01:04,566 --> 00:01:07,300
copy and then paste it

26
00:01:08,566 --> 00:01:09,433
here.

27
00:01:09,433 --> 00:01:10,466
All right.

28
00:01:10,466 --> 00:01:14,533
And now we just need to change
a few things to adapt it to our data set.

29
00:01:14,833 --> 00:01:17,700
So of course we will need to change
the name of the data set here.

30
00:01:17,700 --> 00:01:23,833
It is not data dot CSV
but salary underscore data okay.

31
00:01:24,066 --> 00:01:27,200
So then I'm going to select this
to have a look at the data set.

32
00:01:28,000 --> 00:01:30,033
Here we go. Let's have a look.

33
00:01:30,033 --> 00:01:31,833
Here's the data set okay.

34
00:01:31,833 --> 00:01:34,200
So just to remind what this data set
is about this data

35
00:01:34,200 --> 00:01:38,300
set contains some information of employees
in a company.

36
00:01:38,866 --> 00:01:42,733
And these two informations
are the number of years of experience

37
00:01:42,733 --> 00:01:45,733
the employee has and the salary.

38
00:01:45,800 --> 00:01:48,600
So we are trying to understand
if there is a correlation

39
00:01:48,600 --> 00:01:51,600
between the salary
and the number of years of experience.

40
00:01:51,666 --> 00:01:54,600
And mostly we're trying to see
if it's a linear correlation.

41
00:01:54,600 --> 00:01:55,800
That means if it's a,

42
00:01:55,800 --> 00:01:59,333
that means if there is a linear dependency
between these two variables.

43
00:01:59,800 --> 00:02:03,166
And so what we need to understand
that the first reflex that we must have

44
00:02:03,166 --> 00:02:05,800
when we make a model
is that we must understand

45
00:02:05,800 --> 00:02:09,266
which is the independent variable
and which is the dependent variable.

46
00:02:09,266 --> 00:02:10,966
So the independent variable

47
00:02:10,966 --> 00:02:14,733
is the number of years of experience,
and the dependent variable is the salary.

48
00:02:15,133 --> 00:02:18,133
And so what happens
is that we are trying to predict

49
00:02:18,300 --> 00:02:21,900
the dependent variable salary
based on the information

50
00:02:21,933 --> 00:02:25,900
of the independent
variable years of experience okay.

51
00:02:25,900 --> 00:02:26,966
So that's the data set.

52
00:02:26,966 --> 00:02:28,966
And now let's continue with our model.

53
00:02:28,966 --> 00:02:31,866
So let's go back to a simple linear
regression here.

54
00:02:31,866 --> 00:02:35,100
And we don't need to specify
any column of interest.

55
00:02:35,100 --> 00:02:36,133
We have all we need.

56
00:02:36,133 --> 00:02:39,000
So we won't use this line here okay.

57
00:02:39,000 --> 00:02:43,000
Now we are ready to split the data set
into the training set and the test set.

58
00:02:43,200 --> 00:02:46,166
So we perhaps need to change the split
ratio.

59
00:02:46,166 --> 00:02:47,200
Let's see.

60
00:02:47,200 --> 00:02:50,366
the data set contains 30 observations.

61
00:02:50,366 --> 00:02:52,833
So what what would be a good split ratio.

62
00:02:52,833 --> 00:02:54,633
It's really as you prefer.

63
00:02:54,633 --> 00:02:58,300
I know that I told you that a good split
ratio is 75%.

64
00:02:58,666 --> 00:03:02,500
But just for the sake of beauty,
let's take 20 observations

65
00:03:02,500 --> 00:03:06,933
in a training set and ten observations
in a test set so that would be that

66
00:03:06,933 --> 00:03:10,200
the split ratio would be two third.

67
00:03:10,800 --> 00:03:11,266
Okay.

68
00:03:11,266 --> 00:03:15,200
And of course, let's not forget to change
the name of the dependent variable

69
00:03:15,200 --> 00:03:17,833
because this was the name of the data
in the template.

70
00:03:17,833 --> 00:03:19,266
And now let's see what the name is.

71
00:03:19,266 --> 00:03:20,600
The name is salary.

72
00:03:20,600 --> 00:03:23,400
So here you know
that's the name of the dependent variable.

73
00:03:23,400 --> 00:03:26,400
So we need to
change purchased into salary.

74
00:03:27,400 --> 00:03:28,533
And now I think it's ready.

75
00:03:28,533 --> 00:03:31,766
We are ready to split the data set
into the training set and the data set.

76
00:03:32,066 --> 00:03:35,066
So let's do it and let's see what happens.

77
00:03:35,900 --> 00:03:36,400
Here we go.

78
00:03:36,400 --> 00:03:38,366
It's worked perfectly.

79
00:03:38,366 --> 00:03:41,366
So now let's have a look
at the training set and the test set

80
00:03:42,833 --> 00:03:43,633
okay.

81
00:03:43,633 --> 00:03:47,700
The training set contains the 20
observations generated from the splits.

82
00:03:48,000 --> 00:03:51,033
And in the test set
we have our ten observations.

83
00:03:51,600 --> 00:03:56,000
So we are going to train our simple linear
regression model on the training set.

84
00:03:56,000 --> 00:03:59,333
That means that our model
is going to learn the correlations

85
00:03:59,333 --> 00:04:00,733
between the number of years of experience

86
00:04:00,733 --> 00:04:03,733
and the salary in this
set here in the training set.

87
00:04:04,000 --> 00:04:06,633
And then later
we will test its performance,

88
00:04:06,633 --> 00:04:09,633
its power of prediction on the test set.

89
00:04:09,766 --> 00:04:11,166
So let's continue.

90
00:04:11,166 --> 00:04:14,500
the last step of the data
pre-processing is feature scaling.

91
00:04:14,800 --> 00:04:18,633
But the simple linear regression package
that we are going to use here

92
00:04:18,633 --> 00:04:20,700
in R takes care of this.

93
00:04:20,700 --> 00:04:24,000
So we won't need to apply
feature scaling manually.

94
00:04:24,233 --> 00:04:25,700
So we will be fine with that.

95
00:04:25,700 --> 00:04:29,433
And actually the data pre-processing
phase is ready.

96
00:04:29,966 --> 00:04:31,033
So awesome.

97
00:04:31,033 --> 00:04:34,433
We are ready to start building the linear
regression model.

98
00:04:34,666 --> 00:04:36,466
We are going to do that
in the next tutorial.

99
00:04:36,466 --> 00:04:38,166
So I can't wait to see you there.

100
00:04:38,166 --> 00:04:39,966
And until then, enjoy machine learning.