1
00:00:00,133 --> 00:00:02,800
Hello and welcome to this art tutorial.

2
00:00:02,800 --> 00:00:05,933
In the following tutorials
we will be implementing multiple linear

3
00:00:05,933 --> 00:00:07,000
regression in R.

4
00:00:07,000 --> 00:00:10,133
And right now, as usual,
we are going to start with the basics

5
00:00:10,466 --> 00:00:13,466
which is to set our folder
as working directory.

6
00:00:13,666 --> 00:00:15,933
So right now I'm on my desktop.

7
00:00:15,933 --> 00:00:18,300
I'm going to my Machine
Learning A-Z folder.

8
00:00:18,300 --> 00:00:21,166
Then part two regression.

9
00:00:21,166 --> 00:00:23,900
And then we want to go to multiple linear
regression.

10
00:00:23,900 --> 00:00:25,333
And here is the folder.

11
00:00:25,333 --> 00:00:28,700
Make sure that you have
the 50 ups dot csv file.

12
00:00:28,933 --> 00:00:32,266
And if that's the case you're ready
to click on this more button here

13
00:00:32,566 --> 00:00:35,533
to set the folder as working directory.

14
00:00:35,533 --> 00:00:36,766
All right.

15
00:00:36,766 --> 00:00:41,666
Now let's start with step one
which is to prepare the data

16
00:00:41,666 --> 00:00:44,900
to make our multiple linear regression
ready to be built.

17
00:00:45,500 --> 00:00:48,066
So as usual
we are going to use our template,

18
00:00:48,066 --> 00:00:51,000
the data pre-processing template
that we made in part one.

19
00:00:51,000 --> 00:00:53,066
And we are just going to copy this

20
00:00:54,066 --> 00:00:57,066
copy and paste it here.

21
00:00:57,466 --> 00:00:58,833
All right.

22
00:00:58,833 --> 00:01:01,666
And now let's take
care of the few things to change.

23
00:01:01,666 --> 00:01:04,666
So first we will change
the name of the data set

24
00:01:04,966 --> 00:01:08,200
which is here 50 strips.

25
00:01:11,066 --> 00:01:13,633
All right 50 startups dot CSV.

26
00:01:13,633 --> 00:01:18,433
We can select this and execute
to have a look at our data set.

27
00:01:19,233 --> 00:01:20,666
Here it is.

28
00:01:20,666 --> 00:01:22,600
And that's the data set.

29
00:01:22,600 --> 00:01:24,500
I'll remind what this data set is about.

30
00:01:24,500 --> 00:01:29,800
So this contains informations of startups
actually 50 startups.

31
00:01:30,200 --> 00:01:33,200
And these informations
are some amount of money spent.

32
00:01:33,500 --> 00:01:38,866
So for example there's the amount spent
in R&D administration marketing.

33
00:01:39,333 --> 00:01:43,766
And finally there is also the state
in which the startup operates.

34
00:01:44,400 --> 00:01:47,400
And finally we have a last column here
which is the profit.

35
00:01:47,666 --> 00:01:51,000
And that's the profit we want to predict
with our multiple linear

36
00:01:51,000 --> 00:01:51,800
regression models.

37
00:01:51,800 --> 00:01:55,366
And we want to predict that profit
based on this

38
00:01:55,866 --> 00:01:58,500
independent variables
which are the earned spend,

39
00:01:58,500 --> 00:02:01,500
the administration
marketing spend and the state.

40
00:02:01,566 --> 00:02:05,400
So we are doing this because we are doing
a mission for investors

41
00:02:05,400 --> 00:02:09,333
who want to know in which startup
they should invest their money.

42
00:02:09,700 --> 00:02:12,466
And so not only
they want to predict the future

43
00:02:12,466 --> 00:02:15,466
profits for new startups
based on the same information,

44
00:02:15,733 --> 00:02:17,233
but also they want to see

45
00:02:17,233 --> 00:02:21,000
which independent variable
has the highest effect on the profit

46
00:02:21,266 --> 00:02:24,266
and which one governs the relationship
between the profit

47
00:02:24,300 --> 00:02:26,033
and those independent variables.

48
00:02:26,033 --> 00:02:30,166
Is there an independent variable that has
a highest effect than another one?

49
00:02:30,166 --> 00:02:34,333
Does the state in which the started
operates have an impact on the profit?

50
00:02:34,700 --> 00:02:38,400
We'll find that out thanks to our multiple
linear regression model in R.

51
00:02:38,400 --> 00:02:40,033
And thanks to this model,

52
00:02:40,033 --> 00:02:44,566
the investors will be able to draw some
insights from our results.

53
00:02:45,700 --> 00:02:46,600
Okay, so now the

54
00:02:46,600 --> 00:02:50,533
next step step of the first step
data pre-processing is to split

55
00:02:50,533 --> 00:02:53,533
the data set into the training set
and the test set.

56
00:02:53,800 --> 00:02:56,633
But is it this step step
we need to do right now.

57
00:02:56,633 --> 00:02:57,100
I know that

58
00:02:57,100 --> 00:03:01,466
the template is suggesting that,
but let's not forget that in our data set

59
00:03:01,466 --> 00:03:06,166
we have one specific variable
which should strike our attention.

60
00:03:07,366 --> 00:03:08,300
Well it's this one.

61
00:03:08,300 --> 00:03:09,333
It's the state variable

62
00:03:09,333 --> 00:03:13,866
because it contains categories
which means it's a categorical variable.

63
00:03:14,200 --> 00:03:14,833
And remember

64
00:03:14,833 --> 00:03:18,800
when we have a categorical variable
like this with categories written in text.

65
00:03:19,200 --> 00:03:22,666
This would cause some issues
in our machine learning model equations.

66
00:03:23,266 --> 00:03:24,333
Because how do you want

67
00:03:24,333 --> 00:03:28,233
to make a linear equation
with one of the variable written as text?

68
00:03:28,233 --> 00:03:29,800
Wouldn't make any sense.

69
00:03:29,800 --> 00:03:32,033
So what we're
going to do, of course, is to

70
00:03:33,133 --> 00:03:36,133
encode the state variable.

71
00:03:36,300 --> 00:03:37,166
And to do this

72
00:03:37,166 --> 00:03:41,600
we are going to use what we learned
in part one data pre-processing only.

73
00:03:41,600 --> 00:03:44,933
We didn't include that in the template,
because this will actually be

74
00:03:45,033 --> 00:03:49,033
one of the only examples where we'll need
to encode our categorical data.

75
00:03:49,033 --> 00:03:51,100
We put it in a separate file.

76
00:03:51,100 --> 00:03:54,100
And so right now
we are going to open the separate file.