1
00:00:00,100 --> 00:00:01,700
Hello my friends, and welcome

2
00:00:01,700 --> 00:00:06,633
to this new practical activity
on multiple linear regression.

3
00:00:06,900 --> 00:00:08,033
So in this new section

4
00:00:08,033 --> 00:00:12,000
we're going to learn together how to build
a multiple linear regression model

5
00:00:12,233 --> 00:00:16,166
on the same data set that was introduced
by Kirill in the previous lectures.

6
00:00:16,466 --> 00:00:18,166
And just before we start,

7
00:00:18,166 --> 00:00:21,766
I just want to make sure
that everyone here is on the same page.

8
00:00:22,100 --> 00:00:26,266
This is the whole machine learning dataset
folder containing all the codes

9
00:00:26,366 --> 00:00:27,466
and data sets.

10
00:00:27,466 --> 00:00:31,666
And right before this tutorial,
I give you again the link to this folder.

11
00:00:31,800 --> 00:00:33,733
So make sure to connect to that link.

12
00:00:33,733 --> 00:00:37,966
And now we should be all on the same page
ready to start this new machine

13
00:00:37,966 --> 00:00:41,233
learning model,
which is of course in part two regression.

14
00:00:41,500 --> 00:00:45,233
And now we're going to go of course
to multiple linear regression folder.

15
00:00:45,500 --> 00:00:48,866
And we're going to start with Python
to implement this model.

16
00:00:49,100 --> 00:00:51,366
All right so this is the data set.

17
00:00:51,366 --> 00:00:55,900
And this is the Python
implementation in Ipynb format

18
00:00:55,900 --> 00:01:00,333
which you can either open with Google
Collaboratory or Jupyter Notebook.

19
00:01:00,333 --> 00:01:02,100
Make sure you also have the folder

20
00:01:02,100 --> 00:01:05,466
downloaded on your machine
so that you can indeed get these files.

21
00:01:06,000 --> 00:01:06,300
All right.

22
00:01:06,300 --> 00:01:07,800
So before we start the implementation,

23
00:01:07,800 --> 00:01:10,800
let me just explain again
what this data set is about.

24
00:01:10,933 --> 00:01:14,500
So remember venture
capital is funds hired you

25
00:01:14,500 --> 00:01:18,333
as a data scientist
to train a machine learning model.

26
00:01:18,333 --> 00:01:22,200
And actually a multiple linear
regression model to understand

27
00:01:22,200 --> 00:01:24,700
the correlations between these features,

28
00:01:24,700 --> 00:01:28,533
which are the spend in R&D,
administration and marketing.

29
00:01:28,533 --> 00:01:33,700
And as well as the state
and the profit of what, of 50 startups.

30
00:01:33,900 --> 00:01:35,033
So in this data set, it's

31
00:01:35,033 --> 00:01:39,033
very important to understand that
each row corresponds to a certain startup.

32
00:01:39,033 --> 00:01:43,766
And for each startup, well, you data
scientist collected the following data.

33
00:01:43,766 --> 00:01:47,166
R&D spend, administration
spend, marketing spend

34
00:01:47,166 --> 00:01:48,900
and the state of the startups.

35
00:01:48,900 --> 00:01:51,000
And of course their profit.

36
00:01:51,000 --> 00:01:53,766
Because the goal for this VC fund

37
00:01:53,766 --> 00:01:58,466
is to figure out in which startup
to invest based on these information.

38
00:01:58,633 --> 00:02:02,733
So these are all information
that we already know from 50 startups.

39
00:02:02,966 --> 00:02:06,666
And therefore, if you manage to train a
machine learning model that can understand

40
00:02:06,666 --> 00:02:09,866
well these correlations, well,
for the next step,

41
00:02:09,900 --> 00:02:12,900
you'll be able to deploy this model
on these features

42
00:02:12,933 --> 00:02:16,733
to predict what sort of profit
this new startup might generate.

43
00:02:16,800 --> 00:02:17,500
Okay.

44
00:02:17,500 --> 00:02:20,933
So for this fund, you definitely
want to build an accurate model.

45
00:02:21,533 --> 00:02:21,900
All right.

46
00:02:21,900 --> 00:02:24,133
And now we can start the implementation.

47
00:02:24,133 --> 00:02:25,500
But before we start

48
00:02:25,500 --> 00:02:26,966
I would like you to figure out

49
00:02:26,966 --> 00:02:30,300
what are going to be the first steps
of this implementation.

50
00:02:30,300 --> 00:02:32,333
You know, before I show it to you.

51
00:02:32,333 --> 00:02:32,633
All right.

52
00:02:32,633 --> 00:02:34,633
So first,
I hope that the first thing that came to

53
00:02:34,633 --> 00:02:38,766
your mind is that indeed the first step
is the data preprocessing phase.

54
00:02:39,000 --> 00:02:43,000
And remember, in the data preprocessing
phase we start by importing the libraries.

55
00:02:43,033 --> 00:02:44,100
That's for sure.

56
00:02:44,100 --> 00:02:45,833
Then we're going to import the data set.

57
00:02:45,833 --> 00:02:47,433
That's even more for sure.

58
00:02:47,433 --> 00:02:50,966
And then we will split the data set
between the training set and the data set.

59
00:02:51,200 --> 00:02:54,733
Because indeed
we want to train separately our model

60
00:02:54,733 --> 00:02:57,733
and evaluate its performance
on a separate set.

61
00:02:57,866 --> 00:02:59,966
Okay. So that's always required.

62
00:02:59,966 --> 00:03:03,566
But then is there
something else that we have to do here.

63
00:03:03,933 --> 00:03:06,666
Well, to answer this question
let's have a look

64
00:03:06,666 --> 00:03:09,766
at the columns
one by one, starting with R&D spend.

65
00:03:09,800 --> 00:03:13,500
So R&D spent is an empirical column
you know containing numerical values.

66
00:03:13,800 --> 00:03:16,466
And when we scroll down
you know we can scroll down

67
00:03:16,466 --> 00:03:20,000
because there are only 50 observations
corresponding to 50 startups.

68
00:03:20,200 --> 00:03:22,800
And we can see that here
there is no missing data.

69
00:03:22,800 --> 00:03:26,733
So all good then second column
the administration spent

70
00:03:26,766 --> 00:03:30,266
you know, all the spending administration
like paying employee salaries

71
00:03:30,300 --> 00:03:31,800
or anything else.

72
00:03:31,800 --> 00:03:35,100
So this column is once again
numerical with numerical values.

73
00:03:35,100 --> 00:03:38,166
And there is once again no missing data.

74
00:03:38,466 --> 00:03:39,033
Perfect.

75
00:03:39,033 --> 00:03:42,666
So so far are three steps of the data
preprocessing template

76
00:03:42,833 --> 00:03:45,966
argued the next one
column in marketing spend.

77
00:03:46,200 --> 00:03:49,633
Well same numerical column
with no missing data.

78
00:03:49,633 --> 00:03:50,833
Oh, good.

79
00:03:50,833 --> 00:03:52,466
And now this column.

80
00:03:52,466 --> 00:03:53,966
You know the last feature.

81
00:03:53,966 --> 00:03:54,833
Notice once again

82
00:03:54,833 --> 00:03:57,166
that all the features
are in the first columns

83
00:03:57,166 --> 00:04:00,266
and the dependent variable which you want
to predict in the last column.

84
00:04:00,500 --> 00:04:04,900
So back to this stage feature
what reflexes you have in your mind.

85
00:04:04,900 --> 00:04:06,600
Now you should have the reflex.

86
00:04:06,600 --> 00:04:09,900
If you paid attention to parts one
they depressing.

87
00:04:10,166 --> 00:04:13,833
Basically, the question I'm asking now is
do we need to apply

88
00:04:13,833 --> 00:04:16,833
a certain tool of our data
preprocessing toolkit,

89
00:04:17,066 --> 00:04:20,366
which we built in part
one into this data set?

90
00:04:20,766 --> 00:04:23,766
Well, here the answer is obviously yes,

91
00:04:24,000 --> 00:04:28,166
because indeed this state
column is not numerical.

92
00:04:28,200 --> 00:04:30,100
It actually has some categories.

93
00:04:30,100 --> 00:04:34,433
It has three categories which are New
York, California and Florida.

94
00:04:34,433 --> 00:04:35,366
And therefore

95
00:04:35,366 --> 00:04:39,000
that's exactly the same situation
as in parts when data preprocessing.

96
00:04:39,266 --> 00:04:42,366
There is no order relationship
between these

97
00:04:42,366 --> 00:04:45,366
three states
New York, California and Florida.

98
00:04:45,633 --> 00:04:50,733
And therefore we will have to apply
one hot encoding to that state column,

99
00:04:51,033 --> 00:04:55,066
and therefore will have to grab a tool
of our data preprocessing toolkit.

100
00:04:55,066 --> 00:04:59,766
And that's why I prepared it here
in order to indeed one hot encode

101
00:04:59,966 --> 00:05:02,966
that categorical variable,
the state variable.

102
00:05:03,000 --> 00:05:03,800
All right.

103
00:05:03,800 --> 00:05:05,100
And then the profit is fine.

104
00:05:05,100 --> 00:05:06,600
It is numerical.

105
00:05:06,600 --> 00:05:09,533
And besides there is once again
no missing data.

106
00:05:09,533 --> 00:05:11,900
So that's what you know you must do.

107
00:05:11,900 --> 00:05:14,833
First
you need to have a look at your data set.

108
00:05:14,833 --> 00:05:17,700
If it is not too long, you can check that
there is no missing data

109
00:05:17,700 --> 00:05:18,833
like we just did.

110
00:05:18,833 --> 00:05:22,600
If it is too long,
then I recommend to apply your data

111
00:05:22,600 --> 00:05:26,700
preprocessing tool that handles missing
data and deploy it on this data set.

112
00:05:27,033 --> 00:05:30,633
And then you must check
if any feature is categorical.

113
00:05:30,633 --> 00:05:32,800
And here we could check that very easily.

114
00:05:32,800 --> 00:05:34,066
The state is categorical.

115
00:05:34,066 --> 00:05:37,300
So we're going to apply our one hot
encoding tool

116
00:05:37,300 --> 00:05:40,300
of our data preprocessing toolkit
on this state column.

117
00:05:40,366 --> 00:05:43,800
And then of course
we will apply all the rest of the three

118
00:05:43,800 --> 00:05:46,800
essential steps of our data
preprocessing template.

119
00:05:46,800 --> 00:05:50,733
And once again we will do that
in a flashlight because this is a template

120
00:05:50,733 --> 00:05:54,233
where we only have one thing to change,
which is the name of the dataset.