1
00:00:00,200 --> 00:00:02,100
And now what is this data set about?

2
00:00:02,100 --> 00:00:05,900
Well, that's a classic data
set from actually the UCI Machine

3
00:00:05,900 --> 00:00:08,900
Learning Repository,
which I encourage you to have a look,

4
00:00:09,066 --> 00:00:10,500
because indeed it is a website

5
00:00:10,500 --> 00:00:13,500
that contains a lot of data
sets on which you can practice.

6
00:00:13,566 --> 00:00:17,300
And this one is actually called
combined cycle Power Plant.

7
00:00:17,566 --> 00:00:20,700
And it consists of trying to predict this

8
00:00:20,700 --> 00:00:23,933
dependent variable
which is actually an energy output.

9
00:00:23,966 --> 00:00:26,066
And don't worry,
you don't have to understand

10
00:00:26,066 --> 00:00:29,466
how energy works
or how the physics of this data set works.

11
00:00:29,700 --> 00:00:32,733
The only thing that you need to understand
is that we want to predict

12
00:00:32,733 --> 00:00:36,333
this dependent variable,
which turns out to be an energy output,

13
00:00:36,566 --> 00:00:41,233
and we are predicting this dependent
variable with these four features here,

14
00:00:41,266 --> 00:00:46,566
which are first the engine temperature,
second, the exhaust vacuum,

15
00:00:46,866 --> 00:00:51,233
third, the ambient pressure, and fourth
the relative humidity.

16
00:00:51,433 --> 00:00:53,933
All right.
So that's that's only what matters here.

17
00:00:53,933 --> 00:00:57,666
You have to see it as you know,
a general data set where you have

18
00:00:57,666 --> 00:01:02,500
several features that you're going to use
to predict that dependent variable.

19
00:01:02,733 --> 00:01:03,733
And as you can see,

20
00:01:03,733 --> 00:01:06,966
the condition, you know,
in order to deploy our regression

21
00:01:06,966 --> 00:01:10,400
models on this data set in the future,
data sets you'll be working on

22
00:01:10,666 --> 00:01:13,500
is to have in the first columns
the features

23
00:01:13,500 --> 00:01:16,066
and in the last column
the dependent variable.

24
00:01:16,066 --> 00:01:18,066
All right. That's all that matters.

25
00:01:18,066 --> 00:01:22,800
If you have a data set like that which has
no missing data and no categorical data.

26
00:01:22,800 --> 00:01:27,133
Well, you can deploy each
and every single one of these regression

27
00:01:27,133 --> 00:01:31,033
models by just having to change
the name of your data set.

28
00:01:31,033 --> 00:01:34,266
And if your data set has missing data
or categorical data,

29
00:01:34,266 --> 00:01:37,033
you just have to go to your data
preprocessing toolkit

30
00:01:37,033 --> 00:01:38,100
to take care of this.

31
00:01:38,100 --> 00:01:40,866
And then you can deploy these models.

32
00:01:40,866 --> 00:01:41,700
All right.

33
00:01:41,700 --> 00:01:45,166
So now time for the demo
I'm going to show you

34
00:01:45,166 --> 00:01:46,233
how are we going to quickly

35
00:01:46,233 --> 00:01:49,866
and efficiently plug and play
each of these regression templates

36
00:01:50,100 --> 00:01:53,400
by only having to change
the name of the data set.

37
00:01:53,700 --> 00:01:57,800
And then I'll show you
how we will quickly identify and select

38
00:01:57,800 --> 00:02:01,533
the best regression model
for this particular dataset.

39
00:02:01,733 --> 00:02:03,433
All right let's do this.

40
00:02:03,433 --> 00:02:08,500
So our first step here will be to create
a copy of each of these files.

41
00:02:08,500 --> 00:02:10,833
Because these are all in read only mode

42
00:02:10,833 --> 00:02:13,066
because you know
this folder was shared to you.

43
00:02:13,066 --> 00:02:17,366
So since all of you will access it,
you can of course not modify it directly,

44
00:02:17,566 --> 00:02:21,200
but in order to modify it, you just need
to create a copy in your drive.

45
00:02:21,333 --> 00:02:26,566
And to do this well, we can just do a
right click here and then make a copy.

46
00:02:26,566 --> 00:02:30,300
So we're going to do this
for each of the regression models here.

47
00:02:30,400 --> 00:02:31,333
Let's do this.

48
00:02:31,333 --> 00:02:34,266
Make a copy for multiple linear
regression.

49
00:02:34,266 --> 00:02:36,166
Then make a copy.

50
00:02:36,166 --> 00:02:38,866
Then random forest regression make a copy.

51
00:02:38,866 --> 00:02:41,566
And finally support vector regression.

52
00:02:41,566 --> 00:02:43,533
And there we go.

53
00:02:43,533 --> 00:02:44,133
All right. Good.

54
00:02:44,133 --> 00:02:46,466
So we made a copy of each of these
regression models.

55
00:02:46,466 --> 00:02:50,333
And the copies
should be either on your main drive

56
00:02:50,333 --> 00:02:53,333
or in this Colab notebooks folder.

57
00:02:53,466 --> 00:02:56,466
And well as you can see
they are on my main drive.

58
00:02:56,633 --> 00:02:58,733
So you will actually very easily
find them.

59
00:02:58,733 --> 00:03:02,366
And now what we're going to do is open
each of these files

60
00:03:02,633 --> 00:03:05,633
in order to proceed with the demo.

61
00:03:05,700 --> 00:03:06,033
All right.

62
00:03:06,033 --> 00:03:09,033
So I have first multiple linear
regression.

63
00:03:09,300 --> 00:03:11,733
Then I'm going to open
polynomial regression.

64
00:03:11,733 --> 00:03:15,066
You know, in the same order
as the one we used

65
00:03:15,066 --> 00:03:18,900
to build our regression models
then support vector regression.

66
00:03:19,933 --> 00:03:21,100
Once again you can either

67
00:03:21,100 --> 00:03:24,666
open them with Google Collaboratory
or Jupyter Notebook,

68
00:03:24,666 --> 00:03:28,166
or even Spyder Anaconda,
because I also gave you the folder

69
00:03:28,166 --> 00:03:30,266
containing all these codes
and the data set

70
00:03:30,266 --> 00:03:32,833
right before this tutorial in the article.

71
00:03:32,833 --> 00:03:35,833
So then let's open decision trees

72
00:03:36,000 --> 00:03:40,466
and finally well,
random forest regression.

73
00:03:40,800 --> 00:03:41,933
All right.

74
00:03:41,933 --> 00:03:44,833
So actually let me put it like that.

75
00:03:44,833 --> 00:03:50,066
You know the same order support vector
decision tree and random forests.

76
00:03:50,066 --> 00:03:50,400
All right.

77
00:03:50,400 --> 00:03:53,900
So now we have all our regression
models open.

78
00:03:54,300 --> 00:03:57,033
I'm first going to show you
the code templates one by one.

79
00:03:57,033 --> 00:03:59,633
And then we will deploy them
on the data set.

80
00:03:59,633 --> 00:04:03,233
And I'll show you how to quickly
figure out which one is the best model.

81
00:04:03,233 --> 00:04:04,066
All right.

82
00:04:04,066 --> 00:04:07,633
So starting with multiple linear
regression let's see the different steps.

83
00:04:07,833 --> 00:04:10,033
So we start by importing the libraries.

84
00:04:10,033 --> 00:04:13,033
Of course that's the first step
of the data preprocessing phase.

85
00:04:13,133 --> 00:04:14,700
Then we import the data set.

86
00:04:14,700 --> 00:04:18,100
And as you can see I made it
super generic, meaning that

87
00:04:18,100 --> 00:04:22,033
the only thing that you have to change is
actually the name of your data set here.

88
00:04:22,033 --> 00:04:25,366
That's why I specified in capital letters
that you can't miss it.

89
00:04:25,600 --> 00:04:30,533
Enter the name of your data set here and
we will actually do that in a few minutes.

90
00:04:30,900 --> 00:04:33,266
Then here
you have nothing to change of course,

91
00:04:33,266 --> 00:04:36,733
because this automatically select
all the columns except the last one.

92
00:04:36,733 --> 00:04:39,900
Therefore your features
and this automatically selects

93
00:04:40,066 --> 00:04:42,600
the last column
meaning the dependent variable.

94
00:04:42,600 --> 00:04:46,800
All right then we split the data
set into the training set and a dataset.

95
00:04:47,033 --> 00:04:49,400
Of course here
that's very important to do this

96
00:04:49,400 --> 00:04:53,033
because since we want to select
the best model well we need this test set

97
00:04:53,166 --> 00:04:54,866
in order to evaluate the performance

98
00:04:54,866 --> 00:04:57,933
of each of them in order to compare it
and select the best one.

99
00:04:58,200 --> 00:05:00,633
So we have to do this step. Absolutely.

100
00:05:00,633 --> 00:05:04,900
Then once we have, well the training sets,
we will train our model

101
00:05:04,900 --> 00:05:06,433
on the training set.

102
00:05:06,433 --> 00:05:09,766
Then we will predict the test results,
you know, to have a look

103
00:05:09,766 --> 00:05:13,566
at the predictions and compare them
to the real results in Y test.

104
00:05:13,733 --> 00:05:17,400
And then finally
we will evaluate the model performance.

105
00:05:17,400 --> 00:05:21,400
And here I don't want to scroll down now
because we will discover together

106
00:05:21,400 --> 00:05:25,100
a bit later
the code to evaluate a regression model.

107
00:05:25,266 --> 00:05:28,166
You know, with the r squared coefficient.

108
00:05:28,166 --> 00:05:28,500
All right.

109
00:05:28,500 --> 00:05:32,800
So that's the code template
for multiple linear regression.

110
00:05:33,033 --> 00:05:36,400
And as I told you
and as you see it is super generic

111
00:05:36,400 --> 00:05:39,300
because for any of your future data
set, provided

112
00:05:39,300 --> 00:05:42,700
that they have in the first columns
the features and in the last column

113
00:05:42,700 --> 00:05:46,000
the dependent variable, and also provided
that they don't have missing data

114
00:05:46,000 --> 00:05:47,366
or categorical data.

115
00:05:47,366 --> 00:05:49,400
Well,
the only thing that you have to change

116
00:05:49,400 --> 00:05:53,300
within this code template is just to enter
the name of your data set here.

117
00:05:53,300 --> 00:05:54,233
And that's it.

118
00:05:54,233 --> 00:05:57,366
And by just doing this,
you will be able to evaluate your model

119
00:05:57,566 --> 00:05:58,900
with irrelevant metrics.