1
00:00:00,033 --> 00:00:04,633
Hello my friends, and welcome to this new
section where you will finally learn on

2
00:00:04,633 --> 00:00:10,333
how to evaluate your regression models
and mostly on how to select the best one.

3
00:00:10,533 --> 00:00:14,700
All right, so this is indeed a section
that has been long awaited.

4
00:00:14,900 --> 00:00:19,300
Because indeed, in this whole part
two of regression, we built many machine

5
00:00:19,300 --> 00:00:20,200
learning models.

6
00:00:20,200 --> 00:00:23,400
And now most of you
must have the question, okay, that's cool.

7
00:00:23,400 --> 00:00:27,733
I have all these regression models
in the toolkit, but which one do I select?

8
00:00:27,733 --> 00:00:30,733
Which one should I apply for my data set?

9
00:00:31,000 --> 00:00:34,233
And well, I actually have
some very good news for you.

10
00:00:34,300 --> 00:00:37,866
We will give the answer
to this exact question in this tutorial.

11
00:00:38,100 --> 00:00:42,000
So I'm going to try to reveal everything
in this same tutorial so that you know,

12
00:00:42,000 --> 00:00:47,166
this can be the ultimate tutorial
of regression, where you finally learn on

13
00:00:47,166 --> 00:00:52,066
how to use your regression toolkit
the right way on your future datasets.

14
00:00:52,400 --> 00:00:57,800
So what I will do in this tutorial is
I will introduce you to this toolkit

15
00:00:57,800 --> 00:01:01,800
that I've just made and which contains
all the regression models we learned

16
00:01:01,800 --> 00:01:04,900
together into some very generic code

17
00:01:04,900 --> 00:01:08,833
templates by very generic code templates,
I mean that

18
00:01:08,933 --> 00:01:13,200
you will be able to use these code
templates on your future data set.

19
00:01:13,333 --> 00:01:18,733
By having only 1 or 2 things to change,
I made them as generic as possible

20
00:01:18,733 --> 00:01:21,600
so that they can be ready
to deploy on your data sets.

21
00:01:21,600 --> 00:01:25,800
And besides, each of them contains,
at the end of the implementation,

22
00:01:25,800 --> 00:01:30,033
the evaluation tool,
you know, allowing to evaluate your model

23
00:01:30,200 --> 00:01:35,400
so that you can very easily and quickly
compare the performance of each of them.

24
00:01:35,700 --> 00:01:39,933
In other words, you know, in short,
thanks to this tool kit, you will be able

25
00:01:39,933 --> 00:01:44,633
to select the best model for your data
set in a very short amount of time.

26
00:01:44,633 --> 00:01:46,400
You know, very, very efficiently.

27
00:01:46,400 --> 00:01:48,900
And that's exactly what I'll prove to you.

28
00:01:48,900 --> 00:01:51,566
You know what I'm going to show you
in this tutorial?

29
00:01:51,566 --> 00:01:53,100
We're going to take a real world

30
00:01:53,100 --> 00:01:56,833
data set, you know, with several features
and lots of observations.

31
00:01:57,133 --> 00:02:01,333
I will deploy each of the regression
models of the toolkit on this data set,

32
00:02:01,566 --> 00:02:04,500
and you will see how quickly
and efficiently

33
00:02:04,500 --> 00:02:06,966
I managed to figure out the best model.

34
00:02:06,966 --> 00:02:09,466
And that's
actually the answer to the question

35
00:02:09,466 --> 00:02:11,333
how should I select the best model?

36
00:02:11,333 --> 00:02:12,833
And the simple answer is

37
00:02:12,833 --> 00:02:17,200
try all your models, try all your models,
and just select the best one.

38
00:02:17,200 --> 00:02:19,400
Having the best performance result.

39
00:02:19,400 --> 00:02:21,466
And that performance result is measured

40
00:02:21,466 --> 00:02:25,100
by, of course, the coefficient
r squared or adjusted r squared.

41
00:02:25,800 --> 00:02:26,400
All right.

42
00:02:26,400 --> 00:02:27,300
So there we go.

43
00:02:27,300 --> 00:02:29,533
Let me introduce you to this toolkit.

44
00:02:29,533 --> 00:02:32,533
And then let's proceed to the demo.

45
00:02:32,566 --> 00:02:35,833
But first let's make sure everyone here
is on the same page.

46
00:02:36,033 --> 00:02:40,500
This is a new folder you know different
than the whole machine learning.

47
00:02:40,500 --> 00:02:43,000
It is a folder containing ten parts.

48
00:02:43,000 --> 00:02:46,900
This is a new folder where you will get,
you know, that regression toolkit

49
00:02:46,900 --> 00:02:48,666
containing all the regression models.

50
00:02:48,666 --> 00:02:52,533
And then when we tackle part three,
the classification toolkit with all the

51
00:02:52,533 --> 00:02:57,133
classification models, and mostly you know
this is the model selection folder.

52
00:02:57,133 --> 00:03:00,400
This is the folder you will want to use
when you want to deploy

53
00:03:00,533 --> 00:03:03,533
either your regression models
or your classification models

54
00:03:03,633 --> 00:03:07,666
on your data set, in order to quickly
and efficiently select the best one.

55
00:03:08,000 --> 00:03:09,300
And now there we go.

56
00:03:09,300 --> 00:03:14,600
Let's enter this regression
folder for model selection and as you see

57
00:03:14,700 --> 00:03:18,333
it contains five regression models

58
00:03:18,333 --> 00:03:22,866
that we studied in this part two
you know multiple linear regression.

59
00:03:22,866 --> 00:03:25,866
And I didn't include simple
linear regression of course, because

60
00:03:25,933 --> 00:03:29,933
now we will work with a real world
data set with therefore several features.

61
00:03:30,266 --> 00:03:32,100
Then we have polynomial regression.

62
00:03:32,100 --> 00:03:35,066
Then support vector regression,
then decision tree

63
00:03:35,066 --> 00:03:38,066
regression
and of course random forest regression.

64
00:03:38,166 --> 00:03:42,166
And as I told you,
I made each of these implementations

65
00:03:42,166 --> 00:03:46,533
very generic
so that you can deploy them on your future

66
00:03:46,533 --> 00:03:50,200
data sets by having only 1 or 2 things
to change.

67
00:03:50,200 --> 00:03:54,000
Assuming, of course, that your data set
has a CSV format

68
00:03:54,300 --> 00:03:58,366
and contains all the features
in the first columns, and the dependent

69
00:03:58,366 --> 00:04:02,366
variable in the last column,
that's really the essential condition.

70
00:04:02,600 --> 00:04:03,566
And then of course, here

71
00:04:03,566 --> 00:04:07,533
I chose a data set without missing values
or categorical data.

72
00:04:07,533 --> 00:04:09,000
That's because I trust

73
00:04:09,000 --> 00:04:12,866
you will know how to handle this
thanks to your data preprocessing toolkit.

74
00:04:13,033 --> 00:04:17,900
So this data set is quite classic
but yet real world because as you can see,

75
00:04:17,900 --> 00:04:21,833
it contains several features
and many, many observations.

76
00:04:21,833 --> 00:04:25,566
Actually almost 10,000 observations
if we scroll down.

77
00:04:25,566 --> 00:04:27,133
Yes, almost 10,000.

78
00:04:27,133 --> 00:04:27,733
All right.

79
00:04:27,733 --> 00:04:31,900
With, as you can see, only numerical
values, no categorical data in strings.

80
00:04:32,066 --> 00:04:34,100
And once again, no missing data.

81
00:04:34,100 --> 00:04:38,133
And I chose such a data set
so that, you know, we can make our code

82
00:04:38,133 --> 00:04:41,800
templates for each of our regression
models 100% generic,

83
00:04:41,933 --> 00:04:45,233
so that you only have to change
the name of the data set.