1
00:00:00,433 --> 00:00:00,700
All right.

2
00:00:00,700 --> 00:00:03,000
So first let's start
by importing the libraries.

3
00:00:03,000 --> 00:00:04,933
That's easy. Done.

4
00:00:04,933 --> 00:00:07,033
Next step we import the data set.

5
00:00:07,033 --> 00:00:10,600
And we can now because we have the data
set uploaded in our notebook.

6
00:00:11,033 --> 00:00:12,533
So make sure to have it as well.

7
00:00:12,533 --> 00:00:14,300
Okay. Now the data set is imported.

8
00:00:14,300 --> 00:00:17,300
We have the matrix of features
and the dependent variable vector y.

9
00:00:17,400 --> 00:00:18,033
Good.

10
00:00:18,033 --> 00:00:21,233
And now I'm going to do a print
to show you the state of x.

11
00:00:21,233 --> 00:00:24,100
You know what is X exactly at this stage.

12
00:00:24,100 --> 00:00:25,966
So I'm going to do print x.

13
00:00:25,966 --> 00:00:28,933
And I'm going to run this cell.

14
00:00:28,933 --> 00:00:30,933
All right. And let's see what we get.

15
00:00:30,933 --> 00:00:31,233
All right.

16
00:00:31,233 --> 00:00:36,933
So indeed we get exactly the same columns
as in this data set

17
00:00:36,933 --> 00:00:41,066
with first R&D spend then administration
spend and marketing spend and state.

18
00:00:41,100 --> 00:00:41,400
Right.

19
00:00:41,400 --> 00:00:45,333
We can clearly see that we get
the same columns here in the same order.

20
00:00:45,400 --> 00:00:47,466
All right.
So that's the matrix of features.

21
00:00:47,466 --> 00:00:48,400
All good.

22
00:00:48,400 --> 00:00:50,966
Now I'm now going to show you
the dependent viral vector

23
00:00:50,966 --> 00:00:52,166
because that's obvious.

24
00:00:52,166 --> 00:00:54,333
We're going to get the same profit.

25
00:00:54,333 --> 00:00:58,633
But what I want to show
you is what becomes x

26
00:00:59,000 --> 00:01:01,400
after we encode the categorical data.

27
00:01:01,400 --> 00:01:04,133
You know, and you can actually guess what
it will become.

28
00:01:04,133 --> 00:01:07,033
But you will see that the three columns
resulting here from one

29
00:01:07,033 --> 00:01:10,033
hot encoding
will actually be placed at the beginning.

30
00:01:10,033 --> 00:01:11,500
All right. So let's check it out.

31
00:01:11,500 --> 00:01:15,433
Let's run the cell to indeed apply
encoding categorical data.

32
00:01:15,433 --> 00:01:20,533
And now let's create a new code cell
in which we're going to print X again.

33
00:01:20,933 --> 00:01:26,133
And let's run this
cell to see what x becomes.

34
00:01:26,633 --> 00:01:27,433
And there you go.

35
00:01:27,433 --> 00:01:31,933
Exactly as I told you
we have the same three first columns here.

36
00:01:31,933 --> 00:01:34,266
So that corresponds to R&D
spend that corresponds

37
00:01:34,266 --> 00:01:37,733
to the administration spend
and that corresponds to marketing spend.

38
00:01:38,033 --> 00:01:42,233
But now instead of having this state
column here, we indeed have these

39
00:01:42,233 --> 00:01:46,500
three new columns at the beginning,
including that state variable.

40
00:01:46,500 --> 00:01:47,700
And we can actually see

41
00:01:47,700 --> 00:01:50,800
what corresponds to what, you know,
if we have a look at our data set.

42
00:01:51,000 --> 00:01:54,600
Well,
the first row has New York as a state.

43
00:01:54,866 --> 00:01:59,066
And therefore New York was encoded
as zero zero and one.

44
00:01:59,466 --> 00:02:02,166
Then let's see
as the second state of the second row,

45
00:02:02,166 --> 00:02:04,900
you know, corresponding
to the second store, we have California,

46
00:02:04,900 --> 00:02:08,866
and therefore California was encoded
as one zero and zero.

47
00:02:09,133 --> 00:02:14,866
And finally, well,
Florida was encoded as zero, one and zero.

48
00:02:15,000 --> 00:02:17,666
All right. So that's the one hot encoding
that happens.

49
00:02:17,666 --> 00:02:18,666
And now all good.

50
00:02:18,666 --> 00:02:21,833
We have a fully pre-processed data set.

51
00:02:22,100 --> 00:02:23,466
And as I told you in part one.

52
00:02:23,466 --> 00:02:25,966
But I'm going to say it again here
because this is important.

53
00:02:25,966 --> 00:02:28,933
We don't have to apply feature scaling.

54
00:02:28,933 --> 00:02:32,566
Why? Because, you know, in the equation
of the multiple linear regression,

55
00:02:32,833 --> 00:02:34,233
you know, you have this coefficient

56
00:02:34,233 --> 00:02:36,633
that is multiplied
to each independent variable.

57
00:02:36,633 --> 00:02:37,933
You know each feature.

58
00:02:37,933 --> 00:02:41,866
And therefore it doesn't matter that some
features have higher values than others,

59
00:02:42,033 --> 00:02:45,900
because the coefficients will compensate
to put everything on the same scale.

60
00:02:45,900 --> 00:02:49,433
And therefore remember this
in multiple linear regression,

61
00:02:49,466 --> 00:02:52,766
there is absolutely
no need to apply feature scaling.

62
00:02:53,166 --> 00:02:55,066
And one last thing
I would like to add as well,

63
00:02:55,066 --> 00:02:58,200
because I know a lot of you
ask this question, do

64
00:02:58,200 --> 00:03:02,633
we need to check the assumptions of linear
regression?

65
00:03:03,133 --> 00:03:07,066
The answer is absolutely not,
because I will explain

66
00:03:07,066 --> 00:03:10,933
at the end of this part, you know, part
two regression that whenever you have

67
00:03:10,933 --> 00:03:14,666
a new data set and you want to experiment
with some machine

68
00:03:14,666 --> 00:03:18,733
learning models to figure out
which one leads to the highest accuracy.

69
00:03:18,966 --> 00:03:22,800
Well, even if your data set doesn't
have linear relationships,

70
00:03:23,133 --> 00:03:26,400
you can still try a multiple linear
regression on it.

71
00:03:26,666 --> 00:03:29,900
And if you know your data set
doesn't have linear relationships, well,

72
00:03:29,900 --> 00:03:33,933
your multiple linear regression model
will just perform poorly, and therefore

73
00:03:33,966 --> 00:03:38,533
it will get an accuracy lower
than the accuracy of your other models.

74
00:03:38,533 --> 00:03:41,733
So you will just not select
the multiple linear regression model.

75
00:03:41,733 --> 00:03:45,900
But you don't have to check the multiple
linear regression assumptions.

76
00:03:45,900 --> 00:03:47,766
It will just be a waste of time.

77
00:03:47,766 --> 00:03:51,566
Really, I will show you at the end
how you can so fast and so efficiently

78
00:03:51,633 --> 00:03:53,366
try each of the models on your data

79
00:03:53,366 --> 00:03:57,366
set and select very quickly
the one that has the highest accuracy.

80
00:03:57,566 --> 00:03:58,166
All right.

81
00:03:58,166 --> 00:04:00,000
So I just wanted to be clear
on that point.

82
00:04:00,000 --> 00:04:03,100
Don't worry about the multiple linear
regression assumptions.

83
00:04:03,300 --> 00:04:05,933
If your data set has linear relationships
then good.

84
00:04:05,933 --> 00:04:07,500
Your multiple linear regression

85
00:04:07,500 --> 00:04:11,533
will check the assumptions indeed,
and will bring to you a high accuracy.

86
00:04:11,800 --> 00:04:15,300
And if your data set doesn't have linear
relationships, well, fine.

87
00:04:15,300 --> 00:04:18,033
Your multiple linear regression
just will perform poorly

88
00:04:18,033 --> 00:04:20,533
and you will just take another model
and that's it.

89
00:04:20,533 --> 00:04:22,566
That's as simple as that. All right.

90
00:04:22,566 --> 00:04:24,200
So now good.

91
00:04:24,200 --> 00:04:26,200
We're done with data preprocessing.

92
00:04:26,200 --> 00:04:29,900
We can therefore move on to the next step
which is to train

93
00:04:29,933 --> 00:04:32,933
the multiple linear
regression model on the training set.

94
00:04:32,966 --> 00:04:34,333
So take a little break here.

95
00:04:34,333 --> 00:04:37,466
And as soon as you're ready
to build your next machine learning model,

96
00:04:37,633 --> 00:04:40,666
well join me in the next tutorial
to tackle this.

97
00:04:40,966 --> 00:04:42,866
And until then, enjoy machine learning.