1
00:00:00,533 --> 00:00:03,633
So I'm going to jump back to my folder.

2
00:00:03,633 --> 00:00:06,566
Part one Data Pre-processing,
which is here.

3
00:00:06,566 --> 00:00:07,666
And here we go.

4
00:00:07,666 --> 00:00:10,666
So the categorical data our file is here.

5
00:00:10,833 --> 00:00:13,833
So let's open it. Here it is.

6
00:00:13,933 --> 00:00:20,466
And right now what I'm going to do
is I'm going to take this copy.

7
00:00:20,800 --> 00:00:21,933
So you have the same file.

8
00:00:21,933 --> 00:00:26,000
So you can also take it from your folder
or take it from the course.

9
00:00:26,100 --> 00:00:30,600
And let's go back to our multiple linear
regression file and paste

10
00:00:31,700 --> 00:00:33,266
that here.

11
00:00:33,266 --> 00:00:34,166
Okay.

12
00:00:34,166 --> 00:00:36,766
Then of course
we need to change a few things here.

13
00:00:36,766 --> 00:00:40,033
So we need to change the name
of our categorical variable.

14
00:00:40,200 --> 00:00:41,766
So in part one it was country.

15
00:00:41,766 --> 00:00:44,733
And here it is state.

16
00:00:44,733 --> 00:00:45,500
Okay.

17
00:00:45,500 --> 00:00:46,200
Same here.

18
00:00:46,200 --> 00:00:49,200
We need to change country by state.

19
00:00:49,600 --> 00:00:51,200
Let's not forget to align this.

20
00:00:51,200 --> 00:00:53,000
This is very important in R.

21
00:00:53,000 --> 00:00:56,000
And then any programing language.

22
00:00:56,466 --> 00:00:59,000
Here we go there.

23
00:00:59,000 --> 00:01:00,766
And now we need to change the levels.

24
00:01:00,766 --> 00:01:03,566
So before you know the categorical
variable was the countries.

25
00:01:03,566 --> 00:01:06,566
And the three categories were France
Spain and Germany.

26
00:01:06,800 --> 00:01:11,500
And here our three categories are New York
California and Florida.

27
00:01:12,800 --> 00:01:14,133
So let's do it on our dataset.

28
00:01:14,133 --> 00:01:17,133
Actually I will close that
because we no longer need it.

29
00:01:17,333 --> 00:01:21,366
So here are levels are we said New York.

30
00:01:25,033 --> 00:01:28,033
California. And.

31
00:01:29,566 --> 00:01:31,566
Florida.

32
00:01:31,566 --> 00:01:32,400
Okay.

33
00:01:32,400 --> 00:01:36,000
And then the labels
that is the numeric numbers

34
00:01:36,000 --> 00:01:38,866
which are actually factors,
the numeric factors

35
00:01:38,866 --> 00:01:42,600
that are going to replace this three text
here New York California and Florida.

36
00:01:42,600 --> 00:01:45,600
Are these numbers
you choose here for labels.

37
00:01:45,800 --> 00:01:47,166
So here we have 123.

38
00:01:47,166 --> 00:01:49,833
That means that New York
is going to be one.

39
00:01:49,833 --> 00:01:52,833
California is going to be two
and Florida is going to be three.

40
00:01:52,866 --> 00:01:57,000
You're going to see I'm
going to select this and execute.

41
00:01:58,100 --> 00:01:58,566
All right.

42
00:01:58,566 --> 00:02:01,300
And now let's look at our data set.

43
00:02:01,300 --> 00:02:05,100
As you can see the state is now encoded
with the 123 values.

44
00:02:05,200 --> 00:02:09,000
So one for New York, two for California
and three for Florida.

45
00:02:09,800 --> 00:02:11,733
Let's go back. Okay.

46
00:02:11,733 --> 00:02:13,266
So the encoding is done.

47
00:02:13,266 --> 00:02:15,966
And that's a much
better thing for our model.

48
00:02:15,966 --> 00:02:19,033
Now our model has a greater chance
to work.

49
00:02:19,500 --> 00:02:23,066
And now the last thing we need to do
is to split the data

50
00:02:23,066 --> 00:02:25,500
sets into the training set
and the test set.

51
00:02:25,500 --> 00:02:29,300
So here let's not forget to change
the name of the dependent variable here,

52
00:02:29,766 --> 00:02:32,466
which is not purchased but profit.

53
00:02:34,500 --> 00:02:35,466
All right.

54
00:02:35,466 --> 00:02:38,433
And then we need to change a split ratio
if necessary.

55
00:02:38,433 --> 00:02:40,000
Let's see. We have 50 observations.

56
00:02:40,000 --> 00:02:42,000
So a good split would be to have

57
00:02:42,000 --> 00:02:45,600
40 observations in the training set
and ten observations in the test set.

58
00:02:45,900 --> 00:02:50,833
So that makes actually an 80% split
ratio 80% going to the training set.

59
00:02:51,200 --> 00:02:53,600
And this is already what we have. Perfect.

60
00:02:53,600 --> 00:02:56,233
So we don't have to do anything here
for the split ratio.

61
00:02:56,233 --> 00:02:59,233
And we are ready to take all of these.

62
00:02:59,400 --> 00:03:02,100
And execute.

63
00:03:02,100 --> 00:03:03,900
And here we go.

64
00:03:03,900 --> 00:03:06,900
Let's have a look at our training set
and our test set.

65
00:03:09,366 --> 00:03:09,900
Here it is.

66
00:03:09,900 --> 00:03:11,333
That's the training set.

67
00:03:11,333 --> 00:03:11,566
Okay.

68
00:03:11,566 --> 00:03:14,566
So it contains 40 entries
for the observations.

69
00:03:14,633 --> 00:03:15,600
Great.

70
00:03:15,600 --> 00:03:17,700
We have our encoded variable for state.

71
00:03:17,700 --> 00:03:18,966
That's perfect.

72
00:03:18,966 --> 00:03:22,300
And then a test set that contains
ten observations.

73
00:03:22,566 --> 00:03:24,700
And everything looks fine.

74
00:03:24,700 --> 00:03:28,433
All right so let's
go back to multiple linear regression.

75
00:03:29,466 --> 00:03:31,600
And the last step is feature scaling.

76
00:03:31,600 --> 00:03:35,433
But as for simple linear regression
we won't need to apply

77
00:03:35,433 --> 00:03:37,033
feature scaling manually.

78
00:03:37,033 --> 00:03:40,466
This will be taken care of
with the function that we're going to use

79
00:03:40,466 --> 00:03:43,800
to fit multiple linear
regression to our training set.

80
00:03:44,166 --> 00:03:46,166
So we're all fine. We're all good here.

81
00:03:46,166 --> 00:03:48,233
We are ready to move on to the next step.

82
00:03:48,233 --> 00:03:51,133
And that's what we're going to do
in the next tutorial.

83
00:03:51,133 --> 00:03:54,566
Thank you for watching this one and I look
forward to seeing you in the next one.

84
00:03:55,133 --> 00:03:58,133
Until then, enjoy machine learning.