1
00:00:00,600 --> 00:00:01,200
All right.

2
00:00:01,200 --> 00:00:02,466
So perfect.

3
00:00:02,466 --> 00:00:04,733
So now that everything is essentially
said,

4
00:00:04,733 --> 00:00:06,900
let's tackle
this data preprocessing phase.

5
00:00:06,900 --> 00:00:08,866
So we're going to do this
very efficiently.

6
00:00:08,866 --> 00:00:11,400
I'm going to go to my data
preprocessing template.

7
00:00:11,400 --> 00:00:15,300
And I'm going to copy paste
each of these code cells.

8
00:00:15,300 --> 00:00:16,766
You know the first ones.

9
00:00:16,766 --> 00:00:19,600
So I'm creating a new code
cell here pasting that here.

10
00:00:19,600 --> 00:00:21,633
That's for importing the libraries.

11
00:00:21,633 --> 00:00:25,600
Then we're going to take care of the 
second step of data preprocessing

12
00:00:25,600 --> 00:00:29,533
which is importing the data
set creating therefore.

13
00:00:29,533 --> 00:00:31,500
And you could cell here.

14
00:00:31,500 --> 00:00:33,400
And let's first take care of this.

15
00:00:33,400 --> 00:00:35,666
You know the last one the easy one.

16
00:00:35,666 --> 00:00:39,033
Splitting the data
set into the training set and the test set

17
00:00:39,566 --> 00:00:43,866
and pasting that in a new code
cell right here.

18
00:00:43,866 --> 00:00:45,400
All right. Perfect.

19
00:00:45,400 --> 00:00:48,600
And now before we encode
the categorical data, let's just make sure

20
00:00:48,600 --> 00:00:52,033
to replace
what's necessary here in this template.

21
00:00:52,300 --> 00:00:54,933
And once again
that's the beauty of the template.

22
00:00:54,933 --> 00:00:59,366
We only need to replace one little thing
which is of course the name of the data

23
00:00:59,366 --> 00:01:00,733
set here. Right.

24
00:01:00,733 --> 00:01:05,700
And the name is of course 50 underscore
capital s startups dot CSV.

25
00:01:06,000 --> 00:01:06,566
So there we go.

26
00:01:06,566 --> 00:01:12,166
Let's do this 50 underscore startups grid.

27
00:01:12,166 --> 00:01:15,133
And as a reminder
we don't have to change anything here

28
00:01:15,133 --> 00:01:19,733
because this automatically selects
all the columns except the last one.

29
00:01:19,733 --> 00:01:22,566
And therefore all the four features here.

30
00:01:22,566 --> 00:01:23,700
So that's perfect.

31
00:01:23,700 --> 00:01:27,566
And this line of code
automatically selects

32
00:01:27,566 --> 00:01:31,800
the last column,
which means the dependent variable profit.

33
00:01:31,900 --> 00:01:32,766
Okay.

34
00:01:32,766 --> 00:01:36,066
So once again we tackled data
preprocessing in a flashlight.

35
00:01:36,066 --> 00:01:40,100
And now we just have this one less tool
to add in our data

36
00:01:40,100 --> 00:01:44,266
preprocessing phase
which is include that state variable here.

37
00:01:44,566 --> 00:01:47,800
So to do this we're going to get our data
preprocessing tools which you have

38
00:01:48,000 --> 00:01:50,900
in your part1 data preprocessing folder.

39
00:01:50,900 --> 00:01:53,700
And now we're going to scroll down to find

40
00:01:53,700 --> 00:01:57,300
that tool that you know
encodes the categorical data.

41
00:01:57,600 --> 00:02:01,400
So remember we actually have
two sub tools here if I may say that

42
00:02:01,400 --> 00:02:05,033
first tool that applies one hot encoding,
which is exactly what we want.

43
00:02:05,300 --> 00:02:08,366
And that tool that just encodes
a binary variable

44
00:02:08,366 --> 00:02:11,700
into zero and one
and of course what we need is this one.

45
00:02:11,700 --> 00:02:15,233
So I'm just going to copy paste
that piece of code.

46
00:02:15,433 --> 00:02:18,966
And then I'm going to paste that
right here

47
00:02:18,966 --> 00:02:21,433
in a new code cell
to encode categorical data.

48
00:02:21,433 --> 00:02:26,066
And now your turn, your turn to think
and figure out what we need to do next.

49
00:02:26,366 --> 00:02:29,700
Please press pause on this video
and figure out what you have

50
00:02:29,700 --> 00:02:35,333
to change here in order to indeed apply
one hot encoding on our data set.

51
00:02:35,366 --> 00:02:36,300
I'll give you a hint.

52
00:02:36,300 --> 00:02:39,866
You only have one little thing to change
and then you'll be good to go.

53
00:02:39,866 --> 00:02:41,500
So please press pause.

54
00:02:41,500 --> 00:02:42,000
Okay.

55
00:02:42,000 --> 00:02:44,366
And now I'm
going to give you the solution.

56
00:02:44,366 --> 00:02:48,466
So the only thing that you had to
change here is that index here.

57
00:02:48,466 --> 00:02:51,933
Remember this corresponds
to the index of the column.

58
00:02:51,933 --> 00:02:54,833
You want to apply one hot encoding.

59
00:02:54,833 --> 00:02:57,833
And in our previous data set
you know data dot CSV.

60
00:02:57,966 --> 00:03:01,233
Well remember the categorical variable
was the first column.

61
00:03:01,233 --> 00:03:03,333
That's why we put index zero here.

62
00:03:03,333 --> 00:03:08,666
But for new data set the categorical
variable is actually the fourth column.

63
00:03:08,966 --> 00:03:10,166
But be careful.

64
00:03:10,166 --> 00:03:12,700
Remember that indexes in Python
start from zero.

65
00:03:12,700 --> 00:03:15,433
Therefore this column has index zero.
This one is index one.

66
00:03:15,433 --> 00:03:18,266
This one is exactly two
and this one has index three.

67
00:03:18,266 --> 00:03:21,300
And therefore the index you need to change

68
00:03:21,300 --> 00:03:24,300
here is of course three right.

69
00:03:24,300 --> 00:03:29,466
So this will apply one hot encoding to the
column of index three in your data set.

70
00:03:29,633 --> 00:03:33,100
Therefore exactly this date
column perfect.

71
00:03:33,100 --> 00:03:36,100
So we're done with the data
preprocessing phase.

72
00:03:36,100 --> 00:03:39,900
So now we are going to observe the results
of what we just built.

73
00:03:39,900 --> 00:03:42,066
You know
just in terms of data preprocessing.

74
00:03:42,066 --> 00:03:44,166
And therefore
we're going to do several things here.

75
00:03:44,166 --> 00:03:48,766
First we're going to upload the data
set into our notebook.

76
00:03:48,766 --> 00:03:49,133
Right.

77
00:03:49,133 --> 00:03:52,466
And to do this we click this little
folder here and then upload.

78
00:03:53,566 --> 00:03:54,033
All right.

79
00:03:54,033 --> 00:03:57,533
So as usual my machine learning dataset
folder is on my desktop.

80
00:03:57,700 --> 00:04:00,500
So we're going to go inside.
Make sure to find it on your machine.

81
00:04:00,500 --> 00:04:02,400
Then we're going to go to part
to regression.

82
00:04:02,400 --> 00:04:05,800
Then section
five multiple linear regression in Python.

83
00:04:06,000 --> 00:04:06,833
And there we go.

84
00:04:06,833 --> 00:04:11,300
That's the data set which we need to open
and upload into our notebook.

85
00:04:11,733 --> 00:04:13,200
All right it is uploaded.

86
00:04:13,200 --> 00:04:16,633
And now what we're going to do
is we're going to run each of these cells

87
00:04:16,633 --> 00:04:17,600
that we just made.

88
00:04:17,600 --> 00:04:19,800
But I'm going to add a few prints,

89
00:04:19,800 --> 00:04:21,866
you know,
so that you can really see what we did.

90
00:04:21,866 --> 00:04:25,166
You know how the different matrix
of features, independent variable vector

91
00:04:25,166 --> 00:04:28,200
are created and modified
belong to data preprocessing phase.