1
00:00:00,166 --> 00:00:02,300
Hello and welcome to this art tutorial.

2
00:00:02,300 --> 00:00:04,600
I'm super excited to be in the deep
learning part.

3
00:00:04,600 --> 00:00:08,000
This is one of the most fascinating
and exciting branch of machine learning,

4
00:00:08,366 --> 00:00:11,000
and besides, it's
one of the most powerful.

5
00:00:11,000 --> 00:00:13,666
In the following tutorials,
we're going to solve the business problem

6
00:00:13,666 --> 00:00:16,500
described by Kirill
at the beginning of this section,

7
00:00:16,500 --> 00:00:19,133
and you will see that
we are going to get strong results.

8
00:00:19,133 --> 00:00:23,466
Thanks to this artificial neural network
that we are about to build with are.

9
00:00:23,866 --> 00:00:24,600
So as usual,

10
00:00:24,600 --> 00:00:27,900
we are going to make this artificial
neural network model very efficiently.

11
00:00:28,100 --> 00:00:29,633
And we're going to use the best package

12
00:00:29,633 --> 00:00:33,166
for that, which I will let you find out
about in the next tutorials.

13
00:00:33,733 --> 00:00:34,766
So let's start.

14
00:00:34,766 --> 00:00:39,533
And the first step of our journey
is the boring step data preprocessing.

15
00:00:39,800 --> 00:00:43,200
But we will do it very efficiently
because we have our template

16
00:00:43,200 --> 00:00:45,766
or classification template
that I've prepared here.

17
00:00:45,766 --> 00:00:48,433
And why can we use this classification
template?

18
00:00:48,433 --> 00:00:52,200
Well, it's by nature of the business
problem you saw in the business

19
00:00:52,200 --> 00:00:55,200
problem description
that you have some independent variables.

20
00:00:55,200 --> 00:00:59,033
And with these independent variables,
you have to predict a dependent variable

21
00:00:59,033 --> 00:01:00,833
that has a binary outcome.

22
00:01:00,833 --> 00:01:03,833
And since the outcome of the dependent
variable is binary,

23
00:01:03,833 --> 00:01:06,033
that means it's a categorical variable.

24
00:01:06,033 --> 00:01:09,033
That means we have to predict
classes zero and one,

25
00:01:09,100 --> 00:01:12,800
and therefore that makes our problem
a classification problem.

26
00:01:13,033 --> 00:01:16,033
And so okay we're going to build a deep
learning model.

27
00:01:16,133 --> 00:01:17,066
But this deep learning

28
00:01:17,066 --> 00:01:20,066
model is going to be nothing else
than a classification model.

29
00:01:20,300 --> 00:01:24,133
And that's why we are going to use our
classification template that we have here.

30
00:01:24,433 --> 00:01:28,733
And that will save us a lot of time
to build our artificial neural network.

31
00:01:28,766 --> 00:01:32,400
And besides, we want to focus on the deep
learning model itself.

32
00:01:32,400 --> 00:01:36,033
So we will get there very quickly
thanks to this template.

33
00:01:36,500 --> 00:01:36,866
All right.

34
00:01:36,866 --> 00:01:42,333
So let's take everything from this
template from the top down to here.

35
00:01:42,333 --> 00:01:46,666
Because we cannot use this section here
because, you know this is the section

36
00:01:46,666 --> 00:01:48,533
to visualize the training set results.

37
00:01:48,533 --> 00:01:50,300
And the test set results as well.

38
00:01:50,300 --> 00:01:53,266
But only when we have two independent
variables,

39
00:01:53,266 --> 00:01:56,566
because one independent
variable corresponds to one dimension.

40
00:01:56,933 --> 00:01:58,200
And since now in the data

41
00:01:58,200 --> 00:02:02,200
set of the business problem, we have,
I think 10 or 11 independent variables.

42
00:02:02,400 --> 00:02:06,266
Well, then it's a little bit hard
to represent something in 11 dimensions.

43
00:02:06,533 --> 00:02:11,100
So we won't take this, but we will
definitely take everything that's above.

44
00:02:11,366 --> 00:02:15,866
So I'm going to copy that
and let's go back to our A and and model

45
00:02:16,166 --> 00:02:19,666
and paste this classification template
right here.

46
00:02:20,233 --> 00:02:20,966
All right.

47
00:02:20,966 --> 00:02:24,033
And now in this template
we're going to change a very few things.

48
00:02:24,033 --> 00:02:28,133
And of course we are going to build
our artificial neural network

49
00:02:28,333 --> 00:02:31,666
right here in this section
create your classifier here

50
00:02:31,833 --> 00:02:36,666
we can already replace classifier here by
and to then build the model.

51
00:02:37,033 --> 00:02:37,866
All right.

52
00:02:37,866 --> 00:02:38,700
But of course

53
00:02:38,700 --> 00:02:42,366
we need to make sure everything's
okay in all the data pre-processing step.

54
00:02:42,666 --> 00:02:45,066
And that's
what we're going to do right now.

55
00:02:45,066 --> 00:02:46,133
All right. So let's start.

56
00:02:46,133 --> 00:02:49,833
Let's start with the basic step setting
the right folder has a working directory.

57
00:02:50,133 --> 00:02:53,333
So right now I'm on my desktop I'm
going to my machine Learning 80 folder.

58
00:02:53,333 --> 00:02:56,100
Then we are in part eight Deep learning.

59
00:02:56,100 --> 00:02:59,100
And now section
40 Artificial Neural Networks.

60
00:02:59,133 --> 00:02:59,966
Here we go.

61
00:02:59,966 --> 00:03:02,900
Make sure that you have the churn modeling
dot CSV file.

62
00:03:02,900 --> 00:03:05,333
And if that's the case
you can click on this more button here.

63
00:03:05,333 --> 00:03:08,433
And then set as working directory. Great.

64
00:03:08,433 --> 00:03:10,433
And now let's change a few things.

65
00:03:10,433 --> 00:03:13,800
So first of all let's start
with this section importing the data set.

66
00:03:14,100 --> 00:03:17,033
Well the name of the data set
is not social network ads.

67
00:03:17,033 --> 00:03:20,800
It's now for churn modeling.

68
00:03:21,500 --> 00:03:22,266
All right.

69
00:03:22,266 --> 00:03:24,300
We are now ready to import the data set.

70
00:03:24,300 --> 00:03:27,100
So let's do it right now
I'm going to select this line.

71
00:03:27,100 --> 00:03:28,733
And execute.

72
00:03:28,733 --> 00:03:29,033
All right.

73
00:03:29,033 --> 00:03:30,366
Data sets will import it.

74
00:03:30,366 --> 00:03:32,833
Well we have actually 14 variables.

75
00:03:32,833 --> 00:03:37,200
But let's see if we include
all these variables in the real data set.

76
00:03:37,500 --> 00:03:40,500
You know the one on which
we want to build our deep learning model.

77
00:03:40,733 --> 00:03:44,166
So let's see
I'm going to click on this data set here

78
00:03:44,500 --> 00:03:47,733
to see which independent variables
we include in the model.

79
00:03:48,433 --> 00:03:50,133
All right so just a quick reminder.

80
00:03:50,133 --> 00:03:53,766
This data set contains 10,000 observations
containing

81
00:03:53,766 --> 00:03:57,833
some informations of customers in a bank
like the surname

82
00:03:57,833 --> 00:04:01,800
the credit score geography gender, age
and all the other informations here.

83
00:04:02,166 --> 00:04:05,966
And during six month
the bank looked for each customer.

84
00:04:05,966 --> 00:04:09,866
If the customer stayed or left the bank
within the six month period

85
00:04:10,166 --> 00:04:11,033
and this result,

86
00:04:11,033 --> 00:04:15,433
whether the customer stayed or left
is given in this last column here exited.

87
00:04:15,633 --> 00:04:19,800
So one means that the customer left
the bank during the six months, and zero

88
00:04:19,800 --> 00:04:22,833
means that the customer stayed in the bank
during the six months.

89
00:04:23,533 --> 00:04:27,666
So what's important to understand
now is that all these variables here,

90
00:04:27,666 --> 00:04:32,066
from row number to estimated salary,
are the independent variables.

91
00:04:32,233 --> 00:04:35,333
And the last column here
exited is the dependent variable.

92
00:04:35,833 --> 00:04:41,466
So right now our goal is to make a model
where we can predict this result.

93
00:04:41,466 --> 00:04:44,633
Exited here whether the customer left
or stayed in the bank

94
00:04:44,966 --> 00:04:48,400
from the information contained
in all these independent variables here.

95
00:04:49,100 --> 00:04:52,466
But the thing is
that in these independent variables,

96
00:04:52,466 --> 00:04:56,100
some definitely don't have an impact
on this dependent variable.

97
00:04:56,100 --> 00:04:57,000
Exited.

98
00:04:57,000 --> 00:05:01,066
And so now what we have to do
is only take the independent variables

99
00:05:01,266 --> 00:05:04,133
that could have an impact and correlations

100
00:05:04,133 --> 00:05:07,566
with the decision of the customer
to leave or stayed in the bank.

101
00:05:08,200 --> 00:05:09,533
And so that's what
we're going to do right now.

102
00:05:09,533 --> 00:05:12,800
So let's look at each of these independent
variables one by one.

103
00:05:13,000 --> 00:05:16,000
And let's see
which one we keep in our model.

104
00:05:16,500 --> 00:05:19,200
All right so let's start
with the first one row number.

105
00:05:19,200 --> 00:05:22,900
Well row number has definitely no impact
on the dependent variable exited.

106
00:05:23,100 --> 00:05:26,400
So of course we will not include it
then customer ID.

107
00:05:26,733 --> 00:05:28,366
Well customer ID that's the same.

108
00:05:28,366 --> 00:05:30,600
That's just an identification number.

109
00:05:30,600 --> 00:05:34,066
This definitely doesn't have any impact
on the decision of the customer

110
00:05:34,066 --> 00:05:35,600
to stay or leave in the bank.

111
00:05:35,600 --> 00:05:37,900
So we will not include that either.

112
00:05:37,900 --> 00:05:38,866
Then the surname.

113
00:05:38,866 --> 00:05:40,100
Well that's the same.

114
00:05:40,100 --> 00:05:42,866
It's not because your name is Andrews
that you have more chance

115
00:05:42,866 --> 00:05:46,166
to leave the bank
than if your name is Romeo.

116
00:05:46,700 --> 00:05:49,700
All right,
so we don't include surname either,

117
00:05:49,900 --> 00:05:53,700
but then we have credit score,
and credit score might have an impact

118
00:05:53,700 --> 00:05:56,600
on the decision of the customer
to stay or leave in the bank.

119
00:05:56,600 --> 00:05:59,766
Indeed, we can assume that customers
with a low credit score

120
00:05:59,866 --> 00:06:03,366
are more likely to leave the bank
than customers with a high credit score,

121
00:06:03,533 --> 00:06:06,633
so definitely
we will include credit score in our model.

122
00:06:07,366 --> 00:06:09,300
All right then we have geography.

123
00:06:09,300 --> 00:06:09,566
Well,

124
00:06:09,566 --> 00:06:13,766
maybe some customers are more likely
to leave the bank in one specific country.

125
00:06:13,766 --> 00:06:17,200
And that can be due to external factors
like the economy of the country

126
00:06:17,400 --> 00:06:18,566
or any other factors.

127
00:06:18,566 --> 00:06:19,800
But yes, definitely,

128
00:06:19,800 --> 00:06:22,266
there might be some correlations
between the countries

129
00:06:22,266 --> 00:06:24,833
and the decision
to stay or leave the bank.

130
00:06:24,833 --> 00:06:27,000
So we willing to do that as well then?

131
00:06:27,000 --> 00:06:29,000
Gender. Well, that's the same.

132
00:06:29,000 --> 00:06:33,266
Maybe men or women are more likely
to stay in the bank than the other.

133
00:06:33,266 --> 00:06:35,533
So we need to check it out then. Age.

134
00:06:35,533 --> 00:06:36,600
Well that's the same.

135
00:06:36,600 --> 00:06:38,800
And that's even quite intuitive.

136
00:06:38,800 --> 00:06:42,366
We might expect that younger people,
or more likely to leave the bank

137
00:06:42,666 --> 00:06:43,666
than older people,

138
00:06:43,666 --> 00:06:46,933
because all the people have more balance
and have more stability.

139
00:06:47,233 --> 00:06:49,566
So we include age as well then tenure.

140
00:06:49,566 --> 00:06:52,566
So tenure is for how long
the customer has been in the bank.

141
00:06:52,900 --> 00:06:53,900
And so that's the same.

142
00:06:53,900 --> 00:06:57,000
We might expect that customers
that have been in the bank for a long time

143
00:06:57,233 --> 00:07:00,300
are more likely to stay in the bank
than recent customers.

144
00:07:00,600 --> 00:07:03,566
So yes, we'll take it then. Balance.

145
00:07:03,566 --> 00:07:07,666
Well, balance, of course, we might expect
that this customer with this

146
00:07:07,666 --> 00:07:11,200
high balance has a lot more chance
to stay in the bank

147
00:07:11,400 --> 00:07:16,133
than this customer with the zero balance
all right, than the number of products.

148
00:07:16,133 --> 00:07:19,200
So that's the number of banking products
the customers have in the bank.

149
00:07:19,400 --> 00:07:23,066
And so of course, maybe that the customers
with many products in the bank

150
00:07:23,233 --> 00:07:26,200
are more likely to stay than customers
with, for example,

151
00:07:26,200 --> 00:07:28,766
one product in the bank.
So we'll need to check it out.

152
00:07:28,766 --> 00:07:29,833
That's just assumptions.

153
00:07:29,833 --> 00:07:33,300
That's the model that we'll find out
about these correlations more thoroughly.

154
00:07:33,566 --> 00:07:37,200
But you know definitely from our intuition
we need to include

155
00:07:37,200 --> 00:07:39,100
number of products as well.

156
00:07:39,100 --> 00:07:40,500
Then has great card.

157
00:07:40,500 --> 00:07:43,500
Well that's a little bit of the same
as this variable.

158
00:07:43,533 --> 00:07:46,000
Customers that have a credit card
might be more likely

159
00:07:46,000 --> 00:07:49,000
to stay in the bank than customers
that don't have a credit card.

160
00:07:49,033 --> 00:07:52,033
So yes, is active member. That's the same.

161
00:07:52,033 --> 00:07:53,400
If a customer is active,

162
00:07:53,400 --> 00:07:56,533
then this customer is more likely
to stay in the bank than a customer

163
00:07:56,533 --> 00:07:57,633
that is not active.

164
00:07:57,633 --> 00:08:00,533
So yes, it might be a significant
independent variable.

165
00:08:00,533 --> 00:08:02,400
Then estimated salary.

166
00:08:02,400 --> 00:08:05,500
Well, that's the salary of the customer
estimated by the bank.

167
00:08:05,866 --> 00:08:09,733
And it would make sense that customers
with a high estimated salary

168
00:08:09,966 --> 00:08:13,833
have more chance to leave the bank than
customers with a low estimated salary.

169
00:08:14,133 --> 00:08:14,533
All right.

170
00:08:14,533 --> 00:08:17,733
So that was the last independent
variable of this data set.

171
00:08:18,000 --> 00:08:21,900
So now we know which independent variables
we include in our data set.

172
00:08:22,200 --> 00:08:26,166
And that's what we're going to specify
right now by updating our data set

173
00:08:26,333 --> 00:08:28,733
taking only the indexes
of the independent variables

174
00:08:28,733 --> 00:08:30,500
we want to include in the model.

175
00:08:30,500 --> 00:08:33,200
So let's see what these indexes are okay.

176
00:08:33,200 --> 00:08:35,266
So indexes in R start at one.

177
00:08:35,266 --> 00:08:37,900
And so basically
we taking all the independent variables

178
00:08:37,900 --> 00:08:41,300
from credit score up to estimated salary.

179
00:08:41,666 --> 00:08:45,900
So let's see index one index suit
index three index four.

180
00:08:45,900 --> 00:08:50,366
So we are taking the indexes 456789

181
00:08:50,366 --> 00:08:53,733
ten 1112 and 13.

182
00:08:54,133 --> 00:08:54,566
All right.

183
00:08:54,566 --> 00:08:58,733
So we are taking the indexes from 4 to 14.

184
00:08:58,733 --> 00:09:01,800
Because you know in
R it's not like in Python when we separate

185
00:09:01,800 --> 00:09:05,166
a matrix of features
and the dependent variable vector

186
00:09:05,400 --> 00:09:07,800
we include all the variables
in one data frame.

187
00:09:07,800 --> 00:09:10,300
And so we include the dependent
variable grade.

188
00:09:10,300 --> 00:09:12,533
So let's input these indexes.

189
00:09:12,533 --> 00:09:16,300
So we just said that
we want to take the indexes from four.

190
00:09:16,500 --> 00:09:19,433
So that's the index of the first
independent variable

191
00:09:19,433 --> 00:09:23,600
up to the index 14 which is the index
of the dependent variable.

192
00:09:24,366 --> 00:09:25,200
And that's great.

193
00:09:25,200 --> 00:09:29,666
Now we can update our data
set by selecting this line and execute.

194
00:09:30,433 --> 00:09:30,933
Great.

195
00:09:30,933 --> 00:09:34,533
And now as you can see
if I will go back to the data set here

196
00:09:34,733 --> 00:09:39,900
we have all are potentially statistically
significant independent variables

197
00:09:39,900 --> 00:09:43,933
that might have an impact
on the dependent variable exited.

198
00:09:44,133 --> 00:09:48,166
And so now the first step of data
pre-processing is completed.

199
00:09:48,433 --> 00:09:49,566
We import correctly

200
00:09:49,566 --> 00:09:53,300
the data set by choosing
all the relevant independent variables.

201
00:09:54,066 --> 00:09:56,366
Okay.
Now let's move on to the second step.

202
00:09:56,366 --> 00:09:59,566
The second step
is encoding the target feature as vector.

203
00:10:00,000 --> 00:10:03,633
Well we don't really need to do that
because the dependent variable of our data

204
00:10:03,633 --> 00:10:07,800
set is a categorical variable
with a binary outcome 1 or 0.

205
00:10:08,100 --> 00:10:10,633
And the thing to understand
is that the package we're going to use

206
00:10:10,633 --> 00:10:15,066
is going to recognize it as a categorical
variable with a binary outcome.

207
00:10:15,300 --> 00:10:21,000
So we actually don't need to encode
this target feature exited as a vector.

208
00:10:21,000 --> 00:10:23,866
So I'm going to remove this line.
We don't need it.

209
00:10:23,866 --> 00:10:27,933
However we do need to do something
regarding some categorical variables.

210
00:10:28,200 --> 00:10:29,933
Of course I'm talking about

211
00:10:29,933 --> 00:10:33,700
the two categorical independent variables
we have in our data set.

212
00:10:34,000 --> 00:10:37,866
And these two variables are of course
geography and gender.

213
00:10:38,433 --> 00:10:40,166
So we have two problems here.

214
00:10:40,166 --> 00:10:42,566
So we need to do two things here
for these variables.

215
00:10:42,566 --> 00:10:46,000
The first thing we need to do
is to convert them as vectors.

216
00:10:46,333 --> 00:10:48,166
And then we will need to do something

217
00:10:48,166 --> 00:10:51,733
more than we used to do
when encoding our categorical variables.

218
00:10:52,000 --> 00:10:54,600
It's to set them as numeric.

219
00:10:54,600 --> 00:10:57,433
And to do this
we'll use the as numeric function.

220
00:10:57,433 --> 00:11:00,000
And why do we need to do this
especially here.

221
00:11:00,000 --> 00:11:00,600
Well that is

222
00:11:00,600 --> 00:11:03,966
just because the deep learning package
that we're going to use is requiring it.

223
00:11:04,233 --> 00:11:09,333
And that's the only reason it
expects vectors but set as numeric

224
00:11:09,466 --> 00:11:10,666
numeric vectors.

225
00:11:10,666 --> 00:11:13,800
So let's do this
I'm going back to my A and model.

226
00:11:14,133 --> 00:11:16,933
And so first we're going to change this
to say

227
00:11:16,933 --> 00:11:19,800
that we're encoding the categorical

228
00:11:21,100 --> 00:11:23,533
variables as factors.

229
00:11:23,533 --> 00:11:24,000
All right.

230
00:11:24,000 --> 00:11:25,566
And now we're going to take this

231
00:11:25,566 --> 00:11:29,633
categorical data file that we made in part
one data preprocessing.

232
00:11:30,033 --> 00:11:34,766
Because you know there is the code ready
to encode any categorical data.

233
00:11:35,100 --> 00:11:40,733
So I'm going to select all of this
and paste it here in this second

234
00:11:40,733 --> 00:11:45,000
step of data preprocessing to encode
the categorical variables as vectors.

235
00:11:45,700 --> 00:11:47,000
All right. So let's do this.

236
00:11:47,000 --> 00:11:50,766
We just need to replace
the names of the variables and then add

237
00:11:50,900 --> 00:11:54,700
this as dot numeric function
to set the factors as numeric.

238
00:11:55,000 --> 00:11:57,600
So let's start
by replacing all the names here.

239
00:11:57,600 --> 00:12:00,466
Well the first categorical variable
gives the countries

240
00:12:00,466 --> 00:12:02,000
but it is not called country.

241
00:12:02,000 --> 00:12:04,200
It is called geography.

242
00:12:04,200 --> 00:12:08,400
So we will replace here
country by geography.

243
00:12:09,233 --> 00:12:10,000
Same here.

244
00:12:12,900 --> 00:12:13,800
And the good

245
00:12:13,800 --> 00:12:17,866
news is that now we don't need
to change the names of the categories here

246
00:12:17,866 --> 00:12:21,333
France, Spain and Germany
because that's the same names.

247
00:12:21,333 --> 00:12:22,800
So that's great.

248
00:12:22,800 --> 00:12:25,433
And we will keep the labels 123.

249
00:12:25,433 --> 00:12:26,333
All right that's good.

250
00:12:26,333 --> 00:12:31,466
And now we add this as dot
numeric function

251
00:12:31,800 --> 00:12:34,800
to set the factors as numeric.

252
00:12:34,833 --> 00:12:37,000
So I'm putting all these factor function

253
00:12:37,000 --> 00:12:40,666
here inside the parentheses
of the as numeric function.

254
00:12:41,100 --> 00:12:43,366
And now I just need to align everything.

255
00:12:43,366 --> 00:12:45,066
Well here we go.

256
00:12:45,066 --> 00:12:48,066
And same for here.

257
00:12:48,133 --> 00:12:49,533
All right. Great.

258
00:12:49,533 --> 00:12:52,533
And now let's do the same
for the second categorical variable.

259
00:12:52,633 --> 00:12:55,766
So we need to replace
purchase here by gender.

260
00:12:56,533 --> 00:12:57,700
So let's do it.

261
00:12:57,700 --> 00:13:00,700
Purchased replaced by gender.

262
00:13:01,433 --> 00:13:02,100
All right.

263
00:13:02,100 --> 00:13:05,000
Same here gender.

264
00:13:05,000 --> 00:13:08,400
And now we replace the two categories
no and yes by

265
00:13:08,700 --> 00:13:11,700
female and male.

266
00:13:12,266 --> 00:13:14,533
And here we can give the labels we want.

267
00:13:14,533 --> 00:13:19,366
So let's for example take labels
one for female and two for male.

268
00:13:19,966 --> 00:13:20,566
All right.

269
00:13:20,566 --> 00:13:23,566
And let's not forget to add the US

270
00:13:24,000 --> 00:13:27,066
dot numeric function
which I remind we just

271
00:13:27,066 --> 00:13:30,066
do for the future deep learning package
that we're going to use.

272
00:13:30,133 --> 00:13:33,366
So parentheses here parentheses here.

273
00:13:33,366 --> 00:13:35,600
And now let's align everything.

274
00:13:35,600 --> 00:13:38,600
Here we go. All right. Great.

275
00:13:38,800 --> 00:13:40,066
So now everything is ready.

276
00:13:40,066 --> 00:13:44,133
This section is ready
that encodes as required

277
00:13:44,133 --> 00:13:47,800
by the deep learning package
our categorical independent variables.

278
00:13:48,233 --> 00:13:48,600
All right.

279
00:13:48,600 --> 00:13:51,066
So I'm going to select all this section
here.

280
00:13:51,066 --> 00:13:54,066
And let's execute.

281
00:13:54,166 --> 00:13:55,866
All right. Executed properly.

282
00:13:55,866 --> 00:13:59,600
Now let's have a look at the data
set to see what the variables became.

283
00:13:59,766 --> 00:14:00,366
Perfect.

284
00:14:00,366 --> 00:14:02,300
Geography was encoded

285
00:14:02,300 --> 00:14:06,400
into one, two and three categories
that are numeric categories.

286
00:14:06,766 --> 00:14:10,333
And the gender one for female
and two for male.

287
00:14:10,800 --> 00:14:13,066
Great and again as numeric vectors.

288
00:14:14,066 --> 00:14:14,666
Perfect.

289
00:14:14,666 --> 00:14:16,933
So this section is now completed.

290
00:14:16,933 --> 00:14:19,166
And let's move on to the next one.

291
00:14:19,166 --> 00:14:21,900
We can see how we're getting
very efficient at this.

292
00:14:21,900 --> 00:14:25,066
The next one is about splitting
the data sets into the training set

293
00:14:25,066 --> 00:14:26,100
and the test set.

294
00:14:26,100 --> 00:14:29,866
We need to do that because we will train
our artificial neural network

295
00:14:30,100 --> 00:14:33,900
on the training set, and we will test
its performance on the test set.

296
00:14:34,200 --> 00:14:35,133
So we'll do that.

297
00:14:35,133 --> 00:14:37,000
But let's not execute too fast.

298
00:14:37,000 --> 00:14:38,933
We need to replace purchase here

299
00:14:38,933 --> 00:14:42,600
by the name of the dependent variable,
which is exited.

300
00:14:44,000 --> 00:14:47,000
And maybe we can change the split
ratio as well.

301
00:14:47,033 --> 00:14:52,766
You know, put 80% for the training set
so that we have 8000 observations to train

302
00:14:52,766 --> 00:14:56,800
our artificial neural network
and 2000 observations to test

303
00:14:57,133 --> 00:14:59,833
its performance on new observations.

304
00:14:59,833 --> 00:15:02,333
That is,
the new observations of the test set.

305
00:15:02,333 --> 00:15:03,400
So now that's ready.

306
00:15:03,400 --> 00:15:05,133
We don't have to do anything more here.

307
00:15:05,133 --> 00:15:09,133
The most important thing is not to forget
to replace purchased by exited.

308
00:15:09,533 --> 00:15:15,200
And so now I'm going to select
all these section and execute perfect.

309
00:15:15,333 --> 00:15:19,633
Now we have our training
set and our test set.

310
00:15:20,633 --> 00:15:21,366
Great.

311
00:15:21,366 --> 00:15:22,933
So that is the whole data set.

312
00:15:22,933 --> 00:15:25,933
That is our training set
with 8000 observations.

313
00:15:26,200 --> 00:15:29,400
And that is our test set
with 2000 observations.

314
00:15:30,000 --> 00:15:30,600
Perfect.

315
00:15:30,600 --> 00:15:32,466
Now let's go back to our and

316
00:15:32,466 --> 00:15:35,900
and we are finally getting
to the last step of data preprocessing.

317
00:15:36,100 --> 00:15:38,100
And that is feature scaling.

318
00:15:38,100 --> 00:15:39,400
So now the question is

319
00:15:39,400 --> 00:15:43,333
do we need to apply feature scaling
to train an artificial neural network.

320
00:15:43,600 --> 00:15:45,600
And the answer is yes.

321
00:15:45,600 --> 00:15:46,600
Absolutely.

322
00:15:46,600 --> 00:15:49,300
That's 100% compulsory.

323
00:15:49,300 --> 00:15:50,900
And that is because training

324
00:15:50,900 --> 00:15:53,900
an artificial neural network
is highly compute intensive.

325
00:15:54,066 --> 00:15:56,333
So there is going to be
a lot of computations.

326
00:15:56,333 --> 00:15:58,566
And besides parallel computations.

327
00:15:58,566 --> 00:16:00,900
So definitely
we need to apply feature scaling.

328
00:16:00,900 --> 00:16:03,900
And besides it is required by the package.

329
00:16:03,900 --> 00:16:05,333
So we will execute this.

330
00:16:05,333 --> 00:16:08,433
But before let's not forget to change
the indexes.

331
00:16:08,833 --> 00:16:11,933
These index is three here where the index

332
00:16:11,933 --> 00:16:14,933
of the dependent variable and part
one data preprocessing.

333
00:16:15,000 --> 00:16:17,966
So right now
we just need to replace this index three.

334
00:16:17,966 --> 00:16:21,133
Here by our new index
of the dependent variable.

335
00:16:21,566 --> 00:16:22,700
And so what is this index.

336
00:16:22,700 --> 00:16:25,533
That is the index of the exited column.

337
00:16:25,533 --> 00:16:27,600
Well we can see that directly here.

338
00:16:27,600 --> 00:16:30,500
This data set has 11 variables.

339
00:16:30,500 --> 00:16:34,233
So that means that the exited column here
has index 11.

340
00:16:34,833 --> 00:16:40,433
So let's replace
three here by 11 then here as well

341
00:16:40,433 --> 00:16:44,300
1111 and 11.

342
00:16:44,866 --> 00:16:45,500
Great.

343
00:16:45,500 --> 00:16:47,800
And now the feature scaling
section is ready.

344
00:16:47,800 --> 00:16:50,766
So let's select the whole section.

345
00:16:50,766 --> 00:16:53,700
And execute. Great.

346
00:16:53,700 --> 00:16:56,700
And now if we have a look
at our training set

347
00:16:57,000 --> 00:17:00,000
well well yes definitely
everything is scaled.

348
00:17:00,066 --> 00:17:01,833
And our test set same.

349
00:17:01,833 --> 00:17:03,600
Everything is definitely scaled.

350
00:17:03,600 --> 00:17:04,466
We are happy.

351
00:17:04,466 --> 00:17:09,000
We are ready to build
our artificial neural network.

352
00:17:09,300 --> 00:17:11,533
And that's what we're going to do
in the next tutorial.

353
00:17:11,533 --> 00:17:13,400
So I'm super excited to start.

354
00:17:13,400 --> 00:17:14,866
I look forward to seeing you there.

355
00:17:14,866 --> 00:17:16,733
And until then, enjoy machine learning.