1
00:00:00,133 --> 00:00:00,966
Hello, my friends.

2
00:00:00,966 --> 00:00:02,966
All right. Are you ready to start this

3
00:00:02,966 --> 00:00:06,700
big implementation
of your very first artificial brain?

4
00:00:06,866 --> 00:00:08,700
Well, I'm definitely ready.

5
00:00:08,700 --> 00:00:11,633
So let's do this.
Let's smash this together.

6
00:00:11,633 --> 00:00:14,633
All right, so we're going to start
with the data preprocessing phase,

7
00:00:14,633 --> 00:00:18,600
which we will tackle in this same tutorial
because we quickly

8
00:00:18,600 --> 00:00:20,400
want to get to the interesting stuff.

9
00:00:20,400 --> 00:00:23,933
So let's do this efficiently
thanks to our data

10
00:00:23,933 --> 00:00:27,400
preprocessing template
but also our data preprocessing toolkit.

11
00:00:27,733 --> 00:00:32,700
And therefore the first thing we're going
to do here is to import the libraries.

12
00:00:32,700 --> 00:00:35,500
So we're going to create a new code
cell here.

13
00:00:35,500 --> 00:00:39,866
We're going to go into our data
preprocessing template to steal the cell

14
00:00:39,866 --> 00:00:44,966
we want meaning this one
and get it back into our implementation.

15
00:00:44,966 --> 00:00:46,033
The first cell.

16
00:00:46,033 --> 00:00:46,366
All right.

17
00:00:46,366 --> 00:00:47,166
So that's the first thing.

18
00:00:47,166 --> 00:00:50,000
However
I just want to show you something extra.

19
00:00:50,000 --> 00:00:53,000
It is about the beauty of Google Colab.

20
00:00:53,033 --> 00:00:54,833
I want to show you that indeed

21
00:00:54,833 --> 00:00:59,233
TensorFlow 2.0 is already
pre-installed in Google Colab.

22
00:00:59,233 --> 00:01:03,033
You know, in any Google Colab notebook
you will ever open.

23
00:01:03,033 --> 00:01:08,133
So the way for me to show you
this is to first import TensorFlow,

24
00:01:08,133 --> 00:01:11,400
because okay,
it is already pre-installed as a library

25
00:01:11,400 --> 00:01:14,400
inside a notebook,
but we still need to import it.

26
00:01:14,600 --> 00:01:18,866
And in fact here, since
we will actually won't use matplotlib,

27
00:01:19,066 --> 00:01:21,933
we're just going to delete
this import of this library

28
00:01:21,933 --> 00:01:24,866
and then just add as a third library here.

29
00:01:24,866 --> 00:01:26,300
Well TensorFlow.

30
00:01:26,300 --> 00:01:29,433
And the way to import TensorFlow
we start with indeed import

31
00:01:29,566 --> 00:01:32,700
then the name of the library
which is TensorFlow of course.

32
00:01:33,066 --> 00:01:36,066
And then we add a shortcut simple one

33
00:01:36,066 --> 00:01:39,100
like the classic one okay.

34
00:01:39,100 --> 00:01:41,933
And now I'm going to create a new code
cell.

35
00:01:41,933 --> 00:01:46,733
And inside
I'm going to enter the following TF dot

36
00:01:47,066 --> 00:01:50,766
double underscore version
double underscore again.

37
00:01:51,000 --> 00:01:55,433
And this will simply print
the version of TensorFlow we're using

38
00:01:55,433 --> 00:01:56,766
which I just want to show you

39
00:01:56,766 --> 00:01:59,933
is indeed TensorFlow
to write the brand new TensorFlow.

40
00:02:00,300 --> 00:02:01,600
So let's do this.

41
00:02:01,600 --> 00:02:04,166
First
we need to execute this cell and this one.

42
00:02:04,166 --> 00:02:07,700
But remember if we execute this cell
now this will take some time

43
00:02:07,700 --> 00:02:10,566
because actually the notebook
is not running yet.

44
00:02:10,566 --> 00:02:13,933
And the way to run
this is just to click this folder here.

45
00:02:14,000 --> 00:02:14,500
Right.

46
00:02:14,500 --> 00:02:18,600
And it is now that it will connect
to a runtime to enable file browsing,

47
00:02:18,600 --> 00:02:21,600
but mostly to,
you know, start running the notebook.

48
00:02:21,600 --> 00:02:22,200
Okay.

49
00:02:22,200 --> 00:02:26,333
And at the same time let's take
that opportunity to upload the data set.

50
00:02:26,466 --> 00:02:30,200
So please go to your machine learning
is that Codes and Datasets folder

51
00:02:30,200 --> 00:02:31,233
which had to download

52
00:02:31,233 --> 00:02:35,400
either in the previous tutorial or at the
beginning of each practical activity.

53
00:02:35,566 --> 00:02:39,566
So there we go inside the import
a deep learning almost close to the end.

54
00:02:39,566 --> 00:02:43,166
By the way, you must be very excited
almost seeing the tip of the tunnel.

55
00:02:43,500 --> 00:02:46,633
Then let's go to this section
on Artificial Neural network.

56
00:02:46,866 --> 00:02:49,666
Let's go to Python and let's select
this dataset

57
00:02:49,666 --> 00:02:53,100
churn modeling dot CSV open okay.

58
00:02:53,300 --> 00:02:54,500
And there we go.

59
00:02:54,500 --> 00:02:56,366
Now we have everything.
We have the data set.

60
00:02:56,366 --> 00:02:59,233
And besides our notebook
is already running.

61
00:02:59,233 --> 00:03:02,033
As you can notice
it takes a little bit more time than usual

62
00:03:02,033 --> 00:03:05,800
because you know the data set is this time
more real world and mostly bigger.

63
00:03:05,800 --> 00:03:06,900
Okay. All right.

64
00:03:06,900 --> 00:03:07,500
So let's do this.

65
00:03:07,500 --> 00:03:10,733
Let's run this first cell here
to import the libraries.

66
00:03:10,733 --> 00:03:13,500
This time numpy pandas and TensorFlow.

67
00:03:13,500 --> 00:03:18,466
And now let's play this cell
to indeed reassure ourselves

68
00:03:18,466 --> 00:03:23,700
that the TensorFlow version
we're going to be working with is 2.2.0.

69
00:03:23,700 --> 00:03:28,400
Basically TensorFlow two, which is so much
better than TensorFlow one.

70
00:03:28,400 --> 00:03:30,666
I'm so happy about their new version.

71
00:03:30,666 --> 00:03:31,466
Okay great.

72
00:03:31,466 --> 00:03:33,200
So it's good to have the confirmation.

73
00:03:33,200 --> 00:03:36,600
And now now let's tackle this part
one data preprocessing.

74
00:03:36,600 --> 00:03:37,333
We're going to do this

75
00:03:37,333 --> 00:03:41,066
efficiently thanks to our data
preprocessing template in toolkit.

76
00:03:41,233 --> 00:03:42,266
So let's create first

77
00:03:42,266 --> 00:03:46,066
a new code cell to import the data set
which we already have in a notebook.

78
00:03:46,366 --> 00:03:47,166
Perfect.

79
00:03:47,166 --> 00:03:49,600
So let's go to our data
preprocessing template.

80
00:03:49,600 --> 00:03:54,533
And let's now steal this second cell
to import the data set.

81
00:03:54,800 --> 00:03:55,466
All right.

82
00:03:55,466 --> 00:03:58,233
Let's go back
here. And let's face that inside.

83
00:03:58,233 --> 00:04:01,800
And now of course the question is
what do we need to replace.

84
00:04:01,966 --> 00:04:06,000
Well the obvious first change
we need to do is the name of the data set,

85
00:04:06,000 --> 00:04:10,466
which this time is not data
dot CSV but churn

86
00:04:10,900 --> 00:04:14,666
underscore modeling, dot CSV create.

87
00:04:15,000 --> 00:04:17,366
Then let's look at the rows one by one.

88
00:04:17,366 --> 00:04:19,200
This one is okay.

89
00:04:19,200 --> 00:04:21,000
Now what about this one?

90
00:04:21,000 --> 00:04:24,833
This line of code creates
the matrix features x and the way it does

91
00:04:24,833 --> 00:04:28,066
this is it takes all the columns
except the last one.

92
00:04:28,466 --> 00:04:32,200
But let's actually have a look again
at our data set.

93
00:04:32,466 --> 00:04:33,300
Right.

94
00:04:33,300 --> 00:04:38,100
We noticed when I described to you
the data set that the first columns

95
00:04:38,100 --> 00:04:41,433
are actually irrelevant in the sense
that they will not help

96
00:04:41,600 --> 00:04:44,466
to predict the outcome
of the dependent variable.

97
00:04:44,466 --> 00:04:49,733
And these columns are, you know, the non
helpful columns or obviously this one.

98
00:04:49,733 --> 00:04:52,133
This just gives the row number
of this data set.

99
00:04:52,133 --> 00:04:56,233
So we clearly don't want to include it
then customer ID as well.

100
00:04:56,233 --> 00:04:56,833
Right.

101
00:04:56,833 --> 00:05:00,700
The customer ID is just a key
identifier of each customer.

102
00:05:00,700 --> 00:05:04,966
Because you know each row
corresponds to a different customer.

103
00:05:04,966 --> 00:05:06,233
So of course the customer

104
00:05:06,233 --> 00:05:10,300
ID has absolutely no impact
on the dependent variable exited.

105
00:05:10,466 --> 00:05:13,633
So we will also exclude that column.

106
00:05:13,633 --> 00:05:14,200
We don't have.

107
00:05:14,200 --> 00:05:16,933
So you know the neural network
will just figure it out.

108
00:05:16,933 --> 00:05:20,700
But let's just ease the learning process
of our future neural network.

109
00:05:20,700 --> 00:05:21,133
Right.

110
00:05:21,133 --> 00:05:23,100
We're all on the same boat here.

111
00:05:23,100 --> 00:05:25,066
Okay. Then what about the surname?

112
00:05:25,066 --> 00:05:27,133
Does the surname have an impact on

113
00:05:27,133 --> 00:05:30,900
whether the customer
will stay in or leave the bank?

114
00:05:30,900 --> 00:05:32,400
Well absolutely not.

115
00:05:32,400 --> 00:05:32,900
Right.

116
00:05:32,900 --> 00:05:36,766
Surname of course, has no impact
on the decision of a customer

117
00:05:36,766 --> 00:05:38,600
to stay in or leave the bank.

118
00:05:38,600 --> 00:05:41,333
So we will also exclude this column

119
00:05:41,333 --> 00:05:45,166
and then all the rest, you know,
all the other features here look fine.

120
00:05:45,166 --> 00:05:46,800
They might have an impact

121
00:05:46,800 --> 00:05:50,400
on the dependent variable,
meaning they might help to predict

122
00:05:50,400 --> 00:05:54,866
if each customer will stay in the bank
or leave the bank.

123
00:05:54,966 --> 00:05:59,566
Okay, so we will definitely keep all
the other ones, meaning all the features

124
00:05:59,700 --> 00:06:00,833
starting from this one.

125
00:06:00,833 --> 00:06:05,233
Do credit score
and so here in our implementation,

126
00:06:05,466 --> 00:06:10,066
instead of taking all the columns
except the last one, well we will take

127
00:06:10,066 --> 00:06:14,300
all the columns starting from this one
except the last one,

128
00:06:14,566 --> 00:06:18,333
meaning all the columns from credit
score up to estimated salary.

129
00:06:18,633 --> 00:06:23,400
And the way to do this is still to keep
that upper bound of the range.

130
00:06:23,400 --> 00:06:26,633
You know,
finishing at the one before last column.

131
00:06:26,833 --> 00:06:29,233
Right?
You know, that's exactly the upper bound.

132
00:06:29,233 --> 00:06:30,466
That's the range.

133
00:06:30,466 --> 00:06:33,966
But at the left of this range
we won't specify

134
00:06:33,966 --> 00:06:36,966
nothing, which means the first column,
the first index.

135
00:06:37,200 --> 00:06:41,700
But instead we will specify
the index of the column

136
00:06:41,800 --> 00:06:44,700
we want to start with
which is the credit score.

137
00:06:44,700 --> 00:06:46,366
Right. We know we want to start from here.

138
00:06:46,366 --> 00:06:49,833
And therefore now the question is
what is the index of that column.

139
00:06:50,033 --> 00:06:52,966
Well let's see indexes in Python
start from zero.

140
00:06:52,966 --> 00:06:54,900
So this has index zero.

141
00:06:54,900 --> 00:06:56,433
Then this has index one.

142
00:06:56,433 --> 00:06:59,200
This has index two
and this has the next three.

143
00:06:59,200 --> 00:07:03,433
And therefore here instead of specifying
nothing here as a lower

144
00:07:03,433 --> 00:07:06,866
bound of the range,
well we will specify the index three

145
00:07:07,066 --> 00:07:11,400
so that we can take all the columns
starting from the column of index three

146
00:07:11,633 --> 00:07:16,266
up to the one before last, and taking all
the rows, all the values of the data set.

147
00:07:16,533 --> 00:07:19,766
And this will create a relevant
matrix of features.

148
00:07:20,266 --> 00:07:22,666
Perfect. So this line of code is done.

149
00:07:22,666 --> 00:07:24,200
Now what about the next one.

150
00:07:24,200 --> 00:07:26,466
Well obviously the next one is fine.

151
00:07:26,466 --> 00:07:30,000
It will just take the last column
of this data set, which is exactly

152
00:07:30,000 --> 00:07:33,766
what we want
for dependent variable exited.

153
00:07:34,066 --> 00:07:36,300
So all good here. Nothing to change.

154
00:07:36,300 --> 00:07:40,166
We can just play the cell 
and we will have our data set,

155
00:07:40,366 --> 00:07:43,366
our matrix of features
and our dependent variable vector.

156
00:07:43,466 --> 00:07:44,200
Let's check it out.

157
00:07:44,200 --> 00:07:47,466
Actually let's create
two new code cells right.

158
00:07:47,666 --> 00:07:51,500
One where we will print
the matrix of features x,

159
00:07:51,700 --> 00:07:56,400
and one where we will print the dependent
variable vector y.

160
00:07:56,733 --> 00:07:57,800
Perfect. All right.

161
00:07:57,800 --> 00:07:58,600
So let's do this now

162
00:07:58,600 --> 00:08:03,466
actually let's play first this cell
to print the matrix of features x.

163
00:08:03,466 --> 00:08:04,466
And there we go.

164
00:08:04,466 --> 00:08:08,066
We have indeed all the features
starting from the credit score.

165
00:08:08,100 --> 00:08:09,500
This is a credit score.

166
00:08:09,500 --> 00:08:13,833
Then you know the country of residence
and then the gender and all the other ones

167
00:08:13,833 --> 00:08:16,500
you know has credit card.
Yes or no is active.

168
00:08:16,500 --> 00:08:19,666
And the last one is the estimated salary.

169
00:08:19,800 --> 00:08:22,333
Okay.
So we have all these features. Perfect.

170
00:08:22,333 --> 00:08:23,333
And of course we don't have

171
00:08:23,333 --> 00:08:27,466
the dependent variable values
because they are right here in Y.

172
00:08:27,833 --> 00:08:29,100
And there we go.

173
00:08:29,100 --> 00:08:33,400
These are all the decisions that the
customers to state or leave in the bank.

174
00:08:33,400 --> 00:08:36,600
So of course this one here
corresponds to this customer here

175
00:08:36,900 --> 00:08:40,500
which obviously has decided
to leave the bank.

176
00:08:40,500 --> 00:08:40,800
Right.

177
00:08:40,800 --> 00:08:44,166
This is actually this
same one here exited one.

178
00:08:44,633 --> 00:08:48,033
And then well, this second customer

179
00:08:48,033 --> 00:08:51,733
has decided to stay in the bank
and corresponds to this one.

180
00:08:51,933 --> 00:08:52,366
Right.

181
00:08:52,366 --> 00:08:55,133
Which is exactly this one as well. Okay.

182
00:08:55,133 --> 00:08:56,666
This customer.

183
00:08:56,666 --> 00:08:56,966
All right.

184
00:08:56,966 --> 00:08:58,133
So all good so far.

185
00:08:58,133 --> 00:09:02,300
First step of the data
preprocessing phase done successfully.

186
00:09:02,433 --> 00:09:07,000
And now let's move on to the more advanced
steps of our data preprocessing phase

187
00:09:07,000 --> 00:09:10,000
which is about encoding
the categorical data.

188
00:09:10,100 --> 00:09:10,666
Right.

189
00:09:10,666 --> 00:09:14,500
Of course we noticed
that there are two categorical variables.

190
00:09:14,500 --> 00:09:18,700
This first one giving the country
of residence of the customers,

191
00:09:18,700 --> 00:09:21,700
and the second one
giving the gender of the customers.

192
00:09:21,866 --> 00:09:26,833
So we'll have to do some encoding work
here to encode these categorical data.

193
00:09:26,833 --> 00:09:30,633
And either simple labels,
you know, zero and one for the gender

194
00:09:30,866 --> 00:09:34,933
or some one hot
encoding for this categorical variables

195
00:09:34,933 --> 00:09:39,566
in which indeed there is no relationship
order between these values, you know,

196
00:09:39,600 --> 00:09:42,300
between these categories
France, Spain and Germany.

197
00:09:42,300 --> 00:09:43,800
Okay. So let's do this.

198
00:09:43,800 --> 00:09:47,700
Let's start first with the label
encoding of the gender column.

199
00:09:47,700 --> 00:09:49,500
So let's create a new code cell.

200
00:09:49,500 --> 00:09:53,400
And now of course to do it efficiently
we're going to go into our data

201
00:09:53,400 --> 00:09:54,700
preprocessing toolkit.

202
00:09:54,700 --> 00:09:57,266
We're going to scroll down to find.

203
00:09:57,266 --> 00:09:59,700
By the way
there is no missing data in the data set.

204
00:09:59,700 --> 00:10:03,133
I checked them and in reality
you would also have to check them.

205
00:10:03,300 --> 00:10:04,000
But all good.

206
00:10:04,000 --> 00:10:05,600
We don't have to take care of any

207
00:10:05,600 --> 00:10:09,633
missing data so we can directly move
to encoding categorical data.

208
00:10:09,633 --> 00:10:15,233
And now since we're taking care of label
encoding the gender column,

209
00:10:15,366 --> 00:10:16,866
well we're going to take this.

210
00:10:16,866 --> 00:10:20,633
That's exactly the tool
we need to perform label encoding.

211
00:10:20,633 --> 00:10:22,133
So I'm stealing this code cell.

212
00:10:22,133 --> 00:10:25,466
Now I'm
adding it inside our notebook here.

213
00:10:25,466 --> 00:10:26,766
Our implementation.

214
00:10:26,766 --> 00:10:30,300
But remember that in our data
preprocessing toolkit we did this

215
00:10:30,300 --> 00:10:32,300
on the dependent variable vector.

216
00:10:32,300 --> 00:10:37,800
But now we want to do it on this specific
column of the matrix of features x.

217
00:10:37,800 --> 00:10:43,000
And therefore what we only need
to replace here is this y by that specific

218
00:10:43,000 --> 00:10:47,400
column of the matrix of features x
to which we want to apply label encoding.

219
00:10:47,700 --> 00:10:48,200
And so.

220
00:10:48,200 --> 00:10:51,200
Well now the question is
how can we get this column.

221
00:10:51,200 --> 00:10:55,366
Well we just need to get the index
and then call x with that index.

222
00:10:55,366 --> 00:10:56,900
And so well there we go.

223
00:10:56,900 --> 00:10:59,700
That's the first column of x.
It has index zero.

224
00:10:59,700 --> 00:11:02,266
That's the second column of x.
It has index one.

225
00:11:02,266 --> 00:11:05,266
And that's the third column of x
which has index two.

226
00:11:05,600 --> 00:11:08,600
And therefore here
we simply need to replace y

227
00:11:08,833 --> 00:11:13,766
by our matrix of features x
of which we're going to take all the rows.

228
00:11:13,766 --> 00:11:18,000
And I'm taking them with this column,
you know which means arrange in Python.

229
00:11:18,200 --> 00:11:21,700
And then to take the column we want
meaning the gender column which has

230
00:11:21,700 --> 00:11:22,866
index two.

231
00:11:22,866 --> 00:11:27,166
Well I just need to add here
after the comma the index two

232
00:11:27,266 --> 00:11:30,900
so that it will take all the rows
but only the column of index two.

233
00:11:31,200 --> 00:11:33,500
And now of course we need to take this.

234
00:11:33,500 --> 00:11:39,200
And paste that inside the fit transform
method called from our object,

235
00:11:39,200 --> 00:11:42,200
which is an instance of the label encoder
class.

236
00:11:42,366 --> 00:11:43,333
And done.

237
00:11:43,333 --> 00:11:47,766
We just performed successfully label
encoding to the gender column

238
00:11:47,766 --> 00:11:49,266
of our matrix of features x.

239
00:11:49,266 --> 00:11:53,033
Let's make sure it's the case
by creating a new code cell here.

240
00:11:53,033 --> 00:11:57,400
And do,
new print of the matrix of features X.

241
00:11:57,766 --> 00:12:00,933
Let's run the cell you know, first.

242
00:12:01,366 --> 00:12:03,133
All right. Good.

243
00:12:03,133 --> 00:12:04,500
And now let's print X.

244
00:12:04,500 --> 00:12:06,533
And let's
just make sure that we no longer see

245
00:12:06,533 --> 00:12:09,000
female, female, female, female,
male, female.

246
00:12:09,000 --> 00:12:12,833
But whatever encoding there was,
which probably will be one

247
00:12:12,833 --> 00:12:16,933
for female or zero for female
and zero for male or one female.

248
00:12:16,933 --> 00:12:18,166
Let's see what they did.

249
00:12:18,166 --> 00:12:20,000
All right. And there we go. Right.

250
00:12:20,000 --> 00:12:24,300
That's the new column
after the label encoding and so female

251
00:12:24,300 --> 00:12:28,266
was encoded into zero
and male was encoded into one.

252
00:12:28,266 --> 00:12:31,766
That's of course a random decision
of the machine to choose this.

253
00:12:31,766 --> 00:12:32,966
Integers associated.

254
00:12:32,966 --> 00:12:34,333
And so all good.

255
00:12:34,333 --> 00:12:37,200
Now this column is well label encoded.

256
00:12:37,200 --> 00:12:42,700
And now we're going to proceed to the one
hot encoding of the geography column.

257
00:12:42,900 --> 00:12:45,400
And this time we have indeed to perform
one hot encoding

258
00:12:45,400 --> 00:12:49,200
because there is no other relationship
between France, Spain and Germany.

259
00:12:49,200 --> 00:12:52,066
So we couldn't,
you know, encode France into zero.

260
00:12:52,066 --> 00:12:54,200
Then Spain into one and German into three.

261
00:12:54,200 --> 00:12:56,833
We have to perform
one hot encoding instead.

262
00:12:56,833 --> 00:12:58,300
And so let's do this.

263
00:12:58,300 --> 00:13:01,433
Let's go back to our data
preprocessing toolkit.

264
00:13:01,766 --> 00:13:05,233
Let's take that cell this time,
which is executive

265
00:13:05,233 --> 00:13:08,533
cell that perform one hot encoding.

266
00:13:08,866 --> 00:13:12,800
And let's paste it inside a new code cell

267
00:13:13,166 --> 00:13:16,166
to one hot encode the geography column.

268
00:13:16,666 --> 00:13:17,333
All right.

269
00:13:17,333 --> 00:13:21,666
Now the question is of course
what do we have to replace or change

270
00:13:21,666 --> 00:13:26,833
in that cell to indeed perform
one hot encoding on the geography column?

271
00:13:27,100 --> 00:13:29,400
Well, remember,
the only thing that you have to

272
00:13:29,400 --> 00:13:33,933
change inside
this code is that index of the column

273
00:13:33,933 --> 00:13:36,933
you want to apply
one hot encoding on, right?

274
00:13:37,133 --> 00:13:42,233
And remember that in our data CSV
file of our part one data preprocessing.

275
00:13:42,266 --> 00:13:44,000
While the categorical variable

276
00:13:44,000 --> 00:13:46,400
with the three different states
was in the first column.

277
00:13:46,400 --> 00:13:48,600
That's why we had index zero here.

278
00:13:48,600 --> 00:13:52,300
But this time
this column is actually the second column.

279
00:13:52,300 --> 00:13:53,833
Therefore it has index one.

280
00:13:53,833 --> 00:13:59,966
And therefore very simply we just need
to replace zero here by one okay.

281
00:14:00,033 --> 00:14:00,933
And that's it.

282
00:14:00,933 --> 00:14:03,166
All the rest will be done automatically.

283
00:14:03,166 --> 00:14:06,166
Let me show
you this. Let's play that cell.

284
00:14:06,266 --> 00:14:10,533
And now let's create
a new code cell to print again X.

285
00:14:11,033 --> 00:14:12,266
All right. Good.

286
00:14:12,266 --> 00:14:16,733
And now let's play
that cell and see what x has become.

287
00:14:17,033 --> 00:14:20,600
And indeed well remember
when we perform one hot encoding.

288
00:14:20,600 --> 00:14:21,800
Well the dummy variables

289
00:14:21,800 --> 00:14:25,433
are actually moved to the first columns
of the matrix of features.

290
00:14:25,433 --> 00:14:28,966
We have them exactly here
you know in the three first columns.

291
00:14:29,200 --> 00:14:30,333
So let's see.

292
00:14:30,333 --> 00:14:32,666
Let's
see how the one hot encoding was done.

293
00:14:32,666 --> 00:14:35,400
This is the first combination

294
00:14:35,400 --> 00:14:38,533
of dummy variables
which corresponds to friends.

295
00:14:38,533 --> 00:14:40,633
You know these are the same rows here.

296
00:14:40,633 --> 00:14:45,100
And therefore friends
was encoded into 100.

297
00:14:45,533 --> 00:14:50,100
Now Spain was encoding into 001.

298
00:14:50,400 --> 00:14:54,166
And finally Germany was encoded into

299
00:14:54,433 --> 00:14:57,833
well this 1010 okay.

300
00:14:58,066 --> 00:15:00,066
So that's all one hot encoding.

301
00:15:00,066 --> 00:15:02,400
Then we no longer see the gender column.

302
00:15:02,400 --> 00:15:03,933
But no worries it is still here.

303
00:15:03,933 --> 00:15:04,933
And so perfect.

304
00:15:04,933 --> 00:15:09,800
One hot encoding was not only done
successfully, but also Western efficiently

305
00:15:09,833 --> 00:15:13,033
thanks to our data
preprocessing toolkit and template.

306
00:15:13,633 --> 00:15:14,100
Good.

307
00:15:14,100 --> 00:15:16,500
Now let's move on to the next step,
which is to split

308
00:15:16,500 --> 00:15:19,166
the data set into the training set
and the test set.

309
00:15:19,166 --> 00:15:22,166
And once again we're going to do that
so efficiently

310
00:15:22,200 --> 00:15:25,200
thanks to this time
our data preprocessing template.

311
00:15:25,400 --> 00:15:29,133
Indeed we have to steal now
this cell that splits the data

312
00:15:29,133 --> 00:15:31,700
set into the training set
and the test set.

313
00:15:31,700 --> 00:15:35,133
So let's step back into our implementation

314
00:15:35,133 --> 00:15:38,333
in a new code cell right here.

315
00:15:38,666 --> 00:15:40,766
And now we can just just this 100%.

316
00:15:40,766 --> 00:15:42,233
We will just play that cell.

317
00:15:42,233 --> 00:15:45,300
And we don't have to do
a print of these four entities.

318
00:15:45,300 --> 00:15:48,733
We perfectly understand how they work,
but feel free to do it if you want.

319
00:15:48,966 --> 00:15:52,833
You're free to do any modification
in this copy, of course, of the notebook.

320
00:15:53,466 --> 00:15:56,000
And finally, we have a final step

321
00:15:56,000 --> 00:15:59,000
of our data preprocessing phase,
which is feature scaling.

322
00:15:59,100 --> 00:16:02,100
And now I want to say something very,
very important.

323
00:16:02,100 --> 00:16:06,466
Feature scaling is absolutely compulsory
for deep learning.

324
00:16:06,466 --> 00:16:11,200
Whenever you build an artificial neural
network you have to apply feature scaling.

325
00:16:11,200 --> 00:16:13,133
That's absolutely fundamental.

326
00:16:13,133 --> 00:16:17,200
And it is so fundamental
that we will actually apply feature

327
00:16:17,200 --> 00:16:21,033
scaling to all our features, you know,
regardless of whether they already

328
00:16:21,033 --> 00:16:22,700
have some values of zero and one.

329
00:16:22,700 --> 00:16:25,133
You know, like the dummy variables.
And same for these ones.

330
00:16:25,133 --> 00:16:28,900
We will just scale everything
because it is so important to do it

331
00:16:29,000 --> 00:16:30,366
for deep learning.

332
00:16:30,366 --> 00:16:33,833
So the feature scaling step here
will be very simple.

333
00:16:34,000 --> 00:16:37,000
We will just take our data
preprocessing toolkit.

334
00:16:37,100 --> 00:16:40,700
We will go right at the end
because I think this is our last tool.

335
00:16:40,700 --> 00:16:41,833
Yes there we go.

336
00:16:41,833 --> 00:16:46,800
We will take that full cell
and we will paste it right back

337
00:16:46,800 --> 00:16:51,233
in a new code cell just below
feature scaling will paste it here.

338
00:16:51,233 --> 00:16:53,266
And now instead of selecting

339
00:16:53,266 --> 00:16:56,500
some specific indexes
here, we'll just take everything.

340
00:16:56,500 --> 00:17:01,266
So I'm just removing
all our index selections here right.

341
00:17:01,500 --> 00:17:03,566
So that we can just scale everything.

342
00:17:03,566 --> 00:17:06,800
And that's the way
it should be for neural network

343
00:17:06,800 --> 00:17:09,800
you know for building
and training a neural network.

344
00:17:10,133 --> 00:17:11,400
All right. So perfect.

345
00:17:11,400 --> 00:17:14,466
This will just apply feature scaling
to all the features

346
00:17:14,466 --> 00:17:18,066
of both the training set and the test set.

347
00:17:18,100 --> 00:17:22,566
But of course our scaler
object is only fitted to the training set.

348
00:17:22,700 --> 00:17:23,100
Right.

349
00:17:23,100 --> 00:17:26,633
Remember it's to avoid information leakage
that doesn't change.

350
00:17:26,633 --> 00:17:27,733
But there you go.

351
00:17:27,733 --> 00:17:31,666
Now we have the code to perform features
counting already.

352
00:17:31,666 --> 00:17:32,600
So let's do this.

353
00:17:32,600 --> 00:17:39,233
Let's run this final cell and then
the data preprocessing phase will be over.

354
00:17:39,733 --> 00:17:43,833
So congratulations I hope we did it
efficiently enough for you.

355
00:17:43,866 --> 00:17:45,200
That's the way it should be.

356
00:17:45,200 --> 00:17:48,300
I'd like to remind, by the way, that,
you know, the data preprocessing

357
00:17:48,300 --> 00:17:52,800
phase counts
for 70% of the work of a data scientist.

358
00:17:53,000 --> 00:17:56,833
So that's why it was really important
for me to give you some very efficient

359
00:17:56,866 --> 00:18:00,766
data preprocessing template and toolkit
so that, as you can see, we can do it

360
00:18:00,766 --> 00:18:04,500
efficiently in less than 20 minutes,
you know, in less than 20 minutes.

361
00:18:04,500 --> 00:18:06,966
With my explanation,
but without the explanation,

362
00:18:06,966 --> 00:18:08,666
even in less than ten minutes.

363
00:18:08,666 --> 00:18:11,200
So I hope you understand
and appreciate the importance.

364
00:18:11,200 --> 00:18:12,333
And now, my friends,

365
00:18:12,333 --> 00:18:16,000
it is time for the exciting step,
the exciting part of this implementation.

366
00:18:16,000 --> 00:18:19,500
I'm talking
of course, about part to building the CNN.

367
00:18:19,500 --> 00:18:20,666
So there we go.

368
00:18:20,666 --> 00:18:23,966
Recharge yourself with good energy,
and as soon as you're ready,

369
00:18:23,966 --> 00:18:27,066
let's tackle together
part two, where we're going to build

370
00:18:27,066 --> 00:18:32,100
for the first time an artificial brain
leveraging TensorFlow 2.0.

371
00:18:32,500 --> 00:18:34,633
I can't wait to see you
in the next tutorial.

372
00:18:34,633 --> 00:18:36,533
And until then, enjoy machine learning.