1
00:00:00,360 --> 00:00:07,380
In this lesson we're going to pre process our data so that it's easier to feed it into our neural network.

2
00:00:07,410 --> 00:00:11,430
Let's add a markdown cell in our notebook.

3
00:00:11,550 --> 00:00:15,750
It's a pre process data.

4
00:00:15,750 --> 00:00:22,440
So one of the things that we're going to do is we're going to change the kind of numbers that are being

5
00:00:22,440 --> 00:00:29,880
fed into our neural network because at the moment if we look at X underscore train underscore all and

6
00:00:29,880 --> 00:00:34,930
say we look at the very first entry then we'll get an array like this.

7
00:00:35,160 --> 00:00:40,830
If we drill down a little further right we go one level deeper then we see that an array is nested inside

8
00:00:40,830 --> 00:00:41,750
another array.

9
00:00:42,360 --> 00:00:44,760
But if we drill a little deeper.

10
00:00:44,760 --> 00:00:46,970
So we go to a particular pixel.

11
00:00:47,100 --> 00:00:51,270
We can see the three values that are stored for this pixel.

12
00:00:51,450 --> 00:00:54,500
The red green and blue values.

13
00:00:54,540 --> 00:01:00,990
Now if I just want to look at this number 59 in isolation some four levels down right and I look at

14
00:01:00,990 --> 00:01:05,230
the type that I can see that this is an integer right.

15
00:01:05,840 --> 00:01:08,410
But it is a you into it.

16
00:01:08,460 --> 00:01:09,590
And what does this mean.

17
00:01:09,600 --> 00:01:18,750
This means it's an 8 bit unsigned integer now an unsigned integer is simply a fancy name for a positive

18
00:01:18,780 --> 00:01:19,590
number.

19
00:01:19,590 --> 00:01:24,950
So 1984 would be an unsigned integer but negative 10 would not be.

20
00:01:24,960 --> 00:01:25,600
Why.

21
00:01:25,620 --> 00:01:29,660
Because there's a sign in front of the ten the negative sign.

22
00:01:29,760 --> 00:01:36,270
So unsigned just means a positive number if I take my training dataset.

23
00:01:36,620 --> 00:01:44,130
Excellent as quatrain trained on the scroll all and I divide this whole thing by two hundred and fifty

24
00:01:44,130 --> 00:01:49,470
five point zero then I would accomplish two things.

25
00:01:49,530 --> 00:01:53,130
First off I would make all of these numbers a lot smaller.

26
00:01:53,130 --> 00:01:53,590
Right.

27
00:01:53,610 --> 00:02:00,330
I know that 255 is the largest value that I'm going to have because on the RG B scale this is what I

28
00:02:00,330 --> 00:02:00,980
see right.

29
00:02:00,990 --> 00:02:06,720
It goes only up to two hundred and fifty five for any of these particular sliders.

30
00:02:06,870 --> 00:02:14,670
So dividing by 255 means that all the values inside my ex on this quatrain underscore all will be between

31
00:02:14,670 --> 00:02:18,150
zero and one right.

32
00:02:18,240 --> 00:02:22,970
The next thing that this will accomplish is there will be some conversion taking place.

33
00:02:22,970 --> 00:02:25,510
We're going to be converting from an integer to a float.

34
00:02:25,520 --> 00:02:31,850
We're going to be converting to a decimal number so a float is simply a number that has a decimal point

35
00:02:32,210 --> 00:02:37,460
and then some numbers after the decimal point floating point numbers is what you're going to encounter

36
00:02:37,490 --> 00:02:41,230
when you drink some sort of scientific calculation.

37
00:02:41,240 --> 00:02:46,340
Now the reason I'm dividing by two hundred and fifty five and bringing those values down to a range

38
00:02:46,430 --> 00:02:50,710
between 0 and one is because of our learning rate.

39
00:02:50,770 --> 00:02:56,510
Now if you watch the module on gradient descent you'll see that the learning rate is typically quite

40
00:02:56,510 --> 00:02:57,830
small.

41
00:02:57,920 --> 00:03:03,500
And by scaling our numbers down to values between 0 and 1 it's going to be much easier.

42
00:03:03,520 --> 00:03:08,570
Calculating the loss and adjusting the weights given a very typical learning rate.

43
00:03:08,630 --> 00:03:12,050
So this is why we're bringing that range down.

44
00:03:12,050 --> 00:03:17,450
Now I'm going to do this for both our training dataset and our testing dataset.

45
00:03:17,480 --> 00:03:26,150
So when I write X and the score train underscore all comma X underscore test is equal to x on the school

46
00:03:26,420 --> 00:03:35,810
train and a score all divided by two hundred and fifty five point zero comma X underscored test divided

47
00:03:35,810 --> 00:03:39,870
by two hundred and fifty five point zero.

48
00:03:40,130 --> 00:03:48,740
If I hit shift enter on this and say copy this cell paste it again and put it below this line.

49
00:03:48,740 --> 00:03:55,940
After this line has evaluated I can re-evaluate this self and I can see that the type has changed to

50
00:03:55,940 --> 00:03:59,640
a 64 bit floating point number.

51
00:03:59,720 --> 00:04:00,950
Brilliant.

52
00:04:00,950 --> 00:04:06,770
And if I want to see what that number is it'll be fifty nine divided by two hundred and fifty five which

53
00:04:06,770 --> 00:04:12,240
is around zero point 2 3 1 3 7 and so on right now.

54
00:04:12,380 --> 00:04:18,920
The next thing I'm going to do as part of our pre processing step is I'm going to flatten out our data

55
00:04:18,920 --> 00:04:26,120
set so you know having four dimensions is is fine and sometimes we'll work with that but I think I'll

56
00:04:26,120 --> 00:04:32,720
make it a lot easier conceptually if we put all of these values into a single row if you will.

57
00:04:32,720 --> 00:04:33,210
Right.

58
00:04:33,230 --> 00:04:38,910
Going to have a single vector a single row of numbers to represent one image.

59
00:04:39,020 --> 00:04:45,890
So this means that these three dimensions will be flattened and to do this I'm going to use num PIs

60
00:04:46,010 --> 00:04:54,230
reshape method to check it out how overwrite our X underscore train underscore all again by setting

61
00:04:54,230 --> 00:05:02,720
it equal to X on this quatrain on the square all dot reshape and then I have to supply basically two

62
00:05:02,720 --> 00:05:06,430
inputs the first input is the length.

63
00:05:06,710 --> 00:05:17,180
So I'm going to say X on a squad train and a squirrel dot shape the square brackets zero and that value

64
00:05:17,300 --> 00:05:21,520
is equal to 50000.

65
00:05:21,660 --> 00:05:28,680
Now I could have also said alien X and the squat train underscore all get the same answer.

66
00:05:29,410 --> 00:05:35,850
And the second value I'm going to supply to reshape is equal to what I want to collapse.

67
00:05:35,910 --> 00:05:37,070
These three dimensions.

68
00:05:37,080 --> 00:05:37,760
Right.

69
00:05:37,890 --> 00:05:42,490
What I want to do is I want to multiply the number of pixels in the width of the image.

70
00:05:42,510 --> 00:05:47,120
The number of pixels in the height of the image and the color channels.

71
00:05:47,160 --> 00:05:57,210
So this would read 32 times 32 times three but I don't really like these magic numbers in my code that

72
00:05:57,210 --> 00:05:57,700
much.

73
00:05:57,780 --> 00:06:04,050
So when I come back appear to our constants and I'm just going to make this very explicit it's going

74
00:06:04,050 --> 00:06:11,180
to see image on a square width is equal to 32 image on a score.

75
00:06:11,880 --> 00:06:19,850
Height is equal to 32 and then I'll add another one called image on a school pixels and that's gonna

76
00:06:19,860 --> 00:06:25,180
be equal to image width times image height.

77
00:06:25,310 --> 00:06:26,710
All right.

78
00:06:26,790 --> 00:06:28,850
And then my color channels.

79
00:06:28,860 --> 00:06:30,920
I'll stick with the American spelling here.

80
00:06:32,640 --> 00:06:34,280
It's gonna be equal to three.

81
00:06:34,500 --> 00:06:42,120
So the total number of inputs I'm going to have total underscore inputs should be equal to.

82
00:06:43,050 --> 00:06:51,240
Well it's gonna be the number of pixels times the color channels.

83
00:06:51,240 --> 00:06:52,730
Agreed.

84
00:06:52,740 --> 00:06:53,530
Great.

85
00:06:53,550 --> 00:06:58,950
This means that I can come down here where I'm pre processing my data and I'll replace this just with

86
00:06:59,190 --> 00:07:01,710
total on the school inputs.

87
00:07:01,710 --> 00:07:07,690
So let me hit shift enter on the cell and after a number I has completed its work.

88
00:07:07,980 --> 00:07:13,250
I'll say X and the SWAT train and all that shape and we'll take a quick look at what we got.

89
00:07:13,440 --> 00:07:20,830
We are now dealing with an umpire a that is 50000 by three thousand and seventy two.

90
00:07:21,000 --> 00:07:25,060
Now of course we can't just do that to the training data is that what we do to the training data set.

91
00:07:25,140 --> 00:07:26,910
We should also due to our testing data set.

92
00:07:27,360 --> 00:07:35,010
So I'll do exactly the same thing and a lot of print statement on here with an F string shape of X on

93
00:07:35,010 --> 00:07:39,310
the score test is curly braces.

94
00:07:39,920 --> 00:07:48,210
Excellent score test that shape that shift enter and let's see what we get 10000 images in our testing

95
00:07:48,240 --> 00:07:51,800
dataset and they share the same dimension as our training dataset.

96
00:07:52,070 --> 00:07:53,340
Brilliant.

97
00:07:53,580 --> 00:07:59,200
The next thing that we're going to do is we create something called a validation dataset.

98
00:07:59,220 --> 00:08:02,300
So imagine we've got our entire data set.

99
00:08:02,340 --> 00:08:08,370
So both our training and our testing data and what we're gonna do is we're gonna split our training

100
00:08:08,370 --> 00:08:10,360
data into two.

101
00:08:10,440 --> 00:08:14,660
We can have a part of the training dataset that's part of our validation dataset.

102
00:08:14,760 --> 00:08:18,700
So in total we're going to have three parts.

103
00:08:18,700 --> 00:08:20,440
How why would we do this.

104
00:08:20,460 --> 00:08:25,410
Well it has to do with our workflow because if you think about it there's many little knobs and little

105
00:08:25,410 --> 00:08:27,420
variations that we can make to our model.

106
00:08:27,420 --> 00:08:34,590
Many little tweaks and the validation data set will allow us to then select our best model because the

107
00:08:34,590 --> 00:08:39,990
validation data set is because of where we're gonna be evaluating all these little tweaks.

108
00:08:39,990 --> 00:08:42,310
In other words we're going to be doing two things.

109
00:08:42,480 --> 00:08:48,330
We're gonna be training our model but we're also gonna be tuning in and making slight variations to

110
00:08:48,330 --> 00:08:48,480
it.

111
00:08:48,960 --> 00:08:55,440
And the validation data set will provide us an unbiased evaluation as to how our model is doing.

112
00:08:55,530 --> 00:09:03,030
While we're tuning it and this has the big advantage of saving our test data set for later the test

113
00:09:03,030 --> 00:09:09,900
data set will be untouched and it'll be therefore our final evaluation only our best model will get

114
00:09:09,900 --> 00:09:16,350
to see our test data set because the job of our test data set is to give us a realistic impression of

115
00:09:16,350 --> 00:09:22,320
horror model would do in the real world if we didn't create this validation dataset and we only had

116
00:09:22,320 --> 00:09:27,810
a training data set and a test data set and we're tuning our model and we're always showing at the test

117
00:09:27,810 --> 00:09:28,340
data set.

118
00:09:28,470 --> 00:09:34,740
Then we'd actually be in danger of tuning our model such that it does well under test data set and as

119
00:09:34,740 --> 00:09:42,310
a consequence we end up with unrealistic results so before we head back into Jupiter notebook and split

120
00:09:42,310 --> 00:09:46,810
up our number higher rate one question you might be asking at this point as well.

121
00:09:47,050 --> 00:09:50,610
How large should your validation dataset be.

122
00:09:50,800 --> 00:09:55,580
And I think this really depends on the size of your data set as a whole.

123
00:09:55,690 --> 00:09:57,280
For smaller data sets.

124
00:09:57,280 --> 00:10:02,710
The general rule of thumb is 60 percent training 20 percent validation 20 percent.

125
00:10:02,710 --> 00:10:10,000
Testing on the other hand if you have an absolutely enormous amount of data then what people might actually

126
00:10:10,000 --> 00:10:14,420
do is only reserve about 1 percent for validation and 1 percent for testing.

127
00:10:14,590 --> 00:10:20,560
But the kind of data that I'm usually working with the 60 20 20 rule has served me very very well.

128
00:10:20,560 --> 00:10:26,660
So this is not a bad rule of thumb to stick by back in Jupiter notebook.

129
00:10:26,660 --> 00:10:36,440
I'm going to insert a subsection here that's going to read great validation data sets and I want to

130
00:10:36,440 --> 00:10:39,110
call this dataset X on the score.

131
00:10:39,110 --> 00:10:49,370
Val set that equal to x underscore train a score all square brackets and I'm going to take the first

132
00:10:49,690 --> 00:10:51,350
say ten thousand values.

133
00:10:51,410 --> 00:10:55,400
So semicolon and then the number 10000.

134
00:10:56,510 --> 00:11:02,540
Now if we want to get rid of this magic number here we can once again add up to our constants and put

135
00:11:02,540 --> 00:11:12,440
in validation on a score size as a constant and set that equal to ten thousand and then before we come

136
00:11:12,440 --> 00:11:21,320
back down we have to hit shift enter of course scroll down and replace our 10000 number with validation

137
00:11:21,350 --> 00:11:26,800
underscore size now of course we need to do this to the y values as well.

138
00:11:26,810 --> 00:11:27,520
Right.

139
00:11:27,590 --> 00:11:31,130
So we're going to create another variable called Y on a square.

140
00:11:31,360 --> 00:11:33,470
And just to check what it is we're doing.

141
00:11:33,640 --> 00:11:35,630
Quickly print out the shape here.

142
00:11:35,660 --> 00:11:40,430
So X and a score for hold up shape and it shift enter on the cell.

143
00:11:40,430 --> 00:11:45,490
So I've got 10000 values with the same dimensions as above.

144
00:11:45,530 --> 00:11:47,700
Now what about our training dataset.

145
00:11:47,720 --> 00:11:48,380
Right.

146
00:11:48,380 --> 00:11:54,350
We've created our validation bit but given that the first 10000 values are now part of our validation

147
00:11:54,380 --> 00:12:01,610
dataset we should take maybe the last 40000 values and store those in our training is that our new training

148
00:12:01,610 --> 00:12:04,620
data set that we're actually going to use for our model.

149
00:12:04,790 --> 00:12:13,000
That way we'll have 40000 images for training 10000 images for validation and 10000 images for testing.

150
00:12:13,160 --> 00:12:16,640
So I'll actually leave that up to you as a challenge.

151
00:12:16,640 --> 00:12:22,700
So what I'd like you to do is create two num higher res X on a squat train and why on a squat train

152
00:12:23,270 --> 00:12:29,020
and extend the school train has to have the shape 40000 by three thousand seventy two.

153
00:12:29,270 --> 00:12:37,300
And why on this go train needs to have the shape 40000 by one since we've used up the first 10000 values

154
00:12:37,420 --> 00:12:42,070
from our X underscore train those are all data set for our validation data set.

155
00:12:42,100 --> 00:12:50,380
What I'd like you to do is to store the last 40000 values so the ones we haven't used inside this X

156
00:12:50,440 --> 00:12:57,310
on this quatrain and y on a squat train no higher rate I'll give you a few seconds to pause the video

157
00:12:57,670 --> 00:12:59,430
and give this a go.

158
00:12:59,500 --> 00:13:00,310
I'll see it a bit

159
00:13:03,370 --> 00:13:03,790
all right.

160
00:13:03,790 --> 00:13:08,660
So here is the solution X and this good train is equal to x.

161
00:13:08,700 --> 00:13:15,150
I was quatrain underscore all square brackets and then since we're not taking the first 10000 values

162
00:13:15,370 --> 00:13:23,500
but we're taking all the values from ten thousand onwards we can see validation underscore size semicolon.

163
00:13:23,500 --> 00:13:29,650
And that will give us the last 40000 values from X underscore train underscore all that we're looking

164
00:13:29,650 --> 00:13:34,050
for and we can do the very same thing for our labels of course.

165
00:13:34,060 --> 00:13:42,370
So why is train is equal to why on a score train underscore all square brackets validation size semicolon

166
00:13:43,480 --> 00:13:54,490
and that's it X and the square train that shape is 40000 by three thousand and seventy two so that's

167
00:13:54,490 --> 00:13:55,140
really it.

168
00:13:55,780 --> 00:13:59,740
But I tell you what training our models can actually be a little bit intensive.

169
00:13:59,740 --> 00:14:01,750
Even with only 40000 images.

170
00:14:01,780 --> 00:14:07,180
So what I think would be nice would be to have an even smaller dataset to work with in the beginning

171
00:14:07,210 --> 00:14:11,800
so that we can iterate and not slow down our computers too much.

172
00:14:11,890 --> 00:14:15,320
And then once we're happy move on to the larger dataset.

173
00:14:15,340 --> 00:14:20,560
So this would almost simulate training on a smaller data set that you have at first but then going out

174
00:14:20,560 --> 00:14:24,910
and gathering more data and then training on a larger data set later on.

175
00:14:24,910 --> 00:14:27,730
This is something you'd often encounter in the real world.

176
00:14:27,790 --> 00:14:34,350
So I'm going to add another little markdown cell here and says create a small data set.

177
00:14:34,370 --> 00:14:36,580
And this is mostly for illustration

178
00:14:39,550 --> 00:14:47,520
so I'll just create two more umpire raise X on the school train and score access for extra small.

179
00:14:47,650 --> 00:14:54,150
And I'm just going to take the first thousand values from X underscore train that we created earlier.

180
00:14:54,280 --> 00:15:02,740
And if I do the same thing for our y values then I get something like this that a little constant at

181
00:15:02,740 --> 00:15:08,710
the top that just reads small underscore train underscore size.

182
00:15:08,920 --> 00:15:11,860
And that's gonna be 1000.

183
00:15:11,860 --> 00:15:16,130
That way I know what's the and what it does.

184
00:15:16,220 --> 00:15:18,880
Small on a score train underscore size.

185
00:15:18,950 --> 00:15:24,620
It's gonna take the place of our magic number in the cell.

186
00:15:24,610 --> 00:15:28,910
Now one of the really good things about using these practice data sets is that they tend to be pretty

187
00:15:28,910 --> 00:15:29,740
clean.

188
00:15:29,820 --> 00:15:35,230
It doesn't tend to be much wrong with them and that saves us a lot of time when it comes to data cleaning.

189
00:15:35,360 --> 00:15:41,410
We're actually done with the pre processing we're done with the data exploration side of things.

190
00:15:41,450 --> 00:15:47,840
So now we get to move on to our next steps setting up our neural networks and training our algorithms.

191
00:15:47,850 --> 00:15:49,370
And this is what we're all here for right.

192
00:15:49,640 --> 00:15:51,680
So I'll see you in the next lesson.

193
00:15:51,680 --> 00:15:52,170
Take it.