1
00:00:00,240 --> 00:00:01,500
Okay great.

2
00:00:01,590 --> 00:00:09,250
Now we have a little help of function that not only returns us our images in the form of tenses market

3
00:00:09,310 --> 00:00:15,870
processes the images for us but it also returns the labels so we can get it back in a topple form.

4
00:00:15,870 --> 00:00:22,410
So let's utilize that to write another function that's going to turn our data all of our data because

5
00:00:22,410 --> 00:00:25,710
this only does one sample.

6
00:00:25,710 --> 00:00:34,730
We need a way to turn all of our data so our x and y our data and labels into batches of size 32.

7
00:00:34,820 --> 00:00:38,180
Right now we've got let's write that down.

8
00:00:38,850 --> 00:00:47,760
So we're trying to communicate to ourselves here tower notebooks a way to turn our data into topples

9
00:00:47,910 --> 00:00:53,100
of tenses and we go here in the form

10
00:00:56,990 --> 00:01:00,140
image label.

11
00:01:01,370 --> 00:01:09,490
Let's make a function to turn all of our data.

12
00:01:12,930 --> 00:01:21,610
X and Y into batches All right.

13
00:01:21,610 --> 00:01:22,690
Wonderful.

14
00:01:22,690 --> 00:01:27,880
Now again if you're wondering where all of this came from a great place for any kind of data loading

15
00:01:28,300 --> 00:01:29,770
is to read through the documentation.

16
00:01:29,770 --> 00:01:31,680
Now it could be a bit boring sometimes.

17
00:01:31,690 --> 00:01:37,060
But trust me reading through even if you don't understand the first time I didn't understand the first

18
00:01:37,060 --> 00:01:38,440
time read through it.

19
00:01:38,440 --> 00:01:44,810
Try it out try it out again and see how you go this is gonna be a big dog of a function.

20
00:01:45,240 --> 00:01:50,540
Oh that's fitting because we're working on dog vision but it's gonna save us a hell of a lot of time

21
00:01:50,750 --> 00:01:51,840
later on.

22
00:01:51,980 --> 00:01:53,420
So let's do it.

23
00:01:53,420 --> 00:01:59,820
So we need to define the batch size 32 is a good start.

24
00:01:59,990 --> 00:02:00,990
Remember tensor flow.

25
00:02:01,010 --> 00:02:03,520
Defaults to 32 a lot.

26
00:02:04,450 --> 00:02:11,190
So if you don't set it somewhere it'll probably default to 32 but we're going to define it anyway.

27
00:02:11,210 --> 00:02:17,480
So let's create a function to turn data into batches.

28
00:02:17,480 --> 00:02:23,180
And now we're going to write through this function together and it's gonna do a few little things differently

29
00:02:23,210 --> 00:02:30,060
depending on if we want a training data batch of validation data batch or a test data batch because

30
00:02:30,060 --> 00:02:33,170
remember these are our three sets there.

31
00:02:33,210 --> 00:02:34,160
So these are our three sets.

32
00:02:34,170 --> 00:02:35,200
So we've got a training set.

33
00:02:35,190 --> 00:02:40,320
We're gonna turn into a training batch of validation batch and a test batch.

34
00:02:40,360 --> 00:02:40,990
All right.

35
00:02:41,270 --> 00:02:56,000
So we want def create data batches we want it to take in an X and a Y and actually Y can be none.

36
00:02:56,180 --> 00:03:01,460
Now I want you to think about Y Y might be none because Y is our labels.

37
00:03:01,610 --> 00:03:07,460
This will probably make sense in a little bit but just have a think about why might we want to create

38
00:03:07,460 --> 00:03:17,420
a data batch with No Labels valid data equals false and test data equals false and we'll leave ourselves

39
00:03:17,420 --> 00:03:18,670
a little doctoring here.

40
00:03:18,830 --> 00:03:23,400
I just to have a colleague called Ron and he told me a technique called Rubber ducking he was a lot

41
00:03:23,400 --> 00:03:30,450
better coder than I was and he said if you're having trouble figuring out how to do something in code

42
00:03:30,510 --> 00:03:36,450
write it out in language first or say it to yourself in language as if you're talking to a rubber duck

43
00:03:37,460 --> 00:03:41,850
and rubber ducks don't understand English very well so you have to make it very clear.

44
00:03:43,230 --> 00:03:44,240
That's a great tip from Ron.

45
00:03:44,240 --> 00:03:45,360
Thank you Ron.

46
00:03:45,380 --> 00:03:53,570
So this function creates batches of data out of image X and label Y pairs

47
00:03:56,310 --> 00:04:03,570
it shuffles the data if it's training data so mixes the data around because if there were some inherent

48
00:04:03,660 --> 00:04:09,750
order into the training file that we downloaded probably not because these look pretty jumbled up we

49
00:04:09,750 --> 00:04:14,880
don't want our model to remember that order we want it to be as adaptable as possible no matter what

50
00:04:14,970 --> 00:04:27,020
order an image comes in but so if a training data shuffle but doesn't shuffle if it's validation data.

51
00:04:27,020 --> 00:04:31,250
Now I want you to have a think about why it might not shuffle if it's validation data.

52
00:04:31,260 --> 00:04:36,510
Don't worry we'll cover that but just have a think about it for the meantime also accepts test data

53
00:04:36,900 --> 00:04:37,590
as input.

54
00:04:37,590 --> 00:04:40,410
Now here is why why might be none.

55
00:04:40,410 --> 00:04:42,810
Because test data doesn't have labels of course

56
00:04:45,670 --> 00:04:52,320
what we want to do is build a function that kind of does the same thing but has some if clauses or e

57
00:04:52,320 --> 00:04:58,080
left causes if it's valid data or test data so we know what we kind of want to do and we've got some

58
00:04:58,080 --> 00:05:00,380
conditions here that if it's valid or test state.

59
00:05:00,420 --> 00:05:05,000
So let's start off with the test data because I think that's the least amount of things.

60
00:05:05,040 --> 00:05:13,060
So if the data is a test dataset we probably don't.

61
00:05:13,110 --> 00:05:17,610
I was about to draw I definitely had labels where we know we don't have labels if it's a test data set.

62
00:05:17,610 --> 00:05:24,200
So we're gonna if test data print let's give ourselves a little little head's up I'll give myself a

63
00:05:24,200 --> 00:05:25,970
little head's up what's going on.

64
00:05:25,970 --> 00:05:31,400
Creating test data batches just to make sure that later on hey maybe we're making some data batches

65
00:05:31,400 --> 00:05:36,860
and we've accidentally passed a false parameter to test data and it ends up doing it differently for

66
00:05:36,860 --> 00:05:37,460
the test data.

67
00:05:37,460 --> 00:05:38,240
We don't want that.

68
00:05:38,240 --> 00:05:48,540
We want to know that it's making test batches so data equals T F data a dataset so this is the module

69
00:05:48,570 --> 00:05:50,420
in tensor flow data.

70
00:05:50,430 --> 00:05:54,640
We're making essentially a batch data set and we're gonna.

71
00:05:54,660 --> 00:05:58,670
This is a little interesting one here from tensor slices.

72
00:05:59,010 --> 00:06:06,620
So create a data set a.k.a. just a tensor flow data set whose elements are slices of the given tenses.

73
00:06:06,660 --> 00:06:15,480
So basically what this says is pass me some tenses and I'll create a data set out of that little confusing

74
00:06:15,510 --> 00:06:21,670
I know but this is just how it's done intensive TAF constant x.

75
00:06:21,810 --> 00:06:31,720
So basically what this says is I'm going to create a tensor flow data set from the tensor slices X and

76
00:06:31,730 --> 00:06:41,300
because we're passing X to T F constant it turns X into a tensor so this is only file parts.

77
00:06:41,500 --> 00:06:44,680
No Labels okay.

78
00:06:44,760 --> 00:06:45,950
And now data batch.

79
00:06:46,020 --> 00:06:52,520
Here's where we turn our tensor of low dataset because right now this would turn all of X into a data

80
00:06:52,520 --> 00:06:52,840
set.

81
00:06:52,860 --> 00:06:58,130
We need to turn it into a batch size of 32 so equals data dot map.

82
00:06:58,170 --> 00:07:06,150
We need to process the image and we need to turn it into a batch batch size.

83
00:07:06,410 --> 00:07:08,550
Well a lot of stuff going on here.

84
00:07:08,560 --> 00:07:09,660
Daniel what's happening.

85
00:07:10,740 --> 00:07:13,740
Well we've talked through the line above.

86
00:07:13,890 --> 00:07:22,590
But this one all it's saying is basically take this data which is intensified dataset made of x which

87
00:07:22,590 --> 00:07:29,220
is just file names in the form of tenses and then map our process image function See this is why we

88
00:07:29,220 --> 00:07:32,970
made it into a function so we can call it process image.

89
00:07:33,030 --> 00:07:37,320
It's gonna go through all of these steps here on our file names.

90
00:07:37,350 --> 00:07:42,990
So it's going to import them turn them into normalized tenses and then turn it into a batch of batch

91
00:07:42,990 --> 00:07:43,870
size.

92
00:07:43,920 --> 00:07:49,300
Let's go here to intensive low data batch.

93
00:07:49,470 --> 00:07:56,090
This is the workflow I use whenever I'm looking at a function.

94
00:07:56,180 --> 00:08:04,000
So if we go batch There we go batch combines consecutive elements of this data set into batches There

95
00:08:04,000 --> 00:08:09,490
we go so if we pass it a range of eight and we want it to be a batch of three it's going to split eight

96
00:08:10,180 --> 00:08:11,380
into three batches.

97
00:08:12,250 --> 00:08:17,570
So that's just exactly why is this in a new tab.

98
00:08:17,770 --> 00:08:22,720
That's what it's gonna do here except as is of size 32 wonderful.

99
00:08:22,750 --> 00:08:24,370
So that's the test data.

100
00:08:24,370 --> 00:08:25,410
Now we want to do.

101
00:08:25,450 --> 00:08:34,690
If the data is a valid dataset we don't need to shuffle it.

102
00:08:34,690 --> 00:08:42,010
Now I want you to think about again why would we not need to shuffle it.

103
00:08:42,520 --> 00:08:44,790
It's okay if you're not entirely sure.

104
00:08:45,250 --> 00:08:46,630
I got caught out on this.

105
00:08:46,690 --> 00:08:48,580
A lot of times too.

106
00:08:48,880 --> 00:08:56,190
So we'll print a little progress statement here and then we'll do much the same as this we could just

107
00:08:56,190 --> 00:09:04,770
copy that but we're going to practice writing it out TAF data dataset from tensor slices.

108
00:09:04,790 --> 00:09:16,670
Now this is where because our validation set has labels we're going to posit a couple of X and TAF constant

109
00:09:16,970 --> 00:09:17,710
y.

110
00:09:17,810 --> 00:09:25,040
So just turning out file names and labels into tenses and we got here.

111
00:09:25,550 --> 00:09:31,960
File paths labels so now we have a tensor flow dataset

112
00:09:35,600 --> 00:09:40,730
of image file parts and labels because it's a validation dataset.

113
00:09:40,790 --> 00:09:48,290
The validation dataset has labels and we're going to turn it into a data batch equals data dot map just

114
00:09:48,290 --> 00:09:59,820
the exact same as above process the image and turn it into a batch of batch size help getting trigger

115
00:09:59,820 --> 00:10:06,670
happy wonderful and then return data batch.

116
00:10:07,710 --> 00:10:17,760
And finally we want if none of these work well a valid data is false and test data is also false it's

117
00:10:17,790 --> 00:10:26,450
obviously going to be a training batch so we're going to go create training data batches on the lump

118
00:10:28,650 --> 00:10:29,770
and let's go.

119
00:10:30,650 --> 00:10:32,980
So if it is a training dataset we want to shuffle it.

120
00:10:32,990 --> 00:10:34,700
That's the only difference here.

121
00:10:35,180 --> 00:10:43,010
So we want to go turn file paths and labels into tenses.

122
00:10:43,010 --> 00:10:47,780
So just as we done with the validation we could just copy this.

123
00:10:47,870 --> 00:10:52,070
Now there's probably a more efficient way to write this function that does the same functionality but

124
00:10:52,070 --> 00:10:56,590
I've tried to write it in a way that is fairly understandable as to what's going on.

125
00:10:59,650 --> 00:11:04,420
Constant X and as I said that I might have said that it's fairly understandable but if you're coming

126
00:11:04,420 --> 00:11:08,950
across something like this for the first time trust me when I did I was confused.

127
00:11:08,950 --> 00:11:10,950
But once I went through it step by step.

128
00:11:10,960 --> 00:11:13,340
Remember our technique for breaking down a function.

129
00:11:13,480 --> 00:11:18,370
Write it out line by line and then check out what it's doing in each line.

130
00:11:19,120 --> 00:11:27,280
So we want to shuffle the path names and labels before mapping will go into this in a second before

131
00:11:27,280 --> 00:11:27,880
mapping.

132
00:11:27,880 --> 00:11:37,090
Image Processor function is faster than shuffling images I'll just write the code so we can talk about

133
00:11:37,090 --> 00:11:44,450
it but data shuffle emphasize so buffer size just stencil.

134
00:11:44,470 --> 00:11:48,200
How many variables do we want to shuffle then we want to shuffle the whole lot.

135
00:11:49,040 --> 00:11:51,230
Well a lot going on here.

136
00:11:52,730 --> 00:12:00,410
While all this is saying is hey take this data which is a tensor flow dataset of the data so the file

137
00:12:00,410 --> 00:12:06,860
names and the labels and then shuffle it up and we want to shuffle the number of examples.

138
00:12:07,040 --> 00:12:13,670
If we only shuffled 100 it would take the first 100 examples of x and y and shuffle them but note we

139
00:12:13,670 --> 00:12:14,600
want to shuffle them all.

140
00:12:15,320 --> 00:12:21,980
And the reason being you might see other tutorials shuffling later after we run the map function it

141
00:12:22,110 --> 00:12:25,370
goes data map process.

142
00:12:25,370 --> 00:12:33,670
Image so if you did shuffle after you've already processed the image it takes a lot longer to shuffle

143
00:12:33,700 --> 00:12:40,690
a full image than it does just a file name so as I said remember we're trying to minimize the time between

144
00:12:40,690 --> 00:12:41,840
experiments.

145
00:12:41,950 --> 00:12:44,170
So we want to try and get any speed up that we can.

146
00:12:44,170 --> 00:12:52,490
I've just realized we've made a little mistake up here in our validation dataset because we have images

147
00:12:52,520 --> 00:13:01,550
and labels we need to map the function get image label which also processes our image whereas for our

148
00:13:01,550 --> 00:13:10,250
test data set because we have no labels we can run directly the processed image function so let's let's

149
00:13:10,250 --> 00:13:13,400
replace this get image label and it's going to be the same

150
00:13:16,600 --> 00:13:19,430
for our training dataset.

151
00:13:19,480 --> 00:13:27,060
So get image label wonderful and it might just write out what this is doing with a little little comment.

152
00:13:27,730 --> 00:13:34,010
So create image label troubles.

153
00:13:34,030 --> 00:13:46,850
This also turns the image path into a pre processed image so if you can shuffle your data when it's

154
00:13:46,850 --> 00:13:52,710
in its smallest format a.k.a. in our case file names rather than full images.

155
00:13:52,820 --> 00:14:05,510
And now finally we want to turn the training data into batches so data batch equals data top batch batch

156
00:14:05,510 --> 00:14:09,460
size and then we'll go out here.

157
00:14:09,680 --> 00:14:14,710
Return data batch hoof.

158
00:14:14,720 --> 00:14:19,310
That's a bit of a behemoth of a function but as I said this is gonna save us a lot of time later on

159
00:14:19,310 --> 00:14:24,880
and you'll see that remember this is the kind of development work style I want you to start thinking

160
00:14:24,880 --> 00:14:25,570
about.

161
00:14:25,570 --> 00:14:28,740
Hey am I going to reuse some functionality later on.

162
00:14:28,750 --> 00:14:33,430
Well if I am I'm going to ride it into a function and you probably already know that because you're

163
00:14:33,430 --> 00:14:35,380
a lot smarter developer than what I am.

164
00:14:35,380 --> 00:14:39,340
I used to just fluster around and just write things line by line.

165
00:14:39,370 --> 00:14:48,790
Now let's test our function by creating training and validation databases so we'll go train data.

166
00:14:48,790 --> 00:14:55,870
We're gonna use our big dog function just above we're gonna pass it X train for the training data and

167
00:14:55,960 --> 00:15:01,030
why train for the validation data and that's all we need.

168
00:15:01,030 --> 00:15:16,060
And then for the vow data we could go create data batches X Val y Val is is it valid data equals true

169
00:15:17,910 --> 00:15:26,290
and now fingers crossed we should see our print progress messages.

170
00:15:26,410 --> 00:15:29,430
Yes that is amazing.

171
00:15:29,440 --> 00:15:41,950
Okay now let's check out the different attributes of our data matches so if we go train data it's now

172
00:15:41,950 --> 00:15:53,630
in the format of a data batch which is tensor flows preferred way of processing things.

173
00:15:53,650 --> 00:15:55,270
There we go.

174
00:15:55,720 --> 00:15:56,710
This is our training data.

175
00:15:56,710 --> 00:15:59,310
The first one here what's it saying.

176
00:15:59,630 --> 00:16:02,990
So you've got shape none 2 2 4 2 2 4 3.

177
00:16:02,990 --> 00:16:04,440
This is our image.

178
00:16:04,550 --> 00:16:06,700
None is for batch size.

179
00:16:06,740 --> 00:16:13,310
I know we've set it to 32 but it's going to stay as none because batch size remember is flexible.

180
00:16:13,310 --> 00:16:22,910
So this is just our images are now in batches of 32 and they've got a shape of 2 2 4 2 2 4 3 height

181
00:16:23,060 --> 00:16:29,730
width color channels of type T float 32 and we've got a Tupperware here of our label.

182
00:16:29,750 --> 00:16:37,100
So these are image label pairs in the form of tenses and our labels also have a batch shape of none

183
00:16:37,130 --> 00:16:46,940
because batch size is flexible and they have a dimension of 120 which is because if we go y 0

184
00:16:50,510 --> 00:16:58,040
there's 120 different dog breeds and then it's the same again for the validation data beautiful.

185
00:16:58,040 --> 00:17:03,840
So that was a pretty in-depth one once this video is over which will be in a few seconds.

186
00:17:03,840 --> 00:17:08,560
I'd go back through and just read through what's going on in this function here.

187
00:17:08,760 --> 00:17:14,640
If you're not entirely sure I'd look up loading and pre processing data tutorial of images.

188
00:17:14,640 --> 00:17:20,940
Have a read through that and then the best way to learn is to really just write this out break down

189
00:17:20,940 --> 00:17:24,150
the function and try to run a line by line.

190
00:17:24,150 --> 00:17:30,490
That's how I break down things for myself but the way our data is at the moment it's still kind of hard

191
00:17:30,490 --> 00:17:31,380
to understand.

192
00:17:31,480 --> 00:17:35,170
If you're new to the concept of batches it's really a very difficult concept to grasp.

193
00:17:35,200 --> 00:17:37,600
It's perfectly okay not to know what's going on right now.

194
00:17:38,140 --> 00:17:44,020
So to help ourselves understand rather than just having this this gobbledygook being printed out when

195
00:17:44,020 --> 00:17:50,140
we're trying to check out what's in our data batches let's write a function which is going to help us

196
00:17:50,260 --> 00:17:52,550
visualize what's going on.

197
00:17:52,600 --> 00:17:54,070
So that's what we'll do in the next video.