1 00:00:00,240 --> 00:00:01,500 Okay great. 2 00:00:01,590 --> 00:00:09,250 Now we have a little help of function that not only returns us our images in the form of tenses market 3 00:00:09,310 --> 00:00:15,870 processes the images for us but it also returns the labels so we can get it back in a topple form. 4 00:00:15,870 --> 00:00:22,410 So let's utilize that to write another function that's going to turn our data all of our data because 5 00:00:22,410 --> 00:00:25,710 this only does one sample. 6 00:00:25,710 --> 00:00:34,730 We need a way to turn all of our data so our x and y our data and labels into batches of size 32. 7 00:00:34,820 --> 00:00:38,180 Right now we've got let's write that down. 8 00:00:38,850 --> 00:00:47,760 So we're trying to communicate to ourselves here tower notebooks a way to turn our data into topples 9 00:00:47,910 --> 00:00:53,100 of tenses and we go here in the form 10 00:00:56,990 --> 00:01:00,140 image label. 11 00:01:01,370 --> 00:01:09,490 Let's make a function to turn all of our data. 12 00:01:12,930 --> 00:01:21,610 X and Y into batches All right. 13 00:01:21,610 --> 00:01:22,690 Wonderful. 14 00:01:22,690 --> 00:01:27,880 Now again if you're wondering where all of this came from a great place for any kind of data loading 15 00:01:28,300 --> 00:01:29,770 is to read through the documentation. 16 00:01:29,770 --> 00:01:31,680 Now it could be a bit boring sometimes. 17 00:01:31,690 --> 00:01:37,060 But trust me reading through even if you don't understand the first time I didn't understand the first 18 00:01:37,060 --> 00:01:38,440 time read through it. 19 00:01:38,440 --> 00:01:44,810 Try it out try it out again and see how you go this is gonna be a big dog of a function. 20 00:01:45,240 --> 00:01:50,540 Oh that's fitting because we're working on dog vision but it's gonna save us a hell of a lot of time 21 00:01:50,750 --> 00:01:51,840 later on. 22 00:01:51,980 --> 00:01:53,420 So let's do it. 23 00:01:53,420 --> 00:01:59,820 So we need to define the batch size 32 is a good start. 24 00:01:59,990 --> 00:02:00,990 Remember tensor flow. 25 00:02:01,010 --> 00:02:03,520 Defaults to 32 a lot. 26 00:02:04,450 --> 00:02:11,190 So if you don't set it somewhere it'll probably default to 32 but we're going to define it anyway. 27 00:02:11,210 --> 00:02:17,480 So let's create a function to turn data into batches. 28 00:02:17,480 --> 00:02:23,180 And now we're going to write through this function together and it's gonna do a few little things differently 29 00:02:23,210 --> 00:02:30,060 depending on if we want a training data batch of validation data batch or a test data batch because 30 00:02:30,060 --> 00:02:33,170 remember these are our three sets there. 31 00:02:33,210 --> 00:02:34,160 So these are our three sets. 32 00:02:34,170 --> 00:02:35,200 So we've got a training set. 33 00:02:35,190 --> 00:02:40,320 We're gonna turn into a training batch of validation batch and a test batch. 34 00:02:40,360 --> 00:02:40,990 All right. 35 00:02:41,270 --> 00:02:56,000 So we want def create data batches we want it to take in an X and a Y and actually Y can be none. 36 00:02:56,180 --> 00:03:01,460 Now I want you to think about Y Y might be none because Y is our labels. 37 00:03:01,610 --> 00:03:07,460 This will probably make sense in a little bit but just have a think about why might we want to create 38 00:03:07,460 --> 00:03:17,420 a data batch with No Labels valid data equals false and test data equals false and we'll leave ourselves 39 00:03:17,420 --> 00:03:18,670 a little doctoring here. 40 00:03:18,830 --> 00:03:23,400 I just to have a colleague called Ron and he told me a technique called Rubber ducking he was a lot 41 00:03:23,400 --> 00:03:30,450 better coder than I was and he said if you're having trouble figuring out how to do something in code 42 00:03:30,510 --> 00:03:36,450 write it out in language first or say it to yourself in language as if you're talking to a rubber duck 43 00:03:37,460 --> 00:03:41,850 and rubber ducks don't understand English very well so you have to make it very clear. 44 00:03:43,230 --> 00:03:44,240 That's a great tip from Ron. 45 00:03:44,240 --> 00:03:45,360 Thank you Ron. 46 00:03:45,380 --> 00:03:53,570 So this function creates batches of data out of image X and label Y pairs 47 00:03:56,310 --> 00:04:03,570 it shuffles the data if it's training data so mixes the data around because if there were some inherent 48 00:04:03,660 --> 00:04:09,750 order into the training file that we downloaded probably not because these look pretty jumbled up we 49 00:04:09,750 --> 00:04:14,880 don't want our model to remember that order we want it to be as adaptable as possible no matter what 50 00:04:14,970 --> 00:04:27,020 order an image comes in but so if a training data shuffle but doesn't shuffle if it's validation data. 51 00:04:27,020 --> 00:04:31,250 Now I want you to have a think about why it might not shuffle if it's validation data. 52 00:04:31,260 --> 00:04:36,510 Don't worry we'll cover that but just have a think about it for the meantime also accepts test data 53 00:04:36,900 --> 00:04:37,590 as input. 54 00:04:37,590 --> 00:04:40,410 Now here is why why might be none. 55 00:04:40,410 --> 00:04:42,810 Because test data doesn't have labels of course 56 00:04:45,670 --> 00:04:52,320 what we want to do is build a function that kind of does the same thing but has some if clauses or e 57 00:04:52,320 --> 00:04:58,080 left causes if it's valid data or test data so we know what we kind of want to do and we've got some 58 00:04:58,080 --> 00:05:00,380 conditions here that if it's valid or test state. 59 00:05:00,420 --> 00:05:05,000 So let's start off with the test data because I think that's the least amount of things. 60 00:05:05,040 --> 00:05:13,060 So if the data is a test dataset we probably don't. 61 00:05:13,110 --> 00:05:17,610 I was about to draw I definitely had labels where we know we don't have labels if it's a test data set. 62 00:05:17,610 --> 00:05:24,200 So we're gonna if test data print let's give ourselves a little little head's up I'll give myself a 63 00:05:24,200 --> 00:05:25,970 little head's up what's going on. 64 00:05:25,970 --> 00:05:31,400 Creating test data batches just to make sure that later on hey maybe we're making some data batches 65 00:05:31,400 --> 00:05:36,860 and we've accidentally passed a false parameter to test data and it ends up doing it differently for 66 00:05:36,860 --> 00:05:37,460 the test data. 67 00:05:37,460 --> 00:05:38,240 We don't want that. 68 00:05:38,240 --> 00:05:48,540 We want to know that it's making test batches so data equals T F data a dataset so this is the module 69 00:05:48,570 --> 00:05:50,420 in tensor flow data. 70 00:05:50,430 --> 00:05:54,640 We're making essentially a batch data set and we're gonna. 71 00:05:54,660 --> 00:05:58,670 This is a little interesting one here from tensor slices. 72 00:05:59,010 --> 00:06:06,620 So create a data set a.k.a. just a tensor flow data set whose elements are slices of the given tenses. 73 00:06:06,660 --> 00:06:15,480 So basically what this says is pass me some tenses and I'll create a data set out of that little confusing 74 00:06:15,510 --> 00:06:21,670 I know but this is just how it's done intensive TAF constant x. 75 00:06:21,810 --> 00:06:31,720 So basically what this says is I'm going to create a tensor flow data set from the tensor slices X and 76 00:06:31,730 --> 00:06:41,300 because we're passing X to T F constant it turns X into a tensor so this is only file parts. 77 00:06:41,500 --> 00:06:44,680 No Labels okay. 78 00:06:44,760 --> 00:06:45,950 And now data batch. 79 00:06:46,020 --> 00:06:52,520 Here's where we turn our tensor of low dataset because right now this would turn all of X into a data 80 00:06:52,520 --> 00:06:52,840 set. 81 00:06:52,860 --> 00:06:58,130 We need to turn it into a batch size of 32 so equals data dot map. 82 00:06:58,170 --> 00:07:06,150 We need to process the image and we need to turn it into a batch batch size. 83 00:07:06,410 --> 00:07:08,550 Well a lot of stuff going on here. 84 00:07:08,560 --> 00:07:09,660 Daniel what's happening. 85 00:07:10,740 --> 00:07:13,740 Well we've talked through the line above. 86 00:07:13,890 --> 00:07:22,590 But this one all it's saying is basically take this data which is intensified dataset made of x which 87 00:07:22,590 --> 00:07:29,220 is just file names in the form of tenses and then map our process image function See this is why we 88 00:07:29,220 --> 00:07:32,970 made it into a function so we can call it process image. 89 00:07:33,030 --> 00:07:37,320 It's gonna go through all of these steps here on our file names. 90 00:07:37,350 --> 00:07:42,990 So it's going to import them turn them into normalized tenses and then turn it into a batch of batch 91 00:07:42,990 --> 00:07:43,870 size. 92 00:07:43,920 --> 00:07:49,300 Let's go here to intensive low data batch. 93 00:07:49,470 --> 00:07:56,090 This is the workflow I use whenever I'm looking at a function. 94 00:07:56,180 --> 00:08:04,000 So if we go batch There we go batch combines consecutive elements of this data set into batches There 95 00:08:04,000 --> 00:08:09,490 we go so if we pass it a range of eight and we want it to be a batch of three it's going to split eight 96 00:08:10,180 --> 00:08:11,380 into three batches. 97 00:08:12,250 --> 00:08:17,570 So that's just exactly why is this in a new tab. 98 00:08:17,770 --> 00:08:22,720 That's what it's gonna do here except as is of size 32 wonderful. 99 00:08:22,750 --> 00:08:24,370 So that's the test data. 100 00:08:24,370 --> 00:08:25,410 Now we want to do. 101 00:08:25,450 --> 00:08:34,690 If the data is a valid dataset we don't need to shuffle it. 102 00:08:34,690 --> 00:08:42,010 Now I want you to think about again why would we not need to shuffle it. 103 00:08:42,520 --> 00:08:44,790 It's okay if you're not entirely sure. 104 00:08:45,250 --> 00:08:46,630 I got caught out on this. 105 00:08:46,690 --> 00:08:48,580 A lot of times too. 106 00:08:48,880 --> 00:08:56,190 So we'll print a little progress statement here and then we'll do much the same as this we could just 107 00:08:56,190 --> 00:09:04,770 copy that but we're going to practice writing it out TAF data dataset from tensor slices. 108 00:09:04,790 --> 00:09:16,670 Now this is where because our validation set has labels we're going to posit a couple of X and TAF constant 109 00:09:16,970 --> 00:09:17,710 y. 110 00:09:17,810 --> 00:09:25,040 So just turning out file names and labels into tenses and we got here. 111 00:09:25,550 --> 00:09:31,960 File paths labels so now we have a tensor flow dataset 112 00:09:35,600 --> 00:09:40,730 of image file parts and labels because it's a validation dataset. 113 00:09:40,790 --> 00:09:48,290 The validation dataset has labels and we're going to turn it into a data batch equals data dot map just 114 00:09:48,290 --> 00:09:59,820 the exact same as above process the image and turn it into a batch of batch size help getting trigger 115 00:09:59,820 --> 00:10:06,670 happy wonderful and then return data batch. 116 00:10:07,710 --> 00:10:17,760 And finally we want if none of these work well a valid data is false and test data is also false it's 117 00:10:17,790 --> 00:10:26,450 obviously going to be a training batch so we're going to go create training data batches on the lump 118 00:10:28,650 --> 00:10:29,770 and let's go. 119 00:10:30,650 --> 00:10:32,980 So if it is a training dataset we want to shuffle it. 120 00:10:32,990 --> 00:10:34,700 That's the only difference here. 121 00:10:35,180 --> 00:10:43,010 So we want to go turn file paths and labels into tenses. 122 00:10:43,010 --> 00:10:47,780 So just as we done with the validation we could just copy this. 123 00:10:47,870 --> 00:10:52,070 Now there's probably a more efficient way to write this function that does the same functionality but 124 00:10:52,070 --> 00:10:56,590 I've tried to write it in a way that is fairly understandable as to what's going on. 125 00:10:59,650 --> 00:11:04,420 Constant X and as I said that I might have said that it's fairly understandable but if you're coming 126 00:11:04,420 --> 00:11:08,950 across something like this for the first time trust me when I did I was confused. 127 00:11:08,950 --> 00:11:10,950 But once I went through it step by step. 128 00:11:10,960 --> 00:11:13,340 Remember our technique for breaking down a function. 129 00:11:13,480 --> 00:11:18,370 Write it out line by line and then check out what it's doing in each line. 130 00:11:19,120 --> 00:11:27,280 So we want to shuffle the path names and labels before mapping will go into this in a second before 131 00:11:27,280 --> 00:11:27,880 mapping. 132 00:11:27,880 --> 00:11:37,090 Image Processor function is faster than shuffling images I'll just write the code so we can talk about 133 00:11:37,090 --> 00:11:44,450 it but data shuffle emphasize so buffer size just stencil. 134 00:11:44,470 --> 00:11:48,200 How many variables do we want to shuffle then we want to shuffle the whole lot. 135 00:11:49,040 --> 00:11:51,230 Well a lot going on here. 136 00:11:52,730 --> 00:12:00,410 While all this is saying is hey take this data which is a tensor flow dataset of the data so the file 137 00:12:00,410 --> 00:12:06,860 names and the labels and then shuffle it up and we want to shuffle the number of examples. 138 00:12:07,040 --> 00:12:13,670 If we only shuffled 100 it would take the first 100 examples of x and y and shuffle them but note we 139 00:12:13,670 --> 00:12:14,600 want to shuffle them all. 140 00:12:15,320 --> 00:12:21,980 And the reason being you might see other tutorials shuffling later after we run the map function it 141 00:12:22,110 --> 00:12:25,370 goes data map process. 142 00:12:25,370 --> 00:12:33,670 Image so if you did shuffle after you've already processed the image it takes a lot longer to shuffle 143 00:12:33,700 --> 00:12:40,690 a full image than it does just a file name so as I said remember we're trying to minimize the time between 144 00:12:40,690 --> 00:12:41,840 experiments. 145 00:12:41,950 --> 00:12:44,170 So we want to try and get any speed up that we can. 146 00:12:44,170 --> 00:12:52,490 I've just realized we've made a little mistake up here in our validation dataset because we have images 147 00:12:52,520 --> 00:13:01,550 and labels we need to map the function get image label which also processes our image whereas for our 148 00:13:01,550 --> 00:13:10,250 test data set because we have no labels we can run directly the processed image function so let's let's 149 00:13:10,250 --> 00:13:13,400 replace this get image label and it's going to be the same 150 00:13:16,600 --> 00:13:19,430 for our training dataset. 151 00:13:19,480 --> 00:13:27,060 So get image label wonderful and it might just write out what this is doing with a little little comment. 152 00:13:27,730 --> 00:13:34,010 So create image label troubles. 153 00:13:34,030 --> 00:13:46,850 This also turns the image path into a pre processed image so if you can shuffle your data when it's 154 00:13:46,850 --> 00:13:52,710 in its smallest format a.k.a. in our case file names rather than full images. 155 00:13:52,820 --> 00:14:05,510 And now finally we want to turn the training data into batches so data batch equals data top batch batch 156 00:14:05,510 --> 00:14:09,460 size and then we'll go out here. 157 00:14:09,680 --> 00:14:14,710 Return data batch hoof. 158 00:14:14,720 --> 00:14:19,310 That's a bit of a behemoth of a function but as I said this is gonna save us a lot of time later on 159 00:14:19,310 --> 00:14:24,880 and you'll see that remember this is the kind of development work style I want you to start thinking 160 00:14:24,880 --> 00:14:25,570 about. 161 00:14:25,570 --> 00:14:28,740 Hey am I going to reuse some functionality later on. 162 00:14:28,750 --> 00:14:33,430 Well if I am I'm going to ride it into a function and you probably already know that because you're 163 00:14:33,430 --> 00:14:35,380 a lot smarter developer than what I am. 164 00:14:35,380 --> 00:14:39,340 I used to just fluster around and just write things line by line. 165 00:14:39,370 --> 00:14:48,790 Now let's test our function by creating training and validation databases so we'll go train data. 166 00:14:48,790 --> 00:14:55,870 We're gonna use our big dog function just above we're gonna pass it X train for the training data and 167 00:14:55,960 --> 00:15:01,030 why train for the validation data and that's all we need. 168 00:15:01,030 --> 00:15:16,060 And then for the vow data we could go create data batches X Val y Val is is it valid data equals true 169 00:15:17,910 --> 00:15:26,290 and now fingers crossed we should see our print progress messages. 170 00:15:26,410 --> 00:15:29,430 Yes that is amazing. 171 00:15:29,440 --> 00:15:41,950 Okay now let's check out the different attributes of our data matches so if we go train data it's now 172 00:15:41,950 --> 00:15:53,630 in the format of a data batch which is tensor flows preferred way of processing things. 173 00:15:53,650 --> 00:15:55,270 There we go. 174 00:15:55,720 --> 00:15:56,710 This is our training data. 175 00:15:56,710 --> 00:15:59,310 The first one here what's it saying. 176 00:15:59,630 --> 00:16:02,990 So you've got shape none 2 2 4 2 2 4 3. 177 00:16:02,990 --> 00:16:04,440 This is our image. 178 00:16:04,550 --> 00:16:06,700 None is for batch size. 179 00:16:06,740 --> 00:16:13,310 I know we've set it to 32 but it's going to stay as none because batch size remember is flexible. 180 00:16:13,310 --> 00:16:22,910 So this is just our images are now in batches of 32 and they've got a shape of 2 2 4 2 2 4 3 height 181 00:16:23,060 --> 00:16:29,730 width color channels of type T float 32 and we've got a Tupperware here of our label. 182 00:16:29,750 --> 00:16:37,100 So these are image label pairs in the form of tenses and our labels also have a batch shape of none 183 00:16:37,130 --> 00:16:46,940 because batch size is flexible and they have a dimension of 120 which is because if we go y 0 184 00:16:50,510 --> 00:16:58,040 there's 120 different dog breeds and then it's the same again for the validation data beautiful. 185 00:16:58,040 --> 00:17:03,840 So that was a pretty in-depth one once this video is over which will be in a few seconds. 186 00:17:03,840 --> 00:17:08,560 I'd go back through and just read through what's going on in this function here. 187 00:17:08,760 --> 00:17:14,640 If you're not entirely sure I'd look up loading and pre processing data tutorial of images. 188 00:17:14,640 --> 00:17:20,940 Have a read through that and then the best way to learn is to really just write this out break down 189 00:17:20,940 --> 00:17:24,150 the function and try to run a line by line. 190 00:17:24,150 --> 00:17:30,490 That's how I break down things for myself but the way our data is at the moment it's still kind of hard 191 00:17:30,490 --> 00:17:31,380 to understand. 192 00:17:31,480 --> 00:17:35,170 If you're new to the concept of batches it's really a very difficult concept to grasp. 193 00:17:35,200 --> 00:17:37,600 It's perfectly okay not to know what's going on right now. 194 00:17:38,140 --> 00:17:44,020 So to help ourselves understand rather than just having this this gobbledygook being printed out when 195 00:17:44,020 --> 00:17:50,140 we're trying to check out what's in our data batches let's write a function which is going to help us 196 00:17:50,260 --> 00:17:52,550 visualize what's going on. 197 00:17:52,600 --> 00:17:54,070 So that's what we'll do in the next video.