1 00:00:00,360 --> 00:00:07,380 In this lesson we're going to pre process our data so that it's easier to feed it into our neural network. 2 00:00:07,410 --> 00:00:11,430 Let's add a markdown cell in our notebook. 3 00:00:11,550 --> 00:00:15,750 It's a pre process data. 4 00:00:15,750 --> 00:00:22,440 So one of the things that we're going to do is we're going to change the kind of numbers that are being 5 00:00:22,440 --> 00:00:29,880 fed into our neural network because at the moment if we look at X underscore train underscore all and 6 00:00:29,880 --> 00:00:34,930 say we look at the very first entry then we'll get an array like this. 7 00:00:35,160 --> 00:00:40,830 If we drill down a little further right we go one level deeper then we see that an array is nested inside 8 00:00:40,830 --> 00:00:41,750 another array. 9 00:00:42,360 --> 00:00:44,760 But if we drill a little deeper. 10 00:00:44,760 --> 00:00:46,970 So we go to a particular pixel. 11 00:00:47,100 --> 00:00:51,270 We can see the three values that are stored for this pixel. 12 00:00:51,450 --> 00:00:54,500 The red green and blue values. 13 00:00:54,540 --> 00:01:00,990 Now if I just want to look at this number 59 in isolation some four levels down right and I look at 14 00:01:00,990 --> 00:01:05,230 the type that I can see that this is an integer right. 15 00:01:05,840 --> 00:01:08,410 But it is a you into it. 16 00:01:08,460 --> 00:01:09,590 And what does this mean. 17 00:01:09,600 --> 00:01:18,750 This means it's an 8 bit unsigned integer now an unsigned integer is simply a fancy name for a positive 18 00:01:18,780 --> 00:01:19,590 number. 19 00:01:19,590 --> 00:01:24,950 So 1984 would be an unsigned integer but negative 10 would not be. 20 00:01:24,960 --> 00:01:25,600 Why. 21 00:01:25,620 --> 00:01:29,660 Because there's a sign in front of the ten the negative sign. 22 00:01:29,760 --> 00:01:36,270 So unsigned just means a positive number if I take my training dataset. 23 00:01:36,620 --> 00:01:44,130 Excellent as quatrain trained on the scroll all and I divide this whole thing by two hundred and fifty 24 00:01:44,130 --> 00:01:49,470 five point zero then I would accomplish two things. 25 00:01:49,530 --> 00:01:53,130 First off I would make all of these numbers a lot smaller. 26 00:01:53,130 --> 00:01:53,590 Right. 27 00:01:53,610 --> 00:02:00,330 I know that 255 is the largest value that I'm going to have because on the RG B scale this is what I 28 00:02:00,330 --> 00:02:00,980 see right. 29 00:02:00,990 --> 00:02:06,720 It goes only up to two hundred and fifty five for any of these particular sliders. 30 00:02:06,870 --> 00:02:14,670 So dividing by 255 means that all the values inside my ex on this quatrain underscore all will be between 31 00:02:14,670 --> 00:02:18,150 zero and one right. 32 00:02:18,240 --> 00:02:22,970 The next thing that this will accomplish is there will be some conversion taking place. 33 00:02:22,970 --> 00:02:25,510 We're going to be converting from an integer to a float. 34 00:02:25,520 --> 00:02:31,850 We're going to be converting to a decimal number so a float is simply a number that has a decimal point 35 00:02:32,210 --> 00:02:37,460 and then some numbers after the decimal point floating point numbers is what you're going to encounter 36 00:02:37,490 --> 00:02:41,230 when you drink some sort of scientific calculation. 37 00:02:41,240 --> 00:02:46,340 Now the reason I'm dividing by two hundred and fifty five and bringing those values down to a range 38 00:02:46,430 --> 00:02:50,710 between 0 and one is because of our learning rate. 39 00:02:50,770 --> 00:02:56,510 Now if you watch the module on gradient descent you'll see that the learning rate is typically quite 40 00:02:56,510 --> 00:02:57,830 small. 41 00:02:57,920 --> 00:03:03,500 And by scaling our numbers down to values between 0 and 1 it's going to be much easier. 42 00:03:03,520 --> 00:03:08,570 Calculating the loss and adjusting the weights given a very typical learning rate. 43 00:03:08,630 --> 00:03:12,050 So this is why we're bringing that range down. 44 00:03:12,050 --> 00:03:17,450 Now I'm going to do this for both our training dataset and our testing dataset. 45 00:03:17,480 --> 00:03:26,150 So when I write X and the score train underscore all comma X underscore test is equal to x on the school 46 00:03:26,420 --> 00:03:35,810 train and a score all divided by two hundred and fifty five point zero comma X underscored test divided 47 00:03:35,810 --> 00:03:39,870 by two hundred and fifty five point zero. 48 00:03:40,130 --> 00:03:48,740 If I hit shift enter on this and say copy this cell paste it again and put it below this line. 49 00:03:48,740 --> 00:03:55,940 After this line has evaluated I can re-evaluate this self and I can see that the type has changed to 50 00:03:55,940 --> 00:03:59,640 a 64 bit floating point number. 51 00:03:59,720 --> 00:04:00,950 Brilliant. 52 00:04:00,950 --> 00:04:06,770 And if I want to see what that number is it'll be fifty nine divided by two hundred and fifty five which 53 00:04:06,770 --> 00:04:12,240 is around zero point 2 3 1 3 7 and so on right now. 54 00:04:12,380 --> 00:04:18,920 The next thing I'm going to do as part of our pre processing step is I'm going to flatten out our data 55 00:04:18,920 --> 00:04:26,120 set so you know having four dimensions is is fine and sometimes we'll work with that but I think I'll 56 00:04:26,120 --> 00:04:32,720 make it a lot easier conceptually if we put all of these values into a single row if you will. 57 00:04:32,720 --> 00:04:33,210 Right. 58 00:04:33,230 --> 00:04:38,910 Going to have a single vector a single row of numbers to represent one image. 59 00:04:39,020 --> 00:04:45,890 So this means that these three dimensions will be flattened and to do this I'm going to use num PIs 60 00:04:46,010 --> 00:04:54,230 reshape method to check it out how overwrite our X underscore train underscore all again by setting 61 00:04:54,230 --> 00:05:02,720 it equal to X on this quatrain on the square all dot reshape and then I have to supply basically two 62 00:05:02,720 --> 00:05:06,430 inputs the first input is the length. 63 00:05:06,710 --> 00:05:17,180 So I'm going to say X on a squad train and a squirrel dot shape the square brackets zero and that value 64 00:05:17,300 --> 00:05:21,520 is equal to 50000. 65 00:05:21,660 --> 00:05:28,680 Now I could have also said alien X and the squat train underscore all get the same answer. 66 00:05:29,410 --> 00:05:35,850 And the second value I'm going to supply to reshape is equal to what I want to collapse. 67 00:05:35,910 --> 00:05:37,070 These three dimensions. 68 00:05:37,080 --> 00:05:37,760 Right. 69 00:05:37,890 --> 00:05:42,490 What I want to do is I want to multiply the number of pixels in the width of the image. 70 00:05:42,510 --> 00:05:47,120 The number of pixels in the height of the image and the color channels. 71 00:05:47,160 --> 00:05:57,210 So this would read 32 times 32 times three but I don't really like these magic numbers in my code that 72 00:05:57,210 --> 00:05:57,700 much. 73 00:05:57,780 --> 00:06:04,050 So when I come back appear to our constants and I'm just going to make this very explicit it's going 74 00:06:04,050 --> 00:06:11,180 to see image on a square width is equal to 32 image on a score. 75 00:06:11,880 --> 00:06:19,850 Height is equal to 32 and then I'll add another one called image on a school pixels and that's gonna 76 00:06:19,860 --> 00:06:25,180 be equal to image width times image height. 77 00:06:25,310 --> 00:06:26,710 All right. 78 00:06:26,790 --> 00:06:28,850 And then my color channels. 79 00:06:28,860 --> 00:06:30,920 I'll stick with the American spelling here. 80 00:06:32,640 --> 00:06:34,280 It's gonna be equal to three. 81 00:06:34,500 --> 00:06:42,120 So the total number of inputs I'm going to have total underscore inputs should be equal to. 82 00:06:43,050 --> 00:06:51,240 Well it's gonna be the number of pixels times the color channels. 83 00:06:51,240 --> 00:06:52,730 Agreed. 84 00:06:52,740 --> 00:06:53,530 Great. 85 00:06:53,550 --> 00:06:58,950 This means that I can come down here where I'm pre processing my data and I'll replace this just with 86 00:06:59,190 --> 00:07:01,710 total on the school inputs. 87 00:07:01,710 --> 00:07:07,690 So let me hit shift enter on the cell and after a number I has completed its work. 88 00:07:07,980 --> 00:07:13,250 I'll say X and the SWAT train and all that shape and we'll take a quick look at what we got. 89 00:07:13,440 --> 00:07:20,830 We are now dealing with an umpire a that is 50000 by three thousand and seventy two. 90 00:07:21,000 --> 00:07:25,060 Now of course we can't just do that to the training data is that what we do to the training data set. 91 00:07:25,140 --> 00:07:26,910 We should also due to our testing data set. 92 00:07:27,360 --> 00:07:35,010 So I'll do exactly the same thing and a lot of print statement on here with an F string shape of X on 93 00:07:35,010 --> 00:07:39,310 the score test is curly braces. 94 00:07:39,920 --> 00:07:48,210 Excellent score test that shape that shift enter and let's see what we get 10000 images in our testing 95 00:07:48,240 --> 00:07:51,800 dataset and they share the same dimension as our training dataset. 96 00:07:52,070 --> 00:07:53,340 Brilliant. 97 00:07:53,580 --> 00:07:59,200 The next thing that we're going to do is we create something called a validation dataset. 98 00:07:59,220 --> 00:08:02,300 So imagine we've got our entire data set. 99 00:08:02,340 --> 00:08:08,370 So both our training and our testing data and what we're gonna do is we're gonna split our training 100 00:08:08,370 --> 00:08:10,360 data into two. 101 00:08:10,440 --> 00:08:14,660 We can have a part of the training dataset that's part of our validation dataset. 102 00:08:14,760 --> 00:08:18,700 So in total we're going to have three parts. 103 00:08:18,700 --> 00:08:20,440 How why would we do this. 104 00:08:20,460 --> 00:08:25,410 Well it has to do with our workflow because if you think about it there's many little knobs and little 105 00:08:25,410 --> 00:08:27,420 variations that we can make to our model. 106 00:08:27,420 --> 00:08:34,590 Many little tweaks and the validation data set will allow us to then select our best model because the 107 00:08:34,590 --> 00:08:39,990 validation data set is because of where we're gonna be evaluating all these little tweaks. 108 00:08:39,990 --> 00:08:42,310 In other words we're going to be doing two things. 109 00:08:42,480 --> 00:08:48,330 We're gonna be training our model but we're also gonna be tuning in and making slight variations to 110 00:08:48,330 --> 00:08:48,480 it. 111 00:08:48,960 --> 00:08:55,440 And the validation data set will provide us an unbiased evaluation as to how our model is doing. 112 00:08:55,530 --> 00:09:03,030 While we're tuning it and this has the big advantage of saving our test data set for later the test 113 00:09:03,030 --> 00:09:09,900 data set will be untouched and it'll be therefore our final evaluation only our best model will get 114 00:09:09,900 --> 00:09:16,350 to see our test data set because the job of our test data set is to give us a realistic impression of 115 00:09:16,350 --> 00:09:22,320 horror model would do in the real world if we didn't create this validation dataset and we only had 116 00:09:22,320 --> 00:09:27,810 a training data set and a test data set and we're tuning our model and we're always showing at the test 117 00:09:27,810 --> 00:09:28,340 data set. 118 00:09:28,470 --> 00:09:34,740 Then we'd actually be in danger of tuning our model such that it does well under test data set and as 119 00:09:34,740 --> 00:09:42,310 a consequence we end up with unrealistic results so before we head back into Jupiter notebook and split 120 00:09:42,310 --> 00:09:46,810 up our number higher rate one question you might be asking at this point as well. 121 00:09:47,050 --> 00:09:50,610 How large should your validation dataset be. 122 00:09:50,800 --> 00:09:55,580 And I think this really depends on the size of your data set as a whole. 123 00:09:55,690 --> 00:09:57,280 For smaller data sets. 124 00:09:57,280 --> 00:10:02,710 The general rule of thumb is 60 percent training 20 percent validation 20 percent. 125 00:10:02,710 --> 00:10:10,000 Testing on the other hand if you have an absolutely enormous amount of data then what people might actually 126 00:10:10,000 --> 00:10:14,420 do is only reserve about 1 percent for validation and 1 percent for testing. 127 00:10:14,590 --> 00:10:20,560 But the kind of data that I'm usually working with the 60 20 20 rule has served me very very well. 128 00:10:20,560 --> 00:10:26,660 So this is not a bad rule of thumb to stick by back in Jupiter notebook. 129 00:10:26,660 --> 00:10:36,440 I'm going to insert a subsection here that's going to read great validation data sets and I want to 130 00:10:36,440 --> 00:10:39,110 call this dataset X on the score. 131 00:10:39,110 --> 00:10:49,370 Val set that equal to x underscore train a score all square brackets and I'm going to take the first 132 00:10:49,690 --> 00:10:51,350 say ten thousand values. 133 00:10:51,410 --> 00:10:55,400 So semicolon and then the number 10000. 134 00:10:56,510 --> 00:11:02,540 Now if we want to get rid of this magic number here we can once again add up to our constants and put 135 00:11:02,540 --> 00:11:12,440 in validation on a score size as a constant and set that equal to ten thousand and then before we come 136 00:11:12,440 --> 00:11:21,320 back down we have to hit shift enter of course scroll down and replace our 10000 number with validation 137 00:11:21,350 --> 00:11:26,800 underscore size now of course we need to do this to the y values as well. 138 00:11:26,810 --> 00:11:27,520 Right. 139 00:11:27,590 --> 00:11:31,130 So we're going to create another variable called Y on a square. 140 00:11:31,360 --> 00:11:33,470 And just to check what it is we're doing. 141 00:11:33,640 --> 00:11:35,630 Quickly print out the shape here. 142 00:11:35,660 --> 00:11:40,430 So X and a score for hold up shape and it shift enter on the cell. 143 00:11:40,430 --> 00:11:45,490 So I've got 10000 values with the same dimensions as above. 144 00:11:45,530 --> 00:11:47,700 Now what about our training dataset. 145 00:11:47,720 --> 00:11:48,380 Right. 146 00:11:48,380 --> 00:11:54,350 We've created our validation bit but given that the first 10000 values are now part of our validation 147 00:11:54,380 --> 00:12:01,610 dataset we should take maybe the last 40000 values and store those in our training is that our new training 148 00:12:01,610 --> 00:12:04,620 data set that we're actually going to use for our model. 149 00:12:04,790 --> 00:12:13,000 That way we'll have 40000 images for training 10000 images for validation and 10000 images for testing. 150 00:12:13,160 --> 00:12:16,640 So I'll actually leave that up to you as a challenge. 151 00:12:16,640 --> 00:12:22,700 So what I'd like you to do is create two num higher res X on a squat train and why on a squat train 152 00:12:23,270 --> 00:12:29,020 and extend the school train has to have the shape 40000 by three thousand seventy two. 153 00:12:29,270 --> 00:12:37,300 And why on this go train needs to have the shape 40000 by one since we've used up the first 10000 values 154 00:12:37,420 --> 00:12:42,070 from our X underscore train those are all data set for our validation data set. 155 00:12:42,100 --> 00:12:50,380 What I'd like you to do is to store the last 40000 values so the ones we haven't used inside this X 156 00:12:50,440 --> 00:12:57,310 on this quatrain and y on a squat train no higher rate I'll give you a few seconds to pause the video 157 00:12:57,670 --> 00:12:59,430 and give this a go. 158 00:12:59,500 --> 00:13:00,310 I'll see it a bit 159 00:13:03,370 --> 00:13:03,790 all right. 160 00:13:03,790 --> 00:13:08,660 So here is the solution X and this good train is equal to x. 161 00:13:08,700 --> 00:13:15,150 I was quatrain underscore all square brackets and then since we're not taking the first 10000 values 162 00:13:15,370 --> 00:13:23,500 but we're taking all the values from ten thousand onwards we can see validation underscore size semicolon. 163 00:13:23,500 --> 00:13:29,650 And that will give us the last 40000 values from X underscore train underscore all that we're looking 164 00:13:29,650 --> 00:13:34,050 for and we can do the very same thing for our labels of course. 165 00:13:34,060 --> 00:13:42,370 So why is train is equal to why on a score train underscore all square brackets validation size semicolon 166 00:13:43,480 --> 00:13:54,490 and that's it X and the square train that shape is 40000 by three thousand and seventy two so that's 167 00:13:54,490 --> 00:13:55,140 really it. 168 00:13:55,780 --> 00:13:59,740 But I tell you what training our models can actually be a little bit intensive. 169 00:13:59,740 --> 00:14:01,750 Even with only 40000 images. 170 00:14:01,780 --> 00:14:07,180 So what I think would be nice would be to have an even smaller dataset to work with in the beginning 171 00:14:07,210 --> 00:14:11,800 so that we can iterate and not slow down our computers too much. 172 00:14:11,890 --> 00:14:15,320 And then once we're happy move on to the larger dataset. 173 00:14:15,340 --> 00:14:20,560 So this would almost simulate training on a smaller data set that you have at first but then going out 174 00:14:20,560 --> 00:14:24,910 and gathering more data and then training on a larger data set later on. 175 00:14:24,910 --> 00:14:27,730 This is something you'd often encounter in the real world. 176 00:14:27,790 --> 00:14:34,350 So I'm going to add another little markdown cell here and says create a small data set. 177 00:14:34,370 --> 00:14:36,580 And this is mostly for illustration 178 00:14:39,550 --> 00:14:47,520 so I'll just create two more umpire raise X on the school train and score access for extra small. 179 00:14:47,650 --> 00:14:54,150 And I'm just going to take the first thousand values from X underscore train that we created earlier. 180 00:14:54,280 --> 00:15:02,740 And if I do the same thing for our y values then I get something like this that a little constant at 181 00:15:02,740 --> 00:15:08,710 the top that just reads small underscore train underscore size. 182 00:15:08,920 --> 00:15:11,860 And that's gonna be 1000. 183 00:15:11,860 --> 00:15:16,130 That way I know what's the and what it does. 184 00:15:16,220 --> 00:15:18,880 Small on a score train underscore size. 185 00:15:18,950 --> 00:15:24,620 It's gonna take the place of our magic number in the cell. 186 00:15:24,610 --> 00:15:28,910 Now one of the really good things about using these practice data sets is that they tend to be pretty 187 00:15:28,910 --> 00:15:29,740 clean. 188 00:15:29,820 --> 00:15:35,230 It doesn't tend to be much wrong with them and that saves us a lot of time when it comes to data cleaning. 189 00:15:35,360 --> 00:15:41,410 We're actually done with the pre processing we're done with the data exploration side of things. 190 00:15:41,450 --> 00:15:47,840 So now we get to move on to our next steps setting up our neural networks and training our algorithms. 191 00:15:47,850 --> 00:15:49,370 And this is what we're all here for right. 192 00:15:49,640 --> 00:15:51,680 So I'll see you in the next lesson. 193 00:15:51,680 --> 00:15:52,170 Take it.