1 00:00:00,990 --> 00:00:04,550 In this lesson we're gonna do some data pre processing. 2 00:00:04,600 --> 00:00:06,460 We can do three things primarily. 3 00:00:06,580 --> 00:00:09,030 We're going to reskill our feature data. 4 00:00:09,030 --> 00:00:17,610 We're going to convert our target values into one hot encoding and we're also going to create our validation 5 00:00:17,610 --> 00:00:19,550 data set from our training data. 6 00:00:20,490 --> 00:00:29,340 So in your Jupiter notebook at a markdown cell here that reads data pre processing. 7 00:00:29,760 --> 00:00:37,080 And that way we have a little subsection to give us a bit more screen real estate with the toggle header 8 00:00:37,080 --> 00:00:41,550 here under review and we can work a little bit higher up. 9 00:00:42,840 --> 00:00:49,590 So one thing that we've already seen in the last lesson was how our values for our features are between 10 00:00:49,610 --> 00:00:52,130 0 and 255. 11 00:00:53,420 --> 00:01:00,320 Now considering that the learning rates for our optimizer are going to be very very small values it 12 00:01:00,320 --> 00:01:07,280 helps when the inputs to our neural network are going to be between 0 and 1. 13 00:01:07,280 --> 00:01:11,240 That's why we're going to reach scale our features. 14 00:01:11,660 --> 00:01:20,210 And the easiest way to do that is that we divide our training data and our testing data by two hundred 15 00:01:20,360 --> 00:01:21,790 and fifty five. 16 00:01:21,890 --> 00:01:23,480 That's the largest value right. 17 00:01:23,870 --> 00:01:29,510 So if there's a completely black pixel it'll have the value one after We've re scaled it. 18 00:01:29,510 --> 00:01:36,200 We can do all of this work in a single line of code X unequal score train on the score all come up X 19 00:01:36,350 --> 00:01:44,000 on a school test is gonna be equal to X and a score train on the score all divided by two hundred and 20 00:01:44,000 --> 00:01:54,280 fifty five point zero come on X on a score test divided by two hundred and fifty five point zero. 21 00:01:54,530 --> 00:02:01,310 This calculation will do two divisions one on all the values in our training data set and one and all 22 00:02:01,310 --> 00:02:05,300 the values in our testing data set and what it will also do. 23 00:02:05,300 --> 00:02:10,430 Because of this decimal point and this division is we're going to convert all of the integers that are 24 00:02:10,430 --> 00:02:16,920 currently in him to floating point numbers or our numbers with a decimal point so limit shift enter 25 00:02:16,950 --> 00:02:18,540 on this. 26 00:02:18,620 --> 00:02:25,200 And now we can tackle our target values these currently looks something like this. 27 00:02:25,240 --> 00:02:28,760 They're all numbers between 0 and 9. 28 00:02:28,760 --> 00:02:34,400 What we're gonna do is going to transform all of these into values between 0 and 1 actually. 29 00:02:34,460 --> 00:02:42,380 So what are effectively going to do is to take this sparse matrix right here which is what this is and 30 00:02:42,380 --> 00:02:45,410 turn it into a full matrix. 31 00:02:45,410 --> 00:02:47,280 Let me show you this through an example. 32 00:02:47,280 --> 00:02:54,590 So if I take these first five values here and I'll just store them in an array called values what I 33 00:02:54,590 --> 00:03:05,900 can do then is use a function from num pi called end p dot I 10 closing parentheses square brackets 34 00:03:06,740 --> 00:03:12,590 and then values me hit shift enter on this and show you what this does. 35 00:03:12,680 --> 00:03:13,030 All right. 36 00:03:13,060 --> 00:03:14,660 So what are we looking at here. 37 00:03:15,530 --> 00:03:25,460 Well we can see that now we find a one in the position of the numbers inside the values array. 38 00:03:25,760 --> 00:03:32,400 For example there's a one in the fifth position 0 1 2 3 4 5. 39 00:03:32,480 --> 00:03:36,530 There is a 1 in the first position for the second row. 40 00:03:36,530 --> 00:03:44,640 There's a 1 in the fourth position on the third row 0 1 2 3 4 and so on. 41 00:03:44,750 --> 00:03:46,160 What's going on here. 42 00:03:46,160 --> 00:03:49,010 Well let's take this step by step. 43 00:03:49,010 --> 00:03:51,230 This is actually something we've seen before. 44 00:03:51,260 --> 00:03:52,850 But in a slightly different form. 45 00:03:53,630 --> 00:04:01,790 So if I pull up the documentation for end p dot I then I can see here that this function returns a 2D 46 00:04:01,790 --> 00:04:07,910 array with ones the diagonal and zero elsewhere and N. 47 00:04:08,030 --> 00:04:12,430 So this first parameter is the number of rows in the output. 48 00:04:12,480 --> 00:04:21,590 So if I come down here and I just write and put out I on its own the number 10 then I get a 10 by 10 49 00:04:21,950 --> 00:04:25,620 matrix with ones down the diagonal. 50 00:04:25,800 --> 00:04:27,120 Why did I use the number 10. 51 00:04:27,900 --> 00:04:32,970 Well because we've got 10 different types of labels in our dataset. 52 00:04:33,150 --> 00:04:36,650 We've got the numbers between 0 and 9. 53 00:04:36,690 --> 00:04:40,980 So that's why I've created a 10 by 10 matrix. 54 00:04:40,980 --> 00:04:42,720 So what's happening next. 55 00:04:42,720 --> 00:04:46,080 Well this is not matrix multiplication. 56 00:04:46,080 --> 00:04:52,740 Instead what we're doing is actually array element indexing the second bit here. 57 00:04:52,740 --> 00:05:00,930 This values in the square brackets acts as the index array each number in the index array indicates 58 00:05:00,990 --> 00:05:07,580 which value in the preceding array to use in the place of the index. 59 00:05:07,620 --> 00:05:14,520 So check this out if I've caught n Pete and I and I use that same ten by ten matrix and then I have 60 00:05:14,520 --> 00:05:21,630 some square brackets after it and I put the number two there Then I get the third row extracted from 61 00:05:21,870 --> 00:05:29,120 my identity matrix my one here is in the third position I get this entire row coming on. 62 00:05:29,640 --> 00:05:36,330 So I hope you can see what's going on here now if I've got my values array like so and I pull out a 63 00:05:36,330 --> 00:05:43,260 particular value here with the square bracket notation like so then this is the form that's very familiar 64 00:05:43,260 --> 00:05:49,590 to us here I'm pulling out the fifth value are the number nine in this case here I'm pulling out the 65 00:05:49,590 --> 00:05:52,920 third row at index number two. 66 00:05:52,920 --> 00:06:01,890 So all we're doing him is we're using this entire array as an index and we're pulling out several of 67 00:06:01,890 --> 00:06:10,950 the rows from the identity matrix and this is how we can convert the entire training data set for the 68 00:06:10,950 --> 00:06:15,580 labels into a 1 Hot encoding. 69 00:06:16,110 --> 00:06:27,630 So let's do that now I'll add a little subheading here that reads convert target values to 1 Hot encoding 70 00:06:28,440 --> 00:06:34,540 at the moment our target values are sparse because they just have an integer for the class and what 71 00:06:34,540 --> 00:06:41,160 we're going to do is when I essentially reshape this entire thing so that it is in this format instead 72 00:06:42,180 --> 00:06:51,840 the way we can do this is simply by overwriting y on a squat train underscore all with and p dot I 10 73 00:06:52,760 --> 00:06:53,870 square brackets. 74 00:06:54,050 --> 00:07:01,650 Why underscore a train underscore all and if we don't want this number 10 floating around in here because 75 00:07:01,710 --> 00:07:03,330 we might not know what it stands for. 76 00:07:03,360 --> 00:07:11,870 When we come back to it in the future let's add a constant at the top that reads an R underscore classes 77 00:07:12,540 --> 00:07:15,890 and that's going to be equal to the number 10. 78 00:07:15,960 --> 00:07:23,610 If I refresh the cell I can now use my constant down here where I had this number 10 earlier and run 79 00:07:23,610 --> 00:07:25,400 this entire cell. 80 00:07:25,800 --> 00:07:32,880 If I check out why I was quatrain on a scroll that shape I can see that it is now at a rate of 60 thousand 81 00:07:32,880 --> 00:07:40,350 labels but for each label I've got this one hot encoding so I've got 10 columns and one of these columns 82 00:07:40,530 --> 00:07:49,780 will have a one at the position that corresponds to the label and this is in contrast to our flattened 83 00:07:49,780 --> 00:07:51,410 array that we had earlier. 84 00:07:51,490 --> 00:07:53,970 That was much more sparse. 85 00:07:54,010 --> 00:07:59,800 Of course we have to do the very same thing to our test labels as we've done to our training labels 86 00:07:59,980 --> 00:08:01,450 to be consistent. 87 00:08:01,450 --> 00:08:07,600 So we'll write y on the score test is equal to Pete dot identity. 88 00:08:07,600 --> 00:08:20,140 E y e parentheses in our classes y underscore test and now r y underscore test dot shape should also 89 00:08:20,140 --> 00:08:24,370 be a 1 Hot encoded array. 90 00:08:24,370 --> 00:08:31,920 In this case LP 10000 thousand by 10 the last thing that we'll do is we'll create our validation data 91 00:08:31,920 --> 00:08:32,680 set. 92 00:08:32,770 --> 00:08:44,360 So once again I'll add a subheading here that reads create validation data set from training data. 93 00:08:45,520 --> 00:08:51,430 What we want to do in this case is we want to split up our training data into our validation data set 94 00:08:51,850 --> 00:08:56,110 and the actual training data set that we're gonna use. 95 00:08:56,170 --> 00:09:02,230 The first thing that we'll do is we'll decide on the size of the validation data set I'm going to come 96 00:09:02,230 --> 00:09:06,080 back up to my constants and I'm going to add it up here. 97 00:09:06,400 --> 00:09:15,490 Validation on the score size and we'll set that equal to 10000 same size in this case as our training 98 00:09:15,490 --> 00:09:16,890 data set. 99 00:09:17,040 --> 00:09:23,280 I'll let shift enter on the cell and then I want to throw this over to you as a challenge. 100 00:09:23,320 --> 00:09:27,870 Once again this will be some good review for race slicing techniques. 101 00:09:27,880 --> 00:09:36,480 What I'd like you to do is to split the training data set into four smaller datasets namely the X underscore 102 00:09:36,500 --> 00:09:41,780 Val the Y underscore val x on escort train and y on escort train. 103 00:09:41,950 --> 00:09:46,960 Make use of the constant that we've created above so that you end up with the validation data set of 104 00:09:46,980 --> 00:09:51,950 10000 and a training data set of 50000. 105 00:09:52,270 --> 00:09:56,020 I'll give you a few seconds to pause the video before I show you the solution. 106 00:09:58,210 --> 00:09:59,390 Here's how you do it. 107 00:09:59,500 --> 00:10:05,980 X and a score Val is equal to x on the squat train on a squirrel square brackets. 108 00:10:05,980 --> 00:10:12,770 Colon validation size the same for r y underscore Val. 109 00:10:13,270 --> 00:10:20,350 So the first 10000 values in the large sixty thousand sample training dataset. 110 00:10:20,380 --> 00:10:21,800 Now let's do the other bit. 111 00:10:21,880 --> 00:10:29,230 So X on the squat train shall be equal to X on a squat train and a square all square brackets and then 112 00:10:29,230 --> 00:10:36,460 I'll be the last 50000 so it'll be everything from the 10000 sample onwards. 113 00:10:36,460 --> 00:10:39,760 So validation on this exercise. 114 00:10:39,910 --> 00:10:46,760 Colon closing square bracket and the same goes for the Y levels in the training dataset. 115 00:10:46,810 --> 00:10:49,010 Let me hit shift enter on this. 116 00:10:49,450 --> 00:10:56,320 And now when I pull up X on a squat train shape I shall see that it's fifty thousand by seven hundred 117 00:10:56,560 --> 00:10:57,490 and eighty four. 118 00:10:58,240 --> 00:11:08,150 An X on a score vowel don't shape it's gonna be ten thousand by seven hundred and eighty four and that's 119 00:11:08,150 --> 00:11:08,820 it. 120 00:11:09,290 --> 00:11:15,230 In the next lessons we're going to be busy setting up tensor flow and setting up our neural network 121 00:11:15,320 --> 00:11:18,190 architecture for all that and more. 122 00:11:18,410 --> 00:11:20,930 How see in the next lessons take out.