1
00:00:00,990 --> 00:00:04,550
In this lesson we're gonna do some data pre processing.

2
00:00:04,600 --> 00:00:06,460
We can do three things primarily.

3
00:00:06,580 --> 00:00:09,030
We're going to reskill our feature data.

4
00:00:09,030 --> 00:00:17,610
We're going to convert our target values into one hot encoding and we're also going to create our validation

5
00:00:17,610 --> 00:00:19,550
data set from our training data.

6
00:00:20,490 --> 00:00:29,340
So in your Jupiter notebook at a markdown cell here that reads data pre processing.

7
00:00:29,760 --> 00:00:37,080
And that way we have a little subsection to give us a bit more screen real estate with the toggle header

8
00:00:37,080 --> 00:00:41,550
here under review and we can work a little bit higher up.

9
00:00:42,840 --> 00:00:49,590
So one thing that we've already seen in the last lesson was how our values for our features are between

10
00:00:49,610 --> 00:00:52,130
0 and 255.

11
00:00:53,420 --> 00:01:00,320
Now considering that the learning rates for our optimizer are going to be very very small values it

12
00:01:00,320 --> 00:01:07,280
helps when the inputs to our neural network are going to be between 0 and 1.

13
00:01:07,280 --> 00:01:11,240
That's why we're going to reach scale our features.

14
00:01:11,660 --> 00:01:20,210
And the easiest way to do that is that we divide our training data and our testing data by two hundred

15
00:01:20,360 --> 00:01:21,790
and fifty five.

16
00:01:21,890 --> 00:01:23,480
That's the largest value right.

17
00:01:23,870 --> 00:01:29,510
So if there's a completely black pixel it'll have the value one after We've re scaled it.

18
00:01:29,510 --> 00:01:36,200
We can do all of this work in a single line of code X unequal score train on the score all come up X

19
00:01:36,350 --> 00:01:44,000
on a school test is gonna be equal to X and a score train on the score all divided by two hundred and

20
00:01:44,000 --> 00:01:54,280
fifty five point zero come on X on a score test divided by two hundred and fifty five point zero.

21
00:01:54,530 --> 00:02:01,310
This calculation will do two divisions one on all the values in our training data set and one and all

22
00:02:01,310 --> 00:02:05,300
the values in our testing data set and what it will also do.

23
00:02:05,300 --> 00:02:10,430
Because of this decimal point and this division is we're going to convert all of the integers that are

24
00:02:10,430 --> 00:02:16,920
currently in him to floating point numbers or our numbers with a decimal point so limit shift enter

25
00:02:16,950 --> 00:02:18,540
on this.

26
00:02:18,620 --> 00:02:25,200
And now we can tackle our target values these currently looks something like this.

27
00:02:25,240 --> 00:02:28,760
They're all numbers between 0 and 9.

28
00:02:28,760 --> 00:02:34,400
What we're gonna do is going to transform all of these into values between 0 and 1 actually.

29
00:02:34,460 --> 00:02:42,380
So what are effectively going to do is to take this sparse matrix right here which is what this is and

30
00:02:42,380 --> 00:02:45,410
turn it into a full matrix.

31
00:02:45,410 --> 00:02:47,280
Let me show you this through an example.

32
00:02:47,280 --> 00:02:54,590
So if I take these first five values here and I'll just store them in an array called values what I

33
00:02:54,590 --> 00:03:05,900
can do then is use a function from num pi called end p dot I 10 closing parentheses square brackets

34
00:03:06,740 --> 00:03:12,590
and then values me hit shift enter on this and show you what this does.

35
00:03:12,680 --> 00:03:13,030
All right.

36
00:03:13,060 --> 00:03:14,660
So what are we looking at here.

37
00:03:15,530 --> 00:03:25,460
Well we can see that now we find a one in the position of the numbers inside the values array.

38
00:03:25,760 --> 00:03:32,400
For example there's a one in the fifth position 0 1 2 3 4 5.

39
00:03:32,480 --> 00:03:36,530
There is a 1 in the first position for the second row.

40
00:03:36,530 --> 00:03:44,640
There's a 1 in the fourth position on the third row 0 1 2 3 4 and so on.

41
00:03:44,750 --> 00:03:46,160
What's going on here.

42
00:03:46,160 --> 00:03:49,010
Well let's take this step by step.

43
00:03:49,010 --> 00:03:51,230
This is actually something we've seen before.

44
00:03:51,260 --> 00:03:52,850
But in a slightly different form.

45
00:03:53,630 --> 00:04:01,790
So if I pull up the documentation for end p dot I then I can see here that this function returns a 2D

46
00:04:01,790 --> 00:04:07,910
array with ones the diagonal and zero elsewhere and N.

47
00:04:08,030 --> 00:04:12,430
So this first parameter is the number of rows in the output.

48
00:04:12,480 --> 00:04:21,590
So if I come down here and I just write and put out I on its own the number 10 then I get a 10 by 10

49
00:04:21,950 --> 00:04:25,620
matrix with ones down the diagonal.

50
00:04:25,800 --> 00:04:27,120
Why did I use the number 10.

51
00:04:27,900 --> 00:04:32,970
Well because we've got 10 different types of labels in our dataset.

52
00:04:33,150 --> 00:04:36,650
We've got the numbers between 0 and 9.

53
00:04:36,690 --> 00:04:40,980
So that's why I've created a 10 by 10 matrix.

54
00:04:40,980 --> 00:04:42,720
So what's happening next.

55
00:04:42,720 --> 00:04:46,080
Well this is not matrix multiplication.

56
00:04:46,080 --> 00:04:52,740
Instead what we're doing is actually array element indexing the second bit here.

57
00:04:52,740 --> 00:05:00,930
This values in the square brackets acts as the index array each number in the index array indicates

58
00:05:00,990 --> 00:05:07,580
which value in the preceding array to use in the place of the index.

59
00:05:07,620 --> 00:05:14,520
So check this out if I've caught n Pete and I and I use that same ten by ten matrix and then I have

60
00:05:14,520 --> 00:05:21,630
some square brackets after it and I put the number two there Then I get the third row extracted from

61
00:05:21,870 --> 00:05:29,120
my identity matrix my one here is in the third position I get this entire row coming on.

62
00:05:29,640 --> 00:05:36,330
So I hope you can see what's going on here now if I've got my values array like so and I pull out a

63
00:05:36,330 --> 00:05:43,260
particular value here with the square bracket notation like so then this is the form that's very familiar

64
00:05:43,260 --> 00:05:49,590
to us here I'm pulling out the fifth value are the number nine in this case here I'm pulling out the

65
00:05:49,590 --> 00:05:52,920
third row at index number two.

66
00:05:52,920 --> 00:06:01,890
So all we're doing him is we're using this entire array as an index and we're pulling out several of

67
00:06:01,890 --> 00:06:10,950
the rows from the identity matrix and this is how we can convert the entire training data set for the

68
00:06:10,950 --> 00:06:15,580
labels into a 1 Hot encoding.

69
00:06:16,110 --> 00:06:27,630
So let's do that now I'll add a little subheading here that reads convert target values to 1 Hot encoding

70
00:06:28,440 --> 00:06:34,540
at the moment our target values are sparse because they just have an integer for the class and what

71
00:06:34,540 --> 00:06:41,160
we're going to do is when I essentially reshape this entire thing so that it is in this format instead

72
00:06:42,180 --> 00:06:51,840
the way we can do this is simply by overwriting y on a squat train underscore all with and p dot I 10

73
00:06:52,760 --> 00:06:53,870
square brackets.

74
00:06:54,050 --> 00:07:01,650
Why underscore a train underscore all and if we don't want this number 10 floating around in here because

75
00:07:01,710 --> 00:07:03,330
we might not know what it stands for.

76
00:07:03,360 --> 00:07:11,870
When we come back to it in the future let's add a constant at the top that reads an R underscore classes

77
00:07:12,540 --> 00:07:15,890
and that's going to be equal to the number 10.

78
00:07:15,960 --> 00:07:23,610
If I refresh the cell I can now use my constant down here where I had this number 10 earlier and run

79
00:07:23,610 --> 00:07:25,400
this entire cell.

80
00:07:25,800 --> 00:07:32,880
If I check out why I was quatrain on a scroll that shape I can see that it is now at a rate of 60 thousand

81
00:07:32,880 --> 00:07:40,350
labels but for each label I've got this one hot encoding so I've got 10 columns and one of these columns

82
00:07:40,530 --> 00:07:49,780
will have a one at the position that corresponds to the label and this is in contrast to our flattened

83
00:07:49,780 --> 00:07:51,410
array that we had earlier.

84
00:07:51,490 --> 00:07:53,970
That was much more sparse.

85
00:07:54,010 --> 00:07:59,800
Of course we have to do the very same thing to our test labels as we've done to our training labels

86
00:07:59,980 --> 00:08:01,450
to be consistent.

87
00:08:01,450 --> 00:08:07,600
So we'll write y on the score test is equal to Pete dot identity.

88
00:08:07,600 --> 00:08:20,140
E y e parentheses in our classes y underscore test and now r y underscore test dot shape should also

89
00:08:20,140 --> 00:08:24,370
be a 1 Hot encoded array.

90
00:08:24,370 --> 00:08:31,920
In this case LP 10000 thousand by 10 the last thing that we'll do is we'll create our validation data

91
00:08:31,920 --> 00:08:32,680
set.

92
00:08:32,770 --> 00:08:44,360
So once again I'll add a subheading here that reads create validation data set from training data.

93
00:08:45,520 --> 00:08:51,430
What we want to do in this case is we want to split up our training data into our validation data set

94
00:08:51,850 --> 00:08:56,110
and the actual training data set that we're gonna use.

95
00:08:56,170 --> 00:09:02,230
The first thing that we'll do is we'll decide on the size of the validation data set I'm going to come

96
00:09:02,230 --> 00:09:06,080
back up to my constants and I'm going to add it up here.

97
00:09:06,400 --> 00:09:15,490
Validation on the score size and we'll set that equal to 10000 same size in this case as our training

98
00:09:15,490 --> 00:09:16,890
data set.

99
00:09:17,040 --> 00:09:23,280
I'll let shift enter on the cell and then I want to throw this over to you as a challenge.

100
00:09:23,320 --> 00:09:27,870
Once again this will be some good review for race slicing techniques.

101
00:09:27,880 --> 00:09:36,480
What I'd like you to do is to split the training data set into four smaller datasets namely the X underscore

102
00:09:36,500 --> 00:09:41,780
Val the Y underscore val x on escort train and y on escort train.

103
00:09:41,950 --> 00:09:46,960
Make use of the constant that we've created above so that you end up with the validation data set of

104
00:09:46,980 --> 00:09:51,950
10000 and a training data set of 50000.

105
00:09:52,270 --> 00:09:56,020
I'll give you a few seconds to pause the video before I show you the solution.

106
00:09:58,210 --> 00:09:59,390
Here's how you do it.

107
00:09:59,500 --> 00:10:05,980
X and a score Val is equal to x on the squat train on a squirrel square brackets.

108
00:10:05,980 --> 00:10:12,770
Colon validation size the same for r y underscore Val.

109
00:10:13,270 --> 00:10:20,350
So the first 10000 values in the large sixty thousand sample training dataset.

110
00:10:20,380 --> 00:10:21,800
Now let's do the other bit.

111
00:10:21,880 --> 00:10:29,230
So X on the squat train shall be equal to X on a squat train and a square all square brackets and then

112
00:10:29,230 --> 00:10:36,460
I'll be the last 50000 so it'll be everything from the 10000 sample onwards.

113
00:10:36,460 --> 00:10:39,760
So validation on this exercise.

114
00:10:39,910 --> 00:10:46,760
Colon closing square bracket and the same goes for the Y levels in the training dataset.

115
00:10:46,810 --> 00:10:49,010
Let me hit shift enter on this.

116
00:10:49,450 --> 00:10:56,320
And now when I pull up X on a squat train shape I shall see that it's fifty thousand by seven hundred

117
00:10:56,560 --> 00:10:57,490
and eighty four.

118
00:10:58,240 --> 00:11:08,150
An X on a score vowel don't shape it's gonna be ten thousand by seven hundred and eighty four and that's

119
00:11:08,150 --> 00:11:08,820
it.

120
00:11:09,290 --> 00:11:15,230
In the next lessons we're going to be busy setting up tensor flow and setting up our neural network

121
00:11:15,320 --> 00:11:18,190
architecture for all that and more.

122
00:11:18,410 --> 00:11:20,930
How see in the next lessons take out.