1
00:00:00,810 --> 00:00:07,130
Now that we've set up our notebook and downloaded our data let's start exploring it a little bit and

2
00:00:07,130 --> 00:00:12,890
talk a little bit more in detail about what kind of data we're actually working with we're going to

3
00:00:12,890 --> 00:00:14,470
be working with some image data.

4
00:00:14,510 --> 00:00:20,960
And this is image data of some handwritten digits all of the data is made of grayscale images of these

5
00:00:20,960 --> 00:00:22,410
handwritten digits.

6
00:00:22,430 --> 00:00:28,400
Lucky for us the hardest part of the work namely cleaning the data and getting this data ready for us

7
00:00:28,400 --> 00:00:37,110
to use has already been done by the National Institute of Standards and Technology the NIST.

8
00:00:37,250 --> 00:00:39,680
That's the nest behind amnesty.

9
00:00:40,370 --> 00:00:47,030
They've collected all this data and formatted it for processing part of the original data is actually

10
00:00:47,030 --> 00:00:51,680
available on their Web site and you can download it if you like.

11
00:00:51,680 --> 00:00:56,690
The only problem is is that the form that they've got chosen is a little bit strange.

12
00:00:56,710 --> 00:00:58,800
It's it's very efficient plant.

13
00:00:58,820 --> 00:01:06,350
It does require some pre processing what I've done instead is giving you the CSP because in the CSB

14
00:01:06,590 --> 00:01:13,040
we've got values for all the images in an array format and this is very similar to what we've encountered

15
00:01:13,040 --> 00:01:13,520
before.

16
00:01:14,060 --> 00:01:20,110
So these images are 20 pixels wide and 28 pixels tall.

17
00:01:20,840 --> 00:01:25,960
And they're also in grayscale meaning there's only a single color channel.

18
00:01:26,060 --> 00:01:32,290
And what this means is that each pixel in this image has a single value associated with it.

19
00:01:32,390 --> 00:01:35,810
That shows how light or how dark this pixel is.

20
00:01:35,840 --> 00:01:45,680
This value is basically an integer between 0 and 255 as such the total number of inputs for our perception

21
00:01:46,160 --> 00:01:53,180
is 28 times 28 times 1 1 being the number of color channels.

22
00:01:53,180 --> 00:01:56,330
And that's equal to seven hundred and eighty four.

23
00:01:56,360 --> 00:02:00,790
How are we getting these seven hundred and eighty four data points in a C as V file.

24
00:02:00,830 --> 00:02:03,450
Well this is how I structure them.

25
00:02:03,650 --> 00:02:07,490
I essentially put them all into a giant array.

26
00:02:07,550 --> 00:02:11,600
Let's take a look at this array in our Jupiter notebook here.

27
00:02:11,650 --> 00:02:18,640
I'll add a markdown cell that it's going to read explore and let's take a look at these x on this quatrain

28
00:02:18,640 --> 00:02:25,510
on the score all an X on the score test and Y underscore test number high raise and a bit more detail

29
00:02:26,290 --> 00:02:34,490
so X on this quatrain on a scroll the shape is going to show us that we've got sixty thousand examples.

30
00:02:34,870 --> 00:02:42,340
And each example has seven hundred and eighty four values associated with it here's how the very first

31
00:02:42,340 --> 00:02:48,970
one in our training data set looks like I'll pull this up with X on the squat train on a score all square

32
00:02:48,970 --> 00:02:55,280
brackets zero and what we see is an array as promised.

33
00:02:55,340 --> 00:03:04,520
It's got values between zero and two hundred and fifty five a zero value means that the pixel is completely

34
00:03:04,520 --> 00:03:11,960
white and a value of 255 means that the pixel is completely black.

35
00:03:11,960 --> 00:03:18,200
Everything else in between is a shade of gray and there are more than 50 shades of gray here as we can

36
00:03:18,200 --> 00:03:18,490
tell.

37
00:03:19,890 --> 00:03:27,210
Let me minimize this output now and quickly take a look at why underscore a train on a scroll and we

38
00:03:27,210 --> 00:03:36,890
see that it has 60000 labels to go along with the feature data for 60000 examples X on a school test.

39
00:03:36,890 --> 00:03:40,840
On the other hand only has ten thousand examples.

40
00:03:41,360 --> 00:03:47,390
So our testing dataset is a tad smaller than our training dataset which also explains the difference

41
00:03:47,390 --> 00:03:51,110
in file size and loading time down.

42
00:03:51,200 --> 00:03:56,530
What does r y on a squat train on a square all actually looked like.

43
00:03:56,540 --> 00:03:59,980
Let's take a look at the very first entry in there.

44
00:03:59,980 --> 00:04:02,110
We've got a five.

45
00:04:02,120 --> 00:04:04,230
What about the first five entries.

46
00:04:04,280 --> 00:04:13,460
So semicolon and then five shows us we've got 5 0 4 1 9.

47
00:04:13,640 --> 00:04:18,780
These here correspond to the categories or the classes four digits.

48
00:04:18,860 --> 00:04:24,950
Now one thing that you'll actually notice is that I've flattened our arrays for us because what I've

49
00:04:24,950 --> 00:04:30,760
actually done is I've given you all the pixels laid out in a single row for each image.

50
00:04:30,830 --> 00:04:40,690
If I had not done that then this shape would read 60000 by 28 by 28 by one right for the color channel.

51
00:04:40,790 --> 00:04:47,510
I've already combine all of these into a single row now that makes it a little bit easier to work with

52
00:04:47,660 --> 00:04:54,230
but it's not necessarily a good thing because we actually lose a little bit of positional information

53
00:04:54,680 --> 00:05:02,510
on each pixel like we don't know what other pixels surround that pixel by flapping them like this.

54
00:05:02,510 --> 00:05:08,120
So that's something to bear in mind for your own projects and for any future projects and tutorials

55
00:05:08,390 --> 00:05:10,390
that you're working through.

56
00:05:10,400 --> 00:05:16,160
Now that we explored our data and we've got a feeling for what it looks like we're gonna do a little

57
00:05:16,160 --> 00:05:20,780
bit more pre processing so that we can feed it into our neural network.

58
00:05:20,810 --> 00:05:23,240
So I hope you're looking forward to that as much as I am.

59
00:05:23,690 --> 00:05:26,560
And that's coming right up in the next lesson.

60
00:05:26,570 --> 00:05:27,550
I'll see you there.