1 00:00:00,810 --> 00:00:07,130 Now that we've set up our notebook and downloaded our data let's start exploring it a little bit and 2 00:00:07,130 --> 00:00:12,890 talk a little bit more in detail about what kind of data we're actually working with we're going to 3 00:00:12,890 --> 00:00:14,470 be working with some image data. 4 00:00:14,510 --> 00:00:20,960 And this is image data of some handwritten digits all of the data is made of grayscale images of these 5 00:00:20,960 --> 00:00:22,410 handwritten digits. 6 00:00:22,430 --> 00:00:28,400 Lucky for us the hardest part of the work namely cleaning the data and getting this data ready for us 7 00:00:28,400 --> 00:00:37,110 to use has already been done by the National Institute of Standards and Technology the NIST. 8 00:00:37,250 --> 00:00:39,680 That's the nest behind amnesty. 9 00:00:40,370 --> 00:00:47,030 They've collected all this data and formatted it for processing part of the original data is actually 10 00:00:47,030 --> 00:00:51,680 available on their Web site and you can download it if you like. 11 00:00:51,680 --> 00:00:56,690 The only problem is is that the form that they've got chosen is a little bit strange. 12 00:00:56,710 --> 00:00:58,800 It's it's very efficient plant. 13 00:00:58,820 --> 00:01:06,350 It does require some pre processing what I've done instead is giving you the CSP because in the CSB 14 00:01:06,590 --> 00:01:13,040 we've got values for all the images in an array format and this is very similar to what we've encountered 15 00:01:13,040 --> 00:01:13,520 before. 16 00:01:14,060 --> 00:01:20,110 So these images are 20 pixels wide and 28 pixels tall. 17 00:01:20,840 --> 00:01:25,960 And they're also in grayscale meaning there's only a single color channel. 18 00:01:26,060 --> 00:01:32,290 And what this means is that each pixel in this image has a single value associated with it. 19 00:01:32,390 --> 00:01:35,810 That shows how light or how dark this pixel is. 20 00:01:35,840 --> 00:01:45,680 This value is basically an integer between 0 and 255 as such the total number of inputs for our perception 21 00:01:46,160 --> 00:01:53,180 is 28 times 28 times 1 1 being the number of color channels. 22 00:01:53,180 --> 00:01:56,330 And that's equal to seven hundred and eighty four. 23 00:01:56,360 --> 00:02:00,790 How are we getting these seven hundred and eighty four data points in a C as V file. 24 00:02:00,830 --> 00:02:03,450 Well this is how I structure them. 25 00:02:03,650 --> 00:02:07,490 I essentially put them all into a giant array. 26 00:02:07,550 --> 00:02:11,600 Let's take a look at this array in our Jupiter notebook here. 27 00:02:11,650 --> 00:02:18,640 I'll add a markdown cell that it's going to read explore and let's take a look at these x on this quatrain 28 00:02:18,640 --> 00:02:25,510 on the score all an X on the score test and Y underscore test number high raise and a bit more detail 29 00:02:26,290 --> 00:02:34,490 so X on this quatrain on a scroll the shape is going to show us that we've got sixty thousand examples. 30 00:02:34,870 --> 00:02:42,340 And each example has seven hundred and eighty four values associated with it here's how the very first 31 00:02:42,340 --> 00:02:48,970 one in our training data set looks like I'll pull this up with X on the squat train on a score all square 32 00:02:48,970 --> 00:02:55,280 brackets zero and what we see is an array as promised. 33 00:02:55,340 --> 00:03:04,520 It's got values between zero and two hundred and fifty five a zero value means that the pixel is completely 34 00:03:04,520 --> 00:03:11,960 white and a value of 255 means that the pixel is completely black. 35 00:03:11,960 --> 00:03:18,200 Everything else in between is a shade of gray and there are more than 50 shades of gray here as we can 36 00:03:18,200 --> 00:03:18,490 tell. 37 00:03:19,890 --> 00:03:27,210 Let me minimize this output now and quickly take a look at why underscore a train on a scroll and we 38 00:03:27,210 --> 00:03:36,890 see that it has 60000 labels to go along with the feature data for 60000 examples X on a school test. 39 00:03:36,890 --> 00:03:40,840 On the other hand only has ten thousand examples. 40 00:03:41,360 --> 00:03:47,390 So our testing dataset is a tad smaller than our training dataset which also explains the difference 41 00:03:47,390 --> 00:03:51,110 in file size and loading time down. 42 00:03:51,200 --> 00:03:56,530 What does r y on a squat train on a square all actually looked like. 43 00:03:56,540 --> 00:03:59,980 Let's take a look at the very first entry in there. 44 00:03:59,980 --> 00:04:02,110 We've got a five. 45 00:04:02,120 --> 00:04:04,230 What about the first five entries. 46 00:04:04,280 --> 00:04:13,460 So semicolon and then five shows us we've got 5 0 4 1 9. 47 00:04:13,640 --> 00:04:18,780 These here correspond to the categories or the classes four digits. 48 00:04:18,860 --> 00:04:24,950 Now one thing that you'll actually notice is that I've flattened our arrays for us because what I've 49 00:04:24,950 --> 00:04:30,760 actually done is I've given you all the pixels laid out in a single row for each image. 50 00:04:30,830 --> 00:04:40,690 If I had not done that then this shape would read 60000 by 28 by 28 by one right for the color channel. 51 00:04:40,790 --> 00:04:47,510 I've already combine all of these into a single row now that makes it a little bit easier to work with 52 00:04:47,660 --> 00:04:54,230 but it's not necessarily a good thing because we actually lose a little bit of positional information 53 00:04:54,680 --> 00:05:02,510 on each pixel like we don't know what other pixels surround that pixel by flapping them like this. 54 00:05:02,510 --> 00:05:08,120 So that's something to bear in mind for your own projects and for any future projects and tutorials 55 00:05:08,390 --> 00:05:10,390 that you're working through. 56 00:05:10,400 --> 00:05:16,160 Now that we explored our data and we've got a feeling for what it looks like we're gonna do a little 57 00:05:16,160 --> 00:05:20,780 bit more pre processing so that we can feed it into our neural network. 58 00:05:20,810 --> 00:05:23,240 So I hope you're looking forward to that as much as I am. 59 00:05:23,690 --> 00:05:26,560 And that's coming right up in the next lesson. 60 00:05:26,570 --> 00:05:27,550 I'll see you there.