1 00:00:00,180 --> 00:00:01,050 All righty. 2 00:00:01,080 --> 00:00:06,570 So now we've got all of our training image file names room but these are just file names at the moment 3 00:00:06,720 --> 00:00:09,270 so all file paths probably the better way to do it. 4 00:00:09,270 --> 00:00:12,080 I've called this phone and so we'll just leave that there. 5 00:00:12,090 --> 00:00:17,680 So these are file paths to our training images which means later on we'll have a way to access them. 6 00:00:17,730 --> 00:00:20,490 So it's important to note here that these are just strings. 7 00:00:20,520 --> 00:00:23,020 They're not the actual images just yet. 8 00:00:23,100 --> 00:00:25,170 Well we'll do that in a second. 9 00:00:25,170 --> 00:00:28,590 But what we're doing now is we need to get our labels ready. 10 00:00:28,650 --> 00:00:33,970 So these are our labels we've just dealt with the file I.D. and turn them into file parts. 11 00:00:34,080 --> 00:00:36,200 But let's now get our labels. 12 00:00:36,270 --> 00:00:42,090 So remember the premise of machine learning is to turn our data into numbers more specifically when 13 00:00:42,090 --> 00:00:44,080 we're working with tensor flow. 14 00:00:44,150 --> 00:00:46,180 It's to turn it into tenses. 15 00:00:46,230 --> 00:00:50,820 That's what we're moving towards in this little subsection at the moment. 16 00:00:50,820 --> 00:01:04,720 All right a little bit of text here since we've now got our training image file paths and a list let's 17 00:01:04,840 --> 00:01:10,580 prepare our labels okay. 18 00:01:10,930 --> 00:01:13,800 We've got all our labels in labels CSB. 19 00:01:14,440 --> 00:01:16,480 Let's just have a look at them actually. 20 00:01:16,480 --> 00:01:17,480 Let's go labels. 21 00:01:17,500 --> 00:01:23,050 Let's just create a variable called labels CSA grade. 22 00:01:23,050 --> 00:01:25,480 Keep it simple you know what is this come out at 23 00:01:29,090 --> 00:01:29,330 OK. 24 00:01:29,340 --> 00:01:30,790 So that's day type object. 25 00:01:30,790 --> 00:01:33,360 Remember we have to start turning these into numbers. 26 00:01:33,420 --> 00:01:37,280 So we might turn them into a num pi array. 27 00:01:37,290 --> 00:01:40,540 So import num pi as. 28 00:01:40,710 --> 00:01:48,210 Now I wonder if I could just do this otherwise I know another way to do it. 29 00:01:48,930 --> 00:01:50,370 And what if we go labels 30 00:01:53,240 --> 00:02:02,000 wonderful how many are in that Len labels 10000 turn 22 which is amazing. 31 00:02:02,000 --> 00:02:08,750 The other way I was going to do it is if we comment out this I'm doing command slash that's all you 32 00:02:08,750 --> 00:02:17,080 can come in out a whole line and then we go to num pi nice simple method empanadas. 33 00:02:17,310 --> 00:02:17,820 Same thing 34 00:02:23,590 --> 00:02:26,410 does same thing as above. 35 00:02:26,590 --> 00:02:27,680 Little trick there. 36 00:02:27,730 --> 00:02:29,110 Let's do the same thing as before. 37 00:02:29,140 --> 00:02:31,810 Compare the amount of labels to the number of file names. 38 00:02:31,900 --> 00:02:37,780 If we come back to here after we've downloaded this and unzipped it into our car lab notebook what we're 39 00:02:37,780 --> 00:02:45,280 doing here is by comparing the number of labels to the number of file names checking for missing data. 40 00:02:45,280 --> 00:02:51,550 So here Rambo back in the previous projects with structured data there were some cells in the data frame 41 00:02:51,550 --> 00:02:57,890 that were missing but checking it with unstructured data if there's missing data is a bit harder. 42 00:02:57,910 --> 00:03:06,250 So this is kind of the work around we're doing check if number of labels matches the number of phone 43 00:03:06,250 --> 00:03:08,250 hands and we're just going to do the exact same as before. 44 00:03:08,250 --> 00:03:16,830 So if len labels to the length of labels what we just did there equals Len file names because what we're 45 00:03:16,830 --> 00:03:24,690 eventually going to have to do is pair up these labels with these file names like what we've just done 46 00:03:24,690 --> 00:03:25,400 here. 47 00:03:25,580 --> 00:03:30,530 So index nine thousand is Tibetan mastiff and its file name nine thousand. 48 00:03:30,540 --> 00:03:39,770 Is this absolute lie and of a dog so let's go here predate a little confirmation for ourselves number 49 00:03:39,770 --> 00:03:53,420 of labels matches number of file names be a uniform else print number of labels does not match number 50 00:03:53,510 --> 00:04:02,880 of phone names check data directories fingers crossed. 51 00:04:03,110 --> 00:04:03,860 Beautiful. 52 00:04:03,860 --> 00:04:04,750 That's what we're after. 53 00:04:04,760 --> 00:04:08,030 Number of labels matches number of file names. 54 00:04:08,090 --> 00:04:14,920 OK now finally since a machine or not really finally we're still going a lot to go with this project. 55 00:04:16,080 --> 00:04:20,590 So it's a machine learning model can't take strings as input. 56 00:04:20,630 --> 00:04:22,310 This is what labels currently is right. 57 00:04:22,310 --> 00:04:25,160 It's an array of strings. 58 00:04:25,280 --> 00:04:32,660 What we have to do is convert it into numbers how might we do this or to begin let's find a list of 59 00:04:32,750 --> 00:04:36,530 all the unique dog rates of this ten thousand two hundred twenty two in here. 60 00:04:36,800 --> 00:04:39,210 But how would we find the unique labels. 61 00:04:39,350 --> 00:04:40,810 So let's do this. 62 00:04:41,120 --> 00:04:43,990 Find the unique label values. 63 00:04:44,000 --> 00:04:46,050 So I want you to have a little think about this. 64 00:04:46,190 --> 00:04:54,800 If you have an umpire Ray I'm giving you a little hint how might you find the unique values in a given 65 00:04:54,800 --> 00:04:59,040 array using num high Java level. 66 00:04:59,050 --> 00:05:01,460 Think you could even look it up. 67 00:05:01,550 --> 00:05:03,320 There's a good Google search for that. 68 00:05:03,320 --> 00:05:09,240 So NPR unique is how we find find the unique elements of an array and now this is pretty cool with collab 69 00:05:09,250 --> 00:05:09,560 right. 70 00:05:09,560 --> 00:05:13,610 It just pops up this dock string without me even doing anything. 71 00:05:13,610 --> 00:05:16,070 So we want to find the unique labels. 72 00:05:16,100 --> 00:05:18,950 Let's find that unique breeds 73 00:05:22,040 --> 00:05:26,530 let's have a look at that Alton It's all right. 74 00:05:26,560 --> 00:05:29,270 There's a whole bunch of different dog breeds so let's have a look at him. 75 00:05:30,600 --> 00:05:35,090 Brittany spaniel never even heard of that Chihuahua. 76 00:05:37,460 --> 00:05:39,410 Gordon setter Great Dane. 77 00:05:39,410 --> 00:05:40,530 Yeah he's a beast. 78 00:05:40,550 --> 00:05:41,480 Okay. 79 00:05:41,630 --> 00:05:43,480 Getting distracted by dogs. 80 00:05:43,700 --> 00:05:48,870 And so this should be 120 wonderful. 81 00:05:48,880 --> 00:05:50,350 So that is because we have 82 00:05:53,550 --> 00:05:56,750 120 breeds of dogs we're lining up here right. 83 00:05:56,760 --> 00:06:00,350 We're getting our data ready we're getting it prepared to get into tenses. 84 00:06:00,390 --> 00:06:02,100 We've only got two arrays of strings. 85 00:06:02,100 --> 00:06:04,160 How might we turn these into numbers. 86 00:06:04,200 --> 00:06:06,990 Why don't we turn it into an array of ball lanes. 87 00:06:06,990 --> 00:06:09,140 Let me give you an example rather than talk about it. 88 00:06:09,180 --> 00:06:19,170 So turn a single label into an array of ball lanes print labels zero. 89 00:06:19,210 --> 00:06:23,250 Just use label zero as an example so labels zero. 90 00:06:23,280 --> 00:06:30,440 Now we're going to use that comparison operator to compare the first label to unique breeds. 91 00:06:30,450 --> 00:06:37,890 And what this should return is an array of true and false values where everywhere and unique breeds 92 00:06:38,940 --> 00:06:49,210 labels zero doesn't equal it should be false but the only location where it does equal should be true. 93 00:06:49,210 --> 00:06:51,190 There we go. 94 00:06:51,190 --> 00:06:53,430 Let's have a look at these sets of Boston ball. 95 00:06:54,310 --> 00:06:56,530 But then if we have a look at unique breeds 96 00:06:59,840 --> 00:07:09,090 so we can say True is here so what's that about twelve in so if we go through here. 97 00:07:09,090 --> 00:07:12,540 Can a map if you can Adam up you'd find Boston bull. 98 00:07:12,540 --> 00:07:13,140 There we go. 99 00:07:13,140 --> 00:07:13,620 True. 100 00:07:14,370 --> 00:07:15,950 That's what we're after. 101 00:07:15,960 --> 00:07:24,700 Let's do that for every single label so turn every label into a bull in a right if you're looking at 102 00:07:24,700 --> 00:07:27,220 this and this is just all truths and falsehoods. 103 00:07:27,370 --> 00:07:29,890 You're probably wondering how is this going to be converted to numbers. 104 00:07:29,900 --> 00:07:35,160 But we'll see in a second as a special thing with nut pie raisin and boolean values. 105 00:07:35,320 --> 00:07:36,970 So let's go bowling. 106 00:07:37,270 --> 00:07:44,370 Labels equals label we'll do another list comprehension here. 107 00:07:44,490 --> 00:07:51,270 So label equals unique grades for label in labels. 108 00:07:52,260 --> 00:07:53,880 Let's see what the code says first. 109 00:07:53,880 --> 00:07:54,720 Then we'll talk through it. 110 00:07:56,790 --> 00:08:08,030 If in doubt run the code wonderful and bullying labels should be 10000 long or ten thousand two hundred 111 00:08:08,170 --> 00:08:09,500 twenty two whatever it was 112 00:08:12,560 --> 00:08:13,430 wonderful. 113 00:08:13,460 --> 00:08:19,640 So that means we've turned every single label into a boolean label now. 114 00:08:19,670 --> 00:08:21,600 What is this little this comprehension doing. 115 00:08:21,650 --> 00:08:25,700 All it's doing is a scout up version of what we've got here. 116 00:08:25,790 --> 00:08:38,600 So it's saying do this little one off example for every label in labels and if we remember Len labels 117 00:08:39,360 --> 00:08:43,850 is ten thousand two hundred twenty two because we have ten thousand two hundred twenty two labels. 118 00:08:43,850 --> 00:08:47,370 So it's done it for every single label. 119 00:08:47,510 --> 00:08:52,110 He might be thinking again why are we turning him into true and false wealth. 120 00:08:52,130 --> 00:08:58,980 How we can convert these boolean arrays into numbers more specifically one hot encoded numbers. 121 00:08:59,250 --> 00:09:03,560 So if in doubt run the code let's check out the code then we'll talk it through. 122 00:09:04,070 --> 00:09:17,010 So example turning boolean array into integers print label 0. 123 00:09:17,010 --> 00:09:18,020 We'll get the first one. 124 00:09:18,180 --> 00:09:28,700 So this is the original label straight from the diaphragm straight from label CSA print NDP where unique 125 00:09:28,700 --> 00:09:32,090 grades equals labels. 126 00:09:32,120 --> 00:09:35,640 0 print. 127 00:09:35,880 --> 00:09:43,720 So actually we should probably tell this is the index where label occurs and then we're gonna go print 128 00:09:43,840 --> 00:09:52,030 boolean labels 0 dot org Max. 129 00:09:52,030 --> 00:09:54,550 So this is index 130 00:09:57,880 --> 00:10:08,020 where label occurs in boolean array and then we're gonna go print boolean labels 131 00:10:11,240 --> 00:10:26,500 as type int so there will be a one or there should be a one where the sample label occurs as a big self 132 00:10:26,820 --> 00:10:28,360 but we're gonna check it out. 133 00:10:28,750 --> 00:10:29,530 Look at that. 134 00:10:29,590 --> 00:10:38,410 So the original label is Boston Bill and it appears at the 19th index which is the same in our boolean 135 00:10:38,410 --> 00:10:41,850 labels zero index so it appears at 19. 136 00:10:42,010 --> 00:10:56,830 And if we were to count this so 0 0 0 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 is where one occurs. 137 00:10:56,830 --> 00:11:07,240 So now every single label in bullying labels is actually in this type of format so there's a one where 138 00:11:07,240 --> 00:11:09,730 the actual label occurs. 139 00:11:09,820 --> 00:11:20,910 So if we did the same thing for let's go to and then print labels to 140 00:11:24,330 --> 00:11:30,350 is Pekinese occurs right down here and zero for everything else. 141 00:11:30,630 --> 00:11:31,830 Wonderful. 142 00:11:31,830 --> 00:11:39,120 So now we've got our labels in a numeric format and our image file paths easily accessible because remember 143 00:11:39,120 --> 00:11:40,630 we've got file names. 144 00:11:40,890 --> 00:11:49,630 These our image paths they aren't numeric yet but we can do that later on using tensor flow. 145 00:11:49,630 --> 00:11:53,530 So now we've got our data in an accessible format. 146 00:11:53,560 --> 00:12:01,610 Let's create some training and validation sets because if we come to our files we notice when we download 147 00:12:01,680 --> 00:12:05,590 it from Kaggle we only have a train and test set to do our experiments. 148 00:12:05,710 --> 00:12:08,800 We want to split our data into training and validation. 149 00:12:09,250 --> 00:12:11,290 So that's what we're doing there in the next video.