1
00:00:00,180 --> 00:00:01,050
All righty.

2
00:00:01,080 --> 00:00:06,570
So now we've got all of our training image file names room but these are just file names at the moment

3
00:00:06,720 --> 00:00:09,270
so all file paths probably the better way to do it.

4
00:00:09,270 --> 00:00:12,080
I've called this phone and so we'll just leave that there.

5
00:00:12,090 --> 00:00:17,680
So these are file paths to our training images which means later on we'll have a way to access them.

6
00:00:17,730 --> 00:00:20,490
So it's important to note here that these are just strings.

7
00:00:20,520 --> 00:00:23,020
They're not the actual images just yet.

8
00:00:23,100 --> 00:00:25,170
Well we'll do that in a second.

9
00:00:25,170 --> 00:00:28,590
But what we're doing now is we need to get our labels ready.

10
00:00:28,650 --> 00:00:33,970
So these are our labels we've just dealt with the file I.D. and turn them into file parts.

11
00:00:34,080 --> 00:00:36,200
But let's now get our labels.

12
00:00:36,270 --> 00:00:42,090
So remember the premise of machine learning is to turn our data into numbers more specifically when

13
00:00:42,090 --> 00:00:44,080
we're working with tensor flow.

14
00:00:44,150 --> 00:00:46,180
It's to turn it into tenses.

15
00:00:46,230 --> 00:00:50,820
That's what we're moving towards in this little subsection at the moment.

16
00:00:50,820 --> 00:01:04,720
All right a little bit of text here since we've now got our training image file paths and a list let's

17
00:01:04,840 --> 00:01:10,580
prepare our labels okay.

18
00:01:10,930 --> 00:01:13,800
We've got all our labels in labels CSB.

19
00:01:14,440 --> 00:01:16,480
Let's just have a look at them actually.

20
00:01:16,480 --> 00:01:17,480
Let's go labels.

21
00:01:17,500 --> 00:01:23,050
Let's just create a variable called labels CSA grade.

22
00:01:23,050 --> 00:01:25,480
Keep it simple you know what is this come out at

23
00:01:29,090 --> 00:01:29,330
OK.

24
00:01:29,340 --> 00:01:30,790
So that's day type object.

25
00:01:30,790 --> 00:01:33,360
Remember we have to start turning these into numbers.

26
00:01:33,420 --> 00:01:37,280
So we might turn them into a num pi array.

27
00:01:37,290 --> 00:01:40,540
So import num pi as.

28
00:01:40,710 --> 00:01:48,210
Now I wonder if I could just do this otherwise I know another way to do it.

29
00:01:48,930 --> 00:01:50,370
And what if we go labels

30
00:01:53,240 --> 00:02:02,000
wonderful how many are in that Len labels 10000 turn 22 which is amazing.

31
00:02:02,000 --> 00:02:08,750
The other way I was going to do it is if we comment out this I'm doing command slash that's all you

32
00:02:08,750 --> 00:02:17,080
can come in out a whole line and then we go to num pi nice simple method empanadas.

33
00:02:17,310 --> 00:02:17,820
Same thing

34
00:02:23,590 --> 00:02:26,410
does same thing as above.

35
00:02:26,590 --> 00:02:27,680
Little trick there.

36
00:02:27,730 --> 00:02:29,110
Let's do the same thing as before.

37
00:02:29,140 --> 00:02:31,810
Compare the amount of labels to the number of file names.

38
00:02:31,900 --> 00:02:37,780
If we come back to here after we've downloaded this and unzipped it into our car lab notebook what we're

39
00:02:37,780 --> 00:02:45,280
doing here is by comparing the number of labels to the number of file names checking for missing data.

40
00:02:45,280 --> 00:02:51,550
So here Rambo back in the previous projects with structured data there were some cells in the data frame

41
00:02:51,550 --> 00:02:57,890
that were missing but checking it with unstructured data if there's missing data is a bit harder.

42
00:02:57,910 --> 00:03:06,250
So this is kind of the work around we're doing check if number of labels matches the number of phone

43
00:03:06,250 --> 00:03:08,250
hands and we're just going to do the exact same as before.

44
00:03:08,250 --> 00:03:16,830
So if len labels to the length of labels what we just did there equals Len file names because what we're

45
00:03:16,830 --> 00:03:24,690
eventually going to have to do is pair up these labels with these file names like what we've just done

46
00:03:24,690 --> 00:03:25,400
here.

47
00:03:25,580 --> 00:03:30,530
So index nine thousand is Tibetan mastiff and its file name nine thousand.

48
00:03:30,540 --> 00:03:39,770
Is this absolute lie and of a dog so let's go here predate a little confirmation for ourselves number

49
00:03:39,770 --> 00:03:53,420
of labels matches number of file names be a uniform else print number of labels does not match number

50
00:03:53,510 --> 00:04:02,880
of phone names check data directories fingers crossed.

51
00:04:03,110 --> 00:04:03,860
Beautiful.

52
00:04:03,860 --> 00:04:04,750
That's what we're after.

53
00:04:04,760 --> 00:04:08,030
Number of labels matches number of file names.

54
00:04:08,090 --> 00:04:14,920
OK now finally since a machine or not really finally we're still going a lot to go with this project.

55
00:04:16,080 --> 00:04:20,590
So it's a machine learning model can't take strings as input.

56
00:04:20,630 --> 00:04:22,310
This is what labels currently is right.

57
00:04:22,310 --> 00:04:25,160
It's an array of strings.

58
00:04:25,280 --> 00:04:32,660
What we have to do is convert it into numbers how might we do this or to begin let's find a list of

59
00:04:32,750 --> 00:04:36,530
all the unique dog rates of this ten thousand two hundred twenty two in here.

60
00:04:36,800 --> 00:04:39,210
But how would we find the unique labels.

61
00:04:39,350 --> 00:04:40,810
So let's do this.

62
00:04:41,120 --> 00:04:43,990
Find the unique label values.

63
00:04:44,000 --> 00:04:46,050
So I want you to have a little think about this.

64
00:04:46,190 --> 00:04:54,800
If you have an umpire Ray I'm giving you a little hint how might you find the unique values in a given

65
00:04:54,800 --> 00:04:59,040
array using num high Java level.

66
00:04:59,050 --> 00:05:01,460
Think you could even look it up.

67
00:05:01,550 --> 00:05:03,320
There's a good Google search for that.

68
00:05:03,320 --> 00:05:09,240
So NPR unique is how we find find the unique elements of an array and now this is pretty cool with collab

69
00:05:09,250 --> 00:05:09,560
right.

70
00:05:09,560 --> 00:05:13,610
It just pops up this dock string without me even doing anything.

71
00:05:13,610 --> 00:05:16,070
So we want to find the unique labels.

72
00:05:16,100 --> 00:05:18,950
Let's find that unique breeds

73
00:05:22,040 --> 00:05:26,530
let's have a look at that Alton It's all right.

74
00:05:26,560 --> 00:05:29,270
There's a whole bunch of different dog breeds so let's have a look at him.

75
00:05:30,600 --> 00:05:35,090
Brittany spaniel never even heard of that Chihuahua.

76
00:05:37,460 --> 00:05:39,410
Gordon setter Great Dane.

77
00:05:39,410 --> 00:05:40,530
Yeah he's a beast.

78
00:05:40,550 --> 00:05:41,480
Okay.

79
00:05:41,630 --> 00:05:43,480
Getting distracted by dogs.

80
00:05:43,700 --> 00:05:48,870
And so this should be 120 wonderful.

81
00:05:48,880 --> 00:05:50,350
So that is because we have

82
00:05:53,550 --> 00:05:56,750
120 breeds of dogs we're lining up here right.

83
00:05:56,760 --> 00:06:00,350
We're getting our data ready we're getting it prepared to get into tenses.

84
00:06:00,390 --> 00:06:02,100
We've only got two arrays of strings.

85
00:06:02,100 --> 00:06:04,160
How might we turn these into numbers.

86
00:06:04,200 --> 00:06:06,990
Why don't we turn it into an array of ball lanes.

87
00:06:06,990 --> 00:06:09,140
Let me give you an example rather than talk about it.

88
00:06:09,180 --> 00:06:19,170
So turn a single label into an array of ball lanes print labels zero.

89
00:06:19,210 --> 00:06:23,250
Just use label zero as an example so labels zero.

90
00:06:23,280 --> 00:06:30,440
Now we're going to use that comparison operator to compare the first label to unique breeds.

91
00:06:30,450 --> 00:06:37,890
And what this should return is an array of true and false values where everywhere and unique breeds

92
00:06:38,940 --> 00:06:49,210
labels zero doesn't equal it should be false but the only location where it does equal should be true.

93
00:06:49,210 --> 00:06:51,190
There we go.

94
00:06:51,190 --> 00:06:53,430
Let's have a look at these sets of Boston ball.

95
00:06:54,310 --> 00:06:56,530
But then if we have a look at unique breeds

96
00:06:59,840 --> 00:07:09,090
so we can say True is here so what's that about twelve in so if we go through here.

97
00:07:09,090 --> 00:07:12,540
Can a map if you can Adam up you'd find Boston bull.

98
00:07:12,540 --> 00:07:13,140
There we go.

99
00:07:13,140 --> 00:07:13,620
True.

100
00:07:14,370 --> 00:07:15,950
That's what we're after.

101
00:07:15,960 --> 00:07:24,700
Let's do that for every single label so turn every label into a bull in a right if you're looking at

102
00:07:24,700 --> 00:07:27,220
this and this is just all truths and falsehoods.

103
00:07:27,370 --> 00:07:29,890
You're probably wondering how is this going to be converted to numbers.

104
00:07:29,900 --> 00:07:35,160
But we'll see in a second as a special thing with nut pie raisin and boolean values.

105
00:07:35,320 --> 00:07:36,970
So let's go bowling.

106
00:07:37,270 --> 00:07:44,370
Labels equals label we'll do another list comprehension here.

107
00:07:44,490 --> 00:07:51,270
So label equals unique grades for label in labels.

108
00:07:52,260 --> 00:07:53,880
Let's see what the code says first.

109
00:07:53,880 --> 00:07:54,720
Then we'll talk through it.

110
00:07:56,790 --> 00:08:08,030
If in doubt run the code wonderful and bullying labels should be 10000 long or ten thousand two hundred

111
00:08:08,170 --> 00:08:09,500
twenty two whatever it was

112
00:08:12,560 --> 00:08:13,430
wonderful.

113
00:08:13,460 --> 00:08:19,640
So that means we've turned every single label into a boolean label now.

114
00:08:19,670 --> 00:08:21,600
What is this little this comprehension doing.

115
00:08:21,650 --> 00:08:25,700
All it's doing is a scout up version of what we've got here.

116
00:08:25,790 --> 00:08:38,600
So it's saying do this little one off example for every label in labels and if we remember Len labels

117
00:08:39,360 --> 00:08:43,850
is ten thousand two hundred twenty two because we have ten thousand two hundred twenty two labels.

118
00:08:43,850 --> 00:08:47,370
So it's done it for every single label.

119
00:08:47,510 --> 00:08:52,110
He might be thinking again why are we turning him into true and false wealth.

120
00:08:52,130 --> 00:08:58,980
How we can convert these boolean arrays into numbers more specifically one hot encoded numbers.

121
00:08:59,250 --> 00:09:03,560
So if in doubt run the code let's check out the code then we'll talk it through.

122
00:09:04,070 --> 00:09:17,010
So example turning boolean array into integers print label 0.

123
00:09:17,010 --> 00:09:18,020
We'll get the first one.

124
00:09:18,180 --> 00:09:28,700
So this is the original label straight from the diaphragm straight from label CSA print NDP where unique

125
00:09:28,700 --> 00:09:32,090
grades equals labels.

126
00:09:32,120 --> 00:09:35,640
0 print.

127
00:09:35,880 --> 00:09:43,720
So actually we should probably tell this is the index where label occurs and then we're gonna go print

128
00:09:43,840 --> 00:09:52,030
boolean labels 0 dot org Max.

129
00:09:52,030 --> 00:09:54,550
So this is index

130
00:09:57,880 --> 00:10:08,020
where label occurs in boolean array and then we're gonna go print boolean labels

131
00:10:11,240 --> 00:10:26,500
as type int so there will be a one or there should be a one where the sample label occurs as a big self

132
00:10:26,820 --> 00:10:28,360
but we're gonna check it out.

133
00:10:28,750 --> 00:10:29,530
Look at that.

134
00:10:29,590 --> 00:10:38,410
So the original label is Boston Bill and it appears at the 19th index which is the same in our boolean

135
00:10:38,410 --> 00:10:41,850
labels zero index so it appears at 19.

136
00:10:42,010 --> 00:10:56,830
And if we were to count this so 0 0 0 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 is where one occurs.

137
00:10:56,830 --> 00:11:07,240
So now every single label in bullying labels is actually in this type of format so there's a one where

138
00:11:07,240 --> 00:11:09,730
the actual label occurs.

139
00:11:09,820 --> 00:11:20,910
So if we did the same thing for let's go to and then print labels to

140
00:11:24,330 --> 00:11:30,350
is Pekinese occurs right down here and zero for everything else.

141
00:11:30,630 --> 00:11:31,830
Wonderful.

142
00:11:31,830 --> 00:11:39,120
So now we've got our labels in a numeric format and our image file paths easily accessible because remember

143
00:11:39,120 --> 00:11:40,630
we've got file names.

144
00:11:40,890 --> 00:11:49,630
These our image paths they aren't numeric yet but we can do that later on using tensor flow.

145
00:11:49,630 --> 00:11:53,530
So now we've got our data in an accessible format.

146
00:11:53,560 --> 00:12:01,610
Let's create some training and validation sets because if we come to our files we notice when we download

147
00:12:01,680 --> 00:12:05,590
it from Kaggle we only have a train and test set to do our experiments.

148
00:12:05,710 --> 00:12:08,800
We want to split our data into training and validation.

149
00:12:09,250 --> 00:12:11,290
So that's what we're doing there in the next video.