1
00:00:00,800 --> 00:00:01,780
Welcome back.

2
00:00:01,780 --> 00:00:08,000
Since the last few videos we've been spending a fair bit of effort getting our our data into an accessible

3
00:00:08,000 --> 00:00:08,420
format.

4
00:00:08,420 --> 00:00:09,380
Now we've got that.

5
00:00:09,410 --> 00:00:11,930
So we've got let's just quickly have a look.

6
00:00:11,960 --> 00:00:16,970
We've got boolean labels which is our labels in numerical format.

7
00:00:16,970 --> 00:00:22,180
We'll have a look at the first two so it doesn't take up too much room Navy game and true and false.

8
00:00:22,180 --> 00:00:28,920
We can access those as zeros and ones really easily and the same goes for our file pods.

9
00:00:28,930 --> 00:00:36,970
So these are our training images here and we left off discussing how Kaggle doesn't provide us a validation

10
00:00:36,970 --> 00:00:37,390
set

11
00:00:42,380 --> 00:00:47,080
so creating our own validation set.

12
00:00:47,780 --> 00:00:52,760
So since the data set from Kaggle doesn't come

13
00:00:55,620 --> 00:01:01,620
with a validation set we're going to create our own.

14
00:01:02,370 --> 00:01:08,340
And remember from way back in machine learning when I won the most important concept in machine learning

15
00:01:08,460 --> 00:01:10,140
is a three sets okay.

16
00:01:10,200 --> 00:01:12,700
The training set the validation set and the test set.

17
00:01:12,690 --> 00:01:20,510
So right now we've got the training set and a test set but we'd like this validation set which is okay

18
00:01:20,550 --> 00:01:22,020
the practice exam.

19
00:01:22,020 --> 00:01:26,390
So to reiterate we're trained in machine learning model on the training set.

20
00:01:26,970 --> 00:01:32,220
We'll evaluate it on the validation set to see if our experiments that we've been doing on the training

21
00:01:32,220 --> 00:01:34,160
set are going all right.

22
00:01:34,230 --> 00:01:37,800
And then finally the final exam is our test set.

23
00:01:37,920 --> 00:01:42,970
So we'd like this little intermediate set to validate our initial experiment.

24
00:01:42,990 --> 00:01:49,320
So that's why we're gonna make a validation set and now to do so what could we use.

25
00:01:49,360 --> 00:01:50,680
What have we used in the past.

26
00:01:50,680 --> 00:01:52,160
I want you to think about this.

27
00:01:52,230 --> 00:01:57,770
I was in the socket line section if you said try and test split you'd be correct.

28
00:01:57,790 --> 00:02:03,400
Now can be a bit confusing because it says try and test split but really the functionality of the function

29
00:02:03,400 --> 00:02:10,850
is just splitting data into two different sets one set of certain size and another set of another size.

30
00:02:10,960 --> 00:02:11,800
Let's set that up.

31
00:02:12,360 --> 00:02:16,720
But first what we might do is set up x and y variables.

32
00:02:16,720 --> 00:02:23,950
Now the reason we do this is because having to remember bullying labels and file names is a bit tedious.

33
00:02:23,950 --> 00:02:31,060
So what we might just do is create X which is our standard variable for your data and Y which is usually

34
00:02:31,270 --> 00:02:33,100
the standard variable for your label.

35
00:02:33,100 --> 00:02:41,910
So we're just gonna make copies of those beautiful and now since we're working with 10000 plus images

36
00:02:42,060 --> 00:02:48,320
ten thousand two hundred and twenty two to be exact.

37
00:02:48,560 --> 00:02:54,530
It's a good idea to start working with a portion of them first to make sure things are working before

38
00:02:54,530 --> 00:02:56,810
we go and try and a machine learning model and all them.

39
00:02:56,870 --> 00:03:02,140
This is a thing to note for your own projects is that when you're doing experiments remember what's

40
00:03:02,150 --> 00:03:02,820
our goal.

41
00:03:02,840 --> 00:03:09,110
It's to minimize the time between experiments so we can figure out what works and what doesn't.

42
00:03:09,290 --> 00:03:14,540
And if we're using ten thousand examples every single time that's going to take a while.

43
00:03:14,540 --> 00:03:19,550
And if you've never trained a machine learning model or a deep learning model on ten thousand plus images

44
00:03:19,640 --> 00:03:24,350
so you don't have a perspective of how long that takes you can take my word for it that it would take

45
00:03:24,350 --> 00:03:30,770
a long time so it will probably reduce it down and start with about a thousand images to begin with

46
00:03:30,770 --> 00:03:32,840
and then increase it as we need.

47
00:03:32,840 --> 00:03:42,980
So let's write a little note for ourselves we're going to start off experimenting a thousand images

48
00:03:43,070 --> 00:03:45,200
and increase as needed.

49
00:03:45,200 --> 00:03:51,680
Now this goes for not just images but for all kind of machine learning projects.

50
00:03:51,680 --> 00:03:57,630
And we've seen this in the past using some samples of our data maybe working on a text classification

51
00:03:57,660 --> 00:04:03,450
and you had a hundred thousand emails and you wanted to classify whether they were spam or not spam

52
00:04:03,780 --> 00:04:07,530
you might only start with five thousand and then build up from there.

53
00:04:07,530 --> 00:04:15,900
So let's set a little parameter so set number of images to use for experimenting.

54
00:04:15,900 --> 00:04:19,030
Now I want to show you something really cool within collab.

55
00:04:19,340 --> 00:04:26,460
So we'll create a little parameter here and now it's in capitals because it convention you might see

56
00:04:26,460 --> 00:04:31,260
here and there with with deep learning projects is that type of parameters.

57
00:04:31,260 --> 00:04:37,090
So things that you can set as a user are often set in capitals.

58
00:04:37,200 --> 00:04:41,860
That's where you might see these variables that are in full full capitals.

59
00:04:41,940 --> 00:04:52,560
So let's go a thousand and then in collab lab it has this little little cool tool that you can it's

60
00:04:52,560 --> 00:04:54,460
like a magic function at parameter.

61
00:04:54,460 --> 00:04:56,250
Now I'm still working these out.

62
00:04:56,280 --> 00:05:01,460
So maybe you can leave a little suggestion on where I can find out how to make more of these.

63
00:05:01,620 --> 00:05:04,400
If you figure it out because I only know a couple.

64
00:05:04,620 --> 00:05:05,640
Now watch this.

65
00:05:07,620 --> 00:05:10,950
So what we've done is we've just gone.

66
00:05:11,010 --> 00:05:17,980
Hey this is a parameter that we can set create a type slider the minimum is a thousand Max is ten.

67
00:05:18,090 --> 00:05:22,890
And this step is a thousand num images has now just appeared.

68
00:05:22,890 --> 00:05:24,590
Look if we go like this.

69
00:05:24,900 --> 00:05:33,330
Oh yeah 5000 4000 eight thousand eight thousand well I could play around with that all day.

70
00:05:33,420 --> 00:05:40,680
That's something that collab has over Jupiter aside from being backed by a GP offering all right.

71
00:05:40,940 --> 00:05:42,530
So now we've got a little parameter.

72
00:05:42,530 --> 00:05:43,650
We've got X and Y.

73
00:05:43,700 --> 00:05:57,970
Let's use psychic learns try trying to split so let's split our data into train and validation from

74
00:05:57,970 --> 00:06:02,060
model selection import train test split.

75
00:06:02,080 --> 00:06:04,540
Now again this is called train test split.

76
00:06:04,900 --> 00:06:08,730
But the basic functionality of this function is to split.

77
00:06:08,920 --> 00:06:13,230
You could do this manually by using indexing but we'll just use a function to help us out.

78
00:06:14,490 --> 00:06:27,050
Is just a split data into two different sets into training and validation of total size num images which

79
00:06:27,050 --> 00:06:29,150
is this variable we've set up here.

80
00:06:29,150 --> 00:06:45,170
Run that sell X train X Val this time not X test y drain y Val equals train test split X up to num images.

81
00:06:45,170 --> 00:06:52,490
So it's only going to take the first thousand file names because remember X is just our file names we

82
00:06:52,490 --> 00:06:59,030
have an import about images just yet because it's a lot faster to work with just file names and Y is

83
00:06:59,030 --> 00:07:01,370
our labels in boolean form

84
00:07:04,230 --> 00:07:05,770
and test size.

85
00:07:05,770 --> 00:07:11,800
Again this is just validation size will be 20 percent and then the random state is kind of like setting

86
00:07:11,800 --> 00:07:12,950
random seed.

87
00:07:13,250 --> 00:07:21,210
While it is like setting random say to 42 so random state equals forty two is the same as going any

88
00:07:21,210 --> 00:07:27,210
random seed forty two and if you're not sure what any random seed does.

89
00:07:27,210 --> 00:07:28,250
There's a video on that.

90
00:07:28,320 --> 00:07:38,280
Otherwise you might want to just quickly Google what does any random say do stuff go land X train just

91
00:07:38,280 --> 00:07:45,900
to check all the the lengths of our data are correct because you wouldn't believe the amount of times

92
00:07:45,930 --> 00:07:54,210
I've been burned because my data was the wrong shape you will spend the hours battling to get your data

93
00:07:54,210 --> 00:07:55,940
into the right shape.

94
00:07:56,190 --> 00:08:00,500
Trust me I've been there all right.

95
00:08:00,500 --> 00:08:01,040
Wonderful.

96
00:08:01,070 --> 00:08:06,590
So we've got a thousand samples and we've got a 20 percent test size so that means there's gonna be

97
00:08:06,680 --> 00:08:12,170
800 samples in the training split and 200 samples in the validation split.

98
00:08:12,170 --> 00:08:13,460
So let's just have a quick look.

99
00:08:14,510 --> 00:08:24,920
Let's have our gaze at the training data so x trying just to verify that we're still on this earth we're

100
00:08:24,920 --> 00:08:31,030
still working with the right stuff.

101
00:08:31,090 --> 00:08:32,990
Actually that's a bit too many.

102
00:08:33,100 --> 00:08:35,700
I've got to remember that these are full blown boolean labels.

103
00:08:35,770 --> 00:08:37,060
So there's a lot there.

104
00:08:37,270 --> 00:08:40,120
So train we've got file pods correct.

105
00:08:40,150 --> 00:08:41,290
That is what we want.

106
00:08:41,350 --> 00:08:46,290
And training for labels we have bullying labels.

107
00:08:46,310 --> 00:08:48,160
That is what we are after.

108
00:08:48,530 --> 00:08:49,430
All right.

109
00:08:49,430 --> 00:08:56,450
So now that we have our data into a training and a validation set a small subset.

110
00:08:56,450 --> 00:08:57,040
Mind you.

111
00:08:57,050 --> 00:09:04,230
Because remember our goal to begin with is to minimize time between experiments what we can probably

112
00:09:04,230 --> 00:09:14,220
finally do is turn out images or at least create some functionality to turn out images and labels into

113
00:09:14,220 --> 00:09:15,420
tenses.

114
00:09:15,420 --> 00:09:16,800
Let's do that in the next video.