1 00:00:00,800 --> 00:00:01,780 Welcome back. 2 00:00:01,780 --> 00:00:08,000 Since the last few videos we've been spending a fair bit of effort getting our our data into an accessible 3 00:00:08,000 --> 00:00:08,420 format. 4 00:00:08,420 --> 00:00:09,380 Now we've got that. 5 00:00:09,410 --> 00:00:11,930 So we've got let's just quickly have a look. 6 00:00:11,960 --> 00:00:16,970 We've got boolean labels which is our labels in numerical format. 7 00:00:16,970 --> 00:00:22,180 We'll have a look at the first two so it doesn't take up too much room Navy game and true and false. 8 00:00:22,180 --> 00:00:28,920 We can access those as zeros and ones really easily and the same goes for our file pods. 9 00:00:28,930 --> 00:00:36,970 So these are our training images here and we left off discussing how Kaggle doesn't provide us a validation 10 00:00:36,970 --> 00:00:37,390 set 11 00:00:42,380 --> 00:00:47,080 so creating our own validation set. 12 00:00:47,780 --> 00:00:52,760 So since the data set from Kaggle doesn't come 13 00:00:55,620 --> 00:01:01,620 with a validation set we're going to create our own. 14 00:01:02,370 --> 00:01:08,340 And remember from way back in machine learning when I won the most important concept in machine learning 15 00:01:08,460 --> 00:01:10,140 is a three sets okay. 16 00:01:10,200 --> 00:01:12,700 The training set the validation set and the test set. 17 00:01:12,690 --> 00:01:20,510 So right now we've got the training set and a test set but we'd like this validation set which is okay 18 00:01:20,550 --> 00:01:22,020 the practice exam. 19 00:01:22,020 --> 00:01:26,390 So to reiterate we're trained in machine learning model on the training set. 20 00:01:26,970 --> 00:01:32,220 We'll evaluate it on the validation set to see if our experiments that we've been doing on the training 21 00:01:32,220 --> 00:01:34,160 set are going all right. 22 00:01:34,230 --> 00:01:37,800 And then finally the final exam is our test set. 23 00:01:37,920 --> 00:01:42,970 So we'd like this little intermediate set to validate our initial experiment. 24 00:01:42,990 --> 00:01:49,320 So that's why we're gonna make a validation set and now to do so what could we use. 25 00:01:49,360 --> 00:01:50,680 What have we used in the past. 26 00:01:50,680 --> 00:01:52,160 I want you to think about this. 27 00:01:52,230 --> 00:01:57,770 I was in the socket line section if you said try and test split you'd be correct. 28 00:01:57,790 --> 00:02:03,400 Now can be a bit confusing because it says try and test split but really the functionality of the function 29 00:02:03,400 --> 00:02:10,850 is just splitting data into two different sets one set of certain size and another set of another size. 30 00:02:10,960 --> 00:02:11,800 Let's set that up. 31 00:02:12,360 --> 00:02:16,720 But first what we might do is set up x and y variables. 32 00:02:16,720 --> 00:02:23,950 Now the reason we do this is because having to remember bullying labels and file names is a bit tedious. 33 00:02:23,950 --> 00:02:31,060 So what we might just do is create X which is our standard variable for your data and Y which is usually 34 00:02:31,270 --> 00:02:33,100 the standard variable for your label. 35 00:02:33,100 --> 00:02:41,910 So we're just gonna make copies of those beautiful and now since we're working with 10000 plus images 36 00:02:42,060 --> 00:02:48,320 ten thousand two hundred and twenty two to be exact. 37 00:02:48,560 --> 00:02:54,530 It's a good idea to start working with a portion of them first to make sure things are working before 38 00:02:54,530 --> 00:02:56,810 we go and try and a machine learning model and all them. 39 00:02:56,870 --> 00:03:02,140 This is a thing to note for your own projects is that when you're doing experiments remember what's 40 00:03:02,150 --> 00:03:02,820 our goal. 41 00:03:02,840 --> 00:03:09,110 It's to minimize the time between experiments so we can figure out what works and what doesn't. 42 00:03:09,290 --> 00:03:14,540 And if we're using ten thousand examples every single time that's going to take a while. 43 00:03:14,540 --> 00:03:19,550 And if you've never trained a machine learning model or a deep learning model on ten thousand plus images 44 00:03:19,640 --> 00:03:24,350 so you don't have a perspective of how long that takes you can take my word for it that it would take 45 00:03:24,350 --> 00:03:30,770 a long time so it will probably reduce it down and start with about a thousand images to begin with 46 00:03:30,770 --> 00:03:32,840 and then increase it as we need. 47 00:03:32,840 --> 00:03:42,980 So let's write a little note for ourselves we're going to start off experimenting a thousand images 48 00:03:43,070 --> 00:03:45,200 and increase as needed. 49 00:03:45,200 --> 00:03:51,680 Now this goes for not just images but for all kind of machine learning projects. 50 00:03:51,680 --> 00:03:57,630 And we've seen this in the past using some samples of our data maybe working on a text classification 51 00:03:57,660 --> 00:04:03,450 and you had a hundred thousand emails and you wanted to classify whether they were spam or not spam 52 00:04:03,780 --> 00:04:07,530 you might only start with five thousand and then build up from there. 53 00:04:07,530 --> 00:04:15,900 So let's set a little parameter so set number of images to use for experimenting. 54 00:04:15,900 --> 00:04:19,030 Now I want to show you something really cool within collab. 55 00:04:19,340 --> 00:04:26,460 So we'll create a little parameter here and now it's in capitals because it convention you might see 56 00:04:26,460 --> 00:04:31,260 here and there with with deep learning projects is that type of parameters. 57 00:04:31,260 --> 00:04:37,090 So things that you can set as a user are often set in capitals. 58 00:04:37,200 --> 00:04:41,860 That's where you might see these variables that are in full full capitals. 59 00:04:41,940 --> 00:04:52,560 So let's go a thousand and then in collab lab it has this little little cool tool that you can it's 60 00:04:52,560 --> 00:04:54,460 like a magic function at parameter. 61 00:04:54,460 --> 00:04:56,250 Now I'm still working these out. 62 00:04:56,280 --> 00:05:01,460 So maybe you can leave a little suggestion on where I can find out how to make more of these. 63 00:05:01,620 --> 00:05:04,400 If you figure it out because I only know a couple. 64 00:05:04,620 --> 00:05:05,640 Now watch this. 65 00:05:07,620 --> 00:05:10,950 So what we've done is we've just gone. 66 00:05:11,010 --> 00:05:17,980 Hey this is a parameter that we can set create a type slider the minimum is a thousand Max is ten. 67 00:05:18,090 --> 00:05:22,890 And this step is a thousand num images has now just appeared. 68 00:05:22,890 --> 00:05:24,590 Look if we go like this. 69 00:05:24,900 --> 00:05:33,330 Oh yeah 5000 4000 eight thousand eight thousand well I could play around with that all day. 70 00:05:33,420 --> 00:05:40,680 That's something that collab has over Jupiter aside from being backed by a GP offering all right. 71 00:05:40,940 --> 00:05:42,530 So now we've got a little parameter. 72 00:05:42,530 --> 00:05:43,650 We've got X and Y. 73 00:05:43,700 --> 00:05:57,970 Let's use psychic learns try trying to split so let's split our data into train and validation from 74 00:05:57,970 --> 00:06:02,060 model selection import train test split. 75 00:06:02,080 --> 00:06:04,540 Now again this is called train test split. 76 00:06:04,900 --> 00:06:08,730 But the basic functionality of this function is to split. 77 00:06:08,920 --> 00:06:13,230 You could do this manually by using indexing but we'll just use a function to help us out. 78 00:06:14,490 --> 00:06:27,050 Is just a split data into two different sets into training and validation of total size num images which 79 00:06:27,050 --> 00:06:29,150 is this variable we've set up here. 80 00:06:29,150 --> 00:06:45,170 Run that sell X train X Val this time not X test y drain y Val equals train test split X up to num images. 81 00:06:45,170 --> 00:06:52,490 So it's only going to take the first thousand file names because remember X is just our file names we 82 00:06:52,490 --> 00:06:59,030 have an import about images just yet because it's a lot faster to work with just file names and Y is 83 00:06:59,030 --> 00:07:01,370 our labels in boolean form 84 00:07:04,230 --> 00:07:05,770 and test size. 85 00:07:05,770 --> 00:07:11,800 Again this is just validation size will be 20 percent and then the random state is kind of like setting 86 00:07:11,800 --> 00:07:12,950 random seed. 87 00:07:13,250 --> 00:07:21,210 While it is like setting random say to 42 so random state equals forty two is the same as going any 88 00:07:21,210 --> 00:07:27,210 random seed forty two and if you're not sure what any random seed does. 89 00:07:27,210 --> 00:07:28,250 There's a video on that. 90 00:07:28,320 --> 00:07:38,280 Otherwise you might want to just quickly Google what does any random say do stuff go land X train just 91 00:07:38,280 --> 00:07:45,900 to check all the the lengths of our data are correct because you wouldn't believe the amount of times 92 00:07:45,930 --> 00:07:54,210 I've been burned because my data was the wrong shape you will spend the hours battling to get your data 93 00:07:54,210 --> 00:07:55,940 into the right shape. 94 00:07:56,190 --> 00:08:00,500 Trust me I've been there all right. 95 00:08:00,500 --> 00:08:01,040 Wonderful. 96 00:08:01,070 --> 00:08:06,590 So we've got a thousand samples and we've got a 20 percent test size so that means there's gonna be 97 00:08:06,680 --> 00:08:12,170 800 samples in the training split and 200 samples in the validation split. 98 00:08:12,170 --> 00:08:13,460 So let's just have a quick look. 99 00:08:14,510 --> 00:08:24,920 Let's have our gaze at the training data so x trying just to verify that we're still on this earth we're 100 00:08:24,920 --> 00:08:31,030 still working with the right stuff. 101 00:08:31,090 --> 00:08:32,990 Actually that's a bit too many. 102 00:08:33,100 --> 00:08:35,700 I've got to remember that these are full blown boolean labels. 103 00:08:35,770 --> 00:08:37,060 So there's a lot there. 104 00:08:37,270 --> 00:08:40,120 So train we've got file pods correct. 105 00:08:40,150 --> 00:08:41,290 That is what we want. 106 00:08:41,350 --> 00:08:46,290 And training for labels we have bullying labels. 107 00:08:46,310 --> 00:08:48,160 That is what we are after. 108 00:08:48,530 --> 00:08:49,430 All right. 109 00:08:49,430 --> 00:08:56,450 So now that we have our data into a training and a validation set a small subset. 110 00:08:56,450 --> 00:08:57,040 Mind you. 111 00:08:57,050 --> 00:09:04,230 Because remember our goal to begin with is to minimize time between experiments what we can probably 112 00:09:04,230 --> 00:09:14,220 finally do is turn out images or at least create some functionality to turn out images and labels into 113 00:09:14,220 --> 00:09:15,420 tenses. 114 00:09:15,420 --> 00:09:16,800 Let's do that in the next video.