1 00:00:00,610 --> 00:00:01,890 Look at us go. 2 00:00:01,890 --> 00:00:04,650 We're moving to this framework at lightning pace. 3 00:00:04,650 --> 00:00:06,110 We've done Problem Definition. 4 00:00:06,150 --> 00:00:10,150 We've looked at data we've decided on an evaluation metric. 5 00:00:10,170 --> 00:00:13,110 We've understood a few of the features we've got in our data. 6 00:00:13,110 --> 00:00:15,620 Now we're up to step five which is modelling. 7 00:00:15,690 --> 00:00:17,640 Now there's a few parts to modelling. 8 00:00:17,640 --> 00:00:21,750 So we've broken this down into four different sections. 9 00:00:21,750 --> 00:00:23,800 And this is where it's Section One. 10 00:00:23,910 --> 00:00:28,420 And this is probably the most important concept in machine learning that three sets. 11 00:00:28,630 --> 00:00:35,730 And now over the whole of modelling we want to answer the question based on our problem and data what 12 00:00:35,730 --> 00:00:43,570 machine learning model should we use modelling can be broken down into three parts choosing and training 13 00:00:43,570 --> 00:00:50,470 a model churning a model and model comparison before we get into these though. 14 00:00:50,680 --> 00:00:57,160 Part one of modelling is and the most paramount topic to discuss in this whole entire course is the 15 00:00:57,160 --> 00:01:00,550 most important concept in machine learning. 16 00:01:00,760 --> 00:01:07,510 The train validation and test splits or commonly referred to as three sets. 17 00:01:07,510 --> 00:01:13,840 Now since you want to be using machine learning models to gain insights on some data to predict the 18 00:01:13,840 --> 00:01:18,930 future it's important to test how well they would go and do in the real world. 19 00:01:19,150 --> 00:01:26,740 To do this you split your data into three different sets a training set to train your model on a validation 20 00:01:26,740 --> 00:01:36,600 set to choosing your model on a test set to test and compare your different models why is this important. 21 00:01:36,600 --> 00:01:42,270 Think of it like this when you're at university you might study the Course materials all through the 22 00:01:42,270 --> 00:01:48,870 semester then before the final exam You might see how you could improve your knowledge on a practice 23 00:01:48,870 --> 00:01:50,070 exam. 24 00:01:50,070 --> 00:01:57,270 After doing well on the practice exam you're confident you'll do well on the final exam when you take 25 00:01:57,270 --> 00:01:58,490 the final exam. 26 00:01:58,500 --> 00:02:03,330 And although some of the problems you've never seen before you're able to adapt the knowledge you've 27 00:02:03,330 --> 00:02:10,440 learned from the study materials to the slightly different but similar questions on the final exam. 28 00:02:10,620 --> 00:02:15,730 Because of this you pass the final exam with great marks. 29 00:02:15,780 --> 00:02:23,760 This adaptation that you had from the course materials and practice exams to the final exam is referred 30 00:02:23,760 --> 00:02:30,540 to in machine learning as a generalisation or the ability for a machine learning model to perform well 31 00:02:30,600 --> 00:02:34,880 on data it hasn't seen before because of what it's learned. 32 00:02:34,950 --> 00:02:43,970 On another dataset Now where might this go wrong well if your professor accidentally sent out the final 33 00:02:43,970 --> 00:02:49,000 exam for everyone to practice on when it came time to the actual exam. 34 00:02:49,070 --> 00:02:52,780 Everyone would have already seen it now. 35 00:02:52,830 --> 00:02:58,000 Since people know what they should be expecting they go through the exam. 36 00:02:58,090 --> 00:03:03,590 They answer all the questions with ease and everyone ends up getting top marks. 37 00:03:03,610 --> 00:03:10,530 Now top marks might appear good but did the students really learn anything or were they just expert 38 00:03:10,540 --> 00:03:17,500 memorization machines for your machine learning models to be valuable at predicting something in the 39 00:03:17,500 --> 00:03:24,130 future on unseen data you'll want to avoid them becoming memorization machines. 40 00:03:24,130 --> 00:03:28,900 This is where training validation and test splits come in. 41 00:03:28,900 --> 00:03:35,750 In our heart disease example let's say there were 100 patients you start off with 100. 42 00:03:35,800 --> 00:03:39,910 One way to create these splits is to shuffle these patients. 43 00:03:39,910 --> 00:03:45,440 Then select 70 percent for training which would mean that would be about 70. 44 00:03:45,440 --> 00:03:46,560 Patient records. 45 00:03:47,000 --> 00:03:54,110 And 15 percent for validation and 15 percent for testing which means to be 70 patients in the training 46 00:03:54,110 --> 00:03:54,820 set. 47 00:03:54,830 --> 00:04:00,250 15 patients in the validation split and 15 patients in the test split. 48 00:04:00,260 --> 00:04:06,580 Now the percentages of each of these may vary but standard practice is usually around 70 to 80 percent 49 00:04:06,590 --> 00:04:07,640 for training. 50 00:04:07,640 --> 00:04:11,570 10 to 15 for validation and 10 15 for test. 51 00:04:11,630 --> 00:04:19,280 You may see in some examples that some sets or some data sets only get split into training and test. 52 00:04:19,280 --> 00:04:21,480 But that's case by case scenario. 53 00:04:21,530 --> 00:04:27,030 Usually you'll have three different sets then once you've got these splits. 54 00:04:27,030 --> 00:04:34,170 Using a model you've chosen you'd feed at the training data or the information of of these 70 patient 55 00:04:34,170 --> 00:04:35,310 records. 56 00:04:35,460 --> 00:04:41,550 And once your model had trained you can check its results and see if you can improve them on the validation 57 00:04:41,550 --> 00:04:41,880 set. 58 00:04:42,180 --> 00:04:44,220 This is where you do model tuning. 59 00:04:44,220 --> 00:04:49,170 So just because you're machine learning the model's got one set of results and the patient records you 60 00:04:49,170 --> 00:04:54,000 can actually improve them and we'll see this in a future lesson on the validation split. 61 00:04:54,080 --> 00:04:58,360 Well the validation split is where you should be testing to see if you can improve. 62 00:04:59,160 --> 00:05:05,910 Finally once you've improved your model you can check the models results as well as any other models 63 00:05:05,910 --> 00:05:12,420 results that you might have done during experimentation on the test said what's important to remember 64 00:05:12,450 --> 00:05:19,020 is that all three of these sets a separate during training the model never sees the validation split 65 00:05:19,290 --> 00:05:20,520 or the test split. 66 00:05:20,700 --> 00:05:26,850 And during testing you're doing it on the test split not the training set it's the same as when you 67 00:05:26,850 --> 00:05:33,180 were studying for your exam if you saw the final exam whilst practicing that would be cheating and your 68 00:05:33,180 --> 00:05:37,500 final result wouldn't reflect how well you'd learned. 69 00:05:37,610 --> 00:05:43,250 For now think about it the last time you went for a test did you practice beforehand. 70 00:05:43,250 --> 00:05:48,530 Was the practice you were doing helpful for the test and when you're thinking about this try and think 71 00:05:48,530 --> 00:05:55,740 of how the lines to why it's important to not let a machine learning model see a test set or test data 72 00:05:55,740 --> 00:05:57,710 simply whilst it's training.