1
00:00:00,610 --> 00:00:01,890
Look at us go.

2
00:00:01,890 --> 00:00:04,650
We're moving to this framework at lightning pace.

3
00:00:04,650 --> 00:00:06,110
We've done Problem Definition.

4
00:00:06,150 --> 00:00:10,150
We've looked at data we've decided on an evaluation metric.

5
00:00:10,170 --> 00:00:13,110
We've understood a few of the features we've got in our data.

6
00:00:13,110 --> 00:00:15,620
Now we're up to step five which is modelling.

7
00:00:15,690 --> 00:00:17,640
Now there's a few parts to modelling.

8
00:00:17,640 --> 00:00:21,750
So we've broken this down into four different sections.

9
00:00:21,750 --> 00:00:23,800
And this is where it's Section One.

10
00:00:23,910 --> 00:00:28,420
And this is probably the most important concept in machine learning that three sets.

11
00:00:28,630 --> 00:00:35,730
And now over the whole of modelling we want to answer the question based on our problem and data what

12
00:00:35,730 --> 00:00:43,570
machine learning model should we use modelling can be broken down into three parts choosing and training

13
00:00:43,570 --> 00:00:50,470
a model churning a model and model comparison before we get into these though.

14
00:00:50,680 --> 00:00:57,160
Part one of modelling is and the most paramount topic to discuss in this whole entire course is the

15
00:00:57,160 --> 00:01:00,550
most important concept in machine learning.

16
00:01:00,760 --> 00:01:07,510
The train validation and test splits or commonly referred to as three sets.

17
00:01:07,510 --> 00:01:13,840
Now since you want to be using machine learning models to gain insights on some data to predict the

18
00:01:13,840 --> 00:01:18,930
future it's important to test how well they would go and do in the real world.

19
00:01:19,150 --> 00:01:26,740
To do this you split your data into three different sets a training set to train your model on a validation

20
00:01:26,740 --> 00:01:36,600
set to choosing your model on a test set to test and compare your different models why is this important.

21
00:01:36,600 --> 00:01:42,270
Think of it like this when you're at university you might study the Course materials all through the

22
00:01:42,270 --> 00:01:48,870
semester then before the final exam You might see how you could improve your knowledge on a practice

23
00:01:48,870 --> 00:01:50,070
exam.

24
00:01:50,070 --> 00:01:57,270
After doing well on the practice exam you're confident you'll do well on the final exam when you take

25
00:01:57,270 --> 00:01:58,490
the final exam.

26
00:01:58,500 --> 00:02:03,330
And although some of the problems you've never seen before you're able to adapt the knowledge you've

27
00:02:03,330 --> 00:02:10,440
learned from the study materials to the slightly different but similar questions on the final exam.

28
00:02:10,620 --> 00:02:15,730
Because of this you pass the final exam with great marks.

29
00:02:15,780 --> 00:02:23,760
This adaptation that you had from the course materials and practice exams to the final exam is referred

30
00:02:23,760 --> 00:02:30,540
to in machine learning as a generalisation or the ability for a machine learning model to perform well

31
00:02:30,600 --> 00:02:34,880
on data it hasn't seen before because of what it's learned.

32
00:02:34,950 --> 00:02:43,970
On another dataset Now where might this go wrong well if your professor accidentally sent out the final

33
00:02:43,970 --> 00:02:49,000
exam for everyone to practice on when it came time to the actual exam.

34
00:02:49,070 --> 00:02:52,780
Everyone would have already seen it now.

35
00:02:52,830 --> 00:02:58,000
Since people know what they should be expecting they go through the exam.

36
00:02:58,090 --> 00:03:03,590
They answer all the questions with ease and everyone ends up getting top marks.

37
00:03:03,610 --> 00:03:10,530
Now top marks might appear good but did the students really learn anything or were they just expert

38
00:03:10,540 --> 00:03:17,500
memorization machines for your machine learning models to be valuable at predicting something in the

39
00:03:17,500 --> 00:03:24,130
future on unseen data you'll want to avoid them becoming memorization machines.

40
00:03:24,130 --> 00:03:28,900
This is where training validation and test splits come in.

41
00:03:28,900 --> 00:03:35,750
In our heart disease example let's say there were 100 patients you start off with 100.

42
00:03:35,800 --> 00:03:39,910
One way to create these splits is to shuffle these patients.

43
00:03:39,910 --> 00:03:45,440
Then select 70 percent for training which would mean that would be about 70.

44
00:03:45,440 --> 00:03:46,560
Patient records.

45
00:03:47,000 --> 00:03:54,110
And 15 percent for validation and 15 percent for testing which means to be 70 patients in the training

46
00:03:54,110 --> 00:03:54,820
set.

47
00:03:54,830 --> 00:04:00,250
15 patients in the validation split and 15 patients in the test split.

48
00:04:00,260 --> 00:04:06,580
Now the percentages of each of these may vary but standard practice is usually around 70 to 80 percent

49
00:04:06,590 --> 00:04:07,640
for training.

50
00:04:07,640 --> 00:04:11,570
10 to 15 for validation and 10 15 for test.

51
00:04:11,630 --> 00:04:19,280
You may see in some examples that some sets or some data sets only get split into training and test.

52
00:04:19,280 --> 00:04:21,480
But that's case by case scenario.

53
00:04:21,530 --> 00:04:27,030
Usually you'll have three different sets then once you've got these splits.

54
00:04:27,030 --> 00:04:34,170
Using a model you've chosen you'd feed at the training data or the information of of these 70 patient

55
00:04:34,170 --> 00:04:35,310
records.

56
00:04:35,460 --> 00:04:41,550
And once your model had trained you can check its results and see if you can improve them on the validation

57
00:04:41,550 --> 00:04:41,880
set.

58
00:04:42,180 --> 00:04:44,220
This is where you do model tuning.

59
00:04:44,220 --> 00:04:49,170
So just because you're machine learning the model's got one set of results and the patient records you

60
00:04:49,170 --> 00:04:54,000
can actually improve them and we'll see this in a future lesson on the validation split.

61
00:04:54,080 --> 00:04:58,360
Well the validation split is where you should be testing to see if you can improve.

62
00:04:59,160 --> 00:05:05,910
Finally once you've improved your model you can check the models results as well as any other models

63
00:05:05,910 --> 00:05:12,420
results that you might have done during experimentation on the test said what's important to remember

64
00:05:12,450 --> 00:05:19,020
is that all three of these sets a separate during training the model never sees the validation split

65
00:05:19,290 --> 00:05:20,520
or the test split.

66
00:05:20,700 --> 00:05:26,850
And during testing you're doing it on the test split not the training set it's the same as when you

67
00:05:26,850 --> 00:05:33,180
were studying for your exam if you saw the final exam whilst practicing that would be cheating and your

68
00:05:33,180 --> 00:05:37,500
final result wouldn't reflect how well you'd learned.

69
00:05:37,610 --> 00:05:43,250
For now think about it the last time you went for a test did you practice beforehand.

70
00:05:43,250 --> 00:05:48,530
Was the practice you were doing helpful for the test and when you're thinking about this try and think

71
00:05:48,530 --> 00:05:55,740
of how the lines to why it's important to not let a machine learning model see a test set or test data

72
00:05:55,740 --> 00:05:57,710
simply whilst it's training.