1
00:00:00,850 --> 00:00:02,380
So once we have trained our model.

2
00:00:03,570 --> 00:00:09,360
I told you that we can quantify the performance of our model using the confusion matrix.

3
00:00:10,200 --> 00:00:13,710
We can see how many times we have got it right and how many times it is wrong.

4
00:00:15,070 --> 00:00:22,710
But when we are growing the confusion matrix on the same data which we use to green our model, it is

5
00:00:22,710 --> 00:00:23,970
called training.

6
00:00:24,110 --> 00:00:24,480
It is.

7
00:00:27,180 --> 00:00:29,660
Training in it is not something we are interested in.

8
00:00:30,680 --> 00:00:36,860
We are interested in the accuracy of the predictions when we apply our method to a previously unseen

9
00:00:37,100 --> 00:00:37,730
test data.

10
00:00:40,160 --> 00:00:45,110
For example, when I'm predicting whether the House will be sold within three months or not, I don't

11
00:00:45,110 --> 00:00:51,320
really care how well the method predicts whether the house will be sold or not on the previously completed

12
00:00:51,320 --> 00:00:53,640
transactions, which my system I see.

13
00:00:55,430 --> 00:01:00,230
I want to know how well it will predict whether it will be sold on the future transactions.

14
00:01:01,640 --> 00:01:09,260
Similarly, if I want to predict the risk of a particular disease in different individuals, I want

15
00:01:09,260 --> 00:01:13,580
to do it for future patients and not for the ones I already know the outcome for.

16
00:01:16,100 --> 00:01:22,750
So to handle this issue, what you're going to do is you're going to split our data into two parts.

17
00:01:23,860 --> 00:01:26,480
One part will be called the training set.

18
00:01:27,340 --> 00:01:35,180
And the other part will be called dataset trainings that will be used to train the model and the test

19
00:01:35,410 --> 00:01:37,780
will be used to test its performance.

20
00:01:40,220 --> 00:01:46,490
So that test, it will be the unseen data and it will be used to assess the accuracy of our model.

21
00:01:48,500 --> 00:01:56,120
So mathematically, I have these pairs of X and ways x1 y by one, X two.

22
00:01:56,120 --> 00:01:56,540
I do.

23
00:01:56,750 --> 00:01:57,350
And so on.

24
00:01:57,740 --> 00:02:01,880
So these and build of X and ways will be my training set.

25
00:02:03,320 --> 00:02:05,030
I will use them to train my model.

26
00:02:06,320 --> 00:02:11,390
And once my model is trained, I have this functional form of relationship between X and Y.

27
00:02:13,640 --> 00:02:18,480
Now I will use this previously unseen set of data.

28
00:02:20,570 --> 00:02:27,050
And I'll feed these observations into this model to predict the value of life.

29
00:02:28,970 --> 00:02:35,440
This predicted value of Y and the actual value of Y available for this test set.

30
00:02:35,810 --> 00:02:36,920
That is these Y zeros.

31
00:02:38,090 --> 00:02:45,230
These will be compared to create the confusion matrix and this confusion matrix will be used to assess

32
00:02:45,230 --> 00:02:46,780
the accuracy of our model.

33
00:02:49,130 --> 00:02:54,860
So when we have three different types of classifiers, that is logistic regression, linear discriminant,

34
00:02:54,890 --> 00:03:03,260
analysis and gain nearest neighbors, we will draw the confusion matrix for the test set for all these

35
00:03:03,260 --> 00:03:10,730
three classifiers and then compare their performance on this previously unseen data instead of the training

36
00:03:10,730 --> 00:03:11,030
data.

37
00:03:14,740 --> 00:03:21,790
The main reason why we have to separate the data and to test it and the training set is because there

38
00:03:21,790 --> 00:03:28,260
is no guarantee that if a model is giving low training error, it will also have low test at it.

39
00:03:30,490 --> 00:03:38,410
Roughly speaking, many statistical methods specifically estimate why values so that we are able to

40
00:03:38,410 --> 00:03:39,790
minimize deep training error.

41
00:03:40,930 --> 00:03:43,750
So far, such methods training it and may be very low.

42
00:03:44,350 --> 00:03:47,350
But that test data will be quite large.

43
00:03:49,990 --> 00:03:57,220
And this graph, you can see that the true function of this dataset is discovered, Lane.

44
00:03:58,930 --> 00:04:06,910
But if I'm using a less flexible method, such as a straight line to estimate the values, it will have

45
00:04:07,420 --> 00:04:08,290
a lot of error.

46
00:04:10,030 --> 00:04:18,520
But on the other hand, if I increase the flexibility to too much, my line will exactly fall.

47
00:04:18,730 --> 00:04:22,270
Each and every point instead of following the general trend.

48
00:04:24,600 --> 00:04:28,570
So having too much flexibility makes the model overfit.

49
00:04:28,600 --> 00:04:32,920
The data, which also results in increasing error.

50
00:04:34,000 --> 00:04:39,840
So in this situation, if you notice the training, it it will be very low.

51
00:04:40,900 --> 00:04:43,810
As each point is captured by this code.

52
00:04:45,370 --> 00:04:50,060
So this particular model will give a very low training error.

53
00:04:50,740 --> 00:04:58,330
But in fact, if you use this model on any unseen data, this model will be giving a very high error

54
00:04:58,330 --> 00:05:01,360
rate, probably even more than the straight line.

55
00:05:02,860 --> 00:05:07,750
So we need to find the balance when selecting the flexibility of our model.

56
00:05:08,940 --> 00:05:12,220
And hence, we should compare our different models.

57
00:05:13,180 --> 00:05:18,820
This is their performance on unseen data instead of previously seen data or the training data.

58
00:05:20,920 --> 00:05:25,420
This graph tells us how the error rate changes along with flexibility.

59
00:05:27,650 --> 00:05:33,830
If I keep on increasing the flexibility of a of my model, the training error which is given by this

60
00:05:34,520 --> 00:05:38,330
light blue line, it keeps on decreasing continuously.

61
00:05:39,260 --> 00:05:44,310
So we may be tempted to use a very flexible method so as to get low training error.

62
00:05:46,580 --> 00:05:53,900
But in reality, if I present team data, this model will perform worse than this model.

63
00:05:55,550 --> 00:06:01,850
So we need to see the test error rate, which first decreases and then increases as we increase the

64
00:06:01,850 --> 00:06:02,540
flexibility.

65
00:06:03,350 --> 00:06:08,570
And we have to identify this point with the test error rate as minimum.

66
00:06:11,830 --> 00:06:18,250
Now, there are several techniques to split the data into grading and test it so that we can find this

67
00:06:18,250 --> 00:06:19,030
minimum point.

68
00:06:21,630 --> 00:06:24,770
We are going to discuss the three most popular techniques here.

69
00:06:25,790 --> 00:06:28,310
First one is called validation set approach.

70
00:06:29,510 --> 00:06:35,240
Second is leave a note cross validation and third one is gatefold cross-validation.

71
00:06:37,070 --> 00:06:38,920
The first technique, which is validation.

72
00:06:38,930 --> 00:06:40,760
That approach is the simplest approach.

73
00:06:41,930 --> 00:06:44,690
We will randomly divide the data into two parts.

74
00:06:45,400 --> 00:06:51,100
A training set and a test set model will be fitted on the training set.

75
00:06:51,350 --> 00:06:56,000
And once the model is train, the error for test will be calculated on the test.

76
00:06:57,170 --> 00:07:01,580
We usually split the relevant data in a ratio of eight is to 20.

77
00:07:01,760 --> 00:07:07,790
That is, 80 percent of the data will be used for training purpose and 20 percent will be used for testing

78
00:07:07,790 --> 00:07:08,280
bookless.

79
00:07:09,650 --> 00:07:12,170
There are basically two limitations of this approach.

80
00:07:12,470 --> 00:07:17,250
One is that part of the available data will not be used for training.

81
00:07:17,770 --> 00:07:23,860
And as we know, the more data we use during training, better will be the performance or model.

82
00:07:25,580 --> 00:07:32,270
So if we keep some data for testing, the train model will not be as good as we can get.

83
00:07:33,560 --> 00:07:39,500
And if you have a limited number of observations, that is not a lot of observations.

84
00:07:40,010 --> 00:07:42,220
Your training will be severely impacted.

85
00:07:44,070 --> 00:07:50,850
Secondly, the testator can be highly variable, depending on which observation is selected for printing

86
00:07:51,300 --> 00:07:53,310
and which observation is selected for testing.

87
00:07:55,320 --> 00:07:59,760
So to handle these two issues, there are these two alternative approaches.

88
00:08:02,640 --> 00:08:04,770
And the leverne out cross-validation.

89
00:08:05,580 --> 00:08:14,540
We will keep the first observation and Randy Madelon remaining and minus one observations in the next

90
00:08:14,540 --> 00:08:14,810
round.

91
00:08:14,940 --> 00:08:20,850
We will keep the second observation for testing both of us and run the model on the remaining and minus

92
00:08:20,850 --> 00:08:24,030
one observations again and each cycle.

93
00:08:24,660 --> 00:08:32,280
We will use that one observation for testing, and the error calculated in each cycle will be averaged

94
00:08:32,910 --> 00:08:36,030
to establish the test error in this method.

95
00:08:38,910 --> 00:08:47,660
So since we need to run the model several times, this method can be computationally expensive and automated

96
00:08:47,670 --> 00:08:48,930
to just leave one out.

97
00:08:48,930 --> 00:08:52,810
Cross validation is gave for cross validation.

98
00:08:53,410 --> 00:08:56,110
In this, we will divide the data into Kasit.

99
00:08:57,500 --> 00:09:02,400
We will keep one set for testing and use the other key minus one set for training.

100
00:09:03,240 --> 00:09:10,500
You can see that leave one out cross-validation is a special case of careful cross-validation.

101
00:09:11,310 --> 00:09:17,580
If you have gays equal to N, that is gay is equal to the total number of observations.

102
00:09:18,780 --> 00:09:21,830
This is exactly same as did leave a note cross-validation.

103
00:09:23,500 --> 00:09:27,230
We will not be covering these two techniques in the software package.

104
00:09:27,420 --> 00:09:29,790
We'll be only using the validation that approach.