1
00:00:00,620 --> 00:00:05,800
In this video, we will learn how to split the available data into a test and train set.

2
00:00:06,960 --> 00:00:13,110
Then we will use the training set to train our model and we will create the confusion matrix on the

3
00:00:13,110 --> 00:00:20,160
test set to take the performance of our model to split the data and to test and train it.

4
00:00:20,850 --> 00:00:25,620
I prefer to start this package called See It Tool.

5
00:00:28,030 --> 00:00:30,480
If you have it installed, it'll be shown here.

6
00:00:30,540 --> 00:00:33,100
You just have to take if it does not at all.

7
00:00:33,270 --> 00:00:34,740
You know how to install a package.

8
00:00:34,920 --> 00:00:40,470
You have to write installer packages and within double quotation marks you need to mention the name

9
00:00:40,470 --> 00:00:41,160
of the package.

10
00:00:42,420 --> 00:00:44,270
So first we need this packet.

11
00:00:44,340 --> 00:00:44,950
See adults.

12
00:00:47,580 --> 00:00:50,690
Next, we are going to set a seed.

13
00:00:51,750 --> 00:00:59,670
The concept of setting a seed is that when we are splitting the data and to test and dream are, we'll

14
00:00:59,670 --> 00:01:00,930
be doing it randomly.

15
00:01:02,610 --> 00:01:09,180
But if I set the seed at a particular value and use the same seed at the same value, we both will get

16
00:01:09,180 --> 00:01:10,200
these same split.

17
00:01:11,190 --> 00:01:15,120
That is the observation in the training set, which I will get.

18
00:01:16,050 --> 00:01:18,520
You will get the same observation in your training set.

19
00:01:20,920 --> 00:01:26,290
So we will be setting the scene at zero to rate, said NCD.

20
00:01:31,070 --> 00:01:34,220
We didn't record the right to the lenders.

21
00:01:34,990 --> 00:01:37,070
So the seat is now set at zero.

22
00:01:38,620 --> 00:01:46,760
So now we create a bit evil called split, which will help 80 percent of the value this group and 20

23
00:01:46,760 --> 00:01:49,850
percent value just falls to create that.

24
00:01:49,910 --> 00:01:50,180
We will.

25
00:01:50,180 --> 00:01:50,420
Right.

26
00:01:50,600 --> 00:01:54,960
Split is equal to sample or split.

27
00:01:58,580 --> 00:01:59,430
Aaron, d'accord.

28
00:01:59,690 --> 00:02:02,120
The first parameter will be our data, which is D.F..

29
00:02:03,260 --> 00:02:08,250
And the second parameter is split ratio, which will say 2.8.

30
00:02:10,430 --> 00:02:17,000
This means that 80 percent of the data will be training set and 20 percent will be tested.

31
00:02:17,930 --> 00:02:19,190
So we'll run this combined.

32
00:02:19,850 --> 00:02:23,690
You can see that there is a variable called split.

33
00:02:25,830 --> 00:02:30,130
And it has values like falls through approvals and so on.

34
00:02:31,990 --> 00:02:35,440
So the training set will be the subset of.

35
00:02:35,630 --> 00:02:39,070
The FBI does it, which has split values through.

36
00:02:40,150 --> 00:02:41,720
So will the training set

37
00:02:46,750 --> 00:02:48,130
is equal to subject.

38
00:02:51,250 --> 00:02:58,360
The data is D.F. and displayed value is split equal to equal to two.

39
00:03:05,790 --> 00:03:06,240
Single.

40
00:03:06,270 --> 00:03:12,600
Equal to is an assignment operator level equal to is used to compare values.

41
00:03:12,810 --> 00:03:14,850
So split values should be true.

42
00:03:20,070 --> 00:03:29,250
You can see that our training set has 386 observations, which is nearly 80 percent, not exactly 80

43
00:03:29,250 --> 00:03:31,710
percent of the observations, but merely Deverson.

44
00:03:33,120 --> 00:03:37,050
And for the tests, it will use the split value to be false.

45
00:03:37,320 --> 00:03:38,950
So test set is equal to subsect.

46
00:03:39,360 --> 00:03:41,790
The idea of split is what?

47
00:03:54,280 --> 00:03:55,210
You can see on the right.

48
00:03:55,480 --> 00:04:01,500
We have train set with 386 observations and test set with the remaining 120 observations.

49
00:04:02,560 --> 00:04:03,610
Now our job is done.

50
00:04:04,540 --> 00:04:08,910
We have to repeat the same things which we did on the complete dataset.

51
00:04:09,610 --> 00:04:12,270
So we will train the model using train set.

52
00:04:12,970 --> 00:04:16,370
And we create the confusion matrix using dataset.

53
00:04:18,010 --> 00:04:26,020
So I'll show you how to train the logistic regression model with the training set and create the confusion

54
00:04:26,020 --> 00:04:27,460
matrix using your test.

55
00:04:27,830 --> 00:04:35,110
You'll have to trendy linear discriminant analysis on the training set and created confusion metrics

56
00:04:35,110 --> 00:04:36,640
on the test set on your own.

57
00:04:38,350 --> 00:04:40,750
You probably know how to do logistic regression.

58
00:04:41,270 --> 00:04:44,260
We will create a new variable called Trained Outfit

59
00:04:47,620 --> 00:04:53,110
is equal to Jelen GLAAD function, which is used for doing logistic regression

60
00:04:56,350 --> 00:04:57,080
dependent variable.

61
00:04:57,080 --> 00:04:57,410
The.

62
00:05:00,100 --> 00:05:06,160
We are using all the variables or dart data is trained said.

63
00:05:09,580 --> 00:05:11,210
And family is by no means.

64
00:05:19,660 --> 00:05:20,450
Let us run this.

65
00:05:22,750 --> 00:05:28,500
So now we have another very well trained outfit which contains the information of the logistic regression

66
00:05:28,500 --> 00:05:28,800
model.

67
00:05:30,750 --> 00:05:36,090
Now to find the predicted probabilities for the desert, we will use the predict function.

68
00:05:38,130 --> 00:05:40,140
We will write test, not probes.

69
00:05:43,330 --> 00:05:46,030
This is the variable name is equal to predict.

70
00:05:49,660 --> 00:05:53,260
The first batter, my dad is the model which is trained outweight.

71
00:05:57,460 --> 00:06:00,700
Second parameter is the data on which we want to predict.

72
00:06:00,970 --> 00:06:02,260
It is Essid.

73
00:06:04,700 --> 00:06:07,550
Just go check it out.

74
00:06:07,590 --> 00:06:09,620
But I'm Dave Davies, equal to response.

75
00:06:16,910 --> 00:06:17,790
That is it on this.

76
00:06:19,990 --> 00:06:25,810
So we have performed training on one side and based on a completely different set.

77
00:06:27,380 --> 00:06:29,250
We have these probabilities of the desert.

78
00:06:30,050 --> 00:06:36,320
If we want to assign classes using a default boundary condition of point five, we will first create

79
00:06:36,320 --> 00:06:39,970
an array which will have all the values as no.

80
00:06:40,580 --> 00:06:45,400
And we will change those elements which have probably value greater than point five to.

81
00:06:45,490 --> 00:06:45,940
Yes.

82
00:06:46,580 --> 00:06:49,970
So first we recreate the array will create test.

83
00:06:50,460 --> 00:06:50,840
Fred.

84
00:06:53,280 --> 00:06:54,350
Does it do predictions?

85
00:06:54,380 --> 00:06:55,930
So they start praying.

86
00:06:56,000 --> 00:06:59,110
Is he going to wrap with tanks?

87
00:06:59,120 --> 00:06:59,310
What?

88
00:06:59,310 --> 00:07:01,410
Repeat, repeat?

89
00:07:01,580 --> 00:07:02,120
No.

90
00:07:05,540 --> 00:07:14,070
No one doing detainees since Desert has 120 observations, the you have today, which has noted no one

91
00:07:14,100 --> 00:07:16,430
doing detainees will run this.

92
00:07:17,940 --> 00:07:23,960
Now, for all those elements were probably easier than point three within two years.

93
00:07:24,270 --> 00:07:25,960
So let's start Red

94
00:07:28,850 --> 00:07:29,660
Square bracket.

95
00:07:29,810 --> 00:07:30,330
All right.

96
00:07:33,030 --> 00:07:34,230
This dark, Rob.

97
00:07:38,930 --> 00:07:40,480
This great event point for you.

98
00:07:44,000 --> 00:07:44,670
Is equal to.

99
00:07:45,790 --> 00:07:46,230
Yes.

100
00:07:53,430 --> 00:07:53,830
Grandis.

101
00:07:59,110 --> 00:08:02,180
So now, wherever probability is more than point five.

102
00:08:02,380 --> 00:08:03,510
We have the yes.

103
00:08:04,270 --> 00:08:06,100
Now we have the predicted probabilities.

104
00:08:06,100 --> 00:08:09,610
What the said you have the actual values in this sword.

105
00:08:09,700 --> 00:08:10,620
Very well indeed.

106
00:08:10,620 --> 00:08:16,480
They said we can compare these by creating a confusion matrix to create a confusion matrix.

107
00:08:16,510 --> 00:08:18,800
We will use these people function so well.

108
00:08:18,820 --> 00:08:19,000
Right.

109
00:08:19,130 --> 00:08:19,600
Table

110
00:08:22,350 --> 00:08:30,820
distorted comma the sorry variable of detested or tested dollar.

111
00:08:36,580 --> 00:08:38,920
No less sordid.

112
00:08:42,750 --> 00:08:43,510
Let's run this.

113
00:08:45,800 --> 00:08:53,690
So you can see now we have seven to eight correct predictions out of total 120.

114
00:08:54,320 --> 00:08:56,440
So 78 by one doing D.

115
00:08:56,450 --> 00:08:59,750
D prediction accuracy, which is lower than the training set.

116
00:09:00,020 --> 00:09:02,560
But still, it can be considered a good classifier.

117
00:09:04,820 --> 00:09:12,410
Now, you have to create this confusion matrix on the test set using linear discriminant analysis and

118
00:09:12,410 --> 00:09:18,580
compare the performance of LDA against logistic regression in the coming videos.

119
00:09:18,650 --> 00:09:20,520
We'll be learning about get nearest neighbors.

120
00:09:21,020 --> 00:09:25,670
And then we will be predicting the desert values using the Game Limited.