1
00:00:00,530 --> 00:00:05,600
In this video, we will learn how to split the available data and to test and train said.

2
00:00:06,800 --> 00:00:13,080
Then you were a trainee model on the training set and find the mean square error of the test said.

3
00:00:15,950 --> 00:00:21,140
To split the data into disinterring, I prefer to install this package or one other method.

4
00:00:21,830 --> 00:00:23,800
This package is called see it tools.

5
00:00:24,830 --> 00:00:26,270
You know how to install a package.

6
00:00:27,200 --> 00:00:29,930
You can just ride in started out packages.

7
00:00:32,980 --> 00:00:35,650
And within Blackard and double quotation marks.

8
00:00:35,760 --> 00:00:36,060
Right.

9
00:00:36,220 --> 00:00:39,040
See it tools and the T of tool, just capital.

10
00:00:43,400 --> 00:00:43,960
Run this.

11
00:00:48,800 --> 00:00:51,710
You can see on the right see it tools is now available.

12
00:00:52,370 --> 00:00:55,520
We'll just take this check box to make this available.

13
00:00:56,800 --> 00:01:03,910
Now we are going to set a seed, the concept of setting seed is that when splitting the data into a

14
00:01:03,910 --> 00:01:04,840
test and train.

15
00:01:05,110 --> 00:01:06,400
I'll be doing it randomly.

16
00:01:06,970 --> 00:01:13,510
But if I set the seed at a particular value and you said the same seed at the same value, we both will

17
00:01:13,510 --> 00:01:14,860
get the same split.

18
00:01:15,100 --> 00:01:21,400
That is the observation and the training set, which I will get you will get the same observation in

19
00:01:21,400 --> 00:01:22,150
your training said.

20
00:01:24,720 --> 00:01:26,890
So we'll set the he at zero.

21
00:01:27,120 --> 00:01:30,470
So we laid said dot seed again.

22
00:01:30,720 --> 00:01:32,520
And within the blockade, we laid zero.

23
00:01:37,260 --> 00:01:38,490
In Baghdad, we laid zero.

24
00:01:39,460 --> 00:01:40,180
We'll run this.

25
00:01:41,140 --> 00:01:42,740
So we'll see decided zero.

26
00:01:43,270 --> 00:01:45,260
No, we will split the data, right.

27
00:01:45,520 --> 00:01:49,240
Split is equal to sample dark split.

28
00:01:54,390 --> 00:02:01,420
And within a decade, will they be if comma split ratio is equal 2.8?

29
00:02:03,770 --> 00:02:06,560
The S and the art of racial art capital.

30
00:02:08,440 --> 00:02:16,970
Next on this show, a newer, even called split is created and it has blue and false value for each

31
00:02:16,970 --> 00:02:18,410
of the observation.

32
00:02:19,520 --> 00:02:24,980
We will assign crew to the training set and the values that falls will as I need to test it.

33
00:02:25,100 --> 00:02:26,510
So training set is equal to.

34
00:02:31,800 --> 00:02:36,120
And the skill set is equal to subsect.

35
00:02:39,580 --> 00:02:40,700
It's a subset of beef.

36
00:02:40,930 --> 00:02:41,630
So be it.

37
00:02:42,770 --> 00:02:43,160
Colma.

38
00:02:45,500 --> 00:02:46,040
Split.

39
00:02:48,730 --> 00:02:50,720
Equal to equal to two.

40
00:02:52,690 --> 00:02:55,560
So we're checking wherever displayed values, true.

41
00:02:56,380 --> 00:03:04,300
We take out that subset of D.F. and put it into the training set variable so you can see training set

42
00:03:04,300 --> 00:03:06,430
variable is also created.

43
00:03:06,880 --> 00:03:08,530
It does 378 observations.

44
00:03:08,830 --> 00:03:15,340
It will not tell exactly 80 percent of the observations, but merely whichever one you mentioned in

45
00:03:15,340 --> 00:03:22,330
the split ratio, you will have nearly those number of observations and what the remaining values will

46
00:03:22,330 --> 00:03:23,560
assign them to test it.

47
00:03:23,620 --> 00:03:25,360
So Test Underscore said.

48
00:03:30,070 --> 00:03:31,180
Is equal to subsect.

49
00:03:32,460 --> 00:03:36,960
And within that could be if Colma split equal to equal to votes.

50
00:03:42,770 --> 00:03:47,760
And on this, so best set variable is also created.

51
00:03:49,450 --> 00:03:53,150
Now we will run a linear model on the training data set.

52
00:03:53,780 --> 00:03:57,620
We know how to run a linear model that will create a variable.

53
00:03:58,010 --> 00:03:59,250
L.M. underscored A..

54
00:04:01,670 --> 00:04:03,380
And this is equal to L.M..

55
00:04:04,590 --> 00:04:09,630
Within bracket will rate Brice de la Dot.

56
00:04:12,780 --> 00:04:13,200
Goma.

57
00:04:15,440 --> 00:04:17,640
That is equal to training set.

58
00:04:18,620 --> 00:04:21,270
We are not running this model on the complete data that we have.

59
00:04:21,600 --> 00:04:25,270
We are running it only on the 378 observations in the training set.

60
00:04:26,460 --> 00:04:27,270
So let's run this.

61
00:04:28,860 --> 00:04:31,310
The model is fit in the eleven, code eight.

62
00:04:31,980 --> 00:04:36,970
If you want to look at somebody, you can date somebody with a record eleven dress code.

63
00:04:38,530 --> 00:04:44,100
But here we are going to find out the mean square error of the training set.

64
00:04:44,820 --> 00:04:48,170
And it is said so to find means great errors.

65
00:04:48,960 --> 00:04:52,330
We need to first predict the value of price basis.

66
00:04:52,410 --> 00:04:55,650
This fitted model to predict the value.

67
00:04:56,030 --> 00:04:57,420
We'll use a functional predict.

68
00:04:58,550 --> 00:05:01,190
They predict function takes two parameters.

69
00:05:01,370 --> 00:05:05,140
One is the model that we have today, which is a limiter body.

70
00:05:05,600 --> 00:05:10,810
And the other is the data, which is to be used to predict the values of a.

71
00:05:12,290 --> 00:05:20,090
So we'll get these predicted values of the training set into a variable called train underscored a civil

72
00:05:20,090 --> 00:05:30,890
right train, underscored A is equal to predict and within bracket, the first parameter will be LMR,

73
00:05:30,900 --> 00:05:37,270
underscoring the city for model comma, the city data.

74
00:05:37,340 --> 00:05:38,630
So does the training data.

75
00:05:39,610 --> 00:05:41,970
So will they, training on the squad said.

76
00:05:45,160 --> 00:05:52,570
So what this will do is it will take all the independent variables from this say, put it into this

77
00:05:52,570 --> 00:05:57,860
model and predict the value of the independent variable and store it and to train under squaddie.

78
00:05:59,050 --> 00:05:59,930
So let's run this.

79
00:06:02,800 --> 00:06:06,850
So we have train underscored A as another variable.

80
00:06:07,780 --> 00:06:09,960
We'll do this same thing for the test.

81
00:06:09,960 --> 00:06:14,500
It also just in place of train will test.

82
00:06:19,810 --> 00:06:24,230
So we'll get the predicted value of house price for our STW.

83
00:06:26,050 --> 00:06:34,450
Now, the mean square error is the average of difference of these squares, of these predicted values

84
00:06:34,450 --> 00:06:35,500
and the actual values.

85
00:06:36,830 --> 00:06:39,680
So to get that average will rate mean.

86
00:06:43,530 --> 00:06:48,280
And within brackets, we have to square the differences of these.

87
00:06:49,000 --> 00:06:54,340
So it is a difference of training on underscore said dollar price.

88
00:06:55,420 --> 00:07:01,330
So these these are the actual values minus the predicted values which are trained under scored a.

89
00:07:05,400 --> 00:07:07,180
And we want to square these values.

90
00:07:07,360 --> 00:07:09,760
So we'll put another bracket around.

91
00:07:13,630 --> 00:07:15,690
I will square this different.

92
00:07:18,320 --> 00:07:18,740
Run this.

93
00:07:20,940 --> 00:07:26,210
So grindy point six six is the mean squared error on the training data.

94
00:07:28,430 --> 00:07:35,420
So on an average squared distance of the predicted values and the actual values on the training data

95
00:07:35,810 --> 00:07:36,590
is grindy.

96
00:07:36,590 --> 00:07:37,460
Point six six.

97
00:07:38,240 --> 00:07:38,900
Let's do this.

98
00:07:39,180 --> 00:07:40,670
Waddi tested Dolto.

99
00:07:44,070 --> 00:07:47,680
We will use the best set dollar price.

100
00:07:51,710 --> 00:07:53,900
Minus test under squaddie.

101
00:07:57,760 --> 00:08:05,770
So since this test days previously unseen, most probably a model will not work as well on this data.

102
00:08:06,730 --> 00:08:13,330
The main square adder is thirty three point zero four, which means it is performing worse on the unseen.

103
00:08:13,330 --> 00:08:13,720
Do the.

104
00:08:15,300 --> 00:08:17,640
This is as discussed in these two electors also.

105
00:08:19,470 --> 00:08:20,650
So this is all.

106
00:08:20,760 --> 00:08:24,380
We split the data into desert and a train said in.

107
00:08:24,620 --> 00:08:32,140
Are we then done the model on the training set and using the model created on the training, it will

108
00:08:32,160 --> 00:08:34,830
predict the values of the test dependent variable.

109
00:08:35,640 --> 00:08:39,450
We then find the estimated error on this test data.

110
00:08:40,140 --> 00:08:43,860
This estimated at it is to be used when we are comparing different models.