1
00:00:02,470 --> 00:00:07,100
In this video, we are going to split our data into test and train.

2
00:00:08,710 --> 00:00:14,950
We do this so that we can see the performance of our model on previously unseen data.

3
00:00:16,150 --> 00:00:22,720
So we will train the model using the train part of the data and we will test its performance on the

4
00:00:23,020 --> 00:00:24,100
best part of the data.

5
00:00:26,380 --> 00:00:33,790
Usually we split the data in the nation of 80 is to guarantee, meaning that we will use 80 percent

6
00:00:33,790 --> 00:00:35,690
of the data to train the model.

7
00:00:36,400 --> 00:00:41,200
And that is 20 percent of the data will be used to test its performance.

8
00:00:43,870 --> 00:00:46,350
To do this train split an hour.

9
00:00:47,590 --> 00:00:49,270
This is the code that you need to run.

10
00:00:50,320 --> 00:00:56,410
First, we need a package called C Études if it is installed in R.

11
00:00:56,770 --> 00:00:58,700
You do not need to run this command.

12
00:00:58,990 --> 00:01:06,670
If it is not, you have to run in Stalder packages and within inverted commas, you have the right tools.

13
00:01:07,450 --> 00:01:09,370
The D of tools is capital.

14
00:01:11,250 --> 00:01:16,170
Once this package will be installed, you need to run library, see your tools.

15
00:01:17,070 --> 00:01:20,430
What does basically buzzer's if you go to the packages part?

16
00:01:23,430 --> 00:01:30,460
It is showing all the packages that are installed, but this check box is on right now.

17
00:01:30,780 --> 00:01:32,640
That is, you cannot use your tools.

18
00:01:32,730 --> 00:01:40,880
As of now, if you want to use your tools in this cord, you have to take it or run library, çehre

19
00:01:40,890 --> 00:01:41,910
tools command.

20
00:01:44,140 --> 00:01:47,620
So either they get hit or run this command.

21
00:01:49,910 --> 00:01:53,000
Once the Seer Tools package is ready for use.

22
00:01:54,360 --> 00:01:56,730
We will ride these four lines.

23
00:01:57,090 --> 00:02:04,930
The first line is set seed setting seed is done so that we have reproducibility of the data.

24
00:02:05,490 --> 00:02:10,620
That is, if I said, see that zero and you also said seed at zero.

25
00:02:11,040 --> 00:02:18,820
When we are randomly selecting 80 percent of the data to be trained, data that randomly selected 80

26
00:02:18,840 --> 00:02:22,320
percent of data will be same for me and same for you.

27
00:02:23,470 --> 00:02:28,770
So if we do not certain seed, you get a separate set of observations in 80 percent of my data.

28
00:02:30,200 --> 00:02:35,300
Which I will use to train the model and thus I will get a different model than the model that you will

29
00:02:35,300 --> 00:02:37,310
get with your 80 percent of the data.

30
00:02:39,080 --> 00:02:44,060
So setting seed ensures that both of us get the same split.

31
00:02:45,050 --> 00:02:46,610
So we'll run the court this line.

32
00:02:53,490 --> 00:02:59,710
Posters and installer packages, although, see it all was already installed in my system.

33
00:03:00,220 --> 00:03:02,350
It will again go and reinstall it.

34
00:03:03,960 --> 00:03:06,200
Then library doors see it also then.

35
00:03:06,420 --> 00:03:09,560
And we have taken on this check box.

36
00:03:11,080 --> 00:03:12,790
Then a wohlstetter deal.

37
00:03:14,520 --> 00:03:16,740
Next is to create a new variable.

38
00:03:19,290 --> 00:03:20,190
Goit split.

39
00:03:21,500 --> 00:03:27,290
This very well will be created on the movie, does it, meaning that it will have the same number of

40
00:03:27,350 --> 00:03:29,510
observations as the movie did, does it?

41
00:03:30,650 --> 00:03:38,390
And we have a split ratio of point eight, which means that 80 percent of the data in this played variable

42
00:03:38,870 --> 00:03:39,600
will have value.

43
00:03:39,730 --> 00:03:40,100
True.

44
00:03:40,730 --> 00:03:44,060
And remaining 20 percent will have value falls.

45
00:03:45,010 --> 00:03:47,960
So if we run this line of code.

46
00:03:50,740 --> 00:03:52,870
A new variable split is created.

47
00:03:54,340 --> 00:03:57,820
It has values, falls through, falls through.

48
00:03:58,330 --> 00:03:58,900
And so on.

49
00:03:59,800 --> 00:04:03,310
And we have 506 such observations.

50
00:04:07,180 --> 00:04:12,600
Not very ever split values, true, which is for nearly 80 percent of the time.

51
00:04:13,350 --> 00:04:21,180
Wherever this value is true, we will put that observation into the train set and wherever this is false

52
00:04:21,250 --> 00:04:23,510
will put that observation in the desert.

53
00:04:24,270 --> 00:04:28,530
So when I on this call, I'll get a new dataset called Dream.

54
00:04:31,560 --> 00:04:37,910
And it has 393 observations, which is nearly 80 percent, not exactly 80 percent.

55
00:04:38,370 --> 00:04:39,430
But nearly percent.

56
00:04:42,210 --> 00:04:49,710
And then we have this set which will have the remaining 20 percent of the observations.

57
00:04:52,840 --> 00:04:59,270
So now this train very well, this train dataset, which has 393 observations.

58
00:04:59,800 --> 00:05:02,140
This will be used to bring the model.

59
00:05:03,090 --> 00:05:05,220
That is to make that decision tree.

60
00:05:06,270 --> 00:05:11,920
Once that decision tree is created, we will check its performance on the test set.

61
00:05:12,240 --> 00:05:16,030
That is, we will predict the value of collection variable for the test.

62
00:05:16,620 --> 00:05:21,960
And we will compared the actual value with the predicted value of this variable.

63
00:05:24,290 --> 00:05:29,260
This is all we split the data into test and bring in our.