1 00:00:02,470 --> 00:00:07,100 In this video, we are going to split our data into test and train. 2 00:00:08,710 --> 00:00:14,950 We do this so that we can see the performance of our model on previously unseen data. 3 00:00:16,150 --> 00:00:22,720 So we will train the model using the train part of the data and we will test its performance on the 4 00:00:23,020 --> 00:00:24,100 best part of the data. 5 00:00:26,380 --> 00:00:33,790 Usually we split the data in the nation of 80 is to guarantee, meaning that we will use 80 percent 6 00:00:33,790 --> 00:00:35,690 of the data to train the model. 7 00:00:36,400 --> 00:00:41,200 And that is 20 percent of the data will be used to test its performance. 8 00:00:43,870 --> 00:00:46,350 To do this train split an hour. 9 00:00:47,590 --> 00:00:49,270 This is the code that you need to run. 10 00:00:50,320 --> 00:00:56,410 First, we need a package called C Études if it is installed in R. 11 00:00:56,770 --> 00:00:58,700 You do not need to run this command. 12 00:00:58,990 --> 00:01:06,670 If it is not, you have to run in Stalder packages and within inverted commas, you have the right tools. 13 00:01:07,450 --> 00:01:09,370 The D of tools is capital. 14 00:01:11,250 --> 00:01:16,170 Once this package will be installed, you need to run library, see your tools. 15 00:01:17,070 --> 00:01:20,430 What does basically buzzer's if you go to the packages part? 16 00:01:23,430 --> 00:01:30,460 It is showing all the packages that are installed, but this check box is on right now. 17 00:01:30,780 --> 00:01:32,640 That is, you cannot use your tools. 18 00:01:32,730 --> 00:01:40,880 As of now, if you want to use your tools in this cord, you have to take it or run library, çehre 19 00:01:40,890 --> 00:01:41,910 tools command. 20 00:01:44,140 --> 00:01:47,620 So either they get hit or run this command. 21 00:01:49,910 --> 00:01:53,000 Once the Seer Tools package is ready for use. 22 00:01:54,360 --> 00:01:56,730 We will ride these four lines. 23 00:01:57,090 --> 00:02:04,930 The first line is set seed setting seed is done so that we have reproducibility of the data. 24 00:02:05,490 --> 00:02:10,620 That is, if I said, see that zero and you also said seed at zero. 25 00:02:11,040 --> 00:02:18,820 When we are randomly selecting 80 percent of the data to be trained, data that randomly selected 80 26 00:02:18,840 --> 00:02:22,320 percent of data will be same for me and same for you. 27 00:02:23,470 --> 00:02:28,770 So if we do not certain seed, you get a separate set of observations in 80 percent of my data. 28 00:02:30,200 --> 00:02:35,300 Which I will use to train the model and thus I will get a different model than the model that you will 29 00:02:35,300 --> 00:02:37,310 get with your 80 percent of the data. 30 00:02:39,080 --> 00:02:44,060 So setting seed ensures that both of us get the same split. 31 00:02:45,050 --> 00:02:46,610 So we'll run the court this line. 32 00:02:53,490 --> 00:02:59,710 Posters and installer packages, although, see it all was already installed in my system. 33 00:03:00,220 --> 00:03:02,350 It will again go and reinstall it. 34 00:03:03,960 --> 00:03:06,200 Then library doors see it also then. 35 00:03:06,420 --> 00:03:09,560 And we have taken on this check box. 36 00:03:11,080 --> 00:03:12,790 Then a wohlstetter deal. 37 00:03:14,520 --> 00:03:16,740 Next is to create a new variable. 38 00:03:19,290 --> 00:03:20,190 Goit split. 39 00:03:21,500 --> 00:03:27,290 This very well will be created on the movie, does it, meaning that it will have the same number of 40 00:03:27,350 --> 00:03:29,510 observations as the movie did, does it? 41 00:03:30,650 --> 00:03:38,390 And we have a split ratio of point eight, which means that 80 percent of the data in this played variable 42 00:03:38,870 --> 00:03:39,600 will have value. 43 00:03:39,730 --> 00:03:40,100 True. 44 00:03:40,730 --> 00:03:44,060 And remaining 20 percent will have value falls. 45 00:03:45,010 --> 00:03:47,960 So if we run this line of code. 46 00:03:50,740 --> 00:03:52,870 A new variable split is created. 47 00:03:54,340 --> 00:03:57,820 It has values, falls through, falls through. 48 00:03:58,330 --> 00:03:58,900 And so on. 49 00:03:59,800 --> 00:04:03,310 And we have 506 such observations. 50 00:04:07,180 --> 00:04:12,600 Not very ever split values, true, which is for nearly 80 percent of the time. 51 00:04:13,350 --> 00:04:21,180 Wherever this value is true, we will put that observation into the train set and wherever this is false 52 00:04:21,250 --> 00:04:23,510 will put that observation in the desert. 53 00:04:24,270 --> 00:04:28,530 So when I on this call, I'll get a new dataset called Dream. 54 00:04:31,560 --> 00:04:37,910 And it has 393 observations, which is nearly 80 percent, not exactly 80 percent. 55 00:04:38,370 --> 00:04:39,430 But nearly percent. 56 00:04:42,210 --> 00:04:49,710 And then we have this set which will have the remaining 20 percent of the observations. 57 00:04:52,840 --> 00:04:59,270 So now this train very well, this train dataset, which has 393 observations. 58 00:04:59,800 --> 00:05:02,140 This will be used to bring the model. 59 00:05:03,090 --> 00:05:05,220 That is to make that decision tree. 60 00:05:06,270 --> 00:05:11,920 Once that decision tree is created, we will check its performance on the test set. 61 00:05:12,240 --> 00:05:16,030 That is, we will predict the value of collection variable for the test. 62 00:05:16,620 --> 00:05:21,960 And we will compared the actual value with the predicted value of this variable. 63 00:05:24,290 --> 00:05:29,260 This is all we split the data into test and bring in our.