1 00:00:00,990 --> 00:00:06,930 Now, once our data is ready, the first step that we take is to split the data into two parts. 2 00:00:08,250 --> 00:00:10,800 The first part will be called the training set. 3 00:00:11,790 --> 00:00:18,440 This part will be used to train the model and then that will be a second set. 4 00:00:18,930 --> 00:00:26,160 That second set will be Kordic tested and that test set will be used to test the performance of the 5 00:00:26,160 --> 00:00:26,940 created model. 6 00:00:28,920 --> 00:00:36,720 So in business and adio, if you have a lot of data, it is always recommended that you train the model 7 00:00:36,990 --> 00:00:42,900 on one set of observations and you keep one set of observations which are not shown into your model. 8 00:00:43,710 --> 00:00:48,840 You test your performance or your offer model on this previously unseen data. 9 00:00:50,880 --> 00:00:54,380 Usually we do a split of 80 20. 10 00:00:54,870 --> 00:01:01,530 That is, we use 80 percent of the data available to train the model and 20 percent of the remaining 11 00:01:01,530 --> 00:01:04,350 data to test the performance. 12 00:01:05,250 --> 00:01:13,710 So here I am showing you how to split the available data into two parts to split the data in are we 13 00:01:13,710 --> 00:01:16,380 use this package called See It tools. 14 00:01:19,470 --> 00:01:28,020 If you have this package in stolen art, it will be visible here in the packages part so you can scroll 15 00:01:28,050 --> 00:01:30,960 and see if see it is available with you or not. 16 00:01:31,230 --> 00:01:33,370 If it is not your undies, come on in. 17 00:01:33,450 --> 00:01:34,770 Stalled out packages. 18 00:01:36,250 --> 00:01:37,250 See Édouard. 19 00:01:38,520 --> 00:01:42,210 No doubt see it tools is written in single quotation marks. 20 00:01:43,460 --> 00:01:44,420 So we'll run this command. 21 00:01:49,730 --> 00:01:53,090 And it has installed seer tools in the packages. 22 00:01:53,270 --> 00:01:55,520 You can see that see a tool is available here. 23 00:01:57,080 --> 00:02:00,530 Now it is installed, but it is not active. 24 00:02:00,680 --> 00:02:02,000 That is, it cannot be used. 25 00:02:02,060 --> 00:02:03,860 As of now to make it active. 26 00:02:03,890 --> 00:02:05,330 We run this library command. 27 00:02:06,130 --> 00:02:10,580 You can also take this checkbox to will run this library command. 28 00:02:12,650 --> 00:02:14,510 And you can see that a tick comes here. 29 00:02:16,490 --> 00:02:19,520 And this year, tools package is now active. 30 00:02:23,920 --> 00:02:26,210 The next step is setting seed. 31 00:02:27,960 --> 00:02:32,280 We set seed so that we have reproducible results. 32 00:02:34,000 --> 00:02:42,370 What I mean by this is when we are doing a test transplant, we'll be randomly putting 80 percent of 33 00:02:42,370 --> 00:02:46,450 the data into printed and 20 percent of the data to test it. 34 00:02:47,800 --> 00:02:55,390 If we do not succeed, my randomly selected 80 percent of the data will be different than your randomly 35 00:02:55,480 --> 00:02:57,190 selected 80 percent of the data. 36 00:02:58,510 --> 00:03:01,090 But if we set seed to zero. 37 00:03:01,810 --> 00:03:07,990 If you also said the seed to zero and if I have said the seed to zero here, both of us will be getting 38 00:03:07,990 --> 00:03:09,310 these same split of data. 39 00:03:12,140 --> 00:03:19,670 And if we both get same speed of test and train data, we both will get the same model and that same 40 00:03:19,670 --> 00:03:21,740 model will be giving us these same results. 41 00:03:24,310 --> 00:03:27,460 So that is why we are setting the seed to zero hit. 42 00:03:31,140 --> 00:03:35,880 Next will create a variable called split in the split variable. 43 00:03:36,390 --> 00:03:37,710 We will have value. 44 00:03:37,780 --> 00:03:38,880 True and false. 45 00:03:39,690 --> 00:03:41,360 And we are giving a split ratio. 46 00:03:41,430 --> 00:03:48,870 Point eight, which means that 80 percent of the values will be true and 20 percent will be false. 47 00:03:50,730 --> 00:03:57,990 So if I run this command, you can see that a split variable is created here, which contains value 48 00:03:58,220 --> 00:04:00,450 through brute rule force and so on. 49 00:04:01,470 --> 00:04:08,550 And randomly, nearly 80 percent of these values will be true and nearly 20 percent will be false. 50 00:04:10,260 --> 00:04:17,520 Now, in my training said, I'll be taking all the movie data set values which have split value through. 51 00:04:19,350 --> 00:04:24,710 And indeed, they said, I'll have all the movie data set values which have split value falls. 52 00:04:26,190 --> 00:04:32,190 I've named this variable as Trinity because for now we are doing classification. 53 00:04:34,230 --> 00:04:39,020 That is, we are trying to predict a very well with this categorical later on in the course. 54 00:04:39,180 --> 00:04:41,740 We'll also see how to do regression in swim. 55 00:04:43,200 --> 00:04:47,370 So here I have created two variables Rainsy and D.C. Green. 56 00:04:47,440 --> 00:04:52,390 He will get values from what we dataset, which has split value through. 57 00:04:52,630 --> 00:04:53,580 So at underscore my. 58 00:04:55,320 --> 00:05:01,590 And you can see nearly 80 percent, which is nearly 400 observations are getting stored in the trains 59 00:05:01,660 --> 00:05:02,120 very well. 60 00:05:02,940 --> 00:05:05,850 And I also found this one, which is DWC. 61 00:05:06,740 --> 00:05:12,060 And I have one hundred eight observations in my test dataset. 62 00:05:13,100 --> 00:05:21,040 This is how we split the data in two days and Drame will be using the brain to do does it to rain as 63 00:05:21,130 --> 00:05:21,690 a model. 64 00:05:22,100 --> 00:05:27,550 And later on, we will compare its performance on the test see dataset.