1 00:00:00,750 --> 00:00:06,840 Now, usually we do not cream our model on all the available data. 2 00:00:07,710 --> 00:00:12,650 We take a small portion of our available data to test. 3 00:00:12,680 --> 00:00:13,410 So what are more than. 4 00:00:15,500 --> 00:00:22,850 This is small does it will help us to estimate how our model will perform on the real life data? 5 00:00:23,690 --> 00:00:27,010 The data, which is not used to screen our more than. 6 00:00:29,160 --> 00:00:37,940 So in practice, we generally use 80 percent of our available data and we take 20 percent of our available 7 00:00:37,950 --> 00:00:39,900 data as testing data. 8 00:00:41,060 --> 00:00:48,470 The only test so more than on their data and compare the performance of our different models on that 9 00:00:48,510 --> 00:00:53,100 test data to evaluate our models on real live data. 10 00:00:54,600 --> 00:00:57,820 So we have five hundred and six rows. 11 00:00:58,440 --> 00:01:00,030 That is our available data. 12 00:01:01,300 --> 00:01:09,250 We will keep 20 percent of this data as our test data and we will train our model on the 80 percent 13 00:01:09,260 --> 00:01:10,040 of this data. 14 00:01:11,030 --> 00:01:17,210 This segregating of data and two best and Krien part is known as the screen split. 15 00:01:19,050 --> 00:01:27,480 And it is very easy to perform test strength is split using a Skillern will force import train underscored 16 00:01:27,480 --> 00:01:31,260 Besse underscored split from Escalon more than selection. 17 00:01:35,870 --> 00:01:40,240 Now, this train underscored tests, underscored split matter. 18 00:01:41,270 --> 00:01:50,000 Take this barometer's, which is what X will do, our Y value the size of our test data. 19 00:01:50,660 --> 00:01:57,230 Since I have told you that you usually get 20 percent of data as our test data, so you can provide 20 00:01:57,230 --> 00:01:58,310 zero point two here. 21 00:01:59,450 --> 00:02:03,020 Then there is another parameter that is random ESADE. 22 00:02:04,570 --> 00:02:12,730 Since we are randomly assigning our data and to test and train to get the same test data every time 23 00:02:12,940 --> 00:02:18,940 so that we can compare performance of our model, we can use this and demonstrate variable. 24 00:02:19,600 --> 00:02:21,190 This is just a random number. 25 00:02:21,310 --> 00:02:26,320 You can take zero one or any other value you want. 26 00:02:28,460 --> 00:02:31,310 The advantage of using this round number suit is. 27 00:02:32,950 --> 00:02:40,270 If I keep this right, no mistake concern throughout my analysis, I will get the exact same split off 28 00:02:40,350 --> 00:02:41,050 as screen. 29 00:02:42,290 --> 00:02:49,130 So even if you are running, your testing is played with three a.m. to zero, you will get that same 30 00:02:49,130 --> 00:02:50,750 test strain split as me. 31 00:02:51,290 --> 00:02:59,130 For example, you have 10 rows and suppose you are third and fourth rows are going in to test. 32 00:02:59,840 --> 00:03:03,500 And rest of the eight groups are going in to train data dataset. 33 00:03:04,600 --> 00:03:13,490 If you keep that random is constant, you will always get your third and fourth value as best set and 34 00:03:13,490 --> 00:03:15,700 dress of values in your train set. 35 00:03:17,560 --> 00:03:25,600 And this will help us to compare the performance of our model across different methods and to keep the 36 00:03:25,750 --> 00:03:28,420 output of our model always concerned. 37 00:03:30,760 --> 00:03:34,590 So always stick to a single value of no. 38 00:03:34,950 --> 00:03:36,540 Don't change this value. 39 00:03:37,390 --> 00:03:42,610 Stick to the number of your choice if you want to get the same split as me. 40 00:03:43,480 --> 00:03:45,700 Select random insert equal to zero. 41 00:03:47,660 --> 00:03:51,590 Now we get four outputs from this function. 42 00:03:52,940 --> 00:03:55,700 The first output should be more extreme. 43 00:03:56,450 --> 00:03:59,960 So I have named my variable as X underscore crane. 44 00:04:00,740 --> 00:04:04,190 The second output is your best X data. 45 00:04:05,000 --> 00:04:07,200 I have named it X and School Crane. 46 00:04:07,970 --> 00:04:11,690 Then we have Y of school Crane and VI underscore test. 47 00:04:12,560 --> 00:04:14,210 So if I run this. 48 00:04:17,880 --> 00:04:19,590 I have four more variables. 49 00:04:20,790 --> 00:04:22,050 I can check the head. 50 00:04:22,220 --> 00:04:23,360 Awful lot extreme. 51 00:04:23,850 --> 00:04:25,890 And the shape of the extreme due to. 52 00:04:35,980 --> 00:04:41,960 You can see December looks exactly same as the sample from our X date frame. 53 00:04:42,880 --> 00:04:44,530 We don't have any way variable here. 54 00:04:44,560 --> 00:04:47,140 We only have X. 55 00:04:47,920 --> 00:04:51,850 And one thing to notice here are this indexes. 56 00:04:52,750 --> 00:04:59,680 You can see our indexes are sharp for long since some of the rules are going in to screen test data 57 00:04:59,830 --> 00:05:08,140 and some of the observations are coming into our external data, we can check the shape of our extreme. 58 00:05:11,060 --> 00:05:19,250 This should contain 80 percent of the values or for all the values, that is 506 and 2.8, which comes 59 00:05:19,250 --> 00:05:20,770 out to be four hundred and four. 60 00:05:20,930 --> 00:05:27,860 So we have four hundred and four values in our extreme and we have a test of the values and our X test. 61 00:05:35,110 --> 00:05:40,620 So a hundred and two values and test and 400 Ford values in Crane total. 62 00:05:40,810 --> 00:05:47,950 We have 506 observation will use this extreme to create our model and we will use X test. 63 00:05:49,220 --> 00:05:51,170 To evaluate performance of water, Martin. 64 00:05:53,200 --> 00:05:57,250 Similarly, you can check the shape of future via Train and Vytas. 65 00:05:58,550 --> 00:06:04,040 The normal observation should be seen as extreme and expressed, respectively.