1 00:00:00,530 --> 00:00:05,600 In this video, we will learn how to split the available data and to test and train said. 2 00:00:06,800 --> 00:00:13,080 Then you were a trainee model on the training set and find the mean square error of the test said. 3 00:00:15,950 --> 00:00:21,140 To split the data into disinterring, I prefer to install this package or one other method. 4 00:00:21,830 --> 00:00:23,800 This package is called see it tools. 5 00:00:24,830 --> 00:00:26,270 You know how to install a package. 6 00:00:27,200 --> 00:00:29,930 You can just ride in started out packages. 7 00:00:32,980 --> 00:00:35,650 And within Blackard and double quotation marks. 8 00:00:35,760 --> 00:00:36,060 Right. 9 00:00:36,220 --> 00:00:39,040 See it tools and the T of tool, just capital. 10 00:00:43,400 --> 00:00:43,960 Run this. 11 00:00:48,800 --> 00:00:51,710 You can see on the right see it tools is now available. 12 00:00:52,370 --> 00:00:55,520 We'll just take this check box to make this available. 13 00:00:56,800 --> 00:01:03,910 Now we are going to set a seed, the concept of setting seed is that when splitting the data into a 14 00:01:03,910 --> 00:01:04,840 test and train. 15 00:01:05,110 --> 00:01:06,400 I'll be doing it randomly. 16 00:01:06,970 --> 00:01:13,510 But if I set the seed at a particular value and you said the same seed at the same value, we both will 17 00:01:13,510 --> 00:01:14,860 get the same split. 18 00:01:15,100 --> 00:01:21,400 That is the observation and the training set, which I will get you will get the same observation in 19 00:01:21,400 --> 00:01:22,150 your training said. 20 00:01:24,720 --> 00:01:26,890 So we'll set the he at zero. 21 00:01:27,120 --> 00:01:30,470 So we laid said dot seed again. 22 00:01:30,720 --> 00:01:32,520 And within the blockade, we laid zero. 23 00:01:37,260 --> 00:01:38,490 In Baghdad, we laid zero. 24 00:01:39,460 --> 00:01:40,180 We'll run this. 25 00:01:41,140 --> 00:01:42,740 So we'll see decided zero. 26 00:01:43,270 --> 00:01:45,260 No, we will split the data, right. 27 00:01:45,520 --> 00:01:49,240 Split is equal to sample dark split. 28 00:01:54,390 --> 00:02:01,420 And within a decade, will they be if comma split ratio is equal 2.8? 29 00:02:03,770 --> 00:02:06,560 The S and the art of racial art capital. 30 00:02:08,440 --> 00:02:16,970 Next on this show, a newer, even called split is created and it has blue and false value for each 31 00:02:16,970 --> 00:02:18,410 of the observation. 32 00:02:19,520 --> 00:02:24,980 We will assign crew to the training set and the values that falls will as I need to test it. 33 00:02:25,100 --> 00:02:26,510 So training set is equal to. 34 00:02:31,800 --> 00:02:36,120 And the skill set is equal to subsect. 35 00:02:39,580 --> 00:02:40,700 It's a subset of beef. 36 00:02:40,930 --> 00:02:41,630 So be it. 37 00:02:42,770 --> 00:02:43,160 Colma. 38 00:02:45,500 --> 00:02:46,040 Split. 39 00:02:48,730 --> 00:02:50,720 Equal to equal to two. 40 00:02:52,690 --> 00:02:55,560 So we're checking wherever displayed values, true. 41 00:02:56,380 --> 00:03:04,300 We take out that subset of D.F. and put it into the training set variable so you can see training set 42 00:03:04,300 --> 00:03:06,430 variable is also created. 43 00:03:06,880 --> 00:03:08,530 It does 378 observations. 44 00:03:08,830 --> 00:03:15,340 It will not tell exactly 80 percent of the observations, but merely whichever one you mentioned in 45 00:03:15,340 --> 00:03:22,330 the split ratio, you will have nearly those number of observations and what the remaining values will 46 00:03:22,330 --> 00:03:23,560 assign them to test it. 47 00:03:23,620 --> 00:03:25,360 So Test Underscore said. 48 00:03:30,070 --> 00:03:31,180 Is equal to subsect. 49 00:03:32,460 --> 00:03:36,960 And within that could be if Colma split equal to equal to votes. 50 00:03:42,770 --> 00:03:47,760 And on this, so best set variable is also created. 51 00:03:49,450 --> 00:03:53,150 Now we will run a linear model on the training data set. 52 00:03:53,780 --> 00:03:57,620 We know how to run a linear model that will create a variable. 53 00:03:58,010 --> 00:03:59,250 L.M. underscored A.. 54 00:04:01,670 --> 00:04:03,380 And this is equal to L.M.. 55 00:04:04,590 --> 00:04:09,630 Within bracket will rate Brice de la Dot. 56 00:04:12,780 --> 00:04:13,200 Goma. 57 00:04:15,440 --> 00:04:17,640 That is equal to training set. 58 00:04:18,620 --> 00:04:21,270 We are not running this model on the complete data that we have. 59 00:04:21,600 --> 00:04:25,270 We are running it only on the 378 observations in the training set. 60 00:04:26,460 --> 00:04:27,270 So let's run this. 61 00:04:28,860 --> 00:04:31,310 The model is fit in the eleven, code eight. 62 00:04:31,980 --> 00:04:36,970 If you want to look at somebody, you can date somebody with a record eleven dress code. 63 00:04:38,530 --> 00:04:44,100 But here we are going to find out the mean square error of the training set. 64 00:04:44,820 --> 00:04:48,170 And it is said so to find means great errors. 65 00:04:48,960 --> 00:04:52,330 We need to first predict the value of price basis. 66 00:04:52,410 --> 00:04:55,650 This fitted model to predict the value. 67 00:04:56,030 --> 00:04:57,420 We'll use a functional predict. 68 00:04:58,550 --> 00:05:01,190 They predict function takes two parameters. 69 00:05:01,370 --> 00:05:05,140 One is the model that we have today, which is a limiter body. 70 00:05:05,600 --> 00:05:10,810 And the other is the data, which is to be used to predict the values of a. 71 00:05:12,290 --> 00:05:20,090 So we'll get these predicted values of the training set into a variable called train underscored a civil 72 00:05:20,090 --> 00:05:30,890 right train, underscored A is equal to predict and within bracket, the first parameter will be LMR, 73 00:05:30,900 --> 00:05:37,270 underscoring the city for model comma, the city data. 74 00:05:37,340 --> 00:05:38,630 So does the training data. 75 00:05:39,610 --> 00:05:41,970 So will they, training on the squad said. 76 00:05:45,160 --> 00:05:52,570 So what this will do is it will take all the independent variables from this say, put it into this 77 00:05:52,570 --> 00:05:57,860 model and predict the value of the independent variable and store it and to train under squaddie. 78 00:05:59,050 --> 00:05:59,930 So let's run this. 79 00:06:02,800 --> 00:06:06,850 So we have train underscored A as another variable. 80 00:06:07,780 --> 00:06:09,960 We'll do this same thing for the test. 81 00:06:09,960 --> 00:06:14,500 It also just in place of train will test. 82 00:06:19,810 --> 00:06:24,230 So we'll get the predicted value of house price for our STW. 83 00:06:26,050 --> 00:06:34,450 Now, the mean square error is the average of difference of these squares, of these predicted values 84 00:06:34,450 --> 00:06:35,500 and the actual values. 85 00:06:36,830 --> 00:06:39,680 So to get that average will rate mean. 86 00:06:43,530 --> 00:06:48,280 And within brackets, we have to square the differences of these. 87 00:06:49,000 --> 00:06:54,340 So it is a difference of training on underscore said dollar price. 88 00:06:55,420 --> 00:07:01,330 So these these are the actual values minus the predicted values which are trained under scored a. 89 00:07:05,400 --> 00:07:07,180 And we want to square these values. 90 00:07:07,360 --> 00:07:09,760 So we'll put another bracket around. 91 00:07:13,630 --> 00:07:15,690 I will square this different. 92 00:07:18,320 --> 00:07:18,740 Run this. 93 00:07:20,940 --> 00:07:26,210 So grindy point six six is the mean squared error on the training data. 94 00:07:28,430 --> 00:07:35,420 So on an average squared distance of the predicted values and the actual values on the training data 95 00:07:35,810 --> 00:07:36,590 is grindy. 96 00:07:36,590 --> 00:07:37,460 Point six six. 97 00:07:38,240 --> 00:07:38,900 Let's do this. 98 00:07:39,180 --> 00:07:40,670 Waddi tested Dolto. 99 00:07:44,070 --> 00:07:47,680 We will use the best set dollar price. 100 00:07:51,710 --> 00:07:53,900 Minus test under squaddie. 101 00:07:57,760 --> 00:08:05,770 So since this test days previously unseen, most probably a model will not work as well on this data. 102 00:08:06,730 --> 00:08:13,330 The main square adder is thirty three point zero four, which means it is performing worse on the unseen. 103 00:08:13,330 --> 00:08:13,720 Do the. 104 00:08:15,300 --> 00:08:17,640 This is as discussed in these two electors also. 105 00:08:19,470 --> 00:08:20,650 So this is all. 106 00:08:20,760 --> 00:08:24,380 We split the data into desert and a train said in. 107 00:08:24,620 --> 00:08:32,140 Are we then done the model on the training set and using the model created on the training, it will 108 00:08:32,160 --> 00:08:34,830 predict the values of the test dependent variable. 109 00:08:35,640 --> 00:08:39,450 We then find the estimated error on this test data. 110 00:08:40,140 --> 00:08:43,860 This estimated at it is to be used when we are comparing different models.