1 00:00:00,620 --> 00:00:05,800 In this video, we will learn how to split the available data into a test and train set. 2 00:00:06,960 --> 00:00:13,110 Then we will use the training set to train our model and we will create the confusion matrix on the 3 00:00:13,110 --> 00:00:20,160 test set to take the performance of our model to split the data and to test and train it. 4 00:00:20,850 --> 00:00:25,620 I prefer to start this package called See It Tool. 5 00:00:28,030 --> 00:00:30,480 If you have it installed, it'll be shown here. 6 00:00:30,540 --> 00:00:33,100 You just have to take if it does not at all. 7 00:00:33,270 --> 00:00:34,740 You know how to install a package. 8 00:00:34,920 --> 00:00:40,470 You have to write installer packages and within double quotation marks you need to mention the name 9 00:00:40,470 --> 00:00:41,160 of the package. 10 00:00:42,420 --> 00:00:44,270 So first we need this packet. 11 00:00:44,340 --> 00:00:44,950 See adults. 12 00:00:47,580 --> 00:00:50,690 Next, we are going to set a seed. 13 00:00:51,750 --> 00:00:59,670 The concept of setting a seed is that when we are splitting the data and to test and dream are, we'll 14 00:00:59,670 --> 00:01:00,930 be doing it randomly. 15 00:01:02,610 --> 00:01:09,180 But if I set the seed at a particular value and use the same seed at the same value, we both will get 16 00:01:09,180 --> 00:01:10,200 these same split. 17 00:01:11,190 --> 00:01:15,120 That is the observation in the training set, which I will get. 18 00:01:16,050 --> 00:01:18,520 You will get the same observation in your training set. 19 00:01:20,920 --> 00:01:26,290 So we will be setting the scene at zero to rate, said NCD. 20 00:01:31,070 --> 00:01:34,220 We didn't record the right to the lenders. 21 00:01:34,990 --> 00:01:37,070 So the seat is now set at zero. 22 00:01:38,620 --> 00:01:46,760 So now we create a bit evil called split, which will help 80 percent of the value this group and 20 23 00:01:46,760 --> 00:01:49,850 percent value just falls to create that. 24 00:01:49,910 --> 00:01:50,180 We will. 25 00:01:50,180 --> 00:01:50,420 Right. 26 00:01:50,600 --> 00:01:54,960 Split is equal to sample or split. 27 00:01:58,580 --> 00:01:59,430 Aaron, d'accord. 28 00:01:59,690 --> 00:02:02,120 The first parameter will be our data, which is D.F.. 29 00:02:03,260 --> 00:02:08,250 And the second parameter is split ratio, which will say 2.8. 30 00:02:10,430 --> 00:02:17,000 This means that 80 percent of the data will be training set and 20 percent will be tested. 31 00:02:17,930 --> 00:02:19,190 So we'll run this combined. 32 00:02:19,850 --> 00:02:23,690 You can see that there is a variable called split. 33 00:02:25,830 --> 00:02:30,130 And it has values like falls through approvals and so on. 34 00:02:31,990 --> 00:02:35,440 So the training set will be the subset of. 35 00:02:35,630 --> 00:02:39,070 The FBI does it, which has split values through. 36 00:02:40,150 --> 00:02:41,720 So will the training set 37 00:02:46,750 --> 00:02:48,130 is equal to subject. 38 00:02:51,250 --> 00:02:58,360 The data is D.F. and displayed value is split equal to equal to two. 39 00:03:05,790 --> 00:03:06,240 Single. 40 00:03:06,270 --> 00:03:12,600 Equal to is an assignment operator level equal to is used to compare values. 41 00:03:12,810 --> 00:03:14,850 So split values should be true. 42 00:03:20,070 --> 00:03:29,250 You can see that our training set has 386 observations, which is nearly 80 percent, not exactly 80 43 00:03:29,250 --> 00:03:31,710 percent of the observations, but merely Deverson. 44 00:03:33,120 --> 00:03:37,050 And for the tests, it will use the split value to be false. 45 00:03:37,320 --> 00:03:38,950 So test set is equal to subsect. 46 00:03:39,360 --> 00:03:41,790 The idea of split is what? 47 00:03:54,280 --> 00:03:55,210 You can see on the right. 48 00:03:55,480 --> 00:04:01,500 We have train set with 386 observations and test set with the remaining 120 observations. 49 00:04:02,560 --> 00:04:03,610 Now our job is done. 50 00:04:04,540 --> 00:04:08,910 We have to repeat the same things which we did on the complete dataset. 51 00:04:09,610 --> 00:04:12,270 So we will train the model using train set. 52 00:04:12,970 --> 00:04:16,370 And we create the confusion matrix using dataset. 53 00:04:18,010 --> 00:04:26,020 So I'll show you how to train the logistic regression model with the training set and create the confusion 54 00:04:26,020 --> 00:04:27,460 matrix using your test. 55 00:04:27,830 --> 00:04:35,110 You'll have to trendy linear discriminant analysis on the training set and created confusion metrics 56 00:04:35,110 --> 00:04:36,640 on the test set on your own. 57 00:04:38,350 --> 00:04:40,750 You probably know how to do logistic regression. 58 00:04:41,270 --> 00:04:44,260 We will create a new variable called Trained Outfit 59 00:04:47,620 --> 00:04:53,110 is equal to Jelen GLAAD function, which is used for doing logistic regression 60 00:04:56,350 --> 00:04:57,080 dependent variable. 61 00:04:57,080 --> 00:04:57,410 The. 62 00:05:00,100 --> 00:05:06,160 We are using all the variables or dart data is trained said. 63 00:05:09,580 --> 00:05:11,210 And family is by no means. 64 00:05:19,660 --> 00:05:20,450 Let us run this. 65 00:05:22,750 --> 00:05:28,500 So now we have another very well trained outfit which contains the information of the logistic regression 66 00:05:28,500 --> 00:05:28,800 model. 67 00:05:30,750 --> 00:05:36,090 Now to find the predicted probabilities for the desert, we will use the predict function. 68 00:05:38,130 --> 00:05:40,140 We will write test, not probes. 69 00:05:43,330 --> 00:05:46,030 This is the variable name is equal to predict. 70 00:05:49,660 --> 00:05:53,260 The first batter, my dad is the model which is trained outweight. 71 00:05:57,460 --> 00:06:00,700 Second parameter is the data on which we want to predict. 72 00:06:00,970 --> 00:06:02,260 It is Essid. 73 00:06:04,700 --> 00:06:07,550 Just go check it out. 74 00:06:07,590 --> 00:06:09,620 But I'm Dave Davies, equal to response. 75 00:06:16,910 --> 00:06:17,790 That is it on this. 76 00:06:19,990 --> 00:06:25,810 So we have performed training on one side and based on a completely different set. 77 00:06:27,380 --> 00:06:29,250 We have these probabilities of the desert. 78 00:06:30,050 --> 00:06:36,320 If we want to assign classes using a default boundary condition of point five, we will first create 79 00:06:36,320 --> 00:06:39,970 an array which will have all the values as no. 80 00:06:40,580 --> 00:06:45,400 And we will change those elements which have probably value greater than point five to. 81 00:06:45,490 --> 00:06:45,940 Yes. 82 00:06:46,580 --> 00:06:49,970 So first we recreate the array will create test. 83 00:06:50,460 --> 00:06:50,840 Fred. 84 00:06:53,280 --> 00:06:54,350 Does it do predictions? 85 00:06:54,380 --> 00:06:55,930 So they start praying. 86 00:06:56,000 --> 00:06:59,110 Is he going to wrap with tanks? 87 00:06:59,120 --> 00:06:59,310 What? 88 00:06:59,310 --> 00:07:01,410 Repeat, repeat? 89 00:07:01,580 --> 00:07:02,120 No. 90 00:07:05,540 --> 00:07:14,070 No one doing detainees since Desert has 120 observations, the you have today, which has noted no one 91 00:07:14,100 --> 00:07:16,430 doing detainees will run this. 92 00:07:17,940 --> 00:07:23,960 Now, for all those elements were probably easier than point three within two years. 93 00:07:24,270 --> 00:07:25,960 So let's start Red 94 00:07:28,850 --> 00:07:29,660 Square bracket. 95 00:07:29,810 --> 00:07:30,330 All right. 96 00:07:33,030 --> 00:07:34,230 This dark, Rob. 97 00:07:38,930 --> 00:07:40,480 This great event point for you. 98 00:07:44,000 --> 00:07:44,670 Is equal to. 99 00:07:45,790 --> 00:07:46,230 Yes. 100 00:07:53,430 --> 00:07:53,830 Grandis. 101 00:07:59,110 --> 00:08:02,180 So now, wherever probability is more than point five. 102 00:08:02,380 --> 00:08:03,510 We have the yes. 103 00:08:04,270 --> 00:08:06,100 Now we have the predicted probabilities. 104 00:08:06,100 --> 00:08:09,610 What the said you have the actual values in this sword. 105 00:08:09,700 --> 00:08:10,620 Very well indeed. 106 00:08:10,620 --> 00:08:16,480 They said we can compare these by creating a confusion matrix to create a confusion matrix. 107 00:08:16,510 --> 00:08:18,800 We will use these people function so well. 108 00:08:18,820 --> 00:08:19,000 Right. 109 00:08:19,130 --> 00:08:19,600 Table 110 00:08:22,350 --> 00:08:30,820 distorted comma the sorry variable of detested or tested dollar. 111 00:08:36,580 --> 00:08:38,920 No less sordid. 112 00:08:42,750 --> 00:08:43,510 Let's run this. 113 00:08:45,800 --> 00:08:53,690 So you can see now we have seven to eight correct predictions out of total 120. 114 00:08:54,320 --> 00:08:56,440 So 78 by one doing D. 115 00:08:56,450 --> 00:08:59,750 D prediction accuracy, which is lower than the training set. 116 00:09:00,020 --> 00:09:02,560 But still, it can be considered a good classifier. 117 00:09:04,820 --> 00:09:12,410 Now, you have to create this confusion matrix on the test set using linear discriminant analysis and 118 00:09:12,410 --> 00:09:18,580 compare the performance of LDA against logistic regression in the coming videos. 119 00:09:18,650 --> 00:09:20,520 We'll be learning about get nearest neighbors. 120 00:09:21,020 --> 00:09:25,670 And then we will be predicting the desert values using the Game Limited.