1
00:00:00,840 --> 00:00:08,120
In this lecture, we will learn how to split our dataset into two parts test and train data set.

2
00:00:08,910 --> 00:00:16,050
Then we will train our logistic regression model on training dataset and we will evaluate the performance

3
00:00:16,050 --> 00:00:18,590
of that model using the test dataset.

4
00:00:20,010 --> 00:00:26,910
I have already written all of the codes, my data going new to write all of these quotes on your own

5
00:00:27,010 --> 00:00:27,990
while you are practicing.

6
00:00:30,050 --> 00:00:33,110
First, to split our data into test and.

7
00:00:34,010 --> 00:00:42,470
We need this function green test split and you can import this function from Escalon, not model selection.

8
00:00:45,790 --> 00:00:50,380
The output of this function is in the form of four variables.

9
00:00:51,730 --> 00:00:55,050
First variable should be your independent green data.

10
00:00:56,170 --> 00:01:04,400
Second variable should be you're independent based data and then you are dependent on data and dependent

11
00:01:04,400 --> 00:01:09,040
test data that are for argument for this function.

12
00:01:09,580 --> 00:01:13,990
The first one should be the independent variables from your or is no other doorframe.

13
00:01:14,890 --> 00:01:16,900
We have already created X variable.

14
00:01:17,140 --> 00:01:18,630
Therefore, we are using X.

15
00:01:19,660 --> 00:01:23,260
The second argument here is the dependent variable.

16
00:01:23,920 --> 00:01:25,330
We have already created Y.

17
00:01:25,690 --> 00:01:27,170
That's why we are using Y.

18
00:01:28,480 --> 00:01:31,270
Then the next argument is further test size.

19
00:01:32,110 --> 00:01:40,450
So as we discuss in our two Reflektor ideal split for this strain, is it being 20 percent or two percent

20
00:01:40,450 --> 00:01:43,540
for Crean and 20 percent for test?

21
00:01:44,920 --> 00:01:50,950
And here you can mention the portion of your whole dataset you want in your test data.

22
00:01:51,730 --> 00:01:56,660
Since we won 20 percent of test data, therefore we will read zero point two.

23
00:01:57,460 --> 00:02:04,750
If you want to do percent of tests today, you can read zero point three and sort of zero point to the

24
00:02:04,750 --> 00:02:06,960
next argument is for random assert.

25
00:02:08,290 --> 00:02:10,360
Here you can provide any value.

26
00:02:10,720 --> 00:02:18,450
But if you grow eight zero, you will get the same desert as I'm getting from this randomly split.

27
00:02:19,840 --> 00:02:26,480
If you write one or any other value, this function will work exactly the same.

28
00:02:28,450 --> 00:02:35,260
But the only difference will be that you will not get the exact same result as me because you are random

29
00:02:35,260 --> 00:02:37,690
numbers will be different from my random numbers.

30
00:02:39,720 --> 00:02:47,550
So if you want to get the same split every time, you'll have to stick to only one random state.

31
00:02:47,630 --> 00:02:50,370
No, I'm using zero.

32
00:02:50,550 --> 00:02:53,640
And for all future test this state, I will use zero

33
00:02:56,200 --> 00:02:59,220
not initially in our X and Y variable.

34
00:02:59,580 --> 00:03:01,590
There were five hundred and six rows.

35
00:03:02,250 --> 00:03:08,760
Now let's look at the shape of extreme expressed by train and by test.

36
00:03:14,010 --> 00:03:19,840
If you are using updated version of Biton, you need to put parenthesis with the print.

37
00:03:19,870 --> 00:03:20,220
Come on.

38
00:03:22,240 --> 00:03:28,760
I feed on this, you can see that 80 percent of data is an extreme.

39
00:03:29,030 --> 00:03:38,330
And the remaining 20 percent is in its best 404 observation are an hour extended to and under.

40
00:03:38,330 --> 00:03:40,040
And two observations are in order.

41
00:03:40,220 --> 00:03:41,060
Experts data.

42
00:03:44,960 --> 00:03:48,960
Now, let's fit a logistic regression model on our training dataset.

43
00:03:49,970 --> 00:03:51,200
We'll follow the same step.

44
00:03:51,530 --> 00:03:53,330
First, we will create the object.

45
00:03:53,660 --> 00:04:01,500
Seacliff underscore a lot and then we will fit our extreme and white dream in this object.

46
00:04:06,750 --> 00:04:08,010
We have fitted our model.

47
00:04:08,540 --> 00:04:12,570
Now let's predict the value on what X test does say.

48
00:04:16,590 --> 00:04:23,130
You can notice here I am using expressed and sort of extreme because I want to credit the values on

49
00:04:23,130 --> 00:04:24,160
my test data set.

50
00:04:25,710 --> 00:04:33,720
Now let's look at the accuracy and the confusion metrics of this model to get the accuracy score and

51
00:04:33,720 --> 00:04:34,710
confusion metrics.

52
00:04:34,710 --> 00:04:39,040
We will import these, too, from Escalon metrics.

53
00:04:40,420 --> 00:04:47,090
We will run it, then will clear the confusion metrics for our via test and why test predict.

54
00:04:47,490 --> 00:04:50,250
These are the actual values of our white test.

55
00:04:50,370 --> 00:04:52,910
And these are the predicted value of a lot of test.

56
00:04:54,890 --> 00:04:56,550
Here are what rules represent.

57
00:04:56,550 --> 00:05:02,130
The actual class and the column represent the predicted class.

58
00:05:03,730 --> 00:05:09,240
So for the statistics, no, we are seeing that the actual value was not sold.

59
00:05:10,090 --> 00:05:14,430
And the predicted value was also not sold for this 36 records.

60
00:05:15,940 --> 00:05:18,640
If you remember, these are known as crude negatives.

61
00:05:19,420 --> 00:05:21,700
Similarly here, our actual value is one.

62
00:05:22,360 --> 00:05:24,250
But the predicted value is zero.

63
00:05:25,150 --> 00:05:27,640
These are known as false negatives.

64
00:05:28,550 --> 00:05:33,520
And for this 22, the actual value was zero and the predicted value was one.

65
00:05:34,240 --> 00:05:36,100
These are known as false positives.

66
00:05:36,840 --> 00:05:38,350
So when the ladies are positive.

67
00:05:41,130 --> 00:05:45,900
Now, the accuracy score is the percentage of observations.

68
00:05:46,050 --> 00:05:54,300
My model is able to predict correctly, so the accuracy score will be the sum of 36 plus 31 divided

69
00:05:54,300 --> 00:05:56,610
by the total number of observations.

70
00:05:58,020 --> 00:06:03,660
You can do that manually also, but there is a small function to calculate it phonetically.

71
00:06:07,230 --> 00:06:12,270
The accuracy of forward model on our test dataset is zero point six five.

72
00:06:12,510 --> 00:06:15,040
It's not great, but it is good enough.

73
00:06:17,200 --> 00:06:24,670
And it is always coming day to use your test data set to create the confusion metrics and to calculate

74
00:06:24,670 --> 00:06:25,930
the accuracy score.

75
00:06:27,520 --> 00:06:34,330
Now I have calculated this accuracy score and confusion metrics for logistic regression for a.

76
00:06:34,630 --> 00:06:41,680
You can grade on your own and you can also compare the accuracy score between the two models and the

77
00:06:41,680 --> 00:06:42,250
next video.

78
00:06:42,280 --> 00:06:45,640
We will learn how to create game and model and Biton.