1
00:00:00,200 --> 00:00:04,830
In the last video we left off scoring our machine learning model that we fitted to the data.

2
00:00:05,010 --> 00:00:10,470
After we'd just filled all the numerical values filled all the categorical values and turned all our

3
00:00:10,470 --> 00:00:18,350
data into numbers model worked our data had no missing values and it was all numeric but we left off

4
00:00:18,350 --> 00:00:21,160
with the question why is this metric reliable.

5
00:00:21,170 --> 00:00:27,980
Let's put that here because that's worth highlighting because remember always remember feeding a machine

6
00:00:27,980 --> 00:00:33,290
learning model or evaluating a machine learning model is just as important as fitting one.

7
00:00:33,290 --> 00:00:37,280
And that's where we're up to now because we've just fit a machine learning model we have to make sure

8
00:00:37,280 --> 00:00:41,840
that our results hold water right because we don't want to be promising things that our model can't

9
00:00:41,840 --> 00:00:42,630
do.

10
00:00:42,650 --> 00:00:47,810
So question why doesn't the above metric hold water

11
00:00:50,900 --> 00:00:57,380
hold water is an expression for why isn't the metric reliable.

12
00:00:57,380 --> 00:00:59,950
That's just a way of speaking.

13
00:01:00,380 --> 00:01:01,300
So why doesn't it.

14
00:01:01,400 --> 00:01:03,800
Oh water why isn't it reliable.

15
00:01:03,810 --> 00:01:06,380
Have you had to think about it.

16
00:01:06,580 --> 00:01:11,470
I'll show you a diagram that may refresh your memory if we come back to our keynote.

17
00:01:11,500 --> 00:01:13,510
Remember the from right back to the beginning.

18
00:01:13,510 --> 00:01:17,300
The most important concept in machine learning a.k.a. the three sets.

19
00:01:18,250 --> 00:01:22,860
So what we've done is we've trained a machine learning model on one dataset.

20
00:01:22,870 --> 00:01:24,150
But then what have we done.

21
00:01:24,200 --> 00:01:25,560
We come back to the code.

22
00:01:25,960 --> 00:01:33,200
We fit the model on the data set on some data but then we've had value why did it on the exact same

23
00:01:33,200 --> 00:01:36,090
data.

24
00:01:37,020 --> 00:01:42,390
Okay so if we come back here essentially what we've done is if we were thinking about this has been

25
00:01:42,390 --> 00:01:46,680
a university course we've learned the course materials.

26
00:01:47,040 --> 00:01:53,600
But what we've done instead of testing our model on the equivalent of like a final exam a.k.a. a test

27
00:01:53,680 --> 00:02:01,110
dataset we've just evaluated our machine learning model on the exact same materials that it's learned

28
00:02:01,110 --> 00:02:06,600
on you are in one class and you got given a book to read and then at the exact same time and just like

29
00:02:06,600 --> 00:02:11,280
you're gonna be tested with the questions directly from that book you're reading.

30
00:02:11,760 --> 00:02:12,420
Okay.

31
00:02:12,970 --> 00:02:19,450
But we after our model's ability to generalize and a.k.a. the ability for a machine learning model to

32
00:02:19,450 --> 00:02:21,790
perform well on data it hasn't seen before.

33
00:02:21,790 --> 00:02:23,400
That's what we're after.

34
00:02:23,440 --> 00:02:30,150
If we go back to Kaggle there's a training data set which contains data through the end of 2011.

35
00:02:30,190 --> 00:02:32,740
There's a validation set and there's a test set.

36
00:02:32,740 --> 00:02:34,180
So that's what we're after.

37
00:02:34,240 --> 00:02:39,680
We need to evaluate our model not on the data we've trained it on which is this we need to evaluate

38
00:02:39,680 --> 00:02:46,360
it on the test data but before we do that if you remember right back up the top let's go right back

39
00:02:46,360 --> 00:02:51,170
up the top to we imported our first data frame I said we'd revisit this.

40
00:02:51,330 --> 00:02:52,720
And now the time has come.

41
00:02:53,590 --> 00:02:58,570
So we go here we imported train and validate CSP.

42
00:02:59,140 --> 00:03:00,920
Wonder why did we do that.

43
00:03:00,940 --> 00:03:03,850
While it's all gonna be come clear in this video.

44
00:03:03,850 --> 00:03:05,810
So we train invalid CSB.

45
00:03:05,930 --> 00:03:10,960
You are valid CSB and we've got train CSB why don't we just import them separately.

46
00:03:10,960 --> 00:03:16,960
Well I wanted to demonstrate what it's like creating your own validation set rather than someone else

47
00:03:16,960 --> 00:03:21,600
creating it for you with a time series data set which is what we're working on.

48
00:03:21,600 --> 00:03:24,270
So this is a perfect playground for that.

49
00:03:24,340 --> 00:03:26,140
So if we read here.

50
00:03:26,140 --> 00:03:28,500
Train CSB is the training set.

51
00:03:28,840 --> 00:03:33,360
So basically what we've been working on and valid CSP is a validation set.

52
00:03:33,430 --> 00:03:36,890
The key point here is because it's a time series.

53
00:03:36,890 --> 00:03:43,390
The training data set has data up to the end of 2011 whereas the validation set has data between January

54
00:03:43,390 --> 00:03:46,720
1 2012 to April 30 2012.

55
00:03:46,720 --> 00:03:53,410
So what we will do to better evaluate our model we need to split our data up into training and validation

56
00:03:53,410 --> 00:03:54,330
set so let's do that.

57
00:03:55,060 --> 00:03:56,920
Let's go here.

58
00:03:56,920 --> 00:04:05,140
Splitting data into train validation sets because that's what we're after now we've got a model we can

59
00:04:05,140 --> 00:04:05,920
build models now.

60
00:04:05,920 --> 00:04:10,180
But we need to rather than just build better models we need to make sure that what we're evaluating

61
00:04:10,180 --> 00:04:11,860
with makes sense.

62
00:04:11,860 --> 00:04:14,780
So let's check our data again as we always do.

63
00:04:14,890 --> 00:04:17,770
So maybe we can use the year.

64
00:04:17,980 --> 00:04:24,170
What was it called sale year Media Temple sale year There we go.

65
00:04:24,170 --> 00:04:26,990
Okay so maybe because our data frame is in order.

66
00:04:26,990 --> 00:04:32,600
Yes this is going to help us when we import loud at a frame we ordered it in the order of the sale dates.

67
00:04:32,630 --> 00:04:38,310
So now reading this to create our own validation set we want to split our data.

68
00:04:38,330 --> 00:04:41,240
So all of the rows up to 2011.

69
00:04:41,240 --> 00:04:48,890
So the s year 2011 that can be in the training set and then all of the rose in 2012 can be the validation

70
00:04:48,890 --> 00:04:49,370
set.

71
00:04:49,370 --> 00:04:53,530
We won't worry about the test set for now because that's in a separate data set.

72
00:04:53,540 --> 00:04:58,940
The reason why we're only working on train and valid CSP is because those we imported the train and

73
00:04:58,980 --> 00:05:02,000
valid CSC at the start of the notebook.

74
00:05:02,000 --> 00:05:07,140
This file here so we have to create our own validation dataset.

75
00:05:07,290 --> 00:05:07,950
So let's do that.

76
00:05:07,950 --> 00:05:16,370
So we've got data templates check the sale year dot value counts.

77
00:05:16,440 --> 00:05:17,160
There we go.

78
00:05:17,190 --> 00:05:23,250
2012 so there's eleven thousand five hundred seventy three samples in 2012.

79
00:05:23,280 --> 00:05:24,010
All right.

80
00:05:24,120 --> 00:05:29,480
So to split our data into training and validation it should be as easy as going.

81
00:05:29,580 --> 00:05:30,590
Okay.

82
00:05:30,840 --> 00:05:37,830
Let's introspect the s year column and every column that's equal to 2012 will be a validation set for

83
00:05:37,920 --> 00:05:40,700
our year and example.

84
00:05:40,710 --> 00:05:43,640
And every row that's not equal of s year.

85
00:05:43,660 --> 00:05:47,650
That's not equal to 2012 will be in the training set.

86
00:05:47,700 --> 00:05:49,200
So let's stop talking about it Daniel.

87
00:05:49,200 --> 00:05:55,280
Let's see the code to split data into training and validation.

88
00:05:55,300 --> 00:05:58,910
This is something that you might have to do with your own time series data right.

89
00:05:58,920 --> 00:06:02,590
Because you're not always going to be given in a casual format right.

90
00:06:02,610 --> 00:06:06,660
When you're working with a client or working on a project you're not always going to automatically have

91
00:06:06,660 --> 00:06:08,430
your data in train valid and test.

92
00:06:08,430 --> 00:06:11,540
These are things you're going to have to create for yourself.

93
00:06:11,550 --> 00:06:16,020
Looking at this sale column or looking at the time column looking at the date column and figuring out

94
00:06:16,260 --> 00:06:18,590
how you can make your own training a validation set.

95
00:06:18,600 --> 00:06:26,130
So that is exactly why we imported them as one set to begin with so we could practice making our own

96
00:06:26,340 --> 00:06:28,080
training and validation sets.

97
00:06:28,200 --> 00:06:35,540
And I feel like I'm saying the words set a lot but that's important because we go back to our keynote.

98
00:06:35,660 --> 00:06:40,460
This is the most important concept in machine learning because whatever we train our model on we want

99
00:06:40,460 --> 00:06:43,460
to make sure we're evaluating it on something else.

100
00:06:43,470 --> 00:06:50,670
So come here to the validation set is every row in DFT temp where the s year column equals 2012.

101
00:06:50,690 --> 00:06:51,590
Yes that's correct.

102
00:06:52,010 --> 00:07:01,190
And the the day of training set is every row in IDF temp where the same year column is not equal to

103
00:07:01,190 --> 00:07:02,180
2012

104
00:07:04,700 --> 00:07:05,230
WONDERFUL.

105
00:07:05,270 --> 00:07:09,950
AND THEN WE MIGHT GO THEN DLF down and land the F train.

106
00:07:10,340 --> 00:07:16,730
So all this is going to tell us is the length of these two data frames that beautiful.

107
00:07:16,730 --> 00:07:23,210
So now we have a validation set which contains eleven thousand five hundred seventy three rows and a

108
00:07:23,210 --> 00:07:30,140
training data set which contains four hundred one thousand one hundred twenty five rows or samples and

109
00:07:30,140 --> 00:07:33,180
they're split on date beautiful.

110
00:07:33,190 --> 00:07:34,350
We're taking boxes here.

111
00:07:34,360 --> 00:07:35,600
We are taking boxes here.

112
00:07:35,800 --> 00:07:41,470
So now what we might do is split data into x and y.

113
00:07:41,470 --> 00:07:48,670
We've seen this before and so that way we have an X training set and a Y training set and a X valid

114
00:07:48,670 --> 00:07:52,100
set and a Y valid set which is our data and labels.

115
00:07:52,330 --> 00:07:57,940
So we'll have X train y train equals D.

116
00:07:58,150 --> 00:08:03,370
Now we're working with the training set here drop we want to drop the sale price column on the first

117
00:08:03,370 --> 00:08:05,690
axis.

118
00:08:05,690 --> 00:08:12,880
Okay I just dropped the column and we're gonna set it to D train y train is going to be equal to sale

119
00:08:12,880 --> 00:08:20,610
price so see how I'm doing it with a comma here I'm being a bit tricky by just doing in one line beautiful

120
00:08:20,700 --> 00:08:31,950
and gone here valid why valid not sure and dear Val looked wrong sale price just the same again but

121
00:08:31,950 --> 00:08:40,830
this time with the validation set def thou Dot this is actually going to be Sal Price There we go and

122
00:08:40,830 --> 00:08:46,020
now we have a little game inspect our data just to make sure we haven't made a little error somewhere

123
00:08:46,400 --> 00:08:49,410
they should all be comparable shapes to each other

124
00:08:52,060 --> 00:08:57,550
why valid dot shape so we're just taking these data sets that we're creating here and we're finding

125
00:08:57,550 --> 00:09:00,590
out the shapes of them beautiful.

126
00:09:00,710 --> 00:09:10,400
So our training set is 400 1000 rows or about that with 102 features 102 columns our y training is 400

127
00:09:10,430 --> 00:09:17,120
1000 rows and then validation is about eleven and a half thousand with the same amount of columns for

128
00:09:17,120 --> 00:09:22,760
X and no columns for Y because Y is just one column let's have a look.

129
00:09:22,970 --> 00:09:25,810
Now look this is what we do make sure all of our data.

130
00:09:25,820 --> 00:09:27,470
Okay so these are all sale price.

131
00:09:28,280 --> 00:09:29,910
Beautiful.

132
00:09:30,340 --> 00:09:32,320
Well okay.

133
00:09:32,330 --> 00:09:39,170
So we've got our data into train and validation sets we've taken care of two of the three most important

134
00:09:39,170 --> 00:09:43,940
things or most important sets we'll have a look at the test set later but this is what we've created

135
00:09:43,940 --> 00:09:51,520
a training set and a validation set it's time to keep building some more models so we might end this

136
00:09:51,520 --> 00:09:58,120
video here and we've got to figure out a way to evaluate how a machine learning model so we'll probably

137
00:09:58,120 --> 00:09:59,620
have a look at that in the next video.