1
00:00:00,360 --> 00:00:08,070
So the beautiful thing is now that our data has no missing values and has turned all into numeric values

2
00:00:09,340 --> 00:00:11,440
we should be able to build a machine learning model.

3
00:00:12,040 --> 00:00:13,100
So let's do that.

4
00:00:13,100 --> 00:00:17,530
We've instantiated one way back up here as model.

5
00:00:17,620 --> 00:00:24,370
Ray back when this we started the modeling section far we've covered a fair bit of ground here.

6
00:00:24,390 --> 00:00:26,370
You should be proud modeling.

7
00:00:26,530 --> 00:00:27,400
Wonderful.

8
00:00:27,400 --> 00:00:29,620
We're going to do basically the exact same thing.

9
00:00:30,310 --> 00:00:33,970
So let's build a machine learning model we could copy this but we're just going to write it out again

10
00:00:34,680 --> 00:00:37,630
or leave a bit of communication to ourselves.

11
00:00:37,630 --> 00:00:50,950
So now that all of our data is numeric as well as our data frame has no missing values we should be

12
00:00:50,950 --> 00:00:57,040
able to build a machine learning model.

13
00:00:57,550 --> 00:00:58,750
Wonderful.

14
00:00:58,750 --> 00:01:00,740
So remember what our goal is.

15
00:01:01,080 --> 00:01:04,590
It's to find patterns in the data to predict the sale price.

16
00:01:04,590 --> 00:01:05,470
So that's what we'll have to do.

17
00:01:05,470 --> 00:01:06,810
Let's have a look DFT HAMP.

18
00:01:06,830 --> 00:01:07,270
Go ahead.

19
00:01:07,270 --> 00:01:09,220
One last look at our data before we model.

20
00:01:09,730 --> 00:01:16,590
So we need to drop the sale price column so that we can build a machine learning model that takes in

21
00:01:16,620 --> 00:01:22,190
all of the other columns except for sale price finds patterns and then tries to predict sale price.

22
00:01:22,200 --> 00:01:28,260
So what we're going to do I'm going to set up a little timer magic and if you haven't seen this before

23
00:01:28,260 --> 00:01:32,310
the percentage percentage sign that's a Jupiter notebook magic function.

24
00:01:32,670 --> 00:01:34,370
You can search out for a whole bunch of these.

25
00:01:34,380 --> 00:01:39,900
But what this is going to do is basically just calculate how much time this particular cell takes to

26
00:01:39,900 --> 00:01:43,650
run and I'm gonna do it on my computer which is a MacBook Pro.

27
00:01:44,970 --> 00:01:46,120
How many rows do we have.

28
00:01:46,290 --> 00:01:50,540
We have four hundred thousand or something rows.

29
00:01:50,610 --> 00:01:52,000
This may take a little while.

30
00:01:52,110 --> 00:01:57,830
In previous examples when we've worked with less data our machine learning models have been pretty quick

31
00:01:57,840 --> 00:01:58,030
right.

32
00:01:58,050 --> 00:02:00,410
Because it's only have to sort through a couple of hundred rows.

33
00:02:00,420 --> 00:02:02,240
But this is 420000.

34
00:02:02,370 --> 00:02:05,790
I mean of course there are much bigger data sets but we're starting to get up there right.

35
00:02:05,790 --> 00:02:10,270
We're in the hundreds of thousands of samples instantiate model.

36
00:02:10,320 --> 00:02:15,960
So what this time thing is going to do is because it's our goal as a as a data scientist as a machine

37
00:02:15,960 --> 00:02:20,430
learning engineer is to do as many experiments as fast as possible.

38
00:02:20,490 --> 00:02:25,640
If you training models on all of the data all the time it's going to take a fairly long time.

39
00:02:25,650 --> 00:02:30,570
So I'm just going to demonstrate how long it takes for my personal computer to use a baseline random

40
00:02:30,570 --> 00:02:36,030
forest regressive model to find all the patterns in all four hundred twelve thousand rows.

41
00:02:36,030 --> 00:02:40,590
So don't worry I'm not going to have to sit there and watch it run once we kick it off.

42
00:02:41,250 --> 00:02:46,920
All are pause the video and then come back and show you how long it took the ways of thinking I want

43
00:02:46,920 --> 00:02:53,040
you to start thinking about is when you're doing experiments start to think to yourself How can I reduce

44
00:02:53,040 --> 00:02:57,840
the amount of time it takes between my experiments.

45
00:02:57,840 --> 00:03:01,860
The reason being is because you're not always trying to do the right thing you're trying to figure out

46
00:03:01,860 --> 00:03:03,870
what's wrong what doesn't work.

47
00:03:04,350 --> 00:03:10,260
So that's the rationale between trying to speed up your time between experiments and as I typed as I

48
00:03:10,260 --> 00:03:15,380
talk what we're doing here is we're just instantiating a random forest model.

49
00:03:15,750 --> 00:03:24,600
We're gonna put random state equals 242 their random state equals 42 that way our results are be reproducible.

50
00:03:24,600 --> 00:03:25,530
So this is all we're doing.

51
00:03:25,560 --> 00:03:30,240
We've imported the random forest regressive class because we're working on a regression problem we're

52
00:03:30,240 --> 00:03:35,820
setting end jobs to equal negative one because this is a fairly large dataset for what we've been working

53
00:03:35,820 --> 00:03:36,990
with for now.

54
00:03:36,990 --> 00:03:42,120
So I want to use all of the cause and my computer and I want random say Digg or 42 so our results are

55
00:03:42,480 --> 00:03:45,930
reproducible and I'm going to fit the model to the data.

56
00:03:45,930 --> 00:03:49,050
So this is our X and this is our y.

57
00:03:49,130 --> 00:03:54,780
Remember we're trying to predict sale price based on all the other columns except for sale price.

58
00:03:55,740 --> 00:04:01,440
So when I kick this off this would take a few minutes but I'm going to pause a video and because we've

59
00:04:01,440 --> 00:04:04,410
got the little magic function time here we can see how long it takes.

60
00:04:04,410 --> 00:04:09,450
So get ready to time travel in three to one.

61
00:04:09,470 --> 00:04:09,990
All right.

62
00:04:09,990 --> 00:04:10,890
We're back.

63
00:04:10,890 --> 00:04:11,680
That was pretty quick.

64
00:04:12,480 --> 00:04:21,050
So our model is finished and it says here at the wall time was six minutes and 58 seconds.

65
00:04:21,050 --> 00:04:27,830
So it took about seven minutes to go through 412 thousand seven hundred roses is actually pretty good

66
00:04:27,830 --> 00:04:28,220
right.

67
00:04:28,220 --> 00:04:31,040
Think about how quickly you could look at that.

68
00:04:31,040 --> 00:04:35,150
How long would it take you to go through 400 12000 different examples.

69
00:04:35,150 --> 00:04:39,050
You didn't have to wait for that time because I spent it up but if you were to run this on your computer

70
00:04:39,080 --> 00:04:43,760
it may take longer it may take less time depending on what sort of computer you have on running this

71
00:04:43,760 --> 00:04:48,770
on a 13 inch MacBook Pro I think from about 2016 or something like that.

72
00:04:48,800 --> 00:04:51,700
I'm not even exactly sure how much computing power mine has.

73
00:04:51,890 --> 00:04:55,500
But it's a laptop so maybe not as much as a desktop anyway.

74
00:04:56,210 --> 00:05:01,700
But you can imagine as these numbers of Rose started to really get up high this could start to take

75
00:05:01,700 --> 00:05:02,470
a long time.

76
00:05:02,510 --> 00:05:08,330
If our goal as machine learning engineers if our goal as data scientist is to reduce the amount of time

77
00:05:08,330 --> 00:05:14,000
between experiments we're going to have to be pretty nifty when we try out different models when we

78
00:05:14,000 --> 00:05:19,380
work on different data because waiting seven minutes every time even if this number was higher burner

79
00:05:19,400 --> 00:05:22,430
cell to run that's going to slow us down a fair bit.

80
00:05:22,430 --> 00:05:29,300
So we'll see a little trick soon of how we can improve this but finally score the model so see how our

81
00:05:29,300 --> 00:05:29,870
model did.

82
00:05:29,870 --> 00:05:36,980
If it's gone over four hundred twelve thousand rows surely it's found something some patterns in here.

83
00:05:36,980 --> 00:05:44,060
So we're going to score it on just the exact same data that it was trained on so we'll copy this.

84
00:05:44,160 --> 00:05:52,130
And now remember for regression algorithm the default score value here is going to be returned the coefficient

85
00:05:52,130 --> 00:05:54,710
of determination which is the r squared.

86
00:05:54,710 --> 00:06:01,940
Now if you need a refresher on what r squared is you can go what is co efficient of determination.

87
00:06:02,360 --> 00:06:07,610
But in other words it's a score that can go up to 1 0.

88
00:06:07,640 --> 00:06:13,010
If all your model does is predict the mean of the sale price column and it can go to negative infinity.

89
00:06:13,010 --> 00:06:15,080
If your model is absolute terrible.

90
00:06:15,620 --> 00:06:21,700
So we're looking for a value that's close to 1.

91
00:06:21,800 --> 00:06:26,240
So again this might take a little while because the model has to compute the score of how it did across

92
00:06:26,240 --> 00:06:27,920
four hundred twelve thousand rows

93
00:06:30,960 --> 00:06:33,980
bomb point nine eight.

94
00:06:34,020 --> 00:06:34,910
Wow.

95
00:06:34,920 --> 00:06:39,250
If the maximum is four point nine eight seven so close at point nine nine.

96
00:06:39,300 --> 00:06:41,720
So if the maximum is one point zero.

97
00:06:41,730 --> 00:06:47,430
Does that mean a random forest regress has almost found a perfect pattern for finding predicting the

98
00:06:47,430 --> 00:06:48,640
sale price based on it.

99
00:06:48,670 --> 00:06:56,430
Are the columns in the data frame not so fast I want you to think about before we get into the next

100
00:06:56,430 --> 00:06:57,630
video where we're going to cover it.

101
00:06:58,290 --> 00:07:02,890
Why is this metric not reliable right.

102
00:07:02,890 --> 00:07:10,140
So if we have a look here we fitted the model on this data and then we've evaluated it on the same data.

103
00:07:10,180 --> 00:07:13,570
Why doesn't this metric hold much water have a think about it.

104
00:07:13,570 --> 00:07:14,800
We'll talk about in the next video.