0
1
00:00:00,630 --> 00:00:04,490
Ok, so we've talked about the model we're gonna use
1

2
00:00:04,500 --> 00:00:09,440
and we've talked about where the term regression actually came from.
2

3
00:00:09,450 --> 00:00:11,690
Now it's time to write some Python code.
3

4
00:00:11,940 --> 00:00:19,380
One of the habits that I want us to get into when training our algorithms is to split up our dataset
4

5
00:00:19,470 --> 00:00:21,280
into two parts.
5

6
00:00:21,330 --> 00:00:29,520
We're going to shuffle all our data and divide our dataset into a training dataset and a testing dataset
6

7
00:00:29,520 --> 00:00:39,230
because we want our algorithm to learn those theta parameters based only on the training dataset.
7

8
00:00:39,540 --> 00:00:46,800
And this means that we can use the other part of our dataset that hasn't been used for testing, because
8

9
00:00:47,100 --> 00:00:55,110
with the testing dataset we can see how our algorithm performs out of sample, how it performs on a data
9

10
00:00:55,110 --> 00:00:57,450
set that it hasn't seen yet.
10

11
00:00:57,600 --> 00:01:05,190
Back in Jupyter notebook I'm going to add a markdown cell with a section heading that reads "Training
11

12
00:01:05,670 --> 00:01:12,350
& Test Dataset Split". There we go. And in the cell below,
12

13
00:01:12,390 --> 00:01:14,220
I'm going to create two more variables,
13

14
00:01:14,250 --> 00:01:21,060
the first one is gonna be called "prices" and I'm going to set that equal to "data['
14

15
00:01:21,660 --> 00:01:30,390
PRICE']", so prices is going to hold on to our series of price data and then we'll have "features" which is
15

16
00:01:30,390 --> 00:01:32,920
gonna be equal to "data",
16

17
00:01:33,300 --> 00:01:41,610
so our entire data frame, then dot, then I'm going to say "drop()" and then in the parentheses I'm going to have
17

18
00:01:42,480 --> 00:01:45,260
"'PRICE', 
18

19
00:01:45,620 --> 00:01:48,550
axis = 1".
19

20
00:01:48,680 --> 00:01:51,090
What did I just do here?
20

21
00:01:51,170 --> 00:02:00,920
I took our entire data dataframe and I've dropped one column from it, namely our target values and to
21

22
00:02:00,920 --> 00:02:03,690
accomplish this I've used the drop method.
22

23
00:02:04,040 --> 00:02:11,520
So this method will return a new dataframe which I've stored in a variable called features.
23

24
00:02:11,750 --> 00:02:16,250
But this data frame will not include the PRICE column.
24

25
00:02:16,250 --> 00:02:24,590
The second argument here "axis = 1" is there to specify that we're looking to drop a column as
25

26
00:02:24,590 --> 00:02:29,330
opposed to a row. For a row you'd have "axis = 0", for a column,
26

27
00:02:29,420 --> 00:02:35,630
use "axis = 1". To split up our dataset in our notebook
27

28
00:02:35,630 --> 00:02:41,960
we're going to make use of scikit's learn capabilities once more, but we're going to have to import this
28

29
00:02:41,960 --> 00:02:42,890
capability,
29

30
00:02:42,950 --> 00:02:46,810
we're going to have to add an import statement at the very top.
30

31
00:02:46,910 --> 00:02:57,250
I'm gonna add "from sklearn.model_selection
31

32
00:02:58,190 --> 00:03:05,010
import train_test_split",
32

33
00:03:05,120 --> 00:03:13,110
so that's all lower case "from sklearn.model_selection import train_test_
33

34
00:03:13,200 --> 00:03:14,930
split".
34

35
00:03:15,000 --> 00:03:20,640
Let me Shift+Enter on this cell and then go back to where we left off.
35

36
00:03:20,730 --> 00:03:28,620
Now this train_test_split function will actually return to us four values - a
36

37
00:03:28,620 --> 00:03:33,800
training and a testing data set for both our features and our targets.
37

38
00:03:33,870 --> 00:03:42,960
So we'll have X_train, X_test, y_train and y_test - four values that are going to be returned from this function.
38

39
00:03:43,350 --> 00:03:50,550
When a function returns multiple values, the Python syntax we'll use to store those values is called tuple
39

40
00:03:50,970 --> 00:03:52,160
unpacking.
40

41
00:03:52,440 --> 00:03:58,640
So let's create the variables that will hold onto the output from this function.
41

42
00:03:58,680 --> 00:04:01,460
It's gonna be "X_train,
42

43
00:04:01,470 --> 00:04:13,190
X_test, y_train, y_
43

44
00:04:13,250 --> 00:04:15,060
test"
44

45
00:04:15,060 --> 00:04:24,270
and that's gonna be set equal to "train_test_split()", so the function
45

46
00:04:24,270 --> 00:04:25,440
itself.
46

47
00:04:25,440 --> 00:04:31,100
And now all we have to do is provide some arguments to this function call.
47

48
00:04:31,200 --> 00:04:35,020
So what arguments does this function need?
48

49
00:04:35,040 --> 00:04:41,520
Well, it has to know which features and which targets to shuffle and to split up.
49

50
00:04:41,940 --> 00:04:48,090
So the first argument is gonna be our features variable which we've created above.
50

51
00:04:48,090 --> 00:04:53,250
The second argument is gonna be our prices variable which we've created above.
51

52
00:04:53,370 --> 00:05:02,310
The third argument is gonna be what kind of split that we want to make we want to make. Do we want to make a 50/50 split?
52

53
00:05:02,320 --> 00:05:05,020
Do we want to make a 60/40 split?
53

54
00:05:05,040 --> 00:05:07,950
What kind of split do we want this function to make?
54

55
00:05:08,790 --> 00:05:19,010
I'm gonna go with an 80/20 split, so I'm going to say "test_size = 0.2".
55

56
00:05:19,050 --> 00:05:26,480
What this means is that our test data set is going to be 20% of the total.
56

57
00:05:26,490 --> 00:05:31,800
Now I could leave this function like this, but I'm going to add one more argument.
57

58
00:05:31,800 --> 00:05:38,550
And the reason is is that this function will shuffle and split the data,
58

59
00:05:38,940 --> 00:05:42,370
however this shuffling is random, right?
59

60
00:05:42,420 --> 00:05:46,570
So my shuffle will be different from your shuffle.
60

61
00:05:46,740 --> 00:05:56,640
If we want to get comparable results, then we have to shuffle our data set in exactly the same way. And
61

62
00:05:56,820 --> 00:05:58,700
to ensure that you and I can do that,
62

63
00:05:58,890 --> 00:06:02,010
let's supply another argument,
63

64
00:06:02,010 --> 00:06:11,190
the random state. So I'm going to say "random_state = " and then we can pick a number. As
64

65
00:06:11,190 --> 00:06:13,620
long as we pick the same number,
65

66
00:06:13,620 --> 00:06:14,520
we're gonna be good,
66

67
00:06:14,520 --> 00:06:21,720
we're gonna get the same shuffling. So I'm gonna pick 10, "random_state = 10".
67

68
00:06:21,720 --> 00:06:28,970
Let me come over here and hit Enter on my keyboard, so that my line doesn't get too long and wraps a
68

69
00:06:29,020 --> 00:06:32,350
bit more nicely like this. Okay,
69

70
00:06:32,380 --> 00:06:37,960
so there actually quite a lot going on in this line of code because we are shuffling all our data and
70

71
00:06:37,960 --> 00:06:39,770
then splitting it up.
71

72
00:06:39,790 --> 00:06:46,900
The thing about the shuffling is that there is a random number generator which will generate this randomness
72

73
00:06:46,900 --> 00:06:48,990
and do the shuffling.
73

74
00:06:49,360 --> 00:06:56,230
The last argument that we supplied here, this random_state argument basically draws a line
74

75
00:06:56,230 --> 00:07:03,280
in the sand to kind of set the starting point for the random number generator that does our shuffling.
75

76
00:07:03,790 --> 00:07:05,980
If you and I both have the same starting point,
76

77
00:07:06,220 --> 00:07:08,400
then we get the same shuffle.
77

78
00:07:08,620 --> 00:07:14,490
Now a question you might ask is: why shuffle the data?
78

79
00:07:14,590 --> 00:07:16,540
What's the point?
79

80
00:07:16,550 --> 00:07:24,250
And the answer to that is that sometimes when you get a fresh dataset straight out of the box the data
80

81
00:07:24,250 --> 00:07:26,260
is actually in a certain order.
81

82
00:07:26,290 --> 00:07:29,420
A good analogy is a deck of playing cards.
82

83
00:07:29,530 --> 00:07:36,410
Did you ever buy a new deck of playing cards? Or did you ever watch a magician perform a con trick on
83

84
00:07:36,420 --> 00:07:38,030
stage with a new deck?
84

85
00:07:38,170 --> 00:07:44,170
When you take off the plastic wrapper on a deck of playing cards and you take the cards out of the box
85

86
00:07:44,500 --> 00:07:50,350
and you look through them, you'll notice that all the cards are in a certain order.
86

87
00:07:50,360 --> 00:07:56,570
Now obviously this means that before playing a card game with this deck, you have to shuffle the cards,
87

88
00:07:56,650 --> 00:08:01,830
otherwise you end up with a very, very terrible game of poker.
88

89
00:08:01,840 --> 00:08:05,370
In other words, the cards dealt have to be random,
89

90
00:08:05,590 --> 00:08:10,880
otherwise it defeats the purpose. And the same holds true for this training.
90

91
00:08:10,960 --> 00:08:17,650
and our test datasets. The rows or the data points that are allocated to these datasets have to be
91

92
00:08:17,650 --> 00:08:19,150
random as well.
92

93
00:08:19,210 --> 00:08:25,000
There can't be a clear pattern in how the data points are allocated.
93

94
00:08:25,420 --> 00:08:33,160
And that's why it's customary to shuffle any dataset as well. So customary in fact that our scikit
94

95
00:08:33,160 --> 00:08:40,450
learn's split function does the shuffling of our rows in our dataframe automatically and why we use
95

96
00:08:40,450 --> 00:08:48,190
this random state argument in the function call. Cool! And just like that we've created an 80/20 split
96

97
00:08:48,490 --> 00:08:50,990
with our data from our dataframe.
97

98
00:08:51,310 --> 00:08:53,370
But you don't have to take my word for it.
98

99
00:08:54,010 --> 00:08:57,580
You can verify the 80/20 split yourself.
99

100
00:08:57,580 --> 00:09:07,990
The percent of the training set will be the number of rows in X_train divided by the total number of
100

101
00:09:07,990 --> 00:09:10,450
rows in the dataset as a whole.
101

102
00:09:10,450 --> 00:09:19,510
We can calculate this using the length function, so "len(X_train)/len(
102

103
00:09:19,780 --> 00:09:30,200
features)" will do this calculation and that is equal to 0.7984.
103

104
00:09:30,220 --> 00:09:33,670
So very, very close to 80%.
104

105
00:09:33,850 --> 00:09:39,250
If you wanted to calculate the percentage of the test dataset,
105

106
00:09:39,490 --> 00:09:43,200
so "% of test data set",
106

107
00:09:43,540 --> 00:09:44,940
you could do the same thing,
107

108
00:09:44,950 --> 00:09:52,240
you could get the number of rows in the test dataset and divide by the number of rows in the feature
108

109
00:09:52,410 --> 00:09:54,000
dataset.
109

110
00:09:54,010 --> 00:10:03,730
Another way to get the number of rows is to say "X_test.shape" and grab the first element
110

111
00:10:04,060 --> 00:10:05,420
in the shape attribute,
111

112
00:10:05,500 --> 00:10:06,970
that's the number of rows.
112

113
00:10:07,060 --> 00:10:15,680
So "X_test.shape[0]/features.shape[0]"
113

114
00:10:16,050 --> 00:10:20,520
and that's equal to 0.2. Brilliant!