0 1 00:00:00,630 --> 00:00:04,490 Ok, so we've talked about the model we're gonna use 1 2 00:00:04,500 --> 00:00:09,440 and we've talked about where the term regression actually came from. 2 3 00:00:09,450 --> 00:00:11,690 Now it's time to write some Python code. 3 4 00:00:11,940 --> 00:00:19,380 One of the habits that I want us to get into when training our algorithms is to split up our dataset 4 5 00:00:19,470 --> 00:00:21,280 into two parts. 5 6 00:00:21,330 --> 00:00:29,520 We're going to shuffle all our data and divide our dataset into a training dataset and a testing dataset 6 7 00:00:29,520 --> 00:00:39,230 because we want our algorithm to learn those theta parameters based only on the training dataset. 7 8 00:00:39,540 --> 00:00:46,800 And this means that we can use the other part of our dataset that hasn't been used for testing, because 8 9 00:00:47,100 --> 00:00:55,110 with the testing dataset we can see how our algorithm performs out of sample, how it performs on a data 9 10 00:00:55,110 --> 00:00:57,450 set that it hasn't seen yet. 10 11 00:00:57,600 --> 00:01:05,190 Back in Jupyter notebook I'm going to add a markdown cell with a section heading that reads "Training 11 12 00:01:05,670 --> 00:01:12,350 & Test Dataset Split". There we go. And in the cell below, 12 13 00:01:12,390 --> 00:01:14,220 I'm going to create two more variables, 13 14 00:01:14,250 --> 00:01:21,060 the first one is gonna be called "prices" and I'm going to set that equal to "data[' 14 15 00:01:21,660 --> 00:01:30,390 PRICE']", so prices is going to hold on to our series of price data and then we'll have "features" which is 15 16 00:01:30,390 --> 00:01:32,920 gonna be equal to "data", 16 17 00:01:33,300 --> 00:01:41,610 so our entire data frame, then dot, then I'm going to say "drop()" and then in the parentheses I'm going to have 17 18 00:01:42,480 --> 00:01:45,260 "'PRICE', 18 19 00:01:45,620 --> 00:01:48,550 axis = 1". 19 20 00:01:48,680 --> 00:01:51,090 What did I just do here? 20 21 00:01:51,170 --> 00:02:00,920 I took our entire data dataframe and I've dropped one column from it, namely our target values and to 21 22 00:02:00,920 --> 00:02:03,690 accomplish this I've used the drop method. 22 23 00:02:04,040 --> 00:02:11,520 So this method will return a new dataframe which I've stored in a variable called features. 23 24 00:02:11,750 --> 00:02:16,250 But this data frame will not include the PRICE column. 24 25 00:02:16,250 --> 00:02:24,590 The second argument here "axis = 1" is there to specify that we're looking to drop a column as 25 26 00:02:24,590 --> 00:02:29,330 opposed to a row. For a row you'd have "axis = 0", for a column, 26 27 00:02:29,420 --> 00:02:35,630 use "axis = 1". To split up our dataset in our notebook 27 28 00:02:35,630 --> 00:02:41,960 we're going to make use of scikit's learn capabilities once more, but we're going to have to import this 28 29 00:02:41,960 --> 00:02:42,890 capability, 29 30 00:02:42,950 --> 00:02:46,810 we're going to have to add an import statement at the very top. 30 31 00:02:46,910 --> 00:02:57,250 I'm gonna add "from sklearn.model_selection 31 32 00:02:58,190 --> 00:03:05,010 import train_test_split", 32 33 00:03:05,120 --> 00:03:13,110 so that's all lower case "from sklearn.model_selection import train_test_ 33 34 00:03:13,200 --> 00:03:14,930 split". 34 35 00:03:15,000 --> 00:03:20,640 Let me Shift+Enter on this cell and then go back to where we left off. 35 36 00:03:20,730 --> 00:03:28,620 Now this train_test_split function will actually return to us four values - a 36 37 00:03:28,620 --> 00:03:33,800 training and a testing data set for both our features and our targets. 37 38 00:03:33,870 --> 00:03:42,960 So we'll have X_train, X_test, y_train and y_test - four values that are going to be returned from this function. 38 39 00:03:43,350 --> 00:03:50,550 When a function returns multiple values, the Python syntax we'll use to store those values is called tuple 39 40 00:03:50,970 --> 00:03:52,160 unpacking. 40 41 00:03:52,440 --> 00:03:58,640 So let's create the variables that will hold onto the output from this function. 41 42 00:03:58,680 --> 00:04:01,460 It's gonna be "X_train, 42 43 00:04:01,470 --> 00:04:13,190 X_test, y_train, y_ 43 44 00:04:13,250 --> 00:04:15,060 test" 44 45 00:04:15,060 --> 00:04:24,270 and that's gonna be set equal to "train_test_split()", so the function 45 46 00:04:24,270 --> 00:04:25,440 itself. 46 47 00:04:25,440 --> 00:04:31,100 And now all we have to do is provide some arguments to this function call. 47 48 00:04:31,200 --> 00:04:35,020 So what arguments does this function need? 48 49 00:04:35,040 --> 00:04:41,520 Well, it has to know which features and which targets to shuffle and to split up. 49 50 00:04:41,940 --> 00:04:48,090 So the first argument is gonna be our features variable which we've created above. 50 51 00:04:48,090 --> 00:04:53,250 The second argument is gonna be our prices variable which we've created above. 51 52 00:04:53,370 --> 00:05:02,310 The third argument is gonna be what kind of split that we want to make we want to make. Do we want to make a 50/50 split? 52 53 00:05:02,320 --> 00:05:05,020 Do we want to make a 60/40 split? 53 54 00:05:05,040 --> 00:05:07,950 What kind of split do we want this function to make? 54 55 00:05:08,790 --> 00:05:19,010 I'm gonna go with an 80/20 split, so I'm going to say "test_size = 0.2". 55 56 00:05:19,050 --> 00:05:26,480 What this means is that our test data set is going to be 20% of the total. 56 57 00:05:26,490 --> 00:05:31,800 Now I could leave this function like this, but I'm going to add one more argument. 57 58 00:05:31,800 --> 00:05:38,550 And the reason is is that this function will shuffle and split the data, 58 59 00:05:38,940 --> 00:05:42,370 however this shuffling is random, right? 59 60 00:05:42,420 --> 00:05:46,570 So my shuffle will be different from your shuffle. 60 61 00:05:46,740 --> 00:05:56,640 If we want to get comparable results, then we have to shuffle our data set in exactly the same way. And 61 62 00:05:56,820 --> 00:05:58,700 to ensure that you and I can do that, 62 63 00:05:58,890 --> 00:06:02,010 let's supply another argument, 63 64 00:06:02,010 --> 00:06:11,190 the random state. So I'm going to say "random_state = " and then we can pick a number. As 64 65 00:06:11,190 --> 00:06:13,620 long as we pick the same number, 65 66 00:06:13,620 --> 00:06:14,520 we're gonna be good, 66 67 00:06:14,520 --> 00:06:21,720 we're gonna get the same shuffling. So I'm gonna pick 10, "random_state = 10". 67 68 00:06:21,720 --> 00:06:28,970 Let me come over here and hit Enter on my keyboard, so that my line doesn't get too long and wraps a 68 69 00:06:29,020 --> 00:06:32,350 bit more nicely like this. Okay, 69 70 00:06:32,380 --> 00:06:37,960 so there actually quite a lot going on in this line of code because we are shuffling all our data and 70 71 00:06:37,960 --> 00:06:39,770 then splitting it up. 71 72 00:06:39,790 --> 00:06:46,900 The thing about the shuffling is that there is a random number generator which will generate this randomness 72 73 00:06:46,900 --> 00:06:48,990 and do the shuffling. 73 74 00:06:49,360 --> 00:06:56,230 The last argument that we supplied here, this random_state argument basically draws a line 74 75 00:06:56,230 --> 00:07:03,280 in the sand to kind of set the starting point for the random number generator that does our shuffling. 75 76 00:07:03,790 --> 00:07:05,980 If you and I both have the same starting point, 76 77 00:07:06,220 --> 00:07:08,400 then we get the same shuffle. 77 78 00:07:08,620 --> 00:07:14,490 Now a question you might ask is: why shuffle the data? 78 79 00:07:14,590 --> 00:07:16,540 What's the point? 79 80 00:07:16,550 --> 00:07:24,250 And the answer to that is that sometimes when you get a fresh dataset straight out of the box the data 80 81 00:07:24,250 --> 00:07:26,260 is actually in a certain order. 81 82 00:07:26,290 --> 00:07:29,420 A good analogy is a deck of playing cards. 82 83 00:07:29,530 --> 00:07:36,410 Did you ever buy a new deck of playing cards? Or did you ever watch a magician perform a con trick on 83 84 00:07:36,420 --> 00:07:38,030 stage with a new deck? 84 85 00:07:38,170 --> 00:07:44,170 When you take off the plastic wrapper on a deck of playing cards and you take the cards out of the box 85 86 00:07:44,500 --> 00:07:50,350 and you look through them, you'll notice that all the cards are in a certain order. 86 87 00:07:50,360 --> 00:07:56,570 Now obviously this means that before playing a card game with this deck, you have to shuffle the cards, 87 88 00:07:56,650 --> 00:08:01,830 otherwise you end up with a very, very terrible game of poker. 88 89 00:08:01,840 --> 00:08:05,370 In other words, the cards dealt have to be random, 89 90 00:08:05,590 --> 00:08:10,880 otherwise it defeats the purpose. And the same holds true for this training. 90 91 00:08:10,960 --> 00:08:17,650 and our test datasets. The rows or the data points that are allocated to these datasets have to be 91 92 00:08:17,650 --> 00:08:19,150 random as well. 92 93 00:08:19,210 --> 00:08:25,000 There can't be a clear pattern in how the data points are allocated. 93 94 00:08:25,420 --> 00:08:33,160 And that's why it's customary to shuffle any dataset as well. So customary in fact that our scikit 94 95 00:08:33,160 --> 00:08:40,450 learn's split function does the shuffling of our rows in our dataframe automatically and why we use 95 96 00:08:40,450 --> 00:08:48,190 this random state argument in the function call. Cool! And just like that we've created an 80/20 split 96 97 00:08:48,490 --> 00:08:50,990 with our data from our dataframe. 97 98 00:08:51,310 --> 00:08:53,370 But you don't have to take my word for it. 98 99 00:08:54,010 --> 00:08:57,580 You can verify the 80/20 split yourself. 99 100 00:08:57,580 --> 00:09:07,990 The percent of the training set will be the number of rows in X_train divided by the total number of 100 101 00:09:07,990 --> 00:09:10,450 rows in the dataset as a whole. 101 102 00:09:10,450 --> 00:09:19,510 We can calculate this using the length function, so "len(X_train)/len( 102 103 00:09:19,780 --> 00:09:30,200 features)" will do this calculation and that is equal to 0.7984. 103 104 00:09:30,220 --> 00:09:33,670 So very, very close to 80%. 104 105 00:09:33,850 --> 00:09:39,250 If you wanted to calculate the percentage of the test dataset, 105 106 00:09:39,490 --> 00:09:43,200 so "% of test data set", 106 107 00:09:43,540 --> 00:09:44,940 you could do the same thing, 107 108 00:09:44,950 --> 00:09:52,240 you could get the number of rows in the test dataset and divide by the number of rows in the feature 108 109 00:09:52,410 --> 00:09:54,000 dataset. 109 110 00:09:54,010 --> 00:10:03,730 Another way to get the number of rows is to say "X_test.shape" and grab the first element 110 111 00:10:04,060 --> 00:10:05,420 in the shape attribute, 111 112 00:10:05,500 --> 00:10:06,970 that's the number of rows. 112 113 00:10:07,060 --> 00:10:15,680 So "X_test.shape[0]/features.shape[0]" 113 114 00:10:16,050 --> 00:10:20,520 and that's equal to 0.2. Brilliant!