0 1 00:00:00,360 --> 00:00:08,940 So the next question is: how well does the model fit to data? Now, as a challenge, 1 2 00:00:08,940 --> 00:00:14,900 do you remember how to print out a regression's r-squared value from the previous lessons? 2 3 00:00:14,960 --> 00:00:19,160 I want you to print out two different r-squared values. 3 4 00:00:19,160 --> 00:00:25,550 I want you to print out the r-squared value for the training dataset and the r-squared value for the 4 5 00:00:25,550 --> 00:00:27,610 testing dataset. 5 6 00:00:27,740 --> 00:00:37,060 I'll give you a few seconds to pause the video and give this a shot. And here's the solution - to obtain 6 7 00:00:37,240 --> 00:00:39,100 the r-squared, 7 8 00:00:39,100 --> 00:00:43,490 we access it through the regression object's score method. 8 9 00:00:43,540 --> 00:00:46,120 So my first print statement is gonna read 9 10 00:00:49,850 --> 00:01:04,410 "'Training data r-squared: ', regr.score(X_train, y_ 10 11 00:01:04,420 --> 00:01:06,350 train)". 11 12 00:01:06,410 --> 00:01:15,060 So this will be the r-squared of the training dataset; and the r-squared of the testing dataset will 12 13 00:01:15,080 --> 00:01:29,720 be "'Test data r-squared:', regr.score(X_test, y_ 13 14 00:01:30,200 --> 00:01:31,310 test)". 14 15 00:01:31,520 --> 00:01:34,910 So let's take a look at what these turn out to be. 15 16 00:01:37,050 --> 00:01:45,270 For our training data we get an r-squared of 0.75 and for our test dataset we get 16 17 00:01:45,390 --> 00:01:50,370 an r-squared of 0.67. 17 18 00:01:50,420 --> 00:01:55,340 Now these values are actually both quite high. 18 19 00:01:55,500 --> 00:02:00,270 I mean when you think about it, the real world is a very, very complicated place. 19 20 00:02:00,410 --> 00:02:09,380 But we've been able to explain approximately 75 percent of the variance in our house prices with just 20 21 00:02:09,470 --> 00:02:21,590 13 features. 13 things and we can explain a lot of what's going on in the real world. Now a question that 21 22 00:02:21,590 --> 00:02:31,180 you might ask looking at these two values is: why is the r-squared lower for the test dataset? 22 23 00:02:31,280 --> 00:02:39,220 And the answer is is that what we've done so far is calculate the thetas only on our training data. 23 24 00:02:39,410 --> 00:02:43,650 The algorithm hasn't seen the test data before. 24 25 00:02:43,670 --> 00:02:50,170 As such, the performance of our model on the data that a model hasn't seen is always gonna be worse. 25 26 00:02:50,210 --> 00:02:56,480 It makes sense for the r-squared to be lower on the test data than on the training data. 26 27 00:02:56,480 --> 00:03:03,320 However, running the model on data that's not in the sample actually gives us some idea of its predictive 27 28 00:03:03,320 --> 00:03:04,180 power. 28 29 00:03:04,190 --> 00:03:10,110 We get a little bit of a feeling for how our model would fare in the real world. 29 30 00:03:10,610 --> 00:03:17,280 And it's very encouraging to see that the r-squared even on the test data is still quite high. 30 31 00:03:17,540 --> 00:03:20,530 So our model actually performs relatively well. 31 32 00:03:22,410 --> 00:03:27,300 Okay, so we spec'd out our model and we've run our first regression. 32 33 00:03:27,300 --> 00:03:30,870 We did a quick sense check and looked at the performance. 33 34 00:03:30,870 --> 00:03:37,720 Now we're getting to the evaluation stage. Now before we move on to the next lesson, 34 35 00:03:37,740 --> 00:03:47,530 let's add another markdown cell above this one here and add the section heading "Multivariable Regression". 35 36 00:03:47,610 --> 00:03:50,270 That way we can keep track of what we're doing. 36 37 00:03:52,420 --> 00:03:55,950 The other thing I'm going to do is take these two lines of code, 37 38 00:03:56,320 --> 00:04:02,670 cut them and paste them down here. 38 39 00:04:02,950 --> 00:04:08,620 That way when I hit Shift+Enter we'll be able to see our data frame of coefficients. 39 40 00:04:08,620 --> 00:04:14,430 Otherwise the output is hidden by the print statements. 40 41 00:04:14,550 --> 00:04:15,780 There we go. 41 42 00:04:15,780 --> 00:04:17,520 Now we're ready to move on. 42 43 00:04:17,520 --> 00:04:18,900 I'll see you in the next lessons.