0 1 00:00:00,240 --> 00:00:08,640 Now that we've shuffled and split our data into training and test datasets, we can finally run our regression 1 2 00:00:08,670 --> 00:00:12,200 using scikit learn. To do that, 2 3 00:00:12,300 --> 00:00:21,510 I'm going to scroll back up and import this capability, so I'm going to say "from sklearn.linear_ 3 4 00:00:21,630 --> 00:00:32,970 model import LinearRegression", I'm going to hit Shift+Enter on this cell, 4 5 00:00:33,600 --> 00:00:41,850 scroll back down where I left off, and then I'm going to create a regression object called "regr", set that 5 6 00:00:41,850 --> 00:00:52,530 equal to "LinearRegression()" and then I'm going to use this regression object and call 6 7 00:00:52,560 --> 00:01:01,470 the "fit" method on it, so "regr.fit()" and now what do I add between these parentheses? 7 8 00:01:02,160 --> 00:01:06,420 What do I give has an argument to our fit method? 8 9 00:01:09,580 --> 00:01:14,330 The data that we should supply is our training datasets, right? 9 10 00:01:14,350 --> 00:01:23,710 So, X_train being the training features and y_train being the training target 10 11 00:01:23,710 --> 00:01:33,460 values. And hitting Shift+Enter on this will train our model. Great! But we don't see any output on the 11 12 00:01:33,460 --> 00:01:39,610 thetas that we've just estimated and also we don't even know how well this model fits our data. Now, 12 13 00:01:39,850 --> 00:01:47,140 before printing out these values, before printing out the values of our theta parameters, let's make some 13 14 00:01:47,140 --> 00:01:54,670 predictions on what we expect to see. Making these predictions is a very important step because it provides 14 15 00:01:54,700 --> 00:02:02,050 a sense check on what the computer is going to spit out to us. We should never blindly trust the numbers 15 16 00:02:02,110 --> 00:02:06,020 that we get back. That's how silly, silly mistakes are made. 16 17 00:02:06,070 --> 00:02:13,300 Besides, we've done all this data exploration work up until now and we can use the knowledge that we 17 18 00:02:13,300 --> 00:02:15,620 gained to make some predictions. 18 19 00:02:15,820 --> 00:02:23,150 For example, we've used seaborn's pairplot to run individual regressions against our target prices. 19 20 00:02:23,260 --> 00:02:30,340 Also, we've created this wonderful correlation matrix to show the correlations of the individual features 20 21 00:02:30,850 --> 00:02:37,510 with our target. We know which correlations were positive and which correlations were negative. 21 22 00:02:37,540 --> 00:02:46,810 So given that this is how our model looks like, let's make some predictions on the signs of the coefficients. 22 23 00:02:47,390 --> 00:02:52,850 Let's predict what kind of signs we're going to see on these theta parameters. 23 24 00:02:52,960 --> 00:03:01,270 So, get out a piece of paper and write down if you expect the theta parameter for RM to be positive 24 25 00:03:01,630 --> 00:03:13,030 or negative, and then do the same thing for NOX, PTRATIO, CRIM, DIS and LSTAT. Did you pause 25 26 00:03:13,030 --> 00:03:15,160 the video and write this down? 26 27 00:03:15,160 --> 00:03:17,830 There is actually one more prediction that we said would make. 27 28 00:03:17,830 --> 00:03:24,580 Remember how a few lessons ago, I asked you if you thought that being next to the Charles River was a good 28 29 00:03:24,580 --> 00:03:27,900 thing for property prices or a bad thing? 29 30 00:03:27,910 --> 00:03:36,740 Well now it's time to also add your prediction to the sign of the CHAS dummy variable as well. 30 31 00:03:36,790 --> 00:03:37,200 Okay. 31 32 00:03:37,240 --> 00:03:40,270 So here's the solution to this challenge. 32 33 00:03:40,450 --> 00:03:51,040 I'm going to print out the value of our intercept and it's gonna be "regr.intercept_" 33 34 00:03:51,670 --> 00:04:00,130 and then I'm going to print out the values of all the coefficients. And to format these values nicely I'm going to 34 35 00:04:00,130 --> 00:04:01,360 actually put them in a data frame. 35 36 00:04:01,360 --> 00:04:11,320 So I'm going to see "pd.DataFrame()" and then our first argument 36 37 00:04:11,410 --> 00:04:13,950 is our data that we're gonna supply 37 38 00:04:14,080 --> 00:04:20,800 and these are the coefficients which we can find under "regr.coef_". 38 39 00:04:20,800 --> 00:04:25,600 The second argument that we're going to supply is called index, 39 40 00:04:25,600 --> 00:04:28,060 so "index = ". 40 41 00:04:28,060 --> 00:04:28,580 and then what 41 42 00:04:28,570 --> 00:04:37,270 we're going to supply as an index are the column names, so "X_train.columns". 42 43 00:04:37,390 --> 00:04:37,720 Yeah. 43 44 00:04:38,080 --> 00:04:40,720 So these are the column names. 44 45 00:04:40,720 --> 00:04:46,190 That way we can have the names of the coefficients next to the values. 45 46 00:04:46,210 --> 00:04:48,220 Let me show you what this looks like. 46 47 00:04:48,370 --> 00:04:51,110 Voila! We get something like this. 47 48 00:04:51,130 --> 00:04:58,220 We can our intercept printed here and we get the name and the value of the different coefficients. 48 49 00:04:58,240 --> 00:05:04,530 Now I'm going to come back up here in to the code where I'm creating our data frame and I'm going to 49 50 00:05:04,540 --> 00:05:07,480 specify third argument called columns. 50 51 00:05:07,500 --> 00:05:08,160 Yeah. 51 52 00:05:08,290 --> 00:05:17,680 And I set that equal to "['COEF']". When I hit Shift+Enter now, you'll 52 53 00:05:17,680 --> 00:05:19,320 see this little 0 here, 53 54 00:05:19,330 --> 00:05:24,700 this the lack of a column name changed to the value "coef". 54 55 00:05:24,850 --> 00:05:36,520 So "columns = " and then a list of 1 value, namely the string "coef" will add a column name to our data 55 56 00:05:36,520 --> 00:05:38,610 frame here. 56 57 00:05:38,710 --> 00:05:39,180 All right. 57 58 00:05:39,210 --> 00:05:42,230 So we've got our results. 58 59 00:05:42,250 --> 00:05:45,850 How do they stack up to your predictions? 59 60 00:05:45,860 --> 00:05:53,050 Remember, we're just looking at the sign of the coefficients here for now. Are the actual coefficients 60 61 00:05:53,260 --> 00:06:01,470 positive or negative according to our model? Did you predict the right sign for the coefficients? 61 62 00:06:01,470 --> 00:06:10,710 Because what we see here is that more crime is bad for house prices and also more pollution is bad 62 63 00:06:10,800 --> 00:06:12,140 for house prices. 63 64 00:06:12,420 --> 00:06:16,560 More rooms on the other hand is good for house prices. 64 65 00:06:16,560 --> 00:06:25,740 It's a positive coefficient. And more students in class is also a negative factor for the house prices 65 66 00:06:26,220 --> 00:06:32,690 as is higher values for LSTAT. Now looking at this, 66 67 00:06:32,830 --> 00:06:40,390 I think this is some really, really good news because the signs of our coefficients passed our first sense 67 68 00:06:40,390 --> 00:06:43,210 check. All these signs that we've just talked about 68 69 00:06:43,300 --> 00:06:45,550 makes sense logically. 69 70 00:06:45,550 --> 00:06:52,120 So the things that we would have expected to be bad for property prices from a logic point of view indeed 70 71 00:06:52,180 --> 00:06:56,030 have a negative coefficient associated with them. 71 72 00:06:56,140 --> 00:06:57,040 Oh and, 72 73 00:06:57,040 --> 00:07:01,210 do you remember how I asked you to make a prediction if living next to the Charles River was a good 73 74 00:07:01,210 --> 00:07:02,360 thing or not? 74 75 00:07:02,380 --> 00:07:04,340 Well, now we've got our answer. 75 76 00:07:04,540 --> 00:07:11,410 Looking at the sign on the CHAS variable, we can see that living next to the Charles River is indeed 76 77 00:07:11,410 --> 00:07:18,700 desirable and the properties that are on the river are more expensive than those that are are away from 77 78 00:07:18,700 --> 00:07:23,570 it, all else equal. In fact we can put a number to it, 78 79 00:07:23,760 --> 00:07:32,010 Bostonians are willing to pay a premium for a property that is next to the river and that premium is 79 80 00:07:32,040 --> 00:07:35,440 equal to about 2000 dollars. 80 81 00:07:35,460 --> 00:07:44,010 Remember, CHAS only has the values 0 or 1. When CHAS is equal to 1 then it's next to the river and 81 82 00:07:44,190 --> 00:07:49,230 1 times approximately 2 is equal to 2, right? 82 83 00:07:49,660 --> 00:07:54,210 And since our target values, our property prices, are in thousands, 83 84 00:07:54,210 --> 00:07:58,350 this translates into 2000 dollars. 84 85 00:07:58,590 --> 00:08:05,820 So what we've just done in Jupyter notebook is calculate all these theta parameters. 85 86 00:08:05,820 --> 00:08:09,720 Our model now looks like this. Using our training data, 86 87 00:08:09,750 --> 00:08:17,370 we've estimated all the values of these theta parameters and the beautiful thing about this linear model 87 88 00:08:17,790 --> 00:08:24,900 is that all our coefficients have a very clear meaning - an increase in the number of rooms by one will 88 89 00:08:24,960 --> 00:08:31,520 increase the price of the property by 3100 hundred dollars. 89 90 00:08:31,560 --> 00:08:38,910 Our model is easy to understand and to interpret, it's the very opposite of a black box that just spits 90 91 00:08:38,910 --> 00:08:39,530 out an answer 91 92 00:08:39,690 --> 00:08:41,570 and we don't know where it came from.