0 1 00:00:01,090 --> 00:00:06,490 In this lesson it's finally time to train our model. 1 2 00:00:06,520 --> 00:00:13,330 We're going to model our house prices using a technique called multivariable regression which is also 2 3 00:00:13,330 --> 00:00:16,220 known as multiple linear regression. 3 4 00:00:16,300 --> 00:00:18,000 But what does that mean? 4 5 00:00:18,100 --> 00:00:24,470 In a previous module we've been estimating movie revenue based on movie budgets. 5 6 00:00:24,610 --> 00:00:29,770 We were using a simple linear regression model that looked like this. 6 7 00:00:29,770 --> 00:00:36,640 We had one explanatory variable which were the movie budgets and we were busy estimating the values 7 8 00:00:36,940 --> 00:00:45,280 of these theta parameters, theta zero and theta one. When fitting more than one explanatory variable, more 8 9 00:00:45,280 --> 00:00:46,510 than one feature, 9 10 00:00:46,660 --> 00:00:53,760 the equation for the regression will look like this instead. This is the format of our model for our 10 11 00:00:53,850 --> 00:00:55,500 multivariable regression. 11 12 00:00:56,530 --> 00:00:59,860 Now, we don't have n different terms. 12 13 00:00:59,890 --> 00:01:05,180 We've got 13 features in our data set and these are the ones we're going to use. 13 14 00:01:05,200 --> 00:01:11,770 So in fact when we adapt this generic model to our specific circumstances, our equation will actually 14 15 00:01:11,770 --> 00:01:13,300 look something like this. 15 16 00:01:13,480 --> 00:01:20,200 The estimate of the price is equal to "theta0 + theta1 * RM", plus theta2 times our second 16 17 00:01:20,200 --> 00:01:26,650 feature, plus theta3 times our third feature and so on until we get to our last one. 17 18 00:01:26,650 --> 00:01:33,910 What this equation is telling us is that our estimate of the property price will be a linear combination 18 19 00:01:34,240 --> 00:01:36,400 of these 13 features. 19 20 00:01:36,400 --> 00:01:39,640 So it's still a linear model. 20 21 00:01:39,700 --> 00:01:40,130 Okay. 21 22 00:01:40,170 --> 00:01:45,360 So this is the model for our regression that we will be fitting our data to. 22 23 00:01:45,550 --> 00:01:53,140 And when we run our Python code, we'll get a value for each of these theta parameters that we see here. 23 24 00:01:53,200 --> 00:01:57,460 These are going to be our coefficients that we're interested in. 24 25 00:01:57,460 --> 00:02:05,560 But before we write our code and run the numbers, did you ever wonder why this technique was called regression 25 26 00:02:05,740 --> 00:02:07,290 in the first place? 26 27 00:02:07,310 --> 00:02:14,320 Where does this name regression actually come from? I mean regression is such a strange word to be using 27 28 00:02:14,470 --> 00:02:16,840 for fitting a line to our data. 28 29 00:02:16,840 --> 00:02:17,170 Right? 29 30 00:02:17,710 --> 00:02:24,890 So this technique and the name actually go back a long long time to an English gentleman called Francis 30 31 00:02:25,060 --> 00:02:26,120 Galton, 31 32 00:02:26,560 --> 00:02:34,100 Sir Francis Galton in fact. Sir Galton lived in Victorian England and had quite a few talents. 32 33 00:02:34,180 --> 00:02:41,020 One of the many things he's remembered for today is his invention of the Galton board. 33 34 00:02:41,020 --> 00:02:41,600 Yep. 34 35 00:02:41,710 --> 00:02:44,550 He named this invention after himself. 35 36 00:02:44,620 --> 00:02:52,090 Anyhow, a Galton board is basically what you end up with if you were to take his Japanese pachinko machine 36 37 00:02:52,570 --> 00:02:55,820 and then extract all the fun out of it. 37 38 00:02:56,110 --> 00:03:02,000 The Galton Board is a contraption that has a bunch of balls and pins in it. 38 39 00:03:02,140 --> 00:03:09,190 When you turn this board upside down the balls go through a single opening at the top and then they 39 40 00:03:09,190 --> 00:03:11,800 start bouncing off the pins. 40 41 00:03:12,250 --> 00:03:19,210 And here's the fascinating part, what you end up with at the bottom of the board will look something 41 42 00:03:19,240 --> 00:03:22,240 like a histogram from matplotlib. 42 43 00:03:22,300 --> 00:03:28,330 In fact you get a very particular type of arrangement in the slots at the bottom. 43 44 00:03:28,630 --> 00:03:35,050 And if you squint a little bit you'll actually recognise our old friend, the normal distribution. 44 45 00:03:35,050 --> 00:03:41,140 Most of the balls end up in the middle but a very, very few balls bounce around like crazy on these pins 45 46 00:03:41,200 --> 00:03:43,920 and they end up on the sides of the board. 46 47 00:03:43,960 --> 00:03:50,270 They end up as outliers on the very, very left or on the very, very right hand edge. 47 48 00:03:50,410 --> 00:03:55,810 Needless to say a lot of statisticians at the time were very excited about this invention. 48 49 00:03:56,050 --> 00:04:00,770 But the thing is Sir Francis Galton was not a one trick pony. 49 50 00:04:00,790 --> 00:04:01,790 No sir. 50 51 00:04:01,930 --> 00:04:07,260 Being quite interested in both distributions and outliers, 51 52 00:04:07,270 --> 00:04:13,960 he looked very, very, very carefully at the sizes of things, from the sizes of the seeds of his sweet peas 52 53 00:04:14,170 --> 00:04:16,690 to the sizes of people. 53 54 00:04:16,690 --> 00:04:23,950 In particular, he was very, very interested in how size changes from one generation to the next, especially 54 55 00:04:24,070 --> 00:04:26,440 with regards to outliers. 55 56 00:04:26,440 --> 00:04:33,580 Now one thing that Sir Galton noticed was that when you had a very, very tall father and that father 56 57 00:04:33,580 --> 00:04:42,970 had a son and that son grew up to be an adult, the adult son was usually shorter than his father. 57 58 00:04:43,240 --> 00:04:47,920 Sir Galton called this phenomenon Regression to the Mean. 58 59 00:04:48,640 --> 00:04:55,680 That's where we get that word from, regression. Now you can even observe this phenomenon of the regression 59 60 00:04:55,680 --> 00:05:02,860 to the mean in say like the NBA. Professional basketball players tend to be very very tall. 60 61 00:05:03,000 --> 00:05:03,750 Right? 61 62 00:05:03,810 --> 00:05:10,890 For example Shaquille O'Neal was 7 foot 1 or 216 centimeters. 62 63 00:05:10,890 --> 00:05:14,250 At least that's what he's listed at on Wikipedia. 63 64 00:05:14,250 --> 00:05:23,130 But what about the sons of an NBA player? Shaquille O'Neal son Sharif O'Neal is slightly shorter than 64 65 00:05:23,130 --> 00:05:29,450 his father at two meters and eight centimeters or 6 foot 10. 65 66 00:05:29,450 --> 00:05:36,710 Another example is actually Michael Jordan who was listed at 6 foot 6 and Michael Jordan's son Jeffrey 66 67 00:05:36,710 --> 00:05:39,530 Jordan is listed at 6 foot 1. 67 68 00:05:39,530 --> 00:05:46,910 Now I realize that those are just two examples and there aren't that many sons of basketball players in the 68 69 00:05:46,910 --> 00:05:54,590 NBA whose height you can pull up on Wikipedia, but you'll find that this pattern holds on average the 69 70 00:05:54,830 --> 00:06:00,480 NBA players sons will end up shorter than their dads on average. 70 71 00:06:00,500 --> 00:06:03,680 This is regression to the mean in action.