1 00:00:00,300 --> 00:00:01,633 Hello. Welcome back to the course. 2 00:00:01,633 --> 00:00:03,633 Super excited to have you on board today. 3 00:00:03,633 --> 00:00:06,600 We've got a very interesting and important tutorial. 4 00:00:06,600 --> 00:00:08,566 Assumptions of linear regression. 5 00:00:08,566 --> 00:00:10,000 So let's have a look. 6 00:00:10,000 --> 00:00:11,266 Here we've got a data set. 7 00:00:11,266 --> 00:00:13,133 With a linear regression applied. 8 00:00:13,133 --> 00:00:16,900 And this linear regression seems to be serving its purpose really well. 9 00:00:17,366 --> 00:00:20,233 However if we look at the following three data sets, 10 00:00:20,233 --> 00:00:23,366 we can see that a linear regression is applied each time. 11 00:00:23,366 --> 00:00:26,800 And in fact it's exactly the same linear regression as in the first case. 12 00:00:27,133 --> 00:00:30,133 However, those linear regressions are not serving their purpose. 13 00:00:30,166 --> 00:00:32,666 In fact, they are misleading. 14 00:00:32,666 --> 00:00:36,433 So we shouldn't be using linear regressions in those situations. 15 00:00:36,966 --> 00:00:39,733 These four data sets are called the Anscombe Quartet, 16 00:00:39,733 --> 00:00:43,900 and they illustrate that you can't just simply blindly apply linear regression. 17 00:00:44,100 --> 00:00:49,300 You have to make sure that your data set is fit for using linear regression. 18 00:00:49,300 --> 00:00:52,300 And that's where assumptions of linear regression come in. 19 00:00:52,600 --> 00:00:53,733 So let's have a look at them. 20 00:00:53,733 --> 00:00:56,733 There's going to be five assumptions in total plus an extra check. 21 00:00:57,000 --> 00:00:58,966 The first assumption is linear. 22 00:00:58,966 --> 00:01:01,766 We want to make sure that there is a linear relationship 23 00:01:01,766 --> 00:01:05,200 between our dependent variable and each independent variable. 24 00:01:05,366 --> 00:01:06,900 And if you look at the chart 25 00:01:06,900 --> 00:01:10,833 here on the right, you'll see that the linear regression is misleading. 26 00:01:10,966 --> 00:01:11,400 It is. 27 00:01:11,400 --> 00:01:14,666 There is actually no linear relationship between the two variables. 28 00:01:14,666 --> 00:01:17,666 So we wouldn't use this kind of model there. 29 00:01:18,100 --> 00:01:20,766 The second assumption is homoscedasticity. 30 00:01:20,766 --> 00:01:21,300 And even though 31 00:01:21,300 --> 00:01:25,866 it sounds like a complex term, it actually simply means equal variance, 32 00:01:26,666 --> 00:01:30,166 meaning that you don't want to see a cone type shape 33 00:01:30,166 --> 00:01:33,166 on your chart with an increasing cone and decreasing cone, 34 00:01:33,233 --> 00:01:37,600 which would mean that variance is dependent on the independent variable. 35 00:01:37,900 --> 00:01:40,900 So in this case we wouldn't use a linear regression either. 36 00:01:41,433 --> 00:01:44,466 The third assumption is multivariate normality 37 00:01:44,500 --> 00:01:47,400 or normality of error distribution. If you look at the chart 38 00:01:47,400 --> 00:01:49,733 here on the right, you can feel that something is off. 39 00:01:49,733 --> 00:01:52,700 The best way to intuitively think about it is 40 00:01:52,700 --> 00:01:55,833 if you look along the line of the linear regression, 41 00:01:55,833 --> 00:01:59,200 you want to see a normal distribution of your data points. 42 00:01:59,566 --> 00:02:02,933 In the case on the right here, we can see something different. 43 00:02:03,166 --> 00:02:06,600 And so again we wouldn't apply a linear regression there. 44 00:02:07,166 --> 00:02:10,266 The fourth assumption is independence of observations. 45 00:02:10,266 --> 00:02:13,300 And this includes the term no autocorrelation. 46 00:02:13,300 --> 00:02:16,566 Sometimes you'll see this assumption titled as no autocorrelation. 47 00:02:17,200 --> 00:02:19,900 And what that means is that we don't want 48 00:02:19,900 --> 00:02:23,200 to see any kind of pattern in our data pattern in the data 49 00:02:23,200 --> 00:02:27,733 like we see here indicates that our rows are not independent, 50 00:02:27,733 --> 00:02:31,800 that some rows are affecting other rows and other rows, etc.. 51 00:02:32,066 --> 00:02:35,166 A classic example of this would be the stock market, where 52 00:02:35,866 --> 00:02:40,000 previous prices of future prices which affect future prices and so on. 53 00:02:40,133 --> 00:02:43,666 So in this case, we wouldn't apply a linear regression model. 54 00:02:44,500 --> 00:02:47,400 The fifth assumption is lack of multicollinearity. 55 00:02:47,400 --> 00:02:50,633 Basically we want our independent variables or predictors 56 00:02:50,966 --> 00:02:54,166 not to be correlated with each other if they're not correlated 57 00:02:54,333 --> 00:02:56,433 then we can build a linear regression. 58 00:02:56,433 --> 00:02:58,000 If they are correlated 59 00:02:58,000 --> 00:03:01,833 then if we do proceed and build a linear regression model, then the coefficient 60 00:03:01,833 --> 00:03:05,366 estimates that we get in the model will become unreliable. 61 00:03:06,133 --> 00:03:08,766 And the sixth point is the outlier check. 62 00:03:08,766 --> 00:03:11,433 This is not an actual assumption, but rather an extra check. 63 00:03:11,433 --> 00:03:15,366 That is important to keep in mind when building linear regression models. 64 00:03:15,766 --> 00:03:18,366 If you look at the chart here on the right, you can see that the outlier 65 00:03:18,366 --> 00:03:22,700 is significantly affecting the linear regression line that we get. 66 00:03:22,900 --> 00:03:25,900 So something that we want to consider is 67 00:03:26,200 --> 00:03:30,100 should we remove the outliers before building a linear regression. 68 00:03:30,100 --> 00:03:33,600 Or do we want to build a linear regression with the outliers included? 69 00:03:33,833 --> 00:03:36,900 This will depend on your business knowledge and knowledge of the data set. 70 00:03:37,066 --> 00:03:37,833 So there we go. 71 00:03:37,833 --> 00:03:39,966 Those are the assumptions of linear regression. 72 00:03:39,966 --> 00:03:42,800 In this course we're going to assume by default 73 00:03:42,800 --> 00:03:45,800 that they are correct for all the data sets that we work with. 74 00:03:45,933 --> 00:03:48,800 However, when you're working with your own data sets, it's important 75 00:03:48,800 --> 00:03:52,533 to do these checks to make sure you're building a linear regression. 76 00:03:52,733 --> 00:03:55,533 When it is fit for the data set. 77 00:03:55,533 --> 00:03:58,700 And to finish off this tutorial, I have a special bonus for you. 78 00:03:58,700 --> 00:04:02,333 If you enjoyed this explanation, you can download the PDF 79 00:04:02,333 --> 00:04:07,166 version of this slide and keep it at home printed out as a poster. 80 00:04:07,200 --> 00:04:09,033 Keep it somewhere handy 81 00:04:09,033 --> 00:04:12,933 for those times when you do need to check assumptions of linear regression. 82 00:04:13,300 --> 00:04:14,966 If that's something that you would like to have. 83 00:04:14,966 --> 00:04:16,000 Head on over to Super Data 84 00:04:16,000 --> 00:04:19,366 science.com/assumptions and you can download your poster there. 85 00:04:19,833 --> 00:04:21,600 And I look forward to seeing you back here next time. 86 00:04:21,600 --> 00:04:23,400 Until then, enjoy machine learning.