0 1 00:00:00,390 --> 00:00:06,650 Before running our regression analysis there's another thing that we should seek to understand 1 2 00:00:06,780 --> 00:00:14,790 prior to feeding our data into our machine learning algorithm. As part of the data exploration we need 2 3 00:00:14,790 --> 00:00:21,030 to understand to what extent our variables move together. 3 4 00:00:21,060 --> 00:00:30,990 We should look both at the correlation of our features with our target, but also at the correlation between 4 5 00:00:31,020 --> 00:00:32,610 our different features. 5 6 00:00:32,610 --> 00:00:38,440 So what is correlation and why should you give a ***? Excuse me, 6 7 00:00:38,520 --> 00:00:44,280 that was my inner Bostonian coming out, but let's talk about what correlation is first. 7 8 00:00:44,460 --> 00:00:51,060 Correlation is the degree to which things move together and I think you can see this best through some 8 9 00:00:51,060 --> 00:00:52,620 pictures. 9 10 00:00:52,740 --> 00:00:57,510 Now, it's no coincidence that a lot of these pictures might look quite similar to what we saw on our lessons 10 11 00:00:57,510 --> 00:01:00,150 on univariate linear regression. 11 12 00:01:00,150 --> 00:01:07,620 So here in our first picture on correlation we have a chart showing two variables, how much sun I get 12 13 00:01:08,130 --> 00:01:09,870 and how much ice cream I eat. 13 14 00:01:09,870 --> 00:01:13,470 These variables clearly have a relationship. 14 15 00:01:13,470 --> 00:01:20,430 The amount of sun and the amount of ice cream tend to move together. If one of these values is high then 15 16 00:01:20,430 --> 00:01:22,950 the other value tends to be high too. 16 17 00:01:23,070 --> 00:01:27,140 If one of them is low then the other one tends to be low as well. 17 18 00:01:27,150 --> 00:01:30,490 This is called a positive correlation. 18 19 00:01:30,570 --> 00:01:35,660 The amount of sun and the amount of ice cream I eat are positively correlated. 19 20 00:01:35,670 --> 00:01:41,400 Okay, so what about a negative correlation? A negative correlation 20 21 00:01:41,460 --> 00:01:46,980 looks like this. The resulting graph will be downward sloping. 21 22 00:01:46,980 --> 00:01:52,500 What we're looking at here is the amount of time I spend on a packed train carriage hunched over with 22 23 00:01:52,500 --> 00:01:55,800 my face in somebody else's armpit on the London Underground, 23 24 00:01:56,280 --> 00:01:59,940 and how happy I am. And looking at this graph, 24 25 00:02:00,150 --> 00:02:05,610 The more time I spend on the London Underground the less happy I am. 25 26 00:02:05,610 --> 00:02:12,060 If one of these variables has a high value than the other one tends to have a low value. 26 27 00:02:12,300 --> 00:02:17,110 The variables tend to move in opposite directions. 27 28 00:02:17,270 --> 00:02:24,360 Now the emphasis id is on the words "tends to" - when x is high y tends to be low. 28 29 00:02:24,420 --> 00:02:30,450 You see correlations have a like a strength to them, a magnitude - correlations can be strong or they can 29 30 00:02:30,450 --> 00:02:37,730 be weak. If the value of x is high then the value of y can be low but it doesn't have to be. 30 31 00:02:38,040 --> 00:02:44,820 But you know what? I'd say this graph that we're looking at here is not an accurate representation of 31 32 00:02:45,060 --> 00:02:47,240 how I actually feel on the London Underground. 32 33 00:02:47,280 --> 00:02:52,340 The correlation with my happiness is actually much stronger than what I've represented here. 33 34 00:02:52,620 --> 00:02:59,160 Wanna have a guess as to what this graph would look like if these two variables were in fact perfectly 34 35 00:02:59,430 --> 00:03:07,320 negatively correlated? At the moment we kind of see a bit of a band, a cloud of data points as the correlation 35 36 00:03:07,320 --> 00:03:08,550 gets stronger. 36 37 00:03:08,550 --> 00:03:15,100 This cloud of data points actually becomes more and more narrow. In the extreme, 37 38 00:03:15,120 --> 00:03:20,850 all the data points would line up to pretty much form a straight line. 38 39 00:03:20,850 --> 00:03:28,560 Now this is a much more accurate representation of how I feel about British transport infrastructure 39 40 00:03:28,560 --> 00:03:30,510 from the days of Queen Victoria. 40 41 00:03:31,080 --> 00:03:39,510 Okay, so we've looked at a picture of perfect correlation, perfect negative correlation at that. 41 42 00:03:39,510 --> 00:03:46,500 What do you think the graph would look like if the reverse was true? What would this graph show if there 42 43 00:03:46,500 --> 00:03:51,570 was no correlation between the two variables at all? 43 44 00:03:51,570 --> 00:03:53,550 It would actually look like this. 44 45 00:03:53,550 --> 00:03:59,970 Remember how I said that the cloud of data points narrowed as the strength of the correlation increased? 45 46 00:03:59,970 --> 00:04:04,370 If there is no correlation then we get a plot that looks somewhat like this. 46 47 00:04:04,440 --> 00:04:08,120 We get a cloud of points with no clear pattern. 47 48 00:04:08,370 --> 00:04:15,960 In other words, the number of divorces in the state of Maine are uncorrelated to the amount of margarine 48 49 00:04:15,960 --> 00:04:16,800 consumed. 49 50 00:04:17,390 --> 00:04:23,850 Now so far I've been explaining the intuition behind correlation in a very visual manner. 50 51 00:04:24,570 --> 00:04:31,320 However, you'll typically see correlation written about in mathematical notation and the notation for 51 52 00:04:31,320 --> 00:04:33,890 correlation tends to look like this. 52 53 00:04:33,960 --> 00:04:38,280 We tend to use the Greek letter rho for correlation. 53 54 00:04:38,370 --> 00:04:46,980 Also the correlation is actually calculated as a single number -1 one and 1. -1 54 55 00:04:46,980 --> 00:04:55,440 is a perfect negative correlation and positive 1 is a perfect positive correlation and a correlation 55 56 00:04:55,440 --> 00:05:02,580 of 0 means that there's no correlation at all. The variables are uncorrelated. 56 57 00:05:02,610 --> 00:05:04,510 Okay so I hope that makes sense. 57 58 00:05:04,560 --> 00:05:12,690 Now we know that the correlation is a statistical measure of a linear relationship between two variables. 58 59 00:05:12,720 --> 00:05:14,940 Why does correlation matter? 59 60 00:05:14,940 --> 00:05:17,810 Why should we care about correlation? 60 61 00:05:17,820 --> 00:05:23,730 Why should we look at the correlations of our features during the data exploration stage? 61 62 00:05:23,730 --> 00:05:27,140 The answer is is that we primarily care about two things. 62 63 00:05:27,330 --> 00:05:37,380 One, the strength of the correlation and two, the direction. Since our goal is to be able to predict house 63 64 00:05:37,380 --> 00:05:46,970 prices and build a valuation tool, our model should include features that are correlated with house prices. 64 65 00:05:47,100 --> 00:05:54,210 We want to include features in our model whose movement is associated with a big movement in the house 65 66 00:05:54,210 --> 00:05:55,080 prices. 66 67 00:05:55,080 --> 00:05:56,480 We want magnitude. 67 68 00:05:56,520 --> 00:06:00,520 We want a correlation that is not close to zero. 68 69 00:06:00,570 --> 00:06:06,120 So strength is important because it tells us how much correlation there actually is. 69 70 00:06:06,120 --> 00:06:09,410 What's the extent of the movement? 70 71 00:06:09,420 --> 00:06:12,300 The other thing that we want to know of course is the direction. 71 72 00:06:12,300 --> 00:06:19,710 If there is an increase in x does y go up or does y tend to go down. The direction is important because 72 73 00:06:19,710 --> 00:06:23,860 it tells us if the moves are in the same or in the opposite direction. 73 74 00:06:23,880 --> 00:06:27,510 In other words, is the correlation positive or negative. 74 75 00:06:27,690 --> 00:06:28,940 But enough talking. 75 76 00:06:29,130 --> 00:06:35,790 How do we actually find these correlations? In two ways - by calculating them and by visualizing them in 76 77 00:06:35,790 --> 00:06:38,490 Jupyter notebook. Let's do that now.