1 00:00:01,450 --> 00:00:08,360 So when do we say that variables are correlated, when the values of those variables fluctuate together? 2 00:00:08,830 --> 00:00:13,100 We say that variables are colored, they are positively correlated. 3 00:00:13,240 --> 00:00:20,710 If one variable almost always increases, when the other one increases and negatively correlated, if 4 00:00:20,710 --> 00:00:23,500 one decreases, when the other one is increasing. 5 00:00:25,580 --> 00:00:32,960 For example, if you think logically, there should be a positive correlation between calorie intake 6 00:00:33,410 --> 00:00:34,470 and a person's weight. 7 00:00:34,530 --> 00:00:39,550 That is, if you increase your calorie intake, you should have more weight. 8 00:00:39,560 --> 00:00:42,080 And if you reduce it, you should have less of it. 9 00:00:43,730 --> 00:00:44,360 Similarly. 10 00:00:45,540 --> 00:00:49,230 Amount of time, one studies should impact the depth of that student. 11 00:00:50,400 --> 00:00:53,720 So this means they should have some correlation as well. 12 00:00:55,010 --> 00:00:55,580 However. 13 00:00:56,890 --> 00:01:03,490 Selecting variables with no commonality results in non correlated variables, for example. 14 00:01:04,840 --> 00:01:10,310 A dog's name and the type of biscuit the dog will prefer will not be coreligionists. 15 00:01:12,130 --> 00:01:15,040 So we say for such scenarios, correlation is nearly zero. 16 00:01:16,660 --> 00:01:22,870 You can think of any bizarre combination of two variables and most probably those should have low correlation. 17 00:01:27,130 --> 00:01:30,950 The quantification of correlation is called correlation coefficient. 18 00:01:32,230 --> 00:01:37,240 There is a statistical formula to compute the correlation between two variables, although the formula 19 00:01:37,240 --> 00:01:37,750 is simple. 20 00:01:38,050 --> 00:01:44,500 But since we will never find out correlation coefficient for our data manually, we will always be using 21 00:01:44,500 --> 00:01:46,060 some sort words to do. 22 00:01:46,060 --> 00:01:48,940 This will not be discussing the formula here. 23 00:01:50,160 --> 00:01:54,660 I would just note certain characteristics of correlation coefficient. 24 00:01:56,340 --> 00:02:00,030 First, its value is between minus one and one. 25 00:02:02,000 --> 00:02:09,080 So plus one means there is 100 percent positive correlation and minus one means 100 percent negative 26 00:02:09,080 --> 00:02:09,620 coalition. 27 00:02:10,960 --> 00:02:16,630 Secondly, zero value of conviction indicates that there is no relationship between the variables. 28 00:02:19,140 --> 00:02:20,400 You can see the graph here. 29 00:02:22,060 --> 00:02:28,450 The first one shows a positive correlation, that is, if one value on the X axis is increasing, the 30 00:02:28,450 --> 00:02:30,340 value of the virus is also increasing. 31 00:02:32,000 --> 00:02:34,850 The middle one is representing no coalition. 32 00:02:35,960 --> 00:02:40,330 Even if you increase the value of one of the variables, the other one does not change. 33 00:02:41,470 --> 00:02:46,960 The third one is showing negative correlation, that is, if you increase value of one variable, the 34 00:02:46,960 --> 00:02:48,930 value of other variables is going down. 35 00:02:51,880 --> 00:02:58,660 As a rule of thumb, if you find the value of correlation coefficient as less than point one, we say 36 00:02:58,660 --> 00:03:02,310 that there is very little correlation between the two variables. 37 00:03:02,830 --> 00:03:08,140 And if we find that it is more than point eight, we say that it is very high correlation between the 38 00:03:08,140 --> 00:03:08,680 variables. 39 00:03:13,900 --> 00:03:20,620 It is important to note the difference between correlation and causation correlation is just representing. 40 00:03:21,750 --> 00:03:27,560 Whether the values in two variables, for example, dataset are moving together or not. 41 00:03:28,770 --> 00:03:32,950 But it does not tell us anything about cause effect relationship between the two. 42 00:03:34,140 --> 00:03:43,410 We cannot say that because correlation coefficient is as high, that increase in one variable will lead 43 00:03:43,410 --> 00:03:46,110 to increase or decrease in the other variable. 44 00:03:47,310 --> 00:03:48,270 Take this example. 45 00:03:50,190 --> 00:03:58,470 This example shows the values of U.S. crude oil imports from Norway and the number of drivers killed 46 00:03:58,470 --> 00:04:00,120 in collision with railway train. 47 00:04:02,640 --> 00:04:07,740 You can see from this graph itself that both of these values are going hand-in-hand. 48 00:04:09,140 --> 00:04:14,960 If you find out the correlation between these two values, this coefficient is coming out two point 49 00:04:14,960 --> 00:04:15,470 nine five. 50 00:04:15,500 --> 00:04:17,380 That is nearly 95 percent. 51 00:04:19,770 --> 00:04:21,690 Such high value of coalition. 52 00:04:22,770 --> 00:04:30,060 May lead you to think that these two variables are highly correlated and increasing, one of them may 53 00:04:30,060 --> 00:04:31,440 lead to increase in the other one. 54 00:04:32,370 --> 00:04:33,450 That is not the case. 55 00:04:33,780 --> 00:04:36,000 These two are completely random. 56 00:04:36,000 --> 00:04:36,810 Two variables. 57 00:04:38,360 --> 00:04:41,560 And they do not have any relationship amongst themselves. 58 00:04:44,150 --> 00:04:52,820 So to establish cause effect relationship, one of the most common ways is to use the intuition to look 59 00:04:52,820 --> 00:04:58,970 back at the business knowledge and to find out whether the change in one of the variables actually impact 60 00:04:59,020 --> 00:05:01,130 the change of the other variable or not. 61 00:05:02,780 --> 00:05:05,390 The other method is a difficult one. 62 00:05:05,600 --> 00:05:11,420 It is setting up a scientific experiment where we manipulate the variable in a controlled environment 63 00:05:11,420 --> 00:05:14,030 and see the impact on the other variable. 64 00:05:15,450 --> 00:05:17,040 Will not be covering that in this. 65 00:05:18,030 --> 00:05:18,810 ElectriCities. 66 00:05:19,830 --> 00:05:25,500 But just in case you are doing a controlled experiment, that is one of the options that you have. 67 00:05:26,620 --> 00:05:28,090 So far, our dataset. 68 00:05:29,700 --> 00:05:32,460 We'll be depending on the business condition. 69 00:05:33,590 --> 00:05:34,760 To establish causation. 70 00:05:39,220 --> 00:05:41,950 When you have a dataset with multiple variables. 71 00:05:43,530 --> 00:05:47,760 We want to see the correlation coefficient for each pair of variables. 72 00:05:50,460 --> 00:05:54,600 There is a tabular format in which we take all the variables on top. 73 00:05:56,780 --> 00:06:01,790 Like this and all those variables on the left also like this. 74 00:06:03,360 --> 00:06:06,880 And then for each combination of these two variables. 75 00:06:07,710 --> 00:06:13,320 So for this never to try to evade taxes and always to elections. 76 00:06:13,350 --> 00:06:20,250 So this sell to this combination of these two variables, correlation coefficient is mentioned in the 77 00:06:20,250 --> 00:06:21,180 corresponding cell. 78 00:06:23,550 --> 00:06:26,160 This table is called Coalition Matrix. 79 00:06:29,180 --> 00:06:33,050 If you look at this matrix, this diagonal values. 80 00:06:34,010 --> 00:06:37,250 And actually the correlation of that variable with itself. 81 00:06:38,370 --> 00:06:43,710 So basically, whenever this variable increases, this is also increasing whenever this is decreasing. 82 00:06:43,740 --> 00:06:47,080 This is also decreasing and they are going perfectly hand-in-hand. 83 00:06:47,430 --> 00:06:49,980 That's why this value is always going to be one. 84 00:06:51,120 --> 00:06:53,550 All the values on the diagonal are going to be one. 85 00:06:56,550 --> 00:07:03,090 And this value, which is giving us the correlation between always to vote in elections and never to 86 00:07:03,090 --> 00:07:10,230 try to evade taxes at this point, not knowing full well you will also come here, because these two 87 00:07:10,230 --> 00:07:15,390 variables are also present as rows and columns for this Al. 88 00:07:17,810 --> 00:07:22,250 If you look at this very carefully, this matrix. 89 00:07:24,120 --> 00:07:27,900 It's basically having a mirror image of itself about this diagonal. 90 00:07:29,210 --> 00:07:36,080 So this particular cell is same as this cell, this particular cell, the same as the cell, the system 91 00:07:36,080 --> 00:07:36,630 itself. 92 00:07:36,710 --> 00:07:39,470 So basically it is a mirror image about the Steinar. 93 00:07:43,150 --> 00:07:49,960 Once we get this matrix, we can easily see the patterns in the data that was the primary goal of making 94 00:07:49,960 --> 00:07:50,650 this matrix. 95 00:07:51,690 --> 00:07:59,040 Another useful feature of this matrix is that you can quickly find out which two independent variables 96 00:07:59,490 --> 00:08:00,780 are highly correlated. 97 00:08:02,490 --> 00:08:07,070 But why are we trying to find out coalition values between two independent variables? 98 00:08:09,690 --> 00:08:15,840 We are doing this because whenever there is correlation, there are two correlated variables are moving 99 00:08:15,840 --> 00:08:16,310 together. 100 00:08:17,670 --> 00:08:23,970 Therefore, if one of them is linearly related individually with the dependent variable, then the other 101 00:08:23,970 --> 00:08:26,820 one will also show linear relationship with the dependent variable. 102 00:08:28,710 --> 00:08:34,020 But when you will take both the variables in your model and model has to assign relative importance 103 00:08:34,020 --> 00:08:38,040 to each of them, really tedious job for the model. 104 00:08:39,090 --> 00:08:43,980 Every time you will show your model a new sample of data, it will assign different edition coefficients 105 00:08:43,980 --> 00:08:44,940 to these variables. 106 00:08:46,980 --> 00:08:50,430 This problem for the model is called linearity. 107 00:08:53,110 --> 00:09:00,610 Therefore, it is important that we identify highly correlated independent variables and remove one 108 00:09:00,610 --> 00:09:04,170 of the two so that multiple linearity can be avoided. 109 00:09:08,490 --> 00:09:12,600 But with one of the two, would you remove and which one you would keep? 110 00:09:13,980 --> 00:09:20,700 First, try to keep the one which makes more business sense if both make business sense, look at the 111 00:09:20,700 --> 00:09:22,230 correlation coefficient of both. 112 00:09:23,590 --> 00:09:29,350 With the dependent variable, whichever is higher, you may want to keep that one and remove the other 113 00:09:29,350 --> 00:09:29,590 one. 114 00:09:31,440 --> 00:09:37,620 If both of them are having the same coordination commission, then you may go and pick the variable 115 00:09:37,620 --> 00:09:39,270 for which the data is easier to get. 116 00:09:42,380 --> 00:09:46,610 So this is the use of correlation coefficient and correlation matrix.