1 00:00:00,630 --> 00:00:07,600 And by ridded analysis, I told you that we knew scatterplot and correlation matrix to understand the 2 00:00:07,600 --> 00:00:09,520 relationship between two variables. 3 00:00:11,150 --> 00:00:16,380 Now, let us discuss in detail these coalitions that we see in a coalition matrix. 4 00:00:18,910 --> 00:00:25,690 So when do we say that variables are correlated, when the values of those variables fluctuate together? 5 00:00:26,290 --> 00:00:30,530 We say that video was a quality that positively correlated. 6 00:00:30,730 --> 00:00:38,110 If one variable almost always increases, when the other one increases and negatively correlated, if 7 00:00:38,110 --> 00:00:40,900 one decreases, when the other one is increasing. 8 00:00:43,000 --> 00:00:50,320 For example, if you think logically, there should be a positive correlation between calorie intake 9 00:00:50,950 --> 00:00:51,820 and a person's weight. 10 00:00:51,970 --> 00:00:54,280 That is if you increase your calorie intake. 11 00:00:55,420 --> 00:00:56,920 You should have more weight. 12 00:00:57,070 --> 00:00:59,470 And if you reduce it, you should have listened with. 13 00:01:01,160 --> 00:01:01,820 Similarly. 14 00:01:03,060 --> 00:01:06,600 Amount of time, one studies should impact the GDP of that student. 15 00:01:07,860 --> 00:01:11,130 So this means they should have some coalition as well. 16 00:01:12,470 --> 00:01:12,980 However. 17 00:01:14,350 --> 00:01:19,060 Selecting variables with no commonality results in non correlated variables. 18 00:01:20,230 --> 00:01:20,950 For example. 19 00:01:22,350 --> 00:01:27,700 A dog's name and the type of biscuit the dog will prefer will not be called Elegy. 20 00:01:29,590 --> 00:01:31,050 So easy for such scenarios. 21 00:01:31,100 --> 00:01:32,470 Correlation is nearly zero. 22 00:01:34,120 --> 00:01:40,290 You can think of any bizarre combination of two variables and most probably should have low correlation. 23 00:01:44,610 --> 00:01:48,360 The quantification of correlation is called correlation coefficient. 24 00:01:49,680 --> 00:01:53,220 There is a statistical formula to compute the correlation between two variables. 25 00:01:53,850 --> 00:01:59,130 Although the formula is simple, but since we will never find our correlation coefficients for our data 26 00:01:59,160 --> 00:02:03,870 manually, we will always be using some sort words to do this. 27 00:02:04,430 --> 00:02:06,350 We will not be discussing the formula here. 28 00:02:07,650 --> 00:02:12,060 Oh, just gnawed certain characteristics of coalition coefficient. 29 00:02:13,800 --> 00:02:17,430 First, its value is between minus one and one. 30 00:02:19,460 --> 00:02:26,470 So plus one means that is 100 percent positive correlation and minus one means hundred percent negative 31 00:02:26,480 --> 00:02:27,050 coalition. 32 00:02:28,420 --> 00:02:34,060 Secondly, zero value of coefficient indicates that there is no relationship between the variables. 33 00:02:36,630 --> 00:02:37,830 You can see the graphs here. 34 00:02:39,540 --> 00:02:41,760 The first one shows a positive correlation. 35 00:02:41,850 --> 00:02:47,760 That is, if one value on the X axis is increasing, the value on the VIAJES is also increasing. 36 00:02:49,460 --> 00:02:52,280 The middle one is representing no correlation. 37 00:02:53,480 --> 00:02:57,750 Even if you increase the value of one of the variables, the other one does not change. 38 00:02:58,960 --> 00:03:01,150 The third one is showing negative correlation. 39 00:03:01,180 --> 00:03:06,300 That is, if you increase value of one variable, the value of other value will is going down. 40 00:03:09,380 --> 00:03:16,070 As a rule of thumb, if you find the value of coalition coefficient as less than point one, we say 41 00:03:16,070 --> 00:03:19,730 that there is very low correlation between the two variables. 42 00:03:20,360 --> 00:03:25,550 And if we find that it is more than point eight, we say that it is very high correlation between the 43 00:03:25,550 --> 00:03:26,090 variables. 44 00:03:31,420 --> 00:03:38,050 It is important to note the difference between coalition and causation correlation is just representing. 45 00:03:39,210 --> 00:03:44,940 Whether the values in two variables for our sample data set are moving together or not. 46 00:03:46,260 --> 00:03:50,370 But it does not tell us anything about cause effect relationship between the two. 47 00:03:51,600 --> 00:04:00,780 We cannot say that because correlation coefficient is as high, that increase in one variable will lead 48 00:04:00,780 --> 00:04:03,510 to increase or decrease in the other variable. 49 00:04:04,770 --> 00:04:05,700 Take this example. 50 00:04:07,680 --> 00:04:15,840 This example shows the values of U.S. crude oil imports from Norway and the number of drivers killed 51 00:04:15,840 --> 00:04:17,580 in collision with rail between. 52 00:04:20,130 --> 00:04:25,140 You can see from this graph itself that both of these values are going hand-in-hand. 53 00:04:26,660 --> 00:04:32,600 If you find out the coalition between these two values, this coalition is coming out two point nine 54 00:04:32,600 --> 00:04:32,820 for. 55 00:04:32,890 --> 00:04:34,760 That is nearly 95 percent. 56 00:04:37,230 --> 00:04:39,120 Such high value of good coalition. 57 00:04:40,230 --> 00:04:45,000 May lead you to think that these two variables are highly correlated. 58 00:04:45,180 --> 00:04:48,840 An increasing one of them may lead to increase in the other one. 59 00:04:49,860 --> 00:04:50,880 That is not the case. 60 00:04:51,240 --> 00:04:53,370 These two are completely random. 61 00:04:53,390 --> 00:04:54,230 Two variables. 62 00:04:55,850 --> 00:04:58,960 And they do not have any relationship amongst themselves. 63 00:05:01,610 --> 00:05:10,220 So to establish cause effect relationship, one of the most common ways is to use the intuition to look 64 00:05:10,220 --> 00:05:15,920 back at deep business knowledge and to find out whether the change in one of the variables actually 65 00:05:15,920 --> 00:05:18,500 impacted change of the other very well or not. 66 00:05:20,240 --> 00:05:22,820 The other method is a difficult one. 67 00:05:23,150 --> 00:05:28,790 It is setting up a scientific experiment where we manipulate the variable in a controlled environment 68 00:05:28,880 --> 00:05:31,420 and see the impact on the other variable. 69 00:05:32,940 --> 00:05:34,440 Will not be covering that in this. 70 00:05:35,530 --> 00:05:36,170 ElectriCities. 71 00:05:37,320 --> 00:05:42,860 But just in case you are doing a controlled experiment, that is one of the options that you have. 72 00:05:44,060 --> 00:05:45,500 So far, our data say. 73 00:05:47,160 --> 00:05:49,840 We'll be depending on the business edition. 74 00:05:51,020 --> 00:05:52,160 To establish causation. 75 00:05:56,710 --> 00:06:03,850 When you have a dataset with multiple variables, we may want to see the correlation coefficient for 76 00:06:03,970 --> 00:06:05,170 each pair of variables. 77 00:06:07,890 --> 00:06:12,000 There is a tabular format in which we take all the variables on top. 78 00:06:14,270 --> 00:06:19,190 Like this and all those variables on the left all too like this. 79 00:06:20,850 --> 00:06:24,330 And then for each combination of these two variables. 80 00:06:25,170 --> 00:06:30,650 So for this never to try to evade taxes and always toward elections. 81 00:06:30,810 --> 00:06:37,620 So this sale to this combination of these two variables, correlation coefficient is mentioned in the 82 00:06:37,650 --> 00:06:38,580 corresponding cell. 83 00:06:40,980 --> 00:06:43,560 This table is called Coalition Matrix. 84 00:06:46,630 --> 00:06:50,440 If you look at this matrix, these diagonal values. 85 00:06:51,510 --> 00:06:54,630 And actually decorrelation of that variable with itself. 86 00:06:55,810 --> 00:07:01,100 So basically, whenever this variable increases, this is also increasing whenever this is decreasing. 87 00:07:01,120 --> 00:07:02,230 This is also decreasing. 88 00:07:02,440 --> 00:07:04,360 And they are going perfectly hand-in-hand. 89 00:07:04,910 --> 00:07:07,390 That's why this value is always going to be one. 90 00:07:08,620 --> 00:07:10,930 All the values on this diagonal are going to be one. 91 00:07:14,070 --> 00:07:20,370 And this value, which is giving us the correlation between all this to vote in elections and never 92 00:07:20,370 --> 00:07:21,790 to try to evade taxes. 93 00:07:22,440 --> 00:07:29,820 This point, not ninefold value will also come in here because these two variables are also present 94 00:07:29,820 --> 00:07:32,830 as rows and columns for this sale. 95 00:07:35,310 --> 00:07:44,610 If you look at this metric carefully, this matrix is basically having a mirror image of itself about 96 00:07:44,610 --> 00:07:45,300 this diagonal. 97 00:07:46,660 --> 00:07:51,260 So this particular cell is same as this in this particular cell. 98 00:07:51,470 --> 00:07:52,210 Same a diesel. 99 00:07:52,970 --> 00:07:54,080 This is same a diesel. 100 00:07:54,200 --> 00:07:56,840 So basically it is a mirrored image about this Tigner. 101 00:08:00,610 --> 00:08:05,230 Once we get this matrix, we can easily see these patterns in our data. 102 00:08:05,530 --> 00:08:08,050 That was the primary goal of making this matrix. 103 00:08:09,210 --> 00:08:16,470 Another useful feature of this matrix is that you can quickly find out which two independent variables 104 00:08:17,010 --> 00:08:18,150 are highly correlated. 105 00:08:19,950 --> 00:08:24,480 But why are we trying to find out coalition values between two independent variables? 106 00:08:27,070 --> 00:08:33,240 We are doing this because whenever there is correlation, there are two correlated variables are moving 107 00:08:33,240 --> 00:08:33,670 together. 108 00:08:35,140 --> 00:08:41,110 Therefore, if one of them is leniently littered individually with the dependent variable, then the 109 00:08:41,110 --> 00:08:44,200 other one will also show linear relationship with the dependent variable. 110 00:08:46,170 --> 00:08:51,420 But when you will take both the variables in your model and model has to assign relative importance 111 00:08:51,420 --> 00:08:55,400 to each of them, really tedious job for the model. 112 00:08:56,610 --> 00:09:01,350 Every time you will show your model a new sample of data, it will assign different edition coefficient 113 00:09:01,380 --> 00:09:02,370 to these variables. 114 00:09:04,440 --> 00:09:07,890 This problem for the model is called vertical linearity. 115 00:09:10,570 --> 00:09:18,010 Therefore, it is important that we identify highly correlated independent variables and remove one 116 00:09:18,010 --> 00:09:21,580 of the two so that multicore linearity can be avoided. 117 00:09:25,970 --> 00:09:30,020 But which one of the two would you remove and which one you'd keep? 118 00:09:31,440 --> 00:09:36,930 First, try to keep the one which makes more business sense if both make business sense. 119 00:09:37,380 --> 00:09:39,630 Look at the correlation coefficient of both. 120 00:09:41,050 --> 00:09:46,750 With the dependent variable, whichever is higher, you may want to keep that one and remove the other 121 00:09:46,750 --> 00:09:46,990 one. 122 00:09:48,970 --> 00:09:55,410 If both of them are having the same coalition cohesion, then you may go and pick the variable for which 123 00:09:55,410 --> 00:09:56,650 the data is easier to get. 124 00:09:59,920 --> 00:10:03,920 This is the use of correlation coefficient, correlation matrix.