1 00:00:00,480 --> 00:00:07,800 In this video, we will learn how to create a correlation matrix, which gives us the correlation between 2 00:00:08,010 --> 00:00:09,930 different variables of our dataset. 3 00:00:12,260 --> 00:00:19,580 To create a correlation matrix, we will use the data analysis adn which contains the correlation matrix 4 00:00:19,580 --> 00:00:23,000 option, so we'll click on data analysis. 5 00:00:27,100 --> 00:00:34,540 Here you can see the analysis tool, which is correlation, click on OK and the input range, we will 6 00:00:34,540 --> 00:00:35,880 select the whole table. 7 00:00:40,450 --> 00:00:45,790 To select the whole level, you can use the shortcut of control shift plus Iraqis. 8 00:00:48,860 --> 00:00:55,280 So first is the input range, which is the whole table, second is grouped by which is in columns, 9 00:00:55,280 --> 00:01:03,260 since columns contains are variables, next option is labeled in first row, which is true since our 10 00:01:03,260 --> 00:01:06,110 first row of the table contains the name of the variables. 11 00:01:08,360 --> 00:01:11,810 And lastly, it is asking us where do we want our output? 12 00:01:12,140 --> 00:01:14,020 We want our output in a new sheet. 13 00:01:14,240 --> 00:01:17,060 So that is selected and we can click on again on. 14 00:01:20,040 --> 00:01:25,410 So here is the correlation matrix for all the variables in our dataset. 15 00:01:27,550 --> 00:01:34,180 You can see that all the variables are listed in rows and columns, both so in the first column and 16 00:01:34,180 --> 00:01:36,690 in the first row we have the variable names. 17 00:01:37,870 --> 00:01:47,530 This fossil is giving me the correlation between the first column, which is price, and the first row, 18 00:01:47,530 --> 00:01:49,010 which is, again, price. 19 00:01:49,870 --> 00:01:57,790 So since it has the same variable on both sides, the correlation of that very well with itself is 100 20 00:01:57,790 --> 00:01:58,950 percent, which is one. 21 00:02:00,250 --> 00:02:05,030 So you can see the whole diagonal is intersection of the same variable with itself. 22 00:02:05,620 --> 00:02:11,380 That is why it is having the value of one in this second value. 23 00:02:12,370 --> 00:02:16,120 It is the correlation between crime rate and the price variable. 24 00:02:17,290 --> 00:02:20,130 And the value of this is minus point 46. 25 00:02:21,520 --> 00:02:26,350 That is, if crime rate is increasing, price values are in general decreasing. 26 00:02:26,540 --> 00:02:29,860 And if crime rate values are decreasing, prices increasing. 27 00:02:32,120 --> 00:02:39,110 Similarly, we get decorrelation values of all the combinations of variables. 28 00:02:41,630 --> 00:02:45,560 The corresponding second value on the row is empty. 29 00:02:48,450 --> 00:02:54,400 Because it is, again, the correlation between price and crime rate, which is the same as this value. 30 00:02:54,780 --> 00:03:01,410 So that is why the upper half of this diagonal is empty and the lower half is filled. 31 00:03:03,860 --> 00:03:11,690 Not only did he reelected, we can see that which all variables have very high correlation with the 32 00:03:12,050 --> 00:03:13,020 dependent variable. 33 00:03:14,060 --> 00:03:21,830 This will help us identify which all variables are important and are expected to have some impact on 34 00:03:21,830 --> 00:03:23,450 the response variable. 35 00:03:25,300 --> 00:03:33,460 So since this first column is the price column, which is are dependent variable, we can callachor 36 00:03:33,460 --> 00:03:41,320 this using conditional formatting in home menu under these conditions, formatting, we can use color 37 00:03:41,320 --> 00:03:41,800 skills. 38 00:03:43,790 --> 00:03:51,020 So if I use this particular color scale, a dark blue color would mean a very high positive correlation, 39 00:03:51,410 --> 00:03:59,420 a dark red color would mean a very high negative correlation, and no color or white would mean that 40 00:03:59,420 --> 00:04:03,080 there is very less correlation between these two particular variables. 41 00:04:05,920 --> 00:04:15,610 So from this, we can say that we expect ruminant teachers, poor Propp, all these variables to highly 42 00:04:15,610 --> 00:04:16,630 impact the price. 43 00:04:17,890 --> 00:04:26,620 Whereas in boardrooms, waterboarding, like waterboarding were waterboarding and they were all they 44 00:04:26,630 --> 00:04:30,520 were able to have a very little impact on the price variable. 45 00:04:31,900 --> 00:04:38,110 So this is how we interpret the relationship between independent and dependent variables versus their 46 00:04:38,110 --> 00:04:39,160 correlation matrix. 47 00:04:40,720 --> 00:04:45,880 We also need to see the correlation values for all the pairs of independent variables. 48 00:04:47,330 --> 00:04:53,390 We need to establish that are independent variables are not correlated amongst themselves. 49 00:04:53,780 --> 00:05:01,520 This is because if the independent variables are highly correlated, we face a problem named multipolarity 50 00:05:01,520 --> 00:05:05,070 in our analysis that may impact that final result. 51 00:05:05,660 --> 00:05:12,890 So we need to identify those variable pairs which have high correlation and delete one of the variables 52 00:05:12,890 --> 00:05:16,880 of that bid so that we do not encounter multiple linearity. 53 00:05:19,710 --> 00:05:27,270 So to identify a pair of rebels which have high correlation, again, use conditional formatting. 54 00:05:28,680 --> 00:05:30,990 We will select this and go to. 55 00:05:32,820 --> 00:05:35,550 The conditions for rating options and will give a neutral. 56 00:05:41,300 --> 00:05:50,630 And this we will select the role, that format only cells that contain cell value, not between minus 57 00:05:50,630 --> 00:05:54,230 point eight and plus point eight. 58 00:05:58,260 --> 00:06:04,110 Wherever this is the scenario, we want that to be filled with green color. 59 00:06:07,090 --> 00:06:15,250 So whenever the sale value is not between minus pointed and pointed, which means that the sale value 60 00:06:15,250 --> 00:06:22,540 is very high, there is very high, either positive or negative correlation between that pair of variables. 61 00:06:23,720 --> 00:06:26,450 The DNA should be color green. 62 00:06:28,190 --> 00:06:29,280 So we'll click on talk. 63 00:06:31,220 --> 00:06:43,610 You can see that only despair of box and air quality is colored green, that is the value of parks and 64 00:06:43,610 --> 00:06:49,410 air quality is is highly, positively correlated to prevent multicellularity. 65 00:06:49,670 --> 00:06:57,290 We need to delete either of these two variables now, which we're able to delete will depend on the 66 00:06:57,290 --> 00:06:58,090 business scenario. 67 00:06:58,400 --> 00:07:03,710 You have to go back and take the business knowledge on which of these two variables is more important 68 00:07:03,710 --> 00:07:06,320 to our analysis and which can be dropped. 69 00:07:08,180 --> 00:07:14,420 Alternatively, you can also consider the fact that collecting data for each of these variables is easier 70 00:07:15,350 --> 00:07:18,930 if there is no difference between these two variables in decent. 71 00:07:20,600 --> 00:07:28,160 You can opt for that variable, which is higher correlation with the dependent variable, for example, 72 00:07:28,160 --> 00:07:28,640 here. 73 00:07:28,700 --> 00:07:36,440 Air quality has minus zero point for two coalition and parties, minus zero point three nine. 74 00:07:36,440 --> 00:07:41,990 Coalition air quality has a higher value of correlation with the price. 75 00:07:41,990 --> 00:07:46,670 Variable will keep air quality and will the lead box. 76 00:07:48,670 --> 00:07:56,140 So we'll go to our house pricing data and we will delete the column which contains the value of box. 77 00:07:59,870 --> 00:08:06,680 So this is how we draw the correlation matrix and analyze it using conditional formatting in Microsoft 78 00:08:06,680 --> 00:08:07,010 Excel.