1 00:00:01,190 --> 00:00:04,290 In this video, we will learn about by Verrett analysis. 2 00:00:05,430 --> 00:00:12,930 When we looked at EGD and backlots, we looked at individual variables and that was univariate analysis. 3 00:00:14,250 --> 00:00:18,000 Now we will pick two variables and this will be called by vinyard analysis. 4 00:00:18,870 --> 00:00:24,030 So when we have two variables, we look at the relationship they have amongst themselves or whether 5 00:00:24,030 --> 00:00:27,000 they seem to even have a relationship or not. 6 00:00:28,230 --> 00:00:31,800 There are two published ways of looking at two variable relationship. 7 00:00:32,010 --> 00:00:36,570 One is a graphical way, which is called scatterplot, and the second one is tabular way, which is 8 00:00:36,570 --> 00:00:38,040 called correlation matrix. 9 00:00:40,470 --> 00:00:46,650 How we should use scatterplot as we should first plodder scatterplot of each independent variable against 10 00:00:46,650 --> 00:00:47,730 the dependent variables. 11 00:00:49,660 --> 00:00:54,070 Then we should ask, is there a visible relationship between these variables? 12 00:00:54,700 --> 00:00:55,540 If there is none. 13 00:00:55,960 --> 00:00:57,310 We should go back and check. 14 00:00:57,340 --> 00:00:58,150 Business knowledge. 15 00:00:59,260 --> 00:01:05,320 If there is a visible relationship, then we should see if it is a linear relationship or not. 16 00:01:06,400 --> 00:01:12,310 If it is a linear relationship, we will straight away use that variable for linear regression analysis. 17 00:01:13,300 --> 00:01:18,640 If it is some other sort of relationship, we will transform the variables so that the transformed variable 18 00:01:18,880 --> 00:01:20,740 is no linearly related. 19 00:01:22,660 --> 00:01:25,990 We will look at variable transformations in a couple of minutes. 20 00:01:27,400 --> 00:01:33,460 So Basic Scatterplot, we will decide whether to keep, discard or transform or variable. 21 00:01:34,420 --> 00:01:42,010 Next is correlation matrix coalition matrix guilty linear correlation coefficients between all pairs 22 00:01:42,010 --> 00:01:45,760 of variables of data set into Purdum's. 23 00:01:46,450 --> 00:01:55,210 If X2 increases, when X1 increases and x2 decreases when X1 decreases, the variables tend to be highly, 24 00:01:55,210 --> 00:01:56,380 positively correlated. 25 00:01:57,280 --> 00:02:02,770 Similarly, if x2 degree increases, when X1 decreases, the coalition is set to be negative. 26 00:02:03,520 --> 00:02:07,180 And if the change is random, that is excellent. 27 00:02:07,270 --> 00:02:11,140 X2 increases sometimes and decreases sometimes with increase in X1. 28 00:02:11,530 --> 00:02:14,140 Then the coefficient of coalition will be near zero. 29 00:02:15,370 --> 00:02:18,490 We will discuss this in depth in incoming videos. 30 00:02:20,290 --> 00:02:22,260 Then we will bloddy coalition metrics. 31 00:02:22,450 --> 00:02:24,400 We will look for the following two things. 32 00:02:25,810 --> 00:02:29,290 One, if the value is low, that is near zero. 33 00:02:30,700 --> 00:02:33,490 Between the dependent and the independent variables. 34 00:02:34,480 --> 00:02:42,730 This will basically represent that probably there is no better correlation between the independent and 35 00:02:42,790 --> 00:02:44,020 the independent variable. 36 00:02:44,740 --> 00:02:47,620 And we can consider discarding that independent variable. 37 00:02:49,840 --> 00:02:54,330 Second, if there is very high correlation amongst the independent variables. 38 00:02:55,180 --> 00:03:02,190 So if we take two independent variables and correlation between these two us coming too high, it may 39 00:03:02,190 --> 00:03:07,200 suggest that the independent variables that we have selected may not be truly independent. 40 00:03:08,900 --> 00:03:10,430 And we may have to let go. 41 00:03:10,520 --> 00:03:16,790 One of them, because having both in the analysis leads to a type of ethical medical insanity. 42 00:03:17,990 --> 00:03:21,640 All these types of errors will also be covered in the later part of this course. 43 00:03:22,490 --> 00:03:25,670 For now, we can consider that will be two moving. 44 00:03:25,790 --> 00:03:32,300 Anybody will of the beard of independent variables, which show very high correlation coefficient, 45 00:03:32,390 --> 00:03:37,240 say, more than pointed here in this light. 46 00:03:37,370 --> 00:03:39,320 I have some scatterplot examples. 47 00:03:40,940 --> 00:03:45,380 Here we take one variable on the x axis and other on the Y axis. 48 00:03:45,830 --> 00:03:48,530 And each data point is plotted on the graph. 49 00:03:49,640 --> 00:03:54,440 The idea of doing this is to see if there is any visible relationship between the variables. 50 00:03:55,430 --> 00:04:01,910 Most of the time when we applauding scatterplot, the y axis will be taken by the dependent variable, 51 00:04:02,150 --> 00:04:04,760 that is the variable we are trying to predict. 52 00:04:07,490 --> 00:04:13,280 Now, if you look at the first scatterplot, it seems to be a linear relationship. 53 00:04:14,690 --> 00:04:17,570 It does not mean that a single lane will pass through all points. 54 00:04:18,140 --> 00:04:23,990 But overall, if excess higher values y tends to be on the higher side. 55 00:04:25,010 --> 00:04:28,670 And when X increases, Y increases proportionately. 56 00:04:29,780 --> 00:04:34,650 So this can be used straightaway in a linear regression problem. 57 00:04:36,540 --> 00:04:41,280 But what does get applauds, the relationship seems to be not linear. 58 00:04:43,500 --> 00:04:52,290 Why is increasing as X is increasing, but either faster or slower on in this right plot? 59 00:04:53,160 --> 00:05:01,950 Initially when I was increasing X, Y was increasing at a larger pace, but eventually Y starts increasing 60 00:05:01,950 --> 00:05:02,880 at a slower pace. 61 00:05:03,990 --> 00:05:07,380 This type of distribution is offer logarithmic function. 62 00:05:09,620 --> 00:05:16,820 If we move to the next graph with vias starts increasing slowly, but as we move ahead and have higher 63 00:05:16,820 --> 00:05:21,290 values of X, the increase in the value of Y is higher. 64 00:05:23,090 --> 00:05:26,180 This type of distribution, it is an exponential distribution. 65 00:05:27,890 --> 00:05:34,850 And lastly, if you look at this distribution, all the points seem uniformly distributed on this graph. 66 00:05:35,630 --> 00:05:43,070 There is no particular functional distribution which seems to satisfy the relationship between these 67 00:05:43,070 --> 00:05:43,760 two variables. 68 00:05:44,060 --> 00:05:50,270 We can't really say with confidence that increasing X will increase Y or decrease way. 69 00:05:50,840 --> 00:05:54,810 So when there is no recognizable pattern we can do to discard it. 70 00:05:55,280 --> 00:06:01,500 This is our business knowledge or we can even keep it still and let the model discarded later. 71 00:06:02,970 --> 00:06:09,710 What the graph to entry when the relationship between the two variables is some other functional form 72 00:06:09,920 --> 00:06:11,570 will need to transform the variable. 73 00:06:12,890 --> 00:06:17,690 We'll see how to transform a variable in the next couple of late. 74 00:06:20,030 --> 00:06:27,560 So now let us focus on the Diebel transformation, one type of erasable transformation is what I told 75 00:06:27,560 --> 00:06:28,550 you in the previous lead. 76 00:06:28,670 --> 00:06:35,900 That is, if the dependent and independent variables are not related linearly, but in some other functional 77 00:06:35,900 --> 00:06:42,770 form, we can modify the independent variable so that the modified version has a more linear relationship 78 00:06:42,770 --> 00:06:43,910 with the dependent variable. 79 00:06:44,720 --> 00:06:49,040 Keep in mind, transforming a variable is not a mandatory thing. 80 00:06:49,490 --> 00:06:56,540 We are only transforming variables with the hope that it will eventually fit the model better once we 81 00:06:56,540 --> 00:06:58,280 know how to run the model. 82 00:06:58,940 --> 00:07:05,770 I would suggest that you run the model without doing the transformations and then we're doing the transformations 83 00:07:06,080 --> 00:07:10,070 so that you can see which of these two models fit your data better. 84 00:07:10,730 --> 00:07:16,490 Now we can do other types of variable transformations also so that our variables give us more information 85 00:07:16,730 --> 00:07:18,050 or they become more relevant. 86 00:07:19,310 --> 00:07:26,840 One type of transformation is where we have two or more independent variables depicting similar type 87 00:07:26,840 --> 00:07:33,650 of data and having similar type of relation with the dependent variable in such a case. 88 00:07:33,680 --> 00:07:40,550 We can decide to take the average of the values of these variables and put it into a new variable and 89 00:07:40,550 --> 00:07:43,040 use this variable instead of the other variables. 90 00:07:44,280 --> 00:07:46,500 Let me tell you an example of how this is done. 91 00:07:47,640 --> 00:07:50,400 Let us go back to our dataset of host pricing. 92 00:07:51,450 --> 00:07:59,280 If you look carefully, we have these four variables which tell us the distance of hose from four different 93 00:07:59,340 --> 00:08:00,510 employment locations. 94 00:08:02,140 --> 00:08:10,180 Basically, all four of these are trying to represent the availability of job opportunities in the nearby 95 00:08:10,180 --> 00:08:10,780 locations. 96 00:08:11,230 --> 00:08:16,300 Probably because we think house prices are more where jobs are easy to get. 97 00:08:17,640 --> 00:08:23,890 Now, since all four of these are trying to represent the same thing, having forward variables is actually 98 00:08:23,890 --> 00:08:26,200 over representing this feature of employment. 99 00:08:27,190 --> 00:08:34,840 Also, house prices may not relate to each one of these individually, but there is higher chance that 100 00:08:34,900 --> 00:08:39,010 it will relate to a variable which represents overall ease of getting a job. 101 00:08:41,530 --> 00:08:48,700 And one way of creating that could be to get the mean distance to employment locations so we can create 102 00:08:48,700 --> 00:08:56,020 a new variable, get the mean of these four values for each of the relations stored in it, and then 103 00:08:56,020 --> 00:08:58,420 remove these four variables from our dataset. 104 00:08:59,500 --> 00:09:05,380 This is one way in which we can transform multiple variables representing the same thing into one new 105 00:09:05,380 --> 00:09:10,960 variable which will later have better relationship with these dependent variable. 106 00:09:14,460 --> 00:09:19,260 Next method of transformation is creating racial variables which are relevant to business. 107 00:09:20,400 --> 00:09:23,530 What this means is we may have phone. 108 00:09:23,700 --> 00:09:29,250 Suppose that the price of house depends on the quality of education maintained in the locality or the 109 00:09:29,250 --> 00:09:29,670 city. 110 00:09:31,170 --> 00:09:36,180 So we may think that probably number of schools in the region or a number of teachers in the region 111 00:09:36,630 --> 00:09:37,800 can reflect this feature. 112 00:09:39,060 --> 00:09:44,450 But if you think about it, quality will not depend just on the number of teachers, but their identity. 113 00:09:44,490 --> 00:09:47,010 That is how many teachers, but all the students. 114 00:09:48,040 --> 00:09:52,030 Also, bigger cities tend to have more population and more number of schools. 115 00:09:52,810 --> 00:09:57,940 So a number of schools or teachers will not be independent in the true sense in such a case. 116 00:09:58,990 --> 00:10:05,410 Therefore, in a particular business context, racial variables often make more sense. 117 00:10:05,830 --> 00:10:10,420 So transform such variables to racial variables before the analysis. 118 00:10:11,410 --> 00:10:17,260 Third point is when we discussed in the previous late in the next two slides, I will be showing you 119 00:10:17,260 --> 00:10:23,260 some common type of relationships between two variables and how to transform one of them so that the 120 00:10:23,260 --> 00:10:26,110 relationship becomes closer to linear. 121 00:10:28,960 --> 00:10:31,510 So this first graph depicts the shape of. 122 00:10:33,260 --> 00:10:35,870 Way as natural log of X. 123 00:10:37,330 --> 00:10:43,660 If the shape of your scatterplot looks like this, ignore this scale and ignore where it is, cutting 124 00:10:43,660 --> 00:10:45,480 the axis does mastership. 125 00:10:46,420 --> 00:10:55,200 If this shape is similar, you need to take the log of this variable, too, instead of X, use log 126 00:10:55,210 --> 00:10:55,840 of X. 127 00:10:57,300 --> 00:11:05,550 If your ex has values between zero to one, it is advisable that you add a constant to X, that is, 128 00:11:06,240 --> 00:11:15,300 add one to X and then take log because log of within a range of zero to one behaves in a very erratic 129 00:11:15,300 --> 00:11:15,600 minute. 130 00:11:17,230 --> 00:11:24,480 Once you do this transformation, this new created variable, which contains the value of Log X, will 131 00:11:24,490 --> 00:11:27,260 have a morally linear relationship with the way. 132 00:11:28,900 --> 00:11:39,190 Similarly, in this graph, in the second graph, Y has an exponential relationship with X in such a 133 00:11:39,190 --> 00:11:39,640 case. 134 00:11:40,270 --> 00:11:44,290 You need to take E to the power X instead of X. 135 00:11:46,830 --> 00:11:52,830 Remember that your graph may not look exactly the same, its shape should be similar. 136 00:11:53,610 --> 00:11:57,870 It may not intersect the axis at the points that we have shown. 137 00:11:58,860 --> 00:12:00,870 It would just look similar to this. 138 00:12:01,010 --> 00:12:01,710 But look, graph. 139 00:12:04,600 --> 00:12:06,250 And the third one is of. 140 00:12:07,870 --> 00:12:08,920 A higher order point. 141 00:12:08,970 --> 00:12:14,620 I mean, so if it is exquisite, it looks like this. 142 00:12:15,880 --> 00:12:17,290 Excuse me, has a higher slope. 143 00:12:17,740 --> 00:12:20,840 If you look at it deep, the power X and it square. 144 00:12:21,040 --> 00:12:22,090 They also look similar. 145 00:12:22,390 --> 00:12:29,800 Does that the exponential function grows much faster than the normal function, too. 146 00:12:30,400 --> 00:12:38,380 If your scatterplot is showing something like this, probably we need to take a higher order value of 147 00:12:38,380 --> 00:12:38,840 the X. 148 00:12:39,040 --> 00:12:46,750 So private XCOR, private excuse, whichever you think is the type of relationship with which X has 149 00:12:46,750 --> 00:12:47,210 with Y. 150 00:12:49,330 --> 00:12:57,190 And this slide I'm showing you the effect of putting a negative sign on a plot of y it makes if you 151 00:12:57,190 --> 00:13:00,280 have a plot of X Q it looks something like this. 152 00:13:01,930 --> 00:13:06,280 But if you take a plot of minus excuse it look something like this. 153 00:13:08,470 --> 00:13:15,190 This plot is actually just the mirror image of this plot on the Y axis. 154 00:13:16,390 --> 00:13:25,600 So whenever you negate the X and then put a function on it, the new function is just the mirror image 155 00:13:25,720 --> 00:13:33,100 of the previous function where we are telling you this is because sometimes you will see a relationship 156 00:13:33,100 --> 00:13:36,580 like this and you will think that it looks like a excuse. 157 00:13:36,610 --> 00:13:41,620 But how do a model it does take the negative off X and take its cue. 158 00:13:42,160 --> 00:13:43,510 This is how it will look. 159 00:13:44,680 --> 00:13:48,590 As I told you, you just need to remember the shape of different types of plot. 160 00:13:50,410 --> 00:13:53,190 And you can use that information to transform your variable. 161 00:13:54,160 --> 00:14:00,460 This is a politics to Dibala minus X will be an mirrored image of this. 162 00:14:00,580 --> 00:14:02,350 So it will look something like this. 163 00:14:02,800 --> 00:14:04,240 The way I'm moving my Moscoso. 164 00:14:05,800 --> 00:14:12,580 But if I do do the politics and then negated, it becomes a mirror image on the x axis. 165 00:14:13,390 --> 00:14:15,790 So this will come down. 166 00:14:16,180 --> 00:14:19,420 So this is and this is a mirror image on the x axis. 167 00:14:21,310 --> 00:14:24,880 So if you see something like this, it is Eataly. 168 00:14:24,940 --> 00:14:27,130 Politics does a negative in front of it. 169 00:14:28,120 --> 00:14:29,300 You need not handle this. 170 00:14:29,410 --> 00:14:35,650 If you just take an exponential, your model will also find that it has a negative relationship. 171 00:14:36,550 --> 00:14:40,630 You need not worry about negating the function. 172 00:14:41,200 --> 00:14:49,870 But if you see something like this and you have to negate the variable itself, must negate the variable 173 00:14:49,990 --> 00:14:51,070 and then apply the function. 174 00:14:53,470 --> 00:15:00,710 The second thing I want to emphasize is the effect of adding a constant if you add a constant to X. 175 00:15:01,330 --> 00:15:04,540 It will shift this entire graph to the left hand side. 176 00:15:05,560 --> 00:15:11,290 So if you are looking at a graph which has a shape like this and you want to move this graph to a little 177 00:15:11,290 --> 00:15:13,240 left, add a constant. 178 00:15:13,990 --> 00:15:16,780 By how much more you want to shift this go. 179 00:15:17,940 --> 00:15:21,310 So if you look at excuse, it is passing through zero. 180 00:15:21,670 --> 00:15:23,150 But if I add into it. 181 00:15:23,320 --> 00:15:24,850 It is passing through minus 10. 182 00:15:25,840 --> 00:15:28,090 So it is just shifting this whole graph to the left. 183 00:15:30,220 --> 00:15:30,730 Instead. 184 00:15:30,880 --> 00:15:33,490 If you add a constant to the function. 185 00:15:33,970 --> 00:15:38,050 If you add 10 to do the politics, it will just push it. 186 00:15:38,830 --> 00:15:42,010 Then units are able to whatever constant you will add here. 187 00:15:42,400 --> 00:15:46,300 It will push the whole God does one unit up on the Y axis. 188 00:15:49,400 --> 00:15:54,650 So basically, you need to remember the shapes of different types of functions and that you can use 189 00:15:54,830 --> 00:15:55,940 to transform variables. 190 00:15:57,820 --> 00:16:00,460 So on out, they dusted off house blazing. 191 00:16:01,860 --> 00:16:06,810 We'll be looking at the relationship of crime rate versus house price, and then we'll be transforming 192 00:16:07,200 --> 00:16:13,740 the crime rate very well so that it has a more linear relationship with house place.