1 00:00:01,070 --> 00:00:07,820 And this video will assess the accuracy of these coefficient between bidone that we have calculated. 2 00:00:09,020 --> 00:00:11,750 I will start by restating the situation that we are in. 3 00:00:13,200 --> 00:00:16,380 In the world, millions of transactions were happening. 4 00:00:16,680 --> 00:00:24,120 We picked a small set or a small sample of 506 such observations and decided to identify the relationship 5 00:00:24,120 --> 00:00:26,190 between house price and number of rooms. 6 00:00:27,430 --> 00:00:33,270 When I used the formula shown in previous video, we get the values of and Beethoven. 7 00:00:34,470 --> 00:00:38,820 This line is minimizing this squared error on the sample of five or six points. 8 00:00:40,060 --> 00:00:46,650 If we had selected some other 506 observations, we may have got some other values of Betaseron be done. 9 00:00:47,610 --> 00:00:54,350 If we take all the million observations of the world and then get a line, this line will be called 10 00:00:54,390 --> 00:00:55,730 population regression line. 11 00:00:56,580 --> 00:01:00,150 This line may or may not be similar as sample regression line. 12 00:01:01,980 --> 00:01:08,310 You can see in this graph, this blue line, it is embolisation line, which is minimizing the error 13 00:01:08,550 --> 00:01:11,340 on these data points, the sample data point. 14 00:01:13,310 --> 00:01:21,400 But this red line with population, the green line, which would have come if I would have saw the list, 15 00:01:21,410 --> 00:01:27,860 could add up for all the data points of the population, these dulaine, the Nazi. 16 00:01:29,650 --> 00:01:33,470 Now, the problem is this, we do not have all the observations. 17 00:01:33,610 --> 00:01:35,850 Therefore, we cannot get population regulation line. 18 00:01:36,850 --> 00:01:40,420 We have sample data so we can get only this sample regression line. 19 00:01:42,100 --> 00:01:48,640 We want to use the completion of simple regression line as an estimate for the population regression 20 00:01:48,640 --> 00:01:48,850 line. 21 00:01:50,810 --> 00:01:59,650 How far off will this ample estimate that is be gap and between gap will be from the population coefficients 22 00:01:59,960 --> 00:02:07,220 which are Betaseron be driven for this, we will use a quantity called standard error of Betaseron between. 23 00:02:09,680 --> 00:02:16,520 The standard of being as unbeatable is calculated using these formulas again and showing you these formulas 24 00:02:16,520 --> 00:02:22,550 and discussing the illusion behind them, you need not memorize them or know their derivation for practical 25 00:02:22,550 --> 00:02:23,090 purposes. 26 00:02:23,750 --> 00:02:29,240 But since these quantities will be part of the output of the model from our software, we need to understand 27 00:02:29,240 --> 00:02:29,750 their meaning. 28 00:02:31,700 --> 00:02:32,720 So in this formula. 29 00:02:33,700 --> 00:02:37,960 Delta Sigma squared, sigma squared, variance of population residuals. 30 00:02:39,580 --> 00:02:46,300 Remember, we discussed vegetables as well as the value, which is the difference of actual weight from 31 00:02:46,300 --> 00:02:48,570 the estimated weight of the population. 32 00:02:49,700 --> 00:02:55,610 Sigma squared is the variance of these values for all the data points of the population. 33 00:02:56,890 --> 00:03:01,360 Since population regression line is not known, the sigma results are not known. 34 00:03:02,350 --> 00:03:10,960 We need to estimate it from our sample data to this estimate is given by this formula, which is Oddisee 35 00:03:10,990 --> 00:03:18,550 is equal to underoos are assessed by and minus two year Oddisee is called residual standard error. 36 00:03:19,630 --> 00:03:21,070 Odyssey's, as did some. 37 00:03:21,220 --> 00:03:26,330 Of course, we discussed this in the last video, which is sum of all the squares. 38 00:03:26,620 --> 00:03:32,170 So for all the data points we get, the errors that that is the residuals. 39 00:03:32,650 --> 00:03:36,220 We square them and we add them that as the Odyssey's. 40 00:03:37,080 --> 00:03:43,090 We divide Odyssey's by and minus two and the number of observations for our dataset. 41 00:03:43,110 --> 00:03:47,430 It is five or six will divided by five or six minus two. 42 00:03:48,650 --> 00:03:52,760 When we find it on the road, we get residuals, standard error. 43 00:03:54,550 --> 00:03:58,630 Will use this the standard edit as a proxy for Sigma. 44 00:04:00,100 --> 00:04:06,250 Also, notice that if X is more spread out in this formula of Bedazzler and vitamin. 45 00:04:07,720 --> 00:04:15,790 Standard of bedazzling be down will be small, which intuitively means that we have more leverage while 46 00:04:15,790 --> 00:04:17,740 estimating the slope in such a case. 47 00:04:19,120 --> 00:04:26,140 Now, what is the practical takeaway from the ethical relations standard, it will be used to give us 48 00:04:26,140 --> 00:04:30,640 a confidence interval, that is, or linear regression. 49 00:04:31,610 --> 00:04:37,610 There is a 95 percent chance that the true value of Beethoven lies in the interval. 50 00:04:38,990 --> 00:04:41,830 We got minus two times standard, Ed. 51 00:04:41,850 --> 00:04:46,460 We don't want to be the one guy plus two times to standard a little bit of an. 52 00:04:48,340 --> 00:04:55,840 So within this interval, we are lower values this and higher values this, we are 95 percent confident 53 00:04:55,840 --> 00:04:59,410 that the actual between lies in this interval. 54 00:05:01,150 --> 00:05:11,070 Similarly, for the 95 percent confidence interval will be the estimated Bredasdorp minus two days standard, 55 00:05:11,080 --> 00:05:12,430 Ed, this. 56 00:05:13,270 --> 00:05:14,710 Standard error of president. 57 00:05:15,860 --> 00:05:23,750 And the higher value will be the zero plus two times standard error of bedazzle, to summarize, what 58 00:05:23,750 --> 00:05:29,500 we have done here is we had two lines, one line we got from reasonable regression that we did. 59 00:05:30,440 --> 00:05:36,890 The other one is a hypothetical line, which is the true line between the population of all the points. 60 00:05:38,710 --> 00:05:45,520 We wanted to show whether we can approximate this Tamburlaine as the population lane. 61 00:05:47,150 --> 00:05:50,240 For that, we found out the confidence interval. 62 00:05:51,480 --> 00:05:55,230 Within which the population coefficient will lay. 63 00:05:56,920 --> 00:06:02,320 So the president of the population line will pay between. 64 00:06:03,560 --> 00:06:11,150 These two values that we get from the regulation and the slope of the population aggression relation 65 00:06:11,150 --> 00:06:14,150 between these two values of the simple equation. 66 00:06:16,370 --> 00:06:19,620 And we have a assigned the probability of how confident we are. 67 00:06:20,000 --> 00:06:27,710 We are saying that there is a 95 percent chance that the population regression coefficients lie within 68 00:06:28,010 --> 00:06:31,220 these intervals that we found out using sample. 69 00:06:35,160 --> 00:06:41,760 Another use of Standard Chartered is to establish that whether X and Y actually have a relationship 70 00:06:41,760 --> 00:06:47,010 or not in a linear model, this relationship is governed by between. 71 00:06:48,160 --> 00:06:51,850 We are saying that life is better one times X plus constant. 72 00:06:52,990 --> 00:06:57,280 So if Beethoven is zero, it means that there is no relationship. 73 00:06:58,780 --> 00:07:02,860 If there is no relationship, then the variable X cannot be used to predict like. 74 00:07:04,010 --> 00:07:08,540 So we need to show that the probability of Beethoven being zero is negligible. 75 00:07:10,920 --> 00:07:15,540 Let me tell you the concept first, and after that, I'll show you the way, how you will find it in 76 00:07:15,540 --> 00:07:17,160 every other statistical book. 77 00:07:18,060 --> 00:07:26,190 The concept is this Biederman has some value that is the most probable value found out by the formula 78 00:07:26,190 --> 00:07:27,700 we saw in the previous videos. 79 00:07:29,670 --> 00:07:35,200 Now we want to see that we are sufficiently confident that it cannot be zero. 80 00:07:35,790 --> 00:07:37,020 It has two parts to it. 81 00:07:37,530 --> 00:07:41,490 First, most valuable value should be far from zero. 82 00:07:42,360 --> 00:07:49,230 And taking the standard of beta, which is giving us the interval in which the rule lies, should be 83 00:07:49,230 --> 00:07:49,650 small. 84 00:07:51,450 --> 00:08:00,200 So basically, we want zero not to lie in this whole interval, in which case I hope you get the idea. 85 00:08:01,210 --> 00:08:06,730 Let's do it the proper way now, this this method is called hypothesis testing. 86 00:08:07,750 --> 00:08:09,800 We will construct two hypotheses. 87 00:08:09,820 --> 00:08:14,770 One is at zero, that is there is no relationship between X and Y. 88 00:08:15,780 --> 00:08:25,860 And its alternative, which is written as it is that there is a relationship between X and Y, so it 89 00:08:27,360 --> 00:08:31,130 will be the one is equal, Luisito, it is between is not equal to zero. 90 00:08:32,380 --> 00:08:34,600 And we want to disprove Ejiro. 91 00:08:36,720 --> 00:08:43,680 To disprove at zero, we will calculate something known as a statistic which is given as being equal 92 00:08:43,680 --> 00:08:48,330 to be Gapminder zero, divided by standard error be double. 93 00:08:50,130 --> 00:08:58,410 So you can see what is what this is representing here, enumerator, it says how far better one is from 94 00:08:58,410 --> 00:08:58,770 zero. 95 00:08:59,940 --> 00:09:05,170 And when you divided by standard at it, you go the distance is how many times I heard it. 96 00:09:06,210 --> 00:09:13,050 So using this formula will get to people, you know, we want this disvalue to be large. 97 00:09:14,190 --> 00:09:19,260 By the way, it is called P-value, because it is based on p distribution, which is similar to normal 98 00:09:19,260 --> 00:09:19,860 distribution. 99 00:09:21,400 --> 00:09:27,310 Basically, the distribution is another probability distribution, and if you have P-value, you can 100 00:09:27,310 --> 00:09:32,920 get the probability of observing any value equal to absolutely or larger. 101 00:09:34,660 --> 00:09:39,250 The probability that you will get from this distribution is called P-value. 102 00:09:42,380 --> 00:09:44,090 Now, this P-value has meaning. 103 00:09:45,900 --> 00:09:48,230 We can interpret P-value as follows. 104 00:09:49,400 --> 00:09:55,720 Small value of B will mean that it is highly unlikely that there is no relationship between pressure 105 00:09:55,730 --> 00:09:56,420 and response. 106 00:09:58,850 --> 00:10:04,160 Which means that we can reject the null hypothesis and declare that there is a relationship between 107 00:10:04,160 --> 00:10:04,780 X and Y. 108 00:10:07,680 --> 00:10:08,790 Typically, we use. 109 00:10:09,810 --> 00:10:16,080 A value like five percent or one percent as the cutoff value for B, that is, if P is less than point 110 00:10:16,080 --> 00:10:21,810 zero one, then the variable X is significantly impacting Y. 111 00:10:26,150 --> 00:10:28,400 Let us look at the results of our sample. 112 00:10:32,000 --> 00:10:36,920 Not that this is a result which we have received from the software package that we are using, and of 113 00:10:36,920 --> 00:10:44,660 course there is a separate video where you will learn how to run this analysis and get this result. 114 00:10:45,770 --> 00:10:51,110 I'm just discussing the result here as an example to all the theory that we just word. 115 00:10:52,520 --> 00:10:59,630 So in the last video, we had just seen the beta one and beta values, this intercept is the president 116 00:11:00,380 --> 00:11:08,240 and this room where evil has this slope, this nine point zero nine is the beta one for this variable. 117 00:11:10,930 --> 00:11:13,040 The first thing that we covered is the standard error. 118 00:11:13,360 --> 00:11:14,220 What does he do? 119 00:11:14,260 --> 00:11:14,890 And we don't. 120 00:11:16,810 --> 00:11:22,800 The formula that I showed you earlier was used to calculate the standard error of zero and be done, 121 00:11:24,010 --> 00:11:33,070 how the standard error can be used, as you can make the statement that for the sample of 506 observations, 122 00:11:33,550 --> 00:11:37,360 the beta that you calculated for minimum error was nine point zero nine. 123 00:11:39,340 --> 00:11:48,700 But you are 95 percent confident that for the global data of placing, the B1 will lay between nine 124 00:11:48,880 --> 00:11:52,780 point zero nine minus two times zero point four one. 125 00:11:54,050 --> 00:11:58,460 Two nine point zero nine plus two times zero point four to. 126 00:12:00,750 --> 00:12:06,930 So this interval is giving you that the estimated Beta's that you have here. 127 00:12:08,140 --> 00:12:15,940 Can be used to make this statement that the true regression coefficients will lie in the interval given 128 00:12:15,940 --> 00:12:16,930 by this standard error. 129 00:12:18,870 --> 00:12:22,890 The next thing we discussed was the value and the p value. 130 00:12:25,240 --> 00:12:31,840 This devaluate P-value, we said that we have one hypothesis, which is saying that there is no relationship. 131 00:12:33,870 --> 00:12:39,270 To disprove that hypothesis, we calculated the devalue which was given by this formula. 132 00:12:42,260 --> 00:12:49,730 So to get the value of this be done, we will probably be the one value divided by the AC. 133 00:12:49,910 --> 00:12:55,100 So nine point zero nine divided by zero point for one, gives you this disvalue. 134 00:12:56,170 --> 00:13:02,800 So corresponding to this disvalue, we calculate a P value, which is written as probability of getting 135 00:13:02,800 --> 00:13:07,120 a P value, which is greater than equal to the speed that we calculated. 136 00:13:08,090 --> 00:13:10,880 This probability value is coming out to be very small. 137 00:13:10,910 --> 00:13:17,280 It's a it's an exponential with the power of minus 16 to this value is very small. 138 00:13:18,620 --> 00:13:29,510 So if this value is very small, we can see that it is unlikely that there is no relationship, which 139 00:13:29,510 --> 00:13:32,300 means that there is some relationship. 140 00:13:33,840 --> 00:13:43,290 And therefore, we are confident that room number variable is impacting the house pricing and the relationship 141 00:13:43,290 --> 00:13:46,440 between them is given by this coalition. 142 00:13:48,470 --> 00:13:54,410 So the takeaway from this lecture is we have one and zero values calculated, for example. 143 00:13:55,840 --> 00:14:01,780 We calculated a standard error, which is helping us in determining two things, one is what is range 144 00:14:01,780 --> 00:14:04,540 in which the true value of Beethoven and be desirable. 145 00:14:05,140 --> 00:14:11,230 And the second thing is whether there is actually a relationship between these predictor and the response 146 00:14:11,230 --> 00:14:11,740 variables. 147 00:14:12,840 --> 00:14:19,020 To establish that there is a relationship we calculated at value, using these two things, using a 148 00:14:19,050 --> 00:14:26,280 disvalue, we calculated a P-value, and if this P value is less than a threshold of one percent or 149 00:14:26,280 --> 00:14:30,600 five percent, whichever you like to use, then we say that there is a relationship. 150 00:14:31,930 --> 00:14:37,750 So far, our model, there is a relationship of house pricing with room number, and that relationship 151 00:14:37,750 --> 00:14:39,860 is this one, which is nine point.