1 00:00:02,290 --> 00:00:10,950 Let's understand the mathematics behind regression in this session, we will use an example to get started. 2 00:00:11,990 --> 00:00:13,580 The example I'm considering. 3 00:00:14,670 --> 00:00:15,240 He is. 4 00:00:17,250 --> 00:00:23,180 A student studies for a certain number of hours and he gets different kinds of marks. 5 00:00:23,970 --> 00:00:28,850 Are trying to establish a correlation between the number of hours to read and the mark something. 6 00:00:29,310 --> 00:00:34,110 OK, I'm going to use Excel to create the scatterplot. 7 00:00:34,440 --> 00:00:41,070 OK, this can be created using Python or are as well, OK, or any programming language for that matter. 8 00:00:42,840 --> 00:00:50,190 If you see this view, Scatterplot Scatterplot is using regression and it shows the extent of relationship 9 00:00:50,550 --> 00:00:52,660 between X and Y. 10 00:00:53,220 --> 00:00:57,360 What is the dependent variable, an independent variable in this case. 11 00:00:58,340 --> 00:01:07,400 My obtained is the dependent variable, a member of our study is the independent variable because Mathabane 12 00:01:07,730 --> 00:01:10,320 is dependent on the member of our study. 13 00:01:11,120 --> 00:01:11,430 Right. 14 00:01:12,380 --> 00:01:15,700 A number of our study is not dependent on anything. 15 00:01:15,740 --> 00:01:20,450 OK, maybe it's dependent on something about which we are not concerned for now. 16 00:01:20,620 --> 00:01:26,840 OK, we are interested in establishing the relationship between number of us to the max. 17 00:01:27,410 --> 00:01:30,740 OK, so how will you go about creating this in Excel? 18 00:01:31,250 --> 00:01:39,580 So you facilitate the data points and you come to insert and then click the option here. 19 00:01:39,620 --> 00:01:45,410 OK, and then you will click this scatterplot, OK, and you will have a graph like this. 20 00:01:45,980 --> 00:01:46,320 Right. 21 00:01:46,640 --> 00:01:49,780 So that is the starting point for you on here. 22 00:01:50,210 --> 00:01:53,190 The relationship between the two is linear. 23 00:01:54,120 --> 00:01:58,350 OK, it is not logarithmic or anything else. 24 00:01:58,370 --> 00:01:59,600 So what is logarithmic? 25 00:01:59,610 --> 00:02:01,490 We are going to see that shortly. 26 00:02:02,130 --> 00:02:07,940 That means when I say that the relationship is linear, it means I can show this relationship in the 27 00:02:07,940 --> 00:02:09,860 form of a straight line. 28 00:02:10,760 --> 00:02:11,150 Right. 29 00:02:11,540 --> 00:02:14,120 So that is known as a linear relationship. 30 00:02:14,420 --> 00:02:19,910 OK, I can show the relationship between my dependent variable and independent variable in the form 31 00:02:19,910 --> 00:02:21,080 of a straight line. 32 00:02:21,440 --> 00:02:24,130 And what what is the straight line that I'm talking about? 33 00:02:24,380 --> 00:02:29,450 I am going to fit a line that will cover all these data points. 34 00:02:30,050 --> 00:02:33,320 In fact, that line is known as goodness of fit. 35 00:02:34,230 --> 00:02:37,690 Right, so let's create the goodness of it, how will I do it? 36 00:02:37,920 --> 00:02:44,160 I first to select these data points, just click one of the data points and right. 37 00:02:44,160 --> 00:02:46,920 Click you can see these options will be there. 38 00:02:47,400 --> 00:02:48,630 Click at Trend Line. 39 00:02:50,620 --> 00:02:57,130 Unusual how this straight line came up and you also choose display equation on chart. 40 00:02:57,680 --> 00:03:03,970 OK, you can also choose display R-squared, the number we covered the square in the previous session. 41 00:03:04,480 --> 00:03:06,330 So I display the equation. 42 00:03:06,350 --> 00:03:12,910 So this equation shows the relationship between dependent variable and independent variable. 43 00:03:13,970 --> 00:03:23,400 If you recollect your mathematics, there's this equation looks like a explicit Y equals M explicit. 44 00:03:23,900 --> 00:03:32,180 That means number of marks of pain equals ten point eight six to into number of our studied plus forty 45 00:03:32,180 --> 00:03:33,800 one point nine zero one. 46 00:03:34,700 --> 00:03:41,720 Right, if, let's say a student comes in on, that particular student comes and tells that he has studied 47 00:03:41,730 --> 00:03:49,620 for one point to us all, I can all I need to do is I need to put one point to fight here. 48 00:03:49,640 --> 00:03:56,480 That is ten point eight, six two and one point two five plus forty one point nine zero one will give 49 00:03:56,870 --> 00:04:03,400 marks the student will get when he or she studies for one point to fires nine. 50 00:04:03,650 --> 00:04:08,720 So I have not only established the fact I can also predict the future. 51 00:04:09,750 --> 00:04:10,110 Right. 52 00:04:10,410 --> 00:04:11,080 Are you getting it? 53 00:04:12,120 --> 00:04:14,610 So what is this Y equals M explosive. 54 00:04:15,570 --> 00:04:19,280 M is known as gradient and C is known as the intercept. 55 00:04:19,920 --> 00:04:30,480 This intercept corresponds to the value where the line means the y axis and slope is nothing but this 56 00:04:30,480 --> 00:04:32,850 distance divided by this distance. 57 00:04:32,860 --> 00:04:37,050 It is also known as Delta White, divided by Delta X.. 58 00:04:37,470 --> 00:04:42,690 In this case, this distance is four and this distance is to four. 59 00:04:42,690 --> 00:04:44,310 Divided by two is to. 60 00:04:44,580 --> 00:04:54,300 I can also draw a smaller triangle and the same height divided by base that will give the same value 61 00:04:54,300 --> 00:04:54,780 of M. 62 00:04:55,650 --> 00:04:59,760 OK, I can draw a smaller triangle or a slightly bigger triangle also. 63 00:05:00,060 --> 00:05:01,320 OK, whatever. 64 00:05:01,320 --> 00:05:09,120 You draw a triangle on this line and when you compute the height divided by base, the value of him 65 00:05:09,120 --> 00:05:11,930 that you get will be the same, right. 66 00:05:12,810 --> 00:05:14,490 Are you understanding linear regression? 67 00:05:15,790 --> 00:05:20,830 OK, members of our study done massively OK in this case. 68 00:05:22,030 --> 00:05:30,040 The line meets the y axis at minus four, so interceptors minus four and gradients, see, I've drawn 69 00:05:30,040 --> 00:05:30,890 a smaller triangle. 70 00:05:30,910 --> 00:05:33,280 I didn't draw a bigger triangle like this. 71 00:05:33,940 --> 00:05:41,500 And the same height divided by base two, divided by one will you might slope or gradient. 72 00:05:45,070 --> 00:05:52,000 And they can have more than one variable also, like in the example that I just share, a member of 73 00:05:52,000 --> 00:05:54,850 Australia's one independent variable. 74 00:05:55,770 --> 00:06:02,580 And in the case of multiple linear regression, you will have more than one independent variable. 75 00:06:03,480 --> 00:06:08,040 Please note that dependent variable will always be onely one, right? 76 00:06:08,610 --> 00:06:14,290 So here I am looking at two variables, right in the case of multiple linear regression. 77 00:06:14,940 --> 00:06:19,940 I'm trying to predict what will be the quantum of carbon dioxide emission in a vehicle. 78 00:06:20,520 --> 00:06:29,090 And I am predicting that using two independent variables, volume of the vehicle and weight of the vehicle. 79 00:06:29,950 --> 00:06:30,310 Right. 80 00:06:30,540 --> 00:06:33,570 So there is a case of multiple linear regression. 81 00:06:34,600 --> 00:06:34,940 Right. 82 00:06:35,320 --> 00:06:41,140 I don't have just one factor, but I am more than one factor, and yet also the relationship is linear. 83 00:06:41,170 --> 00:06:45,430 That means I can express this in the form of a straight line. 84 00:06:46,360 --> 00:06:49,550 Right now, let's see logistic regression. 85 00:06:50,060 --> 00:06:52,990 OK, let's understand that with an example. 86 00:06:53,960 --> 00:07:00,050 Here we are trying to predict the malignancy of a tumor, what is malignancy that means whether the 87 00:07:00,050 --> 00:07:04,790 tumor is really cancerous or not, malignancy means cancer. 88 00:07:05,720 --> 00:07:09,480 I am trying to predict malignancy based on tumor size. 89 00:07:09,620 --> 00:07:18,200 I have plotted instances of Dumarsais on the malignancy here and I have represented in the form of a 90 00:07:18,200 --> 00:07:18,560 graph. 91 00:07:19,520 --> 00:07:28,100 If I tried to fit a linear relationship in this, which is a straight line, I cannot fit all the points 92 00:07:28,100 --> 00:07:28,280 right. 93 00:07:28,340 --> 00:07:35,540 As you can see here, if I tried to fit a straight line like this, even if I move that line a bit to 94 00:07:35,540 --> 00:07:38,630 the left, these points will be left on. 95 00:07:39,590 --> 00:07:39,960 Right. 96 00:07:40,640 --> 00:07:48,290 So my foot is not good at all, a straight line cannot cover all the data points. 97 00:07:49,200 --> 00:07:56,730 If you reconnect the linear regression, the straight line word, almost all the data points, that 98 00:07:56,730 --> 00:07:58,140 means the food is good here. 99 00:07:58,440 --> 00:08:03,080 The food is so bad that you cannot have a linear regression. 100 00:08:03,630 --> 00:08:09,540 You need what is known as logistic regression, because when it is in the form of a column like this, 101 00:08:09,870 --> 00:08:12,780 it covers all the data points. 102 00:08:13,710 --> 00:08:16,930 Right now, do you understand who you are? 103 00:08:17,040 --> 00:08:21,800 Because I have a cover, it covers all the data points, almost all the data points, OK? 104 00:08:22,380 --> 00:08:30,840 And in the case of a linear model, I have what is known as Y equals M explicit or B not plus B on X, 105 00:08:31,680 --> 00:08:35,480 B one is the slope and B not is the intercept. 106 00:08:35,880 --> 00:08:41,820 In the case of logistic regression, we use what is known as sigmoid function, OK, which is nothing 107 00:08:41,820 --> 00:08:47,660 but one plus E to the power of minus B, not plus B 1x. 108 00:08:48,150 --> 00:08:52,680 This E to the power of minus is basically a way of writing logarithmic value. 109 00:08:52,960 --> 00:08:53,360 Right. 110 00:08:54,030 --> 00:08:59,550 And the output is in the case of linear regression, the output is the numeric value. 111 00:08:59,970 --> 00:09:05,880 In the case of a logistic regression or a logic model, the output is a probability. 112 00:09:06,180 --> 00:09:06,540 Right. 113 00:09:06,930 --> 00:09:13,390 That is whether the tumor is cancerous or not, it will have a value between zero and one. 114 00:09:14,160 --> 00:09:17,820 Are you understanding the difference between a linear relationship and a. 115 00:09:18,920 --> 00:09:20,660 Logistic relationship. 116 00:09:21,440 --> 00:09:31,940 OK, let's see an example where we compare linear and logistic using similar or same independent variables. 117 00:09:31,970 --> 00:09:34,610 OK, let's take the carbon dioxide emission. 118 00:09:34,970 --> 00:09:38,860 I'm using these to X1 next to vehicle volume and vehicle weight. 119 00:09:38,880 --> 00:09:40,580 That's my independent variables. 120 00:09:41,030 --> 00:09:44,300 I'm trying to predict the quantum of carbon dioxide emission. 121 00:09:44,780 --> 00:09:52,250 I use multiple linear regression if I predict what is the likelihood of carbon dioxide emission coming 122 00:09:52,250 --> 00:09:56,130 from that vehicle, I use logistic regression. 123 00:09:56,180 --> 00:09:59,060 Please note here, I'm using the word likelihood. 124 00:09:59,060 --> 00:10:03,620 That means it is a probability, which means it can have a value between zero and one. 125 00:10:04,630 --> 00:10:05,340 Are you getting it? 126 00:10:06,570 --> 00:10:17,460 In logistic regression, one more aspect is because the dependent variable had as a value between zero 127 00:10:17,460 --> 00:10:22,200 and one, your X1 x2 should also have a value between zero and one. 128 00:10:22,900 --> 00:10:28,170 But you see here, the values here are not between zero and one. 129 00:10:28,170 --> 00:10:28,500 Right. 130 00:10:28,890 --> 00:10:30,110 They are numeric values. 131 00:10:30,660 --> 00:10:33,820 So I need to convert the numeric into. 132 00:10:35,040 --> 00:10:40,410 Logarithmic value, and then I will apply logistic regression. 133 00:10:41,750 --> 00:10:44,300 How you understanding is this clearly?