1 00:00:00,866 --> 00:00:01,800 Hello and welcome back. 2 00:00:01,800 --> 00:00:03,500 Today we've got a very exciting tutorial. 3 00:00:03,500 --> 00:00:06,433 We're talking about the logistic regression. 4 00:00:06,433 --> 00:00:09,666 The logistic regression is used to predict a categorical 5 00:00:09,666 --> 00:00:13,166 dependent variable from a number of independent variables. 6 00:00:13,366 --> 00:00:17,066 So the key difference to a linear regression 7 00:00:17,066 --> 00:00:19,933 here is that we're not predicting a continuous variable. 8 00:00:19,933 --> 00:00:22,033 We're predicting a categorical variable. 9 00:00:22,033 --> 00:00:25,633 For example you might be working for an insurance company. 10 00:00:25,633 --> 00:00:28,633 And you want to predict will somebody purchase 11 00:00:28,766 --> 00:00:32,066 the health insurance that your company is offering. 12 00:00:32,066 --> 00:00:33,466 Yes or no. 13 00:00:33,466 --> 00:00:35,533 And it's a very simple yes or no. 14 00:00:35,533 --> 00:00:37,700 it's a categorical variable. 15 00:00:37,700 --> 00:00:41,533 And you might want to be predicting this dependent variable, 16 00:00:41,866 --> 00:00:46,200 based on an, an independent variable such as age. 17 00:00:46,200 --> 00:00:49,200 So depending on their age, will they purchase, 18 00:00:49,400 --> 00:00:52,233 the health insurance plan that your company is offering? 19 00:00:52,233 --> 00:00:54,833 So on the x axis we would have age. 20 00:00:54,833 --> 00:00:56,900 On the y axis we would have yes or no. 21 00:00:56,900 --> 00:00:59,533 Did they take up the offer or not? 22 00:00:59,533 --> 00:01:04,333 Let's say our x axis is somewhere between 18 years of age and 60 years of age. 23 00:01:04,333 --> 00:01:08,600 And our y axis simply has a binary outcomes. 24 00:01:08,866 --> 00:01:10,366 Yes and no. 25 00:01:10,366 --> 00:01:12,900 So there's no in between this either yes or no. 26 00:01:12,900 --> 00:01:17,500 So we're going to just add this horizontal line for illustrative purposes. 27 00:01:18,300 --> 00:01:20,666 So what will our data set look like. 28 00:01:20,666 --> 00:01:25,333 Let's say we've got certain number of observations in our data set. 29 00:01:25,333 --> 00:01:29,733 So we know people's age and we know, when they were exposed to the offer, 30 00:01:29,733 --> 00:01:30,966 whether they purchased or didn't. 31 00:01:30,966 --> 00:01:34,900 So these people over here didn't purchase, the health insurance. 32 00:01:34,900 --> 00:01:36,133 And these people over here 33 00:01:36,133 --> 00:01:39,133 did purchase the health insurance, and that's our data set. 34 00:01:39,133 --> 00:01:42,400 So as you can see, it's very different. 35 00:01:42,400 --> 00:01:46,466 This plot looks very different to what we were working with in the linear 36 00:01:46,700 --> 00:01:48,366 regression in tutorials. 37 00:01:48,366 --> 00:01:52,200 So we can't simply draw a linear regression 38 00:01:52,200 --> 00:01:55,733 line as sloped a line through these points. 39 00:01:55,733 --> 00:01:56,700 It makes no sense. 40 00:01:56,700 --> 00:02:01,466 So that's why the equation for logistic regression is slightly different. 41 00:02:01,466 --> 00:02:02,466 Here it is. 42 00:02:02,466 --> 00:02:06,000 So, here on the left we've got, 43 00:02:06,533 --> 00:02:10,066 instead of y, we've got, a logarithm. 44 00:02:10,066 --> 00:02:14,966 And then on the right we've got, the same, part as we saw before, 45 00:02:15,400 --> 00:02:18,966 the important value to us here is P, and that is the probability 46 00:02:18,966 --> 00:02:21,533 that we'll be, working with just now. 47 00:02:21,533 --> 00:02:22,600 We'll see it in action. 48 00:02:22,600 --> 00:02:24,900 So let's look at the logistic regression curve. 49 00:02:24,900 --> 00:02:26,766 The logistic regression curve looks like this. 50 00:02:26,766 --> 00:02:29,200 And it's also called the sigmoid curve. 51 00:02:30,166 --> 00:02:33,433 And how does this work in action. 52 00:02:34,166 --> 00:02:36,900 Well let's say we have two new observations. 53 00:02:36,900 --> 00:02:39,800 Let's say we've built this model based on this data. 54 00:02:39,800 --> 00:02:41,900 we have a logistic regression. 55 00:02:41,900 --> 00:02:44,233 How will it apply to new observations. 56 00:02:44,233 --> 00:02:45,666 Well let's say we have two new observations. 57 00:02:45,666 --> 00:02:48,666 Somebody of age 35 and somebody of age 45. 58 00:02:48,766 --> 00:02:52,300 what we do is we will need to project these values 59 00:02:52,300 --> 00:02:55,466 onto our logistic regression, find out where they fit there. 60 00:02:55,766 --> 00:02:59,433 And the logistic regression will give us probabilities. 61 00:02:59,433 --> 00:03:04,266 So, this value everything here is between 0 and 1. 62 00:03:04,266 --> 00:03:05,566 So no is a zero. 63 00:03:05,566 --> 00:03:06,833 Yes is a one. 64 00:03:06,833 --> 00:03:09,266 And in between are the probabilities. 65 00:03:09,266 --> 00:03:12,900 So the logistic regression gives us probabilities of somebody saying yes. 66 00:03:12,900 --> 00:03:14,933 So somebody's taking up that offer. 67 00:03:14,933 --> 00:03:18,133 So for the 35 year old it's a 42% chance 68 00:03:18,133 --> 00:03:21,200 that they will take up the offer based on this model. 69 00:03:21,200 --> 00:03:22,233 So this is what the model is 70 00:03:22,233 --> 00:03:25,466 predicting is predicting a 42% chance they will take up the offer. 71 00:03:25,800 --> 00:03:30,266 And for the 45 year old, there's an 81% chance they'll take up though. 72 00:03:30,566 --> 00:03:31,766 So we could stop there. 73 00:03:31,766 --> 00:03:32,600 That is the P. 74 00:03:32,600 --> 00:03:35,633 That is a probability that we see in the equation on the left. 75 00:03:35,966 --> 00:03:38,800 that's hidden inside that logarithm. 76 00:03:38,800 --> 00:03:42,200 So we could stop there and we could use these probabilities. 77 00:03:42,200 --> 00:03:43,733 Not in some use cases. 78 00:03:43,733 --> 00:03:46,933 That's what the logistic regression is used for. 79 00:03:46,933 --> 00:03:49,500 We, we just deal with the probabilities once we have them. 80 00:03:49,500 --> 00:03:52,533 But in most cases we want a binary outcome, a yes or no outcome. 81 00:03:52,533 --> 00:03:56,833 And so for those situations, we split our curve into two. 82 00:03:56,866 --> 00:04:00,733 I'll plug in to do anything above this line, this middle line, 83 00:04:00,733 --> 00:04:03,900 anything above with a probability of 50% or higher 84 00:04:04,133 --> 00:04:07,733 will be projected into a yes into a binary one. 85 00:04:07,966 --> 00:04:13,033 And anything below the 50% line will be projected into a no, a binary zero. 86 00:04:13,333 --> 00:04:14,966 How points would end up here? 87 00:04:14,966 --> 00:04:17,966 So based on this logistic regression, we would make the conclusion 88 00:04:17,966 --> 00:04:21,266 that the 35 year old would not purchase our insurance 89 00:04:21,266 --> 00:04:24,266 plan and the 45 year old would purchase our insurance plan. 90 00:04:25,166 --> 00:04:29,133 And just like with linear regression, you can have multiple independent 91 00:04:29,133 --> 00:04:29,633 variables. 92 00:04:29,633 --> 00:04:33,600 So for instance, age, income, level of education, 93 00:04:33,933 --> 00:04:38,166 how big their families or whether they have a family or they're single 94 00:04:38,466 --> 00:04:43,266 and many other types of variables can be added depending on the, use case. 95 00:04:43,800 --> 00:04:48,033 And our equation will look like this in this situation. 96 00:04:48,500 --> 00:04:49,200 So there you go. 97 00:04:49,200 --> 00:04:51,066 That's logistic regression in a nutshell. 98 00:04:51,066 --> 00:04:52,366 I look forward to seeing you next time. 99 00:04:52,366 --> 00:04:54,566 And until then enjoy machine learning.