0 1 00:00:00,710 --> 00:00:05,480 All right so we've plotted our data and we suspect that there is a relationship there. 1 2 00:00:05,490 --> 00:00:11,190 Now it's time to talk about the algorithm that we will use to help quantify this relationship between 2 3 00:00:11,190 --> 00:00:13,080 budget and revenue. 3 4 00:00:13,080 --> 00:00:16,800 In this lesson we're going to talk about the theory behind regression. 4 5 00:00:17,070 --> 00:00:19,340 So, we know what our real data will look like. 5 6 00:00:19,500 --> 00:00:25,320 But let's illustrate what the algorithm will do with a stylized example to help build our intuition. 6 7 00:00:25,890 --> 00:00:27,230 Our linear regression 7 8 00:00:27,240 --> 00:00:29,050 will get two kinds of data. 8 9 00:00:29,150 --> 00:00:33,740 It will get our film production budgets and it will get our film revenues. 9 10 00:00:33,780 --> 00:00:39,630 The budgets will be our feature, also called the independent variable and the revenue are what we are 10 11 00:00:39,630 --> 00:00:42,310 trying to estimate - that will be our target. 11 12 00:00:42,420 --> 00:00:48,420 What the linear regression will do is try and represent the relationship between the budget and the 12 13 00:00:48,420 --> 00:00:50,910 revenue as a straight line. 13 14 00:00:50,940 --> 00:00:52,130 But here's the rub - 14 15 00:00:52,230 --> 00:00:53,960 what kind of line? 15 16 00:00:54,090 --> 00:01:00,600 Let's think back to high school math class and let's think about what describes a line. From our math 16 17 00:01:00,600 --> 00:01:01,100 classes, 17 18 00:01:01,110 --> 00:01:04,290 we know that we can plot y as a function of X and that's a line. 18 19 00:01:05,070 --> 00:01:13,080 And if we cut the y-axis at 10 then we say that our line has an intercept of 10 and if every time X 19 20 00:01:13,110 --> 00:01:19,860 increased by 2 and it made y increase by 1, then we say that the line has a slope that is equal to one 20 21 00:01:19,860 --> 00:01:20,630 half. 21 22 00:01:20,700 --> 00:01:23,460 In that case our equation would look something like this: 22 23 00:01:23,460 --> 00:01:27,420 y = 1/2 x + 10 23 24 00:01:27,420 --> 00:01:32,160 And that means that the generic equation for line would be something like this. 24 25 00:01:32,160 --> 00:01:39,960 It would be y = mx + c, where m is the slope and c is the constant. 25 26 00:01:39,960 --> 00:01:41,440 So let me ask you this. 26 27 00:01:41,760 --> 00:01:47,460 What part of the equation for the line would tell you about how strong the relationship is between x 27 28 00:01:47,460 --> 00:01:48,810 and y? 28 29 00:01:48,960 --> 00:01:51,240 In this case the slope is the key. 29 30 00:01:51,360 --> 00:01:58,110 The slope tells us how much y will change for a given change in X - the larger the value of the slope 30 31 00:01:58,260 --> 00:02:04,070 the steeper the line becomes. Let's take a look at an example where there is no relationship between 31 32 00:02:04,070 --> 00:02:04,610 x and y. 32 33 00:02:05,240 --> 00:02:08,810 Let's take a look at an example where there is no relationship between x and y. 33 34 00:02:09,440 --> 00:02:13,790 If there is no relationship then we would simply have a straight line. 34 35 00:02:13,910 --> 00:02:16,900 In this case the slope would be equal to zero. 35 36 00:02:17,150 --> 00:02:23,180 But if there is a relationship between the two then the slope would be quite steep and the stronger 36 37 00:02:23,180 --> 00:02:26,010 the relationship the steeper the slope. 37 38 00:02:26,060 --> 00:02:27,800 But here's the thing. 38 39 00:02:27,800 --> 00:02:33,530 There's a big difference between machine learning and pure mathematics; in machine learning, 39 40 00:02:33,530 --> 00:02:39,320 we don't actually know the true relationship and that's why we refer to the slope and the intercept 40 41 00:02:39,620 --> 00:02:45,840 as parameters and these parameters have to be estimated for our linear regression. 41 42 00:02:45,890 --> 00:02:49,760 In fact we even use a different notation. In our notation, 42 43 00:02:49,760 --> 00:02:56,000 we will replace the c for the constant with theta 0 and the slope coefficient will be written as theta 43 44 00:02:56,030 --> 00:03:02,570 1 and also we'll change the order in this equation, so we'll have the constant first and then the slope. 44 45 00:03:03,250 --> 00:03:11,490 And instead of writing y, what you'll also often see is h theta x where h stands for hypothesis. 45 46 00:03:11,540 --> 00:03:17,330 This kind of notation is very popular in machine learning and even though it can look quite intimidating 46 47 00:03:17,330 --> 00:03:22,070 when you first see it all you're looking at here is the equation for a simple line. 47 48 00:03:22,070 --> 00:03:26,750 And the reason I'm showing you this right here is because if you ever pick up a book on machine learning 48 49 00:03:27,020 --> 00:03:31,460 or you're reading some articles on the Internet you're going to be confronted with notation that's very 49 50 00:03:31,460 --> 00:03:37,160 very similar to this. But at this point we still haven't talked about where the line ultimately comes 50 51 00:03:37,160 --> 00:03:38,000 from. 51 52 00:03:38,060 --> 00:03:43,520 How do we know which line to draw? Looking at the data we just have data points. 52 53 00:03:43,570 --> 00:03:44,660 There is actually no line, 53 54 00:03:44,660 --> 00:03:45,780 right? 54 55 00:03:45,830 --> 00:03:51,050 And as a matter of fact, you can draw a whole bunch of different lines through the same set of data points. 55 56 00:03:51,200 --> 00:03:52,590 So, which line is best? 56 57 00:03:52,610 --> 00:03:54,470 Which line would you choose? 57 58 00:03:54,620 --> 00:04:00,590 Which line has the best possible theta zero and best possible theta one? 58 59 00:04:00,710 --> 00:04:04,200 If our dataset looked just like this, our job would be easy. 59 60 00:04:04,280 --> 00:04:08,130 All we would have to do was connect all the data points with a straight line. 60 61 00:04:08,210 --> 00:04:14,510 And this also seems like the best option because we would know that in this case our estimates for theta 61 62 00:04:14,510 --> 00:04:17,450 zero and theta one would be very accurate. 62 63 00:04:18,170 --> 00:04:21,250 However, real data looks more like this. 63 64 00:04:21,350 --> 00:04:27,590 If we were to draw a line through this data then there would always be a gap between the actual value 64 65 00:04:28,010 --> 00:04:29,140 and the line. 65 66 00:04:29,180 --> 00:04:35,340 In other words there would be a difference between the actual data point and the point on the line. 66 67 00:04:35,420 --> 00:04:36,780 The point on the line here, 67 68 00:04:36,890 --> 00:04:40,460 that's called the fitted value or the predicted value. 68 69 00:04:40,460 --> 00:04:46,460 But let's talk more about these gaps because it's these gaps that will help us choose the best possible 69 70 00:04:46,460 --> 00:04:50,240 intercept and the best possible slope for our line. 70 71 00:04:50,240 --> 00:04:53,950 These white lines are actually called residuals. 71 72 00:04:54,140 --> 00:04:58,400 Now why will the residuals help us choose the best possible line for our data? 72 73 00:04:58,400 --> 00:05:01,010 Let me show you another line that we can draw to this data. 73 74 00:05:01,010 --> 00:05:07,490 If I draw a line down here, then what we see is that the gaps between the data points and the line are 74 75 00:05:07,490 --> 00:05:08,640 much larger. 75 76 00:05:08,870 --> 00:05:15,380 The residuals are way biggerm so the residuals can tell us something about how good the line is that 76 77 00:05:15,380 --> 00:05:17,120 we're drawing on this chart. 77 78 00:05:17,150 --> 00:05:22,850 So now we have a measure by which to compare the different lines that we can draw through the data, 78 79 00:05:22,850 --> 00:05:28,910 all we have to do is look at the size of the residuals and choose the line with the smallest residuals. 79 80 00:05:30,060 --> 00:05:33,790 And that's great because now our algorithm has a very clear objective. 80 81 00:05:33,930 --> 00:05:40,550 The goal of our linear regression is going to be to calculate the line that minimizes these residuals. 81 82 00:05:41,010 --> 00:05:43,050 But how exactly should that work? 82 83 00:05:43,050 --> 00:05:48,720 Let's take a look at that first residual. That first residual is gonna be the difference between the 83 84 00:05:48,720 --> 00:05:55,890 actual value, the y1, and the predicted value which is the one on the line and that second residual would 84 85 00:05:55,950 --> 00:06:03,390 also be just the difference between the actual value in white here and the fitted value in green. And 85 86 00:06:03,390 --> 00:06:05,700 the same is true for that third data point. 86 87 00:06:05,700 --> 00:06:11,310 Now suppose we actually have calculated the values for these residuals and these residuals have the 87 88 00:06:11,310 --> 00:06:15,210 values 10, negative 6, and 4. 88 89 00:06:15,210 --> 00:06:21,750 In this case what we can't do is just add them up and find the lowest sum because that second data point 89 90 00:06:22,140 --> 00:06:23,160 is below the line. 90 91 00:06:23,160 --> 00:06:25,310 We have a negative number here. 91 92 00:06:25,470 --> 00:06:32,190 So what we have to do instead, is we have to turn all of these numbers positive and the way we can do 92 93 00:06:32,190 --> 00:06:35,720 that is by squaring the residuals. 93 94 00:06:35,730 --> 00:06:42,360 Now what we've got is a single number - the sum of the squared residuals. 94 95 00:06:42,360 --> 00:06:49,080 This is the number that the linear regression will try to minimize in order to choose the best parameters 95 96 00:06:49,110 --> 00:06:50,400 for the line. 96 97 00:06:50,400 --> 00:06:57,210 In other words; to find the best possible fit for our regression what we need to do is we need to choose 97 98 00:06:57,350 --> 00:07:04,470 an intercept - the theta zero, and we need to choose the slope - theta one that minimizes the sum of the 98 99 00:07:04,470 --> 00:07:11,490 squared residuals. And you'll see this number also being referred to as the residual sum of squares or 99 100 00:07:11,520 --> 00:07:13,200 RSS. 100 101 00:07:13,350 --> 00:07:19,560 So now that we've talked about the theory and built out our intuition behind our regression let's implement 101 102 00:07:19,590 --> 00:07:21,870 this in Jupyter notebook. 102 103 00:07:21,870 --> 00:07:23,010 I'll see you in the next lesson.