0
1
00:00:00,710 --> 00:00:05,480
All right so we've plotted our data and we suspect that there is a relationship there.
1

2
00:00:05,490 --> 00:00:11,190
Now it's time to talk about the algorithm that we will use to help quantify this relationship between
2

3
00:00:11,190 --> 00:00:13,080
budget and revenue.
3

4
00:00:13,080 --> 00:00:16,800
In this lesson we're going to talk about the theory behind regression.
4

5
00:00:17,070 --> 00:00:19,340
So, we know what our real data will look like.
5

6
00:00:19,500 --> 00:00:25,320
But let's illustrate what the algorithm will do with a stylized example to help build our intuition.
6

7
00:00:25,890 --> 00:00:27,230
Our linear regression
7

8
00:00:27,240 --> 00:00:29,050
will get two kinds of data.
8

9
00:00:29,150 --> 00:00:33,740
It will get our film production budgets and it will get our film revenues.
9

10
00:00:33,780 --> 00:00:39,630
The budgets will be our feature, also called the independent variable and the revenue are what we are
10

11
00:00:39,630 --> 00:00:42,310
trying to estimate - that will be our target.
11

12
00:00:42,420 --> 00:00:48,420
What the linear regression will do is try and represent the relationship between the budget and the
12

13
00:00:48,420 --> 00:00:50,910
revenue as a straight line.
13

14
00:00:50,940 --> 00:00:52,130
But here's the rub -
14

15
00:00:52,230 --> 00:00:53,960
what kind of line?
15

16
00:00:54,090 --> 00:01:00,600
Let's think back to high school math class and let's think about what describes a line. From our math
16

17
00:01:00,600 --> 00:01:01,100
classes,
17

18
00:01:01,110 --> 00:01:04,290
we know that we can plot y as a function of X and that's a line.
18

19
00:01:05,070 --> 00:01:13,080
And if we cut the y-axis at 10 then we say that our line has an intercept of 10 and if every time X
19

20
00:01:13,110 --> 00:01:19,860
increased by 2 and it made y increase by 1, then we say that the line has a slope that is equal to one
20

21
00:01:19,860 --> 00:01:20,630
half.
21

22
00:01:20,700 --> 00:01:23,460
In that case our equation would look something like this:
22

23
00:01:23,460 --> 00:01:27,420
y = 1/2 x + 10
23

24
00:01:27,420 --> 00:01:32,160
And that means that the generic equation for line would be something like this.
24

25
00:01:32,160 --> 00:01:39,960
It would be y = mx + c, where m is the slope and c is the constant.
25

26
00:01:39,960 --> 00:01:41,440
So let me ask you this.
26

27
00:01:41,760 --> 00:01:47,460
What part of the equation for the line would tell you about how strong the relationship is between x
27

28
00:01:47,460 --> 00:01:48,810
and y?
28

29
00:01:48,960 --> 00:01:51,240
In this case the slope is the key.
29

30
00:01:51,360 --> 00:01:58,110
The slope tells us how much y will change for a given change in X - the larger the value of the slope
30

31
00:01:58,260 --> 00:02:04,070
the steeper the line becomes. Let's take a look at an example where there is no relationship between
31

32
00:02:04,070 --> 00:02:04,610
x and y.
32

33
00:02:05,240 --> 00:02:08,810
Let's take a look at an example where there is no relationship between x and y.
33

34
00:02:09,440 --> 00:02:13,790
If there is no relationship then we would simply have a straight line.
34

35
00:02:13,910 --> 00:02:16,900
In this case the slope would be equal to zero.
35

36
00:02:17,150 --> 00:02:23,180
But if there is a relationship between the two then the slope would be quite steep and the stronger
36

37
00:02:23,180 --> 00:02:26,010
the relationship the steeper the slope.
37

38
00:02:26,060 --> 00:02:27,800
But here's the thing.
38

39
00:02:27,800 --> 00:02:33,530
There's a big difference between machine learning and pure mathematics; in machine learning,
39

40
00:02:33,530 --> 00:02:39,320
we don't actually know the true relationship and that's why we refer to the slope and the intercept
40

41
00:02:39,620 --> 00:02:45,840
as parameters and these parameters have to be estimated for our linear regression.
41

42
00:02:45,890 --> 00:02:49,760
In fact we even use a different notation. In our notation,
42

43
00:02:49,760 --> 00:02:56,000
we will replace the c for the constant with theta 0 and the slope coefficient will be written as theta
43

44
00:02:56,030 --> 00:03:02,570
1 and also we'll change the order in this equation, so we'll have the constant first and then the slope.
44

45
00:03:03,250 --> 00:03:11,490
And instead of writing y, what you'll also often see is h theta x where h stands for hypothesis.
45

46
00:03:11,540 --> 00:03:17,330
This kind of notation is very popular in machine learning and even though it can look quite intimidating
46

47
00:03:17,330 --> 00:03:22,070
when you first see it all you're looking at here is the equation for a simple line.
47

48
00:03:22,070 --> 00:03:26,750
And the reason I'm showing you this right here is because if you ever pick up a book on machine learning
48

49
00:03:27,020 --> 00:03:31,460
or you're reading some articles on the Internet you're going to be confronted with notation that's very
49

50
00:03:31,460 --> 00:03:37,160
very similar to this. But at this point we still haven't talked about where the line ultimately comes
50

51
00:03:37,160 --> 00:03:38,000
from.
51

52
00:03:38,060 --> 00:03:43,520
How do we know which line to draw? Looking at the data we just have data points.
52

53
00:03:43,570 --> 00:03:44,660
There is actually no line,
53

54
00:03:44,660 --> 00:03:45,780
right?
54

55
00:03:45,830 --> 00:03:51,050
And as a matter of fact, you can draw a whole bunch of different lines through the same set of data points.
55

56
00:03:51,200 --> 00:03:52,590
So, which line is best?
56

57
00:03:52,610 --> 00:03:54,470
Which line would you choose?
57

58
00:03:54,620 --> 00:04:00,590
Which line has the best possible theta zero and best possible theta one?
58

59
00:04:00,710 --> 00:04:04,200
If our dataset looked just like this, our job would be easy.
59

60
00:04:04,280 --> 00:04:08,130
All we would have to do was connect all the data points with a straight line.
60

61
00:04:08,210 --> 00:04:14,510
And this also seems like the best option because we would know that in this case our estimates for theta
61

62
00:04:14,510 --> 00:04:17,450
zero and theta one would be very accurate.
62

63
00:04:18,170 --> 00:04:21,250
However, real data looks more like this.
63

64
00:04:21,350 --> 00:04:27,590
If we were to draw a line through this data then there would always be a gap between the actual value
64

65
00:04:28,010 --> 00:04:29,140
and the line.
65

66
00:04:29,180 --> 00:04:35,340
In other words there would be a difference between the actual data point and the point on the line.
66

67
00:04:35,420 --> 00:04:36,780
The point on the line here,
67

68
00:04:36,890 --> 00:04:40,460
that's called the fitted value or the predicted value.
68

69
00:04:40,460 --> 00:04:46,460
But let's talk more about these gaps because it's these gaps that will help us choose the best possible
69

70
00:04:46,460 --> 00:04:50,240
intercept and the best possible slope for our line.
70

71
00:04:50,240 --> 00:04:53,950
These white lines are actually called residuals.
71

72
00:04:54,140 --> 00:04:58,400
Now why will the residuals help us choose the best possible line for our data?
72

73
00:04:58,400 --> 00:05:01,010
Let me show you another line that we can draw to this data.
73

74
00:05:01,010 --> 00:05:07,490
If I draw a line down here, then what we see is that the gaps between the data points and the line are
74

75
00:05:07,490 --> 00:05:08,640
much larger.
75

76
00:05:08,870 --> 00:05:15,380
The residuals are way biggerm so the residuals can tell us something about how good the line is that
76

77
00:05:15,380 --> 00:05:17,120
we're drawing on this chart.
77

78
00:05:17,150 --> 00:05:22,850
So now we have a measure by which to compare the different lines that we can draw through the data,
78

79
00:05:22,850 --> 00:05:28,910
all we have to do is look at the size of the residuals and choose the line with the smallest residuals.
79

80
00:05:30,060 --> 00:05:33,790
And that's great because now our algorithm has a very clear objective.
80

81
00:05:33,930 --> 00:05:40,550
The goal of our linear regression is going to be to calculate the line that minimizes these residuals.
81

82
00:05:41,010 --> 00:05:43,050
But how exactly should that work?
82

83
00:05:43,050 --> 00:05:48,720
Let's take a look at that first residual. That first residual is gonna be the difference between the
83

84
00:05:48,720 --> 00:05:55,890
actual value, the y1, and the predicted value which is the one on the line and that second residual would
84

85
00:05:55,950 --> 00:06:03,390
also be just the difference between the actual value in white here and the fitted value in green. And
85

86
00:06:03,390 --> 00:06:05,700
the same is true for that third data point.
86

87
00:06:05,700 --> 00:06:11,310
Now suppose we actually have calculated the values for these residuals and these residuals have the
87

88
00:06:11,310 --> 00:06:15,210
values 10, negative 6, and 4.
88

89
00:06:15,210 --> 00:06:21,750
In this case what we can't do is just add them up and find the lowest sum because that second data point
89

90
00:06:22,140 --> 00:06:23,160
is below the line.
90

91
00:06:23,160 --> 00:06:25,310
We have a negative number here.
91

92
00:06:25,470 --> 00:06:32,190
So what we have to do instead, is we have to turn all of these numbers positive and the way we can do
92

93
00:06:32,190 --> 00:06:35,720
that is by squaring the residuals.
93

94
00:06:35,730 --> 00:06:42,360
Now what we've got is a single number - the sum of the squared residuals.
94

95
00:06:42,360 --> 00:06:49,080
This is the number that the linear regression will try to minimize in order to choose the best parameters
95

96
00:06:49,110 --> 00:06:50,400
for the line.
96

97
00:06:50,400 --> 00:06:57,210
In other words; to find the best possible fit for our regression what we need to do is we need to choose
97

98
00:06:57,350 --> 00:07:04,470
an intercept - the theta zero, and we need to choose the slope - theta one that minimizes the sum of the
98

99
00:07:04,470 --> 00:07:11,490
squared residuals. And you'll see this number also being referred to as the residual sum of squares or
99

100
00:07:11,520 --> 00:07:13,200
RSS.
100

101
00:07:13,350 --> 00:07:19,560
So now that we've talked about the theory and built out our intuition behind our regression let's implement
101

102
00:07:19,590 --> 00:07:21,870
this in Jupyter notebook.
102

103
00:07:21,870 --> 00:07:23,010
I'll see you in the next lesson.