0
1
00:00:00,180 --> 00:00:06,600
In the coming lessons we're gonna be introducing some more realism and building out our intuition for
1

2
00:00:06,600 --> 00:00:09,630
the machine learning problems that are to come.
2

3
00:00:09,690 --> 00:00:15,330
We're now going to have a look on how you'd run a gradient descent algorithm given some data points
3

4
00:00:15,420 --> 00:00:24,750
and using a real cost function. This cost function is called the Mean Squared Error or MSE.
4

5
00:00:24,750 --> 00:00:31,140
Previously we've been using example cost functions to learn more about how our gradient descent algorithm
5

6
00:00:31,140 --> 00:00:32,280
behaves.
6

7
00:00:32,280 --> 00:00:38,970
And this was fine at the time but now we turn our attention to a cost function that is actually being
7

8
00:00:38,970 --> 00:00:42,780
used in practice for many machine learning problems.
8

9
00:00:42,780 --> 00:00:45,890
Now to recap a little bit about linear regression.
9

10
00:00:46,140 --> 00:00:52,980
We said that the goal of linear regression was to minimize the distances between the points and the
10

11
00:00:52,980 --> 00:00:59,420
line, and a good line was one that minimizes the residual sum of squares.
11

12
00:00:59,550 --> 00:01:04,410
When we were talking about the distances, we meant the distances between the data points,
12

13
00:01:04,410 --> 00:01:12,780
so the actual values, the ys, and our fitted values or our hypothesis. To calculate a residual sum of
13

14
00:01:12,780 --> 00:01:13,470
squares,
14

15
00:01:13,560 --> 00:01:22,620
we took the difference between the actual value and our fitted value and then we added them all together.
15

16
00:01:22,860 --> 00:01:28,620
But because some values would have been below the line having them all up just like this would create
16

17
00:01:28,620 --> 00:01:29,870
a problem.
17

18
00:01:29,880 --> 00:01:37,130
So what we have to do instead is we have to square all the values and then add them up.
18

19
00:01:37,350 --> 00:01:44,490
Thus our goal was to choose a line or choose the parameters theta zero and theta one that would minimize
19

20
00:01:44,880 --> 00:01:52,290
the sum of the squared differences and the name for this equation was the Residual Sum of Squares or
20

21
00:01:52,350 --> 00:01:54,060
RSS.
21

22
00:01:54,120 --> 00:02:00,570
The way to interpret this number, this residual sum of squares, was as follows
22

23
00:02:00,570 --> 00:02:08,090
it's how much of the dependent variables variation our model did not explain.
23

24
00:02:08,160 --> 00:02:14,520
So it's the sum of the squared differences between the actual y values and the predicted y values and
24

25
00:02:14,520 --> 00:02:22,770
the smaller this residual sum of squares, the better our model fits our data and the greater the residual
25

26
00:02:22,770 --> 00:02:26,220
sum of squares the poorer our model fits our data.
26

27
00:02:26,220 --> 00:02:29,600
Thinking about it this way, what would be a perfect fit?
27

28
00:02:29,670 --> 00:02:33,700
What kind of model would have a perfect prediction?
28

29
00:02:33,740 --> 00:02:38,330
Well it would be one where the residual sum of squares is zero.
29

30
00:02:38,360 --> 00:02:39,060
Right?
30

31
00:02:39,090 --> 00:02:44,200
Value of zero for the residual sum of squares would imply a perfect fit.
31

32
00:02:45,910 --> 00:02:52,480
So I know that at this point you're probably thinking: Would the residual sum of squares make a good cost
32

33
00:02:52,480 --> 00:02:54,440
function?
33

34
00:02:54,700 --> 00:03:03,840
And the answer is - yes it would, but with one small modification and that modification is dividing the
34

35
00:03:03,840 --> 00:03:11,030
whole thing by the number of data points. Then the residual sum of squares gets a new name,
35

36
00:03:11,310 --> 00:03:18,690
and this equation is called the Mean Squared Error or MSE and that's the cost function
36

37
00:03:18,780 --> 00:03:23,880
we're gonna be using in our last example. Back in our Jupyter notebook,
37

38
00:03:23,910 --> 00:03:32,250
let's add some section headings. So I'm going to change my cell to Markdown here and I'm gonna put in one pound
38

39
00:03:32,250 --> 00:03:34,310
symbol and then I'm going to write
39

40
00:03:34,350 --> 00:03:37,010
"Example 5
40

41
00:03:37,410 --> 00:03:46,150
Working with Data & a Real Cost Function".
41

42
00:03:46,920 --> 00:03:50,190
And below that I'm going to add a subheading that reads
42

43
00:03:53,040 --> 00:04:01,830
"Mean Squared Error: a cost function for regression problems".
43

44
00:04:02,010 --> 00:04:09,090
This is also a really good opportunity to try your hand at writing some more LaTeX notation in Jupyter.
44

45
00:04:09,090 --> 00:04:14,580
Let's add the formula for the residual sum of squares first. I'm going to make this fairly large,
45

46
00:04:14,660 --> 00:04:22,380
so I'm going to give it three pound symbols, two dollar signs "RSS = "  and now we need that summation
46

47
00:04:22,380 --> 00:04:25,210
symbol and it's gonna be with the sum tag,
47

48
00:04:25,230 --> 00:04:27,720
that's how we're gonna get that summation symbol.
48

49
00:04:27,720 --> 00:04:31,060
Let me add another two dollar signs as the closing tag, hit Shift+
49

50
00:04:31,090 --> 00:04:34,670
Enter and show you and show you what I mean.
50

51
00:04:34,710 --> 00:04:38,600
So "\sum" gives us the summation symbol.
51

52
00:04:38,640 --> 00:04:45,600
Now we need to add some things to the top and the bottom of the summation symbol. I can add things to
52

53
00:04:45,600 --> 00:04:50,730
the bottom with the underscore, and they're inside some curly braces;
53

54
00:04:50,730 --> 00:04:58,740
I'm going to write "i=1", and then I'm going to use the carry symbol and another pair of curly braces
54

55
00:04:59,340 --> 00:05:02,220
and put an "n" to put something on top.
55

56
00:05:02,250 --> 00:05:04,350
Let me Shift+Enter and show you what this looks like.
56

57
00:05:06,440 --> 00:05:12,880
Now I'm going to add the rest of the equation in some big parentheses, so I'm going to write backslash, and then
57

58
00:05:13,000 --> 00:05:22,510
write "big" for the big tag, open parentheses and then space, another backslash, big closing parenthesis.
58

59
00:05:23,020 --> 00:05:26,590
And then everything inside the parentheses will be raised to the power of two.
59

60
00:05:26,620 --> 00:05:29,460
So I want to see carry and then 2.
60

61
00:05:29,530 --> 00:05:31,660
Let's take a look at what this looks like now.
61

62
00:05:34,220 --> 00:05:34,540
All right.
62

63
00:05:34,550 --> 00:05:44,130
So now I'm just gonna add the final bit and that's gonna be "y^{()}",
63

64
00:05:44,450 --> 00:05:58,170
"i" it's gonna be the i-th value of y, minus "h_\theta"
64

65
00:05:58,170 --> 00:06:04,640
space, x, carry, "{(i)}"
65

66
00:06:04,830 --> 00:06:08,850
"i" for the ith value of X.
66

67
00:06:08,850 --> 00:06:13,170
I'm going to put another space here at the end to make it a bit more readable, hit Shift+Enter and
67

68
00:06:13,170 --> 00:06:14,340
take a look at this equation.
68

69
00:06:15,590 --> 00:06:15,890
Okay.
69

70
00:06:15,930 --> 00:06:22,090
So I think that looks pretty solid and in the process we've discovered a few more LaTeX tricks.
70

71
00:06:22,170 --> 00:06:26,610
One is that this underscore adds things to the bottom like this,
71

72
00:06:26,610 --> 00:06:33,520
"i = 1" for the summation symbol; or the Greek letter theta for our hypothesis, for h;
72

73
00:06:33,750 --> 00:06:38,020
And we also see the big parentheses in action with the big tag.
73

74
00:06:38,400 --> 00:06:44,300
Now having the LaTeX equation for the mean squared error is actually gonna be very easy.
74

75
00:06:44,370 --> 00:06:49,260
So I'm just gonna copy this and paste it and I'm going to modify it.
75

76
00:06:49,290 --> 00:06:57,240
So instead of RSS it's going to read MSE and then just in front I'm going to add a fraction, so I'm going
76

77
00:06:57,240 --> 00:07:07,470
to add "\frac{1}
77

78
00:07:07,470 --> 00:07:16,090
{n}" and when I hit Shift+Enter I can now see my mean squared error equation like so.
78

79
00:07:16,140 --> 00:07:22,550
Now one thing I'll point out is that you might see an alternative way of writing this equation.
79

80
00:07:22,650 --> 00:07:31,470
Sometimes this equation is written with a y hat, with a will carry symbol above the y instead of this
80

81
00:07:31,470 --> 00:07:36,560
notation with the hypothesis function, instead of this h theta x i.
81

82
00:07:36,930 --> 00:07:42,180
So let's add the alternative notation as well so that we've got it on here.
82

83
00:07:42,180 --> 00:07:49,800
So I'm going to select this part here which reads "h_\theta x^{(i)}". I am going to
83

84
00:07:49,800 --> 00:08:00,060
delete it and instead I'm going to show you guys another LaTeX tag which is "\hat" followed
84

85
00:08:00,060 --> 00:08:06,920
by a pair of curly braces and a y inside. When you hit Shift+Enter now,
85

86
00:08:06,990 --> 00:08:15,380
you should see this. The other thing I'll do is I'll remove the carry, the curly braces and the i from the
86

87
00:08:15,380 --> 00:08:17,340
preceding y value.
87

88
00:08:17,450 --> 00:08:23,000
So I'm going to delete that and hit Shift+Enter and let's see what this looks like.
88

89
00:08:23,240 --> 00:08:25,970
So these are the two notations that you gonna come across
89

90
00:08:25,970 --> 00:08:31,540
most often - the predicted values either represented as the hypothesis function,
90

91
00:08:31,550 --> 00:08:38,990
so using this h notation or it's represented with this y hat notation, this is also very, very common.
91

92
00:08:39,470 --> 00:08:45,980
And y hat is also gonna be the way I'm gonna be referring to our predicted values in our Python code
92

93
00:08:45,980 --> 00:08:53,000
that we're gonna write. Now with the notation out of the way and just looking at these equations you
93

94
00:08:53,120 --> 00:09:01,430
probably might be wondering why is the means squared error more useful as a cost function than the residual
94

95
00:09:01,430 --> 00:09:03,960
sum of squares?
95

96
00:09:03,980 --> 00:09:05,780
I mean they look pretty similar.
96

97
00:09:05,780 --> 00:09:06,140
Right?
97

98
00:09:06,750 --> 00:09:08,810
But let's think about it.
98

99
00:09:09,770 --> 00:09:14,880
What would happen when our dataset gets large?
99

100
00:09:15,230 --> 00:09:19,430
We're gonna have a y and a y hat value for each and every data point.
100

101
00:09:19,430 --> 00:09:20,600
Right?
101

102
00:09:20,660 --> 00:09:30,040
So as we're adding more and more data points to our dataset that sum starts to get pretty big because
102

103
00:09:30,040 --> 00:09:33,990
we're adding up all the squared differences, right? And
103

104
00:09:34,000 --> 00:09:41,590
we've already seen what happens when computers are confronted with very, very large numbers before.
104

105
00:09:41,590 --> 00:09:42,820
That's right.
105

106
00:09:42,820 --> 00:09:52,590
When working with very, very large numbers we might encounter our old friend, the overflow error, and this
106

107
00:09:52,590 --> 00:09:55,740
is where the mean squared error comes to the rescue.
107

108
00:09:56,070 --> 00:10:03,870
Because what happens when we're dividing by the number of samples. By dividing that sum by the number
108

109
00:10:03,870 --> 00:10:09,640
of data points we're kind of taking an average, right? An average, a mean.
109

110
00:10:09,810 --> 00:10:16,020
So hence the name - mean squared error. And dividing by that number of samples
110

111
00:10:16,320 --> 00:10:23,250
you can start to handle very large datasets, right? Datasets with tens of thousands of data points and
111

112
00:10:23,250 --> 00:10:29,700
the cost function isn't going to encounter an overflow error because you're keeping that mean squared
112

113
00:10:29,760 --> 00:10:33,460
error manageably small.
113

114
00:10:33,630 --> 00:10:39,620
I hope that explains why the mean square error is a very very popular cost function - because it allows
114

115
00:10:39,620 --> 00:10:47,270
people to handle data sets of all sorts of sizes and your gradient descent algorithm won't cause an
115

116
00:10:47,270 --> 00:10:49,430
overflow as your scaling up.