0
1
00:00:00,640 --> 00:00:01,130
All right.
1

2
00:00:01,160 --> 00:00:08,670
So previously, we've talked about the famous three step machine learning process: predict, calculate the
2

3
00:00:08,670 --> 00:00:11,070
error and learn.
3

4
00:00:11,070 --> 00:00:19,710
Now the middle step is something that involved calculating an error and I glossed over that part in
4

5
00:00:19,710 --> 00:00:21,080
the last lesson.
5

6
00:00:21,090 --> 00:00:25,350
So what does that mean and how do we calculate the error
6

7
00:00:25,350 --> 00:00:26,660
exactly?
7

8
00:00:26,790 --> 00:00:33,930
So to provide a little bit of context let's briefly revisit something that we talked about with linear
8

9
00:00:33,930 --> 00:00:34,570
regression.
9

10
00:00:34,600 --> 00:00:34,860
Right.
10

11
00:00:35,370 --> 00:00:43,550
So, here we had some data points and our goal was to figure out which line best fits the data.
11

12
00:00:43,590 --> 00:00:51,000
In other words, we had to learn which line was best because you could draw so many different lines through
12

13
00:00:51,000 --> 00:00:52,670
this data.
13

14
00:00:53,250 --> 00:01:00,120
And each of these lines, right, had a different value for the theta zero and theta one parameters that
14

15
00:01:00,120 --> 00:01:01,850
are associated with it.
15

16
00:01:01,860 --> 00:01:10,040
So our algorithm somehow had to rank what the best values were for theta zero and theta one.
16

17
00:01:10,170 --> 00:01:13,500
So that word best is pretty vague.
17

18
00:01:13,500 --> 00:01:13,740
Right?
18

19
00:01:13,740 --> 00:01:16,280
We need a hard criteria.
19

20
00:01:17,130 --> 00:01:29,370
So by best, what we actually mean is the line that minimizes the distance between the data points and
20

21
00:01:29,850 --> 00:01:40,820
the line. For example, this dark green line here is clearly better than this light green line below.
21

22
00:01:40,840 --> 00:01:41,890
Why?
22

23
00:01:41,890 --> 00:01:48,730
Well, because it's got a shorter distance between the line and the data points.
23

24
00:01:48,880 --> 00:01:51,250
So that's pretty easy to see visually.
24

25
00:01:51,250 --> 00:01:55,300
Now, all we have to do is construct some sort of metric from that.
25

26
00:01:55,300 --> 00:01:57,760
What we need is a single number.
26

27
00:01:57,760 --> 00:02:00,280
And in this case the number is going to look like this.
27

28
00:02:00,280 --> 00:02:06,930
What we're gonna do is we're gonna add up all the differences between the line and the data points.
28

29
00:02:07,000 --> 00:02:09,120
So the first difference might be 10.
29

30
00:02:09,130 --> 00:02:12,030
The second difference might be negative 6.
30

31
00:02:12,040 --> 00:02:14,610
The third difference might be 4.
31

32
00:02:14,740 --> 00:02:21,590
But because, you know, a data point below the line will have a negative value.
32

33
00:02:21,740 --> 00:02:22,570
Right.
33

34
00:02:22,660 --> 00:02:24,010
That minus 6.
34

35
00:02:24,010 --> 00:02:25,930
What we're going to have to do is square them.
35

36
00:02:25,930 --> 00:02:33,760
So we're going to square all these differences and then add them up. And now we've got a single number.
36

37
00:02:34,610 --> 00:02:36,680
And we've also got a goal, right?
37

38
00:02:37,220 --> 00:02:40,010
So we've got this single number that has a name -
38

39
00:02:40,040 --> 00:02:48,290
it's called the residual sum of squares. And this residual sum of squares gives us a single metric on
39

40
00:02:48,290 --> 00:02:54,530
how good our estimates for theta one and theta zero are.
40

41
00:02:54,530 --> 00:03:01,190
So, with this in mind, we know that we can find the best possible values for theta zero and theta one
41

42
00:03:01,670 --> 00:03:06,620
by minimizing this residual sum of squares.
42

43
00:03:06,620 --> 00:03:12,800
So we've given our algorithm a goal - if the residual sum of squares for one particular line is equal
43

44
00:03:12,800 --> 00:03:20,780
to 100, then that's a better line than if the residual sum squares for a line is equal to 500.
44

45
00:03:20,780 --> 00:03:21,110
Right?
45

46
00:03:21,140 --> 00:03:27,290
The lower the number the better our line and the better our estimates for these coefficients.
46

47
00:03:27,320 --> 00:03:27,850
So,
47

48
00:03:28,640 --> 00:03:34,160
so why am I talking about this? Why am I going on about something that we've already covered?
48

49
00:03:34,160 --> 00:03:35,430
Well we can,
49

50
00:03:35,810 --> 00:03:43,850
we can think about this number as measuring the size of our error and expressing how good or badly we
50

51
00:03:43,850 --> 00:03:45,230
did.
51

52
00:03:45,230 --> 00:03:52,550
And this brings me to this lesson's word of the day, namely cost functions. Our a residual sum of squares
52

53
00:03:53,030 --> 00:04:01,370
or, how it's also called, the sum of squared residuals is you guessed it an example of a cost function
53

54
00:04:02,180 --> 00:04:12,250
and a very big part of the machine learning process is optimizing for solution that has the lowest cost.
54

55
00:04:12,290 --> 00:04:20,600
So finding the best solution falls under this broad topic of optimization. And optimization is actually
55

56
00:04:20,600 --> 00:04:27,740
a word that you'll come across in many fields, not not just machine learning.
56

57
00:04:27,770 --> 00:04:34,430
So if this topic of optimization comes up across different fields it shouldn't really be a surprise
57

58
00:04:34,430 --> 00:04:40,970
that this idea of cost functions is also something that you're going to see across different areas.
58

59
00:04:40,970 --> 00:04:41,330
Right?
59

60
00:04:41,330 --> 00:04:50,030
So you'll find it in Statistics, Decision Theory, Computational Neuroscience, Operations Research, Engineering.
60

61
00:04:50,030 --> 00:04:52,970
So, you'll see it in many many places.
61

62
00:04:53,210 --> 00:04:59,220
And this brings us back to this topic of jargon and language.
62

63
00:04:59,220 --> 00:05:00,100
Right?
63

64
00:05:00,380 --> 00:05:08,330
Because you have many different fields using different words for things that actually express a very
64

65
00:05:08,330 --> 00:05:10,100
very similar concept,
65

66
00:05:10,100 --> 00:05:15,800
it can actually be a little bit confusing reading things or learning about this topic.
66

67
00:05:15,890 --> 00:05:23,090
So in order to make it easier, I want to introduce you guys to a lot of the jargon and are a lot of the
67

68
00:05:23,090 --> 00:05:28,430
words that come back to this idea of cost functions right off the bat.
68

69
00:05:28,520 --> 00:05:33,950
So, depending on the field and depending on the context, people don't actually always use the word cost
69

70
00:05:33,950 --> 00:05:34,760
function.
70

71
00:05:34,850 --> 00:05:35,600
Right.
71

72
00:05:35,660 --> 00:05:42,320
And this can make things a little bit confusing, especially if you're reading an article online or a
72

73
00:05:42,320 --> 00:05:44,300
textbook of some sort.
73

74
00:05:44,300 --> 00:05:51,350
Sometimes you'll see these kind of functions being referred to as loss functions.
74

75
00:05:51,350 --> 00:05:57,620
And I've even come across people calling them error functions but I think this is a bit less common,
75

76
00:05:58,340 --> 00:06:00,650
especially for for this sort of application.
76

77
00:06:01,730 --> 00:06:11,350
And finally, you might even come across the term objective function. So in the process of optimization
77

78
00:06:11,440 --> 00:06:17,860
and trying to find the best solution for something, you'll often find the words loss function and cost
78

79
00:06:17,860 --> 00:06:26,830
function used interchangeably in many, many cases. But the word objective function means something a little
79

80
00:06:26,830 --> 00:06:27,870
bit more general.
80

81
00:06:27,910 --> 00:06:28,840
Right.
81

82
00:06:28,960 --> 00:06:35,380
I mean, if you think about it, the objective isn't always to minimize the cost or minimizing some bad
82

83
00:06:35,380 --> 00:06:35,770
thing.
83

84
00:06:36,460 --> 00:06:44,020
Sometimes the objective might be to maximize a good thing. The relationship between the words cost function
84

85
00:06:44,110 --> 00:06:46,170
and the word objective function
85

86
00:06:46,180 --> 00:06:50,960
is kind of like how, I guess, like salmon is a fish but not all fish are
86

87
00:06:51,040 --> 00:06:55,710
salmon. And I think that about covers the jargon.
87

88
00:06:55,740 --> 00:06:57,010
Right.
88

89
00:06:57,370 --> 00:07:02,560
In machine learning you'll mostly see the words cost function or loss function being used.
89

90
00:07:02,710 --> 00:07:06,960
And this is what I'm going to try to stick to in this course.
90

91
00:07:06,960 --> 00:07:07,620
Now,
91

92
00:07:08,140 --> 00:07:14,680
I don't know about you but personally I can't wait to jump straight into Jupyter notebook and write
92

93
00:07:14,680 --> 00:07:15,890
some Python code.
93

94
00:07:16,000 --> 00:07:18,390
So that's what we're gonna do next.
94

95
00:07:18,390 --> 00:07:25,200
We're going to take a look at how we can go about minimizing a cost function in practice.
95

96
00:07:25,220 --> 00:07:26,100
Stay tuned.