0 1 00:00:00,640 --> 00:00:01,130 All right. 1 2 00:00:01,160 --> 00:00:08,670 So previously, we've talked about the famous three step machine learning process: predict, calculate the 2 3 00:00:08,670 --> 00:00:11,070 error and learn. 3 4 00:00:11,070 --> 00:00:19,710 Now the middle step is something that involved calculating an error and I glossed over that part in 4 5 00:00:19,710 --> 00:00:21,080 the last lesson. 5 6 00:00:21,090 --> 00:00:25,350 So what does that mean and how do we calculate the error 6 7 00:00:25,350 --> 00:00:26,660 exactly? 7 8 00:00:26,790 --> 00:00:33,930 So to provide a little bit of context let's briefly revisit something that we talked about with linear 8 9 00:00:33,930 --> 00:00:34,570 regression. 9 10 00:00:34,600 --> 00:00:34,860 Right. 10 11 00:00:35,370 --> 00:00:43,550 So, here we had some data points and our goal was to figure out which line best fits the data. 11 12 00:00:43,590 --> 00:00:51,000 In other words, we had to learn which line was best because you could draw so many different lines through 12 13 00:00:51,000 --> 00:00:52,670 this data. 13 14 00:00:53,250 --> 00:01:00,120 And each of these lines, right, had a different value for the theta zero and theta one parameters that 14 15 00:01:00,120 --> 00:01:01,850 are associated with it. 15 16 00:01:01,860 --> 00:01:10,040 So our algorithm somehow had to rank what the best values were for theta zero and theta one. 16 17 00:01:10,170 --> 00:01:13,500 So that word best is pretty vague. 17 18 00:01:13,500 --> 00:01:13,740 Right? 18 19 00:01:13,740 --> 00:01:16,280 We need a hard criteria. 19 20 00:01:17,130 --> 00:01:29,370 So by best, what we actually mean is the line that minimizes the distance between the data points and 20 21 00:01:29,850 --> 00:01:40,820 the line. For example, this dark green line here is clearly better than this light green line below. 21 22 00:01:40,840 --> 00:01:41,890 Why? 22 23 00:01:41,890 --> 00:01:48,730 Well, because it's got a shorter distance between the line and the data points. 23 24 00:01:48,880 --> 00:01:51,250 So that's pretty easy to see visually. 24 25 00:01:51,250 --> 00:01:55,300 Now, all we have to do is construct some sort of metric from that. 25 26 00:01:55,300 --> 00:01:57,760 What we need is a single number. 26 27 00:01:57,760 --> 00:02:00,280 And in this case the number is going to look like this. 27 28 00:02:00,280 --> 00:02:06,930 What we're gonna do is we're gonna add up all the differences between the line and the data points. 28 29 00:02:07,000 --> 00:02:09,120 So the first difference might be 10. 29 30 00:02:09,130 --> 00:02:12,030 The second difference might be negative 6. 30 31 00:02:12,040 --> 00:02:14,610 The third difference might be 4. 31 32 00:02:14,740 --> 00:02:21,590 But because, you know, a data point below the line will have a negative value. 32 33 00:02:21,740 --> 00:02:22,570 Right. 33 34 00:02:22,660 --> 00:02:24,010 That minus 6. 34 35 00:02:24,010 --> 00:02:25,930 What we're going to have to do is square them. 35 36 00:02:25,930 --> 00:02:33,760 So we're going to square all these differences and then add them up. And now we've got a single number. 36 37 00:02:34,610 --> 00:02:36,680 And we've also got a goal, right? 37 38 00:02:37,220 --> 00:02:40,010 So we've got this single number that has a name - 38 39 00:02:40,040 --> 00:02:48,290 it's called the residual sum of squares. And this residual sum of squares gives us a single metric on 39 40 00:02:48,290 --> 00:02:54,530 how good our estimates for theta one and theta zero are. 40 41 00:02:54,530 --> 00:03:01,190 So, with this in mind, we know that we can find the best possible values for theta zero and theta one 41 42 00:03:01,670 --> 00:03:06,620 by minimizing this residual sum of squares. 42 43 00:03:06,620 --> 00:03:12,800 So we've given our algorithm a goal - if the residual sum of squares for one particular line is equal 43 44 00:03:12,800 --> 00:03:20,780 to 100, then that's a better line than if the residual sum squares for a line is equal to 500. 44 45 00:03:20,780 --> 00:03:21,110 Right? 45 46 00:03:21,140 --> 00:03:27,290 The lower the number the better our line and the better our estimates for these coefficients. 46 47 00:03:27,320 --> 00:03:27,850 So, 47 48 00:03:28,640 --> 00:03:34,160 so why am I talking about this? Why am I going on about something that we've already covered? 48 49 00:03:34,160 --> 00:03:35,430 Well we can, 49 50 00:03:35,810 --> 00:03:43,850 we can think about this number as measuring the size of our error and expressing how good or badly we 50 51 00:03:43,850 --> 00:03:45,230 did. 51 52 00:03:45,230 --> 00:03:52,550 And this brings me to this lesson's word of the day, namely cost functions. Our a residual sum of squares 52 53 00:03:53,030 --> 00:04:01,370 or, how it's also called, the sum of squared residuals is you guessed it an example of a cost function 53 54 00:04:02,180 --> 00:04:12,250 and a very big part of the machine learning process is optimizing for solution that has the lowest cost. 54 55 00:04:12,290 --> 00:04:20,600 So finding the best solution falls under this broad topic of optimization. And optimization is actually 55 56 00:04:20,600 --> 00:04:27,740 a word that you'll come across in many fields, not not just machine learning. 56 57 00:04:27,770 --> 00:04:34,430 So if this topic of optimization comes up across different fields it shouldn't really be a surprise 57 58 00:04:34,430 --> 00:04:40,970 that this idea of cost functions is also something that you're going to see across different areas. 58 59 00:04:40,970 --> 00:04:41,330 Right? 59 60 00:04:41,330 --> 00:04:50,030 So you'll find it in Statistics, Decision Theory, Computational Neuroscience, Operations Research, Engineering. 60 61 00:04:50,030 --> 00:04:52,970 So, you'll see it in many many places. 61 62 00:04:53,210 --> 00:04:59,220 And this brings us back to this topic of jargon and language. 62 63 00:04:59,220 --> 00:05:00,100 Right? 63 64 00:05:00,380 --> 00:05:08,330 Because you have many different fields using different words for things that actually express a very 64 65 00:05:08,330 --> 00:05:10,100 very similar concept, 65 66 00:05:10,100 --> 00:05:15,800 it can actually be a little bit confusing reading things or learning about this topic. 66 67 00:05:15,890 --> 00:05:23,090 So in order to make it easier, I want to introduce you guys to a lot of the jargon and are a lot of the 67 68 00:05:23,090 --> 00:05:28,430 words that come back to this idea of cost functions right off the bat. 68 69 00:05:28,520 --> 00:05:33,950 So, depending on the field and depending on the context, people don't actually always use the word cost 69 70 00:05:33,950 --> 00:05:34,760 function. 70 71 00:05:34,850 --> 00:05:35,600 Right. 71 72 00:05:35,660 --> 00:05:42,320 And this can make things a little bit confusing, especially if you're reading an article online or a 72 73 00:05:42,320 --> 00:05:44,300 textbook of some sort. 73 74 00:05:44,300 --> 00:05:51,350 Sometimes you'll see these kind of functions being referred to as loss functions. 74 75 00:05:51,350 --> 00:05:57,620 And I've even come across people calling them error functions but I think this is a bit less common, 75 76 00:05:58,340 --> 00:06:00,650 especially for for this sort of application. 76 77 00:06:01,730 --> 00:06:11,350 And finally, you might even come across the term objective function. So in the process of optimization 77 78 00:06:11,440 --> 00:06:17,860 and trying to find the best solution for something, you'll often find the words loss function and cost 78 79 00:06:17,860 --> 00:06:26,830 function used interchangeably in many, many cases. But the word objective function means something a little 79 80 00:06:26,830 --> 00:06:27,870 bit more general. 80 81 00:06:27,910 --> 00:06:28,840 Right. 81 82 00:06:28,960 --> 00:06:35,380 I mean, if you think about it, the objective isn't always to minimize the cost or minimizing some bad 82 83 00:06:35,380 --> 00:06:35,770 thing. 83 84 00:06:36,460 --> 00:06:44,020 Sometimes the objective might be to maximize a good thing. The relationship between the words cost function 84 85 00:06:44,110 --> 00:06:46,170 and the word objective function 85 86 00:06:46,180 --> 00:06:50,960 is kind of like how, I guess, like salmon is a fish but not all fish are 86 87 00:06:51,040 --> 00:06:55,710 salmon. And I think that about covers the jargon. 87 88 00:06:55,740 --> 00:06:57,010 Right. 88 89 00:06:57,370 --> 00:07:02,560 In machine learning you'll mostly see the words cost function or loss function being used. 89 90 00:07:02,710 --> 00:07:06,960 And this is what I'm going to try to stick to in this course. 90 91 00:07:06,960 --> 00:07:07,620 Now, 91 92 00:07:08,140 --> 00:07:14,680 I don't know about you but personally I can't wait to jump straight into Jupyter notebook and write 92 93 00:07:14,680 --> 00:07:15,890 some Python code. 93 94 00:07:16,000 --> 00:07:18,390 So that's what we're gonna do next. 94 95 00:07:18,390 --> 00:07:25,200 We're going to take a look at how we can go about minimizing a cost function in practice. 95 96 00:07:25,220 --> 00:07:26,100 Stay tuned.