1 00:00:00,766 --> 00:00:03,000 Hello and welcome back to the course on Machine Learning. 2 00:00:03,000 --> 00:00:06,900 In today's tutorial we're going to talk about the k nearest neighbors algorithm. 3 00:00:07,466 --> 00:00:09,833 All right so let's get started. 4 00:00:09,833 --> 00:00:12,566 So let's imagine that we have a scenario 5 00:00:12,566 --> 00:00:16,266 where we have two categories already present in our data set. 6 00:00:16,266 --> 00:00:18,133 So we've identified two categories 7 00:00:18,133 --> 00:00:21,766 and one is category one on the left which is red. 8 00:00:21,766 --> 00:00:23,533 Category two is green on the right. 9 00:00:23,533 --> 00:00:25,766 And for simplicity's sake we're just going to 10 00:00:25,766 --> 00:00:30,066 take into consideration two variables or two columns in our data set. 11 00:00:30,066 --> 00:00:34,333 So all of this grouping is happening based on these two columns x1 and x2. 12 00:00:34,766 --> 00:00:37,766 And now let's say we add a new data point into our dataset. 13 00:00:38,000 --> 00:00:39,166 The question is 14 00:00:39,166 --> 00:00:42,833 should it fall into the red category or should fall into the green category? 15 00:00:42,866 --> 00:00:44,133 How do we decide that? 16 00:00:44,133 --> 00:00:48,666 So how do we classify this new data point to a cluster of files 17 00:00:48,766 --> 00:00:51,766 as a, red data point or a green data point? 18 00:00:51,800 --> 00:00:54,900 And that's where the k nearest neighbors algorithm will come to assist us. 19 00:00:55,366 --> 00:00:59,366 At the end of performing, this algorithm will be able to identify 20 00:00:59,533 --> 00:01:01,333 whether it's a red or green point. 21 00:01:01,333 --> 00:01:04,133 And in this case, the point turned out to be red. 22 00:01:04,133 --> 00:01:07,666 So how does the k nearest neighbor algorithm work? 23 00:01:07,766 --> 00:01:08,866 How did it do it? 24 00:01:08,866 --> 00:01:14,266 Well, we're going to build a step by step rule guide to the kernel. 25 00:01:14,500 --> 00:01:16,033 And after we've built 26 00:01:16,033 --> 00:01:19,033 it we're going to actually perform it manually to see how it works. 27 00:01:19,133 --> 00:01:21,700 And as you'll see it's a very very simple algorithm. 28 00:01:21,700 --> 00:01:22,000 All right. 29 00:01:22,000 --> 00:01:25,800 So the first step is to choose the number k of neighbors 30 00:01:25,800 --> 00:01:27,233 that you're going to have in your algorithm. 31 00:01:27,233 --> 00:01:27,800 So you go to 32 00:01:27,800 --> 00:01:33,366 you have to identify whether k is equal to 1 to 2 three five or some other number. 33 00:01:33,500 --> 00:01:35,133 And one of the most common. 34 00:01:35,133 --> 00:01:37,966 A default values for k is five. 35 00:01:37,966 --> 00:01:40,600 The next you need to take the k nearest neighbors 36 00:01:40,600 --> 00:01:43,600 of the new data point according to their Euclidean distance. 37 00:01:43,833 --> 00:01:45,200 Now here you can. 38 00:01:45,200 --> 00:01:47,200 You don't have to use Euclidean distance. 39 00:01:47,200 --> 00:01:49,866 You can use other distances such as the Manhattan distance 40 00:01:49,866 --> 00:01:53,300 or any other distances that you might be considering, 41 00:01:53,600 --> 00:01:56,400 but in most cases the Euclidean distances are used. 42 00:01:56,400 --> 00:01:57,600 So we're going to stick to those. 43 00:01:57,600 --> 00:02:01,100 So once you've taken the nearest neighbors among these k 44 00:02:01,366 --> 00:02:04,566 neighbors, you need to count the number of data points in each category. 45 00:02:04,566 --> 00:02:08,833 So how many data points fell into one category, into the other category, 46 00:02:08,833 --> 00:02:09,200 and so on. 47 00:02:09,200 --> 00:02:12,033 If you might even have more than two categories in your data set. 48 00:02:12,033 --> 00:02:15,166 So you just need to calculate how many fall into each category. 49 00:02:15,533 --> 00:02:17,066 And then you need to assign the new data 50 00:02:17,066 --> 00:02:20,066 point to the category where you counted the most neighbors. 51 00:02:20,133 --> 00:02:21,133 As simple as that. 52 00:02:21,133 --> 00:02:23,333 That's why it's called K-nearest neighbors. 53 00:02:23,333 --> 00:02:25,166 And then your model is ready. 54 00:02:25,166 --> 00:02:25,666 As you can see, it's 55 00:02:25,666 --> 00:02:29,100 a very simple algorithm and model which is going to do a manual exercise 56 00:02:29,100 --> 00:02:31,233 right now to really solidify this knowledge. 57 00:02:31,233 --> 00:02:33,433 So let's move on to that. 58 00:02:33,433 --> 00:02:35,800 So here we've got the new data 59 00:02:35,800 --> 00:02:38,800 point has been added to our scatterplot as we saw previously. 60 00:02:38,833 --> 00:02:43,300 How do we find the nearest neighbors of this new data point. 61 00:02:43,833 --> 00:02:47,500 Well let's have a look at the Euclidean distance that we're going to use. 62 00:02:47,966 --> 00:02:52,833 So Euclidean distance is a very basic type of distance that we define in geometry. 63 00:02:52,833 --> 00:02:55,366 It's the one we use in geometry. 64 00:02:55,366 --> 00:02:59,066 And basically if you have two points over here p1 and p2, 65 00:02:59,200 --> 00:03:03,566 then the distance between the two points is measured according to this formula. 66 00:03:03,566 --> 00:03:05,533 So x two minus x one. 67 00:03:05,533 --> 00:03:09,366 The difference between the x coordinates and then squared 68 00:03:09,366 --> 00:03:13,033 plus the difference between the y coordinates squared. 69 00:03:13,033 --> 00:03:15,933 And then you take a square root out of all of that. 70 00:03:15,933 --> 00:03:20,666 And that is basically if you look at it this way it's a right angle triangle. 71 00:03:20,766 --> 00:03:24,233 And so you're taking you're taking one cathedral's and you're squaring it. 72 00:03:24,233 --> 00:03:26,466 You're taking the other thing, you're squaring it. 73 00:03:26,466 --> 00:03:28,166 You're taking the you're adding them up. 74 00:03:28,166 --> 00:03:29,500 You're taking the square root. 75 00:03:29,500 --> 00:03:32,433 And that gives you the length of the hypotenuse. 76 00:03:32,433 --> 00:03:33,433 So there we go. 77 00:03:33,433 --> 00:03:35,766 And that's how Euclidean distance work. 78 00:03:35,766 --> 00:03:37,500 Again you could use any type of distance. 79 00:03:37,500 --> 00:03:39,900 But this is the geometrical distance. 80 00:03:39,900 --> 00:03:41,266 And this is the one we're going to stick to. 81 00:03:41,266 --> 00:03:44,533 So basically on a scatterplot a two dimensional scatterplot 82 00:03:44,533 --> 00:03:47,533 we can just draw the lines and see what is closer. 83 00:03:48,133 --> 00:03:49,833 So here's our new data point. 84 00:03:49,833 --> 00:03:52,833 How are we going to identify which are the closest five neighbors. 85 00:03:53,066 --> 00:03:57,166 So basically we just look at them and we see the distances here. 86 00:03:57,166 --> 00:03:58,966 So we can see that's the closest one. 87 00:03:58,966 --> 00:04:02,533 That's probably the second closest one the third closest fourth closest 88 00:04:02,533 --> 00:04:05,366 fifth closest. So let's outline those. 89 00:04:05,366 --> 00:04:05,766 There we go. 90 00:04:05,766 --> 00:04:08,566 So now all we have to do is step three. 91 00:04:08,566 --> 00:04:10,200 Among these kind of k neighbors. 92 00:04:10,200 --> 00:04:12,766 Count the number of data points in each category. 93 00:04:12,766 --> 00:04:14,000 You can see in category one. 94 00:04:14,000 --> 00:04:15,433 In the red one we have three neighbors. 95 00:04:15,433 --> 00:04:17,466 In category two we have two neighbors. 96 00:04:17,466 --> 00:04:21,666 So therefore step four assign the new data point to the category 97 00:04:21,666 --> 00:04:23,533 where you counted the most neighbors. 98 00:04:23,533 --> 00:04:26,533 That means we need to assign it to the red category. 99 00:04:26,866 --> 00:04:28,500 As simple as that. There we go. 100 00:04:28,500 --> 00:04:32,733 Now we have classified this new point and your model is ready. 101 00:04:33,033 --> 00:04:33,500 There we go. 102 00:04:33,500 --> 00:04:36,033 It's a it's a very straightforward algorithm. 103 00:04:36,033 --> 00:04:37,500 One of the simplest you can think about it. 104 00:04:37,500 --> 00:04:41,566 So it's very good to get this section started with a simple example like that. 105 00:04:41,866 --> 00:04:45,266 And also we're going to have a look at some practical exercises 106 00:04:45,466 --> 00:04:48,666 and had learned we'll show you around R and Python. 107 00:04:48,900 --> 00:04:50,100 I look forward to seeing you next time. 108 00:04:50,100 --> 00:04:52,233 And until then enjoy machine learning.