1
00:00:00,766 --> 00:00:03,000
Hello and welcome back to the course
on Machine Learning.

2
00:00:03,000 --> 00:00:06,900
In today's tutorial we're going to talk
about the k nearest neighbors algorithm.

3
00:00:07,466 --> 00:00:09,833
All right so let's get started.

4
00:00:09,833 --> 00:00:12,566
So let's imagine that we have a scenario

5
00:00:12,566 --> 00:00:16,266
where we have two categories
already present in our data set.

6
00:00:16,266 --> 00:00:18,133
So we've identified two categories

7
00:00:18,133 --> 00:00:21,766
and one is category
one on the left which is red.

8
00:00:21,766 --> 00:00:23,533
Category two is green on the right.

9
00:00:23,533 --> 00:00:25,766
And for simplicity's sake
we're just going to

10
00:00:25,766 --> 00:00:30,066
take into consideration two variables
or two columns in our data set.

11
00:00:30,066 --> 00:00:34,333
So all of this grouping is happening
based on these two columns x1 and x2.

12
00:00:34,766 --> 00:00:37,766
And now let's say we add a new data point
into our dataset.

13
00:00:38,000 --> 00:00:39,166
The question is

14
00:00:39,166 --> 00:00:42,833
should it fall into the red category
or should fall into the green category?

15
00:00:42,866 --> 00:00:44,133
How do we decide that?

16
00:00:44,133 --> 00:00:48,666
So how do we classify this new data point
to a cluster of files

17
00:00:48,766 --> 00:00:51,766
as a, red data point
or a green data point?

18
00:00:51,800 --> 00:00:54,900
And that's where the k nearest neighbors
algorithm will come to assist us.

19
00:00:55,366 --> 00:00:59,366
At the end of performing,
this algorithm will be able to identify

20
00:00:59,533 --> 00:01:01,333
whether it's a red or green point.

21
00:01:01,333 --> 00:01:04,133
And in this case,
the point turned out to be red.

22
00:01:04,133 --> 00:01:07,666
So how does the k nearest neighbor
algorithm work?

23
00:01:07,766 --> 00:01:08,866
How did it do it?

24
00:01:08,866 --> 00:01:14,266
Well, we're going to build a step by step
rule guide to the kernel.

25
00:01:14,500 --> 00:01:16,033
And after we've built

26
00:01:16,033 --> 00:01:19,033
it we're going to actually perform
it manually to see how it works.

27
00:01:19,133 --> 00:01:21,700
And as you'll see
it's a very very simple algorithm.

28
00:01:21,700 --> 00:01:22,000
All right.

29
00:01:22,000 --> 00:01:25,800
So the first step is to choose the number
k of neighbors

30
00:01:25,800 --> 00:01:27,233
that you're going
to have in your algorithm.

31
00:01:27,233 --> 00:01:27,800
So you go to

32
00:01:27,800 --> 00:01:33,366
you have to identify whether k is equal
to 1 to 2 three five or some other number.

33
00:01:33,500 --> 00:01:35,133
And one of the most common.

34
00:01:35,133 --> 00:01:37,966
A default values for k is five.

35
00:01:37,966 --> 00:01:40,600
The next you need to take
the k nearest neighbors

36
00:01:40,600 --> 00:01:43,600
of the new data point
according to their Euclidean distance.

37
00:01:43,833 --> 00:01:45,200
Now here you can.

38
00:01:45,200 --> 00:01:47,200
You don't have to use Euclidean distance.

39
00:01:47,200 --> 00:01:49,866
You can use other distances
such as the Manhattan distance

40
00:01:49,866 --> 00:01:53,300
or any other distances
that you might be considering,

41
00:01:53,600 --> 00:01:56,400
but in most cases
the Euclidean distances are used.

42
00:01:56,400 --> 00:01:57,600
So we're going to stick to those.

43
00:01:57,600 --> 00:02:01,100
So once you've taken the nearest neighbors
among these k

44
00:02:01,366 --> 00:02:04,566
neighbors, you need to count the number
of data points in each category.

45
00:02:04,566 --> 00:02:08,833
So how many data points fell into one
category, into the other category,

46
00:02:08,833 --> 00:02:09,200
and so on.

47
00:02:09,200 --> 00:02:12,033
If you might even have more
than two categories in your data set.

48
00:02:12,033 --> 00:02:15,166
So you just need to calculate
how many fall into each category.

49
00:02:15,533 --> 00:02:17,066
And then you need to assign the new data

50
00:02:17,066 --> 00:02:20,066
point to the category
where you counted the most neighbors.

51
00:02:20,133 --> 00:02:21,133
As simple as that.

52
00:02:21,133 --> 00:02:23,333
That's
why it's called K-nearest neighbors.

53
00:02:23,333 --> 00:02:25,166
And then your model is ready.

54
00:02:25,166 --> 00:02:25,666
As you can see, it's

55
00:02:25,666 --> 00:02:29,100
a very simple algorithm and model
which is going to do a manual exercise

56
00:02:29,100 --> 00:02:31,233
right now to really solidify
this knowledge.

57
00:02:31,233 --> 00:02:33,433
So let's move on to that.

58
00:02:33,433 --> 00:02:35,800
So here we've got the new data

59
00:02:35,800 --> 00:02:38,800
point has been added to our scatterplot
as we saw previously.

60
00:02:38,833 --> 00:02:43,300
How do we find the nearest neighbors
of this new data point.

61
00:02:43,833 --> 00:02:47,500
Well let's have a look at the Euclidean
distance that we're going to use.

62
00:02:47,966 --> 00:02:52,833
So Euclidean distance is a very basic type
of distance that we define in geometry.

63
00:02:52,833 --> 00:02:55,366
It's the one we use in geometry.

64
00:02:55,366 --> 00:02:59,066
And basically if you have two points
over here p1 and p2,

65
00:02:59,200 --> 00:03:03,566
then the distance between the two points
is measured according to this formula.

66
00:03:03,566 --> 00:03:05,533
So x two minus x one.

67
00:03:05,533 --> 00:03:09,366
The difference between the x coordinates
and then squared

68
00:03:09,366 --> 00:03:13,033
plus the difference between the
y coordinates squared.

69
00:03:13,033 --> 00:03:15,933
And then you take a square root
out of all of that.

70
00:03:15,933 --> 00:03:20,666
And that is basically if you look at it
this way it's a right angle triangle.

71
00:03:20,766 --> 00:03:24,233
And so you're taking you're taking one
cathedral's and you're squaring it.

72
00:03:24,233 --> 00:03:26,466
You're taking the other thing,
you're squaring it.

73
00:03:26,466 --> 00:03:28,166
You're taking the you're adding them up.

74
00:03:28,166 --> 00:03:29,500
You're taking the square root.

75
00:03:29,500 --> 00:03:32,433
And that gives you the
length of the hypotenuse.

76
00:03:32,433 --> 00:03:33,433
So there we go.

77
00:03:33,433 --> 00:03:35,766
And that's how Euclidean distance work.

78
00:03:35,766 --> 00:03:37,500
Again you could use any type of distance.

79
00:03:37,500 --> 00:03:39,900
But this is the geometrical distance.

80
00:03:39,900 --> 00:03:41,266
And this is
the one we're going to stick to.

81
00:03:41,266 --> 00:03:44,533
So basically on a scatterplot
a two dimensional scatterplot

82
00:03:44,533 --> 00:03:47,533
we can just draw the lines
and see what is closer.

83
00:03:48,133 --> 00:03:49,833
So here's our new data point.

84
00:03:49,833 --> 00:03:52,833
How are we going to identify
which are the closest five neighbors.

85
00:03:53,066 --> 00:03:57,166
So basically we just look at them
and we see the distances here.

86
00:03:57,166 --> 00:03:58,966
So we can see that's the closest one.

87
00:03:58,966 --> 00:04:02,533
That's probably the second closest one
the third closest fourth closest

88
00:04:02,533 --> 00:04:05,366
fifth closest. So let's outline those.

89
00:04:05,366 --> 00:04:05,766
There we go.

90
00:04:05,766 --> 00:04:08,566
So now all we have to do is step three.

91
00:04:08,566 --> 00:04:10,200
Among these kind of k neighbors.

92
00:04:10,200 --> 00:04:12,766
Count the number of data points
in each category.

93
00:04:12,766 --> 00:04:14,000
You can see in category one.

94
00:04:14,000 --> 00:04:15,433
In the red one we have three neighbors.

95
00:04:15,433 --> 00:04:17,466
In category two we have two neighbors.

96
00:04:17,466 --> 00:04:21,666
So therefore step four assign
the new data point to the category

97
00:04:21,666 --> 00:04:23,533
where you counted the most neighbors.

98
00:04:23,533 --> 00:04:26,533
That means we need to assign it
to the red category.

99
00:04:26,866 --> 00:04:28,500
As simple as that. There we go.

100
00:04:28,500 --> 00:04:32,733
Now we have classified this new point
and your model is ready.

101
00:04:33,033 --> 00:04:33,500
There we go.

102
00:04:33,500 --> 00:04:36,033
It's a it's
a very straightforward algorithm.

103
00:04:36,033 --> 00:04:37,500
One of the simplest
you can think about it.

104
00:04:37,500 --> 00:04:41,566
So it's very good to get this section
started with a simple example like that.

105
00:04:41,866 --> 00:04:45,266
And also we're going to have a look
at some practical exercises

106
00:04:45,466 --> 00:04:48,666
and had learned
we'll show you around R and Python.

107
00:04:48,900 --> 00:04:50,100
I look forward to seeing you next time.

108
00:04:50,100 --> 00:04:52,233
And until then enjoy machine learning.