0
1
00:00:00,390 --> 00:00:07,430
Now I hope that you can't wait to calculate the mean squared error. From the previous cell up here,
1

2
00:00:07,530 --> 00:00:15,220
we already know what the lowest cost theta zero and theta one parameters are for our regression.
2

3
00:00:15,420 --> 00:00:18,760
And this is thanks to the scikit learn package.
3

4
00:00:18,930 --> 00:00:26,400
But let's work out what the mean squared error actually is for these two values.
4

5
00:00:26,640 --> 00:00:30,960
And for that we're going to need our estimated y values.
5

6
00:00:31,290 --> 00:00:37,340
Looking at our equation up here we need to calculate our y hat.
6

7
00:00:37,950 --> 00:00:51,070
What's this y hat equal to? Well, our y hat is gonna be equal to theta_0 + theta_1*x.
7

8
00:00:51,160 --> 00:00:58,780
This is the equation that underlines our linear regression in this case - our y hat is based on the intercept
8

9
00:00:59,110 --> 00:01:00,970
plus the slope,
9

10
00:01:01,000 --> 00:01:03,050
times our X values.
10

11
00:01:03,190 --> 00:01:08,420
Let's translate that to Python code. So I'm going to create a variable called y_hat
11

12
00:01:08,750 --> 00:01:15,320
and I'm gonna set it equal to our theta_0 value which was this one here.
12

13
00:01:15,370 --> 00:01:16,920
I'm going to copy this.
13

14
00:01:17,220 --> 00:01:17,820
Come down here.
14

15
00:01:17,840 --> 00:01:20,970
Paste it in, add a plus sign.
15

16
00:01:21,300 --> 00:01:23,970
Take the theta_1 value,
16

17
00:01:23,970 --> 00:01:28,690
copy and paste it in and then write
17

18
00:01:28,740 --> 00:01:32,270
"* x_5".
18

19
00:01:32,460 --> 00:01:36,800
This after all are all the X's in our dataset.
19

20
00:01:37,230 --> 00:01:43,830
And that way we can calculate all the predicted values in our dataset.
20

21
00:01:43,920 --> 00:01:45,590
Let's print these out for good measure.
21

22
00:01:45,600 --> 00:01:53,980
So I'm going to say "print" and then the string "'Est values y_hat are: ', 
22

23
00:01:54,090 --> 00:01:57,610
y_5".
23

24
00:01:57,840 --> 00:02:00,670
Let me hit Shift+Enter and let's see what these look like.
24

25
00:02:01,440 --> 00:02:02,120
So here they are.
25

26
00:02:02,130 --> 00:02:05,790
These are all the predicted values. Now,
26

27
00:02:06,330 --> 00:02:15,030
one thing that you can do is if you don't want this first value here on the same line as your string
27

28
00:02:15,810 --> 00:02:21,700
what you can do is you can come in here with a string is and insert a special character -
28

29
00:02:21,840 --> 00:02:23,430
the new line character.
29

30
00:02:23,430 --> 00:02:25,730
So that's backslash and then
30

31
00:02:25,990 --> 00:02:36,180
"n". When I hit shift enter now that 0.969 will move to a new line, "\n" is
31

32
00:02:36,180 --> 00:02:40,020
like a special character for hitting return on your keyboard.
32

33
00:02:40,050 --> 00:02:44,170
So how do these predicted values compared to our actual values?
33

34
00:02:44,280 --> 00:02:45,410
Let's print those out as well.
34

35
00:02:45,430 --> 00:02:59,590
So I'm going to say "print" and then write "In comparison, the actual y values are \n, y_
35

36
00:02:59,750 --> 00:03:00,720
5".
36

37
00:03:00,750 --> 00:03:03,690
These are actual values.
37

38
00:03:03,750 --> 00:03:05,040
There we go.
38

39
00:03:05,040 --> 00:03:12,750
Ideally we want these estimated values to be as close to these actual values as possible.
39

40
00:03:12,900 --> 00:03:14,750
And I'm looking at this.
40

41
00:03:14,850 --> 00:03:16,470
They're actually not too far off.
41

42
00:03:16,470 --> 00:03:18,270
This is not too bad.
42

43
00:03:18,420 --> 00:03:22,700
Given this theta_0 and this theta_1.
43

44
00:03:22,710 --> 00:03:30,420
So now that we know how to calculate our y hat values, let's work out the mean square error of the regression.
44

45
00:03:30,780 --> 00:03:33,480
And I think this actually makes a really good challenge.
45

46
00:03:33,480 --> 00:03:45,780
So can you write a Python function that takes two inputs y and y hat and returns the mean squared error?
46

47
00:03:46,680 --> 00:03:54,380
And after you've done that, after you've written this MSE function can you call this function and print
47

48
00:03:54,400 --> 00:04:04,710
out the mean squared error with the y hat calculated above, so feed in this y hat value into your MSE
48

49
00:04:04,710 --> 00:04:08,790
function and print out what the mean squared error is.
49

50
00:04:08,790 --> 00:04:15,290
Now remember the mean squared error equation looks like this.
50

51
00:04:15,540 --> 00:04:22,050
We've got that formula in LaTeX notation, so all you need to do is translate it into Python code and
51

52
00:04:22,170 --> 00:04:27,400
look up what code to use for that pesky summation symbol right here.
52

53
00:04:28,290 --> 00:04:33,410
I'll give you a few seconds to pause the video and try this on your own
53

54
00:04:36,280 --> 00:04:36,610
All right.
54

55
00:04:36,610 --> 00:04:37,210
Ready?
55

56
00:04:37,250 --> 00:04:39,060
So I hope you figured this out.
56

57
00:04:39,160 --> 00:04:40,950
Now in terms of the solution.
57

58
00:04:41,020 --> 00:04:42,980
I'm not just going to show you one solution.
58

59
00:04:43,060 --> 00:04:49,810
I'm going to show you three different ways that you can implement this, because there's actually many
59

60
00:04:49,810 --> 00:04:53,520
ways you can skin a cat as they say.
60

61
00:04:53,530 --> 00:04:58,060
So while you have the video paused and were solving the challenge,
61

62
00:04:58,060 --> 00:05:01,570
I was busy looking up possible uses for all that cat skin.
62

63
00:05:02,200 --> 00:05:08,060
And the best one I came across was a Japanese instrument called shamisen.
63

64
00:05:08,080 --> 00:05:13,120
Apparently the drum on this banjo was actually traditionally made from from cat skin.
64

65
00:05:13,120 --> 00:05:19,510
So uh there's your random fact of the day. If you've come across any other uses do let me know in the
65

66
00:05:19,570 --> 00:05:21,610
comments section.
66

67
00:05:21,610 --> 00:05:27,910
The Python code that you will have written will probably look something like this.
67

68
00:05:27,910 --> 00:05:36,760
You're going to have have "def mse():" and then for the parameters you'll have y and say y_hat, colon. And the
68

69
00:05:36,820 --> 00:05:39,490
first possible approach would look something like this.
69

70
00:05:39,500 --> 00:05:48,490
You might have a variable say call it "mse_calc" and that would have been equal to 1 divided by 7 because
70

71
00:05:48,490 --> 00:05:54,290
we've got 7 data points and you multiply that by the sum,
71

72
00:05:54,340 --> 00:05:56,680
this is an inbuilt Python function;
72

73
00:05:56,680 --> 00:06:04,420
"sum()" and then inside the sum you'll have another set of parentheses and there you'll have
73

74
00:06:04,630 --> 00:06:10,550
"y-y_hat" to the power of 2.
74

75
00:06:10,600 --> 00:06:13,030
So **2.
75

76
00:06:13,250 --> 00:06:20,090
And because this function needs an output you would return the results of your calculation.
76

77
00:06:20,110 --> 00:06:25,650
So mse_calc and this is one possible way to do it.
77

78
00:06:25,660 --> 00:06:31,050
Let me call this function and print out the mean squared error of y hat that we calculated.
78

79
00:06:31,060 --> 00:06:39,070
So it'll be "mse(y_5,", because these are the actual y values and then I'll supply y_hat that we
79

80
00:06:39,070 --> 00:06:49,160
calculated a few cells above. I'm going to hit Shift+Enter and see what we get. We get 0.95 approximately.
80

81
00:06:49,180 --> 00:06:53,830
Now there is one improvement that we can make to this function.
81

82
00:06:53,830 --> 00:07:02,050
So this 1/7 will probably be bothering you a little bit because it's hardcoded and
82

83
00:07:02,110 --> 00:07:05,450
it would only work or calculate the mean squared error correctly
83

84
00:07:05,620 --> 00:07:12,520
if we had 7 data points - if we had 8 or 9 then the mean square error formula wouldn't be apt
84

85
00:07:12,520 --> 00:07:15,250
anymore, wouldn't be correct anymore with this Python code.
85

86
00:07:15,670 --> 00:07:18,450
So I'm going to comment this out.
86

87
00:07:18,520 --> 00:07:21,160
But I'm going to copy, paste it below,
87

88
00:07:21,520 --> 00:07:28,580
and what I want to do is I'm going to replace this 1/7 with some different code.
88

89
00:07:28,690 --> 00:07:38,230
I'm going to say "1/y.size" - y.size will return the length or the number of
89

90
00:07:38,230 --> 00:07:43,380
samples that are fed in to the place holder for our y values.
90

91
00:07:43,420 --> 00:07:46,860
So this will also work.
91

92
00:07:46,900 --> 00:07:56,500
Let me hit Shift+Enter on this cell and the one below to prove just that. Writing the code this way makes
92

93
00:07:56,680 --> 00:08:03,590
the MSE function much more general because it figures out the number of samples inside of the function.
93

94
00:08:03,670 --> 00:08:11,170
It doesn't have to be supplied as an extra parameter and it's not hardcoded in the instructions themselves.
94

95
00:08:11,170 --> 00:08:17,710
So this is already quite nice but there is another way you could have done this as well, maybe you did
95

96
00:08:17,710 --> 00:08:26,410
a bit of googling and you figured out that numpy actually has a function called "average" and this does
96

97
00:08:26,410 --> 00:08:31,950
both the summation and the dividing by the number of samples for us.
97

98
00:08:31,990 --> 00:08:33,460
So let's see what this would look like.
98

99
00:08:33,550 --> 00:08:43,180
This would be "mse_calc = np.average()" and then it needs two things -
99

100
00:08:43,210 --> 00:08:50,560
it needs the function itself that it's averaging and if it should be averaging the rows, the columns
100

101
00:08:50,770 --> 00:08:52,130
or the whole thing.
101

102
00:08:52,210 --> 00:09:00,960
So first things first, the function its averaging would be (y-y_hat)**2
102

103
00:09:01,620 --> 00:09:07,390
and the way to tell it whether it should average the rows or the columns would be by supplying the "axis"
103

104
00:09:07,750 --> 00:09:15,820
argument, so "axis = 0" sums up the rows instead of both the rows and columns which is not what
104

105
00:09:15,820 --> 00:09:16,670
we want.
105

106
00:09:16,690 --> 00:09:24,900
So when I hit Shift+Enter on this and then run the cell below again, we can see that this works as well.
106

107
00:09:25,540 --> 00:09:32,580
But just because we've printed out a mean square error in this cell here doesn't mean it's correct.
107

108
00:09:32,620 --> 00:09:37,450
So let's check it against an inbuilt function from scikit learn.
108

109
00:09:37,560 --> 00:09:40,690
So we wrap this in a print statement. Write
109

110
00:09:40,690 --> 00:09:42,280
print and then the string
110

111
00:09:44,920 --> 00:09:57,510
"Manually calculated MSE is:" and then we have our mse function in here. To use the inbuilt function
111

112
00:09:57,750 --> 00:10:01,260
that inbuilt mean square error function from scikit learn,
112

113
00:10:01,500 --> 00:10:03,130
we're gonna have to import it.
113

114
00:10:03,200 --> 00:10:07,980
I'll have to scroll to the very top and then here we're gonna add another import statement, we're going to say
114

115
00:10:07,980 --> 00:10:22,770
"from sklearn.metrics import mean_squared_error .
115

116
00:10:23,300 --> 00:10:24,720
Sorry sorry.
116

117
00:10:24,930 --> 00:10:31,380
I hope you're not too upset and will forgive me that I made you do all this work even though there is
117

118
00:10:31,380 --> 00:10:40,300
an inbuilt function for this already. Let's call this mean_squared_error function with both our
118

119
00:10:40,300 --> 00:10:47,110
manually calculated y_hat as well as from the output of the regression directly.
119

120
00:10:47,110 --> 00:10:53,440
So where we were adding our print statement let's add another one let's add "print('MSE
120

121
00:10:56,660 --> 00:11:08,090
regression using manual calc is:', )", and then "mean_squared_error()" and it will
121

122
00:11:08,090 --> 00:11:16,250
supply two inputs: y_5, our actual y values and y_hat, which we've calculated
122

123
00:11:16,790 --> 00:11:18,500
above just here.
123

124
00:11:22,250 --> 00:11:27,530
And now let me copy this line and paste it below.
124

125
00:11:27,530 --> 00:11:35,230
And this one will read "'MSE regression is', mean_squared_error(y_5,)"  and instead of our manually calculated
125

126
00:11:35,230 --> 00:11:45,390
one we can use "regr.predict(x_5)" and when we hit Shift+Enter,
126

127
00:11:45,880 --> 00:11:53,530
we see that indeed all the outputs agree with one another. Our manually calculated mean squared error is
127

128
00:11:53,530 --> 00:11:58,240
the same as what we get from the machine learning package.
128

129
00:11:58,240 --> 00:12:03,700
So we've dissected and implemented this cost function correctly.
129

130
00:12:03,700 --> 00:12:04,840
Brilliant.
130

131
00:12:04,840 --> 00:12:11,770
Now it's time to plot our cost function and visualize it. We'll do that in the next lesson.
131

132
00:12:11,830 --> 00:12:12,670
I'll see you there.