1
00:00:00,240 --> 00:00:00,880
All right.

2
00:00:00,910 --> 00:00:07,230
So in this lesson what we're going to talk about are some of the problems that are associated with neural

3
00:00:07,230 --> 00:00:08,210
networks.

4
00:00:08,280 --> 00:00:12,530
Some of the disadvantages that these types of models have.

5
00:00:12,600 --> 00:00:15,660
We've already talked about their black box nature.

6
00:00:15,660 --> 00:00:21,870
We don't really know why it is that a neural network might give a particular output.

7
00:00:21,870 --> 00:00:27,750
And this is actually very important when we care about the rationale for a particular decision that

8
00:00:27,750 --> 00:00:29,940
is made by neural network.

9
00:00:29,970 --> 00:00:34,150
So imagine that a neural network is being used in the legal system.

10
00:00:34,380 --> 00:00:39,930
And this computer programs job is to set the amount of bail or the amount of jail time for a particular

11
00:00:39,930 --> 00:00:41,270
person.

12
00:00:41,280 --> 00:00:47,880
Now surely in this situation you'd want to be able to ask wait well why is this guy getting two years

13
00:00:47,970 --> 00:00:50,490
and this other guy getting 10 years.

14
00:00:50,610 --> 00:00:57,200
This is just one example of a situation where an in transparent and intractable model is a real disadvantage.

15
00:00:57,270 --> 00:01:03,120
But in this lesson I want to spend time talking about the other really big disadvantage of big complex

16
00:01:03,150 --> 00:01:08,160
neural networks that second really big disadvantage is cost.

17
00:01:08,280 --> 00:01:13,770
And this is a really weird one to wrap your head around at first because what kind of costs are we talking

18
00:01:13,770 --> 00:01:15,060
about here.

19
00:01:15,330 --> 00:01:18,060
This cost actually comes in two forms.

20
00:01:18,090 --> 00:01:24,240
The first is the amount of data required and the second kind of cost is the amount of compute required

21
00:01:24,300 --> 00:01:26,360
to train a neural network.

22
00:01:26,430 --> 00:01:32,100
And when I say compute what I'm talking about is the amount of hours a model has to be trained on something

23
00:01:32,100 --> 00:01:32,910
like a GP.

24
00:01:33,870 --> 00:01:39,390
So what you're gonna see is that these two costs are related in a minute and they all kind of go back

25
00:01:39,390 --> 00:01:41,850
to the structure of a neural network.

26
00:01:41,850 --> 00:01:48,900
Now with any kind of model a pretty good proxy for the degree of complexity in the model are the number

27
00:01:48,960 --> 00:01:49,980
of parameters.

28
00:01:50,970 --> 00:01:57,510
So in this slide for this example neural network where we're just estimating the weights how many parameters

29
00:01:57,510 --> 00:02:00,720
do you think we actually need to estimate in total.

30
00:02:00,720 --> 00:02:05,610
Because that's our training this neural network what we're doing is adjusting the weights for our different

31
00:02:05,610 --> 00:02:06,200
connections.

32
00:02:06,210 --> 00:02:07,550
Right.

33
00:02:07,620 --> 00:02:11,860
So let's work out the total number of weights that we've got on the very left.

34
00:02:11,880 --> 00:02:17,530
We've got six input nodes and then we've got the six nodes for our first hidden layer.

35
00:02:17,580 --> 00:02:25,120
So that's six times six or a total of 36 connections right and then we can do the calculation for the

36
00:02:25,120 --> 00:02:26,170
rest of the layers too.

37
00:02:26,830 --> 00:02:33,040
So between the second layer and the third layer we've got six times five plus five times six for the

38
00:02:33,040 --> 00:02:39,550
third and fourth layer and then for the output layer we've just got four times one adding this all up

39
00:02:39,850 --> 00:02:41,830
we've got about 90 different connections.

40
00:02:41,830 --> 00:02:44,960
So 90 different weights that we need to estimate.

41
00:02:45,170 --> 00:02:51,170
The issue is that the more parameters you have the more data you need to figure out how to tweak each

42
00:02:51,170 --> 00:02:52,220
of them.

43
00:02:52,220 --> 00:02:58,340
So a real life example would be something like baking the perfect cake like the one that your grandma

44
00:02:58,340 --> 00:02:59,000
used to make.

45
00:02:59,540 --> 00:03:04,490
So you try this and at the end you find out it's not tasting quite the same.

46
00:03:04,490 --> 00:03:06,240
It's not just quite right.

47
00:03:06,440 --> 00:03:08,810
But where did things go wrong.

48
00:03:08,840 --> 00:03:12,830
Did you use too much flour or too few eggs or not enough sugar.

49
00:03:12,830 --> 00:03:17,960
Or did you leave it in the oven too long to get this cake tasting just right the way your gran used

50
00:03:17,960 --> 00:03:18,760
to make it.

51
00:03:18,800 --> 00:03:23,810
You might have to experiment quite a bit and bake it a few more times to get it just right.

52
00:03:23,810 --> 00:03:27,070
Tweaking each of these little parameters along the way.

53
00:03:27,230 --> 00:03:32,450
Now of course the more complex the problem the more parameters you have the more you'd have to experiment

54
00:03:32,600 --> 00:03:35,330
to figure out what their value should be.

55
00:03:35,330 --> 00:03:41,660
And this is really a similar story with the neural network or any model the more complex the model the

56
00:03:41,660 --> 00:03:47,920
more data it needs to chew through in order to get sensible parameter estimates now the question you

57
00:03:47,920 --> 00:03:53,150
might ask at this point is well how much data do we need to train this network.

58
00:03:53,200 --> 00:03:56,050
And the answer is It depends.

59
00:03:56,110 --> 00:03:59,240
But in the industry people tend to use a rule of thumb.

60
00:03:59,380 --> 00:04:05,140
We probably need about 10 times the amount of data as we have parameters that we need to estimate.

61
00:04:06,010 --> 00:04:13,840
So in this case we've got 90 parameters so we'd need around 900 data points now say that we were working

62
00:04:13,840 --> 00:04:16,690
with images right instead of our six input nodes.

63
00:04:16,750 --> 00:04:18,790
We have a small image that we want to supply.

64
00:04:18,870 --> 00:04:23,410
And this image is going to be 25 pixels by 28 pixels.

65
00:04:23,410 --> 00:04:30,760
And this means that this black and white image would actually have around 700 inputs because 25 times

66
00:04:30,790 --> 00:04:39,970
28 is 700 now even if we keep the rest of the network the same in that first hidden layer we've already

67
00:04:39,970 --> 00:04:44,080
got four thousand two hundred different connection weights.

68
00:04:44,170 --> 00:04:49,930
The point I'm trying to make here is that the number of parameters goes up massively with the size of

69
00:04:49,930 --> 00:04:51,140
the network.

70
00:04:51,160 --> 00:04:57,670
So if we have these 700 inputs and we just add one extra neuron to that first hidden layer all of a

71
00:04:57,670 --> 00:05:03,690
sudden we're jumping from four thousand two hundred different weights to 4900.

72
00:05:03,760 --> 00:05:10,120
And as the number of parameters goes up this network needs to be fed with more and more data to reliably

73
00:05:10,150 --> 00:05:13,990
estimate all these weights during training.

74
00:05:13,990 --> 00:05:20,300
Now in practice depending on the field of application it's actually not uncommon to have between 100000

75
00:05:20,860 --> 00:05:27,370
or a million or even 10 million parameters in a neural network that need estimating and at that point

76
00:05:27,580 --> 00:05:32,440
the amount of data that you need to get your hands on actually starts to become enormous.

77
00:05:32,440 --> 00:05:37,360
And in addition it has to be an enormous amount of good quality data.

78
00:05:37,510 --> 00:05:44,650
And this is why the biggest and most complex neural networks always have the backing of large organizations

79
00:05:44,650 --> 00:05:49,700
right like Google Facebook Microsoft or a government institution.

80
00:05:49,720 --> 00:05:54,630
Now the good news is that there's a lot of good data out there for the likes of you and I.

81
00:05:55,330 --> 00:06:00,880
We're gonna be able to build some pretty interesting things with the resources that we have available.

82
00:06:00,940 --> 00:06:06,460
So suppose that we manage to actually get a hold of say half a million data points to train the ultimate

83
00:06:06,460 --> 00:06:08,080
cat detector.

84
00:06:08,080 --> 00:06:09,670
Now what.

85
00:06:09,670 --> 00:06:12,010
Well we have to process this data.

86
00:06:12,010 --> 00:06:16,660
We have to store the data we have to clean the data and then eventually we have to train our neural

87
00:06:16,660 --> 00:06:17,220
network.

88
00:06:17,920 --> 00:06:24,430
And it's at this point that the second part of the costs hits you to that was talking about because

89
00:06:24,430 --> 00:06:28,000
training means a lot of calculations.

90
00:06:28,000 --> 00:06:35,140
It means making a lot of predictions estimating a lot of losses at each node in the neural network and

91
00:06:35,170 --> 00:06:41,320
then making a lot of weight adjustments for every single piece of training data that you feed through

92
00:06:41,320 --> 00:06:42,320
the network.

93
00:06:42,340 --> 00:06:47,960
So there is an incredible amount of computation going on for a large neural network.

94
00:06:48,010 --> 00:06:53,200
There's actually so much computation involved during training that your laptop just isn't going to be

95
00:06:53,200 --> 00:06:56,940
up for that job anymore even with a desktop computer.

96
00:06:57,010 --> 00:07:02,770
You're actually going to struggle so at this point you're going to need some very serious hardware.

97
00:07:02,900 --> 00:07:09,410
And by that I mean you're probably going to need some GP use that's plural GP use.

98
00:07:09,410 --> 00:07:11,390
One might not even be enough.

99
00:07:11,720 --> 00:07:17,480
Let's actually go on Amazon and let's take a look at what top of the line GP use going for these days.

100
00:07:18,650 --> 00:07:29,300
So if I fire up Amazon dot com and I type in say and video g force GTA X I don't know.

101
00:07:29,360 --> 00:07:30,890
Two thousand eight hundred.

102
00:07:30,890 --> 00:07:31,900
Right.

103
00:07:32,030 --> 00:07:36,190
That's probably one of the most popular gaming graphics cards these days.

104
00:07:36,260 --> 00:07:40,580
Taking a look here we can see that's about seven hundred and thirty dollars.

105
00:07:40,620 --> 00:07:41,600
It's pretty hefty.

106
00:07:41,630 --> 00:07:48,440
But I've actually got bad news for you this pricey graphics card might be fine for gaming at a four

107
00:07:48,440 --> 00:07:55,040
K resolution but in the world of deep learning researchers actually tend to train their models on the

108
00:07:55,040 --> 00:07:58,060
likes of and videos Taz like Katie.

109
00:07:58,290 --> 00:08:03,850
How the video Tesla Katie actually cost about two and a half times as much.

110
00:08:03,980 --> 00:08:09,200
And this graphics card actually does about eight more operations per second than that top of the line

111
00:08:09,200 --> 00:08:10,310
gaming graphics card.

112
00:08:10,820 --> 00:08:13,180
This is what you'd actually find in a data center.

113
00:08:13,880 --> 00:08:21,830
So needless to say this is a pretty expensive and buying 10 of these just probably isn't going to happen.

114
00:08:21,860 --> 00:08:22,870
Maybe we can.

115
00:08:23,420 --> 00:08:23,840
I don't know.

116
00:08:23,840 --> 00:08:27,800
But a really nice letter to Santa and hope for the best during Christmas.

117
00:08:27,800 --> 00:08:34,790
But these graphics cards are pretty pricey and the reason is that people are buying these because of

118
00:08:34,880 --> 00:08:36,690
cryptocurrency mining.

119
00:08:36,770 --> 00:08:39,650
We've got bitcoin to thank for that.

120
00:08:39,650 --> 00:08:45,620
And it's pretty funny how videos share price has been pretty much tracking the bitcoin value for the

121
00:08:45,620 --> 00:08:46,940
past two years.

122
00:08:46,940 --> 00:08:54,590
So that's the world we live in nowadays but not all hope is lost because there's always the cloud.

123
00:08:54,590 --> 00:09:00,270
Maybe we can use somebody else's computer to train a large neural network.

124
00:09:00,560 --> 00:09:07,070
And you know if this isn't even such a bad solution going to someone like Amazon or another company

125
00:09:07,070 --> 00:09:14,150
like Floyd hub or Microsoft I'm sure it's going to be a lot cheaper than buying your own GP use.

126
00:09:14,150 --> 00:09:20,510
The last time I checked the price per hour varied between 50 cents and about a dollar.

127
00:09:20,510 --> 00:09:26,350
So training a model that requires 100 GP you hours will set you back 50 to 100 dollars.

128
00:09:26,720 --> 00:09:31,620
But what if your model requires 10000 gp hours to train.

129
00:09:32,150 --> 00:09:38,810
Well it still might be worth it depending on your situation punt long training times do mean that you're

130
00:09:38,810 --> 00:09:44,360
going to be probably a little more thoughtful when it comes to iterating and tweaking your models to

131
00:09:44,390 --> 00:09:46,000
retrain it.

132
00:09:46,010 --> 00:09:51,350
So what I'm saying is that Amazon NWS and these other services they lower the barriers of entry for

133
00:09:51,350 --> 00:09:52,030
sure.

134
00:09:52,250 --> 00:09:58,130
And it's still probably cheaper than buying your own GP use and there's even some good deals around

135
00:09:58,130 --> 00:10:04,250
these days but even cloud services can get very expensive very quickly.

136
00:10:04,550 --> 00:10:10,010
And this brings me back to the questions I asked earlier can you use a neural network to solve almost

137
00:10:10,010 --> 00:10:10,800
any problem.

138
00:10:11,330 --> 00:10:12,220
Yes.

139
00:10:12,320 --> 00:10:15,230
Should you use a neural network to solve every problem.

140
00:10:15,230 --> 00:10:16,130
Mm hmm.

141
00:10:16,280 --> 00:10:21,950
Probably and hot the data requirements training time and the amount of computation it takes is why many

142
00:10:21,950 --> 00:10:25,910
people try to solve simple problems with simple methods.

143
00:10:26,450 --> 00:10:31,090
And even though neural networks can pretty much do anything it's not always the best approach.

144
00:10:31,100 --> 00:10:36,530
When you factor in the amount of data required the amount of compute required and the lack of transparency

145
00:10:36,530 --> 00:10:42,290
at the end as a machine learning professional what you really want to do is use the right tool for the

146
00:10:42,290 --> 00:10:43,260
job.

147
00:10:43,550 --> 00:10:49,880
But I'm not going to leave this lecture on such a sad note because neural networks are awesome and there

148
00:10:49,880 --> 00:10:51,410
is a way forward.

149
00:10:51,410 --> 00:10:57,140
So as part of this cause we're gonna do a couple of things to make Neural Networks a bit more accessible.

150
00:10:57,140 --> 00:11:02,540
First off we're gonna be sticking to simpler models and smaller datasets so that our computers can handle

151
00:11:02,540 --> 00:11:03,690
the computation.

152
00:11:03,740 --> 00:11:07,310
This is so we can learn the ropes and not have to pay a large cost.

153
00:11:07,550 --> 00:11:12,520
But in addition there is a way that we can actually play with a GP you for free.

154
00:11:12,560 --> 00:11:19,160
And so it is my pleasure to introduce to you the Google collab notebooks Google collab is actually based

155
00:11:19,310 --> 00:11:24,260
on the Jupiter notebooks that we've been using all along but you can basically access it through your

156
00:11:24,260 --> 00:11:24,950
browser.

157
00:11:24,950 --> 00:11:30,170
You don't have to install anything all you need is a Google account and access to Google Drive and all

158
00:11:30,170 --> 00:11:33,350
the calculations will then run on Google's servers.

159
00:11:33,350 --> 00:11:39,170
All you have to do to open a collab notebook is to go to a drive dot google dot com thing go to new

160
00:11:39,950 --> 00:11:40,630
go to more.

161
00:11:41,300 --> 00:11:43,920
And then here you see collaborate story.

162
00:11:43,960 --> 00:11:48,310
So you click on that and it starts a Jupiter notebook essentially.

163
00:11:48,830 --> 00:11:56,550
And after your page is loaded you can go to runtime change runtime type and then here you see hardware

164
00:11:56,610 --> 00:12:00,410
accelerator and you can change this to GP you.

165
00:12:01,170 --> 00:12:05,940
So that's the first good news a free GP you hurray.

166
00:12:05,970 --> 00:12:11,460
This means that we can do some pretty cool stuff regardless of how old our own personal computer is

167
00:12:11,670 --> 00:12:14,370
because we can piggyback off Google's hardware.

168
00:12:14,730 --> 00:12:21,100
And I read that when you select GP you hear you're in fact running it on one of those Tesla Katis.

169
00:12:21,390 --> 00:12:26,700
The second piece of good news that I've got for you is that you don't always have to train your own

170
00:12:26,700 --> 00:12:27,930
models.

171
00:12:27,930 --> 00:12:28,950
That's right.

172
00:12:28,950 --> 00:12:35,570
We can use other people's pre train models that they've decided to share in a place called A Models

173
00:12:35,570 --> 00:12:42,720
zoo essentially some generous organizations and researchers have open sourced their models and more

174
00:12:42,720 --> 00:12:46,200
importantly their weights for others to use.

175
00:12:46,620 --> 00:12:52,040
One website where you might find some of these models is model zoo dot co.

176
00:12:52,140 --> 00:12:54,820
And another is tensor flow hub.

177
00:12:54,990 --> 00:12:58,290
Both of these are libraries of machine learning models.

178
00:12:59,310 --> 00:13:06,270
So if I go to modules and then go to image classification you can pick and choose among these image

179
00:13:06,270 --> 00:13:07,900
classification models.

180
00:13:08,130 --> 00:13:14,830
And I know that some of these like nascent it have been trained on thousands of GP you hours.

181
00:13:14,940 --> 00:13:21,000
I was actually listening to Regent Mongo explain on a podcast how they're best in class models for image

182
00:13:21,000 --> 00:13:29,180
classification was trained on over a million images from Image net and it took them around 60000 GP

183
00:13:29,250 --> 00:13:30,260
hours to train.

184
00:13:30,720 --> 00:13:37,450
And that's great news for us right because it saves us having to rack up an enormous Amazon AWOL spill.

185
00:13:37,560 --> 00:13:42,050
I still find it incredible that these kind of models are available for us to use.

186
00:13:42,210 --> 00:13:46,500
And on that note let's jump into the next lesson so I can show you how.