1 00:00:00,240 --> 00:00:00,880 All right. 2 00:00:00,910 --> 00:00:07,230 So in this lesson what we're going to talk about are some of the problems that are associated with neural 3 00:00:07,230 --> 00:00:08,210 networks. 4 00:00:08,280 --> 00:00:12,530 Some of the disadvantages that these types of models have. 5 00:00:12,600 --> 00:00:15,660 We've already talked about their black box nature. 6 00:00:15,660 --> 00:00:21,870 We don't really know why it is that a neural network might give a particular output. 7 00:00:21,870 --> 00:00:27,750 And this is actually very important when we care about the rationale for a particular decision that 8 00:00:27,750 --> 00:00:29,940 is made by neural network. 9 00:00:29,970 --> 00:00:34,150 So imagine that a neural network is being used in the legal system. 10 00:00:34,380 --> 00:00:39,930 And this computer programs job is to set the amount of bail or the amount of jail time for a particular 11 00:00:39,930 --> 00:00:41,270 person. 12 00:00:41,280 --> 00:00:47,880 Now surely in this situation you'd want to be able to ask wait well why is this guy getting two years 13 00:00:47,970 --> 00:00:50,490 and this other guy getting 10 years. 14 00:00:50,610 --> 00:00:57,200 This is just one example of a situation where an in transparent and intractable model is a real disadvantage. 15 00:00:57,270 --> 00:01:03,120 But in this lesson I want to spend time talking about the other really big disadvantage of big complex 16 00:01:03,150 --> 00:01:08,160 neural networks that second really big disadvantage is cost. 17 00:01:08,280 --> 00:01:13,770 And this is a really weird one to wrap your head around at first because what kind of costs are we talking 18 00:01:13,770 --> 00:01:15,060 about here. 19 00:01:15,330 --> 00:01:18,060 This cost actually comes in two forms. 20 00:01:18,090 --> 00:01:24,240 The first is the amount of data required and the second kind of cost is the amount of compute required 21 00:01:24,300 --> 00:01:26,360 to train a neural network. 22 00:01:26,430 --> 00:01:32,100 And when I say compute what I'm talking about is the amount of hours a model has to be trained on something 23 00:01:32,100 --> 00:01:32,910 like a GP. 24 00:01:33,870 --> 00:01:39,390 So what you're gonna see is that these two costs are related in a minute and they all kind of go back 25 00:01:39,390 --> 00:01:41,850 to the structure of a neural network. 26 00:01:41,850 --> 00:01:48,900 Now with any kind of model a pretty good proxy for the degree of complexity in the model are the number 27 00:01:48,960 --> 00:01:49,980 of parameters. 28 00:01:50,970 --> 00:01:57,510 So in this slide for this example neural network where we're just estimating the weights how many parameters 29 00:01:57,510 --> 00:02:00,720 do you think we actually need to estimate in total. 30 00:02:00,720 --> 00:02:05,610 Because that's our training this neural network what we're doing is adjusting the weights for our different 31 00:02:05,610 --> 00:02:06,200 connections. 32 00:02:06,210 --> 00:02:07,550 Right. 33 00:02:07,620 --> 00:02:11,860 So let's work out the total number of weights that we've got on the very left. 34 00:02:11,880 --> 00:02:17,530 We've got six input nodes and then we've got the six nodes for our first hidden layer. 35 00:02:17,580 --> 00:02:25,120 So that's six times six or a total of 36 connections right and then we can do the calculation for the 36 00:02:25,120 --> 00:02:26,170 rest of the layers too. 37 00:02:26,830 --> 00:02:33,040 So between the second layer and the third layer we've got six times five plus five times six for the 38 00:02:33,040 --> 00:02:39,550 third and fourth layer and then for the output layer we've just got four times one adding this all up 39 00:02:39,850 --> 00:02:41,830 we've got about 90 different connections. 40 00:02:41,830 --> 00:02:44,960 So 90 different weights that we need to estimate. 41 00:02:45,170 --> 00:02:51,170 The issue is that the more parameters you have the more data you need to figure out how to tweak each 42 00:02:51,170 --> 00:02:52,220 of them. 43 00:02:52,220 --> 00:02:58,340 So a real life example would be something like baking the perfect cake like the one that your grandma 44 00:02:58,340 --> 00:02:59,000 used to make. 45 00:02:59,540 --> 00:03:04,490 So you try this and at the end you find out it's not tasting quite the same. 46 00:03:04,490 --> 00:03:06,240 It's not just quite right. 47 00:03:06,440 --> 00:03:08,810 But where did things go wrong. 48 00:03:08,840 --> 00:03:12,830 Did you use too much flour or too few eggs or not enough sugar. 49 00:03:12,830 --> 00:03:17,960 Or did you leave it in the oven too long to get this cake tasting just right the way your gran used 50 00:03:17,960 --> 00:03:18,760 to make it. 51 00:03:18,800 --> 00:03:23,810 You might have to experiment quite a bit and bake it a few more times to get it just right. 52 00:03:23,810 --> 00:03:27,070 Tweaking each of these little parameters along the way. 53 00:03:27,230 --> 00:03:32,450 Now of course the more complex the problem the more parameters you have the more you'd have to experiment 54 00:03:32,600 --> 00:03:35,330 to figure out what their value should be. 55 00:03:35,330 --> 00:03:41,660 And this is really a similar story with the neural network or any model the more complex the model the 56 00:03:41,660 --> 00:03:47,920 more data it needs to chew through in order to get sensible parameter estimates now the question you 57 00:03:47,920 --> 00:03:53,150 might ask at this point is well how much data do we need to train this network. 58 00:03:53,200 --> 00:03:56,050 And the answer is It depends. 59 00:03:56,110 --> 00:03:59,240 But in the industry people tend to use a rule of thumb. 60 00:03:59,380 --> 00:04:05,140 We probably need about 10 times the amount of data as we have parameters that we need to estimate. 61 00:04:06,010 --> 00:04:13,840 So in this case we've got 90 parameters so we'd need around 900 data points now say that we were working 62 00:04:13,840 --> 00:04:16,690 with images right instead of our six input nodes. 63 00:04:16,750 --> 00:04:18,790 We have a small image that we want to supply. 64 00:04:18,870 --> 00:04:23,410 And this image is going to be 25 pixels by 28 pixels. 65 00:04:23,410 --> 00:04:30,760 And this means that this black and white image would actually have around 700 inputs because 25 times 66 00:04:30,790 --> 00:04:39,970 28 is 700 now even if we keep the rest of the network the same in that first hidden layer we've already 67 00:04:39,970 --> 00:04:44,080 got four thousand two hundred different connection weights. 68 00:04:44,170 --> 00:04:49,930 The point I'm trying to make here is that the number of parameters goes up massively with the size of 69 00:04:49,930 --> 00:04:51,140 the network. 70 00:04:51,160 --> 00:04:57,670 So if we have these 700 inputs and we just add one extra neuron to that first hidden layer all of a 71 00:04:57,670 --> 00:05:03,690 sudden we're jumping from four thousand two hundred different weights to 4900. 72 00:05:03,760 --> 00:05:10,120 And as the number of parameters goes up this network needs to be fed with more and more data to reliably 73 00:05:10,150 --> 00:05:13,990 estimate all these weights during training. 74 00:05:13,990 --> 00:05:20,300 Now in practice depending on the field of application it's actually not uncommon to have between 100000 75 00:05:20,860 --> 00:05:27,370 or a million or even 10 million parameters in a neural network that need estimating and at that point 76 00:05:27,580 --> 00:05:32,440 the amount of data that you need to get your hands on actually starts to become enormous. 77 00:05:32,440 --> 00:05:37,360 And in addition it has to be an enormous amount of good quality data. 78 00:05:37,510 --> 00:05:44,650 And this is why the biggest and most complex neural networks always have the backing of large organizations 79 00:05:44,650 --> 00:05:49,700 right like Google Facebook Microsoft or a government institution. 80 00:05:49,720 --> 00:05:54,630 Now the good news is that there's a lot of good data out there for the likes of you and I. 81 00:05:55,330 --> 00:06:00,880 We're gonna be able to build some pretty interesting things with the resources that we have available. 82 00:06:00,940 --> 00:06:06,460 So suppose that we manage to actually get a hold of say half a million data points to train the ultimate 83 00:06:06,460 --> 00:06:08,080 cat detector. 84 00:06:08,080 --> 00:06:09,670 Now what. 85 00:06:09,670 --> 00:06:12,010 Well we have to process this data. 86 00:06:12,010 --> 00:06:16,660 We have to store the data we have to clean the data and then eventually we have to train our neural 87 00:06:16,660 --> 00:06:17,220 network. 88 00:06:17,920 --> 00:06:24,430 And it's at this point that the second part of the costs hits you to that was talking about because 89 00:06:24,430 --> 00:06:28,000 training means a lot of calculations. 90 00:06:28,000 --> 00:06:35,140 It means making a lot of predictions estimating a lot of losses at each node in the neural network and 91 00:06:35,170 --> 00:06:41,320 then making a lot of weight adjustments for every single piece of training data that you feed through 92 00:06:41,320 --> 00:06:42,320 the network. 93 00:06:42,340 --> 00:06:47,960 So there is an incredible amount of computation going on for a large neural network. 94 00:06:48,010 --> 00:06:53,200 There's actually so much computation involved during training that your laptop just isn't going to be 95 00:06:53,200 --> 00:06:56,940 up for that job anymore even with a desktop computer. 96 00:06:57,010 --> 00:07:02,770 You're actually going to struggle so at this point you're going to need some very serious hardware. 97 00:07:02,900 --> 00:07:09,410 And by that I mean you're probably going to need some GP use that's plural GP use. 98 00:07:09,410 --> 00:07:11,390 One might not even be enough. 99 00:07:11,720 --> 00:07:17,480 Let's actually go on Amazon and let's take a look at what top of the line GP use going for these days. 100 00:07:18,650 --> 00:07:29,300 So if I fire up Amazon dot com and I type in say and video g force GTA X I don't know. 101 00:07:29,360 --> 00:07:30,890 Two thousand eight hundred. 102 00:07:30,890 --> 00:07:31,900 Right. 103 00:07:32,030 --> 00:07:36,190 That's probably one of the most popular gaming graphics cards these days. 104 00:07:36,260 --> 00:07:40,580 Taking a look here we can see that's about seven hundred and thirty dollars. 105 00:07:40,620 --> 00:07:41,600 It's pretty hefty. 106 00:07:41,630 --> 00:07:48,440 But I've actually got bad news for you this pricey graphics card might be fine for gaming at a four 107 00:07:48,440 --> 00:07:55,040 K resolution but in the world of deep learning researchers actually tend to train their models on the 108 00:07:55,040 --> 00:07:58,060 likes of and videos Taz like Katie. 109 00:07:58,290 --> 00:08:03,850 How the video Tesla Katie actually cost about two and a half times as much. 110 00:08:03,980 --> 00:08:09,200 And this graphics card actually does about eight more operations per second than that top of the line 111 00:08:09,200 --> 00:08:10,310 gaming graphics card. 112 00:08:10,820 --> 00:08:13,180 This is what you'd actually find in a data center. 113 00:08:13,880 --> 00:08:21,830 So needless to say this is a pretty expensive and buying 10 of these just probably isn't going to happen. 114 00:08:21,860 --> 00:08:22,870 Maybe we can. 115 00:08:23,420 --> 00:08:23,840 I don't know. 116 00:08:23,840 --> 00:08:27,800 But a really nice letter to Santa and hope for the best during Christmas. 117 00:08:27,800 --> 00:08:34,790 But these graphics cards are pretty pricey and the reason is that people are buying these because of 118 00:08:34,880 --> 00:08:36,690 cryptocurrency mining. 119 00:08:36,770 --> 00:08:39,650 We've got bitcoin to thank for that. 120 00:08:39,650 --> 00:08:45,620 And it's pretty funny how videos share price has been pretty much tracking the bitcoin value for the 121 00:08:45,620 --> 00:08:46,940 past two years. 122 00:08:46,940 --> 00:08:54,590 So that's the world we live in nowadays but not all hope is lost because there's always the cloud. 123 00:08:54,590 --> 00:09:00,270 Maybe we can use somebody else's computer to train a large neural network. 124 00:09:00,560 --> 00:09:07,070 And you know if this isn't even such a bad solution going to someone like Amazon or another company 125 00:09:07,070 --> 00:09:14,150 like Floyd hub or Microsoft I'm sure it's going to be a lot cheaper than buying your own GP use. 126 00:09:14,150 --> 00:09:20,510 The last time I checked the price per hour varied between 50 cents and about a dollar. 127 00:09:20,510 --> 00:09:26,350 So training a model that requires 100 GP you hours will set you back 50 to 100 dollars. 128 00:09:26,720 --> 00:09:31,620 But what if your model requires 10000 gp hours to train. 129 00:09:32,150 --> 00:09:38,810 Well it still might be worth it depending on your situation punt long training times do mean that you're 130 00:09:38,810 --> 00:09:44,360 going to be probably a little more thoughtful when it comes to iterating and tweaking your models to 131 00:09:44,390 --> 00:09:46,000 retrain it. 132 00:09:46,010 --> 00:09:51,350 So what I'm saying is that Amazon NWS and these other services they lower the barriers of entry for 133 00:09:51,350 --> 00:09:52,030 sure. 134 00:09:52,250 --> 00:09:58,130 And it's still probably cheaper than buying your own GP use and there's even some good deals around 135 00:09:58,130 --> 00:10:04,250 these days but even cloud services can get very expensive very quickly. 136 00:10:04,550 --> 00:10:10,010 And this brings me back to the questions I asked earlier can you use a neural network to solve almost 137 00:10:10,010 --> 00:10:10,800 any problem. 138 00:10:11,330 --> 00:10:12,220 Yes. 139 00:10:12,320 --> 00:10:15,230 Should you use a neural network to solve every problem. 140 00:10:15,230 --> 00:10:16,130 Mm hmm. 141 00:10:16,280 --> 00:10:21,950 Probably and hot the data requirements training time and the amount of computation it takes is why many 142 00:10:21,950 --> 00:10:25,910 people try to solve simple problems with simple methods. 143 00:10:26,450 --> 00:10:31,090 And even though neural networks can pretty much do anything it's not always the best approach. 144 00:10:31,100 --> 00:10:36,530 When you factor in the amount of data required the amount of compute required and the lack of transparency 145 00:10:36,530 --> 00:10:42,290 at the end as a machine learning professional what you really want to do is use the right tool for the 146 00:10:42,290 --> 00:10:43,260 job. 147 00:10:43,550 --> 00:10:49,880 But I'm not going to leave this lecture on such a sad note because neural networks are awesome and there 148 00:10:49,880 --> 00:10:51,410 is a way forward. 149 00:10:51,410 --> 00:10:57,140 So as part of this cause we're gonna do a couple of things to make Neural Networks a bit more accessible. 150 00:10:57,140 --> 00:11:02,540 First off we're gonna be sticking to simpler models and smaller datasets so that our computers can handle 151 00:11:02,540 --> 00:11:03,690 the computation. 152 00:11:03,740 --> 00:11:07,310 This is so we can learn the ropes and not have to pay a large cost. 153 00:11:07,550 --> 00:11:12,520 But in addition there is a way that we can actually play with a GP you for free. 154 00:11:12,560 --> 00:11:19,160 And so it is my pleasure to introduce to you the Google collab notebooks Google collab is actually based 155 00:11:19,310 --> 00:11:24,260 on the Jupiter notebooks that we've been using all along but you can basically access it through your 156 00:11:24,260 --> 00:11:24,950 browser. 157 00:11:24,950 --> 00:11:30,170 You don't have to install anything all you need is a Google account and access to Google Drive and all 158 00:11:30,170 --> 00:11:33,350 the calculations will then run on Google's servers. 159 00:11:33,350 --> 00:11:39,170 All you have to do to open a collab notebook is to go to a drive dot google dot com thing go to new 160 00:11:39,950 --> 00:11:40,630 go to more. 161 00:11:41,300 --> 00:11:43,920 And then here you see collaborate story. 162 00:11:43,960 --> 00:11:48,310 So you click on that and it starts a Jupiter notebook essentially. 163 00:11:48,830 --> 00:11:56,550 And after your page is loaded you can go to runtime change runtime type and then here you see hardware 164 00:11:56,610 --> 00:12:00,410 accelerator and you can change this to GP you. 165 00:12:01,170 --> 00:12:05,940 So that's the first good news a free GP you hurray. 166 00:12:05,970 --> 00:12:11,460 This means that we can do some pretty cool stuff regardless of how old our own personal computer is 167 00:12:11,670 --> 00:12:14,370 because we can piggyback off Google's hardware. 168 00:12:14,730 --> 00:12:21,100 And I read that when you select GP you hear you're in fact running it on one of those Tesla Katis. 169 00:12:21,390 --> 00:12:26,700 The second piece of good news that I've got for you is that you don't always have to train your own 170 00:12:26,700 --> 00:12:27,930 models. 171 00:12:27,930 --> 00:12:28,950 That's right. 172 00:12:28,950 --> 00:12:35,570 We can use other people's pre train models that they've decided to share in a place called A Models 173 00:12:35,570 --> 00:12:42,720 zoo essentially some generous organizations and researchers have open sourced their models and more 174 00:12:42,720 --> 00:12:46,200 importantly their weights for others to use. 175 00:12:46,620 --> 00:12:52,040 One website where you might find some of these models is model zoo dot co. 176 00:12:52,140 --> 00:12:54,820 And another is tensor flow hub. 177 00:12:54,990 --> 00:12:58,290 Both of these are libraries of machine learning models. 178 00:12:59,310 --> 00:13:06,270 So if I go to modules and then go to image classification you can pick and choose among these image 179 00:13:06,270 --> 00:13:07,900 classification models. 180 00:13:08,130 --> 00:13:14,830 And I know that some of these like nascent it have been trained on thousands of GP you hours. 181 00:13:14,940 --> 00:13:21,000 I was actually listening to Regent Mongo explain on a podcast how they're best in class models for image 182 00:13:21,000 --> 00:13:29,180 classification was trained on over a million images from Image net and it took them around 60000 GP 183 00:13:29,250 --> 00:13:30,260 hours to train. 184 00:13:30,720 --> 00:13:37,450 And that's great news for us right because it saves us having to rack up an enormous Amazon AWOL spill. 185 00:13:37,560 --> 00:13:42,050 I still find it incredible that these kind of models are available for us to use. 186 00:13:42,210 --> 00:13:46,500 And on that note let's jump into the next lesson so I can show you how.