1 00:00:03,910 --> 00:00:10,300 ‫The Docker Health Check command. It was a new feature added in 1.12, which came out mid 2016, the 2 00:00:10,300 --> 00:00:16,480 ‫same time that Swarm Kit and Swarm Mode were available in Docker. It was added really as a part of 3 00:00:16,480 --> 00:00:22,870 ‫that toolkit, but it still works in all the different files like the Dockerfile, the Compose file, 4 00:00:22,870 --> 00:00:27,790 ‫the docker run command uses it, the Stack files support it, the service update and service create command 5 00:00:27,790 --> 00:00:28,250 ‫support it. 6 00:00:28,260 --> 00:00:29,630 ‫It's everywhere. 7 00:00:29,950 --> 00:00:36,220 ‫I highly recommend that when you're going production, you do engage in testing options for this health 8 00:00:36,220 --> 00:00:37,090 ‫check command. 9 00:00:37,150 --> 00:00:40,330 ‫It's going to work right out of the box with an exec. 10 00:00:40,390 --> 00:00:45,040 ‫It's going to execute that command inside the container just like if you were running your own exec 11 00:00:45,040 --> 00:00:45,700 ‫command. 12 00:00:45,730 --> 00:00:50,020 ‫So, it's not running it from outside the container; it's just running it inside which means that even 13 00:00:50,020 --> 00:00:55,360 ‫simple workers that don't have exposed ports, you can run a simple command in them to validate whether 14 00:00:55,360 --> 00:00:57,780 ‫they're returning good data or whatever. 15 00:00:58,090 --> 00:01:04,300 ‫It's a simple execution of a command, which means it gets a simple return. It expects a 0 or a 1. 16 00:01:04,300 --> 00:01:09,850 ‫In Linux and Windows, you have exit codes from commands and a 0 is a good thing. 17 00:01:09,850 --> 00:01:11,560 ‫It means everything was fine. 18 00:01:11,590 --> 00:01:15,570 ‫Anything other than a 0 is going to be an error in most applications. 19 00:01:15,580 --> 00:01:21,010 ‫But in Docker, we need that application to exit a 1 specifically. We'll show in a minute how you 20 00:01:21,010 --> 00:01:22,160 ‫do that. 21 00:01:22,330 --> 00:01:28,420 ‫There's only three states to a healthcheck in Docker. It starts out with starting. Starting is the 22 00:01:28,420 --> 00:01:32,480 ‫first 30 seconds, by default, where it hasn't run a healthcheck command yet. 23 00:01:32,710 --> 00:01:34,130 ‫Then it's going to run one. 24 00:01:34,180 --> 00:01:39,160 ‫If that returns a 0, it'll start with the healthy. It'll change to the healthy option. 25 00:01:39,280 --> 00:01:45,640 ‫It'll take that command and it'll run it every 30 seconds by default again. If it ever receives an unhealthy 26 00:01:45,640 --> 00:01:49,690 ‫return, like an exit 1, then it marks it as an unhealthy container. 27 00:01:49,690 --> 00:01:52,790 ‫We have options for controlling all of this including retries. 28 00:01:52,810 --> 00:01:53,940 ‫We'll see that in a minute. 29 00:01:54,760 --> 00:02:00,010 ‫This is a much better option than we've had in the past because Docker, until now, was just making 30 00:02:00,010 --> 00:02:01,870 ‫sure the application was still running. 31 00:02:01,930 --> 00:02:05,740 ‫It didn't have any insight into whether that application was doing what it was supposed to. 32 00:02:05,920 --> 00:02:10,120 ‫Now we can do that inside the Docker container itself. 33 00:02:10,330 --> 00:02:13,980 ‫But this isn't a replacement for your third party monitoring solution. 34 00:02:13,990 --> 00:02:20,440 ‫This isn't going to give you graphs, or status over time, or any sort of third party tooling that you 35 00:02:20,440 --> 00:02:22,150 ‫would expect out of a monitoring solution. 36 00:02:22,150 --> 00:02:27,780 ‫This is about Docker understanding if the container itself has a basic level of healthy. 37 00:02:27,880 --> 00:02:36,490 ‫So, in a Nginx, it might return a localhost of the root index file. A return of 200 or 300 is fine 38 00:02:36,520 --> 00:02:39,720 ‫and gives it an exit code of 0, and it considers it healthy. 39 00:02:39,910 --> 00:02:43,190 ‫That's not a super advanced uh, you know, monitoring tool. 40 00:02:43,360 --> 00:02:49,630 ‫But if it did return a 404 or 500 error, it would then consider it unhealthy and we can do something about 41 00:02:49,630 --> 00:02:50,620 ‫that. 42 00:02:50,620 --> 00:02:54,270 ‫Where are we going to see this Docker healthcheck in the GUI? 43 00:02:54,520 --> 00:02:57,070 ‫The first place is in container ls. 44 00:02:57,250 --> 00:02:59,430 ‫It'll just see it as this new option. 45 00:02:59,430 --> 00:03:04,030 ‫It's in the middle. We'll see in a second where it'll show us one of the three states if the health check 46 00:03:04,030 --> 00:03:06,530 ‫is running, and that's how we actually know that there's a healthcheck. 47 00:03:06,580 --> 00:03:12,370 ‫That's the easiest way, at least, to know. We'll see the history, the last five of that healthcheck, 48 00:03:12,370 --> 00:03:19,040 ‫show up in the inspect for that container. And we can see some basic trend over time there. 49 00:03:19,150 --> 00:03:23,460 ‫But the docker run command does not take action on an unhealthy container. 50 00:03:23,620 --> 00:03:29,620 ‫Once the healthcheck considers a container unhealthy, docker run is just going to indicate that in the ls 51 00:03:29,620 --> 00:03:32,690 ‫command, and in the inspect, but it's not going to take action. 52 00:03:32,710 --> 00:03:36,050 ‫That's where we expect the Swarm Services to take action. 53 00:03:36,070 --> 00:03:42,670 ‫So the stacks and services will actually replace that container with a new task, on a new host possibly, 54 00:03:42,670 --> 00:03:44,100 ‫depending on the scheduler. 55 00:03:44,410 --> 00:03:50,200 ‫Even in the update command, we see a little extra bonus by using the healthchecks because the updates 56 00:03:50,550 --> 00:03:56,440 ‫will consider the healthcheck as a part of the readiness for that container before it goes and changes 57 00:03:56,440 --> 00:03:57,370 ‫the next one. 58 00:03:57,370 --> 00:04:02,710 ‫If a container comes up, but it doesn't pass its health check, then the service update won't go to 59 00:04:02,710 --> 00:04:06,750 ‫the next one. Or it'll take action based on the changes you give it. 60 00:04:07,470 --> 00:04:10,400 ‫Let's look at a few examples before we go to the command line. 61 00:04:10,410 --> 00:04:12,680 ‫This is one that we're using on docker run. 62 00:04:12,690 --> 00:04:17,670 ‫This allows us to use an existing image that doesn't have a health check in it, and we're adding 63 00:04:17,670 --> 00:04:19,680 ‫the health check in at runtime. 64 00:04:19,710 --> 00:04:26,220 ‫In this case, we're using the Elasticsearch image. You can see the command is a cURL localhost 65 00:04:26,270 --> 00:04:32,430 ‫9200, which is the port that the Elasticsearch is running on inside the container, not the published port, 66 00:04:32,670 --> 00:04:35,380 ‫but inside the container. For Elasticsearch, 67 00:04:35,390 --> 00:04:38,040 ‫there is an actual health URL. 68 00:04:38,070 --> 00:04:39,560 ‫So, we can use that here. 69 00:04:39,570 --> 00:04:43,750 ‫You'll notice the two pipes with the false at the end of that command. 70 00:04:43,890 --> 00:04:45,300 ‫And that's going to be pretty common 71 00:04:45,300 --> 00:04:50,760 ‫if using something like cURL or another tool that will send out an error code that's other than 1. 72 00:04:50,810 --> 00:04:56,010 ‫Remember when I mentioned that while ago? We need it to exit with 1 if there's a problem. Because that's 73 00:04:56,010 --> 00:04:59,240 ‫the one error code that Docker is going to do something about. 74 00:04:59,310 --> 00:05:06,630 ‫We need to make sure that in this case, a shell will always return the false 1 exit code 75 00:05:06,640 --> 00:05:10,430 ‫if there's anything coming out of that command other than 0. 76 00:05:10,530 --> 00:05:13,250 ‫It's a nice way to get around that problem. 77 00:05:13,310 --> 00:05:18,180 ‫It just so happens with cURL, cURL will give other potential error codes and we don't want it to 78 00:05:18,180 --> 00:05:18,880 ‫do that. 79 00:05:19,380 --> 00:05:25,290 ‫In the actual Docker files, we can add the same command. The format's a little bit different. But you see 80 00:05:25,290 --> 00:05:31,850 ‫that we have these options here. We have the interval, the timeout, the start period (which is new), and retries. 81 00:05:31,950 --> 00:05:35,670 ‫The interval is what you would think it is. It's, by default, every 30 seconds. 82 00:05:35,730 --> 00:05:41,520 ‫How often it's going to run this health check. The time out is how long it's going to wait before it errors 83 00:05:41,520 --> 00:05:48,030 ‫out and returns a bad code, if maybe the app is slow. The start period is a new feature that allows us now 84 00:05:48,060 --> 00:05:56,280 ‫in 17.09 and newer, to give a longer wait period than the first 30 seconds of the duration. Before, it 85 00:05:56,280 --> 00:05:59,810 ‫would always just wait the long...the interval time before it started the healthcheck. 86 00:05:59,820 --> 00:06:04,710 ‫But maybe you have a Java app, or database, or something that takes a lot longer to start. 87 00:06:04,710 --> 00:06:06,400 ‫Maybe it takes five minutes. 88 00:06:06,540 --> 00:06:11,880 ‫You could add that start period in there. It'll still do healthchecks. But what it will do is it won't 89 00:06:11,880 --> 00:06:17,010 ‫alarm on an unhealthy check until that time has elapsed. 90 00:06:17,010 --> 00:06:22,750 ‫So if you set two minutes in there, even though it's health checking every 30 seconds, it's going to only 91 00:06:22,750 --> 00:06:28,320 ‫consider it unhealthy once it's past that two minute mark. The last one there, retries, means that 92 00:06:28,320 --> 00:06:33,650 ‫we will try this health check x number of times before we consider it unhealthy. 93 00:06:33,720 --> 00:06:38,940 ‫That gives maybe a potentially unstable app a chance to come back with a healthy and recover on 94 00:06:38,940 --> 00:06:42,420 ‫its own before we consider this a truly unhealthy container. 95 00:06:42,510 --> 00:06:45,680 ‫The basic healthcheck command you would use in a Dockerfile is called HEALTHCHECK, 96 00:06:45,720 --> 00:06:51,510 ‫all capital letters there. The same format exists where if we're just doing a simple cURL of the localhost 97 00:06:51,630 --> 00:06:55,010 ‫because maybe it's PHP app or something. We can do that. 98 00:06:55,200 --> 00:06:59,760 ‫This is how you would add all those options in to a Dockerfile so you would see how I add the 99 00:06:59,770 --> 00:07:04,130 ‫timeout interval and the retries before the command itself. 100 00:07:04,290 --> 00:07:09,450 ‫The first one there for the basic command, notice I don't have to put in a CMD if I'm just giving it the 101 00:07:09,450 --> 00:07:14,230 ‫command to run. But if I want to show options, if I want to give it custom options out of the box with 102 00:07:14,230 --> 00:07:19,190 ‫the timeout and so on, then I have to specify which one is the command. 103 00:07:19,200 --> 00:07:20,550 ‫Now these aren't two different lines. 104 00:07:20,550 --> 00:07:23,820 ‫Notice the back slash on the end of the first line there. 105 00:07:23,940 --> 00:07:25,530 ‫So don't get that confused. 106 00:07:26,290 --> 00:07:31,270 ‫Here we have a simple example of what it might be like if you had a static application running inside 107 00:07:31,270 --> 00:07:32,450 ‫an Nginx server. 108 00:07:32,500 --> 00:07:37,360 ‫You could set the interval and the time out from your Dockerfile, and you would just have it simply 109 00:07:37,360 --> 00:07:39,830 ‫do a cURL command on the localhost. 110 00:07:39,850 --> 00:07:45,910 ‫If it returns a 200 or 300, it considers that fine. If it returns a 4, or 5, or something else, it considers 111 00:07:45,910 --> 00:07:46,660 ‫that an error. 112 00:07:46,660 --> 00:07:51,050 ‫You notice here that I have an exit 1, which is the same thing as a false. 113 00:07:51,100 --> 00:07:55,540 ‫I did that just to show you that certain examples on the Internet will have a false. Certain examples 114 00:07:55,540 --> 00:07:56,680 ‫will have an exit 1. 115 00:07:56,680 --> 00:07:58,090 ‫They both do the same thing. 116 00:07:58,390 --> 00:08:00,500 ‫Here's a little bit more advanced example. 117 00:08:00,580 --> 00:08:04,240 ‫In this case, we're using a PHP app that's combined with Nginx. 118 00:08:04,270 --> 00:08:12,070 ‫What I've done is, in the resources, you'll find a link to this PHP example. I've added in a custom 119 00:08:12,160 --> 00:08:18,360 ‫Nginx config file that uses Nginx and PHP-FPM status URLs. 120 00:08:18,400 --> 00:08:24,640 ‫Both of those applications have their own status page and sort of a healthcheck ping URL. 121 00:08:24,910 --> 00:08:29,950 ‫You can use those in your apps if you're using PHP or Nginx. There are two different URLs, 122 00:08:30,100 --> 00:08:32,910 ‫but you can use both of them inside the same healthcheck. 123 00:08:32,950 --> 00:08:37,900 ‫In this case, we're using just one of them, and we're throwing in the localhost/ping, which is 124 00:08:37,930 --> 00:08:39,200 ‫actually a PHP-FPM 125 00:08:39,200 --> 00:08:48,050 ‫status command, but you have to enable that inside your PHP-FPM. Again, in the resources of this lecture, 126 00:08:48,070 --> 00:08:51,990 ‫there's a link to a PHP Docker Good Defaults. 127 00:08:52,090 --> 00:08:56,450 ‫You can go check that out on a GitHub where I've shown in this example in a little bit more detail. 128 00:08:56,470 --> 00:09:01,740 ‫Next we have a Postgres example so in the Dockerfile I can use a different URL. 129 00:09:01,750 --> 00:09:08,140 ‫Here we have a Postgres application where in the healthcheck command, I'm using a command of pg isready. 130 00:09:08,140 --> 00:09:13,810 ‫Now, with different apps, there's different tools. With Postgres, it comes with a built-in tool, 131 00:09:13,810 --> 00:09:17,650 ‫that's a very simple testing of a connection to a Postgres server. 132 00:09:17,650 --> 00:09:22,330 ‫It doesn't validate that you have good data, or that your database is mounted properly. It's simply 133 00:09:22,330 --> 00:09:26,430 ‫going to say, 'Does this database server allow connections? Yes or no?' 134 00:09:26,440 --> 00:09:28,970 ‫That's a neat one that you can do out of the box. 135 00:09:29,640 --> 00:09:32,710 ‫Here's what it would look like in a composer/stack file. 136 00:09:32,790 --> 00:09:33,990 ‫Very similar. 137 00:09:33,990 --> 00:09:37,830 ‫You'll notice that the start period down there requires a different version. 138 00:09:37,830 --> 00:09:44,370 ‫Since the healthcheck command came out in 1.12, it was actually supported in 2.1 of this Compose 139 00:09:44,370 --> 00:09:45,090 ‫file. 140 00:09:45,150 --> 00:09:50,790 ‫But, if you're going to use the start period, that means you have to update your Compose file to version 141 00:09:50,870 --> 00:09:56,070 ‫3.4 in order to support that. Because the start period came out over a year later after the healthcheck 142 00:09:56,070 --> 00:09:58,690 ‫command did. 143 00:09:58,700 --> 00:10:01,130 ‫Let's start out with some simple run commands. 144 00:10:01,250 --> 00:10:06,050 ‫What we're gonna do here is we're going to start a Postgres database server without the healthcheck 145 00:10:06,080 --> 00:10:08,040 ‫because by default, it doesn't come with one. 146 00:10:08,210 --> 00:10:13,130 ‫Then we're going to run it again with a manual healthcheck command that will add at the command 147 00:10:13,130 --> 00:10:18,730 ‫line, and we'll see the difference. 148 00:10:18,730 --> 00:10:24,500 ‫Here, we're just going to call the first one p1. We'll run it detached from the official Postgres 149 00:10:24,520 --> 00:10:25,280 ‫image. 150 00:10:25,420 --> 00:10:31,220 ‫If I do a docker container ls, you'll see that there's nothing indicating a healthcheck here. 151 00:10:31,420 --> 00:10:40,490 ‫If we do that same command again, and call it p2 this time, we're going to add a health command. 152 00:10:44,140 --> 00:10:48,280 ‫This time, we're going to use the pg isready, which we talked about earlier, 153 00:10:49,850 --> 00:10:54,710 ‫to test that the connections are available on this Postgres server. We're going to tell it that the 154 00:10:54,710 --> 00:10:59,600 ‫user we need is the postgres user. We don't actually need to give it a password. It's not going to try 155 00:10:59,600 --> 00:11:04,680 ‫to log in. It's just going to try to validate. We'll use the Postgres image. 156 00:11:06,230 --> 00:11:14,620 ‫Now if we do a docker container ls, and I zoom out a little bit, you'll see that it says, 'Up 4 seconds 157 00:11:14,620 --> 00:11:16,050 ‫health is starting.' 158 00:11:16,150 --> 00:11:22,510 ‫Now we get this additional feature in our status of our ls command. It will stay in the starting 159 00:11:22,510 --> 00:11:23,940 ‫state for the default 160 00:11:23,940 --> 00:11:27,450 ‫30 seconds until it runs the healthcheck command for the first time. 161 00:11:27,760 --> 00:11:34,760 ‫Now that we've waited over 30 seconds, you'll see that it's changed to status of healthy. 162 00:11:34,810 --> 00:11:44,240 ‫If we do a docker container inspect on that p2, we'll see at the very top of that that we had 163 00:11:44,240 --> 00:11:48,780 ‫this new health status option. In this case, I've only been able to run it twice. 164 00:11:49,070 --> 00:11:55,120 ‫You can see the output there, that it's showing it's accepting connections. 165 00:11:55,140 --> 00:12:00,750 ‫All right. Let's do some service create commands to that same database, in that same test healthcheck. 166 00:12:00,760 --> 00:12:05,680 ‫What we'll see here when we do this is that there are three different states that a service goes through 167 00:12:05,680 --> 00:12:06,530 ‫on starting up. 168 00:12:06,540 --> 00:12:11,800 ‫It's preparing, which usually means it's downloading the image. It's starting, which means it's executing 169 00:12:11,890 --> 00:12:14,850 ‫the container and bringing it up. Then it's running. 170 00:12:14,980 --> 00:12:19,600 ‫Without the healthcheck command, the starting and running are very quick. They're almost instantaneous. 171 00:12:19,720 --> 00:12:25,170 ‫We'll see that here with a docker service create name p1 postgres. 172 00:12:25,480 --> 00:12:30,430 ‫Once it's done preparing by downloading the image, you'll see that it goes immediately from starting 173 00:12:30,430 --> 00:12:35,590 ‫to running, because there is no healthcheck. It doesn't have anything else to do other than start the 174 00:12:35,590 --> 00:12:37,610 ‫container and say, 'Yep. The binary is running.' 175 00:12:37,750 --> 00:12:43,600 ‫But if we do that same command, docker service create, and call it p2 like before, and give it that same 176 00:12:43,600 --> 00:12:44,240 ‫health command. 177 00:12:49,110 --> 00:12:52,470 ‫We start this service with the healthcheck command built in. 178 00:12:52,590 --> 00:12:58,050 ‫What we'll see is that it'll go from preparing to starting, and it will sit at the starting state for 179 00:12:58,050 --> 00:13:01,780 ‫the default 30 seconds until the first healthcheck runs. 180 00:13:01,890 --> 00:13:08,190 ‫This is now the Docker service expecting a healthy state before it considers this service fully 181 00:13:08,190 --> 00:13:14,280 ‫running. After the 30 seconds is over, it'll shift to the running state. Then we get the last little 182 00:13:14,280 --> 00:13:17,670 ‫verify there, just to make sure that it's considered stable, and then we're done. 183 00:13:17,670 --> 00:13:21,560 ‫You can already see, out of the box, that with services, as well as service updates, 184 00:13:21,570 --> 00:13:26,950 ‫we're going to get this extra bonus of health concept if we use these commands whenever we can.