Non-blocking Docker logging

Dale Frohman
3 min readJan 10, 2021

With the recent AWS Kinesis outages (11/25/2020 &12/07/2020) you may have experienced an outage when Docker blocked an application execution due to log delivery.

Application Logging

It may be a tough decision based on regulation or compliance, but most would agree that the application being available is more important than writing logs.

Regardless of the library used it is important that logging does not block the applications threads and can recover from a backup of log messages. We would rather discard them than suffer an outage.

Such was the case recently when several applications were using Kinesis streams for logging and the service was unavailable.

What can we do to prevent this in the future?

Docker Log Driver Mode

Containers deployed to ECS use the awslogs docker log driver to push logs to CloudWatch. This log driver ingests logs from standard out and standard error of the container. By default the awslogs driver blocks to guarantee delivery of messages. This blocking also affects the consumption of writes to standard out and standard error within the application. To prevent blocking the application the driver should be configured with the “mode” set to “non-blocking”.

How to change?

Awslogs log driver options

The awslogs log driver supports the following options in Amazon ECS task definitions. For more information, see CloudWatch Logs logging driver

mode

Required: No

Valid values: non-blocking | blocking

Default value: blocking

The delivery mode of log messages from the container to awslogs. For more information, see Configure logging drivers

max-buffer-size

Required: No

Default value: 1m

When non-blocking mode is used, the max-buffer-size log option controls the size of the ring buffer used for intermediate message storage.

You will set this in your task definition. You can learn more about how to define this via https://docs.aws.amazon.com/AmazonECS/latest/developerguide/using_awslogs.html

Docker Log Driver Buffer

While not required, it is important to consider buffer and buffer sizing when setting up non-blocking.

When non-blocking is enabled the driver uses a ring buffer in-memory to queue logs before they can be sent to CloudWatch (every 5 seconds). Data written to the buffer that is not ingested will be discarded as new logs are written.

WARNING: Docker has a built in log line buffer size of 16 KB. When enabling non-blocking mode log lines greater than 16 KB will be split into multiple lines. This will affect parsing of JSON logs in CloudWatch and Splunk for any log messages longer than 16 KB. If your application requires large single line log events, blocking mode may be necessary, along with an async logging configuration within the application

When configuring the max-buffer-size it is important to consider high load during normal operation to make sure that logs are not lost (make sure the buffer is not too small). It is also important to consider the amount of loss acceptable during a CloudWatch, Kinesis or network outage (make the buffer large enough for an outage).

For example a container that produces 2.5 MB of logs an hour configured with a max-buffer-size of 10 MB to survive a 4 hour outage. However a container that generates 250 MB of logs an hour will need to consider the trade-offs of having a large in-memory buffer vs how long they can tolerate an outage before logs are discarded; having a 1 GB buffer size is not feasible for most deployments.

The following formula can be used to help tune the max-buffer-size (4 hours being a target time-frame that may need to be adjusted per application):

rps = requests per seclpr = average logs size per request (access log + warn logs + error stack trace) (bytes) lv = average log volume (bytes/s)  t = time until buffer exceeded (sec) [target: 4hrs * 60 * 60 = 14400]  b = max buffer size (bytes)rps * lpr = lv   t * lv = b

I hope this is helpful in preventing future outages when remote logging is not available to your Docker container.

--

--

Dale Frohman

Principal Site Reliability Engineer. Cyber Security Professional. Technologist. Leader.