The microservices architecture is gaining traction, and many systems fan out into sets of applications distributed on multiple servers. If you don't keep your logging in check, it gets harder and more tedious to compile a full trace of your complete systems behavior. In fact, one of the arguments against a distributed system is that it is harder to debug. I intend to show you that with a little effort spent on logging and infrastructure, we can bust that myth. Get ready to level up your logging game!
Everything we will cover here assumes you already have a grip on standard application logging, with a basic setup. Your logs end up in files on disk. You use a standard logging framework like slf4j and logback. You know what to log and how to log it. Timestamps, thread names and class names are a part of your logger pattern.
If you do not have one already, you need to set up a log aggregator. This is absolutely necessary to do to reap any benefits from your logs in a distributed system. Luckily it is pretty easy to get going with a basic setup. A personal favorite is the ELK stack. This gives you everything you need: ElasticSearch to store the logs, Logstash to collect and manage them, and Kibana to view the logs and pull our reports. The ELK stack is widely used and works well. If you have the budget to pay for a license, Splunk is a worthy alternative. Other free alternatives include Apache Flume that stores data in HDFS, or rsyslog in conjunction with the datastore of your choice. When you have a log aggregator running with a nice UI on top to browse your logs, it's much easier to correlate and filter down logs. This means you can now enable a lot more log information in all your applications! Those who do not need that specific information can filter it out query-time by applying a simple log-level filter.
When debugging a complex issue in a multi-app environment, you need to be able to trace the control flow across applications. To make this a lot easier, follow these simple rules.
- Log the
username
of the user initiating the first request. Log the ´system-name´ if the initiator is a system, like a scheduled service. - Log the
user-agent
of the application calling your application. - Log a
traceId
throughout the call chain, to be able to quickly search across all applications for every log trace for a given user action. - Log a
spanId
in your request scope in every application, to be able to distinguish separate requests to the same app.
These rules implies a couple of important aspects. Firstly, you have to propagate the username
, user-agent
and traceId
whenever your control flow jumps from one application to the next. If your applications communicate using REST you can transmit these bits of information as HTTP header parameters.
The traceId
and spanId
can be a randomly generated short string or number. A new traceId
should be generated when a request with no such traceId
set hits an application, and then propagated and re-used for logging in all cascading requests.
A new spanId
will be generated every time a request hits an application. For an explanation of these terms, see the Dapper section below.
To make sure every bit of metainformation is logged at every log-statement in your application, a good solution is to include it in your log pattern, fetching the values from MDC.
Log a line as soon as a request enters your application, and as soon as you transfer the control flow to another component. This typically means you log a "hello" and a "goodbye" message at the entry- and exit-points for your REST or SOAP endpoints. Also log a "calling" message whenever you initiate a request to another service, and a "recieved" message when the other service responds. This will enable you to identify which external service calls are the performance drag in your system, with a quick glance at the timestamps of your logs. You can also correlate the timestamps from the calling service and the callee, to identify network latency issues.
With a library like Metrics you will get even easier-to-read output about your applications runtime performance. Since it is ridiculously easy to get started with, there's almost no reason not to. If it looks like a daunting task to apply meters and timers to every application, start with the services that has the highest risk of running into performance problems or being a bottleneck. Log everything to your log files, or to another specialized aggregator if you intend to use your metrics for more fancy applications. A good set of historic/logged metrics is paramount to retroactively debugging a performance problem in your production environment. Investing a little time here makes it possible to quickly pinpoint bottlenecks, and also to set up alarms whenever a certain service suffers from a low throughput or a high latency.
Consider logging in a structured log format. This makes it a lot easier to parse your logs for answering complex queries through your log aggregator. Though, there are good arguments as to why using a structured log format might be a waste of time. My gut feeling here tells me that it's a cheap investment in a new system, but I'd be reluctant to spend too much time re-working my old log statements. Migrating a system to a structured log format is probably worth the effort if you find yourself repeatedly constructing complex queries in order to parse log messages and to correlate certain data points. Keep in mind that humans will still also read your logs. Ideally you want your logs to be both easily read and easily parsed.
Consider Just-in-time logging. This pattern is a bit dated, but it will be helpful if you are struggling with logging I/O or log size quotas that you cannot work around. Given that you already implemented the previously mentioned measures like log aggregation and a trace/span id for all requests, most of the listed benefits of JIT logging have already been reaped.
Consider using an implementation of Google Dapper like i.e. Twitter Zipkin. This is basically a toolset that helps you do all of the above, through a few well defined interfaces and some utility aggregator daemons. Zipkin leverages adaptive sampling, failover and other techniques required to scale really well (far better than most normal-sized projects need). It also gives you some pretty nifty graphs, a powerful query language and good monitoring capabilities. However, for most purposes a log aggregator like the ELK stack and a basic trace-id/metrics logging setup is good enough.
There is no need to worry about traceability or debuggability in your multi-application system, if you have a plan for how you produce and use your logs. Getting everything up and running in a log aggregator, with a little bit of metadata added to each log statement, should get you a long way. Depending on your needs, you can tweak your logging to solve any problem in a distributed system almost as easily as you could in a single-node monolith.