What Can Distributed Systems Teach Us About Concurrent Coding?
Three important questions plagued the asker of this Stack Exchange question on designing concurrent systems in 2010:
- How do you figure out what can be made concurrent vs. what has to be sequential?
- How do you reproduce error conditions and view what is happening as the [concurrent] application executes?
- How do you visualize the interactions between the different concurrent parts of the application?
What is striking is that, seven years later and at least for systems that are truly distributed in the sense of multiple machines, these problems have been mostly, if not fully, solved. As I read his question, I pondered: can the relatively less byzantine failures of ordinary concurrent code (such as, say, web servers) be tackled in a similar way?
In this post, I would like to discuss these particular solutions and how they can be applied to writing resilient concurrent code.
Why Distributed Systems Solutions Are Good Candidates for Simple Concurrent Applications
First, some background.
Concurrent code involves the interleaved execution of two or more tasks. That is it.
In practice, this is not so straightforward:
Concurrent code can involve simple kernel-level or userspace-level abstractions, such as Linux
pthreads, green threads, entire processes and even just simple coroutines. In other words, the unit of concurrency varies by implementation.
Concurrent code often carries some sort of coordination constraint that requires operations be carried out in a certain order (such as acquiring a mutex or establishing a semaphore).
Concurrent code is usually subject to the scheduling and out-of-memory whims of the underlying operating system, may often depend on external resources, and cannot be relied upon to produce consistent synchronised logging.
Finally, as a side note, very few concurrent toolkits actually offer pre-emptive multitasking as a feature to the user. Most systems favour cooperative multitasking, meaning that deadlocks can easily occur if any individual unit of execution is long-running.
Comparing Concurrent Applications to Distributed Systems
Concurrent applications need not be distributed, though it is often the case that distributed systems are inherently concurrent.
But the above list makes it clear that they share the same pitfalls in some cases: deadlocks, race conditions, byzantine failure, runtime differences and so on affect all systems that involve asynchronous coordination.
In addition, the conceptual abstractions feature a direct one-to-one mapping from services to execution contexts - the distinction is practically invisible when multiprocessing, where each process also has its own virtual address space and thus may well be a service.
In fact, concurrent code differs from a distributed system in only one major aspect:
- Concurrent code always has access to a central clock that synchronises actions. This is in contrast to a distributed system, where services hosted on different machines cannot assume a central clock server.
This isomorphism, and the fact that concurrent code only features one stronger assumption than a distributed system regarding central time, means that there is good reason to believe that concurrent code can benefit from the same solutions.
Answering Question 3: Gaining Visibility into Concurrent Systems
Distributed tracing is the process of gathering timing data for individual concurrent parts of a distributed system.
Distributed tracing works as so: spans with a unique identifier track the entire lifecycle of a task (called a trace) handed off to a system.
Spans may nest spans: children processes within a span, for instance, can inherit the parent span identifier while having their own. All spans within the context of a trace are identified by a combination of their span ID and the trace ID.
At the end of a trace, this aggregated information is reported back to a simple time-series server hosted by the developer. This is what that data visualised might look like:
Representations of concurrent executions are always interleaved, allowing you to see what is running in parallel and what is not. Child processes are always grouped under their parent.
Pros and Cons of Incorporating Distributed Tracing into Concurrent Code
The benefits of a distributed tracing system are as follows:
- You can visualise and time the structure of a concurrent application from start to finish in graphical format, thus providing easy visibility.
- Distributed tracing systems are useful for profiling, which becomes an important tool for analysing bottlenecks as well as serving as a valuable SLA metric.
- Enable you to identify unusual variations in workflow across various ‘runs’ of your application. Distributed systems engineers care more about percentiles than averages, and indeed this is a paradigm that has merits in testing concurrent code.
- Add contextual information to the reporting of any task. For example, you could report the database table being read from in a call, rather than rely on an external service reporting these metrics directly.
- Report errors. A concurrent service that errors out is still a valid span context, and can include stack trace information specific to the span in the final UI. Most application code doesn’t natively wrap around error handlers, so you’ll have to implement your own try-catch and log the error in the span there. (For obvious reasons, this strategy is superior to simply publishing to logs as it groups together the events that led up to the crash).
The drawbacks of many commercial distributed tracing system:
Not all distributed tracing systems are submitter-preserving or contention-preserving. In other words, they don’t always reveal who shares a contended resource such as a lock, though they can track who made the request, letting you see where race conditions might occur.
Distributed tracing systems report asynchronously only at the end of a span. This means a span that has begun and has been indefinitely blocked will not appear in your UI, making it fairly useless for identifying deadlocks.
Storage and hosting are left up to you, although companies such as LightStep appear to be moving into this space with their own solutions.
Distributed tracing can be heavy on your system, though I have no benchmarks to support this and only the original Google Dapper paper to back me up. I recommend it strongly for testing, and sparingly for production traffic.
Answering Question 2: Reproducing Error Conditions In Concurrent Applications
A quick and clear example of this sort of solution is Sentry, which is used by many production-ready distributed services worldwide.
The scope of tools such as Sentry is much smaller than distributed tracing, though it generates equal visibility and follows many of the same ideas.
Error tracking software is designed to be able to keep track of the individual resources that led up to any particular failure through means of breadcrumbs. You can actually get steps on how to reproduce an error this way, including such goodies like the environment that caused the failure.
Error tracking software is designed specifically to report in the event of unexpected failure. You still need to write your own try-catch and report there, but the net benefit is that you only receive information about errors, rather than the history of successful runs.
I won’t spend much time on this here since there is very little to say that hasn’t been extensively covered by documentation, and because many of the chief benefits I associate with these tools have already been addressed in talking about distributed tracing.
Answering Questions 1: Writing Better Concurrent Code
I wrote above that the chief differentiating feature between concurrent applications and distributed systems is simply the stronger guarantee of consistent time.
Most readers familiar with distributed systems understand, however, that the chief characteristic of a distributed system is really the promise of failure. Failure and our inability to guarantee against it is what makes things such as the Two General’s problem impossible to solve. Distributed systems cannot always be available.
On the flip side, what makes regular concurrent code hard is often the requirement for consistency rather than availability. This is not to say failures cannot happen in concurrent code too, however:
Shared resources could become inaccessible. A rolling upgrade wipes an important file, for instance. This causes the concurrent code to wait - and wait - and wait… or simply call it a day and quit.
The underlying operating system may choose to kill without warning, especially if system resource limits are being breached.
Issues like these make it hard to be able to provide the classic answer of ‘dependent components should be in serial’ - an architecture that assumes upstream services will always be up is waiting for errors to affect the end user.
My advice then is simple:
Anticipate error. Anticipate bad application states. Come up with ways for services to know something is wrong upstream, and react accordingly.
A serial component in a concurrent application should indeed be one with prior dependencies and downstream dependencies, but it should also be a part that can do any of the following:
- Instantiate timeouts on resources.
- Instantiate a way to propagate errors downstream without affecting users.
- Invoke a watcher process to try to fix things.
Truly concurrent components should be able to exist independently of each other. They could be launched in parallel and never know the existence of the other. All they should have to worry is data flow - who do they read from, and who do they write to.
Case Study: Zookeeper
An example of this sort of resilience with concurrent systems that use the silver bullet of Zookeeper to coordinate between nodes in a cluster.
Zookeeper provides simple notification-based primitives in a filesystem-like API that can be used to instantly distribute messages about some change in the configuration of the system to listening processes. Listening processes know what’s going on with other processes!
The beauty of Zookeeper as this sort of concurrency primitive is that the sort of failure models that can occur can all actually be planned for:
Zookeeper itself dies (indicated by a timeout when a listening connection fails to hear from it). All the nodes now think the rest of the stems is dead, and can begin their own crash recovery processes.
A single node dies (indicated by a timeout when Zookeeper tries to hit it with a heartbeat signal). Zookeeper can now remove the node from its list, and let everyone else know something has gone awry.
Timeouts are error-prone beasts that can occur for any reason, but the important thing is that they can be used as robust indicators for liveness in the only sense that matters - connectivity.
Lessons we can learn:
A concurrent architecture that makes use of timeouts on resources avoids both the issue of deadlock as well as allows for systematic dissection of application state.
A concurrent architecture that makes use of notification queues gives the rest of the system a stronger guarantee of consistency. (Message brokers are thus commonly found in concurrent systems.)
But the single biggest lesson to draw from Zookeeper and other distributed systems that are directly relevant to their simple concurrent cousins is that robust systems have robust communication.
I go on a minor tirade about metrics and monitoring in my actual answer to the question because this principle goes beyond just abstract processes in an abstract use case - it touches as much on how our code speaks to us.
If we look carefully at all the above, we see that, really, everything discussed above - metrics, monitoring, error handling - are really just aspects of a single concept: metadata. Code metadata is code written about code. It comprises code tests, code profiling, code documentation, function signatures - everything.
This sort of metadata is vital towards thinking about how we make our code work. Until we understand the pitfalls of our thinking, we will never be able to escape them - and until we capture this kind of metadata, we will never be able to discover how to make things better.
Distributed systems bring out some of the worst that bad design can throw at us. Concurrent systems exhibit a fraction of the nasty inconveniences that can occur, but, in both distributed and merely concurrent systems, we benefit heavily from talking with numbers and specifics than with vague generalities.
Numbers first, patterns second, intuition last.