High Availability (HA) is a common requirement. Unfortunately it is often the case that HA considerations need to be made from the ground up while designing a service. Sometimes a component of your service might be the best tool for your business case, but the nature of its business task or a trade off made during its design will make HA difficult to achieve. Recently we ran in to this scenario with a service that integrated Riemann in to its stack.

Riemann is a fast and lite weight stream processing engine and for this particular service requirement it was a great tool for the job, but its performance comes at the cost of some in-memory state. An index of stream object types and aggregate details is stored locally in memory and is used during stream processing. Of course that local state makes it difficult distribute the processing workload, but for our use case Riemann seemed to be holding up its end of the bargain. Ultimately scaling wasn't as big of a concern as availability.

So how might we solve the HA problem? The approach we decided to take was to send all stream data to one Riemann instance, that instance is then responsible for streaming data to every other instance in the cluster. Every instance behaves the same way, if it receives a stream, send that stream to every other known Riemann instance

So now we have a cluster of Riemann instances, one is the leader and the rest are a warm failover. But how do we decide who acts as the leader? Fortunately we are using Consul (http://consul.io) for service discovery and health management in this environment. I've included a link below to the documentation for using Consul to facilitate leader election. But the basic outline is.

  1. Establish a session between the agent and the cluster.
  2. Agree on a key that contains the information about the leader
  3. Attempt to acquire the session lock on that key
  4. The agent that has acquired the lock is the leader.

You can store anything you want in the key but in our case we are using the hostname of the machine that has the lock.

Now how do we handle failover? Every component of the service is being watched by Riemann, if the component fails or if the Consul agent dies, the lock on the session will be released. The next agent that tries to acquire the lock will be automatically promoted.

When you are processing stream data in an operations setting, generally your goal is to take some action based on it. But we can't have all of the Riemann instances acting. In our case they would all act at the same time, and all but the first one would probably be causing trouble. So in our case, before an action is taken, the component that acts on the stream simply checks if that node is the leader before an action is taken.

There are several approaches to a pre-action leader check. A very lite weight solution would be to check the payload of the leader key and see if it points to the acting agent but this is possibly subject to race conditions or a delay in consensus. Another option is to just attempt to acquire the key before any action is taken, if the action returns true, you can proceed, if false do nothing. We used a hybrid of both of these approaches.

At the moment we don't have a perfect solution, failover seems to take less than 20 seconds but that could likely be improved upon and there are lots of ways to handle the leader check. But I like this example because it highlights a not so obvious use case for a tool that many of us are already using every day.


Consul Leader Election: https://www.consul.io/docs/guides/leader-election.html Riemann: http://riemann.io