Jun 01, 2017 Testing distributed systems thru failure/down time scenarios with Delmo
For the last year or so, at Stark & Wayne we've been developing production-grade data services that support highly available failover and automatic disaster recovery. We are planning for these data platforms to run 1000s of databases, so every failover and every user's requirement for disaster recovery needs to work every time and without human intervention. We want to TDD our data services. A year ago Justin Carter created a new testing harness to help us - called Delmo - and a year later it's still proving fundamentally useful.
Watch this blog post
I've felt like Delmo was our little secret weapon - but really it's been by accident that I've not talked more about it. Recently I talked about Delmo at a meetup in Brisbane AU. This blog post was written after this talk to share the tutorial with everyone.
Since giving this talk, I've cleaned up the tutorial demonstration so its easy for anyone to follow:
It uses the example of testing the behavior of a web application (Ruby on Rails app in this case as it was a Ruby meetup) during the unknown periods when it loses access to its database (or any other dependent subsystem or microservice).
To get started with the tutorial:
git clone https://github.com/starkandwayne/delmo-rails-pg-demo cd delmo-rails-pg-demo git checkout step-1 cat README.md
We've also been using Delmo to test our Habitat plans:
These plans are using Delmo to dev/test the behavior of each database in a cluster - testing what happens when a node is lost, and later returns; and testing what happens after downtime, and checking that each service automatically performs disaster recovery.
Finally, the original use of Delmo was with individual Dingo PostgreSQL clusters:
As well as the Open Service Broker compatible Dingo Redis :