What is the value of a tool? It fixes a problem we are experiencing, hopefully, in the most efficient and effective way possible.
When it comes to Kubernetes, many people like to talk about the tools that they use. This assumes that the tool is more important than the problem we are trying to solve. The tools we use change and evolve, but the problems we are trying to solve typically do not.
Stark & Wayne runs Kubernetes ourselves and we also help clients with their own deployments. Standing up the first Kubernetes cluster is a Day 1 problem, but what happens after that? What other problems will companies face that are just beyond what Kubernetes solves for that need additional tools?:
At Stark & Wayne, we believe in open source software and are proud of that approach. All of the tools mentioned below are open source, so they are continually improved or changed out for more effective tools. And, since we believe in good tools, we think you should have good tools too.
We start with a problem that has almost too many sets of tools, deploying a Kubernetes cluster the same way a few thousand times. The tool we use is Genesis, with the latest version specifically designed with Kubernetes in mind. A set of parameters are defined in a kit environment YAML file and then deployed to your infrastructure of choice. Need another Kubernetes deployment? A couple lines of YAML and you will deploy another cluster based off of your defined default configuration.
When developers have a predefined service catalog of services available, they can focus on creating applications that bring value to the organization instead of becoming masters at installing, managing, and backing up a service like PostgreSQL. The developer's goal is typically to write the application, not the service. To provide developers with a set of services, we use a tool called Blacksmith.
Blacksmith solves these problems:
Blacksmith works by acting as a service broker and provisioning dedicated service instances on virtual machines or shared service instances on containers. Visit https://github.com/blacksmith-community for the full list of forges and check out the UI to see all the provisioned services.
Services are typically the stateful parts of the application or platform and are usually the critical components targeted when performing migrations and disaster recovery. The scope of problems related to the backups of service instances is quite large:
For this set of problems, Stark & Wayne uses a tool called SHIELD. Along with its GUI interface, there is also a useful CLI allowing automation of tasks when combined with Concourse. Stark & Wayne has helped many organizations use this tool to successfully recover from a range of errors when managing services. More information can be found at https://shieldproject.io/ or contact someone at Stark & Wayne to talk about some of the more interesting SHIELD success stories we've had!
How do you manage and upgrade a single Kubernetes instance? Likely, it's a very manual process that doesn't scale well, especially as the number of clusters increases. Imagine the workload if you have a hundred or thousands of clusters.
To manage this problem, we use a CI/CD tool called Concourse to pipeline our deployments. One type of pipeline we create is for rolling out changes to a set of environments. Let's say you want to upgrade to a new version of Kubernetes but want to try out the upgrade in development and then automatically propagate the changes to testing, staging, production, or a timeless other set of combinations without user intervention. Concourse can be configured to do all this with a simple UI displaying the current state of the pipelines.
There is an excellent tutorial on configuring Concourse pipelines at https://concoursetutorial.com/, which is both informative and entertaining.
One of the more tedious parts of maintaining a platform is patching for CVEs and other security holes in the operating system. Doesn't exactly sound like fun, does it? If only there was a way to automatically pull in OS patches and roll them out in a controlled fashion to all the virtual machines.
BOSH Stemcells to the rescue! When BOSH deploys a virtual machines, it combines a stemcell (a base OS image) with a set of packages (from BOSH releases) and attaches any persistent disks required from the underlying infrastructure. Stemcells are frequently updated and easily found at https://bosh.io/stemcells/ and we've put together a tutorial about BOSH at https://ultimateguidetobosh.com/.
If you've deployed your Kubernetes, Cloud Foundry, or services as BOSH deployments, you can simply upload the latest stemcell and roll these out with the help of Concourse pipelines to keep your operating systems patched and the security folks happy.
How do you monitor the health of Kubernetes, the underlying infrastructure, or the individual services that make up the platform or are bound to an application?
At Stark & Wayne, we've used serveral iterations of a tool called Prometheus for health monitoring and alerting. Using Node Exporter and BOSH Exporter, you get predefined dashboards for virtual machine statistics. There are dashboards and alerts for Kubernetes and Cloud Foundry populated with data from their corresponding experters. Insightful dashboards also exist for a diverse set of services such as PostgreSQL, MySQL, ElasticSearch, Ceph, Gluster, AWS ECS, fluentd, and even NRPE if you are feeling particularly nostalgic. A more comprehensive list of exporters can be found at: https://prometheus.io/docs/instrumenting/exporters/.
When a certificate expires, does it make a sound? Yes, typically in the form of a PagerDuty alert and operators groaning in annoyance.
How do we keep an eye on certificates used in the platform and know when they are due to expire before they actually expire?
We've helped many organizations end their production outage by finding the expired certificate lurking somewhere on their platform. After helping clients do this more times than we care to admit, we created a tool called Doomsday. Simply point this at the tool which stores your certificates, Vault and CredHub are typical, scan the certificates and Doomsday will then display the expired and near expired certificates. With this vital piece of information, you can rotate your certificates BEFORE they expire and prevent the PagerDuty notification.