This is another interesting day in the life of modern platforms (BOSH and Cloud Foundry) and automation (Concourse).

The Problem

Recently we ran into an issue with Concourse. After building a seemingly successful pipeline and using it to deploy the microbosh to AWS, we ran into a snag where the deploys always failed for the same reason: the deployment couldn't find the AWS instance using the ID in the manifest. But why?

Symptoms

Initially diagnosing the behavior was reasonably straightforward: when running the pipeline, an error like the following would appear:

Started deploy micro bosh > Mount disk. Done (00:00:01)instance i-22222222 has invalid disk: Agent reports  while deployer's record shows vol-11111111.  

Fix attempt #1

Going into AWS -> EC2 -> Volumes and searching for vol-1111111 would easily pull the volume, but it was attached to a different instance, i-33333333. In fact, going into Instances and searching for i-22222222 showed that there were no instances with that ID!

This means that for some reason the bosh-deployments.yml file is wrong. This is the "database" for the bosh micro deploy state. At this point, I wasn't yet sure why it had the incorrect state; so I fixed it to match reality according to the AWS Console:

---
instances:  
- :id: 1
  :name: microbosh
  :uuid: (( some UUID ))
  :stemcell_cid: (( AMI ))
  :stemcell_sha1: (( some SHA ))
  :stemcell_name: (( some stemcell ))
  :config_sha1: (( some SHA ))
  :vm_cid: i-33333333
  :disk_cid: vol-11111111
disks: []  
registry_instances:  
- :id: 14
  :instance_id: i-33333333
  :settings: (( bunch of stuff ))

Great! Everything is kosher. Trigger the pipeline aaaaaaannnnnnndddddd…

Started deploy micro bosh > Mount disk. Done (00:00:01)instance i-33333333 has invalid disk: Agent reports  while deployer's record shows vol-11111111  

Going back into AWS shows i-33333333 has been terminated; and when I inspect the volume vol-11111111 shows that it is now attached to a new instance i-44444444; however, the bosh-deployments.yml file has i-33333333.

Hmmm.

Fix Attempt #2

Using one of our earlier blog posts as a guide, I cleaned out all the "dynamic bits" and tried triggering the pipeline again. Unfortunately this did not resolve the issue: even though neither the instance_id nor vm_cid fields were even present when I started the pipeline, when it ran the wrong instance ID was populated in both places and the pipeline terminated with the same error.

Fix Attempt #3

At this point I deleted the EC2 instance that was supposed to be attached to the persistent disk. (Note that it is probably obvious that the volume is not set to delete when the instance deletes or else the volume would have been disappearing as well, but you know I double checked that anyway. Because human error, and what not.) Then I created a NEW instance manually using the criteria in the manifest. I updated the bosh-deployments.yml file and did a manual bosh deploy. SUCCESS! I triggered the pipeline to run - SUCCESS!

BUT because of the change I made to the pipeline, the pipeline was triggered to run a second time after the successful completion of my manual run. This time it FAILED.

And the instance ID was wrong again.

Deeper Troubleshooting

Clearly, something a bit deeper is going on in the pipeline itself. Since this particular pipeline is pushing its changes to GitHub as a sort of audit trail, to track down where the problem was I look at all its git commits. This is where the problem was made a little more obvious.

By looking at the commits, the problem was rooted between when our pipeline was being triggered and where it was grabbing the deployments. Basically, it was grabbing the "state of the universe" at the beginning and using that to populate bosh-deployments.yml, started to run and change the state of the universe, but then used the bosh-deployents.yml file with now-outdated information to try and deploy. This, of course, caused failure.

To prevent pipeline from triggering prematurely and running with out-of-date information, I updated the resources in the pipeline.yml file to ignore our pipeline-inputs.yml:

resources:  
- name: aws-pipeline-changes
  type: git
  source:
    uri: {{pipeline-git-repo}}
    branch: {{pipeline-branch}}
    paths:
    - environments/aws/templates/pipeline
    - environments/aws/templates/bin
    - environments/aws/templates/releases
    ignore_paths:
    - environments/aws/templates/pipeline/pipeline-inputs.yml

After some cautious optimism I ran the pipeline again. The good news: the original issue was fixed. The bad (ish?) news: it failed with a new error:

unexpected end of JSON input  

Welp, at least our bosh-deployments.yml file was fixed. Huzzah.

Fix one Bug Find Another: The JSON Error

The JSON error appeared right at the build stage - before the pipeline would grab anything and do its magic. In the UI, both the stemcell-aws asset and the environment were in orange. When I clicked on stemcell-aws, I saw that it wasn't able to grab the stemcell - it was just dying.

Looking through the resources in pipeline.yml, the stemcell-aws resource was using bosh-io-stemcell. In Concourse itself, that resource is located at bosh-io-stemcell-resource. The assets/check file is where the curl command runs to grab the stemcell:

curl --retry 5 -s -f http://bosh.io/api/v1/stemcells/$name -o $stemcells  

So I ran this command on the jump box that hosts our pipeline and it failed. As an important aside the reason why it failed is because of restrictions on our client's network: only HTTPS connections are allowed and HTTPS connections are redirected before leaving the company intranet. The fix was as simple as changing the curl command to:

curl --retry 5 -L -s -f https://bosh.io/api/v1/stemcells/$name -o $stemcells

After making the pull request, someone pointed out that the bosh-io-release resource had a similar line of code and so it would probably have the same problem eventually. To avoid this, we submitted pull requests for that as well with the same fix.

Resolved!

After the Concourse team merged our pull request to fix the JSON error, we were able to definitively verify that our initial issue was resolved with a series of successful pipeline deployments. ✌.ʕʘ‿ʘʔ.✌