ETCD Gets Knocked Down and it Gets Up Again

February 4, 2017

You are never gonna keep it down

Purple Rain got you down? Monit thrashing etcd? Just want to know if ETCD is healthy in your Cloud Foundry deployment?

Checking Health

Start by getting the list of etcd servers in your CF deployment:

bosh vms <your deployment> | grep etc

Adjust the following script for your etcd hosts. By adjust, change the values in {} to match your job/index values of your vms and run it on any hm9000 server since it will have the necessary certs if you are using self-signed certs for etcd:

for etcd in etcd-{z1-0,z1-1,z2-0}; do
  for role in self leader; do
    echo -n "${etcd} ${role}: "
    curl -k -s\
      --cacert /var/vcap/jobs/hm9000/config/certs/etcd_ca.crt \
      --cert /var/vcap/jobs/hm9000/config/certs/etcd_client.crt \
      --key /var/vcap/jobs/hm9000/config/certs/etcd_client.key \
      https://${etcd}.cf-etcd.service.cf.internal:4001/v2/stats/${role} | jq .
  done
  echo
done

Now the script will run 2 curls against each of the etcd nodes: one for /self and one for /leader

This will let you know the following:

etcd-z1-0 self: {
  "name": "etcd-z1-0",
  "id": "11e9f50c565d5b40",
  "state": "StateFollower",         #<< etcd_z1/0 says it is a follower
  "leaderInfo": {
    "leader": "ef0d6a8fb314ed3a",
    "uptime": "7h43m21.438615083s",
    "startTime": "2017-01-27T11:07:14.524076843Z"
  }...
}
etcd-z1-0 leader: {
  "message": "not current leader"   #<< etcd_z1/0 says it isn't leader
}
etcd-z1-1 self: {
  "name": "etcd-z1-1",
  "id": "795ba739b14eb9f4",
  "state": "StateFollower",         #<< etcd_z1/1 says it is a follower
  "leaderInfo": {
    "leader": "ef0d6a8fb314ed3a",
    "uptime": "7h43m21.474123185s",
    "startTime": "2017-01-27T11:07:14.526643444Z"
  }...
}
etcd-z1-1 leader: {
  "message": "not current leader"   #<< etcd_z1/1 says it isn't leader
}
etcd-z2-0 self: {
  "name": "etcd-z2-0",
  "state": "StateLeader",           #<< etcd_z2/0 says it is leader
  "leaderInfo": {
    "leader": "ef0d6a8fb314ed3a",
    "uptime": "7h43m21.520462761s",
    "startTime": "2017-01-27T11:07:14.529530067Z"
  }...
}
etcd-z2-0 leader: {
  "leader": "ef0d6a8fb314ed3a",
  "followers": {
    "11e9f50c565d5b40": {           #<< etcd_z2/0 says it has a follower
      "latency": {                  #   corresponds to id of etcd_z1/0
        ...
      },
      "counts": {
        "fail": 0,
        "success": 8111881
      }
    },
    "795ba739b14eb9f4": {           #<< etcd_z2/0 says it has a follower
      "latency": {                  #   corresponds to id of etcd_z1/1
        ...
      },
      "counts": {
        "fail": 33,
        "success": 7876536
      }
    }
  }
}

In this 3 node cluster:

etcd-2-0 is the leader.
etcd-1-0 and etcd-1-1 both report they are not the leader
etcd-2-0 under it’s leader output shows the id of its followers. You can look under the self calls of etcd-1-0 and etcd-1-1 id to make sure they match.

In the event you have more than 1 node reporting it’s a leader, or one of your nodes under self has a blank leader field – you have a split brain or out of sync etcd cluster

For split brain – 2 leaders

bosh ssh into each etcd vm
sudo -i and monit stop etcd
Verify etcd is down via ps -ef | grep etcd – etcd metrics server is fine to remain up
Once all nodes have etcd stopped, reset the etcd cluster db files by deleting the /var/vcap/store/etcd/member directory and all sub directories and files
monit start etcd on the first node, wait for it to come up clean tail -f /var/vcap/sys/log/etcd/*.log then start the remaining nodes one at a time
Re-run the script to validate

For one or more nodes that are not leader but don’t know who the leader is

bosh ssh into the leaderless vm
sudo -i and monit stop etcd
Verify etcd is down via ps -ef | grep etcd – etcd metrics server is fine to remain up
Delete the /var/vcap/store/etcd/member directory and all sub directories and files
monit start etcd, wait for it to come up clean tail -f /var/vcap/sys/log/etcd/*.log
Re-run the script to validate

You have 450+ runners and etcd runs out of file descriptors

This does happen on large deployments since the default ulimit is 1024 for every stemcell we’ve looked at so far. CF v243 it’s uses etcd release v66 which doesn’t handle ulimits correctly. This looks like it may be addressed in newer releases of etcd used in newer CF versions.

To workaround the current problem:

bosh ssh into each etcd vm then:

sudo -i
monit stop etcd

Verify etcd is down via (etcd metrics server is fine to remain up)

ps -ef | grep etcd

Once all nodes have etcd stopped, dump the etcd cluster db files by deleting the member directory and all sub directories and files:

rm -rf /var/vcap/store/etcd/member

Modify limits.conf:

vim /etc/security/limits.conf

Add in the following:

* soft nofile 4096
* hard nofile 4096

Modify /var/vcap/jobs/etcd/bin/etcd_ctl around line 82 to add the ulimit just before calling the etcd executable:

...
ulimit -n 4096  # <=== Add this just before \/ existing line below
/var/vcap/packages/etcd/etcd ${etcd_sec_flags}
...

Reboot the vm. If it does not come up clean attempt these steps:

su vcap
ulimit -n 4096
sudo monit start etcd

tail -f /var/vcap/sys/log/etcd/*.log

Rinse and repeat with remaining etcd nodes

Etcd is a great tool, just needs a kick once in a while!

Lastly, this is a repost of documentation Chris McGowan created for one of our clients. It was full of goodies and needed to be shared. Everyone should have nice things!

Written by:
Chris Weibel

Senior Cloud Engineer