logsearch-boshrelease is pretty cool, when it's working. This is pretty easy to manage at a small scale, but as your deployment scales out, and you start processing multiple millions of records every few minutes, you might run into some issues. Below are some of the most common issues we've run into, and how we resolved them.
Have you ever done a simple logsearch deployment update, and noticed that the elasticsearch nodes take a really long time (sometimes an infinite amount) to complete? This is due to the drain/undrain scripts that are there to ensure the cluster stays available during deployments. There are typically two reasons these scripts will cause your deployment to hang.
Not enough data nodes Each shard of each index in ElasticSearch has a primary, and one or more replicas. Nodes cannot host multiple copies of a shard, so they cannot be both primary + replica of a single shard. For that reason, we recommend at least 3 elasticsearch_persistent nodes in your deployment.
Bad cluster state The drain/undrain scripts loop over cluster state, waiting for it to be GREEN. If this never occurs, they will never exit. This is a pretty good article that discusses cluster state + shards, and how to fix them. However, the script offered for fixing shards suffers from some issues when dealing with multiple nodes + indices, as well as shard ordering. To address that, we've written a utility called esuf for bringing shards back online in an expedient and mostly-sane fashion.
Failing ingestor nodes
Is your ingestor node falling over routinely, with full disk? Check to make sure that debug is disabled on the ingestor. Otherwise, it will log out each message it receives to
/var/vcap/sys/log, and that is likely the source of your ingestor failures.
Queues backing up
Another big issue we've run into is the queue node tipping over, because it gets backed up so much, it eats all available memory + swap, then crashes. When monit starts it back up, it subsequently crashes as soon as it loads all the data back into memory.
To break this vicious cycle, first stop the ingestor jobs. Next,
bosh ssh into the queue node,
monit stop all, ensure that there are no
redis-server processes still hanging out, and delete all the files in
monit start all to bring redis back online. Give the log parsers a minute or so to re-connect to redis, and then slowly bring your ingestors back online.
While the ingestors are operating, you can
bosh ssh into the queue node, and run
watch /var/vcap/packages/redis/bin/redis-cli llen logstash
to get an up-to-date count of how many records are sitting in the queue, waiting to be parsed. Typically, this should fluctuate a little bit up, and down, but ultimately stay small, if not 0 most of the time. If you see it growing, your
log_parser nodes are not keeping up with the data being stuck in the queue.
To determine what to do next, take a look at
bosh vms --vitals for your logsearch deployment. If the log_parsers are looking pretty busy, you likely will need more instances, to keep up with the demand. However, if they're not that busy, chances are the issues are in ElasticSearch. Make sure the cluster is in a green state, with
curl <elasticsearch_ip>:9200/_cat/health?pretty. If not, try using esuf to get all the shards online. If that doesn't help at all, you might be running into I/O bottlenecks on the
elasticsearch_* nodes. Take a look at the
Wait coumn of the output from
bosh vms --vitals. If thats more than 5% on any of the
elasticsearch_* nodes, consider moving their persistent disks to solid-state storage if possible. It doesn't take much iowait to cause the VMs to crawl, and back up the log parsing.
As a last ditch effort, you can try scaling out your elasticsearch nodes, but once you have one node for each shard, scaling out more won't be helpful. For example, if you have 4 shards per index, there are typically 8 total - a primary and a replica for each shard. Scaling out past 8 elastic search nodes won't provide additional returns, unless you also modify logstash on the log_parsers to create indices with more shards. However, we've been able to achieve some pretty high logging volumes with just 8 elasticsearch nodes, so creating more shards + scaling further should probably be a last-ditch effort.
Hopefully this has been somewhat helpful in troubleshooting, fixing, and scaling out your logsearch_boshrelease deployments. And don't forget to check out esuf when you're having trouble getting shards back online.