What can you do when your app timeout connecting to your log server in your CF deployment?
The following error occurred when I pushed my app to CF in AWS.
timeout connecting to log server, no log will be shown
Starting app cf-env in org codex / space cf-app-testing as admin...
Error restarting application: StagerError
To get more information I ran
CF_TRACE=true cf push, then I got the following message hanging there for what felt like like forever.
WEBSOCKET REQUEST: [2016-08-17T19:45:38Z]
GET /apps/e189be2e-770f-4d1c-94e2-d2168f2d292d/stream HTTP/1.1
Authorization: [PRIVATE DATA HIDDEN]
Since it failed when it sent a request to the doppler server, so I ran
bosh vms to check if the doppler VMs were running. I next logged into the doppler server and ran
monit summary to check if all the processes were running.
The output from running
monit summary is as follows:
Process 'doppler' running
Process 'syslog_drain_binder' running
Process 'metron_agent' running
Process 'toolbelt' running
System 'system_localhost' running
Everything looked good so I then dug through the logs on the doppler server. I saw the following messages in the
panic: sync cluster failed
goroutine 1 [running]:
main.NewStoreAdapter(0xc82004bb00, 0x3, 0x4, 0xa, 0x0, 0x0)
For some reason, the log cluster could not be synchronized. As recommended by one of my super coworkers, Geoff, I then tried HM-9000 disaster-recovery method, whose summarized steps are:
monit stop etcd (on all nodes in etcd cluster)
rm -rf /var/vcap/store/etcd/* (on all nodes in etcd cluster)
monit start etcd (one-by-one on each node in etcd cluster)
It did not solve my problem this time, but I think it is a good method to know since it may come to rescue when you deal with some other logging problem.
Since everything itself was running and listening properly and the HM9000 reset-fix did not fix the problem, I went back to check my Security Group Settings and Routing Tables in my AWS Console. Both of them are listed in the left column of the VPC dashboard. I found out that port 4443 for the web socket connections was not allowed in the Inbound Rules! So I enabled port 4443 for Inbound traffic in my Security Group Settings. As soon as I did this running
cf push worked so that the app is now running.