What can you do when your app timeout connecting to your log server in your CF deployment?
The following error occurred when I pushed my app to CF in AWS.
timeout connecting to log server, no log will be shown Starting app cf-env in org codex / space cf-app-testing as admin... FAILED Error restarting application: StagerError
To get more information I ran
CF_TRACE=true cf push, then I got the following message hanging there for what felt like like forever.
WEBSOCKET REQUEST: [2016-08-17T19:45:38Z] GET /apps/e189be2e-770f-4d1c-94e2-d2168f2d292d/stream HTTP/1.1 Host: wss://doppler.system.staging.xiujiaogao.com:4443 Upgrade: websocket Connection: Upgrade Sec-WebSocket-Version: 13 Sec-WebSocket-Key: [HIDDEN] Origin: http://localhost Authorization: [PRIVATE DATA HIDDEN]
Since it failed when it sent a request to the doppler server, so I ran
bosh vms to check if the doppler VMs were running. I next logged into the doppler server and ran
monit summary to check if all the processes were running.
The output from running
monit summary is as follows:
Process 'doppler' running Process 'syslog_drain_binder' running Process 'metron_agent' running Process 'toolbelt' running System 'system_localhost' running
Everything looked good so I then dug through the logs on the doppler server. I saw the following messages in the
panic: sync cluster failed goroutine 1 [running]: panic(0xb0d3c0, 0xc8201460f0) /var/vcap/data/packages/golang1.6/85a489b7c0c2584aa9e0a6dd83666db31c6fc8e8.1-0ebd71019c0365d2608a6ec83f61e3bbee68493c/src/runtime/panic.go:464 +0x3e6 main.NewStoreAdapter(0xc82004bb00, 0x3, 0x4, 0xa, 0x0, 0x0) /var/vcap/data/compile/doppler/loggregator/src/doppler/main.go:58 +0x185 main.main() /var/vcap/data/compile/doppler/loggregator/src/doppler/main.go:92 +0x4f9
For some reason, the log cluster could not be synchronized. As recommended by one of my super coworkers, Geoff, I then tried HM-9000 disaster-recovery method, whose summarized steps are:
monit stop etcd (on all nodes in etcd cluster) rm -rf /var/vcap/store/etcd/* (on all nodes in etcd cluster) monit start etcd (one-by-one on each node in etcd cluster)
It did not solve my problem this time, but I think it is a good method to know since it may come to rescue when you deal with some other logging problem.
Since everything itself was running and listening properly and the HM9000 reset-fix did not fix the problem, I went back to check my Security Group Settings and Routing Tables in my AWS Console. Both of them are listed in the left column of the VPC dashboard. I found out that port 4443 for the web socket connections was not allowed in the Inbound Rules! So I enabled port 4443 for Inbound traffic in my Security Group Settings. As soon as I did this running
cf push worked so that the app is now running.