Stark & Wayne
  • by Xiujiao Gao 高秀娇

What can you do when your app timeout connecting to your log server in your CF deployment?

The following error occurred when I pushed my app to CF in AWS.

timeout connecting to log server, no log will be shown
Starting app cf-env in org codex / space cf-app-testing as admin...

Error restarting application: StagerError

To get more information I ran CF_TRACE=true cf push, then I got the following message hanging there for what felt like like forever.

WEBSOCKET REQUEST: [2016-08-17T19:45:38Z]
GET /apps/e189be2e-770f-4d1c-94e2-d2168f2d292d/stream HTTP/1.1
Host: wss://
Upgrade: websocket
Connection: Upgrade
Sec-WebSocket-Version: 13
Sec-WebSocket-Key: [HIDDEN]
Origin: http://localhost
Authorization: [PRIVATE DATA HIDDEN]

Since it failed when it sent a request to the doppler server, so I ran bosh vms to check if the doppler VMs were running. I next logged into the doppler server and ran monit summary to check if all the processes were running.

The output from running monit summary is as follows:

Process 'doppler'                   running
Process 'syslog_drain_binder'       running
Process 'metron_agent'              running
Process 'toolbelt'                  running
System 'system_localhost'           running

Everything looked good so I then dug through the logs on the doppler server. I saw the following messages in the /var/vcap/sys/log/doppler/doppler.stderr.log file.

panic: sync cluster failed

goroutine 1 [running]:
panic(0xb0d3c0, 0xc8201460f0)
        /var/vcap/data/packages/golang1.6/85a489b7c0c2584aa9e0a6dd83666db31c6fc8e8.1-0ebd71019c0365d2608a6ec83f61e3bbee68493c/src/runtime/panic.go:464 +0x3e6
main.NewStoreAdapter(0xc82004bb00, 0x3, 0x4, 0xa, 0x0, 0x0)
        /var/vcap/data/compile/doppler/loggregator/src/doppler/main.go:58 +0x185
        /var/vcap/data/compile/doppler/loggregator/src/doppler/main.go:92 +0x4f9

For some reason, the log cluster could not be synchronized. As recommended by one of my super coworkers, Geoff, I then tried HM-9000 disaster-recovery method, whose summarized steps are:

monit stop etcd (on all nodes in etcd cluster)
rm -rf /var/vcap/store/etcd/* (on all nodes in etcd cluster)
monit start etcd (one-by-one on each node in etcd cluster)

It did not solve my problem this time, but I think it is a good method to know since it may come to rescue when you deal with some other logging problem.

Since everything itself was running and listening properly and the HM9000 reset-fix did not fix the problem, I went back to check my Security Group Settings and Routing Tables in my AWS Console. Both of them are listed in the left column of the VPC dashboard. I found out that port 4443 for the web socket connections was not allowed in the Inbound Rules! So I enabled port 4443 for Inbound traffic in my Security Group Settings. As soon as I did this running cf push worked so that the app is now running.

中文版请见:当你不能连接到Cloud Foundry 中的日志服务器时怎么办?.