当你的应用链接Cloud Foundry (CF) 的日志服务器超时,你该怎么办?

当我往部署在AWS中的CF发布我的应用时,出现了以下错误:

timeout connecting to log server, no log will be shown  
Starting app cf-env in org codex / space cf-app-testing as admin...

FAILED  
Error restarting application: StagerError  

为了获取更详细的错误日志,我运行了CF_TRACE=true cf push,我看到下面的信息一直停在那里,一动不动的。

WEBSOCKET REQUEST: [2016-08-17T19:45:38Z]  
GET /apps/e189be2e-770f-4d1c-94e2-d2168f2d292d/stream HTTP/1.1  
Host: wss://doppler.system.staging.xiujiaogao.com:4443  
Upgrade: websocket  
Connection: Upgrade  
Sec-WebSocket-Version: 13  
Sec-WebSocket-Key: [HIDDEN]  
Origin: http://localhost  
Authorization: [PRIVATE DATA HIDDEN]  

因为错误发生在向doppler服务器发送请求的时候,我运行bosh vms查看是否所有的doppler服务器都在正常运行。接下来我远程登录到doppler服务器,运行monit summary来查看是否所有作业都在正常运行。

运行 monit summary 的输出如下:

Process 'doppler'                   running  
Process 'syslog_drain_binder'       running  
Process 'metron_agent'              running  
Process 'toolbelt'                  running  
System 'system_localhost'           running  

一切看起来运行正常,于是我去查看具体的日志文件,在/var/vcap/sys/log/doppler/doppler.stderr.log 文件中, 我看到了以下错误信息.

panic: sync cluster failed

goroutine 1 [running]:  
panic(0xb0d3c0, 0xc8201460f0)  
        /var/vcap/data/packages/golang1.6/85a489b7c0c2584aa9e0a6dd83666db31c6fc8e8.1-0ebd71019c0365d2608a6ec83f61e3bbee68493c/src/runtime/panic.go:464 +0x3e6
main.NewStoreAdapter(0xc82004bb00, 0x3, 0x4, 0xa, 0x0, 0x0)  
        /var/vcap/data/compile/doppler/loggregator/src/doppler/main.go:58 +0x185
main.main()  
        /var/vcap/data/compile/doppler/loggregator/src/doppler/main.go:92 +0x4f9

由于某些原因,日志服务器组不能同步。我一个超级同事Geoff推荐我试试下面HM-9000灾难恢复方法,总结步骤如下:

monit stop etcd (on all nodes in etcd cluster)  
rm -rf /var/vcap/store/etcd/* (on all nodes in etcd cluster)  
monit start etcd (one-by-one on each node in etcd cluster)  

很遗憾,这个办法没能解决我的问题。但是我觉得依然值得分享,因为这个办法很有可能解决其它一些类似的日志问题。

既然看起来一切都运转良好,HM9000重新设置没能解决问题,我想到去查看我的Security Group设置和路由表。我登陆Amazon的AWS Console,这两项都在VPC服务下面左边的一栏中。我发现与日志服务器相关的Security Group设置中被 web socket 用来通信的端口4443被禁止了。当我允许通过流量进入端口4443后,我成功发布了我的应用!

To read the English version, please go to What You Should Do When Your App Can Not Connect to Log Servers in CF.