Cloud Foundry Loggregator likes to break on me. This is going to my collection of how I (try to) fix loggregator.

Problem 1

Background

I've just updated CF from v190 -> v191
CF Version: 191
Date of Problem: 10/30/2014

➜  ~  cf logs galaxy
FAILED  
Error dialing loggregator server: websocket: bad handshake.  
Please ask your Cloud Foundry Operator to check the platform configuration (loggregator endpoint is wss://loggregator.10.244.0.34.xip.io:4443).  

Bad handshake?! Wtf? Let's see what CF_TRACE gives me.

WEBSOCKET REQUEST: [2014-10-30T17:27:48Z]  
GET /tail/?app=1fd65ad8-4324-4fcd-9b2e-e8b0e0c6d172 HTTP/1.1  
Host: wss://loggregator.10.244.0.34.xip.io:4443  
Upgrade: websocket  
Connection: Upgrade  
Sec-WebSocket-Version: 13  
Sec-WebSocket-Key: [HIDDEN]  
Authorization: [PRIVATE DATA HIDDEN]  
Origin: http://localhost


WEBSOCKET RESPONSE: [2014-10-30T17:27:48Z]  
HTTP/1.1 404 Not Found  
Date: Thu, 30 Oct 2014 17:27:07 GMT  
Content-Length: 81  
Content-Type: text/plain; charset=utf-8  
X-Cf-Requestid: 0fe6dece-f166-4329-4074-dce4d793c116  
X-Cf-Routererror: unknown_route

FAILED  
Error dialing loggregator server: websocket: bad handshake.  
Please ask your Cloud Foundry Operator to check the platform configuration (loggregator endpoint is wss://loggregator.10.244.0.34.xip.io:4443).  
FAILED  

My app is not found :(

Investigation

List of jobs upgraded

 Started updating job ha_proxy_z1 > ha_proxy_z1/0
  Started updating job router_z1 > router_z1/0
  Started updating job api_z1 > api_z1/0
  Started updating job nats_z1 > nats_z1/0
  Started updating job api_worker_z1 > api_worker_z1/0
  Started updating job runner_z1
  Started updating job runner_z1 > runner_z1/0
  Started updating job clock_global > clock_global/0
     Done updating job nats_z1 > nats_z1/0 (00:00:23)
     Done updating job api_worker_z1 > api_worker_z1/0 (00:00:29)
     Done updating job ha_proxy_z1 > ha_proxy_z1/0 (00:01:21)
     Done updating job router_z1 > router_z1/0 (00:01:25)
     Done updating job clock_global > clock_global/0 (00:01:31)
     Done updating job api_z1 > api_z1/0 (00:01:57)
     Done updating job runner_z1 > runner_z1/0 (00:03:53)
  Started updating job runner_z1 > runner_z1/1. Done (00:03:18)
  Started updating job runner_z1 > runner_z1/2. Done (00:02:53)
     Done updating job runner_z1 (00:10:04)

Nothing about loggregator...

I got lucky this time and looked at right machine first time. Loggregator Traffic controller logs seem to be culprit.

{"timestamp":1414694116.569226742,"process_id":1497,"source":"loggregator trafficcontroller","log_level":"error","message":"Publishing router.register failed: nats: Connection Closed","data":null,"file":"/var/vcap/data/compile/loggregator_trafficcontroller/loggregator/src/github.com/cloudfoundry/loggregatorlib/cfcomponent/registrars/routerregistrar/router_registrar.go","line":118,"method":"github.com/cloudfoundry/loggregatorlib/cfcomponent/registrars/routerregistrar.func·004"}

Solution

# on loggregator traffic controller box
monit stop all  
monit start all  

So what happen here, it seem that nats updated and loggregator traffic controller was still connected. It could no longer talk to nats to register to router. A simple restart of traffic controller fix it.