Fixing bosh-lite packet loss

I recently had the misfortune of having my bosh-lite instance keel over with 80% packet loss (or higher). Turns out, I was suffering from a routing loop. Here’s how to see if you’re suffering from it, too.

Disable any routes for 10.244.x.x to 192.168.50.4

If you have configured any routes to send traffic to 10.244.x.x through 192.168.50.4, delete them immediately (you can always add them later).

It’ll look something like this (but the route command has wildly different syntax depending on what platform you’re on:

route delete -net 10.244.0.0/19 192.168.50.4

Watch for outbound 10.244.0.0/19 traffic

vagrant ssh into the bosh-lite, and run a tcpdump, looking for any traffic going out of eth0 with a 10.244.0.0/19 address:

tcpdump -i eth0 dst net 10.244.0.0/19

If everything is happy, you shouldn’t see any traffic here, and the routing loop likely wasn’t your source for packet loss.

If you do see traffic, it indicates that there is a VM somewhere trying to talk to an IP that doesn’t exist, and the traffic is going out the default gateway, rather than directly to the interface warden added for the container. Time to proceed to the next step!

Find what’s talking to the wrong IP

Download all your bosh manifests (check out James Hunt’s bosh-sync), and grep them for the 10.244.x.x IPs found in the previous step. This will give you an idea of what VMs are trying to talk to things that don’t exist, and whether the missing VM needs to be created, or communications disabled. In my case, CloudFoundry was trying to send syslog messages to a VM that had not been created.

Re-add your routes

Once all the offending communications have been halted, you can re add your routes to access the bosh-lite VMs from outside vagrant:

route-add -net 10.244.0.0/19 192.168.50.4

Rejoice!

A long-term fix has been requested in this github issue, and hopefully will have a patch up soon, to prevent this from affecting people down the road.

Spread the word

twitter icon facebook icon linkedin icon