Part One: Building bridges in the cloud
It started at the CF Summit Basel Hackathon in 2018 and the result is above screencast.
What did you just look at? Let's start with explaining what stack we're using and what is happening before explaining in depth what we see in the Terminals.
The Cloud Foundry side:
The Kubernetes side:
- We deployed a CFCR based Kubernetes Cluster.
- We build wrappers around the SILK-binaries and customized our CFCR-Deployment; it's elaborated in future blog posts that are left out for now.
- We deployed Kibosh into the CFCR Cluster.
What we realized: We have an easy and well working solution to create services out of Helm-Charts ( Thank you, Kibosh), an easy solution to deploy Kubernetes ( Thank you, Kubo/CFCR deployment repo), and Cloud Foundry (Thank you too, cf-deployment repo). We did not see an easy solution to consume these services from Cloud Foundry, why is that?
Vanilla (CFCR) Kubernetes networking works perfectly fine as long as you do not leave your cluster. If your requests are generated from within the cluster, you can rely on Kube-Proxy and the ServiceIP Range to access your Pods. Those ServiceIPs do not map to Network-Interfaces on the Workers, but are "GhostIPs" that get destination NAT to one of the corresponding Pod-IPs behind the Service. Since this relies on iptables/ipvs/userspace on the Kubernetes-Workers, you cannot access those Ghost-IPs from outside (e.g. an app running on a Diego-Cell).
Once you need to access your Pods from outside the cluster, you will have to choose between: a) NodePort Service, b) LoadBalancer Service, and c) Ingress Controller.
Assume one of these scenarios: you created a service instance and now need to expose the endpoints to the consumer (Layer 4 connection from CF to Kubernetes), you deployed a HTTP Microservice to Kubernetes that needs to be made available for a Cloud Foundry app (Layer 7), or in general, you want bidirectional Layer 4/7 traffic between Cloud Foundry and Kubernetes workloads for performance (least amount of network hops) or security reasons (Mutual TLS).
If you are relying on NodePorts, the biggest issue is how to deal with potentially changing IPs on your Kube-Workers. This includes scaling up your Workers and thus creating new available endpoints, scaling down your workers and removing in-use endpoints, or recreating your workers and changing in-use endpoints. While this could be solved by a Layer 4 load balancer in front of the Workers, but that just makes the problem someone's responsibility and introduces dependencies to the load balancer system and troubles with Network Access Control.
Lastly, there is the fact that you cannot grow your environment "indefinitely" as the total amount of Nodeports in one Kubernetes Cluster (not just on one Worker) is limited. If you are relying on a Service of Type LoadBalancer, you need an IaaS layer that supports this. Examples for an IaaS without support for LoadBalancer Service include vSphere without NSX-T, OpenStack without LBaaS Engine, and a Kubernetes Cluster deployed on Bare Metal. Additionally, this approach will consume resources for every created service, load balancers are usually not free and they will add additional hops to every request. If you are relying on Ingress Controllers for external exposure, depending on your Controller and its configuration, you might find the same issues as in (a) and/or (b) combined.
What to consider:
Looking at the architecture of Cloud Foundry and Kubernetes container-hosts, we realized that Garden/RunC and Kubernetes/Docker respectively rely on a CNI for container networking. By default, Cloud Foundry uses Silk and Kubernetes/CFCR uses Flannel. Silk and Flannel work similarly from a network architecture perspective.
Silk and Flannel:
- Rely on creating a layer 3 Bridge on each host.
- Assign a smaller (e.g. /24) subnet out of the whole (e.g. /16) container overlay range to a host.
- Let hosts deal with IP Address Management.
- Rely on the hosts for routing by creating NAT-rules for incoming/outgoing requests.
From this point on, we could have gone two ways: make Flannel work on Diego-Cells and lose Cloud Foundry C2C-Policies (Network Access Control)
make Silk work on Kubernetes and get API driven policies for free
(Read up on the basket of nice that is Silk).
We found that Silk has a few more integration points to Cloud Foundry (mainly the Policy API) than Flannel has to Kubernetes, thus we decided to find out what it would take to make Silk play nice with Kube-Workers.
This will be our target setup:
For the love of Cthulhu, why?:
Having the same network available from either type of container would help dramatically with spanning workloads across Kubernetes and Cloud Foundry, vice versa, in a dynamic and responsive way. Additionally, our containers are blissfully ignorant to the fact that they're hosted on different underlying systems which live on potentially different VM-networks.
- We do not rely on deploying additional components (LoadBalancers/Ingress Controller).
- We do not rely on containers having knowledge about their host network (IP:Port combination of an app instance on a Diego-Cell or a Pod on a Kube-Worker).
- We are not unnecessarily exposing our apps/pods via their hosts, the Containers are only accessible via the SILK Bridge and thus routing is only available to their hosts.
- We can use CF native C2C Service-Discovery to expose Pod/App IPs via hostnames as opposed to making an apps responsibility to do service discovery (e.g. via adding a Eureka to our deployments) or forcing the app to have access to Cloud Foundry/Kubernetes APIs to do look ups.
- We do not introduce additional APIs for Cloud Foundry/Kubernetes to deal with as we're completely reusing existing Cloud Foundry functionality.
Finally, let's watch the screencast again to understand what is happening:
On the right, you see three panels. From top to bottom:
This panel runs `watch "kubectl get namespaces | grep -v Terminating"`. Our service Instance is created via Kibosh. Kibosh will deploy every Instance into its own Namespace. Once we trigger `cf create-service ...`, we see a new namespace pop up.
This panel runs `watch "kubectl get pods -o wide --all-namespace | grep kibosh"`. Shortly after the Namespace got created, Kibosh started to deploy the Pods into that namespace. We added the '-o wide' to be able to see the created Pods IP Address.
This panel runs `dmesg -w -T | grep DENY`, this shows the CF Container to Container Policy system get applied to our Kubernetes Pods; SILKs Access Control is still working.
On the left side you see the commands we run. Let's go through it chronologically:
We start by running `cf create-service ...` to let the marketplace talk to Kibosh to create our Service Instance and wait for the creation to finish.
Once the creation finished we're using `cf bind-service` to tell Cloud Foundry to inject the binding/credentials into our App on the next restage.
We run `cf restage` on our App so the Container gets recreated and injected with the Binding.
We run `cf ssh ...` to SSH into one of our App Instances and look at the VCAP_SERVICES ENV Var to find one of the Ports exposed by our Service Instance.
We try to run a curl on an HTTP Endpoint the Service provides and see that the logs in R-Bottom show that the connection gets denied by CF-Policies.
We run enable_access.sh and disable_access.sh, these are just wrappers to run `cf curl` against the External Policy API. As you can see in the output of the script, it creates a Policy that allows incoming traffic on our Pod (or rather its parent entity) from our CF-App (or rather the set of containers considered our App). After running enable_access.sh, we see that the curl is now succeeding. One of our later blog posts will elaborate on how we plan to apply the CF Policy system to Pods.
We repeat the process of enabling access to the kube-dns Pod for our example app. And finally we do a DNS-Lookup for google.de via the Kube-DNS Pod to finish our screencast.
Thank you for reading and keep tuned to find out what we did to make it work in our upcoming blog post:
Part One: Building bridges in the cloud
Part two: Approaching the problem without rewriting existing code
Part three: There is no better way to learn than trial and error