This one goes out to all the cluster operators.
It really isn't fair. All those application folks get to play with cool stuff like automatic traffic routing based on label matches, process resurrection, and more. All services provided by the clusters we run. We're still on the hook for figuring out how to spin those clusters, configure them, patch them, etc, but we don't get to use Kubernetes for that.
Or, do we?
I'm here to tell you that you can use Kubernetes to manage parts of Kubernetes itself.
No, I'm not talking about static pods for the control plane, although that is a great strategy. I'm talking about applying all those other layers of software, systems, and configurations that invariably accompany running anything in production.
We're going to use DaemonSets.
There's a ton of stuff we can do with the ability to run a single pod on each node in our cluster. To keep this blog post from approaching book-length, I'm going to focus in on three tasks I want to accomplish:
- Provision a named admin account for SSH access & troubleshooting
- Modify some kernel tuning parameters via
- Run ClamAV for scanning nodes for unsavory files
Provisioning SSH Access to Nodes
This one works especially well on managed Kubernetes offerings like Linode Kubernetes Engine (LKE), Amazon's EKS, etc.
--- apiVersion: apps/v1 kind: DaemonSet metadata: name: cluster-admin spec: selector: matchLabels: aikido: cluster-admin template: metadata: labels: aikido: cluster-admin spec: volumes: - name: hostfs hostPath: path: / initContainers: - name: init image: alpine command: - /bin/sh - -xc - | chroot /host getent passwd jhunt || chroot /host useradd -m jhunt \ && chroot /host chsh -s /bin/bash jhunt \ && echo "jhunt ALL=(ALL:ALL) NOPASSWD:ALL" > /host/etc/sudoers.d/jhunt \ && mkdir -p /host/home/jhunt/.ssh \ && chmod 0700 /host/home/jhunt/.ssh \ && chroot /host chown -R jhunt:jhunt /home/jhunt \ && echo "ssh-PUBLIC-KEY-HERE" > /host/home/jhunt/.ssh/authorized_keys \ && echo done volumeMounts: - name: hostfs mountPath: /host containers: - name: sleep image: filefrog/k8s-hacks:pause
That's a lot. Let's take it bit by bit.
We're setting up a DaemonSet, which means we'll get exactly one pod on each cluster node (kubelet) that we have. That holds true now, and in the future as we scale the cluster up or down. New nodes will automatically gain one of these
The pods themselves consist of two containers: an init container that does all of the work, and a regular container that just sleeps, to make Kubernetes happy. We have a volume, called
hostfs that represents the node filesystem (note the
path: / under the volume's
The init container mounts this volume, and it does so in read-write mode. This is risky, so if and when you do this for real, audit the images you are running!
The init container (an alpine base image) does most of its stuff via an explicit command. You can also do this via a ConfigMap, if you want. That command is a rough translation of this shell script:
# add `jhunt` to /etc/passwd, /etc/shadow, # et al, if the account doesn't already exist # if ! chroot /host getent passwd jhunt; then chroot /host useradd -m jhunt fi # he'll use bash thank you very much. chroot /host chsh -s /bin/bash jhunt # jhunt needs full sudo access... echo "jhunt ALL=(ALL:ALL) NOPASSWD:ALL" > /host/etc/sudoers.d/jhunt # set up for SSH key-based authentication. mkdir -p /host/home/jhunt/.ssh chmod 0700 /host/home/jhunt/.ssh chroot /host chown -R jhunt:jhunt /home/jhunt echo "ssh-PUBLIC-KEY-HERE" > /host/home/jhunt/.ssh/authorized_keys
Once the init container has done its thing, I should be able to access any cluster node via SSH, using my private key. Once on-box, I'll be comfortably at home in bash, with full root access to the box via sudo.
Most of the trickier parts of this configuration rely on knowing when to
chroot into the
/host mountpoint, which you'll recall is the actual root filesystem of each Kubernetes node. For example, when checking for the existence of the "jhunt" account, we have to
chroot /host getent ..., otherwise we'll be looking at the init container's user database.
After the init container winds down, Kubernetes is left with nothing to execute but the
filefrog/k8s-hacks:pause image. You can find the full code for that over at its GitHub repo, https://github.com/jhunt/k8s-hacks/tree/master/pause.
Tuning Node Kernel Parameters
We recently ran into a particularly nasty bug involving unbearable and inexplicable network slowdowns between Linux hosts. I won't bore you with the details in this blog post – although I do reserve the right to devote an entire, other blog post to that very subject – but at the end of the day, the culprit was
net.ipv4.tcp_tw_recycle having been enabled.
Unless you know precisely what you are doing, it's best to just turn it off.
We can use our DaemonSet pattern to do just that, on all nodes, for all time.
--- apiVersion: apps/v1 kind: DaemonSet metadata: name: sysctl spec: selector: matchLabels: aikido: sysctl template: metadata: labels: aikido: sysctl spec: hostNetwork: yes initContainers: - name: init image: alpine command: - /bin/sh - -c - sysctl net.ipv4.tcp_tw_recycle=0 securityContext: privileged: true containers: - name: sleep image: filefrog/k8s-hacks:pause
There's a few things to note about this example.
First and foremost, there is no
hostPath mount. We don't need one in this case, since we have access to
/proc, and all we are doing is running
sysctl to change kernel parameters.
Secondly, this is a more-privileged-than-normal configuration. The entire pod exists in the host network namespace (
hostNetwork: yes). It must, because we want to interact with the tuning parameters for the host's network interfaces (i.e., the real
eth0). The init container is also
privileged: yes, since it needs real root permission to change settings in the kernel.
A really great thing about this setup is that now, no matter what the underlying AMI, golden disk image, or BOSH stemcell, is doing, as soon as a node is put into this cluster, it won't be recycling TCP
Running ClamAV Against Nodes
Finally, we're going to use a DaemonSet to install a cluster-wide, per-node software addon: ClamAV antivirus.
The particulars of the actual implementation of ClamAV in OCI images is a bit beyond the scope of this blog post. I've packaged up the image in an upstream git repository, available at https://github.com/filefrog/clamav.
Here's a cut-down version of the Kubernetes YAML spec for this little hack, edited for brevity and clarity. The full version can be found here.
--- apiVersion: apps/v1 kind: DaemonSet metadata: name: clamd spec: selector: matchLabels: aikido: clamd template: metadata: labels: aikido: clamd spec: volumes: - name: host hostPath: path: / containers: - name: clamd image: filefrog/clamav volumeMounts: - name: host mountPath: /host readOnly: yes
There's a lot more to this in the real gist. It includes a second container for regularly updating the virus definitions database, properly initializes those databases via an init container, and provides some additional configuration to make scanning more resilient.
The easiest way to apply this to your cluster is via URL:
$ kubectl apply -f https://bit.ly/2vSOxwm configmap/clamav created daemonset.apps/clamd created $ kubectl get pods NAME READY STATUS RESTARTS AGE clamd-swgkb 2/2 Running 0 63s
(Note: you should see one container per cluster node; my demo environment is a single-node cluster, so I only have the one pod.)
Don't worry if it seems like it's taking forever to come up –
clamd has a lot of virus definitions to read in. Once both containers are in the Ready state, you'll be ready to start interacting with ClamAV.
The DaemonSet doesn't have any Service instances in front of it. While
clamd is listening on TCP/3310, on all interfaces, it is not in the
hostNetwork namespace, so the only way to reach it is via either port-forwarding or
(We explicitly do not expose the TCP/3310 port via a ClusterIP, NodePort, or LoadBalancer Service because ClamAV doesn't perform any authentication against clients connecting to it. This means anyone who can open a connection can ask clamd to scan whatever they want. This is a gaping security hole, and a very low-risk amplificaiton attack / denial of service. You can read more here.)
First, we'll try port-forwarding and running telnet locally.
$ kubectl port-forward clamd-swgkb 3310:3310 Forwarding from 127.0.0.1:3310 -> 3310 Forwarding from [::1]:3310 -> 3310 $ telnet 127.0.0.1 3310 Trying ::1... Connected to localhost. Escape character is '^]'.
The image we're using to run ClamAV has a copy of the EICAR test file, a benign payload that is explicitly recognized by most antivirus solutions (including ClamAV) and categorized as a positive payload match.
We can use the
SCAN command to test this out:
$ telnet 127.0.0.1 3310 Trying ::1... Connected to localhost. Escape character is '^]'. SCAN /var/lib/eicar/eicar.com /var/lib/eicar/eicar.com: Win.Test.EICAR_HDB-1 FOUND Connection closed by foreign host.
Next, we'll try to just exec clamdscan inside the container:
$ kubectl exec -it clamd-swgkb -c clamd \ -- clamdscan /var/lib/eicar/eicar.com /var/lib/eicar/eicar.com: Win.Test.EICAR_HDB-1 FOUND ----------- SCAN SUMMARY ----------- Infected files: 1 Time: 0.022 sec (0 m 0 s) command terminated with exit code 1
Now that we're satisfied that ClamAV is in fact running, and can detect malware and other unsavory and dangerous payloads, we can scan the whole
/host mount, looking for dangerous things on the cluster nodes themselves.
$ kubectl exec -it clamd-swgkb -c clamd \ -- clamdscan /host /host: OK ----------- SCAN SUMMARY ----------- Infected files: 0 Time: 291.040 sec (4 m 51 s)
Go Forth and Manage!
We can do a lot with Kubernetes to manage Kubernetes. If you can find or build an OCI image to package up the software, you can use namespaces, extra capabilities, and host filesystem mounts (if your Pod Security Policies allow it) to do all the things you normally resort to configuration management and other platform automation tools for.