Protecting Yourself with Pod Security Policies

I listen to a lot of folks talk about their Kubernetes strategy as a means of apportioning a finite, limited resource (compute) among a wide and varied set of people, usually application developers and operations nerds, with an eye toward isolation.

I have bad news for you.

Kubernetes isn’t about isolation, not in the security sense of the word anyway.

If you reduce containers down to their base essence (and I’m going to take a few liberties here, so bear with me), it’s about processes. Processes. Program binaries executing code in a virtually unique memory address space. Same kernel. Same user/group space. Sometimes, same filesystem and same PID space.

It’s all an elaborate set of carefully constructed smoke and mirrors that lets the Linux kernel provide different views of shared resources to different processes.

This has been most handy when paired with the OCI image standard, and some best practices from the Docker ecosystem – every container gets its own PID namespace; every container brings its own root filesystem with it, etc.

But you don’t have to abide by those rules if you don’t want to.

To wit: kubectl r00t:

This little gem blows up the security charade of containers. Let’s unpack this, piece by piece.

#!/bin/bashexec kubectl run r00t -it --rm \
  --restart=Never \
  --image nah r00t \
  --overrides '{"spec":{"hostPID": true, "containers":[{"name":"x","image":"alpine","command":["nsenter","--mount=/proc/1/ns/mnt","--","/bin/bash"],"stdin": true,"tty":true,"securityContext":{"privileged":true}}]}}' "[email protected]"

The kubectl run r00t -it --rm --restart=Never bits tell Kubernetes that we want to execute a single Pod (no Deployment here thank you very much), and when that Pod exits, we’re done. Think of it as an analog to docker run -it --rm.

The next bits --image nah and --overrides ... let us modify the generated YAML of the Pod resource. The kubectl run command requires that we specify an image to run and a name for the pod, but we’re just going to override those value with --overrides, so you can put (quite literally) anything you want here.

That brings us to the JSON overrides. For sanity’s sake, let’s reformat that blob of JSON to be a bit more readable, and turn it back into YAML via spruce:

  hostPID: true
   - image: alpine
     name:  x
     stdin: true
     tty:   true
       privileged: true
     command: [nsenter, --mount=/proc/1/ns/mnt, --, /bin/bash]

The first thing we do is pop this pod (and all of its containers) into the Kubernetes node’s hostPID namespace. By default, containers get new process ID namespaces inside the kernel – the first process executed becomes “PID 1”, and gets all the benefits that PID 1 normally gets – automatic inheriatnce of child processes, special signal delivery, etc. A side-effect of being in the Kubernetes node’s PID namespace is that /proc/1 refers to the actual init process of the VM / physical host – this will become exceedingly important in just a bit.

Next up, we start modifying the (only) container in the pod. We choose the alpine image, because it is small and likely to be present in the image cache already. We pick an arbitrary name for the container (x), turn on standard input attachment (stdin: true) and teletype terminal emulation (tty: true) so that we can run an interactive shell.

Then, we set the security context of the running container to be privileged – this provides us all of the normal Linux capabilities you’d come to expect from being the root user on a Linux box.

Finally, the coup de grâce: the command we want this container to execute is nsenter, a handy (and flexible!) little utility for munging and modifying our current Linux namespaces; the foundation on which, combined with cgroups, all this containerization stuff is built. We’re already in the host’s process ID namespace, but we are jailed inside of our own filesystem namespace. To get out we can take advantage of the fact that /proc/1 is the real Linux init (systemd) process, so /proc/1/ns/mnt is the outermost mount namespace, i.e. the real root filesystem!

Let’s give it a go:

$ kubectl r00t
If you don't see a command prompt, try pressing enter.
node/f8f9b380-f4a6-4b17-84d0-996962f7b106:/# ps -ef | grep ku[b]elet
root      6788     1  2 Feb12 ?        14:15:41 kubelet --config=...

There you have it. On my EKS cluster, this is the easiest and best way to pop a root shell and go snooping through kubelet configurations, changing things as I need to. Handy for me, but probably not something that would make the cluster operator sleep well at night.

Are you that cluster operator?

This exploit works because of several, collaborating reasons:

  1. I was able to create a privileged: true Pod
  2. I was able to create a Pod in the hostPID namespace
  3. I was able to run a Pod as the root user (UID/GID of 0/0)
  4. I was able to run a Pod with stdin attached, and a controlling terminal.

If you take away any of those capabilities, the above attack vector stops working. Let’s take away as many of those capabilities as we can, using Pod Security Policies.

A Pod Security Policy lets you prohibit or allow certain types of workloads. They work with the Kubernetes role-based access control (RBAC) system to give you flexibility in what you allow and who you allow it to.

In the rest of this post, we’re going to create a namespace and a service account that can deploy to it. We’ll verify that the service account can do bad things first, before we implement a security policy that prohibits such shenanigans.

Here’s the YAML bits for creating our demo namespace and service account:

apiVersion: v1
kind: Namespace
  name: psp-demo
apiVersion: v1
kind: ServiceAccount
  namespace: psp-demo
  name: psp-demo-sa

This gives us a namespace named psp-demo, and a service account in that namespace, named psp-demo-sa. We will be impersonating that service account later, when we attempt to live under the constraints of our security policy.

Next up, we need to set up some basic RBAC access to allow psp-demo-sa to deploy Pods. This is only because we want to demo Pod creation as the service account!

kind: Role
  name: psp-sa
  namespace: psp-demo
  - apiGroups: ['']
    resources: [pods]
    verbs:     [get, list, watch, create, update, patch, delete]
kind: RoleBinding
  name: psp-sa
  namespace: psp-demo
  kind:     Role
  name:     psp-sa
  - kind: ServiceAccount
    name: psp-demo-sa

The new (namespace-bound) role psp-sa is bound to the psp-demo-sa service account and allows it to do pretty much anything with Pods. Note: this does preclude us from creating Deployments, StatefulSets, and the like. That’s solely by virtue of the role assignments, and has nothing to do with our Pod Security Policies.

Our first security policy is called privileged, and it encodes the most lax security we can specify. This will be reserved for people we trust with our lives (and our cluster!), and serves to show what happens when a user or service account can’t use a policy that exists.

apiVersion: policy/v1beta1
kind: PodSecurityPolicy
  name: privileged
  privileged: true
  allowPrivilegeEscalation: true
  allowedCapabilities: ['*']
  volumes: ['*']
  hostNetwork: true
  hostIPC:     true
  hostPID:     true
  hostPorts: [{ min: 0, max: 65535 }]
  runAsUser:          { rule: RunAsAny }
  seLinux:            { rule: RunAsAny }
  supplementalGroups: { rule: RunAsAny }
  fsGroup:            { rule: RunAsAny }

The next policy is much more restricted. It’s even named restricted! It locks down almost everything we can:

apiVersion: policy/v1beta1
kind: PodSecurityPolicy
  name: restricted
  privileged:               false
  allowPrivilegeEscalation: false
  requiredDropCapabilities: [ALL]
  readOnlyRootFilesystem:   false
  hostNetwork: false
  hostIPC:     false
  hostPID:     false
    # Require the container to run without root privileges.
    rule: MustRunAsNonRoot
    # Assume nodes are using AppArmor rather than SELinux.
    rule: RunAsAny
    rule: MustRunAs
    ranges: [{ min: 1, max: 65535 }]
    rule: MustRunAs
    ranges: [{ min: 1, max: 65535 }]
  # Allow core volume types.
    - configMap
    - emptyDir
    - projected
    - secret
    - downwardAPI
    - persistentVolumeClaim

That’s worth reading over a few times to make sure you’ve got it all. The salient bits (insofar as our attack vector is concerned) are thus:

  • We disallow hostPID Pods / containers
  • We don’t allow directories on the host to be bind-mounted into containers.
    (There’s no hostPath listed in the allowed volume types list)
  • Pods must specify users to run as, and those UIDs cannot be 0. No root!

With those YAMLs applied to the cluster, we can list our policies:

$ kubectl get psp
restricted   false          RunAsAny   MustRunAsNonRoot   MustRunAs   MustRunAs   false            configMap,emptyDir,projected,secret,downwardAPI,persistentVolumeClaim
privileged   true    *      RunAsAny   RunAsAny           RunAsAny    RunAsAny    false            *

Right now, these policies are inert. No one is allowed to use them, which means that no one will be able to create any Pods. To activate these policies, we need to grant users and service accounts the use verb against the policy resources. For that, we’ll use a new Cluster Role and a Cluster Role Binding.

First, the Cluster Role:

kind: ClusterRole
  name: default-psp
  - apiGroups:     [policy]
    resources:     [podsecuritypolicies]
    resourceNames: []
    verbs:         [list, get]
  - apiGroups:     [policy]
    resources:     [podsecuritypolicies]
    resourceNames: [restricted]
    verbs:         [use]

This role is allowed to list and get all security policies, but only use the restricted policy.

Next, we bind the Cluster Role to all users (via the system:authenticated group) and all service accounts (via the system:serviceaccounts group):

kind: ClusterRoleBinding
  name: default-psp
  kind:     ClusterRole
  name:     default-psp
  - apiGroup:
    kind:     Group
    name:     system:authenticated # All authenticated users
  - apiGroup:
    kind:     Group
    name:     system:serviceaccounts

Now, we need to impersonate our demo service account. For that, we can use the --as flag to kubectl:

$ kubectl --as=system:serviceaccount:psp-demo:psp-demo-sa get pods
No resources found in psp-demo namespace.

I hate typing. I hate making other people type. We’re going to alias that big --as flag as ku (which is way easier on the keyboard):

$ alias ku='kubectl --as=system:serviceaccount:psp-demo:psp-demo-sa'
$ ku get pods
No resources found in psp-demo namespace.

Now, we can explore with kubectl auth can-i:

$ ku auth can-i create pods
$ ku get psp -o
$ ku auth can-i use psp/privileged
$ ku auth can-i use psp/restricted

Note: if you get warnings like Warning: resource 'podsecuritypolicies' is not namespace scoped in group 'policy', don’t worry. I get them too, and from what I’ve been able to tell from random Internet searches, they aren’t anything to worry about.

This tells us that we are able to use the restricted policy, but not the privileged policy; so our attempts at breaking in should no longer bear fruit:

$ kubectl r00t --as=system:serviceaccount:psp-demo:psp-demo-sa
Error from server (Forbidden): pods "r00t" is forbidden: unable to validate against any pod security policy: [spec.securityContext.hostPID: Invalid value: true: Host PID is not allowed to be used spec.containers[0].securityContext.privileged: Invalid value: true: Privileged containers are not allowed]


Host PID is not allowed to be used

Where To From Here?

Armed with your newfound expertise in Pod Security Policies, go forth and secure your Kubernetes clusters! A few things to try from here include:

  1. Letting actual cluster admins create privileged pods
  2. Allowing some capabilities to certain specific service accounts
  3. Auditing all of your service accounts and what they can do under your PSPs

Happy Hacking!

Spread the word

twitter icon facebook icon linkedin icon