Skip to content

cgroup: devices updates appear to be broken #2366

@cyphar

Description

@cyphar

This affects both versions, but in quite different ways:

  • For cgroupv1, Don't deny all devices when update cgroup resource #2205 highlighted that on device cgroup updates, we temporarily block all devices. This results in spurious errors in the container (such as programs being unable to open /dev/null). We've seen this happen on customer systems under Kubernetes, so this is definitely a real issue.

    • This is actually a more complicated issue than it first appears because runc actually incorrectly implements the spec here -- technically runc actually is a black-list by default and users have to convert runc to be a white-list. Aside from not following the spec this is a worrying security stance.
  • For cgroupv2, devices cgroup updates are implemented by appending a new BPF program to the cgroup. This means that only new denials have an effect, and thus it's incorrectly implemented. (EDIT: This also means that we "leak" eBPF programs and thus after 64+ applications we start getting errors -- see api, cgroupv2: skip setting the devices cgroup #2474.)

    • Unfortunately this is a bit complicated to fix, but I have figured out how to do it. We need to make an eBPF map of type BPF_MAP_TYPE_PROG_ARRAYand then tail-call into it in a small stub eBPF program which we attach to the actual cgroup. This which will allow us to atomically update the devices cgroup rules (there is no way to atomically replace an eBPF program with BPF_F_ALLOW_MULTI -- and without any program, all device accesses would be permitted).
    • Ignore the above -- you cannot bpf_tail_call from cgroup programs. So we will need to instead implement it through an eBPF map (which we can atomically replace by mis-using BPF_MAP_TYPE_ARRAY_OF_ARRAY).
    • This is all slightly complicated by the fact that the entire API is fd-based (and we don't have our own monitor process so we can't stash away the fd). But luckily there is a lookup-by-id system which we can use to get the file descriptor, though the ids can be recycled so we'll need to be careful to make sure we don't start touching the wrong eBPF map -- and unlike eBPF programs, eBPF maps don't store information about when they were created.

Part of #2315.

Metadata

Metadata

Assignees

No one assigned

    Type

    No type

    Projects

    No projects

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions