-
Notifications
You must be signed in to change notification settings - Fork 2.2k
Description
This affects both versions, but in quite different ways:
-
For cgroupv1, Don't deny all devices when update cgroup resource #2205 highlighted that on device cgroup updates, we temporarily block all devices. This results in spurious errors in the container (such as programs being unable to open
/dev/null). We've seen this happen on customer systems under Kubernetes, so this is definitely a real issue.- This is actually a more complicated issue than it first appears because
runcactually incorrectly implements the spec here -- technicallyruncactually is a black-list by default and users have to convertruncto be a white-list. Aside from not following the spec this is a worrying security stance.
- This is actually a more complicated issue than it first appears because
-
For cgroupv2, devices cgroup updates are implemented by appending a new BPF program to the cgroup. This means that only new denials have an effect, and thus it's incorrectly implemented. (EDIT: This also means that we "leak" eBPF programs and thus after 64+ applications we start getting errors -- see api, cgroupv2: skip setting the devices cgroup #2474.)
Unfortunately this is a bit complicated to fix, but I have figured out how to do it. We need to make an eBPF map of typeBPF_MAP_TYPE_PROG_ARRAYand then tail-call into it in a small stub eBPF program which we attach to the actual cgroup. This which will allow us to atomically update the devices cgroup rules (there is no way to atomically replace an eBPF program withBPF_F_ALLOW_MULTI-- and without any program, all device accesses would be permitted).- Ignore the above -- you cannot
bpf_tail_callfrom cgroup programs. So we will need to instead implement it through an eBPF map (which we can atomically replace by mis-usingBPF_MAP_TYPE_ARRAY_OF_ARRAY). - This is all slightly complicated by the fact that the entire API is fd-based (and we don't have our own monitor process so we can't stash away the fd). But luckily there is a lookup-by-id system which we can use to get the file descriptor, though the ids can be recycled so we'll need to be careful to make sure we don't start touching the wrong eBPF map -- and unlike eBPF programs, eBPF maps don't store information about when they were created.
Part of #2315.