I've noticed that tests are very likely to fail on ARM machines when run in parallel with xdist. I was also able to reproduce this on a x86_64 machine by introducing some load. I was unable to reproduce this on either machine when running the tests serially, even with a high load (stress -c 16 on a machine with 4 logical threads).
Please see the NixOS/nixpkgs#91706 for more information.