-
Notifications
You must be signed in to change notification settings - Fork 24
Description
The problem
Since the purpose of stress -c
is to put real load on the specified number of CPU cores, it's desirable that of these CPU cores, pipelines don't stall a majority of time, yet:
perf stat stress -c 16 -t 5
tells us that the CPU is mostly idle (if occupied):
stress: info: [526148] dispatching hogs: 16 cpu, 0 io, 0 vm, 0 hdd
stress: info: [526148] successful run completed in 5s
Performance counter stats for 'stress -c 16 -t 5':
79,580.45 msec task-clock:u # 15.910 CPUs utilized
0 context-switches:u # 0.000 /sec
0 cpu-migrations:u # 0.000 /sec
309 page-faults:u # 3.883 /sec
418,716,815,425 cycles:u # 5.262 GHz
262,176,845,042 stalled-cycles-frontend:u # 62.61% frontend cycles idle
617,055,840,870 instructions:u # 1.47 insn per cycle
# 0.42 stalled cycles per insn
175,186,890,751 branches:u # 2.201 G/sec
269,450,686 branch-misses:u # 0.15% of all branches
5.001799550 seconds time elapsed
79.463002000 seconds user
0.007854000 seconds sys
I'd like to draw antention to 62.61% frontend cycles idle, meaning that the CPU cores' frontends couldn't proceed processing instructions. That means, for example, in hyperthreading cores, that we're not really using the "half-core" fully, thus not slowing down the other half of the hyperthreaded core as intended (which is what I needed to use stress
for).
Why does that happen?
simple. The code is
while(1){sqrt(rand());}
which, unlike the man page claims, isn't actually "spinning on sqrt()". In fact, a compiler with floating point exceptions disabled might notice the result of sqrt
is never used and simply not even execute it. (that's not what's happening in a default build, however.)
Instead, the bottleneck is rand()
, which isn't even reentrant, and should never have been used from multiple threads. In a perf record stress -c 16 -t 5
, you'll notice on a modern x86_64, that the CPU is stuck ca. 99.8% of time in __random()
; to little surprise, because, and that's the problem here, rand()
modifies global state, and hence heavily depends on memory views being kept consistent between CPU cores.
So, a serious bug to use rand()
here, and a slight bug to use sqrt()
without doing anything with the result.