cmake .
make
$ ./reduction -m cg1
bandwidth 10.03 GB/s
$ ./reduction -m cg2
bandwidth 13.44 GB/s
$ ./reduction -m cg3
bandwidth 18.99 GB/s
$ ./reduction -m cg4
bandwidth 33.73 GB/s
$ ./reduction -m cg5
bandwidth 46.20 GB/s
$ ./reduction -m cg6
bandwidth 53.43 GB/s
http://developer.download.nvidia.com/compute/cuda/1.1-Beta/x86_website/projects/reduction/doc/reduction.pdf https://devblogs.nvidia.com/faster-parallel-reductions-kepler/