Skip to content

Comparative Tests on KME BKM and VWF

Edmond Chow edited this page Oct 18, 2020 · 6 revisions

The performance of H2Pack strongly depends on the performance of the kernel evaluation functions, i.e., the Kernel Matrix Evaluation (KME) function and Bi-Kernel Matvec (BKM) function. The KME and BKM functions can be either vectorized manually using the provided Vector Wrapper Functions (VWF) or vectorized automatically by the C compiler.

The following numerical results demonstrate how these different techniques could affect the performance of -construction and -matvec.

Hardware and software configuration

  • 2 * Intel Xeon Gold 6226 CPU @ 2.7GHz (2 * 12 cores, 2 * 12 * 2 threads, hyperthreading disabled)
  • 6 * 32 GB DDR4 memory
  • Red Hat Enterprise Linux 7.6 (kernel 3.10.0-957.12.1.el7)
  • Intel Parallel Studio Cluster version 2019.5
  • ICC optimization flags: -O3 -xHost
  • OpenMP environment variables
    • OMP_NUM_THREADS=24
    • OMP_PLACES=cores
    • OMP_PROC_BIND=close

Test settings

  • Point sets: uniformly and randomly distributed points in a 3D unit ball
  • Running mode: JIT
  • Relative error threshold: 1e-6
  • Kernel: 3D Gaussian with
  • Comparison of kernel implementations:
    • no vectorization ("no-vec")
    • ICC automatic vectorization ("auto-vec")
    • manual vectorization by VWF ("wrap-vec")

Numerical Results (timings in seconds)

Number of Points 100,000 400,000 1,600,000
-construction KME no-vec 0.022 0.083 0.440
KME auto-vec 0.020 0.092 0.448
KME wrap-vec 0.023 0.084 0.442
-matvec KME no-vec 0.120 0.313 0.745
KME auto-vec 0.038 0.101 0.260
KME wrap-vec 0.028 0.081 0.233
BKM no-vec 0.161 0.369 0.908
BKM auto-vec 0.031 0.091 0.265
BKM wrap-vec 0.020 0.056 0.156

Notes:

  • Computation in -construction is dominated by the column-pivoted QR. It only gains minor performance improvement from vectorization of KME functions.
  • Both automatic and manual vectorization of KME and BKM functions can lead to 300% - 400% speedup in -matvec, while manual vectorization is 20% - 50% faster than automatic vectorization.
Clone this wiki locally