Fused-Multiply-Add #214

ashvardanian · 2024-10-19T08:47:27Z

SimSIMD is expanding and becoming closer to a fully-fledged BLAS library. BLAS level 1 for now, but it's a start! SimSIMD will prioritize mixed and low-precision vector math, favoring modern AI workloads. For image & media processing workloads, the new fma and wsum kernels approach 65 GB/s per core on Intel Sapphire Rapids. That's 100x faster than the serial code for u8 inputs with f32 scaling and accumulation.

Contains the following element-wise operations:

WSum or Weighted-Sum: $R_i = \alpha \cdot A_i + \beta \cdot B_i$
FMA or Fused-Multiply-Add: $R_i = \alpha \cdot A_i \cdot B_i + \beta \cdot C_i$

In NumPy terms:

import numpy as np
def wsum(A: np.ndarray, B: np.ndarray, Alpha: float, Beta: float) -> np.ndarray:
    assert A.dtype == B.dtype, "Input types must match and affect the output style"
    return (Alpha * A + Beta * B).astype(A.dtype)
def fma(A: np.ndarray, B: np.ndarray, C: np.ndarray, Alpha: float, Beta: float) -> np.ndarray:
    assert A.dtype == B.dtype and A.dtype == C.dtype, "Input types must match and affect the output style"
    return (Alpha * A * B + Beta * C).astype(A.dtype)

This tiny set of operations is enough to implement a wide range of algorithms:

To scale a vector by a scalar, just call WSum with $\beta = 0$.
To sum two vectors, just call WSum with $\alpha = \beta = 1$.
To average two vectors, just call WSum with $\alpha = \beta = 0.5$.
To multiply vectors element-wise, just call FMA with $\beta = 0$.

Benchmarks

On Intel Sapphire Rapids:

Run on (16 X 3900 MHz CPU s)
CPU Caches:
  L1 Data 48 KiB (x8)
  L1 Instruction 32 KiB (x8)
  L2 Unified 2048 KiB (x8)
  L3 Unified 61440 KiB (x1)
Load Average: 0.79, 0.75, 0.56
-------------------------------------------------------------------------------------------------------------
Benchmark                                                   Time             CPU   Iterations UserCounters...
-------------------------------------------------------------------------------------------------------------
fma_f64_haswell<1536d>/min_time:10.000/threads:1         1344 ns         1344 ns     10391897 abs_delta=0 bytes=27.4208G/s pairs=743.836k/s relative_error=0
wsum_f64_haswell<1536d>/min_time:10.000/threads:1        1040 ns         1040 ns     13465261 abs_delta=0 bytes=23.6376G/s pairs=961.815k/s relative_error=0
fma_f32_haswell<1536d>/min_time:10.000/threads:1          651 ns          651 ns     21534450 abs_delta=23.597n bytes=28.3033G/s pairs=1.53555M/s relative_error=47.0002n
wsum_f32_haswell<1536d>/min_time:10.000/threads:1         392 ns          392 ns     36225731 abs_delta=19.6436n bytes=31.3326G/s pairs=2.54985M/s relative_error=54.2672n
fma_f16_haswell<1536d>/min_time:10.000/threads:1          188 ns          188 ns     74334715 abs_delta=9.24044u bytes=49.1302G/s pairs=5.33097M/s relative_error=18.3975u
wsum_f16_haswell<1536d>/min_time:10.000/threads:1         130 ns          129 ns    106997523 abs_delta=12.015u bytes=47.4441G/s pairs=7.72203M/s relative_error=33.1896u
fma_bf16_haswell<1536d>/min_time:10.000/threads:1         225 ns          225 ns     62443286 abs_delta=1.91338m bytes=41.0221G/s pairs=4.45118M/s relative_error=3.81108m
wsum_bf16_haswell<1536d>/min_time:10.000/threads:1        161 ns          161 ns     86471812 abs_delta=1.36093m bytes=38.1318G/s pairs=6.20635M/s relative_error=3.75961m
fma_u8_sapphire<1536d>/min_time:10.000/threads:1         70.9 ns         70.9 ns    197232316 abs_delta=9.2812 bytes=64.9867G/s pairs=14.103M/s relative_error=2.45142m
wsum_u8_sapphire<1536d>/min_time:10.000/threads:1        50.6 ns         50.6 ns    276672248 abs_delta=8.89144 bytes=60.6775G/s pairs=19.7518M/s relative_error=3.28203m
fma_i8_sapphire<1536d>/min_time:10.000/threads:1         94.0 ns         94.0 ns    149003863 abs_delta=10.1192 bytes=49.0403G/s pairs=10.6424M/s relative_error=6.98359m
wsum_i8_sapphire<1536d>/min_time:10.000/threads:1        70.4 ns         70.4 ns    198873173 abs_delta=9.76862 bytes=43.613G/s pairs=14.197M/s relative_error=9.3472m
fma_f64_skylake<1536d>/min_time:10.000/threads:1         1340 ns         1340 ns     10460553 abs_delta=39.3003a bytes=27.5182G/s pairs=746.479k/s relative_error=78.2836a
wsum_f64_skylake<1536d>/min_time:10.000/threads:1        1036 ns         1036 ns     13484768 abs_delta=28.4608a bytes=23.717G/s pairs=965.047k/s relative_error=78.6298a
fma_f32_skylake<1536d>/min_time:10.000/threads:1          626 ns          626 ns     22261554 abs_delta=25.3818n bytes=29.4286G/s pairs=1.5966M/s relative_error=50.5553n
wsum_f32_skylake<1536d>/min_time:10.000/threads:1         386 ns          386 ns     35032887 abs_delta=19.7444n bytes=31.8146G/s pairs=2.58908M/s relative_error=54.5454n
fma_bf16_skylake<1536d>/min_time:10.000/threads:1         188 ns          188 ns     74667249 abs_delta=415.805u bytes=48.9511G/s pairs=5.31154M/s relative_error=827.962u
wsum_bf16_skylake<1536d>/min_time:10.000/threads:1        147 ns          147 ns     95128759 abs_delta=269.793u bytes=41.8834G/s pairs=6.81696M/s relative_error=745.331u
fma_f16_serial<1536d>/min_time:10.000/threads:1           900 ns          900 ns     15592180 abs_delta=2.97965u bytes=10.2444G/s pairs=1.11159M/s relative_error=5.93995u
wsum_f16_serial<1536d>/min_time:10.000/threads:1          821 ns          821 ns     17058449 abs_delta=1.11521u bytes=7.48594G/s pairs=1.21841M/s relative_error=3.07961u
fma_u8_serial<1536d>/min_time:10.000/threads:1           6692 ns         6692 ns      2089290 abs_delta=1.66854 bytes=688.583M/s pairs=149.432k/s relative_error=440.882u
wsum_u8_serial<1536d>/min_time:10.000/threads:1          5577 ns         5577 ns      2508971 abs_delta=2.32787 bytes=550.797M/s pairs=179.296k/s relative_error=859.403u
fma_i8_serial<1536d>/min_time:10.000/threads:1           6874 ns         6874 ns      2039761 abs_delta=5.14013 bytes=670.367M/s pairs=145.479k/s relative_error=3.54862m
wsum_i8_serial<1536d>/min_time:10.000/threads:1          5851 ns         5851 ns      2394538 abs_delta=6.36953 bytes=525.018M/s pairs=170.904k/s relative_error=6.09231m

…to main-dev

The Python binding has been rewritten to take advantage of the fast calling convention similar to the `cdist` method. Most functions now have `out=` named argument to avoid memory allocations and reuse existing buffers. `cdist` now computes only half of value if the two input matrices map to the same memory region. Python tests are extended with more invalid inputs.

This commit affects a huge number of lines across the repo and should be ignored by Git Blame tools.

ashvardanian added 15 commits October 17, 2024 17:06

Add: u8 kernels & FMA

679a813

Add: L2 comparisons

b79d45a

Merge branch 'main-dev' of https://github.com/ashvardanian/SimSIMD in…

dc82525

…to main-dev

Merge branch 'main-dev' of https://github.com/ashvardanian/SimSIMD in…

f64f748

…to main-dev

Docs: Position as BLAS alternative

8e9edb5

Make: Compiling with BLAS on MacOS

683fde9

Fix: Detect compatible kernels in bench.cxx

e2c7c7c

Add: FMA kernels

4f55e27

Merge branch 'main-dev' of https://github.com/ashvardanian/SimSIMD in…

af47302

…to main-dev

Docs: Reorganize Python doc-strings

d666f55

Fix: Float rounding in FMA

ea24713

Fix: Missing spdot symbols

b6fb1d3

Add: FMA on Sapphire Rapids

5a56a25

Make: Explicit control over f16 builds with NEON

708d034

Fix: Overriding SIMD feature flags on Arm

a0b7f6a

ashvardanian force-pushed the main-dev branch from c46d472 to a0b7f6a Compare October 21, 2024 04:12

ashvardanian added 9 commits October 25, 2024 10:26

Add: FMA & WSum tests

f4077fa

Fix: Signed saturation on Sapphire Rapids

5c29b81

Fix: Signed saturation in serial i8 FMA

fe87c41

Add: FMA dynamic dispatch

295d830

Add: FMA & WSum Py binding

8aa74ab

Improve: C/C++ formatting

a4022a9

This commit affects a huge number of lines across the repo and should be ignored by Git Blame tools.

Improve: Ignore formatting commits

f098b39

Fix: Missing astype on NoneType

65f8b13

ashvardanian merged commit 7cea703 into main Oct 27, 2024
35 checks passed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Fused-Multiply-Add #214

Fused-Multiply-Add #214

ashvardanian commented Oct 19, 2024 •

edited

Loading

Fused-Multiply-Add #214

Fused-Multiply-Add #214

Conversation

ashvardanian commented Oct 19, 2024 • edited Loading

Benchmarks

ashvardanian commented Oct 19, 2024 •

edited

Loading