-
Notifications
You must be signed in to change notification settings - Fork 0
Performance
Cross Time DSP is encode limited. The extent of this varies with the audio formats in use but, in the typical case of processing 16 bit FLAC for 24 bit FLAC output, single core decode and DSP rates exceed 30 and 40 MS/s, respectively, on Intel 4th generation hardware (Haswell) whereas quad core encoding proceeds in the 8-9 MS/s range. Decode and encode are performed by Windows Media Foundation and are typically 90% of processing time on stereo tracks. The remaining 10% is spent in Cross Time DSP code unpacking decoded samples, DSP, and repacking samples for encoding. Microsoft, in effect, controls Cross Time DSP's performance.
Optimization in Cross Time DSP prioritizes processing of stereo using doubles (<engine precision="Double"> in config) as this is most common. Mono or multichannel audio uses general purpose code 4-5 times slower. Fixed point processing with Q31 is around 10% faster than equivalently optimized double precision but Q31_32x64 and Q31_64x64 are several times slower as they do not use SSE. Cross Time DSP's fixed point additionally favors accuracy over speed and may be slower than implementations selecting more for speed. In typical double precision processing about 80% of processing time is spent on DSP. Conversion from 16 bit input samples to doubles accounts for 6% and conversion from doubles to 24 bit output samples the remaining 14%.
Some of the more interesting findings from measuring the 10% time spent in Cross Time DSP code are listed below.
SIMD Behavior
- IIR DSP is primarily register bound. Direct form II biquads touch nine coefficients and state variables per sample, direct form II first order filters five. One biquad, one first order filter, and their associated accumulator therefore fill the 16 xmm registers available in 64 bit SSE. Higher order filters exceed xmm space, requiring multiple loads per sample (and stores, if state variables are moved out of register) and reducing throughput.
- IIR DSP is secondarily load and store bound. This is typical of most algorithms on current Intel architectures (Haswell, Broadwell, Skylake) and also reflects the relatively light kernels imposed by the limited number of registers. Peak throughout is therefore obtained at maximum register use and filter arrangements which don't pack optimally on to the ALU run accordingly slower.
- With ideal data arrangement AVX 256 offers 65+% increase in throughput over AVX 128 (VEX encoded SSE) for sequential biquads. Multichannel can take advantage of this to varying extent but stereo audio lacks the width to fully utilize 256 bit and wider SIMD. On Haswell (and, by extension, Broadwell and Skylake) the cost of permutes needed to move data and filter state into the additional bits exceeds the benefit. AVX 256 is therefore 16 to 26% slower than AVX 128 depending on the filters in use and implementation of the permutes.
- Simulation suggests stereo filters would be 45-70% faster than in AVX 128 in a seamless 256 bit SIMD implementation removing lane, gather, and scatter limitations. Cross lane extracts, inserts, or permutes easily degrade Haswell throughput by a factor of two, converting gains from AVX 256 into 20% degradations.
- Parallel IIR may avoid enough permutes for greater width to improve throughput. Cross Time DSP presently lacks support for parallel filtering, however, and pipelining from filter block interleaving may prove more effective.
- Use of a 64 bit accumulator gives 32 bit fixed point the same width as a double. Stereo on AVX2 therefore suffers similar width difficulties relative to 128 bit filtering with SSE4.1 instructions (VEX encoded or otherwise).
- Haswell presents a 6-7% AVX2 penalty on sample conversion relative to use of SSSE3 shuffle and SSE4.1 blend and conversion. This is consistent with clock speed reduction and indicates negligible kernel advantage to AVX2 over SSE.
- Haswell presents at best no penalty for IIR with 128 bit FMA. In efficient kernel forms degradation results from latency introduced from dependency injection across accumulator terms exceeding latency removed by fusing multiplies and adds. Use of FMA can result in up to 10% benefit in permute limited AVX 256 kernels. But AVX 128 remains faster.
- Use of _mm_load_pd() and _mm_store_pd() when managed memory alignment permits offers no advantage over _mm_loadu_pd() and _mm_storeu_pd(). Aligned load and store when using _aligned_malloc() is 1-3% faster than loadu/storeu. This is insensitive to 16 versus 32 bit data alignment, though 32 bit alignment contributes a small speedup (1-2%) on Haswell.
Compilation Behavior
- VEX encoded SSE (/arch:AVX) is 6-8% faster than original SSE instruction codes. This is mainly removal of xmm0 contention and will be available from Cross Time DSP 0.7.0.0.
- C++/CLI compilation with maximum optimization (/Ox), intrinsics enabled (/Oi), or favoring fast code (/Ot) results in code a few percent slower than simply setting maximize speed (/O2). Programmer use of SSE intrinsics is roughly 3.5 times more performant than what the compiler can provide on its own.
- C++ and C++/CLI compilation for AVX and AVX2 (/arch) results in code a few percent slower than compilation limited to SSE2 on a precise floating point model (/fp:precise). Fast floating point (/fp:fast) AVX2 is 0-2% faster than compiler generated fast floating point SSE2.
- C# compilation responds well to index offsets known at compile time, producing DSP code about 50% faster. C++/CLI and C++ compilation exhibits similar behavior. This is consistent with experience elsewhere.
- C# compilation for x64 results in DSP code about 10% faster than compilation for AnyCPU.
- Visual Studio 2015 Update 3's C# compiler generates DSP code of speed equivalent to comparable code in C++. C# is typically about 5% slower than equivalent C++.
Common Assumptions Contraindicated by Measurement
- Biquad and first order filter loop unrolling of 2 and 4x is 1-2% slower than not unrolling. This is consistent across multiple compilers and unrolling methods. Specification of loop trip counts also results in slower code on the compilers tested.
- Code of the form *(pointer + offset) is no slower than *(pointer); pointer += stride. It may be 1-2% faster.