-
Notifications
You must be signed in to change notification settings - Fork 0
Performance
Cross Time DSP is encode limited. The extent of this varies with the audio formats in use but, in the typical case of processing 16 bit FLAC for 24 bit FLAC output, single core decode and DSP rates exceed 30 and 40 MS/s, respectively, on Intel 4th generation hardware (Haswell) whereas quad core encoding proceeds in the 8-9 MS/s range. Decode and encode are performed by Windows Media Foundation and are typically 90% of processing time on stereo tracks. The remaining 10% is spent in Cross Time DSP code unpacking decoded samples, DSP, and repacking samples for encoding. Microsoft, in effect, controls Cross Time DSP's performance.
Optimization in Cross Time DSP prioritizes processing of stereo using doubles (<engine precision="Double"> in config) as this is most common. Mono or multichannel audio uses general purpose code 4-5 times slower. Fixed point processing with Q31 is around 10% faster than equivalently optimized double precision but Q31_32x64 and Q31_64x64 are several times slower as they do not use SSE. Cross Time DSP's fixed point additionally favors accuracy over speed and may be slower than implementations selecting more for speed. In typical double precision processing about 80% of processing time is spent on DSP. Conversion from 16 bit input samples to doubles accounts for 6% and conversion from doubles to 24 bit output samples the remaining 14%.
Some of the more interesting findings from measuring the 10% time spent in Cross Time DSP code are listed below.
SIMD Behavior
- IIR DSP is primarily register bound. Direct form II biquads touch nine coefficients and state variables per sample, direct form II first order filters five. One biquad, one first order filter, and their associated accumulator therefore fill the 16 xmm registers available in 64 bit SSE. Higher order filters exceed xmm space, requiring multiple loads per sample (and stores, if state variables are moved out of register) and reducing throughput.
- IIR DSP is secondarily load and store bound. This is typical of most algorithms on current Intel architectures (Haswell, Broadwell, Skylake) and also reflects the relatively light kernels imposed by the limited number of registers. Peak throughout is therefore obtained at maximum register use and filter arrangements which don't pack optimally on to the ALU run accordingly slower.
- Haswell presents a 17% AVX penalty on biquads relative to VEX encoded SSE4.1 despite a tighter kernel and limited overhead for feedback terms. 6-7% of this is attributable to clock speed reduction associated with ymm register use. Insight of the remaining 10% is unavailable due to IACA limitations and VTune cost.
- Haswell presents a 22% AVX penalty relative to VEX SSE4.1 on sixth order filters usiing all of ymm register space. 2% of this appears to be due to Visual Studio 2015 Update 3's less efficient use of ymm registers, resulting in three coefficient loads per iteration which are avoided by SSE code. Clock speeds are reduced an additional 0.5% from AVX biquad cases.
- Haswell presents a 6-7% AVX2 penalty on sample conversion relative to use of SSSE3 shuffle and SSE4.1 blend and conversion. This is consistent with clock speed reduction and indicates negligible kernel advantage to AVX2 over SSE.
- Haswell presents at best no penalty for IIR with both 128 and 256 bit FMA. In most kernel forms degradation results from latency introduced from dependency injection across accumulator terms exceeding latency removed by fusing multiplies and adds.
- Use of _mm_load_pd() and _mm_store_pd() when managed memory alignment permits offers no advantage over _mm_loadu_pd() and _mm_storeu_pd(). Aligned load and store when using _aligned_malloc() is 1-3% faster than loadu/storeu. This is insensitive to 16 versus 32 bit data alignment, though 32 bit alignment contributes a small speedup (1-2%) on Haswell.
Compilation Behavior
- VEX encoded SSE (/arch:AVX) is 6-8% faster than original SSE instruction codes. This is mainly removal of xmm0 contention and will be available from Cross Time DSP 0.7.0.0.
- C++/CLI compilation with maximum optimization (/Ox), intrinsics enabled (/Oi), or favoring fast code (/Ot) results in code a few percent slower than simply setting maximize speed (/O2). Programmer use of SSE intrinsics is roughly 3.5 times more performant than what the compiler can provide on its own.
- C++ and C++/CLI compilation for AVX and AVX2 (/arch) results in code a few percent slower than compilation limited to SSE2 on a precise floating point model (/fp:precise). Fast floating point (/fp:fast) AVX2 is 0-2% faster than compiler generated fast floating point SSE2.
- C# compilation responds well to index offsets known at compile time, producing DSP code about 50% faster. C++/CLI and C++ compilation exhibits similar behavior. This is consistent with experience elsewhere.
- C# compilation for x64 results in DSP code about 10% faster than compilation for AnyCPU.
- Visual Studio 2015 Update 3's C# compiler generates DSP code of speed equivalent to comparable code in C++. C# is typically about 5% slower than equivalent C++.
Common Assumptions Contraindicated by Measurement
- Biquad and first order filter loop unrolling of 2 and 4x is 1-2% slower than not unrolling.
- Code of the form *(pointer + offset) is no slower than *(pointer); pointer += stride. It may be 1-2% faster.