-
Notifications
You must be signed in to change notification settings - Fork 0
Performance
Cross Time DSP is encode limited. The extent of this varies with the audio formats in use but, in the typical case of processing 16 bit FLAC for 24 bit FLAC output, single core decode and DSP rates exceed 30 and 40 MS/s, respectively, on Intel 4th generation hardware (Haswell) whereas quad core encoding proceeds in the 8-9 MS/s range. Decode and encode are performed by Windows Media Foundation and are typically 90% of processing time on stereo tracks. The remaining 10% is spent in Cross Time DSP code unpacking decoded samples, DSP, and repacking samples for encoding. Microsoft, in effect, controls Cross Time DSP's performance.
Optimization effort in Cross Time DSP prioritizes processing of stereo using doubles (<engine precision="Double"> in config) as this is most common. Mono or multichannel audio relies on slower, general purpose code. Fixed point processing using the various Q31 engine precisions is also slower as Cross Time DSP's fixed point implementation additionally favors accuracy over speed. In typical double precision processing about 80% of processing time is spent on DSP. Conversion from 16 bit input samples to doubles accounts for 6% and conversion from doubles to 24 bit output samples the remaining 14%.
Findings from measuring the 10% spent in Cross Time DSP code include:
SIMD Behavior
- IIR DSP is primarily register bound. Direct form II biquads touch nine coefficients and state variables per sample, direct form II first order filters five. One biquad, one first order filter, and their associated accumulator therefore fill the 16 xmm registers available in 64 bit SSE. Making the compute kernel wider requires multiple loads per sample (and stores, if state variables are moved out of register) and reduces throughput.
- IIR DSP is secondarily load and store (cache bandwidth) bound. This is typical of most algorithms on current Intel architectures (2017) and reflects the relatively light kernels imposed by the limited number of registers. Making the compute kernel narrower than the registers available therefore also reduces throughput as ALUs are underutilized.
- Haswell presents a 6-7% AVX2 penalty on sample conversion relative to SSSE3 shuffle and SSE4.1 blend and conversion. This somewhat underperforms Intel's guidance (slide 50) on load to store ratios and compute kernel density. Specific insight is unavailable due to IACA limitations. But likely SSE finds more parallelism and pipelining within Haswell ALUs. Cross Time DSP relies on intrinsics and doesn't attempt programmer optimized assembly.
- Use of _mm_load_pd() and _mm_store_pd() when managed memory alignment permits offers no advantage over _mm_loadu_pd() and _mm_storeu_pd(). Aligned load and store when using _aligned_malloc() is 1-3% faster than loadu/storeu. This is insensitive to 16 versus 32 bit data alignment, though 32 bit alignment contributes a small speedup (1-2%) on Haswell.
Compilation Behavior
- VEX encoded SSE (/arch:AVX) is 6-8% faster than original SSE instruction codes. However, Cross Time DSP ships with VEX disabled to support processors predating Sandy Bridge.
- C++/CLI compilation with maximum optimization (/Ox), intrinsics enabled (/Oi), or favoring fast code (/Ot) results in code a few percent slower than simply setting maximize speed (/O2). Programmer use of SSE intrinsics is roughly 3.5 times more performant than what the compiler can provide on its own.
- C++ and C++/CLI compilation for AVX and AVX2 (/arch) results in code a few percent slower than compilation limited to SSE2 on a precise floating point model (/fp:precise). Fast floating point (/fp:fast) AVX2 is 0-1% faster than compiler generated fast floating point SSE2.
- Biquad and first order filter loop unrolling of 2 and 4x is 1-2% slower than not unrolling.
- C# compilation responds well to index offsets known at compile time, producing DSP code about 50% faster. C++/CLI and C++ compilation exhibits similar behavior. This is consistent with experience elsewhere.
- C# compilation for x64 results in DSP code about 10% faster than compilation for AnyCPU.
- Visual Studio 2015 Update 3's C# compiler generates DSP code of speed equivalent to comparable code in C++. C# is typically about 5% slower than equivalent C++.
- Code of the form *(pointer + offset) is no slower than *(pointer); pointer += stride. It may be 1-2% faster.