-
Notifications
You must be signed in to change notification settings - Fork 12.8k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Possible performance loss with f32 arithmetic #91447
Comments
After a quick godbolt I find that My hypothesis is that this codegen issue is probably cause by the clang (trunk with
|
I can confirm that the problem here is the return type pub fn sum(a: &Stats, b: &Stats, res: &mut Stats)
{
res.x = a.x + b.x;
res.y = a.y + b.y;
res.z = a.z + b.z;
} example::sum:
movss xmm0, dword ptr [rdi]
addss xmm0, dword ptr [rsi]
movss dword ptr [rdx], xmm0
movss xmm0, dword ptr [rdi + 4]
addss xmm0, dword ptr [rsi + 4]
movss dword ptr [rdx + 4], xmm0
movss xmm0, dword ptr [rdi + 8]
addss xmm0, dword ptr [rsi + 8]
movss dword ptr [rdx + 8], xmm0
ret LLVM IR at -C opt-level = 0%Stats = type { float, float, float }
define void @_ZN7example3sum17h258bf76ba01aa602E(%Stats* align 4 dereferenceable(12) %a, %Stats* align 4 dereferenceable(12) %b, %Stats* align 4 dereferenceable(12) %res) unnamed_addr #0 !dbg !6 {
%0 = bitcast %Stats* %a to float*, !dbg !10
%_4 = load float, float* %0, align 4, !dbg !10
%1 = bitcast %Stats* %b to float*, !dbg !11
%_5 = load float, float* %1, align 4, !dbg !11
%2 = bitcast %Stats* %res to float*, !dbg !12
%3 = fadd float %_4, %_5, !dbg !12
store float %3, float* %2, align 4, !dbg !12
%4 = getelementptr inbounds %Stats, %Stats* %a, i32 0, i32 1, !dbg !13
%_6 = load float, float* %4, align 4, !dbg !13
%5 = getelementptr inbounds %Stats, %Stats* %b, i32 0, i32 1, !dbg !14
%_7 = load float, float* %5, align 4, !dbg !14
%6 = getelementptr inbounds %Stats, %Stats* %res, i32 0, i32 1, !dbg !15
%7 = fadd float %_6, %_7, !dbg !15
store float %7, float* %6, align 4, !dbg !15
%8 = getelementptr inbounds %Stats, %Stats* %a, i32 0, i32 2, !dbg !16
%_8 = load float, float* %8, align 4, !dbg !16
%9 = getelementptr inbounds %Stats, %Stats* %b, i32 0, i32 2, !dbg !17
%_9 = load float, float* %9, align 4, !dbg !17
%10 = getelementptr inbounds %Stats, %Stats* %res, i32 0, i32 2, !dbg !18
%11 = fadd float %_8, %_9, !dbg !18
store float %11, float* %10, align 4, !dbg !18
ret void, !dbg !19
} |
The question then becomes why is rustc forcing an i96 as return type? |
This seems to be the same regression as #85265 |
I think I just got hit by that in a real app: with nalgebra to perform gaussian blur, I noticed a slowdown of This really needs to be fixed otherwise using rust for real time computer graphics is a no go! |
This should be fixed in #94570, can we confirm that? |
When will this PR be available in Nightly or is it already available? What about stable? |
It should be on nightly as of this exact moment. |
I've tested on the playground version with assembly output and it attempts to return values to the stack whereas the C++ variant leaves them in xmm registers. It does however look a bit better than godbolt which uses an even older nightly. Here is current assembly output from playground (with 2022-03-04): playground::sum: # @playground::sum
# %bb.0:
movss xmm0, dword ptr [rsi] # xmm0 = mem[0],zero,zero,zero
movss xmm1, dword ptr [rsi + 4] # xmm1 = mem[0],zero,zero,zero
addss xmm0, dword ptr [rdx]
addss xmm1, dword ptr [rdx + 4]
movss xmm2, dword ptr [rsi + 8] # xmm2 = mem[0],zero,zero,zero
addss xmm2, dword ptr [rdx + 8]
mov rax, rdi
movss dword ptr [rdi], xmm0
movss dword ptr [rdi + 4], xmm1
movss dword ptr [rdi + 8], xmm2
ret
# -- End function Here is the C++ assembly for comparison (obtained from godbolt): sum(Stats const&, Stats const&): # @sum(Stats const&, Stats const&)
movsd xmm1, qword ptr [rdi] # xmm1 = mem[0],zero
movsd xmm0, qword ptr [rsi] # xmm0 = mem[0],zero
addps xmm0, xmm1
movss xmm1, dword ptr [rdi + 8] # xmm1 = mem[0],zero,zero,zero
addss xmm1, dword ptr [rsi + 8]
ret |
Ah, the memory-passing thing is a known bug which has its own issues tracking its nuances, essentially, so if this has been reduced to that, it is now a duplicate of those and I shall close this. In the meantime, it usually is a concern that is significantly reduced by inlining. Thank you for investigating, however! |
Here's a quick demonstration that the extra .LBB1_4: # =>This Inner Loop Header: Depth=1
movups xmm1, xmmword ptr [rsi + rdx]
addps xmm0, xmm1
add rdx, 16
cmp rcx, rdx
jne .LBB1_4 |
I've tried, out of curiosity, a floating point arithmetic test and found quite a big difference between C++ and Rust.
The code used in rust
The code used in C++
Here is a link to a godbolt for side-by-side comparision of assembly output: https://godbolt.org/z/dqc4b74rv
Rust seem to absolutely want the floats back into e* registers instead of keeping them in xmm registers, C++ leaves them into the xmm registers. In some cases it might more advantageous to leave the floats in xmm registers for future operations on them rather then passing them back into the e* registers.
The text was updated successfully, but these errors were encountered: