Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Bad codegen for cloning a boxed array #41160

Closed
arthurprs opened this issue Apr 8, 2017 · 13 comments
Closed

Bad codegen for cloning a boxed array #41160

arthurprs opened this issue Apr 8, 2017 · 13 comments
Labels
A-codegen Area: Code generation C-enhancement Category: An issue proposing an enhancement or a PR with one. I-slow Issue: Problems and improvements with respect to performance of generated code. T-compiler Relevant to the compiler team, which will review and decide on the PR/issue.

Comments

@arthurprs
Copy link
Contributor

For this code https://is.gd/JbfHAI

#![crate_type = "lib"]
pub extern fn clone(bits: &mut Box<[u64; 1024]>) -> Box<[u64; 1024]> {
  Box::new(**bits)
}

It copies data to the stack first, before allocating and copying again to the heap.

.cfi_offset rbx, -16
	mov	rsi, qword ptr [rdi]
	lea	rdi, [rsp]
	mov	edx, 8192
	call	memcpy@PLT
	mov	edi, 8192
	mov	esi, 8
	call	__rust_allocate@PLT
	mov	rbx, rax
	test	rbx, rbx
	je	.LBB0_2
	lea	rsi, [rsp]
	mov	edx, 8192
	mov	rdi, rbx
	call	memcpy@PLT
	mov	rax, rbx
	add	rsp, 8192
	pop	rbx
	ret
@nagisa
Copy link
Member

nagisa commented Apr 8, 2017

Seems like inlining does not work good enough. box **bits produces good enough code.

@arthurprs arthurprs changed the title Bad code codegen for cloning a boxed array Bad codegen for cloning a boxed array Apr 8, 2017
@oyvindln
Copy link
Contributor

oyvindln commented May 11, 2017

EDIT: Not very relevant since mir optimisations are not used at the moment anyhow. So I tried to look into this with my limited knowledge of the compiler internals, both with the clone function from the first post in this issue, and a simple test function that simply returns a boxed array:

https://gist.github.com/oyvindln/bda3c6efb9c428c8bcf12acee8f3f223

Using -Z mir-opt-level=3, which enables mir inlining does seem to get rid of one memcpy call with my simplified test code, but not the second. If I understand the assembly correctly, it looks like without mir inlining, there is actually reserved space for 2 arrays on the stack (.cfi_def_cfa_offset is 416, vs 216), and one is memcpied to the other one. So, even though it looks to be inlined by LLVM, LLVM isn't able to get rid of the temporaries. The function using box syntax instead avoids the temporary stack usage entirely.

For the test code in the first post in this issue, the assembly output seems identical with and without mir inlining, though the mir is a bit different. The mir output seems to suggest that the copy propagation pass doesn't manage to turn this into this:

DEBUG:rustc_mir::transform::copy_prop: Considering destination local: _1
DEBUG:rustc_mir::transform::copy_prop: Can't copy-propagate local: source use is not an assignment

log of the pass

@oyvindln
Copy link
Contributor

A workaround for this issue, while box syntax is still unstable, would be for the compiler to special case Box::new(x) to box x. This may also have some use to avoid stack overflows with large boxed values for non-optimised builds.

@oyvindln
Copy link
Contributor

For the rust side, copy propagation of arrays in MIR is not happening simply because it is not implemented for the "Repeat" (array?) rvalue mir type. As for the inlining pass, I don't know if more should be done.

I don't know what's keeping LLVM from optimising this properly.

@arthurprs
Copy link
Contributor Author

Is it possible to fix this by using mir copy propagation for now?

@Mark-Simulacrum Mark-Simulacrum added A-codegen Area: Code generation I-slow Issue: Problems and improvements with respect to performance of generated code. labels Jun 23, 2017
@Mark-Simulacrum Mark-Simulacrum added C-enhancement Category: An issue proposing an enhancement or a PR with one. and removed C-enhancement Category: An issue proposing an enhancement or a PR with one. labels Jul 26, 2017
@oyvindln
Copy link
Contributor

oyvindln commented Aug 30, 2017

So I found that adding *& like this (forces a move?):

pub fn create_boxed_array_deref() -> Box<[u16; 100]> {
    Box::new(*&[0u16; 100])
}

evades the extra stack copy. (This doesn't work with references.)
Using a const value as the parameter also avoids it:

pub fn create_boxed_array_deref() -> Box<[u16; 100]> {
    const A: [u16; 100] = [0;100];
    Box::new(A)
}

With an struct containing an array (at least a simple one), it there's some similar behavoiur:

pub struct WrappedArray {
    pub val: [u16; 100]
}

impl WrappedArray {
    fn new() -> WrappedArray {
        WrappedArray {
            val: [0; 100],
        }
    }
}

impl Default for WrappedArray {
    #[inline(always)]
    fn default() -> WrappedArray {
        WrappedArray {
            val: [100; 100],
        }
    }
}

impl Copy for WrappedArray {}

impl Clone for WrappedArray {
    fn clone(&self) -> WrappedArray { *self }
}

pub fn boxed_struct() -> Box<WrappedArray> {
    // Makes stack copies:
    Box::new(WrappedArray::new())
    Box::new(WrappedArray::new().clone())
    Box::new(WrappedArray { val: [0; 100] })

   // Does not make stack copies:
   const A: WrappedArray = WrappedArray { val: [0; 100] };
   Box::new(A)
   Box::new(WrappedArray { val: [0; 100] }.clone())
   Box::new(WrappedArray::default())
   Box::new(*&WrappedArray { val: [0; 100] })
   // And anything with box syntax
}

So there seems to be some oddness with copy-propagation of array types. I couldn't get any of this to work when copying from a reference like in the original example (other than doing box::clone()).

@nagisa
Copy link
Member

nagisa commented Aug 30, 2017

So I found that adding *& like this (forces a move?):

This promotes the array to a static variable rather than initialising it on the stack.

@oyvindln
Copy link
Contributor

oyvindln commented Sep 2, 2017

There seems to be some issues with eliminating array temporaries in general. Something as simple as this:

example
#[inline]
fn write_v(val: [u64;30], to: &mut [u64;30]) {
    *to = val;
}

pub fn write_to_array(to: &mut [u64;30]) {
    write_v([55;30], to);
    // This optimizes fine on the other hand:
    // *to = [55;30];
}
is enough to give the optimizer issues and results in a stack temporary.

EDIT: Even simpler:

pub fn write_to_array(to: &mut [u64;30]) {
    let v = [55;30];
    *to = v;
}

@oyvindln
Copy link
Contributor

oyvindln commented Sep 3, 2017

Suspecting it's the loop that fills the array that's causing issues: -Z llvm-debug gives


Can't analyze slices for alloca:   %v = alloca [30 x i64], align 8
  A pointer to this alloca escaped by:
    %6 = icmp eq i64* %5, %3

Running the llvm optimizer with -O3 directly on llvm-ir ouput from rustc with -C opt-level=1 get's rid of the stack temporary completely, so maybe there is a pass ordering issue here?
EDIT: Seems like that is due to the loop being unrolled.

@oyvindln
Copy link
Contributor

oyvindln commented Sep 9, 2017

It looks like both clang and gcc have issues with similar things in C++:
https://godbolt.org/g/dAs2dr

I would think making Box::new an intrinsic/compiler-built in might make sense here, as that would help avoid blowing up the stack in unoptimized code (otherwise there will be 3x the size of the type reserved on the stack.) though that doesn't solve the underlying issue.

@isegal
Copy link

isegal commented Dec 16, 2018

C++ does not have an issue when returning large arrays from functions due to copy elision (https://en.cppreference.com/w/cpp/language/copy_elision):
Ex: https://gcc.godbolt.org/z/Ci6jWs

Using an assignment loop would cause an extra memset in the beginning of the function:
https://gcc.godbolt.org/z/KKRjes

Rust unfortunately has the extra memcpy regardless:
https://gcc.godbolt.org/z/JapD8c

True copy elision would definitely be a killer feature in Rust because one would no longer have to worry about having to box anything for the sole purpose of avoiding the extra memcpy. Suppose such feature were to be implemented, it would be interesting to see the performance boost of existing Rust applications.

@isegal
Copy link

isegal commented Dec 16, 2018

Update: The above issue appears to be specific to the array initialization shorthand in Rust. I have filed it separately #56882

@jonas-schievink jonas-schievink added the T-compiler Relevant to the compiler team, which will review and decide on the PR/issue. label Jan 31, 2020
@nikic
Copy link
Contributor

nikic commented Mar 13, 2021

https://rust.godbolt.org/z/dab87z Optimized well on nightly as a result of #82806.

@nikic nikic closed this as completed Mar 13, 2021
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
A-codegen Area: Code generation C-enhancement Category: An issue proposing an enhancement or a PR with one. I-slow Issue: Problems and improvements with respect to performance of generated code. T-compiler Relevant to the compiler team, which will review and decide on the PR/issue.
Projects
None yet
Development

No branches or pull requests

7 participants