-
Notifications
You must be signed in to change notification settings - Fork 12.9k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Optimize large array creation in const-eval #120069
Conversation
r? @cjgillot (rustbot has picked a reviewer for you, use r? to override) |
Some changes occurred to the CTFE / Miri engine cc @rust-lang/miri |
036954b
to
4106d71
Compare
// For particularly large arrays (where this is perf-sensitive) it's common that | ||
// we're writing a single byte repeatedly. So, optimize that case to a memset. | ||
if size_in_bytes == 1 && num_copies >= 1 { | ||
let value = *src_bytes; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think this is OK. We've checked above that the bytes are initialized (if !init.no_bytes_init() {
), so reading the byte should be safe.
@bors try @rust-timer queue |
This comment has been minimized.
This comment has been minimized.
Optimize large array creation in const-eval This changes repeated memcpy's to a memset for the case that we're propagating a single byte into a region of memory. It also optimizes the element-by-element copies to have a tighter loop; I'm pretty sure the old code was actually doing a multiply within each loop iteration. For an 8GB array (`static SLICE: [u8; SIZE] = [0u8; 1 << 33];`) this takes us from ~23 seconds to ~6 seconds locally, which is spent roughly 50/50 in (a) memset to zero and (b) memcpy of the original place into a new place, when popping stack frame. The latter seems hard to avoid but is a big memcpy (since we're copying the type rather than initializing a region, so it's pretty fast), and the first is as good as it's going to get without special casing constant-valued arrays. Closes rust-lang#55795. (That issue's references to lint checking don't appear true anymore, but I think this closes that case as something that is slow due to *time* pretty fully. An 8GB array taking only 6 seconds feels reasonable enough to not merit further tracking).
☀️ Try build successful - checks-actions |
This comment has been minimized.
This comment has been minimized.
Finished benchmarking commit (7e0886b): comparison URL. Overall result: ✅ improvements - no action neededBenchmarking this pull request likely means that it is perf-sensitive, so we're automatically marking it as not fit for rolling up. While you can manually mark this PR as fit for rollup, we strongly recommend not doing so since this PR may lead to changes in compiler perf. @bors rollup=never Instruction countThis is a highly reliable metric that was used to determine the overall result at the top of this comment.
Max RSS (memory usage)ResultsThis is a less reliable metric that may be of interest but was not used to determine the overall result at the top of this comment.
CyclesResultsThis is a less reliable metric that may be of interest but was not used to determine the overall result at the top of this comment.
Binary sizeResultsThis is a less reliable metric that may be of interest but was not used to determine the overall result at the top of this comment.
Bootstrap: 663.454s -> 664.806s (0.20%) |
let size_in_bytes = size.bytes_usize(); | ||
// For particularly large arrays (where this is perf-sensitive) it's common that | ||
// we're writing a single byte repeatedly. So, optimize that case to a memset. | ||
if size_in_bytes == 1 && num_copies >= 1 { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Can you generalize this to handle arrays of other types by extending this size_in_bytes == 1
to any value where all bytes are equal? It would be nice if large arrays of usize
or u32
, or even an enum could also benefit.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Most data-less enums would already benefit (since they are 1 byte in size)... and if they have data it seems super unlikely the bytes are all the same.
For the cases where this could work (e.g., 0u32), I'm wary of #120069 (comment) -- not sure if there's a good, cheap way to check "is there padding in there?"
It might also be worth noting that anything significantly fancier here probably merits looking at replicating the algorithm in slice::repeat
-- this was pointed out in earlier optimization work in this code, but not implemented at the time. For larger types I think that generalizes much better than trying to detect memset being possible...
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
not sure if there's a good, cheap way to check "is there padding in there?"
🤷 That probably deserves a comment that we've checked in a way that only works if the value is a single byte. If I'm doing the logic right, as-written, it's "are any bytes init" not "are all bytes init". I don't think I'd argue against doing that specifically because this is one byte but it's subtle.
InitMask::is_range_initialized
seems like the better function here, but the API has a gnarly error path... that only one caller uses. It might be worth it to split the function into two, so that the one error path that wants to know where there are uninit bytes can call some new function.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yeah, a comment seems like a good idea. I've added that.
it's "are any bytes init" not "are all bytes init"
Yes, but if there's only one byte, that seems like the same thing, right?
InitMask::is_range_initialized seems like the better function here, but the API has a gnarly error path... that only one caller uses. It might be worth it to split the function into two, so that the one error path that wants to know where there are uninit bytes can call some new function.
Yeah, that seems like the way to go for a more complex change here. I'm still worried that adding extra checks is going to lead to regressions -- or at least more code to ship for little value. But 🤷 , I could see it being a win. This PR already makes all cases a very tight loop around the memcpy/memmove call for each value, which is probably pretty fast for not excessively large things, and those are almost always byte-like I suspect.
This changes repeated memcpy's to a memset for the case that we're propagating a single byte into a region of memory.
4106d71
to
e68f303
Compare
@bors r+ rollup=never yea let's try follow up optimizations in follow up PRs, not this one |
Optimize large array creation in const-eval This changes repeated memcpy's to a memset for the case that we're propagating a single byte into a region of memory. It also optimizes the element-by-element copies to have a tighter loop; I'm pretty sure the old code was actually doing a multiply within each loop iteration. For an 8GB array (`static SLICE: [u8; SIZE] = [0u8; 1 << 33];`) this takes us from ~23 seconds to ~6 seconds locally, which is spent roughly 50/50 in (a) memset to zero and (b) memcpy of the original place into a new place, when popping stack frame. The latter seems hard to avoid but is a big memcpy (since we're copying the type rather than initializing a region, so it's pretty fast), and the first is as good as it's going to get without special casing constant-valued arrays. Closes rust-lang#55795. (That issue's references to lint checking don't appear true anymore, but I think this closes that case as something that is slow due to *time* pretty fully. An 8GB array taking only 6 seconds feels reasonable enough to not merit further tracking).
💔 Test failed - checks-actions |
@bors retry |
☀️ Test successful - checks-actions |
Finished benchmarking commit (16fadb3): comparison URL. Overall result: no relevant changes - no action needed@rustbot label: -perf-regression Instruction countThis benchmark run did not return any relevant results for this metric. Max RSS (memory usage)ResultsThis is a less reliable metric that may be of interest but was not used to determine the overall result at the top of this comment.
CyclesThis benchmark run did not return any relevant results for this metric. Binary sizeThis benchmark run did not return any relevant results for this metric. Bootstrap: 664.049s -> 663.017s (-0.16%) |
if size_in_bytes == 1 && num_copies >= 1 { | ||
// SAFETY: `src_bytes` would be read from anyway by copies below (num_copies >= 1). | ||
// Since size_in_bytes = 1, then the `init.no_bytes_init()` check above guarantees | ||
// that this read at type `u8` is OK -- it must be an initialized byte. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This comment seems confused. SAFETY comments are about the safety of the interpreter itself. The bytes in question are always initialized in that sense. The check above is about whether they are initialized in the emulated interpreter state, i.e., whether the interpreted program initialized them.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Happy to edit it! I didn't obviously find any guarantees that the interpreter always initializes backing allocations before passing them around (seems like at minimum based on profiling that's not happening with the destination buffer, so it would be bad to read that at type u8 here?).
I would have expected that the emulated interpreter state having the bytes initialized implies the interpreter's state also has them initialized.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The u8
buffer backing interpreter allocations is always initialized. It has type u8
rather than MaybeUninit<u8>
, so documenting that seems redundant.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Profiling seems to be misleading then. I think we are calling calloc
(via try_new_zeroed_slice
) which might do the zero-init via efficient copy-on-write tricks or so.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Perhaps! I'm 99% sure we're not touching every byte until we hit this code. I'll look into changing some things here then.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
New allocations get created here:
rust/compiler/rustc_middle/src/mir/interpret/allocation.rs
Lines 293 to 333 in 2ea7a37
fn uninit_inner<R>(size: Size, align: Align, fail: impl FnOnce() -> R) -> Result<Self, R> { | |
// We raise an error if we cannot create the allocation on the host. | |
// This results in an error that can happen non-deterministically, since the memory | |
// available to the compiler can change between runs. Normally queries are always | |
// deterministic. However, we can be non-deterministic here because all uses of const | |
// evaluation (including ConstProp!) will make compilation fail (via hard error | |
// or ICE) upon encountering a `MemoryExhausted` error. | |
let bytes = Bytes::zeroed(size, align).ok_or_else(fail)?; | |
Ok(Allocation { | |
bytes, | |
provenance: ProvenanceMap::new(), | |
init_mask: InitMask::new(size, false), | |
align, | |
mutability: Mutability::Mut, | |
extra: (), | |
}) | |
} | |
/// Try to create an Allocation of `size` bytes, failing if there is not enough memory | |
/// available to the compiler to do so. | |
pub fn try_uninit<'tcx>(size: Size, align: Align) -> InterpResult<'tcx, Self> { | |
Self::uninit_inner(size, align, || { | |
ty::tls::with(|tcx| tcx.dcx().delayed_bug("exhausted memory during interpretation")); | |
InterpError::ResourceExhaustion(ResourceExhaustionInfo::MemoryExhausted).into() | |
}) | |
} | |
/// Try to create an Allocation of `size` bytes, panics if there is not enough memory | |
/// available to the compiler to do so. | |
/// | |
/// Example use case: To obtain an Allocation filled with specific data, | |
/// first call this function and then call write_scalar to fill in the right data. | |
pub fn uninit(size: Size, align: Align) -> Self { | |
match Self::uninit_inner(size, align, || { | |
panic!("Allocation::uninit called with panic_on_fail had allocation failure"); | |
}) { | |
Ok(x) => x, | |
Err(x) => x, | |
} | |
} |
That ends up calling this:
rust/compiler/rustc_middle/src/mir/interpret/allocation.rs
Lines 48 to 53 in 2ea7a37
fn zeroed(size: Size, _align: Align) -> Option<Self> { | |
let bytes = Box::<[u8]>::try_new_zeroed_slice(size.bytes_usize()).ok()?; | |
// SAFETY: the box was zero-allocated, which is a valid initial value for Box<[u8]> | |
let bytes = unsafe { bytes.assume_init() }; | |
Some(bytes) | |
} |
I am 100% sure that all these u8
are initialized.
I filed a follow-up to fix the comments: #120387 |
This changes repeated memcpy's to a memset for the case that we're propagating a single byte into a region of memory. It also optimizes the element-by-element copies to have a tighter loop; I'm pretty sure the old code was actually doing a multiply within each loop iteration.
For an 8GB array (
static SLICE: [u8; SIZE] = [0u8; 1 << 33];
) this takes us from ~23 seconds to ~6 seconds locally, which is spent roughly 50/50 in (a) memset to zero and (b) memcpy of the original place into a new place, when popping stack frame. The latter seems hard to avoid but is a big memcpy (since we're copying the type rather than initializing a region, so it's pretty fast), and the first is as good as it's going to get without special casing constant-valued arrays.Closes #55795. (That issue's references to lint checking don't appear true anymore, but I think this closes that case as something that is slow due to time pretty fully. An 8GB array taking only 6 seconds feels reasonable enough to not merit further tracking).