Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Optimize large array creation in const-eval #120069

Merged
merged 1 commit into from
Jan 19, 2024

Conversation

Mark-Simulacrum
Copy link
Member

This changes repeated memcpy's to a memset for the case that we're propagating a single byte into a region of memory. It also optimizes the element-by-element copies to have a tighter loop; I'm pretty sure the old code was actually doing a multiply within each loop iteration.

For an 8GB array (static SLICE: [u8; SIZE] = [0u8; 1 << 33];) this takes us from ~23 seconds to ~6 seconds locally, which is spent roughly 50/50 in (a) memset to zero and (b) memcpy of the original place into a new place, when popping stack frame. The latter seems hard to avoid but is a big memcpy (since we're copying the type rather than initializing a region, so it's pretty fast), and the first is as good as it's going to get without special casing constant-valued arrays.

Closes #55795. (That issue's references to lint checking don't appear true anymore, but I think this closes that case as something that is slow due to time pretty fully. An 8GB array taking only 6 seconds feels reasonable enough to not merit further tracking).

@rustbot
Copy link
Collaborator

rustbot commented Jan 17, 2024

r? @cjgillot

(rustbot has picked a reviewer for you, use r? to override)

@rustbot rustbot added S-waiting-on-review Status: Awaiting review from the assignee but also interested parties. T-compiler Relevant to the compiler team, which will review and decide on the PR/issue. labels Jan 17, 2024
@rustbot
Copy link
Collaborator

rustbot commented Jan 17, 2024

Some changes occurred to the CTFE / Miri engine

cc @rust-lang/miri

// For particularly large arrays (where this is perf-sensitive) it's common that
// we're writing a single byte repeatedly. So, optimize that case to a memset.
if size_in_bytes == 1 && num_copies >= 1 {
let value = *src_bytes;
Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think this is OK. We've checked above that the bytes are initialized (if !init.no_bytes_init() {), so reading the byte should be safe.

@Mark-Simulacrum
Copy link
Member Author

@bors try @rust-timer queue

@rust-timer

This comment has been minimized.

@rustbot rustbot added the S-waiting-on-perf Status: Waiting on a perf run to be completed. label Jan 17, 2024
bors added a commit to rust-lang-ci/rust that referenced this pull request Jan 17, 2024
Optimize large array creation in const-eval

This changes repeated memcpy's to a memset for the case that we're propagating a single byte into a region of memory. It also optimizes the element-by-element copies to have a tighter loop; I'm pretty sure the old code was actually doing a multiply within each loop iteration.

For an 8GB array (`static SLICE: [u8; SIZE] = [0u8; 1 << 33];`) this takes us from ~23 seconds to ~6 seconds locally, which is spent roughly 50/50 in (a) memset to zero and (b) memcpy of the original place into a new place, when popping stack frame. The latter seems hard to avoid but is a big memcpy (since we're copying the type rather than initializing a region, so it's pretty fast), and the first is as good as it's going to get without special casing constant-valued arrays.

Closes rust-lang#55795. (That issue's references to lint checking don't appear true anymore, but I think this closes that case as something that is slow due to *time* pretty fully. An 8GB array taking only 6 seconds feels reasonable enough to not merit further tracking).
@bors
Copy link
Contributor

bors commented Jan 17, 2024

⌛ Trying commit 4106d71 with merge 7e0886b...

@bors
Copy link
Contributor

bors commented Jan 17, 2024

☀️ Try build successful - checks-actions
Build commit: 7e0886b (7e0886b50c05ca6ece6a7f27e7ce7adf54c50b37)

@rust-timer

This comment has been minimized.

@rust-timer
Copy link
Collaborator

Finished benchmarking commit (7e0886b): comparison URL.

Overall result: ✅ improvements - no action needed

Benchmarking this pull request likely means that it is perf-sensitive, so we're automatically marking it as not fit for rolling up. While you can manually mark this PR as fit for rollup, we strongly recommend not doing so since this PR may lead to changes in compiler perf.

@bors rollup=never
@rustbot label: -S-waiting-on-perf -perf-regression

Instruction count

This is a highly reliable metric that was used to determine the overall result at the top of this comment.

mean range count
Regressions ❌
(primary)
- - 0
Regressions ❌
(secondary)
- - 0
Improvements ✅
(primary)
- - 0
Improvements ✅
(secondary)
-1.0% [-1.1%, -0.9%] 6
All ❌✅ (primary) - - 0

Max RSS (memory usage)

Results

This is a less reliable metric that may be of interest but was not used to determine the overall result at the top of this comment.

mean range count
Regressions ❌
(primary)
- - 0
Regressions ❌
(secondary)
- - 0
Improvements ✅
(primary)
- - 0
Improvements ✅
(secondary)
-1.2% [-1.6%, -0.8%] 3
All ❌✅ (primary) - - 0

Cycles

Results

This is a less reliable metric that may be of interest but was not used to determine the overall result at the top of this comment.

mean range count
Regressions ❌
(primary)
- - 0
Regressions ❌
(secondary)
7.0% [7.0%, 7.0%] 1
Improvements ✅
(primary)
- - 0
Improvements ✅
(secondary)
- - 0
All ❌✅ (primary) - - 0

Binary size

Results

This is a less reliable metric that may be of interest but was not used to determine the overall result at the top of this comment.

mean range count
Regressions ❌
(primary)
- - 0
Regressions ❌
(secondary)
- - 0
Improvements ✅
(primary)
-0.5% [-0.5%, -0.5%] 1
Improvements ✅
(secondary)
- - 0
All ❌✅ (primary) -0.5% [-0.5%, -0.5%] 1

Bootstrap: 663.454s -> 664.806s (0.20%)
Artifact size: 308.30 MiB -> 308.35 MiB (0.02%)

@rustbot rustbot removed the S-waiting-on-perf Status: Waiting on a perf run to be completed. label Jan 17, 2024
let size_in_bytes = size.bytes_usize();
// For particularly large arrays (where this is perf-sensitive) it's common that
// we're writing a single byte repeatedly. So, optimize that case to a memset.
if size_in_bytes == 1 && num_copies >= 1 {
Copy link
Member

@saethlin saethlin Jan 17, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can you generalize this to handle arrays of other types by extending this size_in_bytes == 1 to any value where all bytes are equal? It would be nice if large arrays of usize or u32, or even an enum could also benefit.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Most data-less enums would already benefit (since they are 1 byte in size)... and if they have data it seems super unlikely the bytes are all the same.

For the cases where this could work (e.g., 0u32), I'm wary of #120069 (comment) -- not sure if there's a good, cheap way to check "is there padding in there?"

It might also be worth noting that anything significantly fancier here probably merits looking at replicating the algorithm in slice::repeat -- this was pointed out in earlier optimization work in this code, but not implemented at the time. For larger types I think that generalizes much better than trying to detect memset being possible...

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

not sure if there's a good, cheap way to check "is there padding in there?"

🤷 That probably deserves a comment that we've checked in a way that only works if the value is a single byte. If I'm doing the logic right, as-written, it's "are any bytes init" not "are all bytes init". I don't think I'd argue against doing that specifically because this is one byte but it's subtle.

InitMask::is_range_initialized seems like the better function here, but the API has a gnarly error path... that only one caller uses. It might be worth it to split the function into two, so that the one error path that wants to know where there are uninit bytes can call some new function.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yeah, a comment seems like a good idea. I've added that.

it's "are any bytes init" not "are all bytes init"

Yes, but if there's only one byte, that seems like the same thing, right?

InitMask::is_range_initialized seems like the better function here, but the API has a gnarly error path... that only one caller uses. It might be worth it to split the function into two, so that the one error path that wants to know where there are uninit bytes can call some new function.

Yeah, that seems like the way to go for a more complex change here. I'm still worried that adding extra checks is going to lead to regressions -- or at least more code to ship for little value. But 🤷 , I could see it being a win. This PR already makes all cases a very tight loop around the memcpy/memmove call for each value, which is probably pretty fast for not excessively large things, and those are almost always byte-like I suspect.

This changes repeated memcpy's to a memset for the case that we're
propagating a single byte into a region of memory.
@oli-obk
Copy link
Contributor

oli-obk commented Jan 18, 2024

@bors r+ rollup=never

yea let's try follow up optimizations in follow up PRs, not this one

@bors
Copy link
Contributor

bors commented Jan 18, 2024

📌 Commit e68f303 has been approved by oli-obk

It is now in the queue for this repository.

@bors bors added S-waiting-on-bors Status: Waiting on bors to run and complete tests. Bors will change the label on completion. and removed S-waiting-on-review Status: Awaiting review from the assignee but also interested parties. labels Jan 18, 2024
bors added a commit to rust-lang-ci/rust that referenced this pull request Jan 19, 2024
Optimize large array creation in const-eval

This changes repeated memcpy's to a memset for the case that we're propagating a single byte into a region of memory. It also optimizes the element-by-element copies to have a tighter loop; I'm pretty sure the old code was actually doing a multiply within each loop iteration.

For an 8GB array (`static SLICE: [u8; SIZE] = [0u8; 1 << 33];`) this takes us from ~23 seconds to ~6 seconds locally, which is spent roughly 50/50 in (a) memset to zero and (b) memcpy of the original place into a new place, when popping stack frame. The latter seems hard to avoid but is a big memcpy (since we're copying the type rather than initializing a region, so it's pretty fast), and the first is as good as it's going to get without special casing constant-valued arrays.

Closes rust-lang#55795. (That issue's references to lint checking don't appear true anymore, but I think this closes that case as something that is slow due to *time* pretty fully. An 8GB array taking only 6 seconds feels reasonable enough to not merit further tracking).
@bors
Copy link
Contributor

bors commented Jan 19, 2024

⌛ Testing commit e68f303 with merge 2617746...

@bors
Copy link
Contributor

bors commented Jan 19, 2024

💔 Test failed - checks-actions

@bors bors added S-waiting-on-review Status: Awaiting review from the assignee but also interested parties. and removed S-waiting-on-bors Status: Waiting on bors to run and complete tests. Bors will change the label on completion. labels Jan 19, 2024
@rust-log-analyzer
Copy link
Collaborator

A job failed! Check out the build log: (web) (plain)

Click to see the possible cause of the failure (guessed by this bot)

@saethlin
Copy link
Member

Error: fatal: unable to access 'https://github.com/rust-lang-ci/rust/': Could not resolve host: github.com

@bors retry

@bors bors added S-waiting-on-bors Status: Waiting on bors to run and complete tests. Bors will change the label on completion. and removed S-waiting-on-review Status: Awaiting review from the assignee but also interested parties. labels Jan 19, 2024
@bors
Copy link
Contributor

bors commented Jan 19, 2024

⌛ Testing commit e68f303 with merge 16fadb3...

@bors
Copy link
Contributor

bors commented Jan 19, 2024

☀️ Test successful - checks-actions
Approved by: oli-obk
Pushing 16fadb3 to master...

@bors bors added the merged-by-bors This PR was explicitly merged by bors. label Jan 19, 2024
@bors bors merged commit 16fadb3 into rust-lang:master Jan 19, 2024
12 checks passed
@rustbot rustbot added this to the 1.77.0 milestone Jan 19, 2024
@rust-timer
Copy link
Collaborator

Finished benchmarking commit (16fadb3): comparison URL.

Overall result: no relevant changes - no action needed

@rustbot label: -perf-regression

Instruction count

This benchmark run did not return any relevant results for this metric.

Max RSS (memory usage)

Results

This is a less reliable metric that may be of interest but was not used to determine the overall result at the top of this comment.

mean range count
Regressions ❌
(primary)
- - 0
Regressions ❌
(secondary)
5.3% [5.3%, 5.3%] 1
Improvements ✅
(primary)
- - 0
Improvements ✅
(secondary)
-0.7% [-0.7%, -0.7%] 1
All ❌✅ (primary) - - 0

Cycles

This benchmark run did not return any relevant results for this metric.

Binary size

This benchmark run did not return any relevant results for this metric.

Bootstrap: 664.049s -> 663.017s (-0.16%)
Artifact size: 308.24 MiB -> 308.29 MiB (0.01%)

if size_in_bytes == 1 && num_copies >= 1 {
// SAFETY: `src_bytes` would be read from anyway by copies below (num_copies >= 1).
// Since size_in_bytes = 1, then the `init.no_bytes_init()` check above guarantees
// that this read at type `u8` is OK -- it must be an initialized byte.
Copy link
Member

@RalfJung RalfJung Jan 22, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This comment seems confused. SAFETY comments are about the safety of the interpreter itself. The bytes in question are always initialized in that sense. The check above is about whether they are initialized in the emulated interpreter state, i.e., whether the interpreted program initialized them.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Happy to edit it! I didn't obviously find any guarantees that the interpreter always initializes backing allocations before passing them around (seems like at minimum based on profiling that's not happening with the destination buffer, so it would be bad to read that at type u8 here?).

I would have expected that the emulated interpreter state having the bytes initialized implies the interpreter's state also has them initialized.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The u8 buffer backing interpreter allocations is always initialized. It has type u8 rather than MaybeUninit<u8>, so documenting that seems redundant.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Profiling seems to be misleading then. I think we are calling calloc (via try_new_zeroed_slice) which might do the zero-init via efficient copy-on-write tricks or so.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Perhaps! I'm 99% sure we're not touching every byte until we hit this code. I'll look into changing some things here then.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

New allocations get created here:

fn uninit_inner<R>(size: Size, align: Align, fail: impl FnOnce() -> R) -> Result<Self, R> {
// We raise an error if we cannot create the allocation on the host.
// This results in an error that can happen non-deterministically, since the memory
// available to the compiler can change between runs. Normally queries are always
// deterministic. However, we can be non-deterministic here because all uses of const
// evaluation (including ConstProp!) will make compilation fail (via hard error
// or ICE) upon encountering a `MemoryExhausted` error.
let bytes = Bytes::zeroed(size, align).ok_or_else(fail)?;
Ok(Allocation {
bytes,
provenance: ProvenanceMap::new(),
init_mask: InitMask::new(size, false),
align,
mutability: Mutability::Mut,
extra: (),
})
}
/// Try to create an Allocation of `size` bytes, failing if there is not enough memory
/// available to the compiler to do so.
pub fn try_uninit<'tcx>(size: Size, align: Align) -> InterpResult<'tcx, Self> {
Self::uninit_inner(size, align, || {
ty::tls::with(|tcx| tcx.dcx().delayed_bug("exhausted memory during interpretation"));
InterpError::ResourceExhaustion(ResourceExhaustionInfo::MemoryExhausted).into()
})
}
/// Try to create an Allocation of `size` bytes, panics if there is not enough memory
/// available to the compiler to do so.
///
/// Example use case: To obtain an Allocation filled with specific data,
/// first call this function and then call write_scalar to fill in the right data.
pub fn uninit(size: Size, align: Align) -> Self {
match Self::uninit_inner(size, align, || {
panic!("Allocation::uninit called with panic_on_fail had allocation failure");
}) {
Ok(x) => x,
Err(x) => x,
}
}

That ends up calling this:

fn zeroed(size: Size, _align: Align) -> Option<Self> {
let bytes = Box::<[u8]>::try_new_zeroed_slice(size.bytes_usize()).ok()?;
// SAFETY: the box was zero-allocated, which is a valid initial value for Box<[u8]>
let bytes = unsafe { bytes.assume_init() };
Some(bytes)
}

I am 100% sure that all these u8 are initialized.

@Mark-Simulacrum Mark-Simulacrum deleted the fast-memcpy branch January 22, 2024 12:30
@RalfJung
Copy link
Member

I filed a follow-up to fix the comments: #120387

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
merged-by-bors This PR was explicitly merged by bors. S-waiting-on-bors Status: Waiting on bors to run and complete tests. Bors will change the label on completion. T-compiler Relevant to the compiler team, which will review and decide on the PR/issue.
Projects
None yet
Development

Successfully merging this pull request may close these issues.

static array of zeroes can take minutes to lint check
9 participants