-
Notifications
You must be signed in to change notification settings - Fork 3.7k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[loader-v2] Fixing global cache reads & read-before-write on publish #15285
Conversation
⏱️ 12h 18m total CI duration on this PR
🚨 1 job on the last run was significantly faster/slower than expected
|
This stack of pull requests is managed by Graphite. Learn more about stacking. |
0f7f5b2
to
3b8ce5c
Compare
This comment has been minimized.
This comment has been minimized.
This comment has been minimized.
This comment has been minimized.
enum ModuleRead<DC, VC, S> { | ||
/// Read from the cross-block module cache. | ||
GlobalCache, | ||
GlobalCache(Arc<ModuleCode<DC, VC, S>>), | ||
/// Read from per-block cache ([SyncCodeCache]) used by parallel execution. | ||
PerBlockCache(Option<(Arc<ModuleCode<DC, VC, S>>, Option<TxnIndex>)>), |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
can you explain why do we distinguish reads here based on where we got the data from? also what is Option<TxnIndex>
in the PerBlockCache ?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Option - module does not exist (in StateView even).
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Different reads - different validations. We need to check that global reads are still valid, and per-block reads have the same version
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
stupid formatting, didn't show I was referring to TxnIndex
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Ah - None is a storage version
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Different validation paths: for global cache read we need to check if the read is still valid in cache. For per-block we go to MVHashMap. Now, the question is about storage read: we issue it only when there is a cache miss in per-block cache, so it gets validated there.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Basically "storage version" can be later drained into global cache, but otherwise exists only in per-block
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
so from validation perspective - there is no distinction
distinction is ONLY there to make updating global cache (i.e. draining to it) be faster/cheaper by skipping things that are already there.
is that correct?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
actually could be an useful thing to add as a brief comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Added a comment
This comment has been minimized.
This comment has been minimized.
This comment has been minimized.
This comment has been minimized.
59e4942
to
dc8af3f
Compare
This comment has been minimized.
This comment has been minimized.
This comment has been minimized.
This comment has been minimized.
This comment has been minimized.
This comment has been minimized.
This comment has been minimized.
This comment has been minimized.
enum ModuleRead<DC, VC, S> { | ||
/// Read from the cross-block module cache. | ||
GlobalCache, | ||
GlobalCache(Arc<ModuleCode<DC, VC, S>>), | ||
/// Read from per-block cache ([SyncCodeCache]) used by parallel execution. | ||
PerBlockCache(Option<(Arc<ModuleCode<DC, VC, S>>, Option<TxnIndex>)>), |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
so from validation perspective - there is no distinction
distinction is ONLY there to make updating global cache (i.e. draining to it) be faster/cheaper by skipping things that are already there.
is that correct?
@@ -661,7 +658,7 @@ where | |||
} | |||
|
|||
self.module_reads.iter().all(|(key, read)| match read { | |||
ModuleRead::GlobalCache => global_module_cache.contains_valid(key), | |||
ModuleRead::GlobalCache(_) => global_module_cache.contains_valid(key), |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
should this whole match be equivalent to:
self.module_reads.iter().all(|(key, read)| {
let previous_version = match read {
ModuleRead::GlobalCache(_) => None, // i.e. storage version
ModuleRead::PerBlockCache(previous) => previous.as_ref().map(|(_, version)| *version);
};
let current_version = per_block_module_cache.get_module_version(key);
current_version == previous_version
})
why do we need to update GlobalCache at all while executing a block?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
We do if we read first from it (to know if entry is overridden or not). An alternative is to check lower level cache first, but this means performance penalty due to locking.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The code can be somewhat equivalent, but:
let current_version = per_block_module_cache.get_module_version(key);
causes a prefetch of storage version by default. We would need to special case validation to not do it. An we also end up locking the cache (shard, worst case), instead of checking an atomic bool
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
this is because we may publish a module that invalidates the global cache that's being read I think
} | ||
|
||
// Otherwise, it is a miss. Check global cache. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
why do we check global cache before checking state.versioned_map.module_cache ?
on rolling commit - are we updating GlobalCache itself?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
We update global cache at rolling commit - if published keys exist in global cache, we mark them as invalid. So reads to them results in a cache miss and we fallback to MVHashMap where we have placed the write at commit time.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
You can check versioned before, but then you end up acquiring a lock for potentially non-republished module (publish is rare). If 32 threads do this for aptos-framework, this is bad.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
So instead, we lookup in global first, but check an atomic bool flag there (better than a lock), so we optimize for read case
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I see, then I would rename PerBlockCache to UnfinalizedBlockCache or something like that - to make it clear it only ever refers to things before rolling commit, and GlobalCache is global and updated within the block itself.
(you can do that in separate PR of course :) )
(cherry picked from commit 18fb506fa23b166d524d91ed212644330e11444f)
8b61f06
to
1ed9b80
Compare
This comment has been minimized.
This comment has been minimized.
This comment has been minimized.
This comment has been minimized.
This comment has been minimized.
This comment has been minimized.
This comment has been minimized.
This comment has been minimized.
This comment has been minimized.
This comment has been minimized.
This comment has been minimized.
This comment has been minimized.
…15285) - Enforces read-before-write for module publishes. - Records all module reads in captured reads, not just per-block. - Adds a workload + test to publish and call modules. Co-authored-by: Igor <[email protected]> (cherry picked from commit 0a16e9e)
💚 All backports created successfully
Questions ?Please refer to the Backport tool documentation and see the Github Action logs for details |
This comment has been minimized.
This comment has been minimized.
This comment has been minimized.
This comment has been minimized.
This comment has been minimized.
This comment has been minimized.
✅ Forge suite
|
✅ Forge suite
|
✅ Forge suite
|
…15285) (#15298) - Enforces read-before-write for module publishes. - Records all module reads in captured reads, not just per-block. - Adds a workload + test to publish and call modules. Co-authored-by: Igor <[email protected]> (cherry picked from commit 0a16e9e) Co-authored-by: George Mitenkov <[email protected]>
Description
How Has This Been Tested?
To test, use
RUST_MIN_STACK=104857600 cargo test --release --package aptos-executor-benchmark --lib tests::test_publish_transaction
Commenting out
causes panic on not satisfying read-before-write. To test the captured read parts, inserting a panic after a single transaction is executed (instead of logging
"[aptos_vm] Transaction breaking invariant violation ... "
) is no longer triggerred. Increased the number of runs of a test to 10 to ensure we catch those cases.Key Areas to Review
Type of Change
Which Components or Systems Does This Change Impact?
Checklist