[loader-v2] Fixing global cache reads & read-before-write on publish #15285

georgemitenkov · 2024-11-15T02:35:23Z

Description

Capture global cache reads as well. Resolve first to captured reads (per transaction), then to global cache, then to per-block, then to state view.
Issue read-before-write for modules at commit.

How Has This Been Tested?

To test, use RUST_MIN_STACK=104857600 cargo test --release --package aptos-executor-benchmark --lib tests::test_publish_transaction

Commenting out

self.remote.read_state_value(&state_key).map_err(|err| {
    let msg = format!(
        "Error when enforcing read-before-write for module {}::{}: {:?}",
        addr, name, err
    );
    PartialVMError::new(StatusCode::STORAGE_ERROR).with_message(msg)
})?;

causes panic on not satisfying read-before-write. To test the captured read parts, inserting a panic after a single transaction is executed (instead of logging "[aptos_vm] Transaction breaking invariant violation ... ") is no longer triggerred. Increased the number of runs of a test to 10 to ensure we catch those cases.

Key Areas to Review

Type of Change

Which Components or Systems Does This Change Impact?

Checklist

I have read and followed the CONTRIBUTING doc
I have performed a self-review of my own code
I have commented my code, particularly in hard-to-understand areas
I identified and added all stakeholders and component owners affected by this change as reviewers
I tested both happy and unhappy path of the functionality
I have made corresponding changes to the documentation

trunk-io · 2024-11-15T02:35:26Z

⏱️ 12h 18m total CI duration on this PR

Slowest 15 Jobs	Cumulative Duration	Recent Runs
execution-performance / single-node-performance	7h 3m	🟩 🟩 🟩 🟩 🟩 (+6 more)
execution-performance / test-target-determinator	49m	🟩 🟩 🟩 🟩 🟩 (+6 more)
test-target-determinator	32m	🟩 ⬜ 🟩 ⬜ 🟩 (+4 more)
check	28m	🟩 ⬜ 🟩 ⬜ 🟩 (+5 more)
rust-images / rust-all	19m	🟥 🟩
check-dynamic-deps	18m	🟩 🟩 🟩 🟩 🟩 (+8 more)
rust-cargo-deny	16m	🟩 🟩 ⬜ 🟩 🟩 (+5 more)
rust-move-tests	13m	🟩
rust-move-tests	13m	🟩
fetch-last-released-docker-image-tag	13m	🟩 ⬜ 🟩 ⬜ 🟩 (+4 more)
rust-move-tests	12m	🟩
rust-move-tests	12m	🟩
rust-move-tests	12m	🟩
rust-move-tests	10m	⬜
rust-move-tests	8m	⬜

🚨 1 job on the last run was significantly faster/slower than expected

Job	Duration	vs 7d avg	Delta
execution-performance / single-node-performance	38m	16m

_{settings ⋅ feedback ⋅ docs ⋅ learn more about trunk.io}

georgemitenkov · 2024-11-15T02:35:42Z

[loader-v2] Small cleanups & tests #15279 : 2 dependent PRs (#15280 , #15316 )
[loader-v2] Fixing global cache reads & read-before-write on publish #15285 👈 (View in Graphite)
main

This stack of pull requests is managed by Graphite. Learn more about stacking.

igor-aptos · 2024-11-15T03:12:08Z

aptos-move/block-executor/src/captured_reads.rs

 enum ModuleRead<DC, VC, S> {
    /// Read from the cross-block module cache.
-    GlobalCache,
+    GlobalCache(Arc<ModuleCode<DC, VC, S>>),
    /// Read from per-block cache ([SyncCodeCache]) used by parallel execution.
    PerBlockCache(Option<(Arc<ModuleCode<DC, VC, S>>, Option<TxnIndex>)>),


can you explain why do we distinguish reads here based on where we got the data from? also what is Option<TxnIndex> in the PerBlockCache ?

Option - module does not exist (in StateView even).

Different reads - different validations. We need to check that global reads are still valid, and per-block reads have the same version

stupid formatting, didn't show I was referring to TxnIndex

Ah - None is a storage version

Different validation paths: for global cache read we need to check if the read is still valid in cache. For per-block we go to MVHashMap. Now, the question is about storage read: we issue it only when there is a cache miss in per-block cache, so it gets validated there.

Basically "storage version" can be later drained into global cache, but otherwise exists only in per-block

so from validation perspective - there is no distinction

distinction is ONLY there to make updating global cache (i.e. draining to it) be faster/cheaper by skipping things that are already there.

is that correct?

actually could be an useful thing to add as a brief comment

Added a comment

igor-aptos · 2024-11-15T19:19:14Z

aptos-move/block-executor/src/captured_reads.rs

 enum ModuleRead<DC, VC, S> {
    /// Read from the cross-block module cache.
-    GlobalCache,
+    GlobalCache(Arc<ModuleCode<DC, VC, S>>),
    /// Read from per-block cache ([SyncCodeCache]) used by parallel execution.
    PerBlockCache(Option<(Arc<ModuleCode<DC, VC, S>>, Option<TxnIndex>)>),


so from validation perspective - there is no distinction

distinction is ONLY there to make updating global cache (i.e. draining to it) be faster/cheaper by skipping things that are already there.

is that correct?

igor-aptos · 2024-11-15T19:20:57Z

aptos-move/block-executor/src/captured_reads.rs

@@ -661,7 +658,7 @@ where
        }

        self.module_reads.iter().all(|(key, read)| match read {
-            ModuleRead::GlobalCache => global_module_cache.contains_valid(key),
+            ModuleRead::GlobalCache(_) => global_module_cache.contains_valid(key),


should this whole match be equivalent to:

self.module_reads.iter().all(|(key, read)| { let previous_version = match read { ModuleRead::GlobalCache(_) => None, // i.e. storage version ModuleRead::PerBlockCache(previous) => previous.as_ref().map(|(_, version)| *version); }; let current_version = per_block_module_cache.get_module_version(key); current_version == previous_version })

why do we need to update GlobalCache at all while executing a block?

We do if we read first from it (to know if entry is overridden or not). An alternative is to check lower level cache first, but this means performance penalty due to locking.

The code can be somewhat equivalent, but:

let current_version = per_block_module_cache.get_module_version(key);

causes a prefetch of storage version by default. We would need to special case validation to not do it. An we also end up locking the cache (shard, worst case), instead of checking an atomic bool

this is because we may publish a module that invalidates the global cache that's being read I think

igor-aptos · 2024-11-15T19:27:25Z

aptos-move/block-executor/src/code_cache.rs

                }
+
+                // Otherwise, it is a miss. Check global cache.


why do we check global cache before checking state.versioned_map.module_cache ?

on rolling commit - are we updating GlobalCache itself?

We update global cache at rolling commit - if published keys exist in global cache, we mark them as invalid. So reads to them results in a cache miss and we fallback to MVHashMap where we have placed the write at commit time.

You can check versioned before, but then you end up acquiring a lock for potentially non-republished module (publish is rare). If 32 threads do this for aptos-framework, this is bad.

So instead, we lookup in global first, but check an atomic bool flag there (better than a lock), so we optimize for read case

I see, then I would rename PerBlockCache to UnfinalizedBlockCache or something like that - to make it clear it only ever refers to things before rolling commit, and GlobalCache is global and updated within the block itself.

(you can do that in separate PR of course :) )

(cherry picked from commit 18fb506fa23b166d524d91ed212644330e11444f)

…15285) - Enforces read-before-write for module publishes. - Records all module reads in captured reads, not just per-block. - Adds a workload + test to publish and call modules. Co-authored-by: Igor <[email protected]> (cherry picked from commit 0a16e9e)

github-actions · 2024-11-18T15:29:20Z

💚 All backports created successfully

Status	Branch	Result
✅	aptos-release-v1.24

Questions ?

Please refer to the Backport tool documentation and see the Github Action logs for details

github-actions · 2024-11-18T15:55:56Z

✅ Forge suite `realistic_env_max_load` success on `1ed9b8012565a6779542044695db775941506e20`

two traffics test: inner traffic : committed: 14235.58 txn/s, latency: 2794.71 ms, (p50: 2700 ms, p70: 2700, p90: 3000 ms, p99: 3300 ms), latency samples: 5412740
two traffics test : committed: 99.90 txn/s, latency: 1484.26 ms, (p50: 1400 ms, p70: 1500, p90: 1600 ms, p99: 1700 ms), latency samples: 1800
Latency breakdown for phase 0: ["MempoolToBlockCreation: max: 2.007, avg: 1.552", "ConsensusProposalToOrdered: max: 0.335, avg: 0.300", "ConsensusOrderedToCommit: max: 0.400, avg: 0.384", "ConsensusProposalToCommit: max: 0.697, avg: 0.684"]
Max non-epoch-change gap was: 0 rounds at version 0 (avg 0.00) [limit 4], 0.89s no progress at version 2809632 (avg 0.20s) [limit 15].
Max epoch-change gap was: 0 rounds at version 0 (avg 0.00) [limit 4], 8.70s no progress at version 2809630 (avg 8.70s) [limit 15].
Test Ok

github-actions · 2024-11-18T15:58:05Z

✅ Forge suite `framework_upgrade` success on `2bb2d43037a93d883729869d65c7c6c75b028fa1` ==> `1ed9b8012565a6779542044695db775941506e20`

Compatibility test results for 2bb2d43037a93d883729869d65c7c6c75b028fa1 ==> 1ed9b8012565a6779542044695db775941506e20 (PR)
Upgrade the nodes to version: 1ed9b8012565a6779542044695db775941506e20
framework_upgrade::framework-upgrade::full-framework-upgrade : committed: 1326.01 txn/s, submitted: 1328.00 txn/s, failed submission: 1.99 txn/s, expired: 1.99 txn/s, latency: 2389.52 ms, (p50: 2100 ms, p70: 2400, p90: 3900 ms, p99: 5400 ms), latency samples: 119780
framework_upgrade::framework-upgrade::full-framework-upgrade : committed: 1350.34 txn/s, submitted: 1353.03 txn/s, failed submission: 2.69 txn/s, expired: 2.69 txn/s, latency: 2274.82 ms, (p50: 2100 ms, p70: 2400, p90: 3300 ms, p99: 4600 ms), latency samples: 120480
5. check swarm health
Compatibility test for 2bb2d43037a93d883729869d65c7c6c75b028fa1 ==> 1ed9b8012565a6779542044695db775941506e20 passed
Upgrade the remaining nodes to version: 1ed9b8012565a6779542044695db775941506e20
framework_upgrade::framework-upgrade::full-framework-upgrade : committed: 1474.30 txn/s, submitted: 1476.58 txn/s, failed submission: 2.28 txn/s, expired: 2.28 txn/s, latency: 2187.59 ms, (p50: 2100 ms, p70: 2400, p90: 3300 ms, p99: 4400 ms), latency samples: 129080
Test Ok

github-actions · 2024-11-18T15:58:19Z

✅ Forge suite `compat` success on `2bb2d43037a93d883729869d65c7c6c75b028fa1` ==> `1ed9b8012565a6779542044695db775941506e20`

Compatibility test results for 2bb2d43037a93d883729869d65c7c6c75b028fa1 ==> 1ed9b8012565a6779542044695db775941506e20 (PR)
1. Check liveness of validators at old version: 2bb2d43037a93d883729869d65c7c6c75b028fa1
compatibility::simple-validator-upgrade::liveness-check : committed: 14614.13 txn/s, latency: 1974.12 ms, (p50: 1800 ms, p70: 1900, p90: 2200 ms, p99: 5700 ms), latency samples: 559680
2. Upgrading first Validator to new version: 1ed9b8012565a6779542044695db775941506e20
compatibility::simple-validator-upgrade::single-validator-upgrading : committed: 7722.40 txn/s, latency: 3563.27 ms, (p50: 3700 ms, p70: 4100, p90: 4900 ms, p99: 5300 ms), latency samples: 140560
compatibility::simple-validator-upgrade::single-validator-upgrade : committed: 7780.68 txn/s, latency: 4143.45 ms, (p50: 4300 ms, p70: 4500, p90: 6000 ms, p99: 6200 ms), latency samples: 255280
3. Upgrading rest of first batch to new version: 1ed9b8012565a6779542044695db775941506e20
compatibility::simple-validator-upgrade::half-validator-upgrading : committed: 7472.70 txn/s, latency: 3675.21 ms, (p50: 4100 ms, p70: 4500, p90: 4800 ms, p99: 5000 ms), latency samples: 135820
compatibility::simple-validator-upgrade::half-validator-upgrade : committed: 7379.51 txn/s, latency: 4315.75 ms, (p50: 4500 ms, p70: 4600, p90: 6600 ms, p99: 6800 ms), latency samples: 245020
4. upgrading second batch to new version: 1ed9b8012565a6779542044695db775941506e20
compatibility::simple-validator-upgrade::rest-validator-upgrading : committed: 12147.19 txn/s, latency: 2317.41 ms, (p50: 2600 ms, p70: 2600, p90: 2800 ms, p99: 2900 ms), latency samples: 208120
compatibility::simple-validator-upgrade::rest-validator-upgrade : committed: 6101.60 txn/s, submitted: 6101.79 txn/s, expired: 0.19 txn/s, latency: 2667.60 ms, (p50: 2600 ms, p70: 2800, p90: 3000 ms, p99: 3500 ms), latency samples: 386948
5. check swarm health
Compatibility test for 2bb2d43037a93d883729869d65c7c6c75b028fa1 ==> 1ed9b8012565a6779542044695db775941506e20 passed
Test Ok

…15285) (#15298) - Enforces read-before-write for module publishes. - Records all module reads in captured reads, not just per-block. - Adds a workload + test to publish and call modules. Co-authored-by: Igor <[email protected]> (cherry picked from commit 0a16e9e) Co-authored-by: George Mitenkov <[email protected]>

georgemitenkov force-pushed the george/loader-fixes branch from 0f7f5b2 to 3b8ce5c Compare November 15, 2024 02:36

georgemitenkov marked this pull request as ready for review November 15, 2024 02:39

georgemitenkov requested review from gelash, zekun000, sasha8 and danielxiangzl as code owners November 15, 2024 02:39

georgemitenkov requested review from msmouse and igor-aptos and removed request for sasha8 and danielxiangzl November 15, 2024 02:39

This comment has been minimized.

Sign in to view

igor-aptos reviewed Nov 15, 2024

View reviewed changes

This comment has been minimized.

Sign in to view

georgemitenkov force-pushed the george/loader-fixes branch 2 times, most recently from 59e4942 to dc8af3f Compare November 15, 2024 12:22

This was referenced Nov 15, 2024

[loader-v2] Small cleanups & tests #15279

Merged

[loader-v2] Refactor how environment is stored in unsync storages and remove V1 from tests #15280

Closed

georgemitenkov changed the title ~~Loader fixes~~ [loader-v2] Fixing global cache reads & read-before-write on publish Nov 15, 2024

This comment has been minimized.

Sign in to view

igor-aptos reviewed Nov 15, 2024

View reviewed changes

igor-aptos and others added 4 commits November 17, 2024 22:55

Create republish and call mixed workload

c6d6182

(cherry picked from commit 18fb506fa23b166d524d91ed212644330e11444f)

Move read-before-write to VM

2541069

comments: add description to module cache

b0a93db

rebase fix from_bcs

1ed9b80

georgemitenkov force-pushed the george/loader-fixes branch from 8b61f06 to 1ed9b80 Compare November 17, 2024 22:59

georgemitenkov enabled auto-merge (squash) November 17, 2024 23:01

This comment has been minimized.

Sign in to view

georgemitenkov merged commit 0a16e9e into main Nov 18, 2024
76 of 99 checks passed

georgemitenkov deleted the george/loader-fixes branch November 18, 2024 04:24

igor-aptos added 1.24 v1.24 and removed 1.24 labels Nov 18, 2024

github-actions bot mentioned this pull request Nov 18, 2024

[cp][aptos-release-v1.24] [loader-v2] Fixing global cache reads & read-before-write on publish #15298

Merged

This comment has been minimized.

Sign in to view

This was referenced Nov 19, 2024

[loader-v2] migrate Move transactional and integration tests to V2 only #15315

Open

[loader-v2] Addressing simple loader V2 TODOs #15316

Merged

[move] Benchmarking historical transactions #15329

Open

[loader-v2] Fixing global cache reads & read-before-write on publish #15285

[loader-v2] Fixing global cache reads & read-before-write on publish #15285

Conversation

georgemitenkov commented Nov 15, 2024 • edited Loading

Description

How Has This Been Tested?

Key Areas to Review

Type of Change

Which Components or Systems Does This Change Impact?

Checklist

trunk-io bot commented Nov 15, 2024 • edited Loading

georgemitenkov commented Nov 15, 2024 • edited Loading

This comment has been minimized.

This comment has been minimized.

igor-aptos Nov 15, 2024 • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

This comment has been minimized.

This comment has been minimized.

This comment has been minimized.

This comment has been minimized.

This comment has been minimized.

This comment has been minimized.

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

This comment has been minimized.

This comment has been minimized.

This comment has been minimized.

This comment has been minimized.

This comment has been minimized.

This comment has been minimized.

github-actions bot commented Nov 18, 2024

💚 All backports created successfully

Questions ?

This comment has been minimized.

This comment has been minimized.

This comment has been minimized.

github-actions bot commented Nov 18, 2024

✅ Forge suite realistic_env_max_load success on 1ed9b8012565a6779542044695db775941506e20

github-actions bot commented Nov 18, 2024

✅ Forge suite framework_upgrade success on 2bb2d43037a93d883729869d65c7c6c75b028fa1 ==> 1ed9b8012565a6779542044695db775941506e20

github-actions bot commented Nov 18, 2024

✅ Forge suite compat success on 2bb2d43037a93d883729869d65c7c6c75b028fa1 ==> 1ed9b8012565a6779542044695db775941506e20

georgemitenkov commented Nov 15, 2024 •

edited

Loading

trunk-io bot commented Nov 15, 2024 •

edited

Loading

georgemitenkov commented Nov 15, 2024 •

edited

Loading

igor-aptos Nov 15, 2024 •

edited

Loading

✅ Forge suite `realistic_env_max_load` success on `1ed9b8012565a6779542044695db775941506e20`

✅ Forge suite `framework_upgrade` success on `2bb2d43037a93d883729869d65c7c6c75b028fa1` ==> `1ed9b8012565a6779542044695db775941506e20`

✅ Forge suite `compat` success on `2bb2d43037a93d883729869d65c7c6c75b028fa1` ==> `1ed9b8012565a6779542044695db775941506e20`