Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Initial support for parallel Soroban phase XDR. #4364

Open
wants to merge 2 commits into
base: master
Choose a base branch
from

Conversation

dmkozh
Copy link
Contributor

@dmkozh dmkozh commented Jun 26, 2024

Description

I've tried to minimize the scope of the changes; specifically this doesn't contain any actual logic for the parallel execution (such as data dependency validation and building parallel stages). However, there is still some refactoring that needed to happen in order to support new, more complex tx sets.

I've also removed some legacy surge pricing logic and the corresponding tests - we should no longer need the surge pricing logic that covers more than one source account per tx set.

Checklist

  • Reviewed the contributing document
  • Rebased on top of master (no merge commits)
  • Ran clang-format v8.0.0 (via make format or the Visual Studio extension)
  • Compiles
  • Ran all tests
  • If change impacts performance, include supporting evidence per the performance document

@dmkozh dmkozh force-pushed the parallel_txset2 branch from 073814b to 1ca680d Compare June 26, 2024 16:41
Copy link
Contributor

@sisuresh sisuresh left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Just an initial review

src/herder/TxSetFrame.cpp Outdated Show resolved Hide resolved
src/herder/TxSetFrame.cpp Outdated Show resolved Hide resolved
src/herder/TxSetFrame.cpp Show resolved Hide resolved
src/herder/TxSetFrame.h Outdated Show resolved Hide resolved
src/herder/TxSetFrame.h Outdated Show resolved Hide resolved
src/main/main.cpp Outdated Show resolved Hide resolved
@marta-lokhova marta-lokhova self-requested a review June 28, 2024 23:23
Copy link
Contributor

@marta-lokhova marta-lokhova left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks, parallel application of Soroban txs is exciting! I did an initial preliminary review. My main question is around scope and motivation:

  • It's a bit hard to imagine how the refactored tx set is going to be used, because ledger close logic isn't wired up yet. As a result, I can't tell if the access/usage patterns are right in the context of multi-threading. Is there a reason to introduce apply-order changes before parallel application?
  • (related) can you clarify the scope of this PR? if it's just refactoring tx set to be more general, then it'd be easier to make TxSetFrame more generalized internally, but continue exposing old API to make it easy for consumers. Changes to application order should probably be an isolated protocol change. This should help ship this work incrementally.
  • There are some changes that I think should be factored out into a separate PR (surge pricing, test cleanup). There is no reason to block those changes on tx set refactor.

src/herder/TxSetUtils.cpp Outdated Show resolved Hide resolved
@@ -154,11 +179,10 @@ TxSetUtils::buildAccountTxQueues(TxSetTransactions const& txs)
return queues;
}

TxSetTransactions
TxSetUtils::getInvalidTxList(TxSetTransactions const& txs, Application& app,
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can we add a comment here that one transaction per account is expected (or potentially pass AccountID->TxFrame map)?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

That's not the expectation here (yet), but probably that should be changed. I'll work on this in a separate PR.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

After the cleanup we only support nomination of 1 tx/account, so I don't think the comment is necessary in this particular place.

src/herder/TxSetFrame.cpp Outdated Show resolved Hide resolved
auto const& txSetV1 = txSet.v1TxSet();
// There was no protocol with 1 phase, so checking for 2 phases only
if (txSetV1.phases.size() != static_cast<size_t>(TxSetPhase::PHASE_COUNT))
// There was no protocol with 1 phase, so checking for 2 perPhaseTxs only
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

phases was a bit more clear than perPhaseTxs

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

That's just find-and-replace gone too far, sorry.

FWIW concerning the variable naming, perPhaseTxs is semantically different from phases. Specifically we have transactions that may be included into a phase and we also have the 'phase' itself (TxSetPhaseFrame) and phase might have additional internal structure, additional data etc.


Iterator(std::variant<TxFrameList, TxStageFrameList> const& txs,
size_t stageIndex, size_t txIndex);
std::variant<TxFrameList, TxStageFrameList> const& mTxs;
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Do we need a variant here? Legacy main-thread application is just a special case of multi-thread parallel application (i.e. it's a single stage).

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

These are not quite the same conceptually, but I think you're right and we could just generalize the implementation (it helps that the phase frame is immutable and we shouldn't need to make the runtime assertions about the phase structure). I'll give it a shot after taking out the surge pricing changes.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Updated to not use variant.

}

bool
validateNonParallelPhaseXDRStructure(TransactionPhase const& phase)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

nit on naming: we're introducing "parallel" concepts when there's no parallelization yet, which is kind of confusing for the reader. Can we make terminology implementation-agnostic? Really the concept we want here is conflict-free.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm not sure how to name that TBH. I guess I could just say 'v0' phase, would that be more readable?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

v0 is fine here, but my comment was more about the naming throughout the code. Might be nice to replace all mentions of parallel with conflict-free, and also add a comment that there isn't any parallelization yet to make it clear for the readers.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Well, it's not quite conflict-free. I'm fine with using something like v0 or maybe 'sequential' for the existing phases, but I'm not sure if there is a better term than 'parallel' for the new phase, because it is designed to be applicable in parallel (unlike the existing phase). I'm not sure we need to obfuscate that fact just because there is no followup work merged yet.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I've used 'sequential' for v0 phase. Still not sure if there is a better way to name the parallel phase.

@@ -519,15 +519,15 @@ closeLedgerOn(Application& app, uint32 ledgerSeq, TimePoint closeTime,
}
else
{
TxSetTransactions classic;
TxSetTransactions soroban;
TxFrameList classic;
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Regarding XDR changes:

  • I feel uneasy aboutledgerMaxParallelThreads: it's baking an implementation detail into the protocol (there's already enough complexity in config settings). I'm not certain how we plan to enforce this: a validator can bypass this check. Also, there are other ways to fall off the network even with this requirement satisfied (e.g. using CPUs that aren't powerful enough, etc). I think node quality shouldn't be enforced via protocol like this, but rather via qsets/social pressure.
  • Where is ParallelTxExecutionStage defined?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I feel uneasy aboutledgerMaxParallelThreads: it's baking an implementation detail into the protocol

It is not an implementation detail, but a part of the protocol. We have to define the lowest number of threads that we should be sure that every tier 1 validator can afford running in parallel, otherwise there is really not much point to all the parallelization work.

We indeed can't (and don't need to) enforce the number of threads to use for applying the tx set (less can definitely be used and more threads could be used depending on tx set composition, e.g. to speedup catchup). This is just the number of 'logical' threads in the tx set. Transactions in different 'logical' threads may not have data dependencies on each other and no logical thread can exceed the ledger-wide resource limit.

FWIW the actual config plumbing will be in a separate PR; this should be an XDR-only change for now.

Where is ParallelTxExecutionStage defined?

All the txset-related changes are defined in next branch of stellar-ledger.h: https://github.com/stellar/stellar-xdr/blob/619d33e14013712415e5218ca7f5fadb5524c948/Stellar-ledger.x#L210

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Sorry, I'm still pretty confused about this parameter.

It is not an implementation detail, but a part of the protocol. We have to define the lowest number of threads that we should be sure that every tier 1 validator can afford running in parallel, otherwise there is really not much point to all the parallelization work.

What I mean is that we're adding something that is an implementation detail to the protocol. There are ways Tier1 can lose sync due to wrong hardware today. This is why we have recommended SKUs and validators collectively decide what hardware makes sense for them.

We indeed can't (and don't need to) enforce the number of threads to use for applying the tx set (less can definitely be used and more threads could be used depending on tx set composition, e.g. to speedup catchup). This is just the number of 'logical' threads in the tx set. Transactions in different 'logical' threads may not have data dependencies on each other and no logical thread can exceed the ledger-wide resource limit,

So it this config about hardware or tx set properties? You mention above that the config is needed to make sure nodes can support parallelization and not fall out of sync, but also the config isn't really enforced and it's only supposed to indicate the number of conflict-free lanes in a tx set (and the tx set itself can be applied with more or less threads anyway)

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

What I mean is that we're adding something that is an implementation detail to the protocol.

This is as much of an implementation detail as a transaction count limit or per-ledger instruction limit for Soroban. Of course, tier 1 validators can lose sync if they're under the necessary spec, but that doesn't mean that we can apply arbitrarily large tx sets.

This is why we have recommended SKUs and validators collectively decide what hardware makes sense for them.

Right, that's why this is a config setting - network votes on the minimum necessary parallelization support (similarly to how it can vote on the operation count/instruction limits).

So it this config about hardware or tx set properties?

The config defines the number of logical threads in tx set. We could name it in a more obfuscated fashion, but I don't see much value in that, because in the end the value still maps to the minimal number of physical threads a validator must support.

but also the config isn't really enforced

It will be enforced - this PR is already quite large and config plumbing is not a part of it, as well as building tx sets for nomination.

the tx set itself can be applied with more or less threads anyway)

Sure, but the default implementation that we expect validators to use is to spawn as many threads as we have in the tx set and these threads should be able to run on the physical cores. Again, in theory a validator can sleep for 100 milliseconds after applying every transaction, but we do expect every reasonable validator to try and apply transactions as fast as possible. Threads are similar - slow hardware will still be able to apply the tx sets, but it will be lagging behind the network and there is really no way around that.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is as much of an implementation detail as a transaction count limit or per-ledger instruction limit for Soroban. Of course, tier 1 validators can lose sync if they're under the necessary spec, but that doesn't mean that we can apply arbitrarily large tx sets.

Sorry, I don't agree with this at all. ledger-wide limits are about bounding the amount of work needed to produce a single ledger, and preventing things like inserting an infinite loop into your contract. Number of threads is a performance tuning knob. It's not any different from any other performance knobs we have (do overlay in the background, do evictions faster, etc), and can cause nodes to be faster or slower. So the question is why should it be in the protocol?

It will be enforced - this PR is already quite large and config plumbing is not a part of it, as well as building tx sets for nomination.

How do we plan to enforce this? I don't quite understand how validators are going to prove that they meet the hardware requirements. I'm also not sure what happens if there's a protocol vote and some validators don't satisfy the new requirement (but that validator is still in everyone's quorum)

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

ledger-wide limits are about bounding the amount of work needed to produce a single ledger

Exactly, the thread limit has the same purpose of bounding the amount of work per-ledger on a reasonable validator. Currently 'reasonable' means 'reasonably modern CPU', with parallelization we obviously have to require a certain number of CPU cores as well (otherwise there is really no point in doing that).

Number of threads is a performance tuning knob.

It is not a performance tuning knob, it's a ledger capacity tuning knob that works in the same way as ledger-wide instruction limit.

So the question is why should it be in the protocol?

Because we need to be able to bound the ledger capacity at the protocol level. The point of parallelization is not to improve the performance by an unknown factor, it is about increasing the ledger instruction capacity without (significantly) changing the apply (and thus ledger close) time.

How do we plan to enforce this?

This configuration setting regulates the tx set validation - a valid tx set has to have no more than that many logical threads of independent transactions.

I don't quite understand how validators are going to prove that they meet the hardware requirements.

They don't need to prove that per-se, but we'll need to figure out the lowest common value that everyone thinks is reasonable. That value will need to be accounted for when setting the minimal hardware requirements for a validator. Again, there is no point in any parallelization work if we expect majority of the network to run on a single CPU. But we don't think that's the case, do we?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think the confusion is around why this parameter is about bounding work. Tuning the number of threads means tuning how work is executed, but not changing the amount of it. So depending on this number, you can execute a block faster or slower. For tx set construction, I suppose the limit is about bounding total parallel work threads? In practice, this doesn't really mean much, does it? We can't enforce it on validators. And there are all the weird scenarios I was asking about earlier where the network changes the parameter, and now some Tier1 nodes aren't configured "correctly" (another scenario is a tier1 node being rebuilt on a different instance type transparently, for example).

Again, there is no point in any parallelization work if we expect majority of the network to run on a single CPU

The way I think about this is that of course the network will eventually get to a place where most nodes are running the right hardware to take full advantage of parallelization (otherwise, they get removed from quorum or lose sync). It just feels wrong (and not really enforceable) to have these things in the protocol.

Also, these types of things should be discussed and debated at avenues like the protocol meeting, right?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think the confusion is around why this parameter is about bounding work.

Yeah, the parameter is not bounding work directly, but it's bounding the amount of modelled instructions per ledger given a machine with at least that many physical cores. From the protocol standpoint it's definition is 'tx set should have at most that many data-independent logical threads', i.e. a value of N guarantees that a machine with N cores can always execute a tx set in such a way that every core executes at most ledgerMaxInstructions modelled instructions. It does not (and can not) guarantee the real apply time, but the expectation is that given a machine with sufficient number of cores the apply time will always be within a low-variance factor of the 'modelled' instruction count. Machines with lower number of cores may sometimes have a factor that is 2+ times larger than the modelled time, which likely will result in 2+ times longer apply time than in machines with sufficient number of cores.

In order for parallelization to work properly, majority of nodes will need to have at least a few (2-3) physical cores more than this setting specifies. If the network votes for a value that is too high, the ledger close time will go up and likely will result in re-vote for a lower value, which is why it would make sense to reach consensus on the minimum requirement for a number of validator cores offline before voting for this.

Also, these types of things should be discussed and debated at avenues like the protocol meeting, right?

Indeed, but this is an experiment/prototype. If it is successful, we'll present the CAP as well as the initial results. If not, we'll keep iterating on this. We'll still have plenty of time for protocol discussions as this is not planned for protocol 22.

@@ -647,10 +647,6 @@ testTxSetWithFeeBumps(uint32 protocolVersion)

TEST_CASE("txset", "[herder][txset]")
{
SECTION("protocol 13")
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

just checking: this is fine to remove because the test is only testing block construction (so none of this logic is used in the replay)?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yeah, we shouldn't need to create tx sets for protocol 13 anymore, right?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

we can't create new protocol 13 tx sets, but we need to be careful as to not accidentally remove replay test coverage (replay also uses TxSetFrame and the whole ledgerClose flow). That being said, I agree all tests covering surge pricing and block construction can be removed because replay just uses ready tx sets directly from the archives.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Extra tests should have been removed in the cleanup PR, I only removed one helper here that is no longer used.

src/ledger/LedgerManagerImpl.cpp Show resolved Hide resolved
// to determine a stable index of every transaction in the phase, even if
// the phase is parallel and can have certain transaction applied in
// arbitrary order.
class Iterator
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This feels a bit premature: ledger close isn't yet hooked up to do anything in parallel, so it's hard to tell how this is going to be used and whether any synchronization needs to be taken into account. I'm curious about the motivation and scope of these changes, given that they seem to be preceding actual parallel application changes.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This changeset in general has to precede the actual application changes, because in order to apply a tx set one needs to have an appropriate schedule. Moreover, the purpose of iterator is to actually hide the internal structure of the phase for the scenarios where we don't care about the parallel execution at all (e.g. when we just want to check if every tx in the phase is valid etc.).

@marta-lokhova
Copy link
Contributor

(I should mention that this is my first time tapping into soroban parallelization work, so please bear with me if I'm missing context)

@dmkozh
Copy link
Contributor Author

dmkozh commented Jul 10, 2024

It's a bit hard to imagine how the refactored tx set is going to be used, because ledger close logic isn't wired up yet. As a result, I can't tell if the access/usage patterns are right in the context of multi-threading. Is there a reason to introduce apply-order changes before parallel application?

This PR actually has to go first, because we can't properly wire up the parallel application logic if we don't have a proper schedule to use. FWIW there is no multi-threading here at all; the only meaningful interface change is that we'll expose the 'parallel' application schedule from the respective phase (instead of a transaction sequence). I probably mentioned this elsewhere, but to be clear we have to build a tx set in a proper fashion in order to be able to apply it in parallel (specifically we need to run the transactions with data dependencies in the same thread or different stages).

(related) can you clarify the scope of this PR? if it's just refactoring tx set to be more general, then it'd be easier to make TxSetFrame more generalized internally, but continue exposing old API to make it easy for consumers. Changes to application order should probably be an isolated protocol change. This should help ship this work incrementally.

The idea here is to provide support for a new phase type in the TxSetFrame; I've tried to limit the API changes as much as possible and basically the main change is that instead of exposing all the transactions at once we expose the phases separately (that just required an additional loop in a few places).

I realize that some stuff could be moved to separate PRs; I'll at least move all the surge-pricing related stuff. That said, I'm not sure the interface changes would make much sense without at least the stub for supporting the new XDR (otherwise it's hard to tell what's the motivation of the changes and whether they look reasonable). I could try separating that as well, but I'm not sure it will help reviewing the design decisions.

@dmkozh dmkozh force-pushed the parallel_txset2 branch 4 times, most recently from f38272f to fcf19f2 Compare August 9, 2024 15:17
Copy link
Contributor Author

@dmkozh dmkozh left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I've rebased and updated the PR. There is no rush to merge it into the minor release, so I suppose it'll need to wait more.

@@ -154,11 +179,10 @@ TxSetUtils::buildAccountTxQueues(TxSetTransactions const& txs)
return queues;
}

TxSetTransactions
TxSetUtils::getInvalidTxList(TxSetTransactions const& txs, Application& app,
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

After the cleanup we only support nomination of 1 tx/account, so I don't think the comment is necessary in this particular place.

src/herder/TxSetUtils.cpp Outdated Show resolved Hide resolved

Iterator(std::variant<TxFrameList, TxStageFrameList> const& txs,
size_t stageIndex, size_t txIndex);
std::variant<TxFrameList, TxStageFrameList> const& mTxs;
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Updated to not use variant.

}

bool
validateNonParallelPhaseXDRStructure(TransactionPhase const& phase)
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I've used 'sequential' for v0 phase. Still not sure if there is a better way to name the parallel phase.

@@ -647,10 +647,6 @@ testTxSetWithFeeBumps(uint32 protocolVersion)

TEST_CASE("txset", "[herder][txset]")
{
SECTION("protocol 13")
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Extra tests should have been removed in the cleanup PR, I only removed one helper here that is no longer used.

@dmkozh dmkozh force-pushed the parallel_txset2 branch 3 times, most recently from ac3b7db to 93c9541 Compare September 18, 2024 17:41
@dmkozh dmkozh force-pushed the parallel_txset2 branch 4 times, most recently from 8bd3d36 to b930f32 Compare November 6, 2024 22:34
@dmkozh
Copy link
Contributor Author

dmkozh commented Nov 7, 2024

This PR should be rebased on top of the most recent changes, including the tx set nomination tests that should add some confidence in the refactoring.

@dmkozh dmkozh force-pushed the parallel_txset2 branch 3 times, most recently from 8ce0064 to 6afbc17 Compare November 27, 2024 19:30
I've tried to minimize the scope of the changes; specifically this doesn't contain any actual logic for the parallel execution (such as data dependency validation and building parallel stages). However, there is still some refactoring that needed to happen in order to support new, more complex tx sets.
src/main/main.cpp Outdated Show resolved Hide resolved
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants