Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Reduce peer message traffic for ledger data #5126

Open
wants to merge 21 commits into
base: develop
Choose a base branch
from

Conversation

ximinez
Copy link
Collaborator

@ximinez ximinez commented Sep 11, 2024

High Level Overview of Change

Several changes to help reduce message traffic and improve logging and visibility.

  • Suppress duplicate TMGetLedger and TMLedgerData messages, reducing the overhead of processing those messages.
  • Reduce the number of those messages sent to peers.
  • Improve logging related to those messages, as well as a few other areas.
  • Introduces a new protocol version to gate a new feature on the TMLedgerData message, which allows multiple identical requests to be replied to with one message.

These changes are organized into several commits which are organized logically separating each functional operation. They can be merged as-is, or squashed.

Context of Change

Analysis of the issue that led to #5115 identified heavy TMGetLedger request and TMLedgerData response traffic between nodes leading up to the syncing incidents. It was later determined that those messages were more a symptom of the problem, and not the root cause. However, leading up to identification of the root cause, these changes were being implemented to cut down on those messages, detect duplicates, etc. That reduction in unnecessary traffic is still valuable, so it's being included here.

Type of Change

  • Bug fix (non-breaking change which fixes an issue)
  • New feature (non-breaking change which adds functionality)
  • Performance (increase or change in throughput and/or latency)

API Impact

None.

Test Plan

  1. Verify that nodes are at least as reliable at staying in sync as they are now.
  2. Verify that the indicated messages are sent less frequently by upgraded nodes.
  3. Verify that peers that both have this change send fewer ledger messages to each other.

Future Tasks

I am still working on a follow-up to #4764 that makes use of these changes and other improvements to reduce the number of requests initiated by a given node in the first place.

@ximinez ximinez added the Perf Attn Needed Attention needed from RippleX Performance Team label Sep 11, 2024
Copy link

codecov bot commented Sep 11, 2024

Codecov Report

Attention: Patch coverage is 23.00469% with 328 lines in your changes missing coverage. Please review.

Project coverage is 77.6%. Comparing base (47b0543) to head (75fcca5).

Files with missing lines Patch % Lines
src/xrpld/overlay/detail/PeerImp.cpp 0.0% 215 Missing ⚠️
src/xrpld/overlay/detail/ProtocolMessage.h 0.0% 32 Missing ⚠️
src/xrpld/app/misc/NetworkOPs.cpp 21.4% 22 Missing ⚠️
src/xrpld/overlay/detail/PeerSet.cpp 20.8% 19 Missing ⚠️
src/xrpld/app/ledger/detail/InboundLedgers.cpp 67.9% 17 Missing ⚠️
src/xrpld/app/misc/HashRouter.cpp 50.0% 6 Missing ⚠️
src/xrpld/app/ledger/detail/InboundLedger.cpp 54.5% 5 Missing ⚠️
src/xrpld/app/ledger/InboundLedger.h 60.0% 4 Missing ⚠️
src/xrpld/app/ledger/detail/LedgerMaster.cpp 0.0% 3 Missing ⚠️
include/xrpl/basics/CanProcess.h 88.9% 2 Missing ⚠️
... and 2 more
Additional details and impacted files

Impacted file tree graph

@@            Coverage Diff            @@
##           develop   #5126     +/-   ##
=========================================
- Coverage     77.9%   77.6%   -0.2%     
=========================================
  Files          784     785      +1     
  Lines        66681   67006    +325     
  Branches      8157    8299    +142     
=========================================
+ Hits         51923   52020     +97     
- Misses       14758   14986    +228     
Files with missing lines Coverage Δ
include/xrpl/basics/base_uint.h 96.8% <100.0%> (+<0.1%) ⬆️
include/xrpl/protocol/LedgerHeader.h 100.0% <ø> (ø)
src/xrpld/app/ledger/detail/TimeoutCounter.cpp 88.4% <100.0%> (+1.3%) ⬆️
src/xrpld/app/ledger/detail/TimeoutCounter.h 100.0% <ø> (ø)
src/xrpld/app/misc/NetworkOPs.h 100.0% <ø> (ø)
src/xrpld/overlay/Peer.h 100.0% <ø> (ø)
src/xrpld/overlay/detail/PeerImp.h 13.6% <ø> (ø)
src/xrpld/overlay/detail/ProtocolVersion.cpp 86.4% <ø> (ø)
src/xrpld/app/consensus/RCLConsensus.cpp 65.4% <0.0%> (ø)
include/xrpl/basics/CanProcess.h 88.9% <88.9%> (ø)
... and 10 more

... and 9 files with indirect coverage changes

Impacted file tree graph

@ximinez ximinez force-pushed the pr/getledger branch 3 times, most recently from 09c4156 to c69b443 Compare September 11, 2024 14:48
ximinez and others added 5 commits September 11, 2024 11:09
* Allow a retry after 30s in case of peer or network congestion.
* Addresses RIPD-1870
* (Changes levelization. That is not desirable, and will need to be
  fixed.)
* Allow a retry after 15s in case of peer or network congestion.
* Collate duplicate TMGetLedger requests:
  * The requestCookie is ignored when computing the hash, thus increasing
    the chances of detecting duplicate messages.
  * With duplicate messages, keep track of the different requestCookies
    (or lack of cookie). When work is finally done for a given request,
    send the response to all the peers that are waiting on the request,
    sending a separate message for each requestCookie.
* Addresses RIPD-1871
* Addresses RIPD-1869

---------

Co-authored-by: Valentin Balaschenko <[email protected]>
Co-authored-by: Ed Hennis <[email protected]>
* When work is done for a given TMGetLedger request, send the
  response to all the peers that are waiting on the request,
  sending one message per peer, including all the cookies and
  a "directResponse" flag indicating the data is intended for the
  sender, too.
@ximinez ximinez changed the title Reduce ledger protocol message traffic Reduce peer message traffic for ledger data Sep 11, 2024
@ximinez ximinez added this to the 2.3.0 (August 2024) milestone Sep 11, 2024
@ximinez ximinez marked this pull request as ready for review September 11, 2024 20:45
insert()
{
std::unique_lock<Mutex> lock_(mtx_);
bool exists = collection_.contains(item_);
Copy link
Collaborator

@Bronek Bronek Sep 13, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This would avoid extra lookup

auto [_, inserted] = collection_.insert(item_);
return inserted;

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Good catch. Fixed.

* Avoid an unnecessary lookup in CanProcess
* upstream/develop:
  Set version to 2.3.0-b4
  feat(SQLite): allow configurable database pragma values (5135)
  refactor: re-order PRAGMA statements (5140)
  fix(book_changes): add "validated" field and reduce RPC latency (5096)
  chore: fix typos in comments (5094)
  Set version to 2.2.3
  Update SQLite3 max_page_count to match current defaults (5114)
@vlntb vlntb assigned vlntb and unassigned vlntb Oct 1, 2024
@vlntb vlntb self-requested a review October 1, 2024 10:57
Copy link
Collaborator

@vlntb vlntb left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Only minor comments. Happy to approve once addressed.

@@ -623,6 +623,13 @@ to_string(base_uint<Bits, Tag> const& a)
return strHex(a.cbegin(), a.cend());
}

template <std::size_t Bits, class Tag>
inline std::string
to_short_string(base_uint<Bits, Tag> const& a)
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

[nit]: Adding checks for the to_short_string in the base_unit_test next to existing to_string cases would be good.

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I wouldn't call that a nit. Missing test coverage is pretty significant. Thanks for catching it. Fixed.

return ledger;
}

JLOG(p_journal_.trace())
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

[nit]: Should this be a warn instead of trace? Not having a peer to relay the request may indicate some configuration or environment issues.

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Not necessarily, though. It could mean that the node has already sent the request to all it's peers. Also not that the original message is trace.

But there's something odd here. It looks like this code block was somehow duplicated! I must have messed up resolving a conflict when I rebased from master to develop. I've removed the duplicate.

@@ -2936,7 +3078,9 @@ getPeerWithLedger(
void
PeerImp::sendLedgerBase(
std::shared_ptr<Ledger const> const& ledger,
protocol::TMLedgerData& ledgerData)
protocol::TMLedgerData& ledgerData,
std::map<std::shared_ptr<Peer>, std::set<std::optional<uint64_t>>> const&
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

std::map<std::shared_ptr<Peer>, std::set<std::optional<uint64_t>>> is mentioned in four places. It would be more readable to define an alias:
using PeerCookieMap = std::map<std::shared_ptr<Peer>, std::set<std::optional<uint64_t>>>;

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Good suggestion! Fixed.

* Add unit tests for to_short_string(base_uint
* Remove duplicated code
* Use type aliases for cookie maps
* That's what I get for rushing to push
@ximinez ximinez requested a review from Bronek October 15, 2024 16:54
* upstream/develop:
  Expand Error Message for rpcInternal (4959)
  docs: clean up API-CHANGELOG.md (5064)
* upstream/develop:
  Consolidate definitions of fields, objects, transactions, and features (5122)
  Ignore reformat when blaming
  Reformat code with clang-format-18
  Update pre-commit hook
  Update clang-format settings
  Update clang-format workflow
* upstream/develop:
  Add hubs.xrpkuwait.com to bootstrap (5169)
  docs: Add protobuf dependencies to linux setup instructions (5156)
  fix: reject invalid markers in account_objects RPC calls (5046)
  Update RELEASENOTES.md (5154)
  Introduce MPT support (XLS-33d): (5143)
* upstream/develop:
  Add AMMClawback Transaction (XLS-0073d) (5142)
* upstream/develop:
  Fix unity build (5179)
* upstream/develop:
  Set version to 2.3.0-rc1
  Replace Uint192 with Hash192 in server_definitions response (5177)
  Fix potential deadlock (5124)
  Introduce Credentials support (XLS-70d): (5103)
  Fix token comparison in Payment (5172)
  Add fixAMMv1_2 amendment (5176)
@ximinez ximinez removed the request for review from mtrippled November 12, 2024 16:13
* upstream/develop:
  fix: include `index` in `server_definitions` RPC (5190)
  Fix ledger_entry crash on invalid credentials request (5189)
* upstream/develop:
  Set version to 2.3.0-rc2
* upstream/develop:
  Set version to 2.3.0
  refactor(AMMClawback): move tfClawTwoAssets check (5201)
  Add a new serialized type: STNumber (5121)
  fix: check for valid ammID field in amm_info RPC (5188)
Copy link
Collaborator

@Bronek Bronek left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Several notes for now, also I noticed there's a bunch of Added line not covered by tests annotations from codecov, perhaps worth checking these.


#ifndef RIPPLE_BASICS_CANPROCESS_H_INCLUDED
#define RIPPLE_BASICS_CANPROCESS_H_INCLUDED

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Needs #include <mutex> for std::unique_lock

Suggested change
#include <mutex>

return canProcess_;
}

operator bool() const
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Please use explicit here; it would be consistent with std::optional and most other operator bool inside the project.

}

bool
canProcess() const
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

given that it is not covered by tests, perhaps we do not need this function ?

if (canProcess_)
{
std::unique_lock<Mutex> lock_(mtx_);
collection_.erase(item_);
Copy link
Collaborator

@Bronek Bronek Dec 2, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

How about replacing the (likely) O(log N) element lookup with O(1) erase of an iterator ? The iterator which is returned from insert to be specific, and currently ignored. In this case the Item wouldn't have to be stored inside CanProcess object, so that's also one less template parameter and probably also smaller object size (hashes are larger than iterators I guess).

Could even go one step further and replace all data members with std::function<void()> cleanup_ which would capture all that it needs if insert succeeded, or is empty if insert failed. In this case no template parameters would be needed at all.

return true;
return false;
}();
assert(
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This looks like unnecessary repetition of the condition used to initialise shouldAcquire but there is also a subtle bug here. We are reading getOPs().isNeedNetworkLedger() more than once, but this function returns an atomic (I mean its implementation in NetworkOPs.cpp) so in principle each time we call it, we might get a different response. I am not certain this condition can actually happen, but at the very least it is brittle code.

Also please note 3rd call to getOPs().isNeedNetworkLedger() inside logging below, the same applies as it could generate some confusing logs

// it so that it can jump ahead and get caught up.
LedgerIndex const validSeq =
app_.getLedgerMaster().getValidLedgerIndex();
constexpr std::size_t lagLeeway = 20;
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

is this the best location for this constant ? I am asking, not judging.

<< ": " << e.what();
}
catch (...)
if (CanProcess check{acquiresMutex_, pendingAcquires_, hash})
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looking good 👍

* upstream/develop:
  test: Add more test cases for Base58 parser (5174)
  test: Check for some unlikely null dereferences in tests (5004)
  Add Antithesis intrumentation (5042)
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Perf Attn Needed Attention needed from RippleX Performance Team
Projects
None yet
Development

Successfully merging this pull request may close these issues.

5 participants