Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

migrates to the new contact-info #823

Merged
merged 2 commits into from
Apr 24, 2024

Conversation

behzadnouri
Copy link

@behzadnouri behzadnouri commented Apr 15, 2024

Problem

move on to new contact-info

Summary of Changes

replaced uses of legacy contact-info with the new one.

@behzadnouri behzadnouri force-pushed the new-contact-info branch 3 times, most recently from ad56c18 to d03af0b Compare April 18, 2024 23:21
@codecov-commenter
Copy link

codecov-commenter commented Apr 19, 2024

Codecov Report

Attention: Patch coverage is 95.17241% with 7 lines in your changes are missing coverage. Please review.

Project coverage is 81.9%. Comparing base (b745dc9) to head (d6d95f9).
Report is 1 commits behind head on master.

Additional details and impacted files
@@           Coverage Diff           @@
##           master     #823   +/-   ##
=======================================
  Coverage    81.9%    81.9%           
=======================================
  Files         853      853           
  Lines      231955   231919   -36     
=======================================
- Hits       189984   189970   -14     
+ Misses      41971    41949   -22     

@behzadnouri behzadnouri force-pushed the new-contact-info branch 2 times, most recently from 5072b36 to 592b730 Compare April 19, 2024 18:11
Comment on lines 5705 to 5708
#[test]
#[ignore]
#[serial]
fn test_invalid_forks_persisted_on_restart() {
Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@AshwinSekar this test fails with this commit.
It passes if I make below change to push contact-infos out more frequently.

diff --git a/gossip/src/cluster_info.rs b/gossip/src/cluster_info.rs
index 2707a3ea0b..02aa95c1ea 100644
--- a/gossip/src/cluster_info.rs
+++ b/gossip/src/cluster_info.rs
@@ -1875,7 +1875,7 @@ impl ClusterInfo {
                     entrypoints_processed = entrypoints_processed || self.process_entrypoints();
                     //TODO: possibly tune this parameter
                     //we saw a deadlock passing an self.read().unwrap().timeout into sleep
-                    if start - last_push > CRDS_GOSSIP_PULL_CRDS_TIMEOUT_MS / 2 {
+                    {
                         self.push_self();
                         self.refresh_push_active_set(
                             &recycler,

Another way to make it pass is to use the new contact-infos in pull-requests as below, which would also help to propagate contact-infos (but we cannot do this yet because of compatibility issues).

diff --git a/gossip/src/cluster_info.rs b/gossip/src/cluster_info.rs
index 2707a3ea0b..c59a8713cb 100644
--- a/gossip/src/cluster_info.rs
+++ b/gossip/src/cluster_info.rs
@@ -1564,11 +1564,7 @@ impl ClusterInfo {
             .map(|(_, filters)| filters.len())
             .sum::<usize>() as u64;
         self.stats.new_pull_requests_count.add_relaxed(num_requests);
-        // TODO: Use new ContactInfo once the cluster has upgraded to:
-        // https://github.com/anza-xyz/agave/pull/803
-        let self_info = LegacyContactInfo::try_from(&self.my_contact_info())
-            .map(CrdsData::LegacyContactInfo)
-            .expect("Operator must spin up node with valid contact-info");
+        let self_info = CrdsData::ContactInfo(self.my_contact_info());
         let self_info = CrdsValue::new_signed(self_info, &self.keypair());
         let pulls = pulls
             .into_iter()

So I am guessing the test relies on contact-infos being propagated by a certain time or something. I have ignored it for now not to block this change because that is technically not a problem with this code. Can you please take a look?

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Will do, thanks for looking.

Copy link

@gregcusack gregcusack left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

just a few questions/comments. Also, do we still want to boost LegacyContactInfo when generating Pull Responses? https://github.com/behzadnouri/solana/blob/d6d95f92fbd311efabf51514062427d4dcb479fc/gossip/src/cluster_info.rs#L2124

@@ -155,7 +155,7 @@ impl CrdsData {
// TODO: Assign ranges to each arm proportional to their frequency in
// the mainnet crds table.
match kind {
0 => CrdsData::LegacyContactInfo(LegacyContactInfo::new_rand(rng, pubkey)),
0 => CrdsData::ContactInfo(ContactInfo::new_rand(rng, pubkey)),

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

since we haven't removed LegacyContactInfo completely from the network, should we make ContactInfo an additional field rather than in place of LegacyContactInfo? will help ensure backwards compatability until LegacyContactInfo is fully gone

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think it should be fine.
This commit is reducing the uses of LegacyContactInfo to pull-requests.
We are not going to backport this code, so by the time this code is about to hit mainnet, the cluster has already upgraded to #803 and we can remove legacy contact-info from pull-requests as well.

@@ -52,7 +52,7 @@ impl_crds_entry!(CrdsValue, |entry| Some(&entry?.value));
impl_crds_entry!(VersionedCrdsValue, |entry| entry);

// Lookup by Pubkey.
impl_crds_entry!(LegacyContactInfo, CrdsData::LegacyContactInfo(node), node);
impl_crds_entry!(ContactInfo, CrdsData::ContactInfo(node), node);

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

are we going to run into an issue where we cannot lookup LegacyContactInfo from the crds.table?

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If there were any instances of that, we would get compile-time error.

Comment on lines +1567 to 1571
// TODO: Use new ContactInfo once the cluster has upgraded to:
// https://github.com/anza-xyz/agave/pull/803
let self_info = LegacyContactInfo::try_from(&self.my_contact_info())
.map(CrdsData::LegacyContactInfo)
.expect("Operator must spin up node with valid contact-info");

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

i do not understand why this is incompatable. looks like we're able to handle the ContactInfo on the receive side of a PullRequest in handle_batch_pull_requests(): https://github.com/behzadnouri/solana/blob/d6d95f92fbd311efabf51514062427d4dcb479fc/gossip/src/cluster_info.rs#L1967

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Right, but that was only done in #803 which is not backported to v1.17 yet. So mainnet node still expect legacy contact-info in the pull request and otherwise drop the pull request.
To keep master branch compatible with mainnet, we cannot yet switch to the new contact-info here until mainnet has upgraded to #803 .

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

ahh good to know we need to keep master compatible with mainnet. sounds good

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

is this the same reason we keep adding LegacyContactInfo to PullResponses with higher probability: https://github.com/behzadnouri/solana/blob/d6d95f92fbd311efabf51514062427d4dcb479fc/gossip/src/cluster_info.rs#L2124?

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

yeah, since the cluster is still relying on LegayContactInfo we need to keep them prioritized for now.

@@ -5706,6 +5703,7 @@ fn test_randomly_mixed_block_verification_methods_between_bootstrap_and_not() {

/// Forks previous marked invalid should be marked as such in fork choice on restart
#[test]
#[ignore]
#[serial]

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

scares me a little that this test is failing just from switching to ContactInfo. Makes me think it's not getting handled properly somewhere...

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

yeah, but looks like if I make either of patches mentioned in https://github.com/anza-xyz/agave/pull/823/files#r1575382916 it passes. so looks like it relies on contact-infos propagation by a certain time which is not guaranteed anyways.

Nonetheless, we are not going to backport this patch, and as soon as mainnet has upgraded to #803 we can use the new contact-info in the pull requests which makes the test pass.

But agreed that we should investigate this test and figure out what assumptions are failing. Ashwin has opened #982 to keep track of this.

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

ya good point. the frequency fix is interesting. Wondering why that affects the ContactInfo but not LegacyContactInfo? I'd think it wouldn't have an affect...

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

the frequency fix is interesting. Wondering why that affects the ContactInfo but not LegacyContactInfo?

Previously, in addition to push and pull-responses, one more way that contact-infos were propagated across cluster was through the "caller" in pull-requests:
https://github.com/anza-xyz/agave/blob/e7452b768/gossip/src/cluster_info.rs#L2076-L2081
https://github.com/anza-xyz/agave/blob/e7452b768/gossip/src/crds_gossip_pull.rs#L296-L307
process_pull_requests is terribly named but basically what it does it upserts CRDS table with the caller's contact-info attached in pull-requests.

With this commit we are changing what struct counts as "contact-info" by switching from LegacyContactInfo to ContactInfo. So that caller's (legacy) contact-info attached to pull-requests is no longer relevant (because it is not a ContactInfo, but a Legacy...).

So basically with this commit we are killing that extra propagation path for contact-infos I mentioned earlier above. Increasing the frequency of "push_self" seems to compensate for that.

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

ahhhh yes got it. that makes sense

@behzadnouri behzadnouri requested a review from gregcusack April 24, 2024 14:16
Copy link

@gregcusack gregcusack left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

lgtm. just need to keep an eye on #982

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants