Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

polygon/sync: handle bad blocks on chain tip #12320

Merged
merged 36 commits into from
Oct 17, 2024
Merged

polygon/sync: handle bad blocks on chain tip #12320

merged 36 commits into from
Oct 17, 2024

Conversation

taratorio
Copy link
Member

@taratorio taratorio commented Oct 15, 2024

closes #11533

High level:

  • when we get a ErrForkChoiceUpdateBadBlock on the chain tip due to a invalid root hash/receipts mismatch there are 2 scenarios - either we are wrong (due to a bug in our execution or us running on a wrong fork) or the peer which sent us the new block/new block hashes event is wrong
  • when we get a ErrForkChoiceUpdateBadBlock when handling a new milestone that means we are the one who is wrong

With this in mind a way to handle ErrForkChoiceUpdateBadBlock on tip is to:

  • if the error happens when handing a new milestone mismatch and/or when in "sync to tip mode" using checkpoints & milestones on initial start - then terminate the node if we have a fork choice bad block err since we know the issue is with us and there is no point in looping
  • if the error happens while we are in "chain tip mode" and are building our canonical chain after the latest milestone has been correctly processed then we do not know if it us who is at fault or the block our peer has sent - in this case we mark the block as bad in an lru cache and penalise the peer (if this is after a NewBlockHashes event) and move on. Any future blocks that link to bad blocks in the lru get discarded with warn messages. The idea behind this is that 1) if it was indeed our peer who was wrong then we did the right thing & 2) if our peer was correct but we were wrong then we will eventually penalise all our peers and will stop receiving events from p2p - when this happens we will eventually get a new milestone event from heimdall, our canonical chain tip wouldn't match, we will try to download all blocks for the milestone from any peer that connects to us afterwards, will execute those blocks and we will again fail with a bad block err in which case this will be treated as a terminal failure and the node will exit unexpectedly

Whenever we handle a bad block err we make sure to unwind the bridge back to the last known valid block height, so that it is in a consistent state for the next valid block and after shutdown for a clean restarts that would not require manual intervention.

As part of this PR I noticed a few gaps in the logic around unwinding the bridge upon change of forks. Addressed that as part of this change so that we can close the "handle unwinding at tip" ticket.

Base automatically changed from astrid-block-announcements to main October 15, 2024 19:58
@taratorio taratorio marked this pull request as ready for review October 16, 2024 21:56
@taratorio taratorio changed the title polygon/sync: remember bad blocks on chain tip polygon/sync: handle bad blocks on chain tip Oct 17, 2024
@taratorio taratorio requested review from shohamc1 and mh0lt October 17, 2024 16:34
@taratorio taratorio enabled auto-merge (squash) October 17, 2024 16:36
@taratorio taratorio merged commit b46552d into main Oct 17, 2024
11 checks passed
@taratorio taratorio deleted the astrid-bad-blocks branch October 17, 2024 16:43
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Projects
None yet
Development

Successfully merging this pull request may close these issues.

handle unwinds via astrid
2 participants