fix(miner): ignore lastWork when selecting the best mining candidate + tests #12659

Stebalien · 2024-10-30T15:35:27Z

Related Issues

Proposed Changes

Ignore lastWork when selecting the best mining candidate.

Previously, we only took the new head if it's heavier than the last head. Unfortunately, this meant that F3 finalization wasn't properly propagated to the miner.

In terms of impact:

It seems likely that this check was simply defensive as, prior to F3, the new head should never have a lower weight (unless you're talking to multiple lotus nodes, I guess...).
The lastWork field is mostly used to track null blocks. Worst-case scenario, if we switch heads, we'll attempt to re-mine previous heights. However, that should be relatively fast and, due to the slash filter, we
won't attempt to re-broadcast any of those blocks.

Checklist

Before you mark the PR ready for review, please make sure that:

Commits have a clear commit message.
PR title conforms with contribution conventions
Update CHANGELOG.md or signal that this change does not need it per contribution conventions
New features have usage guidelines and / or documentation updates in
- Lotus Documentation
- Discussion Tutorials
Tests exist for new functionality or change in behavior
CI is green

Stebalien · 2024-10-30T15:39:20Z

NOTE: This is the smallest possible fix for this bug. An ideal fix would have some other properties:

Ideally we'd keep the heaviest head unless the new head was finalized by F3, but that's a larger fix and is accounting for a case where something is already going wrong.
Ideally we wouldn't track "nulls" but would instead track the last height we mined on irrespective of the last base we worked on. However, that's a much larger change and not something I feel comfortable shipping in a late RC. Worst-case scenario, we try to re-mine heights when catch-up mining, but that's fine because we already do that.

In other words, this fix should be no worse than the current code.

Previously, we only took the new head if it's heavier than the last head. Unfortunately, this meant that F3 finalization wasn't properly propagated to the miner. In terms of impact: 1. It seems likely that this check was simply defensive as, prior to F3, the new head should never have a lower weight (unless you're talking to multiple lotus nodes, I guess...). 2. The `lastWork` field is mostly used to track null blocks. Worst-case scenario, if we switch heads, we'll attempt to re-mine previous heights. However, that should be relatively fast and, due to the slash filter, we won't attempt to re-broadcast any of those blocks.

masih

LGTM based on standup discussion. As far as I can tell the previous implementation is due to a whole bunch of historical reasons that no longer apply.

Stebalien · 2024-10-30T16:19:45Z

We looked into history a bit more, it looks like this check was a part of the very first WIP commit and just stuck the entire way.

Stebalien · 2024-10-30T21:09:16Z

Well, it kind of works. But our slash-filter handling is bonkers. If we hit the slash filter, we'll keep trying to re-mine the same height until someone wins which obviously won't work. So now I'm trying to fix that.

Stebalien · 2024-10-30T22:50:11Z

Hm. So, I don't think that'll work because mining on a "different fork" would be slashable, IIRC. In practice this isn't an issue because, while we can't mine until someone else does, someone else will eventually mine on a different height.

So I'm just going to setup a test with an additional miner.

Stebalien · 2024-10-31T03:51:59Z

Ok, the handling of null blocks with respect to slashed blocks broke everything (that and our syncing logic has some issues). I've pushed a WIP commit with a test and hacky fixes that make the test "pass", but I don't think they're 100% correct.

Issues:

We don't update our base after submitting a block (will cause catch-up issues).
We don't update the null blocks counter on the base after skipping a block due to the slash filter (prevents progress in the test).
We don't wait after handling the slash filter (my current patch doesn't wait either... not sure what we should do here? does it matter?).
The checkpoint sync logic can't handle the specific type of fork I'm using in the test (fork to a tipset with fewer blocks). I think my hacky fix is correct but we'll need verification on that.

This test still doesn't pass and I'm not sure why: 1. Sync is a bit broken because it can't figure out that we already have all the data locally. 2. But with 2 nodes, it should work. Except that, if I add some logging, I see that sync works until the libp2p nodes just flat-out disconnect from eachother.

Stebalien · 2024-11-01T11:57:20Z

Ok, I think my fix for the mining loop might be ok, but someone should look at it.

However, we still need to fix the sync issue to get the tests to pass:

I tried some simple fixes but, the real issue is that syncs that require forks always download 900 blocks then check them against the current head.
But... it should work anyways. As written, with 2 nodes, sync should "just work". And it does until suddenly a request fails because the nodes disconnect (in libp2p) for some reason.

So I think the sync issue is ultimately some libp2p strangeness.

miner/miner.go

We also perform this check inside `SyncSubmitBlock` so we did have an effective filter, but this was still wrong.

We have some cases where we submit a tipset to ourselves, from ourselves and end up calling `AddPeer` with our own ID. Ignore this case.

A side-effect of InformNewHead is to record the peer for future chain-sync sessions. If we don't pass blocks to InformNewHead here, we can have some difficulty bootstrapping networks.

Kubuxu · 2024-11-08T06:13:51Z

node/hello/hello.go

-		log.Debugf("Got new tipset through Hello: %s from %s", ts.Cids(), s.Conn().RemotePeer())
-		hs.syncer.InformNewHead(s.Conn().RemotePeer(), ts)
-	}
+	// don't bother informing about genesis


Suggested change

// don't bother informing about genesis

rvagg · 2024-11-26T09:37:29Z

I think this is replaced by the now merged #12690

masih · 2024-11-26T09:45:59Z

I think this is replaced by the now merged #12690

We still need to pull checkpointing work out of this PR, which I believe @Kubuxu is working on already.

Stebalien requested review from masih, Kubuxu and magik6k October 30, 2024 15:39

Stebalien force-pushed the steb/fix-best-mining-candidate branch from 704855e to e1a0572 Compare October 30, 2024 15:44

masih approved these changes Oct 30, 2024

View reviewed changes

rjan90 added the release/backport label Oct 30, 2024

rjan90 mentioned this pull request Oct 30, 2024

Lotus Node & Miner Release v1.30.0 (nv24) #12480

Closed

Stebalien force-pushed the steb/fix-best-mining-candidate branch from 246f88c to 0c0b158 Compare October 31, 2024 03:53

rjan90 mentioned this pull request Oct 31, 2024

build: backport changes to release/v1.30.0 branch #12663

Merged

18 tasks

Stebalien added 2 commits November 1, 2024 20:52

fix(miner): continue mining if we fail to submit a block

c80f32d

Stebalien force-pushed the steb/fix-best-mining-candidate branch from 0c0b158 to 4404614 Compare November 1, 2024 11:54

github-advanced-security bot found potential problems Nov 1, 2024

View reviewed changes

miner/miner.go Dismissed Show dismissed Hide dismissed

Stebalien added 3 commits November 5, 2024 11:01

fix(miner): check the slash filter with the correct parent height

065df6a

We also perform this check inside `SyncSubmitBlock` so we did have an effective filter, but this was still wrong.

fix(exchange): avoid adding ourselves as an exchange peer

d5a45ba

We have some cases where we submit a tipset to ourselves, from ourselves and end up calling `AddPeer` with our own ID. Ignore this case.

fix(hello): submit new blocks from peers even if we're at genesis

475eda5

A side-effect of InformNewHead is to record the peer for future chain-sync sessions. If we don't pass blocks to InformNewHead here, we can have some difficulty bootstrapping networks.

rjan90 removed the release/backport label Nov 6, 2024

Kubuxu reviewed Nov 8, 2024

View reviewed changes

make gen

8d4a555

Kubuxu changed the title ~~fix(miner): ignore lastWork when selecting the best mining candidate~~ fix(miner): ignore lastWork when selecting the best mining candidate + tests Nov 12, 2024

This was referenced Nov 12, 2024

fix(miner): ignore lastWork when selecting the best mining candidate #12690

Merged

Fix checkpoint sync in Lotus and merge mining test filecoin-project/go-f3#746

Open

rvagg closed this Nov 26, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

fix(miner): ignore lastWork when selecting the best mining candidate + tests #12659

fix(miner): ignore lastWork when selecting the best mining candidate + tests #12659

Stebalien commented Oct 30, 2024 •

edited

Loading

Stebalien commented Oct 30, 2024

masih left a comment

Stebalien commented Oct 30, 2024

Stebalien commented Oct 30, 2024

Stebalien commented Oct 30, 2024

Stebalien commented Oct 31, 2024

Stebalien commented Nov 1, 2024

Kubuxu Nov 8, 2024 •

edited

Loading

rvagg commented Nov 26, 2024

masih commented Nov 26, 2024

fix(miner): ignore lastWork when selecting the best mining candidate + tests #12659

fix(miner): ignore lastWork when selecting the best mining candidate + tests #12659

Conversation

Stebalien commented Oct 30, 2024 • edited Loading

Related Issues

Proposed Changes

Checklist

Stebalien commented Oct 30, 2024

masih left a comment

Choose a reason for hiding this comment

Stebalien commented Oct 30, 2024

Stebalien commented Oct 30, 2024

Stebalien commented Oct 30, 2024

Stebalien commented Oct 31, 2024

Stebalien commented Nov 1, 2024

Kubuxu Nov 8, 2024 • edited Loading

Choose a reason for hiding this comment

rvagg commented Nov 26, 2024

masih commented Nov 26, 2024

Stebalien commented Oct 30, 2024 •

edited

Loading

Kubuxu Nov 8, 2024 •

edited

Loading