Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Ability to sync blockheight among multiple servers for consensus tests. #87

Open
msgilligan opened this issue Jun 29, 2015 · 26 comments
Open

Comments

@msgilligan
Copy link
Member

Currently the Spock consensus tests use BitcoinClient.waitForServer() and BitcoinClient.waitForBlock() to make sure the Omni Core server being tested (usually on localhost) is warmed-up and in sync with the MainNet blockchain before the tests begin. The consensus test Specs contain an explicit comparison test of the blockheight between the local server (leftHeight) and the remote (rightHeight) which will fail if the remote server is on a different block. No attempt is made at synchronizing the local and remote servers and there are frequent failures of this blockheight comparison as well as false positives on Omni account balance comparison due to blockheight differences.

Because the Jenkins tests are typically started by commits to the omnicore or OmniJ repositories and the consensus tests run against multiple remote servers, some of which are occasionally many blocks behind the MainNet current height, I have not considered it practical to try to sync the block heights between the different servers -- at least in the context of a Jenkins job -- or even in the case of a developer running the Spock consensus tests (via command-line or IDE).

I have given serious thought to writing a continuously-running consensus-validating server that could cache consensus data from multiple Omni server implementations and do comparisons with matching blockheights whenever each server reaches a new block. This consensus-validating server is not something that is in my near term plans, however.

I created this issue to address @dexX7's request for this capability in a comment on Issue #74 and to give us a place to discuss possible solutions including the idea of a consensus-validating server.

@dexX7
Copy link
Member

dexX7 commented Jun 29, 2015

Thanks for creating this issue.

I have not considered it practical to try to sync the block heights between the different servers

Just to be clear: not all servers have to be synchronized, but only those to compare (e.g. local Omni Core vs. Chest).

I see it this way: the consensus tests must not fail, unless there is there is a consensus difference. And if this is the case, then it is a serious problem, and all alarm should be ringing.

@msgilligan
Copy link
Member Author

I see it this way: the consensus tests must not fail, unless there is there is a consensus difference. And if this is the case, then it is a serious problem, and all alarm should be ringing.

I agree with you in principle. The current false alarms are very frustrating.

The problem is that the JUnit test runner (for which Spock is a plugin) does not allow us to (easily) run tests in parallel or control the order of the tests. If in one test we wait for a remote server to advance to the current block our local server could find that by that time the local server has moved to the next block, Sometimes one of the remote severs lags behind for hours and so the test run would be delayed.

Ultimately, in my opinion, the objective of running consensus tests via JUnit/Spock and/or Jenkins is to test a new, local build of the server vs (assumed good) remote servers.

I just don't think the JUnit/Spock runner with or without Jenkins is the right way to do consensus checking on continuously running production or staging servers.

I could be wrong, and maybe I need to give more thought to how the tests could be restructured in order to do this. One possibility would be to separate out the consensus tests against each remote server into separate Jenkins jobs and/or Gradle tasks.

It's also worth mentioning that a production server that is out of consensus because it lags behind by even one block is, from a user's perspective, out of consensus.

@dexX7
Copy link
Member

dexX7 commented Jun 29, 2015

Sometimes one of the remote severs lags behind for hours

Such a server shouldn't be used as reference.

I just don't think the JUnit/Spock runner with or without Jenkins is the right way to do consensus checking on continuously running production or staging servers.

So what do you suggest instead?

@msgilligan
Copy link
Member Author

I have imagined something like a Spring Boot or Grails server, backed by a SQL database that could keep a table of registered Omni servers, with their type and connection info. It could (in parallel) pull consensus information from each server, cache the last n consensus snapshots in the database, and compare results between servers using the cached data as soon as a valid comparison can be made. It could have an HTML dashboard that shows blockheight and consensus information for each server and send e-mail or other alerts when something falls out of consensus.

I have designed the ConsensusSnapshot, ConsensusComparison, etc. classes with such a server in mind. I think a prototype could probably be built in a day or two of coding. Making it production-ready would be another few days, I'd think.

As an added bonus the server could offer an API to get recent historical snapshots for reference servers that JUnit/Jenkins-driven tests could compare against.

@msgilligan
Copy link
Member Author

Such a server shouldn't be used as reference.

The worst offender was Chest, but @zathras-crypto moved it to a bigger instance and it is now only intermittently 1 block behind. It is the only public Chest implementation and the tests are run more to check Chest than anything else. I suppose this could be viewed as an argument for splitting off the consensus tests of local Omni vs. remote Chest into a separate task/job.

@msgilligan
Copy link
Member Author

Another solution we could consider is programmatically ignoring the remaining tests when the block height doesn't match. That might be an easy way to prevent these alarming/distracting "false positives".

@dexX7
Copy link
Member

dexX7 commented Jun 29, 2015

cache the last n consensus snapshots in the database, and compare results between servers using the cached data as soon as a valid comparison can be made

I like where it's going, but I don't think the current "let's fetch some balances and compare them" approach is ideal, or let me rather say: I think it would be the wrong approach to store historical snapshots in some external DB, and use that to compare consensus.

  • if possible, most of the work (whatever that might be) should be done by Omni Core
  • comparing all balances (for a subset of properties) seems inefficient

Even though I'd like to get away from file based persistence in the long run, after each block some state information is persisted in ~/.bitcoin/MP_persist and kept for the last 50 blocks. As part of this, a checksum is added at the end of each file (for each block, and state information), which is created based on the persisted data. It would be pretty easy to collect and store those checksums, and provide a RPC command to retrieve them for any given block.

While this doesn't provide a way to pin down the actual difference in state/consensus, it could be used to confirm two clients/implementations/... have the same state at block n.

@dexX7
Copy link
Member

dexX7 commented Jun 29, 2015

That might be an easy way to prevent these alarming/distracting "false positives".

This is a bit risky in my opinion. If there are consensus differences, then it would be unclear, whether it's caused by unsynced servers, and if there are no differences, then it might as well be luck, because "state at block x" is similar to "state at block y".

@msgilligan
Copy link
Member Author

As part of this, a checksum is added at the end of each file (for each block, and state information), which is created based on the persisted data. It would be pretty easy to collect and store those checksums, and provide a RPC command to retrieve them for any given block.

I wrote Omni Consensus Hashing Proposal with a similar idea. Though I viewed this as more of a Phase 2 feature with regards to the server described above.

@msgilligan
Copy link
Member Author

... If there are consensus differences, then it would be unclear, whether it's caused by unsynced servers, and if there are no differences, then it might as well be luck ...

The idea was that if the blockheight didn't match the entire test Spec would be skipped.

... or maybe we should just move the OmniChest tests to a separate job.

@dexX7
Copy link
Member

dexX7 commented Jul 2, 2015

@zathras-crypto: how comes the Omni Chest node is always a bit behind? Is the server overloaded, or your node very, very badly connected? You may addnode "104.131.211.252:8333" "onetry", which is usually listed as top 500 node according to bitnodes.io, and fuels the faucet. I upgraded it to Omni Core 0.0.10-dev yesterday.

It runs with minrelaytxfee=0.00001 and datacarriersize=200 though, and I'm not sure, if this may result in a ban, if it relays "non-standard" transactions (according to other nodes).

@zathras-crypto
Copy link

@zathras-crypto: how comes the Omni Chest node is always a bit behind?

It should never be more than 5 minutes behind. Note the data is being pulled for consensus testing from Omni Chest's DB and not the Omni Core node backing it.

Long story short Omni Chest Engine (reads data from Omni Core and populates Omni Chest's DB) is run on a rolling 5 minute scheduler which is why there can be a minute or two between a new block coming in and that data being available on Omni Chest (including API).

FYI I'm doing background work on a new version of Omni Chest to support 0.0.10 based features like MetaDEx, that new instance runs via blocknotify instead of via a scheduler so DB updates should be almost immediate then.

Hope that provides some context :)

@msgilligan
Copy link
Member Author

Omni Chest Engine (reads data from Omni Core and populates Omni Chest's DB) is run on a rolling 5 minute scheduler

Why not run the scheduler more often, say every 1 minute?

@zathras-crypto
Copy link

Yeah I guess - probably just habit (repeat n every 0.5 times the expected rate of change - ie 10 min block average) hehe - as I say next version is moving to blocknotify which is the more appropriate way to schedule updates - current iteration simply doesn't have the backend API call to allow scripts from Omni Core (different server) to trigger an update on Chest's DB server when a block comes in but that's fixed for the next version :)

@msgilligan
Copy link
Member Author

If changing the scheduler is an easy change it will sure reduce the number of false positives on consensus tests.

Good to see the next version will use blocknotify.

@dexX7
Copy link
Member

dexX7 commented Jul 7, 2015

By the way, NACK on kicking OmniChest, or disabling consensus tests, under some circumstances.

I'm not sure, which version OmniWallet runs, but I ideally we have at least one with 0.0.9, and another one with 0.0.10, and as far as I know Chest still uses 0.0.9, which is great to compare the 0.0.10-dev branch against.

As temporary workaround, maybe we could loop and fetch consensus data every 15-30 (?) seconds, until the block heights are equal, or more than 5 minutes passed?

@zathras-crypto
Copy link

Just as an FYI guys, OmniChest will always use the latest release of Omni Core, but never a development version. So it's running 0.0.9.1 right now and will move to 0.0.9.2 shortly (though I'll probably keep 0.0.9.1 running on a separate clone just to see how it behaves if there are any more forks).

OmniChest will move to 0.0.10 only once it's released, and that will be when the new version of OmniChest (which has stuff like MetaDEx support etc etc) will make an appearance.

@dexX7
Copy link
Member

dexX7 commented Jul 7, 2015

I assumed it was actually 0.0.9.?-dev, based on the footer, but this is great to know.

@zathras-crypto
Copy link

I assumed it was actually 0.0.9.?-dev, based on the footer, but this is great to know.

Good eye dude! Hehe :)

Actually specifically it's slightly modified from the release code because of extra bits Chest uses (getDecodedPayload(), getPayloadSize() etc) but the base should be there :)

@msgilligan
Copy link
Member Author

NACK on kicking OmniChest

What do you mean by "kicking"?

@dexX7
Copy link
Member

dexX7 commented Jul 8, 2015

Hehe, sorry - "kicking" as of "removing", "disabling" - I was referring to the idea of skipping consensus tests, if the block heights (of Chest) don't match.

@msgilligan
Copy link
Member Author

Because I'd like to see @zathras-crypto "kick" Chest more often -- as in check for a new block once every 1 minute rather than every 5 -- that should really cut down on the number of false positives. How hard is that to do, @zathras-crypto ?

@zathras-crypto
Copy link

How hard is that to do, @zathras-crypto ?

Slightly more complex question sadly hehe :)

TL:DR; is I've reduced the scheduler to every 2 minutes. The longer version is Chest v5 is a kind of double layered beast - with some throwback stuff to when Chest used the masterchest library for parsing and some other newer stuff to migrate to Core for parsing - without going into too much detail there are a couple of safeguards against problems here:

The first being that even if the Chest code muddles a transaction in the block explorer component, Chest always pulls the latest state (properties, dex offers, all address balances for every property etc) into temp tables at the end of every run and overwrites the corresponding production tables with that data. The plus is that it ensures regardless of Chest's transaction history processing, addresses will always have the right balance according to Omni Core - the downside is it's horribly inefficient haha.

The second being that Chest's reorg protection has a second layer that rolls back 5 blocks on every run (though ironically as proved with BIP66 even 5 blocks isn't enough for a safe assumption).

How this relates to the scheduler is it means that the update process is lengthy, and 1 minute is probably too aggressive. We'll see how 2 goes :)

@dexX7
Copy link
Member

dexX7 commented Jul 8, 2015

Do you have a simple script to collect the full state of the system? This would actually be quite handy.

The second being that Chest's reorg protection has a second layer that rolls back 5 blocks ...

Hehe, so what happend during the 6 blocks deep reorg? ;)

@msgilligan
Copy link
Member Author

Do you have a simple script to collect the full state of the system?

I just updated the build files (0cc2ab7) and wrote some documentation (db07123) for the Omni Consensus Tool. It currently can do a comparison of all properties or output the state of a single property to the screen or a file. With some additional work it could be made to dump the balances of all properties (or compare just a single property)

@zathras-crypto
Copy link

Hehe, so what happend during the 6 blocks deep reorg? ;)

Nothing, there were no Omni transactions in the re-org. If there had been and they were 5 or less blocks deep, they would have been wiped from the transactions table as part of normal scanning but any deeper than that and there is risk of bad info being displayed if the new chain doesn't contain the transaction(s) or they are in a different order.

It used to be 10 blocks, I probably shouldn't have changed it haha. The whole BIP66 thing did give me a bit of a wake up call that 6 block + reorgs are still feasible with Bitcoin, so for the next (MetaDEx) version of OmniChest I'm going to port over the reorg protection from the old-skool Masterchest Wallet - that used to store the last n blockhashes along with blockheights and verify the last blockhash hadn't changed since the last scan and if it had roll back blocks until the blockhash & height in the DB matched the data in RPC, and then start from there. Should be pretty straightforward.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

4 participants
@msgilligan @zathras-crypto @dexX7 and others