Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Problem: usecase/syncstrategy/window.go one failure API request fails all API requests #639

Open
ysong42 opened this issue Dec 7, 2021 · 2 comments
Assignees

Comments

@ysong42
Copy link
Contributor

ysong42 commented Dec 7, 2021

@allthatjazzleo Reported that sometimes, our explorer is lagging behind.

Sometimes, some of our internal nodes lagged behind (block data source for indexing server). Then the indexing server failed to pull the latest blocks, and it caused the information on explorer was not the latest.

At the same time, some users were directed to explorer to check their transactions status (maybe through Defi wallet), and they can't find their transactions there, which was expected as the indexing server lagged behind.

Problem

When we use Window SyncStrategy, (usecase/syncstrategy/window.go), we will pull block data in a batch manner.

However, if any of the blockchain nodes we are requesting are lagged behind, the pulling will be stuck.

Imagine we have three blockchain nodes behind the load-balancer.

  • Node-1 is good, API request always return 200.
  • Node-2 is lagged behind, API request to it will return 5XX error.

Say the Window size is 3. Then the indexing in one round will try to fetch 3 blocks.

It is possible some requests are directed to the lagging blockchain node, and it return 5XX error.

  • Request-to-block-0: sent to Node-1, 200
  • Request-to-block-1: sent to Node-2, 5XX
  • Request-to-block-2: sent to Node-1, 200

Although block-0 and block-2 data is fetched, the Window will ignore them and return an error.

In our current setting, we have 3 blockchain nodes, and the window size is 50. Therefore, when one blockchain node is down or lag behind, it is very likely the pulling of block data will be blocked. As it is very likely one request among the 50 requests will sent to the lagged node, which blocks the projection.

Sometimes it could take a few minutes for the nodes to come back. During that time, users will be unable to access the latest tx data through our explorer. Their transactions may even be shown as 404.

Proposal

Maybe we could change the implementation in usecase/syncstrategy/window.go.

Still in the above example, if block-1 is failed to retrieve, then we only return the block-0.

If we have 50 requests and the block-10 request failed, then we will return the block-0 ~ block-9 data.

@ysong42 ysong42 changed the title Problem: usecase/syncstrategy/window.go one failure fails all API request Problem: usecase/syncstrategy/window.go one failure API request fails all API requests Dec 8, 2021
@tomtau
Copy link

tomtau commented Dec 8, 2021

not sure if it helps with this issue, but one potential thing to consider is the "new" PostgreSQL option in the Tendermint configuration: https://github.com/tendermint/tendermint/blob/db6e031a16e25f9f957c03618bfb5b4b98b42c0c/docs/app-dev/indexing-transactions.md#postgresql
with that, I assume the model can be more "push-based" instead of "pull-based", i.e. the full node will directly write some source events in the chain-indexing's DB instead of calling the node's JSON-RPC to retrieve them

@ysong42
Copy link
Contributor Author

ysong42 commented Dec 14, 2021

After another discussion with Leo, here is more context of this issue:

This issue hasn't received any complaints from users yet. This is more of a potential issue that may cause confusion on the user side.

Leo agreed that the root cause is not on the indexing server. And the root cause on blockchain node at the moment is still unknown. Now DevOps team is adding more machines to internal nodes.

The interesting thing is, when using only one node, it seems never lag behind. Only when using 2 or 3 nodes, some of them will lag behind. To let the lagged behind node come back, DevOps team will need to manually restart it (take a few minutes) or wait for it to recover itself (not sure how long it takes).

Let's see how it goes. If after more machines adding to internal nodes, this issue still exists, we could have another round of discussion.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

4 participants