Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Bug: Slow compactor cleanups can cause consistency check errors on queriers #9924

Open
jhalterman opened this issue Nov 15, 2024 · 3 comments
Labels

Comments

@jhalterman
Copy link
Member

jhalterman commented Nov 15, 2024

What is the bug?

Queriers currently provide a 45m grace period (3 * 15m bucket-store.sync-inverval) by default for store gateways to discover blocks before they can be queried. But the grace period is based on when the block was uploaded, not when it became discoverable via a bucket index. When bucket indexes don't get updated quickly, this can cause a block to be queryable before store gateways have had an opportunity to discover it. In a worst case scenario where a bucket index isn't updated for 45m, any new blocks become immediately queryable once the bucket index is loaded by a querier.

Proposed fix

Queriers should base their grace period, in whole or part, on when the bucket index was created, since that's when a block is discoverable by store gateways. For example, we could use 2 * the bucket-store.sync-inverval as the grace period relative to the bucket index creation time, but we could also (optionally) require that the existing 3 * the bucket-store.sync-inverval grace period relative to block upload time be satisfied, which would give a bit more time for store gateways to discover blocks in some cases.

How to reproduce it?

This came up when heavy OOO block creation caused a lot of compaction activity which slowed down bucket cleanup runs, leading to delayed bucket index creations.

What did you think would happen?

err-mimir-store-consistency-check-failed should not occur.

What was your environment?

K8s

Any additional context to share?

No response

@charleskorn
Copy link
Contributor

A couple of thoughts on this:

  • How would a querier know when a block first appeared in the bucket index? Would it need to examine all historical bucket index versions to discover this? Or would compactors set the time in the bucket index?
  • Should we do the same thing for deletion times for blocks in the bucket index? Otherwise queriers could stop querying a compacted block (based on its deletion time, not when the deletion first appeared in the bucket index) well before the replacement block is required to be queried (based on when it first appeared in the bucket index, which may be well after the original block was deleted) - see Fix issue where both recently compacted blocks and their source blocks can be skipped during querying #9224 as an example of what I mean here

@jhalterman
Copy link
Member Author

How would a querier know when a block first appeared in the bucket index? Would it need to examine all historical bucket index versions to discover this? Or would compactors set the time in the bucket index?

Having to read multiple versions of a bucket index doesn't sound good. Compactors including an approximate time that each block was indexed in the index itself would lighten the burden for queriers. This could probably even (eventually) replace the existing updated_at field that's in the index, which might not be needed anymore.

Another alternative is the block metas could be updated to include the time at which each block was first indexed, but that's potentially a lot of writes.

Should we do the same thing for deletion times for blocks in the bucket index?

Yes, possibly. Do you think that's worthy of a separate issue?

@charleskorn
Copy link
Contributor

Having to read multiple versions of a bucket index doesn't sound good. Compactors including an approximate time that each block was indexed in the index itself would lighten the burden for queriers. This could probably even (eventually) replace the existing updated_at field that's in the index, which might not be needed anymore.

Sounds good to me. I'd be in favour of replacing the updated_at field in the index rather than introducing a new field, unless we really need both - I worry that otherwise different components would use different fields and that will lead to other problems.

Should we do the same thing for deletion times for blocks in the bucket index?

Yes, possibly. Do you think that's worthy of a separate issue?

I think we have to do both at the same time - otherwise we might end up in a situation like #9224. Imagine a block A is compacted into new block N at T. If the compactor doesn't update the bucket index with the new block N for an hour, then the deletion time for block A will be T, but the updated_at for N will be T+1h, and queriers would ignore block N (because it is too new) while also not querying A (because it was deleted too long ago).

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Projects
None yet
Development

No branches or pull requests

2 participants