Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Bulk API return document level 500 when new shards are being allocated #5565

Open
esatterwhite opened this issue Nov 26, 2024 · 1 comment · May be fixed by #5566
Open

Bulk API return document level 500 when new shards are being allocated #5565

esatterwhite opened this issue Nov 26, 2024 · 1 comment · May be fixed by #5566
Labels
bug Something isn't working

Comments

@esatterwhite
Copy link
Collaborator

Describe the bug
Under certain conditions using the elasticsearch _bulk api, quickwit, particularly during spikes in ingestion traffic, or when an index is initially created, quickwit will reject documents with a document level status of 500, internal_exception with the reason no shards available. This tends to indicate that something has gone wrong on the server and the document cannot be retried

{
  status: 500
, error: {
    type: 'internal_exception'
  , reason: 'no shards available'
  }
}

This can cause problems when using existing elasticsearch client libraries. Many of them have logic implemented for handing retires and document level errors from the bulk api. However, the rate limiting generally only kicks in when the document level status is a 429. This can be problematic for existing applications where the retry logic is leveraged. In the current quickwit behavior, documents will generally be dropped assuming the error is terminal when its really a transient warmup problem.

Expected behavior
The bulk api document errors should be a 429 when there are no shards available. It may also be helpful to return a error code that is more indicative of the problem rather than an `internal_exception

{
  status: 429
, error: {
    type: 'no_shard_available_action_exception' // elasticsearch has this error code, but it may mean something else in that context.
  , reason: 'no shards available'
  }
}
@esatterwhite esatterwhite added the bug Something isn't working label Nov 26, 2024
@rdettai rdettai linked a pull request Nov 27, 2024 that will close this issue
@fulmicoton
Copy link
Contributor

One trouble is that we actually don't want you to retry right away in that case. Maybe we should set an informative retry_after header? (500ms maybe)

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug Something isn't working
Projects
None yet
Development

Successfully merging a pull request may close this issue.

2 participants