-
Notifications
You must be signed in to change notification settings - Fork 3.5k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Logstash gets stuck in infinite retry loop writing to Time Series Data Streams #15832
Comments
I've added this to the elasticagent pipeline so metrics data only than 1 day gets dropped.
This is fine until Logstash handles TSDB 403 Forbidden responses properly i.e. stops retrying them until it topples over. |
We want to avoid using dead letter queues as we still want to see other indexing errors in the logatsh logs. It's just the metrics data/TSDB indexes which are a problem when the index is in read-only mode, resulting in elasticsearch returning a 409 Forbidden error. |
Another option would be to add some code to the metrics ingest pipelines which drops events older than X days. |
@mbudge thanks for the report.
In case 1, we're talking about up to 100% of all incoming data being blocked from ingestion, and in this case it's better for the pipelines to stop ingesting altogether, otherwise it will drop 100% of data received, causing major data loss. We don't have a good way of knowing which scenario we're facing by looking at the error from ES, especially without further domain knowledge being injected by the user, like your solution here where you tell Logstash "I want anything older than 24h to be dropped, since I know otherwise it may land in indices that aren't writable anymore".
I don't see how retries here would help since nothing would change between the time the first error happens until the last retry, which would likely be a few seconds (e.g. 3 retries: 0.1 seconds for first retry, 0.5s for the second, 1 second for the third). Typically 403 action needs to be taken either immediately (e.g. drop/dlq) or retry forever so an operator/administrator can take action and fix permissions.
Given that in your domain you expect data to land in places where writing is forbidden, maybe just setting
With the setting above only 403 would be DLQ'ed, everything else would still behave as before. I don't see a way to make this be the default as it would easily make other users upset that a permission error led to all their data being happily dropped on the floor until it was noticed by someone. Maybe a path forward is to have TSDS raise a specific kind of error of data not being allowed in downsampled indices, so clients like Logstash can differentiate between this scenario and any other kind of forbidden/403 error. |
dlq_custom_codes are is addition to 400 and 404, which is why I want to stay away from DLQ so we can monitor logs for 400 and 404 errors in Logstash logs on the host and in elastic. "List single-action error codes from Elasticsearch’s Bulk API that are considered valid to move the events into the dead letter queue. This list is an addition to the ordinary error codes considered for this feature, 400 and 404." https://www.elastic.co/guide/en/logstash/current/plugins-outputs-elasticsearch.html#plugins-outputs-elasticsearch-dlq_custom_codes We can have 2 elasticsearch outputs configured in Logstash. One for logs and one for metrics data. On the elasticsearch output for metrics data we would use max retries to prevent Logstash retrying late metrics data indefinitely, preventing an outage as eventually all the late metrics events will be dropped (and customers decide when to do this). Adding a drop processor to the metrics ingest pipelines does sound like a good way to protect against this too. I think the Fleet team are adding a global component template to all the Fleet managed logs data streams. If you get them to add the global@custom ingest pipeline to the metrics data streams, it becomes easier for customers to add logic to drop late metrics data to all the metrics ingest pipelines. |
Can close this. fleet is going to support multiple outputs which will allow us to send metrics data direct to Elasticsearch. System logs data can go to logstash. |
Hi,
We've had several severe outages over the last 6 months due to Logstash getting stuck in an infinite retry loop writing to TSDS.
This is because TSDS have start time and end time, so if data arrives late it TSDS route the documents to older indexes.
This happens when
When this happens we find errors like this in the Logstash logs.
The only way to fix it is to delete all the frozen metrics indexes then restart Logstash.
The metrics data is on warm nodes for 7 days before moving to frozen nodes.
We need max retries so Logstash doesn't get stuck and loose all our log data. We work in a regulated environment where we are audited on data loss.
See support ticket #01543981 for more details.
The text was updated successfully, but these errors were encountered: