Logstash gets stuck in infinite retry loop writing to Time Series Data Streams #15832

mbudge · 2024-01-22T22:18:22Z

Hi,

We've had several severe outages over the last 6 months due to Logstash getting stuck in an infinite retry loop writing to TSDS.

This is because TSDS have start time and end time, so if data arrives late it TSDS route the documents to older indexes.

This happens when

Logstash crashes and when we restart Logstash it starts writing metrics data to read-only indexes which have rolled over.
Host using misconfigured time causes logstash to write to a read only index

When this happens we find errors like this in the Logstash logs.

luster_block_exception", "reason"=>"index [partial-.ds-metrics-system.diskio-default-2023.11.24-000047] blocked by: [FORBIDDEN/8/index write (api)];"}}

Retrying individual bulk actions that failed or were rejected by the previous bulk request {:count
=>2}

The only way to fix it is to delete all the frozen metrics indexes then restart Logstash.

The metrics data is on warm nodes for 7 days before moving to frozen nodes.

We need max retries so Logstash doesn't get stuck and loose all our log data. We work in a regulated environment where we are audited on data loss.

See support ticket #01543981 for more details.

The text was updated successfully, but these errors were encountered:

mbudge · 2024-01-29T09:26:44Z

I've added this to the elasticagent pipeline so metrics data only than 1 day gets dropped.

if [data_stream][type] == "metrics" {
        ruby {
            init => "require 'time'"
            code => 'if LogStash::Timestamp.new(event.get("@timestamp")+86400) < ( LogStash::Timestamp.now)
              event.cancel
            end'
        }
    }

This is fine until Logstash handles TSDB 403 Forbidden responses properly i.e. stops retrying them until it topples over.

mbudge · 2024-02-12T16:40:49Z

We want to avoid using dead letter queues as we still want to see other indexing errors in the logatsh logs.

It's just the metrics data/TSDB indexes which are a problem when the index is in read-only mode, resulting in elasticsearch returning a 409 Forbidden error.

mbudge · 2024-02-12T16:47:02Z

Another option would be to add some code to the metrics ingest pipelines which drops events older than X days.

jsvd · 2024-02-15T10:07:31Z

@mbudge thanks for the report.
The challenge of deciding when to retry or not has been a back and forth for ages in Logstash, with arguments on both sides.
The struggle with this particular case is that the error is about authorization, and it may happen because:

a user or api key was given insufficient roles to be able to write to the desired index
the dynamic index feature allows for data to land in places the user doesn't want to

In case 1, we're talking about up to 100% of all incoming data being blocked from ingestion, and in this case it's better for the pipelines to stop ingesting altogether, otherwise it will drop 100% of data received, causing major data loss.
In case 2 we're talking about a small portion of data that is likely unwanted anyway, so it could be discarded.

We don't have a good way of knowing which scenario we're facing by looking at the error from ES, especially without further domain knowledge being injected by the user, like your solution here where you tell Logstash "I want anything older than 24h to be dropped, since I know otherwise it may land in indices that aren't writable anymore".

We need max retries so Logstash doesn't get stuck and loose all our log data. We work in a regulated environment where we are audited on data loss.

I don't see how retries here would help since nothing would change between the time the first error happens until the last retry, which would likely be a few seconds (e.g. 3 retries: 0.1 seconds for first retry, 0.5s for the second, 1 second for the third). Typically 403 action needs to be taken either immediately (e.g. drop/dlq) or retry forever so an operator/administrator can take action and fix permissions.

This is fine until Logstash handles TSDB 403 Forbidden responses properly i.e. stops retrying them until it topples over.

Given that in your domain you expect data to land in places where writing is forbidden, maybe just setting dlq_custom_codes for 403 would be enough, thus informing that a 403 at the document level is fine to either drop or send to DLQ:

output {
  elasticsearch {
    dlq_custom_codes => [ 403 ]
  }
}

We want to avoid using dead letter queues as we still want to see other indexing errors in the logatsh logs.

With the setting above only 403 would be DLQ'ed, everything else would still behave as before.

I don't see a way to make this be the default as it would easily make other users upset that a permission error led to all their data being happily dropped on the floor until it was noticed by someone.

Maybe a path forward is to have TSDS raise a specific kind of error of data not being allowed in downsampled indices, so clients like Logstash can differentiate between this scenario and any other kind of forbidden/403 error.

mbudge · 2024-02-15T10:28:30Z

dlq_custom_codes are is addition to 400 and 404, which is why I want to stay away from DLQ so we can monitor logs for 400 and 404 errors in Logstash logs on the host and in elastic.

"List single-action error codes from Elasticsearch’s Bulk API that are considered valid to move the events into the dead letter queue. This list is an addition to the ordinary error codes considered for this feature, 400 and 404." https://www.elastic.co/guide/en/logstash/current/plugins-outputs-elasticsearch.html#plugins-outputs-elasticsearch-dlq_custom_codes

We can have 2 elasticsearch outputs configured in Logstash. One for logs and one for metrics data. On the elasticsearch output for metrics data we would use max retries to prevent Logstash retrying late metrics data indefinitely, preventing an outage as eventually all the late metrics events will be dropped (and customers decide when to do this). Adding a drop processor to the metrics ingest pipelines does sound like a good way to protect against this too.

I think the Fleet team are adding a global component template to all the Fleet managed logs data streams. If you get them to add the global@custom ingest pipeline to the metrics data streams, it becomes easier for customers to add logic to drop late metrics data to all the metrics ingest pipelines.

elastic/kibana#149484

mbudge · 2024-05-11T06:29:12Z

Can close this. fleet is going to support multiple outputs which will allow us to send metrics data direct to Elasticsearch. System logs data can go to logstash.

mbudge added bug status:needs-triage labels Jan 22, 2024

mbudge mentioned this issue Jan 27, 2024

Time series database TSDB causing Logstash to crash elastic/elasticsearch#104839

Closed

This was referenced Feb 13, 2024

Offer a lightweight Elastic Agent elastic/elastic-agent#3364

Closed

TSDS data streams can route writes to downsampled indices after rollover elastic/elasticsearch#99696

Closed

mbudge closed this as completed May 11, 2024

roaksoax removed the status:needs-triage label May 11, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Logstash gets stuck in infinite retry loop writing to Time Series Data Streams #15832

Logstash gets stuck in infinite retry loop writing to Time Series Data Streams #15832

mbudge commented Jan 22, 2024 •

edited

Loading

mbudge commented Jan 29, 2024

mbudge commented Feb 12, 2024

mbudge commented Feb 12, 2024

jsvd commented Feb 15, 2024

mbudge commented Feb 15, 2024

mbudge commented May 11, 2024

Logstash gets stuck in infinite retry loop writing to Time Series Data Streams #15832

Logstash gets stuck in infinite retry loop writing to Time Series Data Streams #15832

Comments

mbudge commented Jan 22, 2024 • edited Loading

mbudge commented Jan 29, 2024

mbudge commented Feb 12, 2024

mbudge commented Feb 12, 2024

jsvd commented Feb 15, 2024

mbudge commented Feb 15, 2024

mbudge commented May 11, 2024

mbudge commented Jan 22, 2024 •

edited

Loading