Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Logstash gets stuck in infinite retry loop writing to Time Series Data Streams #15832

Closed
mbudge opened this issue Jan 22, 2024 · 6 comments
Closed
Labels

Comments

@mbudge
Copy link

mbudge commented Jan 22, 2024

Hi,

We've had several severe outages over the last 6 months due to Logstash getting stuck in an infinite retry loop writing to TSDS.

This is because TSDS have start time and end time, so if data arrives late it TSDS route the documents to older indexes.

This happens when

  • Logstash crashes and when we restart Logstash it starts writing metrics data to read-only indexes which have rolled over.
  • Host using misconfigured time causes logstash to write to a read only index

When this happens we find errors like this in the Logstash logs.

luster_block_exception", "reason"=>"index [partial-.ds-metrics-system.diskio-default-2023.11.24-000047] blocked by: [FORBIDDEN/8/index write (api)];"}}

Retrying individual bulk actions that failed or were rejected by the previous bulk request {:count
=>2}

The only way to fix it is to delete all the frozen metrics indexes then restart Logstash.

The metrics data is on warm nodes for 7 days before moving to frozen nodes.

We need max retries so Logstash doesn't get stuck and loose all our log data. We work in a regulated environment where we are audited on data loss.

See support ticket #01543981 for more details.

@mbudge
Copy link
Author

mbudge commented Jan 29, 2024

I've added this to the elasticagent pipeline so metrics data only than 1 day gets dropped.

if [data_stream][type] == "metrics" {
        ruby {
            init => "require 'time'"
            code => 'if LogStash::Timestamp.new(event.get("@timestamp")+86400) < ( LogStash::Timestamp.now)
              event.cancel
            end'
        }
    }

This is fine until Logstash handles TSDB 403 Forbidden responses properly i.e. stops retrying them until it topples over.

@mbudge
Copy link
Author

mbudge commented Feb 12, 2024

We want to avoid using dead letter queues as we still want to see other indexing errors in the logatsh logs.

It's just the metrics data/TSDB indexes which are a problem when the index is in read-only mode, resulting in elasticsearch returning a 409 Forbidden error.

@mbudge
Copy link
Author

mbudge commented Feb 12, 2024

Another option would be to add some code to the metrics ingest pipelines which drops events older than X days.

@jsvd
Copy link
Member

jsvd commented Feb 15, 2024

@mbudge thanks for the report.
The challenge of deciding when to retry or not has been a back and forth for ages in Logstash, with arguments on both sides.
The struggle with this particular case is that the error is about authorization, and it may happen because:

  1. a user or api key was given insufficient roles to be able to write to the desired index
  2. the dynamic index feature allows for data to land in places the user doesn't want to

In case 1, we're talking about up to 100% of all incoming data being blocked from ingestion, and in this case it's better for the pipelines to stop ingesting altogether, otherwise it will drop 100% of data received, causing major data loss.
In case 2 we're talking about a small portion of data that is likely unwanted anyway, so it could be discarded.

We don't have a good way of knowing which scenario we're facing by looking at the error from ES, especially without further domain knowledge being injected by the user, like your solution here where you tell Logstash "I want anything older than 24h to be dropped, since I know otherwise it may land in indices that aren't writable anymore".

We need max retries so Logstash doesn't get stuck and loose all our log data. We work in a regulated environment where we are audited on data loss.

I don't see how retries here would help since nothing would change between the time the first error happens until the last retry, which would likely be a few seconds (e.g. 3 retries: 0.1 seconds for first retry, 0.5s for the second, 1 second for the third). Typically 403 action needs to be taken either immediately (e.g. drop/dlq) or retry forever so an operator/administrator can take action and fix permissions.

This is fine until Logstash handles TSDB 403 Forbidden responses properly i.e. stops retrying them until it topples over.

Given that in your domain you expect data to land in places where writing is forbidden, maybe just setting dlq_custom_codes for 403 would be enough, thus informing that a 403 at the document level is fine to either drop or send to DLQ:

output {
  elasticsearch {
    dlq_custom_codes => [ 403 ]
  }
}

We want to avoid using dead letter queues as we still want to see other indexing errors in the logatsh logs.

With the setting above only 403 would be DLQ'ed, everything else would still behave as before.

I don't see a way to make this be the default as it would easily make other users upset that a permission error led to all their data being happily dropped on the floor until it was noticed by someone.

Maybe a path forward is to have TSDS raise a specific kind of error of data not being allowed in downsampled indices, so clients like Logstash can differentiate between this scenario and any other kind of forbidden/403 error.

@mbudge
Copy link
Author

mbudge commented Feb 15, 2024

dlq_custom_codes are is addition to 400 and 404, which is why I want to stay away from DLQ so we can monitor logs for 400 and 404 errors in Logstash logs on the host and in elastic.

"List single-action error codes from Elasticsearch’s Bulk API that are considered valid to move the events into the dead letter queue. This list is an addition to the ordinary error codes considered for this feature, 400 and 404." https://www.elastic.co/guide/en/logstash/current/plugins-outputs-elasticsearch.html#plugins-outputs-elasticsearch-dlq_custom_codes

We can have 2 elasticsearch outputs configured in Logstash. One for logs and one for metrics data. On the elasticsearch output for metrics data we would use max retries to prevent Logstash retrying late metrics data indefinitely, preventing an outage as eventually all the late metrics events will be dropped (and customers decide when to do this). Adding a drop processor to the metrics ingest pipelines does sound like a good way to protect against this too.

I think the Fleet team are adding a global component template to all the Fleet managed logs data streams. If you get them to add the global@custom ingest pipeline to the metrics data streams, it becomes easier for customers to add logic to drop late metrics data to all the metrics ingest pipelines.

elastic/kibana#149484

@mbudge
Copy link
Author

mbudge commented May 11, 2024

Can close this. fleet is going to support multiple outputs which will allow us to send metrics data direct to Elasticsearch. System logs data can go to logstash.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Projects
None yet
Development

No branches or pull requests

3 participants