Failed config reloads in Logstash causes resources leaks #16202

edmocosta · 2024-06-06T16:57:56Z

Logstash information:

Please include the following information:

Logstash version: 8.14.0
Logstash installation source: zip archive
How is Logstash being run: Via command line

Description of the problem including expected versus actual behavior:
Logstash is not properly handling failed configuration reloads.

When Logstash tries to reload the pipeline configuration and it fails (due to some issue, e.g database down), subsequent reloads causes resource leaks, as no attempt to stop the input plugins is done. It also keeps the old pipeline's plugin in the metric store, filling the memory and returning outdated resources on the stats API.

The issues seems to be on the java_pipeline.rb#shutdown logic. When the reload fails, the finished_execution? is true, so it skips the stop_inputs, wait_for_shutdown and clear_pipeline_metrics methods execution, leaking those resources and increasing them linearly.

For example, given theIf pipeline:

input {
  jdbc {
    id => "db_1"
    type => "jdbc"
     ...
   }

   jdbc {
     id => "db_2"
     type => "jdbc"
      ...
    }
}

If the pipeline configuration reload fails due to db_2 being unavailable, the db_1 connections won't be closed, and it will duplicate the resource usage (number of connections) in every failed reload attempt. It will also keep both pipelines "old" and "new" plugins on the metric store, increasing the number of elements per retry.

Steps to reproduce:

Configure a pipeline with 2 JDBC inputs
Start Logstash with auto reload enabled
Check the number of DB's connections (e.g mysql show processlist) and plugins metrics:
curl -s http://localhost:9600/_node/stats | jq -r '.pipelines."<pipeline_name>".plugins.codecs | length'
Change the pipeline in a way that 1 plugins fails (wrong username for example).
Change the pipeline again (empty space, enter, etc).
Repeat the step 3 and compare the results.

The text was updated successfully, but these errors were encountered:

yaauie · 2024-06-11T16:02:17Z

Fly-by thoughts:

JavaPipeline#clear_pipeline_metrics is only called for JavaPipeline#shutdown, and does not occur when a pipeline fails to start. IIRC we want for a pipeline to leave its metrics in-place when it stops "normally" or exceptionally, but we want for them to be cleared when the pipeline has been shut down.

I think that the "easiest" way to handle this would be to also clear them as a part of starting up, but we would need to be careful to ensure that if a pipeline has overlap during a restart that we keep ordering correct.

Alternatively, we could change the default id generation algorithm to somehow fingerprint the source of the plugin.

edmocosta · 2024-06-11T16:19:09Z

Hey @yaauie, thank you for the input! I'm wondering if changing the JavaPipeline#shutdown method, adding a reloading parameter to avoid the shutdown early return would solve this issue?, E.g:

def shutdown(reloading: false)
 return if finished_execution? && !reloading
end

Then on the PipelineAction::Reload#execute:

old_pipeline.shutdown(reloading: true)

My initial tests seems to be working with this code change, but I'm still unsure if it can have any other impact on the reloading process. WDYT?

yaauie · 2024-06-11T16:52:18Z

🤔 The comments lead me to believe that we guard on JavaPipeline#finished_execution? to ensure we don't have a race with two threads starting and stopping the pipeline simultaneously, and that we have a chance to block until a running pipeline has fully shut down. But because it bails from the shutdown method entirely it fails to do the cleanup that needs to happen whether or not the to-be-shutdown pipeline has already completed execution.

I think that if we change the scope of the guard a little, we can get the behaviour we want in both cases without needing to add the reloading control variable:

diff --git a/logstash-core/lib/logstash/java_pipeline.rb b/logstash-core/lib/logstash/java_pipeline.rb
index 85f182fdb..0e4bd5c44 100644
--- a/logstash-core/lib/logstash/java_pipeline.rb
+++ b/logstash-core/lib/logstash/java_pipeline.rb
@@ -446,15 +446,16 @@ module LogStash; class JavaPipeline < AbstractPipeline
   # this method is intended to be called from outside the pipeline thread
   # and will block until the pipeline has successfully shut down.
   def shutdown
-    return if finished_execution?
-    # shutdown can only start once the pipeline has completed its startup.
-    # avoid potential race condition between the startup sequence and this
-    # shutdown method which can be called from another thread at any time
-    sleep(0.1) while !ready?
-
-    # TODO: should we also check against calling shutdown multiple times concurrently?
-    stop_inputs
-    wait_for_shutdown
+    unless finished_execution?
+      # shutdown can only start once the pipeline has completed its startup.
+      # avoid potential race condition between the startup sequence and this
+      # shutdown method which can be called from another thread at any time
+      sleep(0.1) while !ready?
+
+      # TODO: should we also check against calling shutdown multiple times concurrently?
+      stop_inputs
+      wait_for_shutdown
+    end
     clear_pipeline_metrics
   end # def shutdown

edmocosta · 2024-06-17T09:26:41Z

Hi @yaauie, If I'm not mistaken, It would only solve the pipeline metrics issue, given the clear_pipeline_metrics was moved out of the guard, but the resources leak problem would still happen, as successfully started input plugins wouldn't be closed on subsequent failed reloads.

For example:

Let's assume the following pipeline started successfully and both plugins are running:

input {
  jdbc {
    id => "db_1"
    type => "jdbc"
     ...
   }

   jdbc {
     id => "db_2"
     type => "jdbc"
      ...
    }
}

Then, the user changes the jdbc { id => "db_2" } settings, making it invalid (wrong password, for example).
On the first reload attempt, the shutdown method gets finished_execution? => false, cleaning the resources and metrics properly.

But then, the user changes it again, but the configuration is still invalid. This time, when Logstash tries to reload the pipeline configuration, the shutdown method will get finished_execution? => true, as the pipeline being reloaded/shut down is the same that previously failed to start, skipping the resources cleaning up, and letting the jdbc { id => "db_1" } connections open (as no stop_inputs are invoked).

jsvd · 2024-06-25T15:39:59Z

The scenario I was able to reproduce consistently has to do with resources created during the register method of input plugins.
This didn't reproduce when resources were created by filters, or if they were created after register during start.

To reproduce I took simple inputs like heartbeat and generator and added a resource creation (connection) during register, and added an conditional exception to one of them. With this I could trigger the resource pile up.

The proposed solution fixes this scenario, and I don't see any problems it could create. In this case we're sort of abusing stop_inputs to perform the cleanup, even though inputs never ended up starting in the first place, but it works. A more structured fix would be to differentiate this situation and call close on each input (as opposed to stop).

jsvd · 2024-06-25T15:40:46Z

In the meantime the fix proposed for cleaning up metrics resources is still valid and can be done regardless of the fix for the input register leaks

edmocosta added bug status:needs-triage int-shortlist Team:Logstash labels Jun 6, 2024

edmocosta assigned edmocosta and unassigned edmocosta Jun 11, 2024

flexitrev assigned edmocosta Jun 24, 2024

edmocosta mentioned this issue Jun 26, 2024

Ensure pipeline metrics are cleared on the shutdown #16264

Merged

5 tasks

edmocosta closed this as completed in #16264 Jun 28, 2024

jsvd removed the status:needs-triage label Jun 28, 2024

github-actions bot mentioned this issue Jul 3, 2024

Backport PR #16264 to 8.14: Ensure pipeline metrics are cleared on the shutdown #16291

Merged

5 tasks

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Failed config reloads in Logstash causes resources leaks #16202

Failed config reloads in Logstash causes resources leaks #16202

edmocosta commented Jun 6, 2024 •

edited

Loading

yaauie commented Jun 11, 2024

edmocosta commented Jun 11, 2024

yaauie commented Jun 11, 2024

edmocosta commented Jun 17, 2024 •

edited

Loading

jsvd commented Jun 25, 2024

jsvd commented Jun 25, 2024

Failed config reloads in Logstash causes resources leaks #16202

Failed config reloads in Logstash causes resources leaks #16202

Comments

edmocosta commented Jun 6, 2024 • edited Loading

yaauie commented Jun 11, 2024

edmocosta commented Jun 11, 2024

yaauie commented Jun 11, 2024

edmocosta commented Jun 17, 2024 • edited Loading

jsvd commented Jun 25, 2024

jsvd commented Jun 25, 2024

edmocosta commented Jun 6, 2024 •

edited

Loading

edmocosta commented Jun 17, 2024 •

edited

Loading