Fix termination deadlock when the output is retrying whole failed bulk requests #1117

pfcoperez · 2023-02-10T11:29:53Z

Closes: #1116
Detailed explanation of the problem: #1116 (comment)

Break the termination deadlock when whole bulk requests are failing in a retry loop during pipeline terminations

…n a retry loop during pipeline terminations

yaauie

If we return without raising (and crashing the plugin/worker), the subsequent normal-completion of a batch that was sourced from the PQ will cause it to be acknowledged and marked eligible for immediate deletion, causing a restarted pipeline to NOT include the batch. I consider to be a data-loss regression, even though it is less-bad for the memory-queue case, or for cases where some amount of data-loss is preferred if it is the cost of keeping things moving.

The current behaviour is an unfortunate deadlock, true, but a force-shutdown followed by a restart currently will re-emit the unacknowledged batch for reprocessing.

Should we instead raise an exception that crashes the worker and prevents the batch from being acknowledged?

pfcoperez · 2023-02-22T22:33:34Z

Thanks a ton for chiming in @yaauie !

Should we instead raise an exception that crashes the worker and prevents the batch from being acknowledged?

Do you mean that we could replace the changed retry unless shutting_down? lines with an if that would raise an exception when shutting_down? is true?

andsel · 2023-02-23T08:40:48Z

lib/logstash/plugin_mixins/elasticsearch/common.rb

@@ -154,12 +154,17 @@ def successful_connection?
      !!maximum_seen_major_version && alive_urls_count > 0
    end

+    def shutting_down?
+      @stopping.true? || (!execution_context.nil? && !execution_context.pipeline.nil? && execution_context.pipeline.shutdown_requested? && !execution_context.pipeline.worker_threads_draining?)


To avoid the cascading nil checks we could use the &. safe navigation operator. That would simplify as:

Suggested change

@stopping.true? || (!execution_context.nil? && !execution_context.pipeline.nil? && execution_context.pipeline.shutdown_requested? && !execution_context.pipeline.worker_threads_draining?)

@stopping.true? || (execution_context&.pipeline&.shutdown_requested? && !execution_context&.pipeline&.worker_threads_draining?)

pfcoperez · 2023-04-11T20:57:06Z

Superseded by #1119

Break the termination deadlock when whole bulk requests are failing i…

1886b06

…n a retry loop during pipeline terminations

roaksoax added the status:needs-review label Feb 10, 2023

pfcoperez added 2 commits February 10, 2023 16:25

Check if pipeline is nil to harden against incomplete mocks

2652204

Add check on execution_context not being nil

08d4c95

yaauie reviewed Feb 21, 2023

View reviewed changes

andsel reviewed Feb 23, 2023

View reviewed changes

andsel mentioned this pull request Mar 8, 2023

Check pipeline shutdown status while waiting for valid connection or while issuing a bulk to ES #1119

Merged

4 tasks

pfcoperez closed this Apr 11, 2023

roaksoax removed the status:needs-review label Apr 11, 2023

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Fix termination deadlock when the output is retrying whole failed bulk requests #1117

Fix termination deadlock when the output is retrying whole failed bulk requests #1117

pfcoperez commented Feb 10, 2023

yaauie left a comment

pfcoperez commented Feb 22, 2023

andsel Feb 23, 2023

pfcoperez commented Apr 11, 2023

	@stopping.true? \|\| (!execution_context.nil? && !execution_context.pipeline.nil? && execution_context.pipeline.shutdown_requested? && !execution_context.pipeline.worker_threads_draining?)
	@stopping.true? \|\| (execution_context&.pipeline&.shutdown_requested? && !execution_context&.pipeline&.worker_threads_draining?)

Fix termination deadlock when the output is retrying whole failed bulk requests #1117

Fix termination deadlock when the output is retrying whole failed bulk requests #1117

Conversation

pfcoperez commented Feb 10, 2023

yaauie left a comment

Choose a reason for hiding this comment

pfcoperez commented Feb 22, 2023

andsel Feb 23, 2023

Choose a reason for hiding this comment

pfcoperez commented Apr 11, 2023