Add Karafka integration #4147

nvh0412 · 2024-11-22T05:09:57Z

What does this PR do?

In this PR, we introduce the Kafka integration for the Karafka gem. Which includes:

Distributed tracing by utilizing message.metadata.headers
Traces for worker.process and each message executor inside the batch, so we can link them to the origin trace if the distributed tracing is on

Motivation:

We’re integrating Karafka to implement proper distributed tracing in our system with Datadog, as it lacks an official integration. This integration will also enable distributed tracing if the message headers include distributed tracing data.

Distributed tracing will help us create a proper service map, connecting Kafka producers and consumers.

Change log entry

Yes. Add Karafka integration for distributed tracing.

(Added by @ivoanjo)

Additional Notes:

How to test the change?

docs/GettingStarted.md

marcotc · 2024-11-22T19:08:07Z

lib/datadog/tracing/contrib/karafka/configuration/settings.rb

+
+            option :service_name
+
+            option :distributed_tracing, default: false, type: :bool


I'm curious, for your use case, wouldn't you prefer that distributed_tracing is enabled by default?

I'm actually in limbo with it. The reason why I disable it by default is the same reason as the distributed_tracing of Sidekiq. The trace is actually easily blown up and getting out of control if we turn it on by default.

drichards-87 · 2024-11-22T20:14:59Z

Created a Jira card for Docs Team editorial review.

marcotc · 2024-11-22T20:26:13Z

lib/datadog/tracing/contrib/karafka/events/worker/process.rb

+                ::Karafka.monitor.subscribe 'worker.process' do |event|
+                  # Start a trace
+                  span = Tracing.trace(Ext::SPAN_WORKER_PROCESS, **span_options)
+
+                  job = event[:job]
+                  job_type = fetch_job_type(job.class)
+                  consumer = job.executor.topic.consumer
+                  topic = job.executor.topic.name
+
+                  action = case job_type
+                     when 'Periodic'
+                       'tick'
+                     when 'PeriodicNonBlocking'
+                       'tick'
+                     when 'Shutdown'
+                       'shutdown'
+                     when 'Revoked'
+                       'revoked'
+                     when 'RevokedNonBlocking'
+                       'revoked'
+                     when 'Idle'
+                       'idle'
+                      when 'Eofed'
+                       'eofed'
+                     when 'EofedNonBlocking'
+                       'eofed'
+                     else
+                       'consume'
+                     end
+
+                  span.resource = "#{consumer}##{action}"
+                  span.set_tag(Ext::TAG_TOPIC, topic) if topic
+
+                  if action == 'consume'
+                    span.set_tag(Ext::TAG_MESSAGE_COUNT, job.messages.count)
+                    span.set_tag(Ext::TAG_PARTITION, job.executor.partition)
+                    span.set_tag(Ext::TAG_OFFSET, job.messages.first.metadata.offset)
+                  end
+
+                  span
+                end
+
+                ::Karafka.monitor.subscribe 'worker.completed' do |event|
+                  Tracing.active_span&.finish
+                end


Having separate locations for span creation Tracing.trace and span conclusion Tracing.active_span&.finish is always a possible source of hard-to-debug errors and span leakage (and thus memory leaks). We normally only do it when it's impossible to use Tracing.trace { do_work_here }.

Also, Tracing.trace { do_work_here } takes care of error handling, properly tagging the current span with error information.

In this case, we have the event worker.processed that looks like it's just what we need:
https://github.com/karafka/karafka/blob/ab4f9bcd3620f46adb8c0d158b5396b245619ed3/lib/karafka/processing/worker.rb#L58-L78

Except that it doesn't call the event listeners when an error is raised by the job. There are no error handlers in this method that call assigned_listeners: https://github.com/karafka/karafka-core/blob/a1425725d275796673424c1cd9be517d06518ec9/lib/karafka/core/monitoring/notifications.rb#L120

I opened a PR to Karafka to address this, but even if it's approved, it won't affect users of older versions of the library: karafka/karafka-core#145

The being said, I still lean towards using a single Tracing.trace { do_work_here } only because the safety of not having to worry about leaky spans is too advantageous.

That’s precisely what I’d like to hear from the Datadog team regarding this. My initial intuition when using Tracing.active_span was that I couldn’t be certain whether the span I’m currently working on is the one I want to complete. Considering this and your insights, would it be feasible to remove this tracing event? At the moment, I prefer to retain the Ext::SPAN_MESSAGE_CONSUME event only for this integration, as it effectively meets my requirement: enabling distributed tracing, ensuring that each message span is linked to the root trace in the distributed tracing system.

I opened a PR to Karafka to address this, but even if it's approved, it won't affect users of older versions of the library

This PR is not needed to achieve the expected wrapping because it is already done for example for OpenTelemetry:

ref1: https://karafka.io/docs/Monitoring-and-Logging/#opentelemetry
ref2: https://karafka.io/docs/Monitoring-and-Logging/#monitor-wrapping-and-replacement

mensfeld · 2024-11-22T22:14:27Z

appraisal/ruby-2.7.rb

@@ -193,6 +193,7 @@
  gem 'concurrent-ruby'
  gem 'dalli', '>= 3.0.0'
  gem 'grpc'
+  gem 'karafka'


karafka no longer supports 2.7 based on: https://karafka.io/docs/Versions-Lifecycle-and-EOL/

mensfeld · 2024-11-22T22:14:35Z

appraisal/ruby-3.0.rb

@@ -115,6 +115,7 @@
  gem 'concurrent-ruby'
  gem 'dalli', '>= 3.0.0'
  gem 'grpc', '>= 1.38.0', platform: :ruby # Minimum version with Ruby 3.0 support
+  gem 'karafka'


karafka no longer supports 3.0 based on: https://karafka.io/docs/Versions-Lifecycle-and-EOL/

mensfeld · 2024-11-22T22:16:29Z

lib/datadog/tracing/contrib/karafka/events/worker/process.rb

+              include Karafka::Event
+
+              def self.subscribe!
+                ::Karafka.monitor.subscribe 'worker.process' do |event|


I would not recommend using this layer for instrumentation of that type and would advise you to reconsider something "closer" to the actual execution of work.

Would it be possible for you to explain what "closer" means here :)? I utilised what you did on instrumentation/vendors/datadog/logger_listener.rb here, but what I'm reconsidering here that we don't actually need to subscribe to this event anymore, because our goal is to have message traces for each message inside messages enumerator, so we can link them to the distributed traces. So I can remove this even if everything is getting out of control here.

Sure. Karafka has several layers of reporting when a "job" is executed. The worker level is the highest, and I think of it as conceptually "distant" from the end user code execution layer. In between those, there are coordinators, executors, and more. The closest to the user code is the one that has consumer. events, and while I myself use the worker level once in a while, I, in general, do not recommend it and recommend using the one mentioned above. At some point, I will probably migrate the once that I wrote myself. There is nothing fundamentally wrong about using the worker one but as mentioned, there's a lot in between.

because our goal is to have message traces for each message inside messages enumerator

But this is not the only way users process code. You need to keep in mind users that do batch operations as well, thus you want to trace around all the operational code also.

lib/datadog/tracing/contrib/karafka/integration.rb

codecov-commenter · 2024-11-22T23:31:00Z

Codecov Report

Attention: Patch coverage is 80.58252% with 20 lines in your changes missing coverage. Please review.

Project coverage is 97.76%. Comparing base (ce4393e) to head (fbbbcb0).

Files with missing lines	Patch %	Lines
lib/datadog/tracing/contrib/karafka/patcher.rb	51.61%	15 Missing ⚠️
lib/datadog/tracing/contrib/karafka.rb	80.00%	4 Missing ⚠️
lib/datadog/tracing/contrib/karafka/integration.rb	95.00%	1 Missing ⚠️

Additional details and impacted files

@@            Coverage Diff             @@
##           master    #4147      +/-   ##
==========================================
- Coverage   97.78%   97.76%   -0.02%     
==========================================
  Files        1353     1358       +5     
  Lines       81817    81920     +103     
  Branches     4145     4150       +5     
==========================================
+ Hits        80001    80086      +85     
- Misses       1816     1834      +18

☔ View full report in Codecov by Sentry.
📢 Have feedback on the report? Share it here.

mensfeld · 2024-11-24T18:42:41Z

FYI feel free to ping me once remarks are done. I will be happy to help and maybe in the future retire my own instrumentation in favour of the DD one ;)

mensfeld · 2024-11-24T18:44:45Z

lib/datadog/tracing/contrib/karafka/ext.rb

+          SPAN_MESSAGE_CONSUME = 'karafka.consume'
+          SPAN_WORKER_PROCESS = 'worker.process'
+
+          TAG_TOPIC = 'kafka.topic'


this is a side note because maybe it is done: please keep in mind that karafka supports multi-consumer group operations in one karafka process thus CG (consumer group) always needs to be reported alongside metrics.

mensfeld · 2024-11-24T18:46:19Z

lib/datadog/tracing/contrib/karafka/patcher.rb

+          end
+
+          def each(&block)
+            @messages_array.each do |message|


I mentioned that once but I will do it again here: this is not the only way users process data. Several high scale users use batch operations. Karafka messages API allows you to also fetch deserializers payloads and more making this implementation only partial (not including batch processing users)

Regarding what you wrote here @mensfeld https://karafka.io/docs/Consuming-Messages/#in-batches, this messages enumerator is supposed to be the only API interface that our users use for batch operations, am I right? Seems I can't see any alternative ways in this doc, except you're saying these high-scale users delegate the whole messages to another process by saving them to a storage (e.g., like Event.insert_all messages) without calling .each ?

this messages enumerator is supposed to be the only API interface that our users use for batch operations, am I right?

No.

except you're saying these high-scale users delegate the whole messages to another process by saving them to a storage (e.g., like Event.insert_all messages) without calling .each ?

Yes. Alongside you can use Messages#payloads directly omitting the headers when not needed. Messages#each is a popular use-case but not the only one. While Karafka provides primitives for both, it my itself makes no assumptions about that nature of the processing, that's why #each is not instrumented on my side. The moment I yield control to the user, it is on the user to define the nature of the processing.

Thanks @mensfeld, I see your point, but honestly, this seems like the most suitable place for me to patch and enable distributed tracing, if the users (who also use Datadog like me) want to use #payloads to do their batching implementation, then loosing distributed tracing, in that case, would be reasonable.

I understand that the #each method in the Karafka gem isn’t instrumented since neither you nor the gem itself needs to expose it. However, this distributed tracing is specific to the datadog gem and aligns perfectly with my motivation (in my PR description). So would love to hear thoughts on this from the Datadog team as well. @marcotc

willing to address if there is another place that we should patch to have our distributed tracing instead.

willing to address if there is another place that we should patch to have our distributed tracing instead.

The only thing here is a philosophy of tracing. Ideally if we could have both and allow users to go with the approach they are taking for processing their data, then it would be ideal.

FYI I'm not sure if I mentioned this but feel free to ping me if you need any more help. Also happy to help with this work via Slack/etc if needed.

drichards-87

Left a couple of very small suggestions from Docs and approved the PR.

docs/GettingStarted.md

Use Tracing.trace wrapper to add consumer trace Use Instrumentation::Monitor to instrument and have a proper trace wrapper

nvh0412 · 2024-11-27T11:57:46Z

Hi team and @marcotc

Let's settle this. My gut feeling about this PR is that we do have some limitations in how tracing is integrated into the Karafka gem. I’ve switched to using Tracing.trace with a block rather than starting and finishing traces manually. However, the core of this PR is the distributed tracing with the message block.

This implementation fits perfectly with my system and has been working well with an internal patch. That said, let me know your thoughts on this integration. I’m fine if we can't settle on it, as I can always maintain the internal patch if needed.

github-actions bot added integrations Involves tracing integrations tracing labels Nov 22, 2024

Add Karafka integration

dd02868

nvh0412 force-pushed the karafka-tracing branch from 595ea74 to dd02868 Compare November 22, 2024 05:12

chore: Add karafka mapping

10645a7

nvh0412 marked this pull request as ready for review November 22, 2024 06:07

nvh0412 requested review from a team as code owners November 22, 2024 06:07

marcotc reviewed Nov 22, 2024

View reviewed changes

docs/GettingStarted.md Show resolved Hide resolved

marcotc reviewed Nov 22, 2024

View reviewed changes

drichards-87 added the editorial review Waiting for a review from the docs team label Nov 22, 2024

marcotc mentioned this pull request Nov 22, 2024

Add exception to instrumentation payload karafka/karafka-core#145

Closed

marcotc reviewed Nov 22, 2024

View reviewed changes

mensfeld reviewed Nov 22, 2024

View reviewed changes

lib/datadog/tracing/contrib/karafka/integration.rb Show resolved Hide resolved

nvh0412 added 2 commits November 23, 2024 10:06

chore: remove redudnant changes

ecb1897

chore: Add karafka getting started properly

f9b6e36

mensfeld reviewed Nov 24, 2024

View reviewed changes

drichards-87 approved these changes Nov 25, 2024

View reviewed changes

docs/GettingStarted.md Outdated Show resolved Hide resolved

docs/GettingStarted.md Outdated Show resolved Hide resolved

nvh0412 added 3 commits November 26, 2024 23:31

chore: Address feedbacks

41cb65f

Use Tracing.trace wrapper to add consumer trace Use Instrumentation::Monitor to instrument and have a proper trace wrapper

chore: correct monitor require position

9dd469c

Merge branch 'master' into karafka-tracing

fbbbcb0

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Add Karafka integration #4147

Add Karafka integration #4147

nvh0412 commented Nov 22, 2024 •

edited by Strech

Loading

marcotc Nov 22, 2024

nvh0412 Nov 22, 2024

drichards-87 commented Nov 22, 2024

marcotc Nov 22, 2024

nvh0412 Nov 22, 2024

mensfeld Nov 22, 2024

mensfeld Nov 22, 2024

mensfeld Nov 22, 2024

mensfeld Nov 22, 2024

nvh0412 Nov 22, 2024 •

edited

Loading

mensfeld Nov 24, 2024

codecov-commenter commented Nov 22, 2024 •

edited

Loading

mensfeld commented Nov 24, 2024

mensfeld Nov 24, 2024

mensfeld Nov 24, 2024

nvh0412 Nov 25, 2024

mensfeld Nov 25, 2024

nvh0412 Nov 25, 2024

mensfeld Nov 25, 2024

drichards-87 left a comment

nvh0412 commented Nov 27, 2024


		option :service_name

		option :distributed_tracing, default: false, type: :bool

Add Karafka integration #4147

Are you sure you want to change the base?

Add Karafka integration #4147

Conversation

nvh0412 commented Nov 22, 2024 • edited by Strech Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

drichards-87 commented Nov 22, 2024

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

nvh0412 Nov 22, 2024 • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

codecov-commenter commented Nov 22, 2024 • edited Loading

Codecov Report

mensfeld commented Nov 24, 2024

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

drichards-87 left a comment

Choose a reason for hiding this comment

nvh0412 commented Nov 27, 2024

nvh0412 commented Nov 22, 2024 •

edited by Strech

Loading

nvh0412 Nov 22, 2024 •

edited

Loading

codecov-commenter commented Nov 22, 2024 •

edited

Loading