Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Inactive Apm Tracer starts an invalid span with zeroes in the trace id #1500

Open
renta opened this issue Aug 28, 2023 · 9 comments
Open

Inactive Apm Tracer starts an invalid span with zeroes in the trace id #1500

renta opened this issue Aug 28, 2023 · 9 comments
Labels

Comments

@renta
Copy link

renta commented Aug 28, 2023

Describe the bug
Inactive APM Tracer produces invalid spans for OpenTelemetry Tracing bridge. The bug lives inside
go.elastic.co/apm/module/apmotel/[email protected]/tracer.go:105 in this code:
s.tx = t.provider.apmTracer.StartTransactionOptions(spanName, "", tranOpts) It returns an APM transaction with invalid trace and span ids (all the zero bytes).

To Reproduce
Steps to reproduce the behavior:

  1. Use Elastic APM as a bridge to OpenTelemetry API as described here https://www.elastic.co/guide/en/apm/agent/go/current/opentelemetry.html I've taken an existent working ELK Apm instance and have created a tracing provider like it described in the docs.
  2. Try to create a new span with the tracer with the code (I've done so in a webserver middleware):
ctx, span := ot.tracer.Start(
	handlerContext.GetContext(),
	serverRequestName(handlerContext.GetHandler()),
)
defer span.End()
  1. This span will have invalid trace and span ids. See the screenshot from my debug session as proof.
    Screenshot from 2023-08-28 15-54-14

Expected behavior
I suppose that in any case span should have a correct state (or there should be something like Noop span with message details that tracer is not in active state right now).

@dmathieu
Copy link
Member

Do you have the content of tranOpts? Are Trace and Span set in there?

@dmathieu
Copy link
Member

Also (because I expect tranOpts doesn't have Trace and Span set), it seems an inactive tracer, even without apmotel would still be creating those traces/spans with 0 IDs, which would make this behavior the expected one.

@renta
Copy link
Author

renta commented Aug 29, 2023

Do you have the content of tranOpts? Are Trace and Span set in there?

No, it's just a very primitive code in grpc and http middlewares that just creates a new span on each request. So there is no trace from the client (I'm testing it with Postman and BloomRPC and I do not know any way to create span from them or simulate this creation (some random 16-chars string that mimic W3C Trace header?)).

Also (because I expect tranOpts doesn't have Trace and Span set), it seems an inactive tracer, even without apmotel would still be creating those traces/spans with 0 IDs, which would make this behavior the expected one.

If I understand you right, it's not a behavior of APM Otel bridge but Opentelemetry library inself? If so, it's a little strange from my point of view. On one side I understand that tracing is a deep infrastructure layer of a system and it should try to create as little problems as it could. And Opentracing interface does not allow to return error to signal that something is broken or unset:

type Tracer interface {
	Start(ctx context.Context, spanName string, opts ...SpanStartOption) (context.Context, Span)
}

But as a developer in this case I have no explicit way to understand that span was created with problem (except try to examine that SpanContext is valid) and need to debug to try to figure out the reason. I think that two things should be in the docs (I suppose that docs of Opentelemetry Go library):

  1. Do not try to create span with inactive tracer.
  2. Always examine created span (with SpanContext) that it's valid.
    In other case the app will work but tracing system won't work without errors and you need to go deep in the guts to figure out what is wrong.

Also I've figured out yesterday that GRPC integration https://github.com/open-telemetry/opentelemetry-go-contrib/blob/main/instrumentation/google.golang.org/grpc/otelgrpc/ will create invalid spans if global tracer is unavailable (same situation with zeroes filled trace and span ids as in my first message). So I had to put a TracerProvider to the interceptors with otelgrpc.WithTracerProvider() to make them work.

@renta
Copy link
Author

renta commented Aug 29, 2023

And also more "5 cents" against invalid spans. For example we can use trace id for logs enrichment to be able to investigate with grep'ing or ELK. With inactive tracer (or some misconfiguration in case of OtelGRPC integration) span will be invalid. All this will be in a total silent (no errors). I suppose that span MUST have valid trace and span ids in any case. We could have no errors, no span gathering but zero trace id could be destructive.

@dmathieu
Copy link
Member

dmathieu commented Aug 29, 2023

If I understand you right, it's not a behavior of APM Otel bridge but Opentelemetry library inself?

No, it's a behavior of the APM agent. The tracing bridge implements its own OpenTelemetry SDK, it doesn't use the otel one (which is why it should only be used to migrate, not long-term).

Also I've figured out yesterday that GRPC integration https://github.com/open-telemetry/opentelemetry-go-contrib/blob/main/instrumentation/google.golang.org/grpc/otelgrpc/ will create invalid spans if global tracer is unavailable (same situation with zeroes filled trace and span ids as in my first message). So I had to put a TracerProvider to the interceptors with otelgrpc.WithTracerProvider() to make them work.

That makes sense, every otel instrumentation will behave that way, and has no relation to this repository. This is how OpenTelemetry works.
By default, it relies on the global TracerProvider, but you can specify your own.

What invalid behavior are you seeing with those zeroed spans? Yes, they have zero values, but since that's how the APM agent works, and apmotel is its own implementation of the OpenTelemetry SDK (and therefore cannot be used with an OTLP exporter for example), we should be able to rely on the APM agent behavior.

@renta
Copy link
Author

renta commented Aug 29, 2023

What invalid behavior are you seeing with those zeroed spans?

He rely on trace id in these scenarios:

  1. Logs enrichment and incident analysis in the ELK.
  2. Trace id used as a part of cache key.
  3. We return a trace id as a part of error message for user to be able to investigate easier in the future.

All these subsystem will be broken with zero-values trace id.

@dmathieu
Copy link
Member

I don't understand. Since the APM Tracer is invalid, there won't be any data to see in Kibana.

Also, the zero-value trace ID is already the behavior of the APM agent, even without apmotel.
How did your system work before setting up apmotel?

@renta
Copy link
Author

renta commented Aug 29, 2023

Since the APM Tracer is invalid, there won't be any data to see in Kibana.

To be more clear: system works with Elastic APM as a tracing system but now I migrate all the parts from APM itself to Opentelemetry and use Apm Otel Bridge for translating otel data to current ELK Apm (mainly to hide an infrastructure part of Elastic APM behind the Opentelemetry as an interface and common library for tracing). I've figured out this behavior with zero trace ids on my local machine during the tests of the migration. And I haven't pay an attention on such a behavior for clean APM case. For me it was a moment of surprise that created span could be invalid. I've worked with Jaeger and Opentracing as the tracing tools and thought that span MUST have a valid trace id in any case because somehow it was generated.

@dmathieu
Copy link
Member

OpenTelemetry doesn't generate valid trace IDs if no tracer were configured.
You can see its behavior in the specs: https://github.com/open-telemetry/opentelemetry-specification/blob/main/specification/trace/api.md#behavior-of-the-api-in-the-absence-of-an-installed-sdk

And in the API.
No tracer provider being defined means you'll get a noopSpan, which gives an empty SpanContext: https://github.com/open-telemetry/opentelemetry-go/blob/main/trace/noop.go#L62
That'll give us an empty [16]byte, which will essentially be 0.

So again, this may not be how Opentracing and Jaeger work, but the behavior is consistent with how our agent works, as well as with how OpenTelemetry works. I don't think we should change it here.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Projects
None yet
Development

No branches or pull requests

2 participants