The 'upstream connect error or disconnect/reset before headers. reset reason: connection failure' error when using javaagent in OpenShift infrastructure #6104

art-iva-cente · 2023-12-28T04:28:50Z

art-iva-cente
Dec 28, 2023

We are running a jaeger collector (1.48.1) in OpenShift, and we send the telemetry from our Java Spring applications using opentelemetry javaagent ver 1.28.0. We are using service endpoints in OpenShift as in http://jaeger-collector-headless.qa-app-monitoring.svc:4317, and the issue is when the collector pod restarts the javaagent can't reconnect again. What happens instead the javaagent reports the error below continuously:

ERROR io.opentelemetry.exporter.internal.grpc.OkHttpGrpcExporter - Failed to export spans. Server is UNAVAILABLE. Make sure your collector is running and reachable from this network. Full error message: upstream connect error or disconnect/reset before headers. reset reason: connection failure

What could be the cause of the issue in terms of how the error maps to the networking/infrastructure problems?

If I simulate the situation locally, i.e. I run the local jaeger-all-in-one container then drop it, restart it, the error is different, and the javaagent restores the connection successfully, that's the error on local that I get:

[otel.javaagent 2023-12-27 20:19:37:802 -0800] [OkHttp http://localhost:4317/...] ERROR io.opentelemetry.exporter.internal.grpc.OkHttpGrpcExporter - Failed to export spans. The request could not be executed. Full error message: Failed to connect to localhost/0:0:0:0:0:0:0:1:4317

It's straightforward. It can't connect as the container is intentionally down, then it restores the connection as soon as I restart it.

What is different with the "upstream connect error or disconnect/reset before headers. reset reason: connection failure" compare to "Failed to connect to localhost/0:0:0:0:0:0:0:1:4317" error?

PS: To the javaagent we are passing OTEL_TRACES_EXPORTER=otlp OTEL_METRICS_EXPORTER=none OTEL_EXPORTER_OTLP_ENDPOINT=http://our-openshift.svc:4317 OTEL_EXPORTER_OTLP_PROTOCOL=grpc

trask · 2024-01-02T22:42:23Z

trask
Jan 2, 2024
Maintainer

(transferred to opentelemetry-java repo where the OTLP exporter lives)

0 replies

art-iva-cente · 2024-01-02T22:55:09Z

art-iva-cente
Jan 2, 2024
Author

In our case simple restart of the application under opentelemetry javaagent fixes the situation. Meaning when the jaeger collector bounces, the javaagent isn't able to reconnect and continuously complains about:

[otel.javaagent 2024-01-02 19:17:12:687 +0000] [OkHttp http://jaeger-collector-headless.qa-app-monitoring.svc:4317/...] ERROR io.opentelemetry.exporter.internal.grpc.OkHttpGrpcExporter - Failed to export spans. Server is UNAVAILABLE. Make sure your collector is running and reachable from this network. Full error message:upstream connect error or disconnect/reset before headers. reset reason: connection failure

Then without any other action in OpenShift, but just restarting the app, the new instance under javaagent starts and the connection is established.

So basically when the jaeger collector bounces there is an event that causes the "upstream connect error or disconnect/reset before headers. reset reason: connection failure" error with the javaagent that continues running on the side of the application. And understanding the possible cause of such compare to regular "Failed to connect to < dns >/< ipv6 >:4317" would give a hint on possible issue that we are dealing with.

1 reply

jack-berg Jan 2, 2024
Maintainer

Have you brought this up with OpenShift support? I'm not familiar with OpenShift but it sounds like there's some sort of networking issue where the application isn't able to access the new instance of jaeger-collector-headless.qa-app-monitoring.svc:4317, and that the networking issue is resolved when the application restarts.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

The 'upstream connect error or disconnect/reset before headers. reset reason: connection failure' error when using javaagent in OpenShift infrastructure #6104

{{title}}

{{editor}}'s edit

{{editor}}'s edit

Replies: 2 comments 1 reply

{{title}}

{{title}}

{{editor}}'s edit

{{editor}}'s edit

{{title}}

Select a reply

The 'upstream connect error or disconnect/reset before headers. reset reason: connection failure' error when using javaagent in OpenShift infrastructure #6104

art-iva-cente Dec 28, 2023

Replies: 2 comments · 1 reply

trask Jan 2, 2024 Maintainer

art-iva-cente Jan 2, 2024 Author

jack-berg Jan 2, 2024 Maintainer

art-iva-cente
Dec 28, 2023

Replies: 2 comments 1 reply

trask
Jan 2, 2024
Maintainer

art-iva-cente
Jan 2, 2024
Author

jack-berg Jan 2, 2024
Maintainer