The 'upstream connect error or disconnect/reset before headers. reset reason: connection failure' error when using javaagent in OpenShift infrastructure #6104
Replies: 2 comments 1 reply
-
(transferred to opentelemetry-java repo where the OTLP exporter lives) |
Beta Was this translation helpful? Give feedback.
-
In our case simple restart of the application under opentelemetry javaagent fixes the situation. Meaning when the jaeger collector bounces, the javaagent isn't able to reconnect and continuously complains about: [otel.javaagent 2024-01-02 19:17:12:687 +0000] [OkHttp http://jaeger-collector-headless.qa-app-monitoring.svc:4317/...] ERROR io.opentelemetry.exporter.internal.grpc.OkHttpGrpcExporter - Failed to export spans. Server is UNAVAILABLE. Make sure your collector is running and reachable from this network. Full error message:upstream connect error or disconnect/reset before headers. reset reason: connection failure Then without any other action in OpenShift, but just restarting the app, the new instance under javaagent starts and the connection is established. So basically when the jaeger collector bounces there is an event that causes the "upstream connect error or disconnect/reset before headers. reset reason: connection failure" error with the javaagent that continues running on the side of the application. And understanding the possible cause of such compare to regular "Failed to connect to < dns >/< ipv6 >:4317" would give a hint on possible issue that we are dealing with. |
Beta Was this translation helpful? Give feedback.
-
We are running a jaeger collector (1.48.1) in OpenShift, and we send the telemetry from our Java Spring applications using opentelemetry javaagent ver 1.28.0. We are using service endpoints in OpenShift as in http://jaeger-collector-headless.qa-app-monitoring.svc:4317, and the issue is when the collector pod restarts the javaagent can't reconnect again. What happens instead the javaagent reports the error below continuously:
What could be the cause of the issue in terms of how the error maps to the networking/infrastructure problems?
If I simulate the situation locally, i.e. I run the local jaeger-all-in-one container then drop it, restart it, the error is different, and the javaagent restores the connection successfully, that's the error on local that I get:
It's straightforward. It can't connect as the container is intentionally down, then it restores the connection as soon as I restart it.
What is different with the "upstream connect error or disconnect/reset before headers. reset reason: connection failure" compare to "Failed to connect to localhost/0:0:0:0:0:0:0:1:4317" error?
PS: To the javaagent we are passing OTEL_TRACES_EXPORTER=otlp OTEL_METRICS_EXPORTER=none OTEL_EXPORTER_OTLP_ENDPOINT=http://our-openshift.svc:4317 OTEL_EXPORTER_OTLP_PROTOCOL=grpc
Beta Was this translation helpful? Give feedback.
All reactions