Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

sql_exporter is losing metrics if compute is very busy #9960

Open
Bodobolero opened this issue Dec 2, 2024 · 4 comments
Open

sql_exporter is losing metrics if compute is very busy #9960

Bodobolero opened this issue Dec 2, 2024 · 4 comments
Labels
a/observability Area: related to observability a/performance Area: relates to performance of the system c/compute Component: compute, excluding postgres itself t/bug Issue Type: Bug

Comments

@Bodobolero
Copy link
Contributor

Bodobolero commented Dec 2, 2024

Steps to reproduce

run ingest benchmark
doc

Expected result

We see metrics collected by sql_exporter for the complete run

Actual result

we are losing metrics - most likely because sql_exporter is exceeding its scrape_timout

we observe this especially when there is large amount of backpressure from PS to compute

Environment

staging

Logs, links

https://neonprod.grafana.net/d/de3mupf4g68e8e/perf-test3a-ingest-benchmark?orgId=1&from[…]ge_tenant_endpoint_id=ep-misty-river-w2vdg495&viewPanel=19

first reported here

another observation of this - probably related

https://neondb.slack.com/archives/C04DGM6SMTM/p1731526874214679

@Bodobolero Bodobolero added a/observability Area: related to observability a/performance Area: relates to performance of the system c/compute Component: compute, excluding postgres itself t/bug Issue Type: Bug labels Dec 2, 2024
@ololobus
Copy link
Member

ololobus commented Dec 2, 2024

Previous thread re this problem https://neondb.slack.com/archives/C04DGM6SMTM/p1731526874214679

Ultimately, on each scrape sql_exporter does all the SQL specified in the metrics config. So if compute is loaded, then SQL becomes slower and we see these gaps.

So what are the options we have?

  1. Try to identify the most heavy queries and optimize them
  2. Untie collection and reporting flows (there are tools that do that, e.g. Telegraf, iiuc). In this case, collection will be super fast, scrapes can have much longer timeouts, but instead of gaps we may see stale metrics (up to configured interval). See my comment about Telegraf (not 100% I understand it correctly), @mickael-carl was against this approach
  3. Switch as many metrics from the collection via SQL to maintaining counters/histograms online as possible, and report them from prometheus endpoint in our extension, for example. This may be also combined with 1. I'm not sure that this path is totally realistic, though. There is some Postgres statistics maintained in the catalog, or inside Postgres shared memory structures, so we would need a lot of core patching, and it's still not clear that it will work well. Like the database size, for example. I think we need to discuss that, maybe there are some low-hanging fruits

@ololobus
Copy link
Member

ololobus commented Dec 3, 2024

Moved to backlog because we don't have any good ideas how to fox it except exploring another tool like Telegraf

@tristan957 suggests that we can bump the sql_exporter version

@ololobus
Copy link
Member

ololobus commented Dec 3, 2024

Thread about timeout issues, looks like we currently scrape every 10s, so we cannot bump the timeout significantly

@ololobus
Copy link
Member

ololobus commented Dec 3, 2024

Another piece of info from Tristan, sql_exporter seems to have its own metrics

Only metrics defined by collectors are exported on the /metrics endpoint. SQL Exporter process metrics are exported at /sql_exporter_metrics.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
a/observability Area: related to observability a/performance Area: relates to performance of the system c/compute Component: compute, excluding postgres itself t/bug Issue Type: Bug
Projects
None yet
Development

No branches or pull requests

2 participants