Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

memory leak on system with 128 x86_64 cores #11760

Open
jcpunk opened this issue Nov 26, 2024 · 0 comments
Open

memory leak on system with 128 x86_64 cores #11760

jcpunk opened this issue Nov 26, 2024 · 0 comments
Labels
bug Something isn't working

Comments

@jcpunk
Copy link

jcpunk commented Nov 26, 2024

Describe the bug
I've got an x86_64 system with 128 cores. The otel collector adds about 5Mib to its working memory every time it scrapes a metrics endpoint. Eventually it hits up against the memorylimiter but the garbage collection never seems to really make headway and eventually fails to reclaim enough memory.

My identically configured systems with 8 or 16 x86_64 cores do not appear so leak in this manner.
My aarch64 system with a similar config and with 64 cores does also appear to leak in this manner.

Steps to reproduce
Run the otel-collector on a system with a lot of processing cores

What did you expect to see?
Memory usage eventually stabilize

What did you see instead?
Memory usage grows to fill space allotted - tested up to 4Gib (take 6 days)

What version did you use?
otelcol-contrib version 0.114.0 (memory code is probably in the base collector)

What config did you use?

---
processors:
  batch: {}
  transform/hostname:
    metric_statements:
    - context: datapoint
      statements:
      - set(attributes["nodename"], "host.fnal.gov")
      - set(resource.attributes["nodename"], "host.fnal.gov")
  memory_limiter:
    check_interval: 30s
    limit_mib: 384
exporters:
  prometheus:
    endpoint: "[::]:9299"
    enable_open_metrics: true
    metric_expiration: 2m
service:
  telemetry:
    metrics:
      level: none
  pipelines:
    metrics:
      receivers:
      - prometheus
      processors:
      - memory_limiter
      - transform/hostname
      - batch
      exporters:
      - prometheus
receivers:
  prometheus:
    config:
      scrape_configs:
      - job_name: node-exporter
        scrape_interval: 45s
        static_configs:
        - targets:
          - localhost:9100
          labels:
            instance: host.fnal.gov:9100
      - job_name: systemd-exporter
        scrape_interval: 45s
        static_configs:
        - targets:
          - localhost:9558
          labels:
            instance: host.fnal.gov:9558

Environment
OS: Almalinux 9
Platform: podman
Podman Quadlet file: /etc/containers/systemd/otel-collector.container

# THIS FILE IS MANAGED BY PUPPET
[Service]
TimeoutStartSec=900
TimeoutStopSec=30
TasksMax=4096
CPUWeight=30
MemoryMax=512M
IOSchedulingClass=best-effort
IOSchedulingPriority=7
IOWeight=30
Restart=always

[Container]
AutoUpdate=registry
DropCapability=ALL
User=5219
Group=8247
HostName=%H
LogDriver=journald
NoNewPrivileges=true
Pull=missing
ReadOnly=true
PodmanArgs=--stop-signal=SIGKILL
Volume=/etc/otel-collector:/etc/otel-collector:ro,rslave,z
Environment=GOMAXPROCS=4
Environment=GOMEMLIMIT=384MiB
Exec=--config /etc/otel-collector/otel-config.yaml

Image=ghcr.io/open-telemetry/opentelemetry-collector-releases/opentelemetry-collector-contrib:latest
Network=host
PublishPort=[::]:9299:9299

[Install]
WantedBy=default.target

Additional context
endpoints:

[root@host ~]#  curl -s localhost:9558/metrics |wc -l
5380
[root@host ~]#  curl -s localhost:9100/metrics |wc -l
6317

logs

Nov 26 09:21:44 host.fnal.gov systemd-otel-collector[1202537]: 2024-11-26T15:21:44.556Z        info        [email protected]/memorylimiter.go:203        Memory usage is above soft limit. Forcing a GC.        {"kind": "processor", "name": "memory_limiter", "pipeline": "metrics", "cur_mem_mib": 327}
Nov 26 09:21:44 host.fnal.gov systemd-otel-collector[1202537]: 2024-11-26T15:21:44.643Z        info        [email protected]/memorylimiter.go:173        Memory usage after GC.        {"kind": "processor", "name": "memory_limiter", "pipeline": "metrics", "cur_mem_mib": 289}
Nov 26 09:22:44 host.fnal.gov systemd-otel-collector[1202537]: 2024-11-26T15:22:44.556Z        info        [email protected]/memorylimiter.go:203        Memory usage is above soft limit. Forcing a GC.        {"kind": "processor", "name": "memory_limiter", "pipeline": "metrics", "cur_mem_mib": 327}
Nov 26 09:22:44 host.fnal.gov systemd-otel-collector[1202537]: 2024-11-26T15:22:44.635Z        info        [email protected]/memorylimiter.go:173        Memory usage after GC.        {"kind": "processor", "name": "memory_limiter", "pipeline": "metrics", "cur_mem_mib": 289}
Nov 26 09:23:14 host.fnal.gov systemd-otel-collector[1202537]: 2024-11-26T15:23:14.556Z        info        [email protected]/memorylimiter.go:203        Memory usage is above soft limit. Forcing a GC.        {"kind": "processor", "name": "memory_limiter", "pipeline": "metrics", "cur_mem_mib": 328}
Nov 26 09:23:14 host.fnal.gov systemd-otel-collector[1202537]: 2024-11-26T15:23:14.638Z        info        [email protected]/memorylimiter.go:173        Memory usage after GC.        {"kind": "processor", "name": "memory_limiter", "pipeline": "metrics", "cur_mem_mib": 290}
Nov 26 09:24:14 host.fnal.gov systemd-otel-collector[1202537]: 2024-11-26T15:24:14.556Z        info        [email protected]/memorylimiter.go:203        Memory usage is above soft limit. Forcing a GC.        {"kind": "processor", "name": "memory_limiter", "pipeline": "metrics", "cur_mem_mib": 327}
Nov 26 09:24:14 host.fnal.gov systemd-otel-collector[1202537]: 2024-11-26T15:24:14.618Z        info        [email protected]/memorylimiter.go:173        Memory usage after GC.        {"kind": "processor", "name": "memory_limiter", "pipeline": "metrics", "cur_mem_mib": 289}
Nov 26 09:24:44 host.fnal.gov systemd-otel-collector[1202537]: 2024-11-26T15:24:44.556Z        info        [email protected]/memorylimiter.go:203        Memory usage is above soft limit. Forcing a GC.        {"kind": "processor", "name": "memory_limiter", "pipeline": "metrics", "cur_mem_mib": 327}
Nov 26 09:24:44 host.fnal.gov systemd-otel-collector[1202537]: 2024-11-26T15:24:44.625Z        info        [email protected]/memorylimiter.go:173        Memory usage after GC.        {"kind": "processor", "name": "memory_limiter", "pipeline": "metrics", "cur_mem_mib": 289}
Nov 26 09:25:44 host.fnal.gov systemd-otel-collector[1202537]: 2024-11-26T15:25:44.556Z        info        [email protected]/memorylimiter.go:203        Memory usage is above soft limit. Forcing a GC.        {"kind": "processor", "name": "memory_limiter", "pipeline": "metrics", "cur_mem_mib": 328}
Nov 26 09:25:44 host.fnal.gov systemd-otel-collector[1202537]: 2024-11-26T15:25:44.624Z        info        [email protected]/memorylimiter.go:173        Memory usage after GC.        {"kind": "processor", "name": "memory_limiter", "pipeline": "metrics", "cur_mem_mib": 290}
Nov 26 09:26:14 host.fnal.gov systemd-otel-collector[1202537]: 2024-11-26T15:26:14.557Z        info        [email protected]/memorylimiter.go:203        Memory usage is above soft limit. Forcing a GC.        {"kind": "processor", "name": "memory_limiter", "pipeline": "metrics", "cur_mem_mib": 329}
Nov 26 09:26:14 host.fnal.gov systemd-otel-collector[1202537]: 2024-11-26T15:26:14.628Z        info        [email protected]/memorylimiter.go:173        Memory usage after GC.        {"kind": "processor", "name": "memory_limiter", "pipeline": "metrics", "cur_mem_mib": 290}
Nov 26 09:27:14 host.fnal.gov systemd-otel-collector[1202537]: 2024-11-26T15:27:14.556Z        info        [email protected]/memorylimiter.go:203        Memory usage is above soft limit. Forcing a GC.        {"kind": "processor", "name": "memory_limiter", "pipeline": "metrics", "cur_mem_mib": 328}
Nov 26 09:27:14 host.fnal.gov systemd-otel-collector[1202537]: 2024-11-26T15:27:14.634Z        info        [email protected]/memorylimiter.go:173        Memory usage after GC.        {"kind": "processor", "name": "memory_limiter", "pipeline": "metrics", "cur_mem_mib": 287}
Nov 26 09:27:44 host.fnal.gov systemd-otel-collector[1202537]: 2024-11-26T15:27:44.557Z        info        [email protected]/memorylimiter.go:203        Memory usage is above soft limit. Forcing a GC.        {"kind": "processor", "name": "memory_limiter", "pipeline": "metrics", "cur_mem_mib": 326}
Nov 26 09:27:44 host.fnal.gov systemd-otel-collector[1202537]: 2024-11-26T15:27:44.634Z        info        [email protected]/memorylimiter.go:173        Memory usage after GC.        {"kind": "processor", "name": "memory_limiter", "pipeline": "metrics", "cur_mem_mib": 287}
Nov 26 09:28:44 host.fnal.gov systemd-otel-collector[1202537]: 2024-11-26T15:28:44.556Z        info        [email protected]/memorylimiter.go:203        Memory usage is above soft limit. Forcing a GC.        {"kind": "processor", "name": "memory_limiter", "pipeline": "metrics", "cur_mem_mib": 326}
Nov 26 09:28:44 host.fnal.gov systemd-otel-collector[1202537]: 2024-11-26T15:28:44.636Z        info        [email protected]/memorylimiter.go:173        Memory usage after GC.        {"kind": "processor", "name": "memory_limiter", "pipeline": "metrics", "cur_mem_mib": 290}
Nov 26 09:29:14 host.fnal.gov systemd-otel-collector[1202537]: 2024-11-26T15:29:14.557Z        info        [email protected]/memorylimiter.go:203        Memory usage is above soft limit. Forcing a GC.        {"kind": "processor", "name": "memory_limiter", "pipeline": "metrics", "cur_mem_mib": 328}
Nov 26 09:29:14 host.fnal.gov systemd-otel-collector[1202537]: 2024-11-26T15:29:14.637Z        info        [email protected]/memorylimiter.go:173        Memory usage after GC.        {"kind": "processor", "name": "memory_limiter", "pipeline": "metrics", "cur_mem_mib": 289}
Nov 26 09:30:14 host.fnal.gov systemd-otel-collector[1202537]: 2024-11-26T15:30:14.556Z        info        [email protected]/memorylimiter.go:203        Memory usage is above soft limit. Forcing a GC.        {"kind": "processor", "name": "memory_limiter", "pipeline": "metrics", "cur_mem_mib": 328}
Nov 26 09:30:14 host.fnal.gov systemd-otel-collector[1202537]: 2024-11-26T15:30:14.639Z        info        [email protected]/memorylimiter.go:173        Memory usage after GC.        {"kind": "processor", "name": "memory_limiter", "pipeline": "metrics", "cur_mem_mib": 290}
Nov 26 09:30:44 host.fnal.gov systemd-otel-collector[1202537]: 2024-11-26T15:30:44.556Z        info        [email protected]/memorylimiter.go:203        Memory usage is above soft limit. Forcing a GC.        {"kind": "processor", "name": "memory_limiter", "pipeline": "metrics", "cur_mem_mib": 329}
Nov 26 09:30:44 host.fnal.gov systemd-otel-collector[1202537]: 2024-11-26T15:30:44.630Z        info        [email protected]/memorylimiter.go:173        Memory usage after GC.        {"kind": "processor", "name": "memory_limiter", "pipeline": "metrics", "cur_mem_mib": 290}
Nov 26 09:31:44 host.fnal.gov systemd-otel-collector[1202537]: 2024-11-26T15:31:44.556Z        info        [email protected]/memorylimiter.go:203        Memory usage is above soft limit. Forcing a GC.        {"kind": "processor", "name": "memory_limiter", "pipeline": "metrics", "cur_mem_mib": 329}
Nov 26 09:31:44 host.fnal.gov systemd-otel-collector[1202537]: 2024-11-26T15:31:44.640Z        info        [email protected]/memorylimiter.go:173        Memory usage after GC.        {"kind": "processor", "name": "memory_limiter", "pipeline": "metrics", "cur_mem_mib": 289}
Nov 26 09:32:14 host.fnal.gov systemd-otel-collector[1202537]: 2024-11-26T15:32:14.556Z        info        [email protected]/memorylimiter.go:203        Memory usage is above soft limit. Forcing a GC.        {"kind": "processor", "name": "memory_limiter", "pipeline": "metrics", "cur_mem_mib": 329}
Nov 26 09:32:14 host.fnal.gov systemd-otel-collector[1202537]: 2024-11-26T15:32:14.640Z        info        [email protected]/memorylimiter.go:173        Memory usage after GC.        {"kind": "processor", "name": "memory_limiter", "pipeline": "metrics", "cur_mem_mib": 292}
Nov 26 09:33:14 host.fnal.gov systemd-otel-collector[1202537]: 2024-11-26T15:33:14.556Z        info        [email protected]/memorylimiter.go:203        Memory usage is above soft limit. Forcing a GC.        {"kind": "processor", "name": "memory_limiter", "pipeline": "metrics", "cur_mem_mib": 332}
Nov 26 09:33:14 host.fnal.gov systemd-otel-collector[1202537]: 2024-11-26T15:33:14.641Z        info        [email protected]/memorylimiter.go:173        Memory usage after GC.        {"kind": "processor", "name": "memory_limiter", "pipeline": "metrics", "cur_mem_mib": 294}
Nov 26 09:33:44 host.fnal.gov systemd-otel-collector[1202537]: 2024-11-26T15:33:44.556Z        info        [email protected]/memorylimiter.go:203        Memory usage is above soft limit. Forcing a GC.        {"kind": "processor", "name": "memory_limiter", "pipeline": "metrics", "cur_mem_mib": 334}
Nov 26 09:33:44 host.fnal.gov systemd-otel-collector[1202537]: 2024-11-26T15:33:44.642Z        info        [email protected]/memorylimiter.go:173        Memory usage after GC.        {"kind": "processor", "name": "memory_limiter", "pipeline": "metrics", "cur_mem_mib": 296}
Nov 26 09:34:44 host.fnal.gov systemd-otel-collector[1202537]: 2024-11-26T15:34:44.556Z        info        [email protected]/memorylimiter.go:203        Memory usage is above soft limit. Forcing a GC.        {"kind": "processor", "name": "memory_limiter", "pipeline": "metrics", "cur_mem_mib": 336}
Nov 26 09:34:44 host.fnal.gov systemd-otel-collector[1202537]: 2024-11-26T15:34:44.642Z        info        [email protected]/memorylimiter.go:173        Memory usage after GC.        {"kind": "processor", "name": "memory_limiter", "pipeline": "metrics", "cur_mem_mib": 299}
Nov 26 09:35:14 host.fnal.gov systemd-otel-collector[1202537]: 2024-11-26T15:35:14.556Z        info        [email protected]/memorylimiter.go:203        Memory usage is above soft limit. Forcing a GC.        {"kind": "processor", "name": "memory_limiter", "pipeline": "metrics", "cur_mem_mib": 338}
Nov 26 09:35:14 host.fnal.gov systemd-otel-collector[1202537]: 2024-11-26T15:35:14.637Z        info        [email protected]/memorylimiter.go:173        Memory usage after GC.        {"kind": "processor", "name": "memory_limiter", "pipeline": "metrics", "cur_mem_mib": 301}
Nov 26 09:36:14 host.fnal.gov systemd-otel-collector[1202537]: 2024-11-26T15:36:14.556Z        info        [email protected]/memorylimiter.go:203        Memory usage is above soft limit. Forcing a GC.        {"kind": "processor", "name": "memory_limiter", "pipeline": "metrics", "cur_mem_mib": 340}
Nov 26 09:36:14 host.fnal.gov systemd-otel-collector[1202537]: 2024-11-26T15:36:14.636Z        info        [email protected]/memorylimiter.go:173        Memory usage after GC.        {"kind": "processor", "name": "memory_limiter", "pipeline": "metrics", "cur_mem_mib": 306}
Nov 26 09:36:44 host.fnal.gov systemd-otel-collector[1202537]: 2024-11-26T15:36:44.556Z        info        [email protected]/memorylimiter.go:203        Memory usage is above soft limit. Forcing a GC.        {"kind": "processor", "name": "memory_limiter", "pipeline": "metrics", "cur_mem_mib": 345}
Nov 26 09:36:44 host.fnal.gov systemd-otel-collector[1202537]: 2024-11-26T15:36:44.645Z        info        [email protected]/memorylimiter.go:173        Memory usage after GC.        {"kind": "processor", "name": "memory_limiter", "pipeline": "metrics", "cur_mem_mib": 308}
Nov 26 09:36:44 host.fnal.gov systemd-otel-collector[1202537]: 2024-11-26T15:36:44.645Z        warn        [email protected]/memorylimiter.go:210        Memory usage is above soft limit. Refusing data.        {"kind": "processor", "name": "memory_limiter", "pipeline": "metrics", "cur_mem_mib": 308}
Nov 26 09:37:18 host.fnal.gov systemd-otel-collector[1202537]: 2024-11-26T15:37:18.677Z        error        scrape/scrape.go:1298        Scrape commit failed        {"kind": "receiver", "name": "prometheus", "data_type": "metrics", "scrape_pool": "node-exporter", "target": "http://localhost:9100/metrics", "error": "data refused due to high>
Nov 26 09:37:18 host.fnal.gov systemd-otel-collector[1202537]: github.com/prometheus/prometheus/scrape.(*scrapeLoop).scrapeAndReport.func1
Nov 26 09:37:18 host.fnal.gov systemd-otel-collector[1202537]:         github.com/prometheus/[email protected]/scrape/scrape.go:1298
Nov 26 09:37:18 host.fnal.gov systemd-otel-collector[1202537]: github.com/prometheus/prometheus/scrape.(*scrapeLoop).scrapeAndReport
Nov 26 09:37:18 host.fnal.gov systemd-otel-collector[1202537]:         github.com/prometheus/[email protected]/scrape/scrape.go:1376
Nov 26 09:37:18 host.fnal.gov systemd-otel-collector[1202537]: github.com/prometheus/prometheus/scrape.(*scrapeLoop).run
Nov 26 09:37:18 host.fnal.gov systemd-otel-collector[1202537]:         github.com/prometheus/[email protected]/scrape/scrape.go:1253
Nov 26 09:37:23 host.fnal.gov systemd-otel-collector[1202537]: 2024-11-26T15:37:23.687Z        error        scrape/scrape.go:1298        Scrape commit failed        {"kind": "receiver", "name": "prometheus", "data_type": "metrics", "scrape_pool": "systemd-exporter", "target": "http://localhost:9558/metrics", "error": "data refused due to h>
Nov 26 09:37:23 host.fnal.gov systemd-otel-collector[1202537]: github.com/prometheus/prometheus/scrape.(*scrapeLoop).scrapeAndReport.func1
Nov 26 09:37:23 host.fnal.gov systemd-otel-collector[1202537]:         github.com/prometheus/[email protected]/scrape/scrape.go:1298
Nov 26 09:37:23 host.fnal.gov systemd-otel-collector[1202537]: github.com/prometheus/prometheus/scrape.(*scrapeLoop).scrapeAndReport
Nov 26 09:37:23 host.fnal.gov systemd-otel-collector[1202537]:         github.com/prometheus/[email protected]/scrape/scrape.go:1376
Nov 26 09:37:23 host.fnal.gov systemd-otel-collector[1202537]: github.com/prometheus/prometheus/scrape.(*scrapeLoop).run
Nov 26 09:37:23 host.fnal.gov systemd-otel-collector[1202537]:         github.com/prometheus/[email protected]/scrape/scrape.go:1253
Nov 26 09:38:03 host.fnal.gov systemd-otel-collector[1202537]: 2024-11-26T15:38:03.673Z        error        scrape/scrape.go:1298        Scrape commit failed        {"kind": "receiver", "name": "prometheus", "data_type": "metrics", "scrape_pool": "node-exporter", "target": "http://localhost:9100/metrics", "error": "data refused due to high>
Nov 26 09:38:03 host.fnal.gov systemd-otel-collector[1202537]: github.com/prometheus/prometheus/scrape.(*scrapeLoop).scrapeAndReport.func1
Nov 26 09:38:03 host.fnal.gov systemd-otel-collector[1202537]:         github.com/prometheus/[email protected]/scrape/scrape.go:1298
Nov 26 09:38:03 host.fnal.gov systemd-otel-collector[1202537]: github.com/prometheus/prometheus/scrape.(*scrapeLoop).scrapeAndReport
Nov 26 09:38:03 host.fnal.gov systemd-otel-collector[1202537]:         github.com/prometheus/[email protected]/scrape/scrape.go:1376
Nov 26 09:38:03 host.fnal.gov systemd-otel-collector[1202537]: github.com/prometheus/prometheus/scrape.(*scrapeLoop).run
Nov 26 09:38:03 host.fnal.gov systemd-otel-collector[1202537]:         github.com/prometheus/[email protected]/scrape/scrape.go:1253
Nov 26 09:38:08 host.fnal.gov systemd-otel-collector[1202537]: 2024-11-26T15:38:08.670Z        error        scrape/scrape.go:1298        Scrape commit failed        {"kind": "receiver", "name": "prometheus", "data_type": "metrics", "scrape_pool": "systemd-exporter", "target": "http://localhost:9558/metrics", "error": "data refused due to h>
Nov 26 09:38:08 host.fnal.gov systemd-otel-collector[1202537]: github.com/prometheus/prometheus/scrape.(*scrapeLoop).scrapeAndReport.func1
Nov 26 09:38:08 host.fnal.gov systemd-otel-collector[1202537]:         github.com/prometheus/[email protected]/scrape/scrape.go:1298
Nov 26 09:38:08 host.fnal.gov systemd-otel-collector[1202537]: github.com/prometheus/prometheus/scrape.(*scrapeLoop).scrapeAndReport
Nov 26 09:38:08 host.fnal.gov systemd-otel-collector[1202537]:         github.com/prometheus/[email protected]/scrape/scrape.go:1376
Nov 26 09:38:08 host.fnal.gov systemd-otel-collector[1202537]: github.com/prometheus/prometheus/scrape.(*scrapeLoop).run
Nov 26 09:38:08 host.fnal.gov systemd-otel-collector[1202537]:         github.com/prometheus/[email protected]/scrape/scrape.go:1253
@jcpunk jcpunk added the bug Something isn't working label Nov 26, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug Something isn't working
Projects
None yet
Development

No branches or pull requests

1 participant