Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

pmm-agent.service causes Too many open files crash on MongoDB and memory leaks on MariaDB #3262

Open
1 task done
Bobzemob opened this issue Oct 22, 2024 · 4 comments
Open
1 task done
Assignees
Labels
bug Bug report

Comments

@Bobzemob
Copy link

Bobzemob commented Oct 22, 2024

Description

MongoDB

pmm-agent.service on MongoDB cluster opened 40,000+ pages, causing MongoDB to crash with Too many open files error.
This behavior occured on a MongoDB server cluster running version 6.0.18.

MariaDB

The pmm-agent.service on several MariaDB servers caused errors that coincided with the MySQL process grabbing multiple GB of memory and not releasing it until the MySQL process was restarted. After the MySQL process was restarted, the memory grabbing behavior continued on severs that had the pmm-agent.service active and stopped on the ones that had it disabled. (See logs below)
Affected servers were using MariaDB version 10.3.39.

Expected Results

Expected pmm agent to not cause crashing/memory leaks.

Actual Results

PMM agent opened too many pages on MongoDB servers causing the MongoDB process to crash.
Disabling pmm-agent caused the number of open pages to drop from 40,000+ to around 500.

PMM agent caused MariaDB to allocate multiple GBs of memory and not releasing said memory, eventually leading to a crash.
Disabling the pmm-agent service stopped this behavior from occurring.

Version

PMM Server 2.43.1
PMM Agent 2.43.1-6

Steps to reproduce

No response

Relevant logs

------- MariaDB-DB1 -------

Oct 19 20:26:09 mariadb-1 pmm-agent[7740]: time="2024-10-19T20:26:09.537-04:00" level=warning msg="Action terminated with error: Error 1193 (HY000): Unknown system variable 'binlog_expire_logs_seconds
'\ngithub.com/percona/pmm/agent/runner/actions.(*mysqlQuerySelectAction).Run\n\t/tmp/go/src/github.com/percona/pmm/agent/runner/actions/mysql_query_select_action.go:75\ngithub.com/percona/pmm/agent/runn
er.(*Runner).handleAction.func1\n\t/tmp/go/src/github.com/percona/pmm/agent/runner/runner.go:382\nruntime/pprof.Do\n\t/usr/local/go/src/runtime/pprof/runtime.go:51\nruntime.goexit\n\t/usr/local/go/src/r
untime/asm_amd64.s:1695" component=runner id=/action_id/3ee7b8a6-c981-4a8c-8588-9386a2f2f49d type=mysql-query-select

Oct 19 20:26:48 mariadb-2 pmm-agent[7740]: time="2024-10-19T20:26:48.992-04:00" level=warning msg="Action terminated with error: Error 1146 (42S02): Table 'performance_schema.global_variables' doesn't
 exist\ngithub.com/percona/pmm/agent/runner/actions.(*mysqlQuerySelectAction).Run\n\t/tmp/go/src/github.com/percona/pmm/agent/runner/actions/mysql_query_select_action.go:75\ngithub.com/percona/pmm/agent
/runner.(*Runner).handleAction.func1\n\t/tmp/go/src/github.com/percona/pmm/agent/runner/runner.go:382\nruntime/pprof.Do\n\t/usr/local/go/src/runtime/pprof/runtime.go:51\nruntime.goexit\n\t/usr/local/go/
src/runtime/asm_amd64.s:1695" component=runner id=/action_id/c5a00902-0466-48bd-9737-e96481ba336a type=mysql-query-select

Oct 19 20:27:08 mariadb-1 pmm-agent[7740]: time="2024-10-19T20:27:08.295-04:00" level=warning msg="Action terminated with error: Error 1193 (HY000): Unknown system variable 'default_password_lifetime'
\ngithub.com/percona/pmm/agent/runner/actions.(*mysqlQuerySelectAction).Run\n\t/tmp/go/src/github.com/percona/pmm/agent/runner/actions/mysql_query_select_action.go:75\ngithub.com/percona/pmm/agent/runne
r.(*Runner).handleAction.func1\n\t/tmp/go/src/github.com/percona/pmm/agent/runner/runner.go:382\nruntime/pprof.Do\n\t/usr/local/go/src/runtime/pprof/runtime.go:51\nruntime.goexit\n\t/usr/local/go/src/ru
ntime/asm_amd64.s:1695" component=runner id=/action_id/872cc710-2347-406a-9ff4-2dc860098e1d type=mysql-query-select

Oct 19 20:27:08 maraidb-1 pmm-agent[7740]: time="2024-10-19T20:27:08.303-04:00" level=warning msg="Action terminated with error: Error 1193 (HY000): Unknown system variable 'default_password_lifetime'
\ngithub.com/percona/pmm/agent/runner/actions.(*mysqlQuerySelectAction).Run\n\t/tmp/go/src/github.com/percona/pmm/agent/runner/actions/mysql_query_select_action.go:75\ngithub.com/percona/pmm/agent/runne
r.(*Runner).handleAction.func1\n\t/tmp/go/src/github.com/percona/pmm/agent/runner/runner.go:382\nruntime/pprof.Do\n\t/usr/local/go/src/runtime/pprof/runtime.go:51\nruntime.goexit\n\t/usr/local/go/src/ru
ntime/asm_amd64.s:1695" component=runner id=/action_id/4c68b175-81ee-477a-9843-c8bca1b99c1b type=mysql-query-select

------- MariaDB-DB2 -------

Oct 19 20:26:03 wc-demo-db2 pmm-agent[1061]: time="2024-10-19T20:26:03.262-04:00" level=warning msg="Action terminated with error: Error 1193 (HY000): Unknown system variable 'binlog_expire_logs_seconds
'\ngithub.com/percona/pmm/agent/runner/actions.(*mysqlQuerySelectAction).Run\n\t/tmp/go/src/github.com/percona/pmm/agent/runner/actions/mysql_query_select_action.go:75\ngithub.com/percona/pmm/agent/runn
er.(*Runner).handleAction.func1\n\t/tmp/go/src/github.com/percona/pmm/agent/runner/runner.go:382\nruntime/pprof.Do\n\t/usr/local/go/src/runtime/pprof/runtime.go:51\nruntime.goexit\n\t/usr/local/go/src/r
untime/asm_amd64.s:1695" component=runner id=/action_id/edf95dab-c4af-4f03-9ffb-aa33a4a0b7dc type=mysql-query-select


Oct 19 20:26:42 wc-demo-db2 pmm-agent[1061]: time="2024-10-19T20:26:42.635-04:00" level=warning msg="Action terminated with error: Error 1146 (42S02): Table 'performance_schema.global_variables' doesn't
 exist\ngithub.com/percona/pmm/agent/runner/actions.(*mysqlQuerySelectAction).Run\n\t/tmp/go/src/github.com/percona/pmm/agent/runner/actions/mysql_query_select_action.go:75\ngithub.com/percona/pmm/agent
/runner.(*Runner).handleAction.func1\n\t/tmp/go/src/github.com/percona/pmm/agent/runner/runner.go:382\nruntime/pprof.Do\n\t/usr/local/go/src/runtime/pprof/runtime.go:51\nruntime.goexit\n\t/usr/local/go/
src/runtime/asm_amd64.s:1695" component=runner id=/action_id/8881d5c2-e578-456c-8743-6b3a7078a7c7 type=mysql-query-select


Oct 19 20:27:01 wc-demo-db2 pmm-agent[1061]: time="2024-10-19T20:27:01.915-04:00" level=warning msg="Action terminated with error: Error 1193 (HY000): Unknown system variable 'default_password_lifetime'
\ngithub.com/percona/pmm/agent/runner/actions.(*mysqlQuerySelectAction).Run\n\t/tmp/go/src/github.com/percona/pmm/agent/runner/actions/mysql_query_select_action.go:75\ngithub.com/percona/pmm/agent/runne
r.(*Runner).handleAction.func1\n\t/tmp/go/src/github.com/percona/pmm/agent/runner/runner.go:382\nruntime/pprof.Do\n\t/usr/local/go/src/runtime/pprof/runtime.go:51\nruntime.goexit\n\t/usr/local/go/src/ru
ntime/asm_amd64.s:1695" component=runner id=/action_id/a4bca5b4-b0a7-412c-985f-2be52b6f1cbc type=mysql-query-select


Oct 19 20:27:01 wc-demo-db2 pmm-agent[1061]: time="2024-10-19T20:27:01.923-04:00" level=warning msg="Action terminated with error: Error 1193 (HY000): Unknown system variable 'default_password_lifetime'
\ngithub.com/percona/pmm/agent/runner/actions.(*mysqlQuerySelectAction).Run\n\t/tmp/go/src/github.com/percona/pmm/agent/runner/actions/mysql_query_select_action.go:75\ngithub.com/percona/pmm/agent/runne
r.(*Runner).handleAction.func1\n\t/tmp/go/src/github.com/percona/pmm/agent/runner/runner.go:382\nruntime/pprof.Do\n\t/usr/local/go/src/runtime/pprof/runtime.go:51\nruntime.goexit\n\t/usr/local/go/src/ru
ntime/asm_amd64.s:1695" component=runner id=/action_id/374783d2-aeec-495d-a68b-9a521217d66c type=mysql-query-select

------- MariaDB-DB3 -------

Oct 19 20:17:32 wc-demo-db3 pmm-agent[930]: time="2024-10-19T20:17:32.657-04:00" level=warning msg="Action terminated with error: Error 1146 (42S02): Table 'performance_schema.replication_connection_con
figuration' doesn't exist\ngithub.com/percona/pmm/agent/runner/actions.(*mysqlQuerySelectAction).Run\n\t/tmp/go/src/github.com/percona/pmm/agent/runner/actions/mysql_query_select_action.go:75\ngithub.co
m/percona/pmm/agent/runner.(*Runner).handleAction.func1\n\t/tmp/go/src/github.com/percona/pmm/agent/runner/runner.go:382\nruntime/pprof.Do\n\t/usr/local/go/src/runtime/pprof/runtime.go:51\nruntime.goexi
t\n\t/usr/local/go/src/runtime/asm_amd64.s:1695" component=runner id=/action_id/403df0eb-ac90-419d-9c38-0bf8e7d1e261 type=mysql-query-select

Oct 19 20:22:34 wc-demo-db3 pmm-agent[930]: time="2024-10-19T20:22:34.654-04:00" level=warning msg="Action terminated with error: context deadline exceeded\ngithub.com/percona/pmm/agent/runner/actions.(
*mysqlQuerySelectAction).Run\n\t/tmp/go/src/github.com/percona/pmm/agent/runner/actions/mysql_query_select_action.go:81\ngithub.com/percona/pmm/agent/runner.(*Runner).handleAction.func1\n\t/tmp/go/src/g
ithub.com/percona/pmm/agent/runner/runner.go:382\nruntime/pprof.Do\n\t/usr/local/go/src/runtime/pprof/runtime.go:51\nruntime.goexit\n\t/usr/local/go/src/runtime/asm_amd64.s:1695" component=runner id=/ac
tion_id/8b49d229-5348-4ed3-8c79-12aa241e8dfa type=mysql-query-select

Oct 19 20:26:13 wc-demo-db3 pmm-agent[930]: time="2024-10-19T20:26:13.820-04:00" level=warning msg="Action terminated with error: Error 1193 (HY000): Unknown system variable 'binlog_expire_logs_seconds'\ngithub.com/percona/pmm/agent/runner/actions.(*mysqlQuerySelectAction).Run\n\t/tmp/go/src/github.com/percona/pmm/agent/runner/actions/mysql_query_select_action.go:75\ngithub.com/percona/pmm/agent/runner.(*Runner).handleAction.func1\n\t/tmp/go/src/github.com/percona/pmm/agent/runner/runner.go:382\nruntime/pprof.Do\n\t/usr/local/go/src/runtime/pprof/runtime.go:51\nruntime.goexit\n\t/usr/local/go/src/runtime/asm_amd64.s:1695" component=runner id=/action_id/01bdc6f9-7c8b-4045-b495-d27e255656d8 type=mysql-query-select

Oct 19 20:26:53 wc-demo-db3 pmm-agent[930]: time="2024-10-19T20:26:53.193-04:00" level=warning msg="Action terminated with error: Error 1146 (42S02): Table 'performance_schema.global_variables' doesn't exist\ngithub.com/percona/pmm/agent/runner/actions.(*mysqlQuerySelectAction).Run\n\t/tmp/go/src/github.com/percona/pmm/agent/runner/actions/mysql_query_select_action.go:75\ngithub.com/percona/pmm/agent/runner.(*Runner).handleAction.func1\n\t/tmp/go/src/github.com/percona/pmm/agent/runner/runner.go:382\nruntime/pprof.Do\n\t/usr/local/go/src/runtime/pprof/runtime.go:51\nruntime.goexit\n\t/usr/local/go/src/runtime/asm_amd64.s:1695" component=runner id=/action_id/e8d6680c-8af9-4cf5-8cfb-7f7f0ce07e57 type=mysql-query-select

Oct 19 20:27:12 wc-demo-db3 pmm-agent[930]: time="2024-10-19T20:27:12.498-04:00" level=warning msg="Action terminated with error: Error 1193 (HY000): Unknown system variable 'default_password_lifetime'\ngithub.com/percona/pmm/agent/runner/actions.(*mysqlQuerySelectAction).Run\n\t/tmp/go/src/github.com/percona/pmm/agent/runner/actions/mysql_query_select_action.go:75\ngithub.com/percona/pmm/agent/runner.(*Runner).handleAction.func1\n\t/tmp/go/src/github.com/percona/pmm/agent/runner/runner.go:382\nruntime/pprof.Do\n\t/usr/local/go/src/runtime/pprof/runtime.go:51\nruntime.goexit\n\t/usr/local/go/src/runtime/asm_amd64.s:1695" component=runner id=/action_id/fff68a5f-1185-441d-944a-94eaa8521a55 type=mysql-query-select

Oct 19 20:27:12 wc-demo-db3 pmm-agent[930]: time="2024-10-19T20:27:12.506-04:00" level=warning msg="Action terminated with error: Error 1193 (HY000): Unknown system variable 'default_password_lifetime'\ngithub.com/percona/pmm/agent/runner/actions.(*mysqlQuerySelectAction).Run\n\t/tmp/go/src/github.com/percona/pmm/agent/runner/actions/mysql_query_select_action.go:75\ngithub.com/percona/pmm/agent/runner.(*Runner).handleAction.func1\n\t/tmp/go/src/github.com/percona/pmm/agent/runner/runner.go:382\nruntime/pprof.Do\n\t/usr/local/go/src/runtime/pprof/runtime.go:51\nruntime.goexit\n\t/usr/local/go/src/runtime/asm_amd64.s:1695" component=runner id=/action_id/80935cc1-dc5a-4352-b8b3-c3d6b9894e98 type=mysql-query-select

------- MongoDB-DB1 -------

Oct 20 23:28:08 mongo-db1 pmm-agent[4189141]: time="2024-10-20T23:28:08.156+00:00" level=error msg="couldn't create system.profile iterator, reason: server selection error: server selection timeout, c
urrent topology: { Type: Single, Servers: [{ Addr: mongo-db1.com:27017, Type: Unknown, Last error: dial tcp [IP_ADDRESS]: connect: connection refused }, ] }" agentID=/agent_id/[AGENT_ID]
 component=agent-builtin db=local type=qan_mongodb_profiler_agent
Oct 20 23:28:08 mongo-db1 pmm-agent[4189141]: time="2024-10-20T23:28:08.156+00:00" level=error msg="couldn't create system.profile iterator, reason: server selection error: server selection timeout, c
urrent topology: { Type: Single, Servers: [{ Addr: mongo-db1.com:27017, Type: Unknown, Last error: dial tcp [IP_ADDRESS]: connect: connection refused }, ] }" agentID=/agent_id/[AGENT_ID]
 component=agent-builtin db=DatabaseProd type=qan_mongodb_profiler_agent
Oct 20 23:28:08 mongo-db1 pmm-agent[4189141]: time="2024-10-20T23:28:08.156+00:00" level=error msg="couldn't create system.profile iterator, reason: server selection error: server selection timeout, c
urrent topology: { Type: Single, Servers: [{ Addr: mongo-db1.com:27017, Type: Unknown, Last error: dial tcp [IP_ADDRESS]: connect: connection refused }, ] }" agentID=/agent_id/[AGENT_ID]
 component=agent-builtin db=Database type=qan_mongodb_profiler_agent
Oct 20 23:28:09 mongo-db1 pmm-agent[4189141]: time="2024-10-20T23:28:09.479+00:00" level=error msg="time=\"2024-10-20T23:28:09Z\" level=error msg=\"Registry - Cannot get node type to check if this is
a mongos : server selection error: server selection timeout, current topology: { Type: Single, Servers: [{ Addr: mongo-db1.com:27017, Type: Unknown, Last error: dial tcp [IP_ADDRESS]: c
onnect: connection refused }, ] }\"" agentID=/agent_id/[AGENT_ID] component=agent-process type=mongodb_exporter
Oct 20 23:28:10 mongo-db1 pmm-agent[4189141]: time="2024-10-20T23:28:10.010+00:00" level=error msg="time=\"2024-10-20T23:28:10Z\" level=error msg=\"Registry - Cannot get node type to check if this is
a mongos : server selection error: server selection timeout, current topology: { Type: Single, Servers: [{ Addr: mongo-db1.com:27017, Type: Unknown, Last error: dial tcp [IP_ADDRESS]: c
onnect: connection refused }, ] }\"" agentID=/agent_id/[AGENT_ID] component=agent-process type=mongodb_exporter
Oct 20 23:28:10 mongo-db1 pmm-agent[4189141]: time="2024-10-20T23:28:10.157+00:00" level=error msg="couldn't create system.profile iterator, reason: server selection error: server selection timeout, c
urrent topology: { Type: Single, Servers: [{ Addr: mongo-db1.com:27017, Type: Unknown, Last error: dial tcp [IP_ADDRESS]: connect: connection refused }, ] }" agentID=/agent_id/[AGENT_ID]
 component=agent-builtin db=DatabaseProd type=qan_mongodb_profiler_agent
Oct 20 23:28:10 mongo-db1 pmm-agent[4189141]: time="2024-10-20T23:28:10.157+00:00" level=error msg="couldn't create system.profile iterator, reason: server selection error: server selection timeout, c
urrent topology: { Type: Single, Servers: [{ Addr: mongo-db1.com:27017, Type: Unknown, Last error: dial tcp [IP_ADDRESS]: connect: connection refused }, ] }" agentID=/agent_id/[AGENT_ID]
 component=agent-builtin db=Database type=qan_mongodb_profiler_agent
Oct 20 23:28:10 mongo-db1 pmm-agent[4189141]: time="2024-10-20T23:28:10.157+00:00" level=error msg="couldn't create system.profile iterator, reason: server selection error: server selection timeout, c
urrent topology: { Type: Single, Servers: [{ Addr: mongo-db1.com:27017, Type: Unknown, Last error: dial tcp [IP_ADDRESS]: connect: connection refused }, ] }" agentID=/agent_id/[AGENT_ID]
 component=agent-builtin db=local type=qan_mongodb_profiler_agent
Oct 20 23:28:10 mongo-db1 pmm-agent[4189141]: time="2024-10-20T23:28:10.384+00:00" level=error msg="time=\"2024-10-20T23:28:10Z\" level=error msg=\"error while checking mongodb connection: server sele
ction error: context canceled, current topology: { Type: Single, Servers: [{ Addr: mongo-db1.com:27017, Type: Unknown, Last error: dial tcp [IP_ADDRESS]: connect: connection refused },
] }. mongo_up is set to 0\" collector=general" agentID=/agent_id/[AGENT_ID] component=agent-process type=mongodb_exporter
Oct 20 23:28:10 mongo-db1 pmm-agent[4189141]: time="2024-10-20T23:28:10.401+00:00" level=error msg="time=\"2024-10-20T23:28:10Z\" level=error msg=\"Cannot get node type: server selection error: contex
t canceled, current topology: { Type: Single, Servers: [{ Addr: mongo-db1.com:27017, Type: Unknown, Last error: dial tcp [IP_ADDRESS]: connect: connection refused }, ] }\" component=dia
gnosticDataCollector" agentID=/agent_id/[AGENT_ID] component=agent-process type=mongodb_exporter
Oct 20 23:28:10 mongo-db1 pmm-agent[4189141]: time="2024-10-20T23:28:10.477+00:00" level=info msg="2024-10-20T23:28:10.476Z\twarn\tVictoriaMetrics/lib/promscrape/scrapework.go:387\tcannot scrape targe
t \"http://[IP_ADDRESS]/metrics?collect%5B%5D=diagnosticdata&collect%5B%5D=replicasetstatus&collect%5B%5D=topmetrics\" ({agent_id=\"/agent_id/[AGENT_ID]\",agent_type=\"mongo
db_exporter\",cluster=\"[CLUSTER_NAME]\",instance=\"/agent_id/[AGENT_ID]\",job=\"mongodb_exporter_agent_id_[AGENT_ID]_hr\",machin..node_id=\"/node_id/c0273607-9363-4359-b2f9-d29b6ffc082d\",node_name=\"mongo-db1\",node_type=\"generic\",replication_set=\"[REP_SET]\",service_id=\"/service_id/[SERVICE_ID]\",service_name=\"mongo-db1-mongodb\",service_type=\"mongodb\"}) 1 out of 1 times during -promscrape.suppressScrapeErrorsDelay=0s; the last error: error when scraping \"http://[IP_ADDRESS]/metrics?collect%5B%5D=diagnosticdata&collect%5B%5D=replicasetstatus&collect%5B%5D=topmetrics\" with timeout 4.5s: timeout" agentID=/agent_id/[AGENT_ID] component=agent-process type=vm_agent
Oct 20 23:28:10 mongo-db1 pmm-agent[4189141]: time="2024-10-20T23:28:10.478+00:00" level=error msg="time=\"2024-10-20T23:28:10Z\" level=error msg=\"error while checking mongodb connection: server selection error: context canceled, current topology: { Type: Single, Servers: [{ Addr: mongo-db1.com:27017, Type: Unknown, Last error: dial tcp 172.27.238.77:27017: connect: connection refused }, ] }. mongo_up is set to 0\" collector=general" agentID=/agent_id/[AGENT_ID] component=agent-process type=mongodb_exporter
Oct 20 23:28:10 mongo-db1 pmm-agent[4189141]: time="2024-10-20T23:28:10.478+00:00" level=error msg="time=\"2024-10-20T23:28:10Z\" level=error msg=\"Cannot get node type: server selection error: context canceled, current topology: { Type: Single, Servers: [{ Addr: mongo-db1.com:27017, Type: Unknown, Last error: dial tcp 172.27.238.77:27017: connect: connection refused }, ] }\" component=diagnosticDataCollector" agentID=/agent_id/[AGENT_ID] component=agent-process type=mongodb_exporter
Oct 20 23:28:10 mongo-db1 pmm-agent[4189141]: time="2024-10-20T23:28:10.972+00:00" level=error msg="time=\"2024-10-20T23:28:10Z\" level=error msg=\"Registry - Cannot get node type to check if this is a mongos : server selection error: server selection timeout, current topology: { Type: Single, Servers: [{ Addr: mongo-db1.com:27017, Type: Unknown, Last error: dial tcp 172.27.238.77:27017: connect: connection refused }, ] }\"" agentID=/agent_id/[AGENT_ID] component=agent-process type=mongodb_exporter
Oct 20 23:28:11 mongo-db1 pmm-agent[4189141]: time="2024-10-20T23:28:11.972+00:00" level=error msg="time=\"2024-10-20T23:28:11Z\" level=error msg=\"error while checking mongodb connection: server selection error: server selection timeout, current topology: { Type: Single, Servers: [{ Addr: mongo-db1.com:27017, Type: Unknown, Last error: dial tcp 172.27.238.77:27017: connect: connection refused }, ] }. mongo_up is set to 0\" collector=general" agentID=/agent_id/[AGENT_ID] component=agent-process type=mongodb_exporter
Oct 20 23:28:12 mongo-db1 pmm-agent[4189141]: time="2024-10-20T23:28:12.158+00:00" level=error msg="couldn't create system.profile iterator, reason: server selection error: server selection timeout, current topology: { Type: Single, Servers: [{ Addr: mongo-db1.com:27017, Type: Unknown, Last error: dial tcp [IP_ADDRESS]: connect: connection refused }, ] }" agentID=/agent_id/[AGENT_ID] component=agent-builtin db=DatabaseProd type=qan_mongodb_profiler_agent
Oct 20 23:28:12 mongo-db1 pmm-agent[4189141]: time="2024-10-20T23:28:12.158+00:00" level=error msg="couldn't create system.profile iterator, reason: server selection error: server selection timeout, current topology: { Type: Single, Servers: [{ Addr: mongo-db1.com:27017, Type: Unknown, Last error: dial tcp [IP_ADDRESS]: connect: connection refused }, ] }" agentID=/agent_id/[AGENT_ID] component=agent-builtin db=local type=qan_mongodb_profiler_agent
Oct 20 23:28:12 mongo-db1 pmm-agent[4189141]: time="2024-10-20T23:28:12.158+00:00" level=error msg="couldn't create system.profile iterator, reason: server selection error: server selection timeout, current topology: { Type: Single, Servers: [{ Addr: mongo-db1.com:27017, Type: Unknown, Last error: dial tcp [IP_ADDRESS]: connect: connection refused }, ] }" agentID=/agent_id/[AGENT_ID] component=agent-builtin db=Database type=qan_mongodb_profiler_agent
Oct 20 23:28:30 mongo-db1 pmm-agent[4189141]: time="2024-10-20T23:28:30.460+00:00" level=error msg="time=\"2024-10-20T23:28:30Z\" level=error msg=\"Registry - Cannot get node type to check if this is
a mongos : server selection error: server selection timeout, current topology: { Type: Single, Servers: [{ Addr: mongo-db1.com:27017, Type: Unknown, Last error: dial tcp 172.27.238.77:27017: c
onnect: connection refused }, ] }\"" agentID=/agent_id/[AGENT_ID] component=agent-process type=mongodb_exporter
Oct 20 23:28:30 mongo-db1 pmm-agent[4189141]: time="2024-10-20T23:28:30.477+00:00" level=info msg="2024-10-20T23:28:30.477Z\twarn\tVictoriaMetrics/lib/promscrape/scrapework.go:387\tcannot scrape targe
t \"http://[IP_ADDRESS]/metrics?collect%5B%5D=diagnosticdata&collect%5B%5D=replicasetstatus&collect%5B%5D=topmetrics\" ({agent_id=\"/agent_id/[AGENT_ID]\",agent_type=\"mongo
db_exporter\",cluster=\"[CLUSTER_NAME]\",instance=\"/agent_id/[AGENT_ID]\",job=\"mongodb_exporter_agent_id_[AGENT_ID]_hr\",machin..node_id=\"/node_id/c
0273607-9363-4359-b2f9-d29b6ffc082d\",node_name=\"mongo-db1\",node_type=\"generic\",replication_set=\"[REP_SET]\",service_id=\"/service_id/16332cef-07e6-424e-b998-7aa6884471ba\",service_name=\"[SERVICE_NAME]
\",service_type=\"mongodb\"}) 1 out of 1 times during -promscrape.suppressScrapeErrorsDelay=0s; the last error: error when scraping \"http://127.0.0.1:42000/metrics?collect%5B%5D=diagnostic
data&collect%5B%5D=replicasetstatus&collect%5B%5D=topmetrics\" with timeout 4.5s: timeout" agentID=/agent_id/[AGENT_ID] component=agent-process type=vm_agent
Oct 20 23:28:30 mongo-db1 pmm-agent[4189141]: time="2024-10-20T23:28:30.656+00:00" level=error msg="time=\"2024-10-20T23:28:30Z\" level=error msg=\"error while checking mongodb connection: server sele
ction error: context canceled, current topology: { Type: Single, Servers: [{ Addr: mongo-db1.com:27017, Type: Unknown, Last error: dial tcp [IP_ADDRESS]: connect: connection refused },
] }. mongo_up is set to 0\" collector=general" agentID=/agent_id/[AGENT_ID] component=agent-process type=mongodb_exporter
Oct 20 23:28:30 mongo-db1 pmm-agent[4189141]: time="2024-10-20T23:28:30.673+00:00" level=error msg="time=\"2024-10-20T23:28:30Z\" level=error msg=\"error while checking mongodb connection: server sele
ction error: context canceled, current topology: { Type: Single, Servers: [{ Addr: mongo-db1.com:27017, Type: Unknown, Last error: dial tcp [IP_ADDRESS]: connect: connection refused },
] }. mongo_up is set to 0\" collector=general" agentID=/agent_id/[AGENT_ID] component=agent-process type=mongodb_exporter
Oct 20 23:28:30 mongo-db1 pmm-agent[4189141]: time="2024-10-20T23:28:30.690+00:00" level=error msg="time=\"2024-10-20T23:28:30Z\" level=error msg=\"Cannot get node type: server selection error: contex
t canceled, current topology: { Type: Single, Servers: [{ Addr: mongo-db1.com:27017, Type: Unknown, Last error: dial tcp 172.27.238.77:27017: connect: connection refused }, ] }\" component=dia
gnosticDataCollector" agentID=/agent_id/[AGENT_ID] component=agent-process type=mongodb_exporter

Code of Conduct

  • I agree to follow Percona Community Code of Conduct
@Bobzemob Bobzemob added the bug Bug report label Oct 22, 2024
@Bobzemob Bobzemob changed the title pmm-agent.service causes Too many open file crash on MongoDB and memory leaks on and MariaDB pmm-agent.service causes Too many open files crash on MongoDB and memory leaks on MariaDB Oct 22, 2024
@BupycHuk
Copy link
Member

Hi, we are working on MongoDB memory leak. Regarding MariaDB we need to investigate cause of that problem.

@wreiske
Copy link

wreiske commented Oct 29, 2024

This is a major issue for us, causing multiple replicaset members to crash. We had to stop pmm-agent on all of our mongodb replicaset servers due to this bug.

image

image

Also seeing this on MariaDB
image

@BupycHuk
Copy link
Member

@wreiske we are fixing issue with Mongodb in 2.43.2 and releasing it today. Please recheck MariaDB after upgrade. If problem persists, please create a task in our jira.percona.com.

@wreiske
Copy link

wreiske commented Oct 30, 2024

Get:3 http://repo.percona.com/percona/apt bullseye/main amd64 pmm2-client amd64 2.42.0-6.bullseye [88.0 MB]

Just ran an apt update, still latest available is:

pmm-agent --version
ProjectName: pmm-agent
Version: 2.42.0
PMMVersion: 2.42.0
Timestamp: 2024-06-06 15:28:56 (UTC)
FullCommit: 74e57527735bd062c4bd37adbd89c31bb14ebc15

I'll check back later today.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug Bug report
Projects
None yet
Development

No branches or pull requests

3 participants