Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[Bug]: Validator regularly OOMs on Kusama #2276

Closed
Lederstrumpf opened this issue Nov 17, 2024 · 4 comments · Fixed by #2278
Closed

[Bug]: Validator regularly OOMs on Kusama #2276

Lederstrumpf opened this issue Nov 17, 2024 · 4 comments · Fixed by #2278
Labels
bug Something isn't working

Comments

@Lederstrumpf
Copy link
Contributor

Lederstrumpf commented Nov 17, 2024

Bug Summary

Validator regularly runs out of memory on Kusama

Bug Description

If in active set, Kusama validator regularly runs out of memory.
This happens both on current master (153ed85bcc3dda1f6e1fc2cb0efcca33e8e49aa1) and on the refactor/notifications (b5834724b3f18b1d694051cae1afff06d1ee7652, PR #2269) branches, although it happens less regularly on the former (~ every hour or two) than the latter (~ every 10 minutes), but it also accumulates far fewer era points on the former (~ dozens per era), than on the latter (~ hundreds per era), so this may be due to the latter resolving networking issues and thereby enabling the validator to engage in consensus more actively.

This happens on a host with 64 GB RAM. If this is insufficient for operating a Kagome validator, please close this issue.

Steps to Reproduce

Mode: Validator
number of nodes: 1
Command: kagome --chain kusama -d [...] --validator --listen-addr [...] --public-addr [...] --name [...] --rpc-port [...] --telemetry-url [...] --telemetry-url [...] --node-key-file [...]

Effects of the Bug

The node gets culled by the OOM killer.

An example log from current master (153ed85bcc3dda1f6e1fc2cb0efcca33e8e49aa1):

Nov 17 06:44:37 v1-oslo kagome[557803]: 24.11.17 06:44:37.445020  worker.4         Info      BlockStorage  Added block #25815332 (0x17f2…2868) as child of #25815331 (0x2280…3aec)
Nov 17 06:44:38 v1-oslo kagome[557803]: 24.11.17 06:44:38.301360  main_runner      Info      BlockExecutor  Imported block #25815332 (0x17f2…2868) within 1159 ms. (lag 2301 ms.)
Nov 17 06:44:38 v1-oslo kagome[557803]: 24.11.17 06:44:38.301406  main_runner      Info      Timeline  Caught up block #25815332 (0x17f2…2868)
Nov 17 06:44:38 v1-oslo kagome[557803]: 24.11.17 06:44:38.305339  statement-distr  Info      StatementDistribution  Send my view. (peers_count=737)
Nov 17 06:44:38 v1-oslo kagome[557803]: 24.11.17 06:44:38.578040  main_runner      Info      Timeline  Caught up block #25815332 (0x17f2…2868)
Nov 17 06:44:38 v1-oslo kagome[557803]: 24.11.17 06:44:38.616411  main_runner      Warning   SyncProtocolObserver  cannot find a requested block with id 22092206
Nov 17 06:44:38 v1-oslo kagome[557803]: 24.11.17 06:44:38.657594  main_runner      Info      Timeline  Caught up block #25815332 (0x17f2…2868)
Nov 17 06:44:40 v1-oslo kagome[557803]: 24.11.17 06:44:40.682522  main_runner      Warning   SyncProtocolObserver  cannot find a requested block with id 25398643
Nov 17 06:44:41 v1-oslo kagome[557803]: 24.11.17 06:44:41.300233  main_runner      Warning   SyncProtocolObserver  cannot find a requested block with id 25398643
Nov 17 06:44:42 v1-oslo kagome[557803]: 24.11.17 06:44:42.575388  worker.23        Info      BlockStorage  Added block #25815333 (0xa039…b617) as child of #25815332 (0x17f2…2868)
Nov 17 06:44:43 v1-oslo kagome[557803]: 24.11.17 06:44:43.511382  main_runner      Info      BlockExecutor  Imported block #25815333 (0xa039…b617) within 1196 ms. (lag 1511 ms.)
Nov 17 06:44:43 v1-oslo kagome[557803]: 24.11.17 06:44:43.530897  statement-distr  Info      StatementDistribution  Send my view. (peers_count=753)
Nov 17 06:44:43 v1-oslo kagome[557803]: 24.11.17 06:44:43.801411  main_runner      Info      Timeline  Caught up block #25815333 (0xa039…b617)
Nov 17 06:44:45 v1-oslo kagome[557803]: 24.11.17 06:44:45.851714  main_runner      Warning   SyncProtocolObserver  cannot find a requested block with id 25398643
Nov 17 06:44:48 v1-oslo kagome[557803]: 24.11.17 06:44:48.891030  worker.4         Info      BlockStorage  Added block #25815334 (0xf69e…46cc) as child of #25815333 (0xa039…b617)
Nov 17 06:44:49 v1-oslo kagome[557803]: 24.11.17 06:44:49.808740  main_runner      Info      BlockExecutor  Imported block #25815334 (0xf69e…46cc) within 1419 ms. (lag 1808 ms.)
Nov 17 06:44:49 v1-oslo kagome[557803]: 24.11.17 06:44:49.816217  statement-distr  Info      StatementDistribution  Send my view. (peers_count=773)
Nov 17 06:44:50 v1-oslo kagome[557803]: 24.11.17 06:44:50.217742  main_runner      Warning   SyncProtocolObserver  cannot find a requested block with id 22092206
Nov 17 06:44:50 v1-oslo kagome[557803]: 24.11.17 06:44:50.901453  main_runner      Info      Timeline  Caught up block #25815334 (0xf69e…46cc)
Nov 17 06:44:52 v1-oslo kagome[557803]: 24.11.17 06:44:52.180790  main_runner      Warning   SyncProtocolObserver  cannot find a requested block with id 25398643
Nov 17 06:44:55 v1-oslo kagome[557803]: 24.11.17 06:44:55.830957  main_runner      Info      Timeline  Caught up block #25815334 (0xf69e…46cc)
Nov 17 06:44:55 v1-oslo kagome[557803]: 24.11.17 06:44:55.845407  main_runner      Info      Timeline  Caught up block #25815334 (0xf69e…46cc)
Nov 17 06:44:56 v1-oslo kagome[557803]: 24.11.17 06:44:56.081477  main_runner      Info      Timeline  Caught up block #25815334 (0xf69e…46cc)
Nov 17 06:44:56 v1-oslo kagome[557803]: 24.11.17 06:44:56.281898  worker.25        Info      BlockStorage  Added block #25815335 (0xe1ab…0d05) as child of #25815334 (0xf69e…46cc)
Nov 17 06:44:57 v1-oslo kagome[557803]: 24.11.17 06:44:57.291405  main_runner      Info      BlockExecutor  Imported block #25815335 (0xe1ab…0d05) within 1314 ms. (lag 3291 ms.)
Nov 17 06:44:57 v1-oslo kagome[557803]: 24.11.17 06:44:57.308741  statement-distr  Info      StatementDistribution  Send my view. (peers_count=880)
Nov 17 06:44:58 v1-oslo kagome[557803]: 24.11.17 06:44:58.313495  main_runner      Warning   SyncProtocolObserver  cannot find a requested block with id 22092206
Nov 17 06:44:59 v1-oslo kagome[557803]: 24.11.17 06:44:59.728689  main_runner      Warning   SyncProtocolObserver  cannot find a requested block with id 25398643
Nov 17 06:45:00 v1-oslo kagome[557803]: 24.11.17 06:45:00.631230  main_runner      Warning   SyncProtocolObserver  cannot find a requested block with id 25398643
Nov 17 06:45:00 v1-oslo kagome[557803]: 24.11.17 06:45:00.756659  worker.14        Info      BlockStorage  Added block #25815336 (0x1c59…59ec) as child of #25815335 (0xe1ab…0d05)
Nov 17 06:45:01 v1-oslo kagome[557803]: 24.11.17 06:45:01.609773  main_runner      Info      BlockExecutor  Imported block #25815336 (0x1c59…59ec) within 1320 ms. (lag 1609 ms.)
Nov 17 06:45:01 v1-oslo kagome[557803]: 24.11.17 06:45:01.617510  statement-distr  Info      StatementDistribution  Send my view. (peers_count=814)
Nov 17 06:45:01 v1-oslo kagome[557803]: 24.11.17 06:45:01.853707  main_runner      Info      Timeline  Caught up block #25815336 (0x1c59…59ec)
Nov 17 06:45:02 v1-oslo kagome[557803]: 24.11.17 06:45:02.015478  main_runner      Info      Timeline  Caught up block #25815336 (0x1c59…59ec)
Nov 17 06:45:02 v1-oslo kagome[557803]: 24.11.17 06:45:02.028220  main_runner      Warning   SyncProtocolObserver  cannot find a requested block with id 22092206
Nov 17 06:45:06 v1-oslo kagome[557803]: 24.11.17 06:45:06.464939  main_runner      Info      Timeline  Caught up block #25815336 (0x1c59…59ec)
Nov 17 06:45:06 v1-oslo kagome[557803]: 24.11.17 06:45:06.951102  worker.2         Info      BlockStorage  Added block #25815337 (0x415a…0681) as child of #25815336 (0x1c59…59ec)
Nov 17 06:45:08 v1-oslo kagome[557803]: 24.11.17 06:45:08.013042  main_runner      Info      BlockExecutor  Imported block #25815337 (0x415a…0681) within 1530 ms. (lag 2012 ms.)
Nov 17 06:45:08 v1-oslo kagome[557803]: 24.11.17 06:45:08.030263  statement-distr  Info      StatementDistribution  Send my view. (peers_count=903)
Nov 17 06:45:10 v1-oslo kagome[557803]: 24.11.17 06:45:10.150924  main_runner      Warning   SyncProtocolObserver  cannot find a requested block with id 25398643
Nov 17 06:45:10 v1-oslo systemd[1]: kagosama.service: A process of this unit has been killed by the OOM killer.
Nov 17 06:45:12 v1-oslo systemd[1]: kagosama.service: Main process exited, code=killed, status=9/KILL
Nov 17 06:45:12 v1-oslo systemd[1]: kagosama.service: Failed with result 'oom-kill'.
Nov 17 06:45:12 v1-oslo systemd[1]: kagosama.service: Consumed 1h 7min 16.322s CPU time, received 35.8G IP traffic, sent 2.5G IP traffic.
Nov 17 06:45:42 v1-oslo systemd[1]: kagosama.service: Scheduled restart job, restart counter is at 10.

An example log from current refactor/notifications (b5834724b3f18b1d694051cae1afff06d1ee7652):

Nov 15 13:55:25 v1-oslo kagome[518209]: 24.11.15 13:55:25.777435  statement-distr  Info      StatementDistribution  Fetch attested candidate success. (relay parent=0x8eeb…7431, candidate=0xe3e0…9547, group index=38, statements=4)
Nov 15 13:55:25 v1-oslo kagome[518209]: 24.11.15 13:55:25.777466  statement-distr  Info      StatementDistribution  Not importable. (relay parent=0x8eeb…7431, candidate=0xe3e0…9547, group index=38)
Nov 15 13:55:25 v1-oslo kagome[518209]: 24.11.15 13:55:25.778373  main_runner      Info      ParachainProcessorImpl  Candidate backed.(candidate=0xe3e0…9547, para id=11, relay_parent=0x8eeb…7431)
Nov 15 13:55:27 v1-oslo kagome[518209]: 24.11.15 13:55:27.379774  main_runner      Info      BlockTree  Finalized block #25791090 (0xb5c5…9ce1)
Nov 15 13:55:29 v1-oslo kagome[518209]: 24.11.15 13:55:29.378418  grandpa          Info      GrandpaEnvironment  Found best chain is longer than approved: #25791092 (0x8eeb…7431) > #25791090 (0xb5c5…9ce1); truncate it
Nov 15 13:55:31 v1-oslo kagome[518209]: 24.11.15 13:55:31.269221  worker.9         Info      BlockStorage  Added block #25791093 (0xcb20…2797) as child of #25791092 (0x8eeb…7431)
Nov 15 13:55:35 v1-oslo kagome[518209]: 24.11.15 13:55:35.104468  main_runner      Info      BlockExecutor  Imported block #25791093 (0xcb20…2797) within 4589 ms. (lag 5103 ms.)
Nov 15 13:55:37 v1-oslo kagome[518209]: 24.11.15 13:55:37.459163  main_runner      Info      ApprovalDistribution  Sending an approval to peers. (block=0xcb20…2797, num peers=4)
Nov 15 13:55:47 v1-oslo kagome[518209]: 24.11.15 13:55:38.246090  main_runner      Info      ApprovalDistribution  Sending an approval to peers. (block=0xcb20…2797, num peers=4)
Nov 15 13:55:47 v1-oslo kagome[518209]: 24.11.15 13:55:39.634165  main_runner      Info      ApprovalDistribution  Sending an approval to peers. (block=0xcb20…2797, num peers=4)
Nov 15 13:55:47 v1-oslo kagome[518209]: 24.11.15 13:55:41.287427  main_runner      Info      ApprovalDistribution  Sending an approval to peers. (block=0xcb20…2797, num peers=4)
Nov 15 13:55:47 v1-oslo kagome[518209]: 24.11.15 13:55:43.217174  main_runner      Info      ApprovalDistribution  Sending an approval to peers. (block=0xcb20…2797, num peers=4)
Nov 15 13:55:47 v1-oslo kagome[518209]: 24.11.15 13:55:44.695502  main_runner      Info      ApprovalDistribution  Sending an approval to peers. (block=0xcb20…2797, num peers=4)
Nov 15 13:55:47 v1-oslo kagome[518209]: 24.11.15 13:55:45.461195  main_runner      Info      BlockTree  Finalized block #25791091 (0xd078…f4c2)
Nov 15 13:55:47 v1-oslo kagome[518209]: 24.11.15 13:55:46.080193  main_runner      Info      ApprovalDistribution  Sending an approval to peers. (block=0xcb20…2797, num peers=4)
Nov 15 13:55:47 v1-oslo kagome[518209]: 24.11.15 13:55:46.866109  grandpa          Info      GrandpaEnvironment  Found best chain is longer than approved: #25791093 (0xcb20…2797) > #25791091 (0xd078…f4c2); truncate it
Nov 15 13:55:49 v1-oslo kagome[518209]: 24.11.15 13:55:48.796014  main_runner      Info      ApprovalDistribution  Sending an approval to peers. (block=0xcb20…2797, num peers=4)
Nov 15 13:56:28 v1-oslo kagome[518209]: 24.11.15 13:55:50.125748  main_runner      Info      ApprovalDistribution  Sending an approval to peers. (block=0xcb20…2797, num peers=4)
Nov 15 13:56:28 v1-oslo kagome[518209]: 24.11.15 13:55:52.024332  main_runner      Info      ApprovalDistribution  Sending an approval to peers. (block=0xcb20…2797, num peers=4)
Nov 15 13:56:28 v1-oslo kagome[518209]: 24.11.15 13:55:52.081410  main_runner      Info      ApprovalDistribution  Sending an approval to peers. (block=0xcb20…2797, num peers=4)
Nov 15 13:56:28 v1-oslo kagome[518209]: 24.11.15 13:55:55.035874  main_runner      Info      ApprovalDistribution  Sending an approval to peers. (block=0xcb20…2797, num peers=4)
Nov 15 13:56:28 v1-oslo kagome[518209]: 24.11.15 13:55:57.375167  main_runner      Info      ApprovalDistribution  Sending an approval to peers. (block=0xcb20…2797, num peers=4)
Nov 15 13:56:28 v1-oslo kagome[518209]: 24.11.15 13:55:58.591698  main_runner      Info      ApprovalDistribution  Sending an approval to peers. (block=0xcb20…2797, num peers=4)
Nov 15 13:56:28 v1-oslo kagome[518209]: 24.11.15 13:56:00.915122  main_runner      Info      ApprovalDistribution  Sending an approval to peers. (block=0xcb20…2797, num peers=4)
Nov 15 13:56:28 v1-oslo kagome[518209]: 24.11.15 13:56:02.325781  statement-distr  Info      StatementDistribution  Fetch attested candidate returned an error. (relay parent=0xcb20…2797, candidate=0x5ac1…f9fc, group index=45, error=N6libp2p10connection6Stream5ErrorE(10) Stream: reset by remote peer)
Nov 15 13:56:28 v1-oslo kagome[518209]: 24.11.15 13:56:05.281818  main_runner      Info      ApprovalDistribution  Sending an approval to peers. (block=0xcb20…2797, num peers=4)
Nov 15 13:56:28 v1-oslo kagome[518209]: 24.11.15 13:56:06.386996  statement-distr  Info      StatementDistribution  Fetch attested candidate returned an error. (relay parent=0xcb20…2797, candidate=0x7ded…3ca9, group index=47, error=N6libp2p10connection6Stream5ErrorE(10) Stream: reset by remote peer)
Nov 15 13:56:28 v1-oslo kagome[518209]: 24.11.15 13:56:08.579577  main_runner      Info      BlockTree  Finalized block #25791092 (0x8eeb…7431)
Nov 15 13:56:28 v1-oslo kagome[518209]: 24.11.15 13:56:09.463514  grandpa          Info      GrandpaEnvironment  Found best chain is longer than approved: #25791093 (0xcb20…2797) > #25791092 (0x8eeb…7431); truncate it
Nov 15 13:56:28 v1-oslo kagome[518209]: 24.11.15 13:56:11.829982  statement-distr  Info      StatementDistribution  Fetch attested candidate returned an error. (relay parent=0xcb20…2797, candidate=0xb80f…dd69, group index=54, error=N6libp2p10connection6Stream5ErrorE(10) Stream: reset by remote peer)
Nov 15 13:56:28 v1-oslo kagome[518209]: 24.11.15 13:56:18.836390  statement-distr  Info      StatementDistribution  Fetch attested candidate returned an error. (relay parent=0xcb20…2797, candidate=0xf22e…772f, group index=48, error=N6libp2p10connection6Stream5ErrorE(10) Stream: reset by remote peer)
Nov 15 13:56:31 v1-oslo systemd[1]: kagosama.service: Main process exited, code=killed, status=9/KILL
Nov 15 13:56:31 v1-oslo systemd[1]: kagosama.service: Failed with result 'signal'.
Nov 15 13:56:31 v1-oslo systemd[1]: kagosama.service: Consumed 7min 144ms CPU time, 57.8G memory peak, 0B memory swap peak, read 30.5G from disk, written 18.8G to disk, received 1.7G IP traffic, sent 1.0G IP traffic.

Expected Behavior

The validator does not OOM? ^^

System Information

NixOS 24.5 with kernel 6.11.5
Compiler: gcc 13.2.0
CMake: cmake version 3.25.3

Built using flake from #2257

Relevant for this issue: host has 64 GB RAM.

Additional Context

Issue does not occur when running without --validator flag.

@Lederstrumpf Lederstrumpf added the bug Something isn't working label Nov 17, 2024
@kamilsa
Copy link
Contributor

kamilsa commented Nov 18, 2024

Thanks for reporting! We have few ideas what to check

@kamilsa
Copy link
Contributor

kamilsa commented Nov 19, 2024

Hey @Lederstrumpf can you please check if #2278 improves situation?

@Lederstrumpf
Copy link
Contributor Author

Hey @Lederstrumpf can you please check if #2278 improves situation?

Thanks switched to it, but #2280 blocks this atm (happens both on master and #2278).

@Lederstrumpf
Copy link
Contributor Author

Lederstrumpf commented Nov 19, 2024

Hey @Lederstrumpf can you please check if #2278 improves situation?

related: To debug #2280, I'm currently resyncing from a db backup that's trailing by 14 days atm. This also OOMs (on the same 64GB RAM node) in non-validator mode:

Nov 19 14:09:54 v1-oslo kagome[960975]: 24.11.19 14:09:54.656397  worker.27        Info      BlockStorage  Added block #25635906 (0x487d…f77b) as child of #25635905 (0x1a7b…e7d5)
Nov 19 14:09:54 v1-oslo kagome[960975]: 24.11.19 14:09:54.656690  main_runner      Info      BlockExecutor  Imported block #25635906 (0x487d…f77b) within 397 ms. (lag 14 days)
Nov 19 14:09:55 v1-oslo kagome[960975]: 24.11.19 14:09:55.059686  worker.7         Info      BlockStorage  Added block #25635907 (0x55ad…5076) as child of #25635906 (0x487d…f77b)
Nov 19 14:09:55 v1-oslo kagome[960975]: 24.11.19 14:09:55.065598  main_runner      Info      BlockExecutor  Imported block #25635907 (0x55ad…5076) within 408 ms. (lag 14 days)
Nov 19 14:09:55 v1-oslo kagome[960975]: 24.11.19 14:09:55.472307  worker.5         Info      BlockStorage  Added block #25635908 (0x1c2a…5ff0) as child of #25635907 (0x55ad…5076)
Nov 19 14:09:55 v1-oslo kagome[960975]: 24.11.19 14:09:55.480323  main_runner      Info      BlockExecutor  Imported block #25635908 (0x1c2a…5ff0) within 413 ms. (lag 14 days)
Nov 19 14:10:02 v1-oslo kagome[960975]: 24.11.19 14:10:02.375701  worker.13        Info      BlockStorage  Added block #25635909 (0x22e9…5ca8) as child of #25635908 (0x1c2a…5ff0)
Nov 19 14:10:06 v1-oslo kagome[960975]: 24.11.19 14:10:06.415148  main_runner      Info      BlockExecutor  Imported block #25635909 (0x22e9…5ca8) within 10931 ms. (lag 14 days)
Nov 19 14:10:27 v1-oslo kagome[960975]: 24.11.19 14:10:27.394828  worker.21        Info      BlockStorage  Added block #25635910 (0x323e…66f7) as child of #25635909 (0x22e9…5ca8)
Nov 19 14:10:36 v1-oslo kagome[960975]: 24.11.19 14:10:36.520396  main_runner      Info      BlockExecutor  Imported block #25635910 (0x323e…66f7) within 24076 ms. (lag 14 days)
Nov 19 14:11:49 v1-oslo systemd[1]: kagosama.service: A process of this unit has been killed by the OOM killer.
Nov 19 14:11:52 v1-oslo systemd[1]: kagosama.service: Main process exited, code=killed, status=9/KILL
Nov 19 14:11:52 v1-oslo systemd[1]: kagosama.service: Failed with result 'oom-kill'.
Nov 19 14:11:52 v1-oslo systemd[1]: kagosama.service: Consumed 12min 29.566s CPU time, received 3.1G

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug Something isn't working
Projects
None yet
Development

Successfully merging a pull request may close this issue.

2 participants