Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

RONDB-820: Added parallelism in invalidating node from LCP in table #610

Open
wants to merge 9 commits into
base: 24.10-main
Choose a base branch
from

Conversation

mronstro
Copy link
Collaborator

With support of hundreds of thousands of tables it is important to speed up handling of many tables by parallelising some parts of the node recovery. This patch parallelises the invalidation of a failed node from the table files. This phase happens as part of node recovery in the live node before the node is permitted to continue the start process.

@mronstro mronstro force-pushed the RONDB-820 branch 2 times, most recently from 2f41a2c to 888e999 Compare December 25, 2024 19:57
- Added debugging for understanding table handling during recovery
- Added parallelism in invalidating node from LCP in table

With support of hundreds of thousands of tables it is important to speed up
handling of many tables by parallelising some parts of the node recovery.
This patch parallelises the invalidation of a failed node from the
table files. This phase happens as part of node recovery in the live node
before the node is permitted to continue the start process.

Added parallelism for remove node from table at NF

This patch adds parallelism to the code that removes a failed node from the table
files. This code is executed every time a node fails and parallelising it will
decrease the wait for our node id to be ready for restart.

- Enable remove of massive amounts of log output
- Local optimisation of COPY_TABREQ handling
-  Simplified result of rondb_big test case
- Added restart log message about multi transporter setup
- Move back to 300 tables with 11 indexes in rondb_big.ndb_many_tables

A major part of the node recovery is waiting for LCP, avoid delaying LCP when node
recovery is ongoing. Removed delays in LCP processing when node recovery is ongoing,
optimised writing tables into pages

Ensured table was initialised
Improved sending DIH metadata during NR

During NR we read the metadata in the master DIH and copy it over to the
starting node using the signal COPY_TABREQ. This was previously done one
table at a time. Now one can parallelise this and send over 8 tables in
parallel. The older nodes can handle parallel COPY_TABREQs as well, but
only 4 in parallel.

Don't start new COPY_TABREQ's when already enough outstanding

Fixes to avoid having outstanding signals but already decreased outstanding counter
Finishing loop in copyNodeLab is not finishing an outstanding request
Added more debugging and jam around queued LCP write info in DBDIH
Needed to loop until c_end_tab_queued
Initialised variables controlling delay of LCPs
Delayed decrement outstanding to ensure that we avoid race conditions
Needed to track all outstanding signals to ensure we don't quit too early

Fix for compiling on GCC 13
Missed call to unreservePages when no need of removing table from node, added a bit of debugging
Fixed test case ndbinfo_plans
Fixed compiler warning

Added debugging around Pause LCP and LCP ongoing flag

- Fixed problem with c_lcp_id_while_copy_meta_data

With parallel copying of tables using COPY_TABREQ it is no longer
ok to use a common variable c_lcp_id_while_copy_meta_data to keep
track of the current LCP id for the COPY_TABREQ. By moving this
variable to the table record we ensure that we can handle any
parallelism change that might occur in the future.

Disable debugging
Fixed optimisation of check of LCP compleltion

- Only print logs about individual fragments if full restart logs are enabled
- Optimise checkSchemaStatus
- More printouts on stages in restarts
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants