Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

HDDS-11780. Increase client write retry when SCM is in safe mode #7470

Merged
merged 2 commits into from
Nov 25, 2024

Conversation

ArafatKhan2198
Copy link
Contributor

@ArafatKhan2198 ArafatKhan2198 commented Nov 22, 2024

What changes were proposed in this pull request?

Root Cause:

  • DataNode Registration Delays: Each DataNode requires approximately 30 seconds to register with the leader SCM due to the heartbeat interval.
  • SCM Restart and Leadership Retention: In scenarios where the restarted SCM retains leadership after an election, the SCM loses its in-memory state. DataNodes and pipelines must re-register, leading to delays in exiting safe mode.
  • Dependency on HealthyPipelineSafeModeRule: This rule requires DataNodes to report pipeline health, which can be delayed due to slow DataNode registration, network latency, or the time needed for pipelines to stabilize.
  • These factors combined caused the SCM to take slightly over a minute to exit safe mode, impacting write operations during this transition.

Current Mechanism:

  • The handleSubmitRequestAndSCMSafeModeRetry method manages write requests (e.g., block allocation or key creation) during SCM safe mode by:
  • Catching the "SCM in safe mode" exception.
  • Retrying the operation after a defined wait interval.
  • Allowing limited retries to wait for the SCM to exit safe mode.

Proposed Change:

  • Config Update: Increase BLOCK_ALLOCATION_RETRY_WAIT_TIME_MS to 1000 ms and BLOCK_ALLOCATION_RETRY_COUNT to 90.
  • This extends the total wait time for retries from ~25–30 seconds to ~90 seconds.
  • Impact: This ensures that write operations are not prematurely failed during scenarios where SCM takes longer to exit safe mode, improving client resilience.

What is the link to the Apache JIRA

https://issues.apache.org/jira/browse/HDDS-11780

How was this patch tested?

@ArafatKhan2198 ArafatKhan2198 changed the title HDDS-11780 Slight Delay in Exiting Safe Mode Due to and Impact on Client Writes HDDS-11780. Slight Delay in Exiting Safe Mode Due to and Impact on Client Writes. Nov 22, 2024
@adoroszlai adoroszlai changed the title HDDS-11780. Slight Delay in Exiting Safe Mode Due to and Impact on Client Writes. HDDS-11780. Increase client write retry when SCM is in safe mode Nov 22, 2024
Copy link
Contributor

@sumitagrawl sumitagrawl left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM

@sumitagrawl sumitagrawl marked this pull request as ready for review November 25, 2024 03:48
@sumitagrawl sumitagrawl merged commit b090312 into apache:master Nov 25, 2024
53 checks passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants