HDDS-11780. Increase client write retry when SCM is in safe mode #7470

ArafatKhan2198 · 2024-11-22T06:46:23Z

What changes were proposed in this pull request?

Root Cause:

DataNode Registration Delays: Each DataNode requires approximately 30 seconds to register with the leader SCM due to the heartbeat interval.
SCM Restart and Leadership Retention: In scenarios where the restarted SCM retains leadership after an election, the SCM loses its in-memory state. DataNodes and pipelines must re-register, leading to delays in exiting safe mode.
Dependency on HealthyPipelineSafeModeRule: This rule requires DataNodes to report pipeline health, which can be delayed due to slow DataNode registration, network latency, or the time needed for pipelines to stabilize.
These factors combined caused the SCM to take slightly over a minute to exit safe mode, impacting write operations during this transition.

Current Mechanism:

The handleSubmitRequestAndSCMSafeModeRetry method manages write requests (e.g., block allocation or key creation) during SCM safe mode by:
Catching the "SCM in safe mode" exception.
Retrying the operation after a defined wait interval.
Allowing limited retries to wait for the SCM to exit safe mode.

Proposed Change:

Config Update: Increase BLOCK_ALLOCATION_RETRY_WAIT_TIME_MS to 1000 ms and BLOCK_ALLOCATION_RETRY_COUNT to 90.
This extends the total wait time for retries from ~25–30 seconds to ~90 seconds.
Impact: This ensures that write operations are not prematurely failed during scenarios where SCM takes longer to exit safe mode, improving client resilience.

What is the link to the Apache JIRA

https://issues.apache.org/jira/browse/HDDS-11780

How was this patch tested?

…ient Writes.

...n/java/org/apache/hadoop/ozone/om/protocolPB/OzoneManagerProtocolClientSideTranslatorPB.java

sumitagrawl

LGTM

HDDS-11780. Slight Delay in Exiting Safe Mode Due to and Impact on Cl…

85bac10

…ient Writes.

ArafatKhan2198 requested a review from sumitagrawl November 22, 2024 06:46

ArafatKhan2198 added scm client labels Nov 22, 2024

ArafatKhan2198 changed the title ~~HDDS-11780 Slight Delay in Exiting Safe Mode Due to and Impact on Client Writes~~ HDDS-11780. Slight Delay in Exiting Safe Mode Due to and Impact on Client Writes. Nov 22, 2024

adoroszlai changed the title ~~HDDS-11780. Slight Delay in Exiting Safe Mode Due to and Impact on Client Writes.~~ HDDS-11780. Increase client write retry when SCM is in safe mode Nov 22, 2024

sumitagrawl reviewed Nov 22, 2024

View reviewed changes

...n/java/org/apache/hadoop/ozone/om/protocolPB/OzoneManagerProtocolClientSideTranslatorPB.java Outdated Show resolved Hide resolved

Fixed review comments

b69240e

sumitagrawl approved these changes Nov 25, 2024

View reviewed changes

sumitagrawl marked this pull request as ready for review November 25, 2024 03:48

sumitagrawl merged commit b090312 into apache:master Nov 25, 2024
53 checks passed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

HDDS-11780. Increase client write retry when SCM is in safe mode #7470

HDDS-11780. Increase client write retry when SCM is in safe mode #7470

ArafatKhan2198 commented Nov 22, 2024 •

edited by sumitagrawl

Loading

sumitagrawl left a comment

HDDS-11780. Increase client write retry when SCM is in safe mode #7470

HDDS-11780. Increase client write retry when SCM is in safe mode #7470

Conversation

ArafatKhan2198 commented Nov 22, 2024 • edited by sumitagrawl Loading

What changes were proposed in this pull request?

Root Cause:

Current Mechanism:

Proposed Change:

What is the link to the Apache JIRA

How was this patch tested?

sumitagrawl left a comment

Choose a reason for hiding this comment

ArafatKhan2198 commented Nov 22, 2024 •

edited by sumitagrawl

Loading