HDDS-11750. LegacyReplicationManager#notifyStatusChanged should submit Ratis request asynchronously #7459

ivandika3 · 2024-11-20T01:51:25Z

What changes were proposed in this pull request?

We encountered an issue where the SCM is stuck after transfer leadership, causing SCM leader to be stuck and all client requests to timeout (including OMs).

We saw that SCM is throwing TimeoutException in StateMachineUpdater (the thread in charge of applying Raft logs and completing user requests), causing the whole SCM request processing to be stuck.

2024-11-18 15:54:50,182 [daa4f362-f48d-4933-96b3-840a8739f1d9@group-C0BCE64451CF-StateMachineUpdater] ERROR org.apache.hadoop.hdds.scm.container.ReplicationManager: Exception while cleaning up excess replicas.
java.util.concurrent.TimeoutException
        at java.util.concurrent.CompletableFuture.timedGet(CompletableFuture.java:1784)
        at java.util.concurrent.CompletableFuture.get(CompletableFuture.java:1928)
        at org.apache.hadoop.hdds.scm.ha.SCMRatisServerImpl.submitRequest(SCMRatisServerImpl.java:228)
        at org.apache.hadoop.hdds.scm.ha.SCMHAInvocationHandler.invokeRatis(SCMHAInvocationHandler.java:110)
        at org.apache.hadoop.hdds.scm.ha.SCMHAInvocationHandler.invoke(SCMHAInvocationHandler.java:67)
        at com.sun.proxy.$Proxy19.completeMove(Unknown Source)
        at org.apache.hadoop.hdds.scm.container.ReplicationManager.deleteSrcDnForMove(ReplicationManager.java:1610)
        at org.apache.hadoop.hdds.scm.container.ReplicationManager.lambda$onLeaderReadyAndOutOfSafeMode$36(ReplicationManager.java:2364)
        at java.util.concurrent.ConcurrentHashMap.forEach(ConcurrentHashMap.java:1597)
        at org.apache.hadoop.hdds.scm.container.ReplicationManager.onLeaderReadyAndOutOfSafeMode(ReplicationManager.java:2342)
        at org.apache.hadoop.hdds.scm.container.ReplicationManager.notifyStatusChanged(ReplicationManager.java:2103)
        at org.apache.hadoop.hdds.scm.ha.SCMServiceManager.notifyStatusChanged(SCMServiceManager.java:53)
        at org.apache.hadoop.hdds.scm.ha.SCMStateMachine.notifyTermIndexUpdated(SCMStateMachine.java:338)
        at org.apache.ratis.server.impl.RaftServerImpl.applyLogToStateMachine(RaftServerImpl.java:1650)
        at org.apache.ratis.server.impl.StateMachineUpdater.applyLog(StateMachineUpdater.java:239)
        at org.apache.ratis.server.impl.StateMachineUpdater.run(StateMachineUpdater.java:182)
        at java.lang.Thread.run(Thread.java:748)
2024-11-18 15:54:50,183 [daa4f362-f48d-4933-96b3-840a8739f1d9@group-C0BCE64451CF-StateMachineUpdater] INFO org.apache.hadoop.hdds.scm.container.ReplicationManager: can not remove source replica after successfully replicated to target datanode
2024-11-18 15:54:50,745 [EventQueue-CloseContainerForCloseContainerEventHandler] ERROR org.apache.hadoop.hdds.server.events.SingleThreadExecutor: Error on execution message #8764007
java.lang.reflect.UndeclaredThrowableException
        at com.sun.proxy.$Proxy16.updateContainerState(Unknown Source)
        at org.apache.hadoop.hdds.scm.container.ContainerManagerImpl.updateContainerState(ContainerManagerImpl.java:332)
        at org.apache.hadoop.hdds.scm.container.CloseContainerEventHandler.onMessage(CloseContainerEventHandler.java:82)
        at org.apache.hadoop.hdds.scm.container.CloseContainerEventHandler.onMessage(CloseContainerEventHandler.java:51)
        at org.apache.hadoop.hdds.server.events.SingleThreadExecutor.lambda$onMessage$1(SingleThreadExecutor.java:85)
        at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1149)
        at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:624)
        at java.lang.Thread.run(Thread.java:748)
Caused by: java.util.concurrent.TimeoutException
        at java.util.concurrent.CompletableFuture.timedGet(CompletableFuture.java:1784)
        at java.util.concurrent.CompletableFuture.get(CompletableFuture.java:1928)
        at org.apache.hadoop.hdds.scm.ha.SCMRatisServerImpl.submitRequest(SCMRatisServerImpl.java:228)
        at org.apache.hadoop.hdds.scm.ha.SCMHAInvocationHandler.invokeRatis(SCMHAInvocationHandler.java:110)
        at org.apache.hadoop.hdds.scm.ha.SCMHAInvocationHandler.invoke(SCMHAInvocationHandler.java:67)
        ... 8 more
2024-11-18 15:54:50,746 [EventQueue-DeleteBlockStatusForDeletedBlockLogImpl] WARN org.apache.hadoop.hdds.scm.block.SCMDeletedBlockTransactionStatusManager: Could not commit delete block transactions: []
java.util.concurrent.TimeoutException
        at java.util.concurrent.CompletableFuture.timedGet(CompletableFuture.java:1784)
        at java.util.concurrent.CompletableFuture.get(CompletableFuture.java:1928)
        at org.apache.hadoop.hdds.scm.ha.SCMRatisServerImpl.submitRequest(SCMRatisServerImpl.java:228)
        at org.apache.hadoop.hdds.scm.ha.SCMHAInvocationHandler.invokeRatis(SCMHAInvocationHandler.java:110)
        at org.apache.hadoop.hdds.scm.ha.SCMHAInvocationHandler.invoke(SCMHAInvocationHandler.java:67)
        at com.sun.proxy.$Proxy17.removeTransactionsFromDB(Unknown Source)
        at org.apache.hadoop.hdds.scm.block.SCMDeletedBlockTransactionStatusManager.commitTransactions(SCMDeletedBlockTransactionStatusManager.java:527)
        at org.apache.hadoop.hdds.scm.block.DeletedBlockLogImpl.onMessage(DeletedBlockLogImpl.java:384)
        at org.apache.hadoop.hdds.scm.block.DeletedBlockLogImpl.onMessage(DeletedBlockLogImpl.java:73)
        at org.apache.hadoop.hdds.server.events.SingleThreadExecutor.lambda$onMessage$1(SingleThreadExecutor.java:85)
        at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1149)
        at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:624)
        at java.lang.Thread.run(Thread.java:748)

We found that the root cause this following call chains.

StateMachine#notifyTermIndexUpdated (due to the transfer leadership)
- ReplicationManager#notifyStatusChanged
  - LegacyReplicationManager#notifyStatusChanged
    - LegacyReplicationManager#onLeaderReadyAndOutofSafeMode
      - LegacyReplicationManager#deleteSrcDnForMove
        
        LegacyReplicationManager.MoveScheduler#completeMove (replicated annotation means that it will submit Ratis request)
        
        SCMHAInvocationHandler#invokeRatisServer
        
        SCMRatiServerImpl#submitRequest
        
        RaftServerImpl#submitClientRequestAsync

We should never send a Ratis request under the ReplicationManager#notifyStatusChanged since this will cause a deadlock with Ratis StateMachineUpdater. When ReplicationManager#notifyStatusChanged call the MoveScheduler#completeMove, it will send a Ratis request to the Raft server and wait until the log associated with it is applied by the StateMachineUpdater. However, since ReplicationManger#notifyStatusChanged is itself run under the StateMachineUpdater, it will block the StateMachineUpdater itself, meaning that the Raft log associated with request sent by MoveScheduler#completeMove will never be applied and there will be a deadlock. This will cause StateMachineUpdater to get stuck and most SCM client requests to timeout in the StateMachineUpdater.

Currently, one possible fix might be to just remove the onLeaderReadyAndOutOfSafeMode implementation altogether and hopefully the inflight move will be handled up by the main ReplicationManager thread.

Note: The issue should STILL happen after HDDS-10690 since although StateMachine#notifyTermIndexUpdated will not trigger ReplicationManager#notifyStatusChanged, StateMachine#notifyLeaderReady will instead trigger the Replicationmanager#notifyStatusChanged. Since StateMachine#notifyLeaderReady is still called in the StateMachineUpdater through RaftServerImpl#applyLogToStateMachine.

What is the link to the Apache JIRA

https://issues.apache.org/jira/browse/HDDS-11750

How was this patch tested?

Integration test to reproduce the bug and verify the fix.

Failure before the fix: https://github.com/ivandika3/ozone/actions/runs/11965356725/job/33359496259

Success after fix: https://github.com/ivandika3/ozone/actions/runs/11965440504

…ubmit Ratis request

ivandika3 · 2024-11-20T01:53:15Z

@siddhantsangwan @lokeshj1703 @JacksonYao287 Please let me know what you think and whether it is safe to remove the onLeaderReadyAndOutofSafeMode altogether. This bug caused our SCM service to be stuck that caused cluster-wide unavailability.

cc: @xichen01 @symious @ferhui

sodonnel · 2024-11-20T09:57:00Z

I am not sure about this change. All that logic is there for some reason, so just removing it seems risky. What version are you currently running? Can you switch off the LegacyRM and use the new one?

ivandika3 · 2024-11-20T10:12:19Z

@sodonnel Thanks for taking a look at this.

All that logic is there for some reason, so just removing it seems risky

I'm not really well-versed in the design decisions for the container move in https://issues.apache.org/jira/browse/HDDS-5253. Currently, I'm looking at possible reason why this particular code is introduced and what is the possible impact of removing this code. I'm open for suggestions from any of the community members that are familiar with this.

What version are you currently running?

Our main cluster version is based on 1.2.1, but we have backported a significant amount of features and fixes. We are planning to upgrade our main cluster to 1.4.1 very soon.

However, even in 1.4.1 this particular issue might still happen (albeit very rare). Therefore, I believe we need to address this since LegacyReplicationManager is still technically still supported.

Can you switch off the LegacyRM and use the new one?

Yes, due to this bug we are now strongly considering to use the new ReplicationManager that do not persistently keep track of the inflight move (and will not call submit Ratis request in notifyStatusChange). Can I verify whether there are some known regressions in moving from LegacyReplicationManager to ReplicationManager? I remember asking this quite a while ago whether we can just upgrade the implementation from LegacyReplicationManager to ReplicationManager without regressions and seems there might not be any regressions.

adoroszlai · 2024-11-20T10:27:59Z

...src/main/java/org/apache/hadoop/hdds/scm/container/replication/LegacyReplicationManager.java

@@ -1896,9 +1896,6 @@ public int getContainerInflightDeletionLimit() {
  }

  protected void notifyStatusChanged() {
-    //now, as the current scm is leader and it`s state is up-to-date,
-    //we need to take some action about replicated inflight move options.
-    onLeaderReadyAndOutOfSafeMode();


I think we can fix it be deferring onLeaderReadyAndOutOfSafeMode to another thread.

Thanks for the idea, that's another possible solution I can think of. Let me try this direction.

Should require synchronizing containers in inflight move.

sodonnel · 2024-11-20T10:32:19Z

There are no regresssions that we know of moving from the Legacy to new RM.

There is also a suggestion in the 2.0 release plan to remove the code for Legacy RM as it is not our intention to use or support it going forward.

…ld not submit Ratis request" This reverts commit b2b2cbf.

xichen01

@ivandika3 Thanks for the patch. Overall LGTM. Just a nits for you.

xichen01 · 2024-11-21T14:59:24Z

...src/main/java/org/apache/hadoop/hdds/scm/container/replication/LegacyReplicationManager.java

@@ -333,6 +341,11 @@ public LegacyReplicationManager(final ConfigurationSource conf,
        .setDBTransactionBuffer(scmhaManager.getDBTransactionBuffer())
        .setRatisServer(scmhaManager.getRatisServer())
        .setMoveTable(moveTable).build();
+
+    inflightMoveScannerExecutor = Executors.newSingleThreadExecutor(


We better stop the thread too in ReplicationManager#stop.

Thanks for the review. Let me think about it. My current idea is to shutdown the ExecutorService when stopping the ReplicationManager and reinitializing the ExecutorService in the start. Let me write more extensive tests to check this since notifyStatusChanged is run in Ratis and if any uncaught exception related to the ExecutorService (e.g. NPE or REjectedExectuonException) is thrown it might stop the whole StateMachineUpdater.

I think another issue is that ReplicationManager#stop will not replicate it to the other SCMs. Meaning that one SCM might have running flag false, while the other has running flag true. If for example we transfer leadership from the one with stopped RM to the one with running RM, the notifyStatusChanged will still trigger the inflightMoveScannerExecutor.

In my opinion, for now we can let inflightMoveScannerExecutor to send some replicate commands even if it's stopped. inflightMoveScannerExecutor should just stay on until the whole SCM is stopped and the executor is garbage collected and shut down in the finalize block.

Yes, we can stop the thread only when the SCM stop.

Will the ReplicationManager#stop be called only when the SCM shutdown or the user send "stop replicationManager" command, right?
So we may can let the inflightMoveScannerExecutor stop in the ReplicationManager#stop

Yes, ReplicationManager#stop is called either in SCM shutdown through (SCMServiceManager#stop) or through the "stop replicationManager" command.

The reason I decided to now shutdown the inflightMoveScannerExecutor since once the ExecutorService is shutdown, we cannot submit any new task anymore. Therefore, we need to recreate this ExecutorService every time in start. However, managing this ExecutorServices lifecycles with the different Raft states might have some edge cases and error prone (which might cause RejectedExecutionException or NullPointerException), this might terminate the StateMachineUpdater. Therefore, I try to make it simple and just let the ExecutorService to stop when SCM is shutdown (through the ExecutorService finalizer mechanism).

OK, understand.

siddhantsangwan · 2024-12-02T05:46:29Z

I haven't reviewed the pull request, but the "Container Move.pdf" attachment in https://issues.apache.org/jira/browse/HDDS-4656 could describe the design decisions behind this.

HDDS-11750. LegacyReplicationManager#notifyStatusChanged should not s…

b2b2cbf

…ubmit Ratis request

ivandika3 requested review from lokeshj1703, siddhantsangwan, JacksonYao287, szetszwo and sodonnel November 20, 2024 01:56

ivandika3 self-assigned this Nov 20, 2024

adoroszlai reviewed Nov 20, 2024

View reviewed changes

ivandika3 added 2 commits November 21, 2024 13:31

Revert "HDDS-11750. LegacyReplicationManager#notifyStatusChanged shou…

0fcc252

…ld not submit Ratis request" This reverts commit b2b2cbf.

Asynchronously handle inflight moves in notifyStatusChanged

745639b

xichen01 reviewed Nov 21, 2024

View reviewed changes

Add test to reproduce bug

7eb7834

ivandika3 marked this pull request as ready for review November 22, 2024 05:41

ivandika3 changed the title ~~HDDS-11750. LegacyReplicationManager#notifyStatusChanged should not submit Ratis request~~ HDDS-11750. LegacyReplicationManager#notifyStatusChanged should Ratis request asynchronously Nov 25, 2024

ivandika3 changed the title ~~HDDS-11750. LegacyReplicationManager#notifyStatusChanged should Ratis request asynchronously~~ HDDS-11750. LegacyReplicationManager#notifyStatusChanged should submit Ratis request asynchronously Nov 25, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

HDDS-11750. LegacyReplicationManager#notifyStatusChanged should submit Ratis request asynchronously #7459

HDDS-11750. LegacyReplicationManager#notifyStatusChanged should submit Ratis request asynchronously #7459

ivandika3 commented Nov 20, 2024 •

edited

Loading

ivandika3 commented Nov 20, 2024 •

edited

Loading

sodonnel commented Nov 20, 2024

ivandika3 commented Nov 20, 2024 •

edited

Loading

adoroszlai Nov 20, 2024 •

edited

Loading

ivandika3 Nov 20, 2024 •

edited

Loading

sodonnel commented Nov 20, 2024

xichen01 left a comment

xichen01 Nov 21, 2024

ivandika3 Nov 22, 2024 •

edited

Loading

ivandika3 Nov 22, 2024 •

edited

Loading

xichen01 Nov 23, 2024

xichen01 Nov 29, 2024

ivandika3 Dec 2, 2024

xichen01 Dec 2, 2024

siddhantsangwan commented Dec 2, 2024

HDDS-11750. LegacyReplicationManager#notifyStatusChanged should submit Ratis request asynchronously #7459

Are you sure you want to change the base?

HDDS-11750. LegacyReplicationManager#notifyStatusChanged should submit Ratis request asynchronously #7459

Conversation

ivandika3 commented Nov 20, 2024 • edited Loading

What changes were proposed in this pull request?

What is the link to the Apache JIRA

How was this patch tested?

ivandika3 commented Nov 20, 2024 • edited Loading

sodonnel commented Nov 20, 2024

ivandika3 commented Nov 20, 2024 • edited Loading

adoroszlai Nov 20, 2024 • edited Loading

Choose a reason for hiding this comment

ivandika3 Nov 20, 2024 • edited Loading

Choose a reason for hiding this comment

sodonnel commented Nov 20, 2024

xichen01 left a comment

Choose a reason for hiding this comment

xichen01 Nov 21, 2024

Choose a reason for hiding this comment

ivandika3 Nov 22, 2024 • edited Loading

Choose a reason for hiding this comment

ivandika3 Nov 22, 2024 • edited Loading

Choose a reason for hiding this comment

xichen01 Nov 23, 2024

Choose a reason for hiding this comment

xichen01 Nov 29, 2024

Choose a reason for hiding this comment

ivandika3 Dec 2, 2024

Choose a reason for hiding this comment

xichen01 Dec 2, 2024

Choose a reason for hiding this comment

siddhantsangwan commented Dec 2, 2024

ivandika3 commented Nov 20, 2024 •

edited

Loading

ivandika3 commented Nov 20, 2024 •

edited

Loading

ivandika3 commented Nov 20, 2024 •

edited

Loading

adoroszlai Nov 20, 2024 •

edited

Loading

ivandika3 Nov 20, 2024 •

edited

Loading

ivandika3 Nov 22, 2024 •

edited

Loading

ivandika3 Nov 22, 2024 •

edited

Loading