Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

HDDS-11750. LegacyReplicationManager#notifyStatusChanged should submit Ratis request asynchronously #7459

Open
wants to merge 4 commits into
base: master
Choose a base branch
from

Conversation

ivandika3
Copy link
Contributor

@ivandika3 ivandika3 commented Nov 20, 2024

What changes were proposed in this pull request?

We encountered an issue where the SCM is stuck after transfer leadership, causing SCM leader to be stuck and all client requests to timeout (including OMs).

We saw that SCM is throwing TimeoutException in StateMachineUpdater (the thread in charge of applying Raft logs and completing user requests), causing the whole SCM request processing to be stuck.

2024-11-18 15:54:50,182 [daa4f362-f48d-4933-96b3-840a8739f1d9@group-C0BCE64451CF-StateMachineUpdater] ERROR org.apache.hadoop.hdds.scm.container.ReplicationManager: Exception while cleaning up excess replicas.
java.util.concurrent.TimeoutException
        at java.util.concurrent.CompletableFuture.timedGet(CompletableFuture.java:1784)
        at java.util.concurrent.CompletableFuture.get(CompletableFuture.java:1928)
        at org.apache.hadoop.hdds.scm.ha.SCMRatisServerImpl.submitRequest(SCMRatisServerImpl.java:228)
        at org.apache.hadoop.hdds.scm.ha.SCMHAInvocationHandler.invokeRatis(SCMHAInvocationHandler.java:110)
        at org.apache.hadoop.hdds.scm.ha.SCMHAInvocationHandler.invoke(SCMHAInvocationHandler.java:67)
        at com.sun.proxy.$Proxy19.completeMove(Unknown Source)
        at org.apache.hadoop.hdds.scm.container.ReplicationManager.deleteSrcDnForMove(ReplicationManager.java:1610)
        at org.apache.hadoop.hdds.scm.container.ReplicationManager.lambda$onLeaderReadyAndOutOfSafeMode$36(ReplicationManager.java:2364)
        at java.util.concurrent.ConcurrentHashMap.forEach(ConcurrentHashMap.java:1597)
        at org.apache.hadoop.hdds.scm.container.ReplicationManager.onLeaderReadyAndOutOfSafeMode(ReplicationManager.java:2342)
        at org.apache.hadoop.hdds.scm.container.ReplicationManager.notifyStatusChanged(ReplicationManager.java:2103)
        at org.apache.hadoop.hdds.scm.ha.SCMServiceManager.notifyStatusChanged(SCMServiceManager.java:53)
        at org.apache.hadoop.hdds.scm.ha.SCMStateMachine.notifyTermIndexUpdated(SCMStateMachine.java:338)
        at org.apache.ratis.server.impl.RaftServerImpl.applyLogToStateMachine(RaftServerImpl.java:1650)
        at org.apache.ratis.server.impl.StateMachineUpdater.applyLog(StateMachineUpdater.java:239)
        at org.apache.ratis.server.impl.StateMachineUpdater.run(StateMachineUpdater.java:182)
        at java.lang.Thread.run(Thread.java:748)
2024-11-18 15:54:50,183 [daa4f362-f48d-4933-96b3-840a8739f1d9@group-C0BCE64451CF-StateMachineUpdater] INFO org.apache.hadoop.hdds.scm.container.ReplicationManager: can not remove source replica after successfully replicated to target datanode
2024-11-18 15:54:50,745 [EventQueue-CloseContainerForCloseContainerEventHandler] ERROR org.apache.hadoop.hdds.server.events.SingleThreadExecutor: Error on execution message #8764007
java.lang.reflect.UndeclaredThrowableException
        at com.sun.proxy.$Proxy16.updateContainerState(Unknown Source)
        at org.apache.hadoop.hdds.scm.container.ContainerManagerImpl.updateContainerState(ContainerManagerImpl.java:332)
        at org.apache.hadoop.hdds.scm.container.CloseContainerEventHandler.onMessage(CloseContainerEventHandler.java:82)
        at org.apache.hadoop.hdds.scm.container.CloseContainerEventHandler.onMessage(CloseContainerEventHandler.java:51)
        at org.apache.hadoop.hdds.server.events.SingleThreadExecutor.lambda$onMessage$1(SingleThreadExecutor.java:85)
        at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1149)
        at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:624)
        at java.lang.Thread.run(Thread.java:748)
Caused by: java.util.concurrent.TimeoutException
        at java.util.concurrent.CompletableFuture.timedGet(CompletableFuture.java:1784)
        at java.util.concurrent.CompletableFuture.get(CompletableFuture.java:1928)
        at org.apache.hadoop.hdds.scm.ha.SCMRatisServerImpl.submitRequest(SCMRatisServerImpl.java:228)
        at org.apache.hadoop.hdds.scm.ha.SCMHAInvocationHandler.invokeRatis(SCMHAInvocationHandler.java:110)
        at org.apache.hadoop.hdds.scm.ha.SCMHAInvocationHandler.invoke(SCMHAInvocationHandler.java:67)
        ... 8 more
2024-11-18 15:54:50,746 [EventQueue-DeleteBlockStatusForDeletedBlockLogImpl] WARN org.apache.hadoop.hdds.scm.block.SCMDeletedBlockTransactionStatusManager: Could not commit delete block transactions: []
java.util.concurrent.TimeoutException
        at java.util.concurrent.CompletableFuture.timedGet(CompletableFuture.java:1784)
        at java.util.concurrent.CompletableFuture.get(CompletableFuture.java:1928)
        at org.apache.hadoop.hdds.scm.ha.SCMRatisServerImpl.submitRequest(SCMRatisServerImpl.java:228)
        at org.apache.hadoop.hdds.scm.ha.SCMHAInvocationHandler.invokeRatis(SCMHAInvocationHandler.java:110)
        at org.apache.hadoop.hdds.scm.ha.SCMHAInvocationHandler.invoke(SCMHAInvocationHandler.java:67)
        at com.sun.proxy.$Proxy17.removeTransactionsFromDB(Unknown Source)
        at org.apache.hadoop.hdds.scm.block.SCMDeletedBlockTransactionStatusManager.commitTransactions(SCMDeletedBlockTransactionStatusManager.java:527)
        at org.apache.hadoop.hdds.scm.block.DeletedBlockLogImpl.onMessage(DeletedBlockLogImpl.java:384)
        at org.apache.hadoop.hdds.scm.block.DeletedBlockLogImpl.onMessage(DeletedBlockLogImpl.java:73)
        at org.apache.hadoop.hdds.server.events.SingleThreadExecutor.lambda$onMessage$1(SingleThreadExecutor.java:85)
        at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1149)
        at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:624)
        at java.lang.Thread.run(Thread.java:748) 

We found that the root cause this following call chains.

  • StateMachine#notifyTermIndexUpdated (due to the transfer leadership)
    • ReplicationManager#notifyStatusChanged
      • LegacyReplicationManager#notifyStatusChanged
        • LegacyReplicationManager#onLeaderReadyAndOutofSafeMode
          • LegacyReplicationManager#deleteSrcDnForMove
            • LegacyReplicationManager.MoveScheduler#completeMove (replicated annotation means that it will submit Ratis request)
              • SCMHAInvocationHandler#invokeRatisServer
                • SCMRatiServerImpl#submitRequest
                  • RaftServerImpl#submitClientRequestAsync

We should never send a Ratis request under the ReplicationManager#notifyStatusChanged since this will cause a deadlock with Ratis StateMachineUpdater. When ReplicationManager#notifyStatusChanged call the MoveScheduler#completeMove, it will send a Ratis request to the Raft server and wait until the log associated with it is applied by the StateMachineUpdater. However, since ReplicationManger#notifyStatusChanged is itself run under the StateMachineUpdater, it will block the StateMachineUpdater itself, meaning that the Raft log associated with request sent by MoveScheduler#completeMove will never be applied and there will be a deadlock. This will cause StateMachineUpdater to get stuck and most SCM client requests to timeout in the StateMachineUpdater.

Currently, one possible fix might be to just remove the onLeaderReadyAndOutOfSafeMode implementation altogether and hopefully the inflight move will be handled up by the main ReplicationManager thread.

Note: The issue should STILL happen after HDDS-10690 since although StateMachine#notifyTermIndexUpdated will not trigger ReplicationManager#notifyStatusChanged, StateMachine#notifyLeaderReady will instead trigger the Replicationmanager#notifyStatusChanged. Since StateMachine#notifyLeaderReady is still called in the StateMachineUpdater through RaftServerImpl#applyLogToStateMachine.

What is the link to the Apache JIRA

https://issues.apache.org/jira/browse/HDDS-11750

How was this patch tested?

Integration test to reproduce the bug and verify the fix.

Failure before the fix: https://github.com/ivandika3/ozone/actions/runs/11965356725/job/33359496259

Success after fix: https://github.com/ivandika3/ozone/actions/runs/11965440504

@ivandika3
Copy link
Contributor Author

ivandika3 commented Nov 20, 2024

@siddhantsangwan @lokeshj1703 @JacksonYao287 Please let me know what you think and whether it is safe to remove the onLeaderReadyAndOutofSafeMode altogether. This bug caused our SCM service to be stuck that caused cluster-wide unavailability.

cc: @xichen01 @symious @ferhui

@sodonnel
Copy link
Contributor

I am not sure about this change. All that logic is there for some reason, so just removing it seems risky. What version are you currently running? Can you switch off the LegacyRM and use the new one?

@ivandika3
Copy link
Contributor Author

ivandika3 commented Nov 20, 2024

@sodonnel Thanks for taking a look at this.

All that logic is there for some reason, so just removing it seems risky

I'm not really well-versed in the design decisions for the container move in https://issues.apache.org/jira/browse/HDDS-5253. Currently, I'm looking at possible reason why this particular code is introduced and what is the possible impact of removing this code. I'm open for suggestions from any of the community members that are familiar with this.

What version are you currently running?

Our main cluster version is based on 1.2.1, but we have backported a significant amount of features and fixes. We are planning to upgrade our main cluster to 1.4.1 very soon.

However, even in 1.4.1 this particular issue might still happen (albeit very rare). Therefore, I believe we need to address this since LegacyReplicationManager is still technically still supported.

Can you switch off the LegacyRM and use the new one?

Yes, due to this bug we are now strongly considering to use the new ReplicationManager that do not persistently keep track of the inflight move (and will not call submit Ratis request in notifyStatusChange). Can I verify whether there are some known regressions in moving from LegacyReplicationManager to ReplicationManager? I remember asking this quite a while ago whether we can just upgrade the implementation from LegacyReplicationManager to ReplicationManager without regressions and seems there might not be any regressions.

@@ -1896,9 +1896,6 @@ public int getContainerInflightDeletionLimit() {
}

protected void notifyStatusChanged() {
//now, as the current scm is leader and it`s state is up-to-date,
//we need to take some action about replicated inflight move options.
onLeaderReadyAndOutOfSafeMode();
Copy link
Contributor

@adoroszlai adoroszlai Nov 20, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think we can fix it be deferring onLeaderReadyAndOutOfSafeMode to another thread.

Copy link
Contributor Author

@ivandika3 ivandika3 Nov 20, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for the idea, that's another possible solution I can think of. Let me try this direction.

Should require synchronizing containers in inflight move.

@sodonnel
Copy link
Contributor

There are no regresssions that we know of moving from the Legacy to new RM.

There is also a suggestion in the 2.0 release plan to remove the code for Legacy RM as it is not our intention to use or support it going forward.

Copy link
Contributor

@xichen01 xichen01 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@ivandika3 Thanks for the patch. Overall LGTM. Just a nits for you.

@@ -333,6 +341,11 @@ public LegacyReplicationManager(final ConfigurationSource conf,
.setDBTransactionBuffer(scmhaManager.getDBTransactionBuffer())
.setRatisServer(scmhaManager.getRatisServer())
.setMoveTable(moveTable).build();

inflightMoveScannerExecutor = Executors.newSingleThreadExecutor(
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We better stop the thread too in ReplicationManager#stop.

Copy link
Contributor Author

@ivandika3 ivandika3 Nov 22, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for the review. Let me think about it. My current idea is to shutdown the ExecutorService when stopping the ReplicationManager and reinitializing the ExecutorService in the start. Let me write more extensive tests to check this since notifyStatusChanged is run in Ratis and if any uncaught exception related to the ExecutorService (e.g. NPE or REjectedExectuonException) is thrown it might stop the whole StateMachineUpdater.

Copy link
Contributor Author

@ivandika3 ivandika3 Nov 22, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think another issue is that ReplicationManager#stop will not replicate it to the other SCMs. Meaning that one SCM might have running flag false, while the other has running flag true. If for example we transfer leadership from the one with stopped RM to the one with running RM, the notifyStatusChanged will still trigger the inflightMoveScannerExecutor.

In my opinion, for now we can let inflightMoveScannerExecutor to send some replicate commands even if it's stopped. inflightMoveScannerExecutor should just stay on until the whole SCM is stopped and the executor is garbage collected and shut down in the finalize block.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes, we can stop the thread only when the SCM stop.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Will the ReplicationManager#stop be called only when the SCM shutdown or the user send "stop replicationManager" command, right?
So we may can let the inflightMoveScannerExecutor stop in the ReplicationManager#stop

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes, ReplicationManager#stop is called either in SCM shutdown through (SCMServiceManager#stop) or through the "stop replicationManager" command.

The reason I decided to now shutdown the inflightMoveScannerExecutor since once the ExecutorService is shutdown, we cannot submit any new task anymore. Therefore, we need to recreate this ExecutorService every time in start. However, managing this ExecutorServices lifecycles with the different Raft states might have some edge cases and error prone (which might cause RejectedExecutionException or NullPointerException), this might terminate the StateMachineUpdater. Therefore, I try to make it simple and just let the ExecutorService to stop when SCM is shutdown (through the ExecutorService finalizer mechanism).

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

OK, understand.

@ivandika3 ivandika3 marked this pull request as ready for review November 22, 2024 05:41
@ivandika3 ivandika3 changed the title HDDS-11750. LegacyReplicationManager#notifyStatusChanged should not submit Ratis request HDDS-11750. LegacyReplicationManager#notifyStatusChanged should Ratis request asynchronously Nov 25, 2024
@ivandika3 ivandika3 changed the title HDDS-11750. LegacyReplicationManager#notifyStatusChanged should Ratis request asynchronously HDDS-11750. LegacyReplicationManager#notifyStatusChanged should submit Ratis request asynchronously Nov 25, 2024
@siddhantsangwan
Copy link
Contributor

I haven't reviewed the pull request, but the "Container Move.pdf" attachment in https://issues.apache.org/jira/browse/HDDS-4656 could describe the design decisions behind this.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

5 participants