-
Notifications
You must be signed in to change notification settings - Fork 504
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
HDDS-11750. LegacyReplicationManager#notifyStatusChanged should submit Ratis request asynchronously #7459
base: master
Are you sure you want to change the base?
Conversation
…ubmit Ratis request
@siddhantsangwan @lokeshj1703 @JacksonYao287 Please let me know what you think and whether it is safe to remove the |
I am not sure about this change. All that logic is there for some reason, so just removing it seems risky. What version are you currently running? Can you switch off the LegacyRM and use the new one? |
@sodonnel Thanks for taking a look at this.
I'm not really well-versed in the design decisions for the container move in https://issues.apache.org/jira/browse/HDDS-5253. Currently, I'm looking at possible reason why this particular code is introduced and what is the possible impact of removing this code. I'm open for suggestions from any of the community members that are familiar with this.
Our main cluster version is based on 1.2.1, but we have backported a significant amount of features and fixes. We are planning to upgrade our main cluster to 1.4.1 very soon. However, even in 1.4.1 this particular issue might still happen (albeit very rare). Therefore, I believe we need to address this since LegacyReplicationManager is still technically still supported.
Yes, due to this bug we are now strongly considering to use the new ReplicationManager that do not persistently keep track of the inflight move (and will not call submit Ratis request in notifyStatusChange). Can I verify whether there are some known regressions in moving from LegacyReplicationManager to ReplicationManager? I remember asking this quite a while ago whether we can just upgrade the implementation from LegacyReplicationManager to ReplicationManager without regressions and seems there might not be any regressions. |
@@ -1896,9 +1896,6 @@ public int getContainerInflightDeletionLimit() { | |||
} | |||
|
|||
protected void notifyStatusChanged() { | |||
//now, as the current scm is leader and it`s state is up-to-date, | |||
//we need to take some action about replicated inflight move options. | |||
onLeaderReadyAndOutOfSafeMode(); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think we can fix it be deferring onLeaderReadyAndOutOfSafeMode
to another thread.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks for the idea, that's another possible solution I can think of. Let me try this direction.
Should require synchronizing containers in inflight move.
There are no regresssions that we know of moving from the Legacy to new RM. There is also a suggestion in the 2.0 release plan to remove the code for Legacy RM as it is not our intention to use or support it going forward. |
…ld not submit Ratis request" This reverts commit b2b2cbf.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@ivandika3 Thanks for the patch. Overall LGTM. Just a nits for you.
@@ -333,6 +341,11 @@ public LegacyReplicationManager(final ConfigurationSource conf, | |||
.setDBTransactionBuffer(scmhaManager.getDBTransactionBuffer()) | |||
.setRatisServer(scmhaManager.getRatisServer()) | |||
.setMoveTable(moveTable).build(); | |||
|
|||
inflightMoveScannerExecutor = Executors.newSingleThreadExecutor( |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
We better stop the thread too in ReplicationManager#stop
.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks for the review. Let me think about it. My current idea is to shutdown the ExecutorService
when stopping the ReplicationManager
and reinitializing the ExecutorService
in the start
. Let me write more extensive tests to check this since notifyStatusChanged
is run in Ratis and if any uncaught exception related to the ExecutorService
(e.g. NPE or REjectedExectuonException
) is thrown it might stop the whole StateMachineUpdater
.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think another issue is that ReplicationManager#stop
will not replicate it to the other SCMs. Meaning that one SCM might have running
flag false
, while the other has running
flag true
. If for example we transfer leadership from the one with stopped RM to the one with running RM, the notifyStatusChanged
will still trigger the inflightMoveScannerExecutor
.
In my opinion, for now we can let inflightMoveScannerExecutor
to send some replicate commands even if it's stopped. inflightMoveScannerExecutor
should just stay on until the whole SCM is stopped and the executor is garbage collected and shut down in the finalize
block.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yes, we can stop the thread only when the SCM stop.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Will the ReplicationManager#stop
be called only when the SCM shutdown or the user send "stop replicationManager" command, right?
So we may can let the inflightMoveScannerExecutor stop in the ReplicationManager#stop
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yes, ReplicationManager#stop
is called either in SCM shutdown through (SCMServiceManager#stop
) or through the "stop replicationManager" command.
The reason I decided to now shutdown the inflightMoveScannerExecutor
since once the ExecutorService
is shutdown, we cannot submit any new task anymore. Therefore, we need to recreate this ExecutorService
every time in start
. However, managing this ExecutorService
s lifecycles with the different Raft states might have some edge cases and error prone (which might cause RejectedExecutionException
or NullPointerException
), this might terminate the StateMachineUpdater
. Therefore, I try to make it simple and just let the ExecutorService
to stop when SCM is shutdown (through the ExecutorService
finalizer mechanism).
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
OK, understand.
I haven't reviewed the pull request, but the "Container Move.pdf" attachment in https://issues.apache.org/jira/browse/HDDS-4656 could describe the design decisions behind this. |
What changes were proposed in this pull request?
We encountered an issue where the SCM is stuck after transfer leadership, causing SCM leader to be stuck and all client requests to timeout (including OMs).
We saw that SCM is throwing TimeoutException in StateMachineUpdater (the thread in charge of applying Raft logs and completing user requests), causing the whole SCM request processing to be stuck.
We found that the root cause this following call chains.
We should never send a Ratis request under the ReplicationManager#notifyStatusChanged since this will cause a deadlock with Ratis StateMachineUpdater. When ReplicationManager#notifyStatusChanged call the MoveScheduler#completeMove, it will send a Ratis request to the Raft server and wait until the log associated with it is applied by the StateMachineUpdater. However, since ReplicationManger#notifyStatusChanged is itself run under the StateMachineUpdater, it will block the StateMachineUpdater itself, meaning that the Raft log associated with request sent by MoveScheduler#completeMove will never be applied and there will be a deadlock. This will cause StateMachineUpdater to get stuck and most SCM client requests to timeout in the StateMachineUpdater.
Currently, one possible fix might be to just remove the onLeaderReadyAndOutOfSafeMode implementation altogether and hopefully the inflight move will be handled up by the main ReplicationManager thread.
Note: The issue should STILL happen after HDDS-10690 since although StateMachine#notifyTermIndexUpdated will not trigger ReplicationManager#notifyStatusChanged, StateMachine#notifyLeaderReady will instead trigger the Replicationmanager#notifyStatusChanged. Since StateMachine#notifyLeaderReady is still called in the StateMachineUpdater through RaftServerImpl#applyLogToStateMachine.
What is the link to the Apache JIRA
https://issues.apache.org/jira/browse/HDDS-11750
How was this patch tested?
Integration test to reproduce the bug and verify the fix.
Failure before the fix: https://github.com/ivandika3/ozone/actions/runs/11965356725/job/33359496259
Success after fix: https://github.com/ivandika3/ozone/actions/runs/11965440504