-
Notifications
You must be signed in to change notification settings - Fork 454
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Cluster Unavailability During Node Removal #4260
Comments
It's hard to tell without current/previous placement (eg. debug bundles) and graphs at the time of the issue - since a 4 node cluster can only tolerate a single node failure, another node bootstrapping or restarting at the time of delete would cause write and/or query failures. Was the cluster created with RF=3? |
Thank you very much for your response! My cluster had 5 nodes,placement like this: when I run this command: then the placement is like this: and then I call query interface host:7201/api/v1/query , it show error so what more information we need? |
"Is this a bug? Or is it intended to be designed this way? Or did I do something wrong?" |
What state shard 22 is across nodes? What's the query consistency level? It might be a single shard isn't bootstrapped across all replicas. There are no known issues with this sort of configuration and node operations. |
Thanks again! |
I don't see anything up with config, the bootstrappers/consistency level are correct for this setup.
This is about 1100~ commits behind latest release - definitely try upgrading first |
Okay, then I'll upgrade the version and try. |
Hello M3DB Community,
I am encountering an issue with my M3DB cluster during a node removal operation. My cluster initially had 5 nodes, and I needed to scale down to 4 nodes. To do this, I used the following command:
curl -X DELETE <M3_COORDINATOR_HOST_NAME>:<M3_COORDINATOR_PORT(default 7201)>/api/v1/services/m3db/placement/<NODE_ID>
After executing this command, the cluster began rebalancing the shard data as expected. However, I faced an issue where the cluster became unavailable during this process. Here are some details:
Cluster Size Before Removal: 5 nodes
Node Removal Process: Using the above CURL command
Observed Issue: Cluster became unavailable during shard rebalancing
I followed the operational guidelines available at M3DB Operational Guide, but I am unsure what might have gone wrong. My expectation was that the cluster should remain available during a scale-down operation.
Could you please help me understand the following:
What are the common causes for a cluster becoming unavailable during a node removal process?
Are there any specific configurations or precautions that need to be taken to ensure cluster availability during such operations?
Is there any known issue or limitation with the version of M3DB that might affect the node removal process?
Any insights or guidance would be greatly appreciated. I am happy to provide more details if needed.
Thank you in advance for your help!
The text was updated successfully, but these errors were encountered: