[Per Partition Automatic Failover] Concurrent Detection of Write Regions During Failover #4858

kundadebdatta · 2024-10-28T20:00:11Z

Acceptance Criteria:

During the partition level failover through the .NET v3 SDK, it is often identified that the detection of the write regions takes longer, sometimes more than a minute. The primary reason for that is the following two:

The offending partition takes at least a minute from the backend to failover to another region.
The current (write region) detection logic from the SDK is on a round robin fashion, therefore, the SDK loops through all the regions in the account topology to find out the potential write region.

Though it’s beyond the scope of the SDK to optimize the backend failover time, however the detection logic from the SDK can be made a bit faster. This design proposes an optimization to detect the write regions in parallel by issuing concurrent hedging requests to all the available regions in the account topology.

kundadebdatta added the PerPartitionAutomaticFailover label Oct 28, 2024

kundadebdatta self-assigned this Oct 28, 2024

kundadebdatta added this to Azure Cosmos SDKs Oct 28, 2024

kundadebdatta moved this to In Progress in Azure Cosmos SDKs Oct 28, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[Per Partition Automatic Failover] Concurrent Detection of Write Regions During Failover #4858

[Per Partition Automatic Failover] Concurrent Detection of Write Regions During Failover #4858

kundadebdatta commented Oct 28, 2024

[Per Partition Automatic Failover] Concurrent Detection of Write Regions During Failover #4858

[Per Partition Automatic Failover] Concurrent Detection of Write Regions During Failover #4858

Comments

kundadebdatta commented Oct 28, 2024

Acceptance Criteria: