-
Notifications
You must be signed in to change notification settings - Fork 2.1k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
RFC: Atomic Distributed Transactions #16245
Comments
This is very well written. Everything I could think of on a first read has already been addressed. |
Just some thoughts and questions - With respect to For For |
Could you explain case 4? Why do we rollback the transactions on RM1 and RM2? There was no transaction preparation done before this, right? |
We still have open transactions so we should roll them back otherwise they will remain holding locks till transaction timeout is not achieved. |
Me and @harshit-gangal were looking at what we would need to do to integrate atomic transactions into SwitchTraffic and OnlineDDL workflows. We found that both SwitchTraffic and OnlineDDL use the Query Rules stored in the Changes to these query rules govern how vttablet would run incoming queries. There is one gotcha here though, currently query rules don't impact the queries that have already run and are part of a transaction. So, when these transactions are committed, there is a possibility of going against the query rules even after they've been updated. This is a problem for atomic transactions because once we have prepared a transaction and the transaction manager decides to commit it, then we cannot refuse. So, for example, if a transaction is in a prepared state, and then online ddl changes the rules and proceeds with the cutover, the transaction might not be able to commit because of a structural change to the underlying table. Because both the operations are relying on query rules to communicate with the query server on what kind of queries to allow and deny, we think it's a good idea to use them for the transactions going into the prepared state too. Here is a timing diagram with the order of operations that we think will ensure that no atomic transaction interferes with the workflows - sequenceDiagram
participant SwitchTraffic
participant vttablet
participant onlineDDL
Note over SwitchTraffic: Update Denied<br/>Tables in Vschema
Note over onlineDDL: Register new Query Rules
SwitchTraffic-)vttablet: RefreshState
Note over vttablet: Refresh ShardInfo<br/>From Vschema
Note over vttablet: Update Query Rules
Note over vttablet: All new transactions while<br/>being Prepared will consult<br/>new query rules
vttablet-)SwitchTraffic: Success
SwitchTraffic-)vttablet: WaitForPrepared
onlineDDL-)vttablet: WaitForPrepared
Note over vttablet: Wait for all current set<br/>of prepared transactions<br/>to succeed.
Note over vttablet: New transactions are<br/>guaranteed to be using<br/>the new query rules.
vttablet-)SwitchTraffic: Success
vttablet-)onlineDDL: Success
Note over SwitchTraffic: Proceed
Note over onlineDDL: Proceed
|
Atomic Transactions with PRS and MySQL RestartsCurrent state and the problemsI have been looking at the code that tries to redo the prepared transactions in case of a MySQL restart or PRS, etc, and have found a few shortcomings.
Because we turn off MySQL read-only and open the query engine first, we can potentially have a write that goes through before the transaction engine has had a chance to try and redo the prepared transactions. This can cause the prepare to fail since the write might be incompatible with the ones that the prepared transaction is running. This problem is also present when MySQL restarts.
Solutions that were tried but didn't workMe and @harshit-gangal have talked about this issue and I've thought off and tested a few different solutions. These are some of the solutions I tried but didn't work (If this is not interesting to read, then you can skip this segment and go to the proposed solution) -
mysql [localhost:8033] {msandbox} (performance_schema) > set session read_only = 'false';
ERROR 1229 (HY000): Variable 'read_only' is a GLOBAL variable and should be set with SET GLOBAL
mysql [localhost:8033] {msandbox} (test) > INSERT INTO t1 values (234);
ERROR 1223 (HY000): Can't execute the query because you have a conflicting read lock
So even if we acquire the lock, we can't start the transaction to restore the state. Also we can't acquire the table lock after we start the transaction because again as per docs -
Proposed solutions that should work
Of the 2 proposed solutions, we like the second one more. It is less hacky and it feels like we are less likely to run into any unexpected issues with this fix. We still need to discuss what should we do in case the redoing of transactions fails. It shouldn't happen with proposed fixes ☝️, but if it does, then what should we do? Should we panic the vttablet to prevent any writes? Or should we just setup metrics for this so that the user is alerted but allow the vttablets and other writes to function? EDIT: We decided to go ahead with solution 2. The expectation is that the default mode that Vitess uses on replicas (super_read_only) is not being overridden. Atomic transactions are reliant on the guarantees provided by super_read_only and assume that all writes are being done using an In case of a failure, we want to allow vttablets to continue to take writes. We will add metrics around this so that users can get notified of these failures. We want to build vtctld commands to be able to see and resolve atomic transactions. These should then be used to implement the same functionality in vtadmin so that users can resolve transactions using the UI as well. Task List
|
Atomic Transactions with OnlineDDLWhat Online DDL does right nowI have looked at the Online DDL code and it is only the cutover step that needs to coordinate with atomic transactions. As it stands now, Online DDL does the following steps during a cutover (More details available at https://vitess.io/blog/2022-04-06-online-ddl-vitess-cut-over/) -
From the safety of online-ddl standpoint is concerned, the buffering and then acquiring locks on writes is sufficient to guarantee that we don't lose any writes that happen on the original table when we swap. Proposed changesBasically after we start the buffering and before we kill the transactions, we'll just need to wait for the prepared pool to have finished all the transactions present at that point in time. And we need to do something special for transactions trying to enter the prepared state. We have two alternates described below. Then we should be safe to kill all the remaining transactions (atomic or not). I'm hoping this will work quite nicely in terms of timing, because prepared transactions should really be committed/rollbacked pretty quickly. Transactions are in the prepared state only after the user has already run the commit command on vtgate, and we have completed the first part of 2pc. So if things are not broken (mostly they won't be), then it should be a matter of less than a second for it to receive the result of the decision (commit/rollback). So, unless something catastrophic happens that leave a transaction in the prepared state and we need the resolution to happen by a transaction watcher, the transactions in prepared state would be very short lived. Waiting is different for a transaction that the user has currently "open" (in the sense they haven't committed it), so the user can run more DMLs. We don't need to wait for them. They can be killed. Atomic transaction guarantee is based on the premise that prepared transactions can't fail, and we only enter the prepared state after the transaction has reached a conclusion from the user perspective (they have issued commit). Another thing that we need to consider apart from waiting for all prepared transactions to go through, is whether we allow other transactions to enter the prepared state -
Other Online DDL variants (ghost and pt-osc)The above discussion was all limited to online DDL being run using Vreplication. This is the only mode in which we have any control over the cutover phase. Both ghost and pt-osc are running as separate binaries, so it is going to be very hard to collaborate with either of them. We need to ensure that neither of them is killing transactions during their cutover phase. If they are, then it is going to be impossible to guarantee atomic transactions with them. Even if they aren't killing transactions, it might be very hard for them to find a window for running the cutover if the system is under enough write load because they won't be able to acquire the locks on the table. UPDATE: ghost has been deprecated and pt-osc has always been experimental. #15692 We can just state that their limitations with atomic transactions and move on. UPDATE: We decided to go ahead with the second solution. We will need to make sure we handle all races correctly because it could happen that the check for query rules in prepare state happens before we update the rules, but it takes some time for the transaction to show up in the prepared pool, so we could potentially have a prepared transaction not in the list of transactions that we wait for, but still using the table in question. To handle this, we need to be defensive in what we kill in the online DDL segment. We need to consult the prepared transactions pool to make sure we don't kill something that is prepared. Task List
|
Atomic Transactions with MoveTables and ReshardOrder of operations for MoveTables and ReshardingThe only operation that we are concerned about wrt to atomic transactions is the switch write. The following is the order of operations for Switch Write -
Proposed ChangesWe propose a similar fix to the one proposed for onlineDDL in #16245 (comment). On the vttablets we can always tell if a reshard or moveTables is in progress because of the fields that they set (DisableQueryService, and DeniedTables respectively). So, after we have updated the said field, we'll just need to wait for the prepared pool to have finished all the transactions present at that point in time. And like before, we will have to prevent new prepared transactions from being prepared by consulting these fields -
The same race conditions that we need to handle for online DDL stand here too. One thing that is different though, is that we'll have to introduce a new RPC for getting vtctld to wait for the prepared transactions to finish on the vttablet side. In the case of OnlineDDL the executor was also running on the vttablet so no RPC was required. But in this case we will need to implement something like Another important consideration is that the way that the code is written currently, for Update: We will go ahead with the proposed fix. We discussed a few other considerations that we might run into with Update: We went with an alternate way of making atomic transactions work with Resharding. Instead of doing the changes proposed ☝️, we instead went ahead and changed Task List
|
UPDATE: We don't need to do anything for query rules changes. Special waiting logic was required only for Reshard, MoveTables and OnlineDDL, not because they were adding query rules, but because they were going to do an operation that would fail COMMIT calls on prepared transactions. This is not the case for query rules changes made by the users. So, we are not doing anything right now. |
User Experience for Unresolved Transactions via vtctld and VTAdmin.We have already added the basics for the users to see the unresolved transactions in #16793, and to conclude them in #16834. However, we can definitely make a few more improvements to the information the users see that will help them debug the transaction. First off, in the list of participants that the user sees from the I and @harshit-gangal discussed ☝️ and there is a pretty easy for this, where we can add the metadata manager shard to the list of unresolved transactions from the vtctld server before returning. Once the users have seen that a certain transaction is unable to be resolved and is failing on certain shards, there currently is no way for the user to see what are the exact DML queries that the shard is running. They have to look at the What all information that RPC/page will hold is something we need to decide on. We have a few options. Let's consider we have 3 shards, -40, 40-80, and 80- and the user runs a multi-insert such that each shard ends up with 1 insert query as part of the distributed transaction. Let us also assume that 80- is the only shard that fails. Let
Decision
|
Introduction
This document focuses on reintroducing the atomic distributed transaction implementation and addressing the shortcomings with improved and robust support.
Background
Existing System Overview
Vitess has three transaction modes; those are Single, Multi and TwoPC.
In Single Mode, any transaction that spans more than one shard is rolled back immediately. This mode keeps the transaction to a single shard and provides ACID-compliant transactions.
In Multi Mode, a commit on a multi-shard transaction is handled with a best-effort commit. Any commit failure on a shard rolls back the non-committed transactions. The previously committed shard transactions and the failure shard need application-side handling.
In TwoPC Mode, a commit on a multi-shard transaction follows a sequence of steps to achieve an atomic distributed commit. The existing design document is extensive and explains all the component interactions needed to support it. It also highlights the different failure scenarios and how they should be handled.
Existing Implementation
A Two-Phase commit protocol requires a Transaction Manager (TM) and Resource Managers (RMs).
Resource Managers are the participating VTTablets for the transaction. Their role is to prepare the transaction and return a success or failure response. During the prepare phase, RMs store all the queries executed on that transaction in recovery logs as statements. If an RM fails, upon coming back online, it prepares all the transactions using the transaction recovery logs by executing the statements before accepting any further transactions or queries.
The Transaction Manager role is handled by VTGate. On commit, VTGate creates a transaction record and stores it in one of the participating RMs, designating it as the Metadata Manager (MM). VTGate then issues a prepare request to the other involved RMs. If any RM responds with a failure, VTGate decides to roll back the transaction and stores this decision in the MM. VTGate then issues a rollback prepared request to all the involved RMs.
If all RMs respond successfully, VTGate decides to commit the transaction. It issues a start commit to the MM, which commits the ongoing transaction and stores the commit decision in the transaction record. VTGate then issues a commit prepared request to the other involved RMs. After committing on all RMs, VTGate concludes by removing the transaction record from the MM.
All MMs have a watcher service that monitors unresolved transactions and transmits them to the TM for resolution.
Benefits of the Existing Approach:
Problem Statement
The existing implementation of atomic distributed commit is a modified version of the Two-Phase Commit (2PC) protocol that addresses its inherent issues while making practical trade-offs. This approach efficiently handles single-shard transactions and adopts a realistic method for managing transactions across multiple shards. However, there are issues with the watchdog design, as well as other reliability concerns. Additionally, there are workflow improvements and performance enhancements that need to be addressed. This document will highlight these issues and provide solutions with the rework.
Existing Issues and Proposal
1. Distributed Transaction Identifier (DTID) Generation
The Transaction Manager (TM) designates the first participant of the transaction as MM. It generates the DTID using MM’s shard prefix and the transaction ID. This method ensures uniqueness across shards, it introduces potential conflicts due to the auto-increment transaction ID being reset upon a VTTablet restart.
Impact:
Proposals:
Conclusion:
Proposal 1 is good but it adds a dependency on a new system to provide the DTID. Proposal 2 reduces that dependency by having TM generate the DTID, but it risks generating duplicate DTID which might fail on Create Transaction Record API, leading to transaction rollback. Proposal 3 ensures the DTID is unique but results in a long DTID key. Proposal 4 also risks DTID collisions, causing transaction rollback on the Create Transaction Record API call.
Proposals 1 & 2 can map the DTID to non-participating RMs, making it the MM. These additional network calls will increase the system’s commit latency. Proposals 3 & 4 avoid this extra hop but significantly increase the DTID size. The larger DTID size outweighs the efficiency gains from using one of the participating RMs as MM in the overall commit process.
Proposal 3 looks like the most balanced and reliable option here.
2. Transaction Resolution Design
The MM is currently being provided with a fixed IP address of the TM on startup to invoke TM ResolveTransaction API for unresolved transactions.
Impact:
Proposals:
Conclusion:
Proposal 1 is the more practical choice as it utilizes existing infrastructure, which is proven and already used for other purposes like real-time stats and schema tracking. Unlike Proposal 2, which requires full-fledged development of a VTGate service discovery system.
3. Connection Settings
The current implementation does not store changes in the connection settings in the transaction recovery logs. Its omission risks the integrity and consistency of the distributed transaction during a failure recovery scenario.
Impact:
Proposal: Along with redo statement logs, the connections settings as set statements will be stored in the sequence of when they were executed. On recovery, the same sequence will be used to prepare the transaction.
4. Prepared Transactions Connection Stability
The current implementation assumes a stable MySQL connection after preparing a transaction on an RM. Any connection disruption will roll back the transaction and may cause data inconsistency due to modifications by other concurrent transactions.
Impact:
Proposals:
Conclusion:
Proposal 1 is recommended for immediate adoption to enhance connection stability and prevent unreliable TCP connections. If testing identifies issues with Unix socket stability, Proposal 2 will be implemented to leverage MySQL's XA protocol for transactional integrity and recovery.
5. Transaction Recovery Logs Application Reliability
The current implementation stores the transaction recovery logs as DML statements. On transaction recovery, while applying the statements from these logs it is not expected to fail as the current shutdown and startup workflow ensure that no other DMLs leak into the database. Still, there remains a risk of statement failure during the redo log application, potentially resulting in lost modifications without clear tracking of modified rows.
Impact:
Proposals:
Conclusion:
Currently, neither proposal will be implemented, as the expectation is that redo log applications should not fail during recovery. Should any recovery tests fail due to redo log application issues, Proposal 2 will be prioritized for its inherent advantages over Proposal 1.
6. Unsupported Consistent Lookup Vindex
The current implementation disallows the use of consistent lookup vindexes and upfront rejects any distributed transaction commit involving them.
Impact:
Proposal: Allow the consistent lookup vindex to continue. The pre-transaction will continue to work as-is. Any failure on the pre-transaction commit will roll back the main transaction. The post-transaction will only continue once the distributed transaction is completed. Otherwise, the post-transaction will be rolled back.
7. Resharding, Move Tables and Online Schema Change not Accounted
The current implementation has not handled the complications of running a resharding workflow, a move tables workflow, or an online schema change workflow in parallel with in-flight prepared distributed transactions.
Impact:
Proposals:
Conclusion:
Proposal 1 is relatively easier to argue about the expectation. All workflows will use the same strategy. The new API can be extended to be used for other flows as well.
Exploratory Work
MySQL XA was considered as an alternative to having RMs manage the transaction recovery logs and hold up the row locks until a commit or rollback occurs.
There are currently 20 open bugs on XA. On MySQL 8.0.33, reproduction steps were followed for all the bugs, and 8 still persist. Out of these 8 bugs, 4 have patches attached that resolve the issues when applied. For the remaining 4 issues, changes will need to be made either in the code or the workflow to ensure they are resolved.
MySQL’s XA seems a probable candidate if we encounter issues with our implementation of handling distributed transactions that XA can resolve. XA’s usage is currently neither established nor ruled out in this design.
Rework Design
Commit Phase Interaction
The Component interaction for different cases.
Any error in the commit phase is indicated to the application with a warning flag. If an application's transaction receives a warning signal, it can execute a
show warnings
to know the distributed transaction ID for that transaction. It can watch the transaction status withshow transaction status for <dtid>
.Case 1: All components respond with success.
Case 2: When the Commit Prepared Transaction from the RM responds with an error. In this case, the watcher service needs to resolve the transaction and commit the pending prepared transactions.
Case 3: When the Commit Descision from MM responds with an error. In this case, the watcher service needs to resolve the transaction as it is not certain whether the commit decision persisted or not.
Case 4: When a Prepare Transaction fails. TM will decide to roll back the transaction. If any rollback returns a failure, the watcher service will resolve the transaction.
Case 5: When Create Transaction Record fails. TM will roll back the transaction.
Transaction Resolution Watcher
If there are long pending distributed transactions in the MM. This watcher service will ensure that TM is invoked to resolve them.
Improvements and Enhancements
show transaction status for <dtid>
command.Implementation Plan
Task Breakdown:
[ ] Implement new DTID generator logicTesting Strategy
This is the most important piece to ensure all cases are covered, and APIs are tested thoroughly to ensure correctness and determine scalability.
Test Plan
Basic Tests
Commit or rollback of transactions, and handling prepare failures leading to transaction rollbacks.
Resilient Tests
Handling failures of components like VTGate, VTTablet, or MySQL during the commit or recovery steps.
The failure on MM and RM includes the VTTablet and MySQL interuption cases.
System Tests
Tests Involving multiple moving parts such as distributed transactions with Reparenting (PRS & ERS), Resharding, OnlineDDL, and MoveTables.
Stress Tests
Tests will run conflicting transactions (single and multi-shard), and validate on error metrics related to distributed transaction failure.
Reliability tests
A continuous stream of transactions (single and distributed) will be executed, with all successful commits recorded along with the expected rows. The binary log events will be streamed continuously and validated against the ordering of the change stream and the successful transactions.
This test should run over an extended period, potentially lasting a few days or a week, and must endure various scenarios including:
Deployment Plan
The existing implementation has remained experimental therefore no compatibility guarantees will be maintained with the new design changes.
Monitoring
The existing monitoring support will continue as per the old design.
Future Enhancements
1. Read Isolation Guarantee
The current system lacks isolation guarantees, placing the burden on the application to manage it. Implementing read isolation will enable true cross-shard ACID transactions.
2. Distributed Deadlock Avoidance
The current system can encounter cross-shard deadlocks, which are only resolved when one of the transactions times out and is rolled back. Implementing distributed deadlock avoidance will address this issue more efficiently.
The text was updated successfully, but these errors were encountered: