-
Notifications
You must be signed in to change notification settings - Fork 758
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[Task]: Extending Ballerina's Transaction Support to Include Transaction Recovery #42031
Comments
Changes and New Additions
Recovery Pass
Update 23/01/2024 |
Recovery ProcessThe recovery process involves retrieving failed transactions from the XAResources using xa_recover(). This would return a list of XIDs (transaction identifiers) for transactions that were in progress but failed to complete in that specific resource. Once we have these XIDs, we search for corresponding log records to determine the decision (commit/abort) that was previously made by the coordinator for each transaction and then act on it accordingly. This typically involves either committing or aborting the transaction, depending on the decision recorded in the logs. If there are mixed/hazard outcomes, the user is warned of those outcomes and those need to be manually handled. Update 11/02/24 |
As discussed, retrieving prepared transactions from the database and matching them with corresponding log records to act based on the coordinator's decision was deemed unnecessary overhead. Instead, we'll broadcast the coordinator's decision (commit/abort) to all resources. Resources without active or failed transactions for that XID will respond with Update 12/02/2024 |
Description
Ballerina doesn't have native support for recovery in distributed transactions. It offers recovery only for database transactions utilizing the Atomikos library's transaction manager but lacks the support for transactional microservices or other XA resources. The goal of this task is to extend Ballerina's transaction support to include native recovery functionality for distributed transactions, according to the XA spec, eliminating the need for the Atomikos library. It aims to mitigate risks from network failures, resource manager issues, and application errors, ensuring data consistency, fault tolerance, and overall application reliability in distributed transactions.
Describe your task(s)
[Phase 1] Recovery for Direct XA Resource Transactions
Implement a component that manages an in-memory log, allowing dynamic tracking of transaction status during runtime. This log stores relevant information about ongoing transactions' states.
And a file-based log manager responsible for persistently storing transaction logs. This ensures that transaction information is durably saved and can be recovered even after system restarts or crashes.
Identify and add logs where necessary to capture required transaction state information, to be used in the recovery processes.
Implement a recovery process that runs during startup to perform an initial pass and recover any incomplete transactions or states resulting from a previous system crash before returning to normal operation. Tracked in Implement Transaction Recovery for XA Resources #42080
Develop a mechanism to handle recovery for runtime transaction failures, preserving data integrity and reducing hazard outcomes. For this, the improvements [Improvement]: Set Transaction Timeout Functionality for Ballerina Transactions #42061 and [Improvement]: Improve Ballerina transactions with XA Resource connection check/refresh before commit, rollback operations #41933 were created.
[Phase 2] Coordinator-Participant Recovery
Design and implement a recovery mechanism for both coordinators and participant nodes. This mechanism should allow communication between two nodes and gracefully recover from failures, ensuring that transactions can either be completed or rolled back consistently.
Related area
-> Compilation
Related issue(s) (optional)
No response
Suggested label(s) (optional)
No response
Suggested assignee(s) (optional)
No response
The text was updated successfully, but these errors were encountered: