Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[Task]: Extending Ballerina's Transaction Support to Include Transaction Recovery #42031

Open
3 of 5 tasks
dsplayerX opened this issue Jan 22, 2024 · 3 comments
Open
3 of 5 tasks
Assignees
Labels
Lang/Transactions Ballerina Transaction and its implementation related issued Team/CompilerFE All issues related to Language implementation and Compiler, this exclude run times. Type/Task

Comments

@dsplayerX
Copy link
Contributor

dsplayerX commented Jan 22, 2024

Description

Ballerina doesn't have native support for recovery in distributed transactions. It offers recovery only for database transactions utilizing the Atomikos library's transaction manager but lacks the support for transactional microservices or other XA resources. The goal of this task is to extend Ballerina's transaction support to include native recovery functionality for distributed transactions, according to the XA spec, eliminating the need for the Atomikos library. It aims to mitigate risks from network failures, resource manager issues, and application errors, ensuring data consistency, fault tolerance, and overall application reliability in distributed transactions.

Describe your task(s)

[Phase 1] Recovery for Direct XA Resource Transactions

[Phase 2] Coordinator-Participant Recovery

  • Design and Implement Coordinator-Participant Recovery Mechanism:
    Design and implement a recovery mechanism for both coordinators and participant nodes. This mechanism should allow communication between two nodes and gracefully recover from failures, ensuring that transactions can either be completed or rolled back consistently.

Related area

-> Compilation

Related issue(s) (optional)

No response

Suggested label(s) (optional)

No response

Suggested assignee(s) (optional)

No response

@dsplayerX dsplayerX added Type/Task Team/CompilerFE All issues related to Language implementation and Compiler, this exclude run times. Lang/Transactions Ballerina Transaction and its implementation related issued labels Jan 22, 2024
@dsplayerX dsplayerX self-assigned this Jan 22, 2024
@dsplayerX dsplayerX moved this to In Progress in Ballerina Team Main Board Jan 22, 2024
@dsplayerX
Copy link
Contributor Author

dsplayerX commented Jan 23, 2024

Changes and New Additions

  • A recovery manager instance was integrated into the transaction manager, along side the log manager which handles in-memory and a file-based logging.
  • XIDGenerator was changed to use the transactionId as the gtrid and the transactionBlockId as the bqual. The default_format value is obtained by combining the ASCII values of the characters 'B', 'A', and 'L'. This value will be unique for ballerina transactions but same for each transaction.
  • RecoveryStates enum was introduced and used to manage and identify different states of transactions during logging and recovery.
  • TransactionLogRecord was introduced which holds the information of a transaction including transaction id, transaction state and the log time (as of now, will need to include more information for coordinator-participant recovery).
  • RecoveryLogManager was introduced that manages transaction recovery logs with a combination of file-based FileRecoveryLog and in-memory InMemoryRecoveryLog logs, allowing the adding and retrieving of transaction log records.
    • The in-memory log manager is used for dynamic tracking of transaction status during runtime.
    • The file-based log manager is responsible for persistently storing transaction logs, ensuring recovery after system restarts or crashes.
    • The file log can be customized by changing the directory, filename, checkpoint interval and whether or not to delete old logs. Configuration values for recoveryLogName, recoveryLogDir, checkpointInterval (use "-1" to disable checkpointing), deleteOldLogs can be provided under [ballerina.lang.transaction].
  • RecoveryManager was introduced which is responsible for recovering transactions.
    • As of now, it manages a collection of pending transaction log records and a set of XA resources to be recovered.
    • The recovery process involves checking the transaction state, retrieving prepared XIDs from the XA resources, and replaying commit or rollback operations based on the transaction state.
    • This also handles heuristic terminations of transactions in XA Resources.

Recovery Pass

  • XA Resources for recovery will be gathered using an exposed addXAResourceToRecover method from the library side during their initialization. Transaction recovery will not be supported for resources created/initialized within the transaction block.
  • The transaction manager features a startupCrashRecovery method designed to initiate recovery post-system crashes.
  • A boolean variable is utilized within the transaction manager to track the success status of the initial recovery pass.
  • Recovery process will run during system startup, performing an initial pass to recover incomplete transactions resulting from a previous system crash before returning to normal operation. Failed transactions (prepared and in-doubt) are identified and recovered from the saved log during startup crash recovery releasing currently held locks and allowing the system to operate normally.

Update 23/01/2024

@dsplayerX
Copy link
Contributor Author

dsplayerX commented Feb 12, 2024

Recovery Process

The recovery process involves retrieving failed transactions from the XAResources using xa_recover(). This would return a list of XIDs (transaction identifiers) for transactions that were in progress but failed to complete in that specific resource. Once we have these XIDs, we search for corresponding log records to determine the decision (commit/abort) that was previously made by the coordinator for each transaction and then act on it accordingly. This typically involves either committing or aborting the transaction, depending on the decision recorded in the logs. If there are mixed/hazard outcomes, the user is warned of those outcomes and those need to be manually handled.

Update 11/02/24

@dsplayerX
Copy link
Contributor Author

The recovery process involves retrieving failed transactions from the XAResources using xa_recover(). This would return a list of XIDs (transaction identifiers) for transactions that were in progress but failed to complete in that specific resource. Once we have these XIDs, we search for corresponding log records to determine the decision (commit/abort) that was previously made by the coordinator for each transaction and then act on it accordingly. This typically involves either committing or aborting the transaction, depending on the decision recorded in the logs. If there are mixed/hazard outcomes, the user is warned of those outcomes and those need to be manually handled.

As discussed, retrieving prepared transactions from the database and matching them with corresponding log records to act based on the coordinator's decision was deemed unnecessary overhead.

Instead, we'll broadcast the coordinator's decision (commit/abort) to all resources. Resources without active or failed transactions for that XID will respond with XAER_INVAL or XAER_NOTA, indicating that the XID is no longer known to the resource and the transaction has terminated through a concurrent commit or rollback. This approach would streamline the process and minimizing unnecessary calls.

Update 12/02/2024

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Lang/Transactions Ballerina Transaction and its implementation related issued Team/CompilerFE All issues related to Language implementation and Compiler, this exclude run times. Type/Task
Projects
None yet
Development

Successfully merging a pull request may close this issue.

1 participant