-
Notifications
You must be signed in to change notification settings - Fork 444
make transaction notification and slot status update in order #5644
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
base: master
Are you sure you want to change the base?
make transaction notification and slot status update in order #5644
Conversation
If this PR represents a change to the public RPC API:
Thank you for keeping the RPC clients in sync with the server API @lijunwangs. |
8097d19
to
1efabac
Compare
Codecov ReportAttention: Patch coverage is
Additional details and impacted files@@ Coverage Diff @@
## master #5644 +/- ##
========================================
Coverage 83.0% 83.0%
========================================
Files 828 830 +2
Lines 375675 375813 +138
========================================
+ Hits 311845 311994 +149
+ Misses 63830 63819 -11 🚀 New features to boost your workflow:
|
dc8bf5c
to
9410547
Compare
test_solana_transaction.py.txt |
3f03ec8
to
3c8f72c
Compare
Code looks good to me, are there any tests we could add to check that the order is happening properly? Of course wouldn't want to add a racy test. |
I will look into adding a test in transaction_status_service.rs |
Added a test verification notification order |
48c59c5
to
b1c44d0
Compare
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
My wife doesn't like having a coffee with lunch; she prefers it to come afterward.
What we have here is a SandwichGrill
and a CoffeeMachine
that run in different threads. No matter if the coffee is always ordered after the sandwich, you might get the coffee first because of the way that threading works.
If I understand this PR correctly, here's what it does.
- If no sandwich has been ordered, send the coffee order straight to the
CoffeeMachine
. No race! - Otherwise, put an espresso machine on the grill, and send both sandwich orders and coffee orders to the
SandwichGrill
. Since the grill operator can only make one thing at a time, you're guaranteed to get your coffee and your sandwich in the expected order.
This works, but involves blending responsibilities into the SandwichGrill
in a way that will most certainly confuse future developers who work on this code. Indeed, you had to do some refactoring of code to make it all fit together!
What if, instead, we created a coordination mechanism for the grillmaster and the barista? Admittedly, all of this has, as its root cause, the fact that events don't have sequence numbers. Without solving that though, what if we created the notion of receiver synchronization.
- Create a global
tokio::sync::Mutex
shared by all senders - When sending a message (eg.
Sender<BankNotification>#send
) obtain an async lock and include it in the notification - When receiving that notification (eg.
bank_notification_receiver.recv()
) do all the work except for sending it (eg.slot_status_notifier#notify_slot_confirmed
) .await
the lock- …then broadcast the notification.
- …then release the lock.
Locks will be granted in the order that they were created, which will guarantee that the notifications will be broadcast in the order in which they were sent.
Another solution that does not involve passing locks around would be to create two global AtomicUsize
sequence numbers – one that says how many notifications have been sent and one that says how many have been broadcast. You would fetch_add()
to get the next sequence number, include that number in the notification, then on the receiver side you would loop until the number of notifications sent was >= your_seq_number - 1
before broadcasting the notification. Instead of spinlocking while that is not the case, you could thread::park()
and then unpark()
the next time the sequence number changes to recheck the sequence number.
What do you think about that?
bank::Bank, bank_forks::BankForks, prioritization_fee_cache::PrioritizationFeeCache, | ||
bank::Bank, | ||
bank_forks::BankForks, | ||
// bank_notification::{BankNotification, BankNotificationReceiver}, |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
👀
I think we want to eliminate taking locks if we can avoid it as the synchronizations still creates coupling and likely slow down the TSS. |
Give it a try and measure it! |
I have a simple test -- the lock without await (polling) does not work it is not blocking other thread proceeding: use std::sync::Arc;
use std::thread;
use tokio::sync::Mutex;
#[tokio::main]
async fn main() {
let shared = Arc::new(Mutex::new(0));
// Thread A: Regular thread (non-async), calls lock() but does NOT await
let shared_a = shared.clone();
let thread = thread::spawn(move || {
let _future = shared_a.lock(); // This does NOT lock immediately
println!("Thread A called lock() but did NOT await");
std::thread::sleep(std::time::Duration::from_millis(20000)); // holding the future for 20 seconds
println!("Thread A finished");
});
// Give Thread A a little time to start
tokio::time::sleep(tokio::time::Duration::from_millis(50)).await;
// Task B: Async task that actually acquires the lock
let shared_b = shared.clone();
let thread_b = tokio::spawn(async move {
println!("Task B is trying to acquire the lock...");
let mut guard = shared_b.lock().await;
println!("Task B acquired the lock!");
*guard += 1;
println!("Task B incremented the value");
});
// Wait for everything to finish
tokio::time::sleep(tokio::time::Duration::from_secs(1)).await;
let final_val = shared.lock().await;
println!("Final value: {}", *final_val); // Should be 1
thread.join().unwrap(); // Wait for Thread A to finish
thread_b.await.unwrap(); // Wait for Task B to finish
} |
Here's a POC using the sequence number technique, with thread parking so that you don't have to spinlock waiting for the sequence number to come into range. https://gist.github.com/steveluscher/d0504640c6c57246b0f3df7a64c928dd What do you think? |
@lijunwangs is making good progress with a sequence-number-based solution. What does everyone think about solving this problem in general right now. That is to say:
That way we're only blocking at the very last step, after all of the CPU-intensive work of preparing the notification has been done. Might there anything wrong with that approach? |
I have evaluated communicating about the sequence number scheme with two different mechanism: one using Condvar and 2nd using lock free AtomicUsize with yield_now in the thread which cares about ordering. https://github.com/lijunwangs/test_event_sequencing/blob/master/src/main.rs: CondVar https://github.com/lijunwangs/test_event_sequencing/blob/lock_free/src/main.rs : Lock Free In the CondVar case, I compared the performance of handing 10 million events with/without this synchronization, I found the CondVar adds about 100ms overhead. The lock free one takes half time of the condvar implementation. : |
strangely the yield_now is even faster than without any synchronization. [2025-04-17T18:04:34Z INFO tokio_mutex_test] Using sync: false |
Problem
There are reported issue #3074
And complaints from community on geyser that when receiving SlotStatus::Frozen the plugin might not have received all the transaction notifications yet. The root cause for both issues are because of the following
TransactionStatusBatch and BankNotification::Frozen are sent serially in the replay stage with transactions sent earlier than bank notiifcations. However, they are sent to different intermediate receivers running different threads.
TransactionStatusBatch events are sent to TransactionStatusService and
BankNotification::Frozen events are sent to OCBT (optimistically confirmed bank tracker).
This creates race conditions of the delivery.
The downside for RPC clients is they will have to keep retrying or add non-deterministic sleep interval when they try to get transaction details after they see the slot events. This either increases the RPC traffic load or increase the latency to obtain the data.
Summary of Changes
The changes reroutes the BankNotifications via TransactionStatusService so that to ensure the transactions have been written to the block store before delivering the BankNotification. This is done only when both transaction notification and bank notifications are turned on. If there is only bank notification turned on without transaction notification, there is real change. The bank notification is directly sent to OCBT.
Due to the following dependency of the crates:
RPC —> Ledger —> runtime
Core —> RPC, replay_stage is in the core.
RPC—> runtime
TransactionNotification defined previously in ledger is moved to runtime
BankNotification defined previously in rpc is moved to runtime.
BankNotificationSenderConfig is enhanced to support the configuration of direct (to OCBT) or indirect (via transaction status service (TSS))
TSS is enhanced to support forwarding the received BankNotification.
Tested using the method mentioned in #3074. Using the python script attached ensure all transactions are obtained -- no null responses.
Fixes #
#3074