Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

What about: seqlocks, load-release/store-acquire? #323

Open
comex opened this issue Mar 24, 2022 · 10 comments
Open

What about: seqlocks, load-release/store-acquire? #323

comex opened this issue Mar 24, 2022 · 10 comments

Comments

@comex
Copy link

comex commented Mar 24, 2022

Continuing a discussion in #321 that starts around here and ends here.

I think that this proposal would work, but it is essentially equivalent to the read-dont-modify-write approach.

I agree it's equivalent to read-dont-modify-write. But phrasing it the way I did provides insight about how it can be efficiently implemented. I claim that a load-release can be implemented on most architectures as a fence instruction followed by a load instruction, and store-acquire can be implemented as a store instruction followed by a fence instruction. Notably:

  • This matches how Linux actually implements seqlocks: "load-release", "store-acquire".
    • edit: no it doesn't, it requires a stronger fence than what Linux uses, see below.
  • It puts the fence on the opposite side of the memory access compared to load-acquire and store-release.
  • However, it may require a different type of fence instruction depending on which instructions the architecture offers.

It implies, for example, that readers also synchronize with each other, i.e. R3 -sw> R1', which linearizes the accesses and corresponds to bounding an exclusive cache line around and a loss of the unlimited read-side scaling property that is the whole point of seqlocks.

I could have made a mistake, but this is intended to match seqlocks' implementation, and to not require exclusive cache line access on typical CPUs – unlike a naive implementation of read-dont-modify-write in terms of compare_exchange or fetch_add, which would. The idea is that on those typical CPUs, memory accesses actually are implemented in a way that provides full sequential consistency, in terms of when the accesses actually occur. It's just that the order of "when the accesses actually occur" is different from program order thanks to reordering. (But this is just a motivation, not an attempt to prove correctness. The proof of correctness would come from the memory models in each architecture manual.)

@comex
Copy link
Author

comex commented Mar 24, 2022

As further justification for my claims that load-release and store-acquire are (a) operations worth providing and (b) implementable the way I said:

I did a bit of research into architecture-level memory ordering guarantees. In the past architecture manuals tended to talk more explicitly about instruction reordering, whereas these days they tend to be worded more in terms of abstract guarantees, similar to the C11 memory model. Even so, both the ARM and RISC-V specifications do still provide considerably stronger guarantees than the C11 model: they both require a single global ordering of all accesses in terms of, as I phrased it in my last post, "when they actually occur" (a more accurate phrase would be "when they complete"). Then there are various ways to ensure one access comes before another in that ordering, one of which is to put a suitable fence/barrier instruction between them.

I haven't looked at x86 but I'm pretty sure its memory model is strictly stronger.

Thus, going from a C11's total modification order to a 'total access order' would still leave us making a conservative subset of the guarantees that most architectures provide.

If some other architecture does not naturally provide the needed guarantees, at worst, load-release and store-acquire could be implemented as read-dont-modify-write, e.g. as foo.fetch_add(0, SomeOrdering).

...

Of course, adding new memory orderings would not be easy; at minimum it would require significant work on LLVM's side. But it would be nice to do.

In the meantime, I contend that inline assembly – or FFI to out-of-line assembly – is in fact a sound way to take advantage of the stronger architectural guarantees.

@digama0
Copy link

digama0 commented Mar 24, 2022

It implies, for example, that readers also synchronize with each other, i.e. R3 -sw> R1', which linearizes the accesses and corresponds to bounding an exclusive cache line around and a loss of the unlimited read-side scaling property that is the whole point of seqlocks.

I could have made a mistake, but this is intended to match seqlocks' implementation, and to not require exclusive cache line access on typical CPUs – unlike a naive implementation of read-dont-modify-write in terms of compare_exchange or fetch_add, which would. The idea is that on those typical CPUs, memory accesses actually are implemented in a way that provides full sequential consistency, in terms of when the accesses actually occur. It's just that the order of "when the accesses actually occur" is different from program order thanks to reordering. (But this is just a motivation, not an attempt to prove correctness. The proof of correctness would come from the memory models in each architecture manual.)

I'm not 100% sure, but I think you might be falling into the same trap that the author talks about in section 7:

Ideally, it would be possible to address this by having the compiler treat read-dont-modify-write operations, such as fetch add(0, memory order release) specially, to precisely an atomic load operation plus whatever memory fences etc. are required to enforce the required ordering. Unfortunately, that is not correct.[3]

[3]: ... in spite of my earlier claims in the previously mentioned mailing list discussion.

Can you tell whether the proposed counterexample also applies to your version?

@bjorn3
Copy link
Member

bjorn3 commented Mar 24, 2022

Can't hardware memory models be more strict as you can rely on data dependencies, while software optimizations can easily break data dependencies?

@comex
Copy link
Author

comex commented Mar 24, 2022

I'm not 100% sure, but I think you might be falling into the same trap that the author talks about in section 7:
[…]
Can you tell whether the proposed counterexample also applies to your version?

Sigh… I did fall into that trap, sort of.

It's not that my proposed load-release is unsound or not equivalent to read-dont-modify-write. But it implies a stronger barrier than is needed for seqlocks.

A release operation – C++11's release stores and release fences, as well as release loads under my proposed definition – waits for both prior loads and prior stores to complete. Formally speaking, if you have a -sb> b -rf> c -sb> d, where b is a release store and c is an acquire load, then a happens-before d, regardless of whether a is a load or store (or whether d is a load or store). C++11 has no concept of a barrier that applies to "all stores before this" or "all loads before this", only "all accesses before this".

In RISC-V terms, a store-release is fence rw,w (order prior reads and writes before subsequent writes) followed by a store (the 'subsequent write' that we care about). Similarly, my proposed load-release would be fence rw,r (order prior reads and writes before subsequent reads) followed by a load (the 'subsequent read' that we care about).

But a seqlock reader performs only loads, so it only needs fence r,r.

In Linux terms, this is roughly the difference between smp_mb() and smp_rmb() (although smp_mb() is actually stronger than necessary, equivalent to fence rw,rw). In ARM terms, it's the difference between dmb and dmb ld. And in x86 terms, it's the difference between a locked access to any address, and nothing at all (i.e. fence r,r comes for free).

I'm not sure how much performance difference the stronger fence would make in practice. It's still just a fence; the current CPU core doesn't need to take exclusive ownership of the cache line containing the atomic object or anything like that. (On x86, you need a locked access, but that access can be to any random address; it doesn't have to be to the atomic object itself.)

However, if we're talking about adding new types of atomic accesses, it's definitely questionable to have it be stronger than needed for the motivating use case.

Hmm.

* English translation: one thread does a followed by b, while a second thread does c followed by d, and c reads from b

@comex
Copy link
Author

comex commented Mar 24, 2022

Can't hardware memory models be more strict as you can rely on data dependencies, while software optimizations can easily break data dependencies?

Data dependencies are another case where the Linux kernel tries to use cheaper and subtler synchronization primitives than what C++11 atomics expose, but they're not relevant to seqlocks as far as I know.

@ibraheemdev
Copy link
Member

Worth nothing that LLVM already has optimizations in place for fetch_add(0, ordering). For example on x86:

pub fn load_release(x: &AtomicUsize) -> usize {
    x.fetch_add(0, Ordering::Release)
}

// example::load_release:
//         mfence
//         mov     rax, qword ptr [rdi]
//         ret

@m-ou-se
Copy link
Member

m-ou-se commented Jul 3, 2022

C++'s P1478 proposes to add atomic_load_per_byte_memcpy (and atomic_store_per_byte_memcpy) to solve the seqlock issue. Instead of making the final load of the sequence number a "release load" (which doesn't really make sense), the proposed per-byte-load operation makes it possible make the loading of the data itself an acquire-load. This works for data of arbitrary size, because the proposed function only guarantees that the load is atomic for each individual byte, without any further guarantees. (It's basically just a memcpy that's allowed to race with other (atomic) operations, which is exactly what seqlock needs.)

We should just add the same functions to Rust. LLVM will need to support them for C++ anyway, and they fit nicely in the current memory model.

@ibraheemdev
Copy link
Member

@RalfJung
Copy link
Member

RalfJung commented Jul 4, 2022

I agree, those operations make a lot of sense.

@m-ou-se
Copy link
Member

m-ou-se commented Aug 14, 2022

I wrote an RFC for this today: rust-lang/rfcs#3301

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

6 participants