-
Notifications
You must be signed in to change notification settings - Fork 5
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Experiment with token-based API #2
base: main
Are you sure you want to change the base?
Conversation
The new version will be a fresh start. Reusing this repo for now as the name is suitable for the intent, and the old version was only of historical interest, it was never usable.
Getting a few types and methods implemented.
Uses u16 vector and inline asm. A lot of that will go away when f16 is stabilized in Rust.
Putting impls on the types doesn't work with dispatch, so moving them to plain functions in modules.
Treating the last commit as a checkpoint. Adding a bunch of types, casting operators, etc.
It generates the versions but doesn't work well in practice, as we really need conditional compilation.
No platform specific methods are implemented on SIMD types now.
A proc macro that spits out multiple instances of a function, and generates a dispatcher to pick one at runtime. Delete the old procedural macro, as it couldn't work.
Just enough methods to demonstrate dispatch macro.
The srgb example runs on both neon and fallback. Added a bunch more functions to support it.
Just enough of avx2 to run the examples, including srgb.
An implementation of the fast feature detection idea, also made available as an option to the simd_dispatch macro.
Implement std::ops, allow dispatch. Seems worth exploring.
Most of the old implementation is disabled. Starting new implementation in terms of tokens. Chunks of this, especially low level support, are copied from pulp. Attribution is needed. Remove target_feature 1.1, at least for now.
Runs srgb example on Avx2 now.
We'll very likely bring lots of it back (especially Neon fp16 stuff), but the right way to do that is to fish it out of the `next` branch.
This reduces the boilerplate of the WithSimd trait implementation. Also some fixes so the fallback case works on arch other than aarch64 and x86_64.
Add a `vectorize` method to the SIMD trait which runs the provided function with CPU features enabled. Also clean up examples a bit, adding `#[inline(always)]` where needed and using `simd_dispatch` for the srgb example.
Create a core_arch submodule with low-level access to intrinsics. This pattern follows pulp, and is needed to support layered levels. This commit makes the change for aarch64 only.
This has a separate struct for each SIMD capability level. That will make it easier to split off lower levels if needed, and also a good basis for higher levels.
Adds the fp16 types, some intrinsics, some std::ops, and a few other methods. Not carefully exercised. The methods are implemented directly on f16x4 and f16x8. If we are going to support fp16 on AVX-512, then we will want to switch those over to a trait (maybe a subtrait of `Simd`).
Add unpremultiply example. That's mostly using intrinsics rather than wrapped types, but it does use the vectorize technique.
This is x86_64 only (with fallback).
Add a Bytes trait for SIMD types, which enables a bitcast method.
Now compiles on aarch64. The 256 bit wide operations are emulated.
I marked this as ready for review, as I'm inclined to merge it. I think it's enough to serve as a discussion point for what SIMD should look like, and I'm not likely to do significant more work at the moment. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I've taken a quick glance; I like the token-based approach in general (as you know).
Something that may or may not be nice (I haven't worked out the ergonomics yet), is that it will be possible to encode the features at the type-level once adt_const_params
lands – I believe it is close to stabilization. The token type could then be const-generic over a struct with boolean fields, one for each feature, with specializations for specific blessed levels.
src/half_assed.rs
Outdated
/// intrinsics, which allows it to be `const`. [`from_f32`][Self::from_f32] should be preferred | ||
/// in any non-`const` context. | ||
/// | ||
/// This operation is lossy. If the 32-bit value is to large to fit in 16-bits, ±∞ will result. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
/// This operation is lossy. If the 32-bit value is to large to fit in 16-bits, ±∞ will result. | |
/// This operation is lossy. If the 32-bit value is too large to fit in 16-bits, ±∞ will result. |
README.md
Outdated
[target_feature 1.1]: https://github.com/rust-lang/rfcs/pull/2396 | ||
[Towards fearless SIMD]: https://raphlinus.github.io/rust/simd/2018/10/19/fearless-simd.html | ||
[fearless_simd 0.1.1]: https://crates.io/crates/fearless_simd/0.1.1 | ||
[half]: https://crates.io/crates/pulp |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
[half]: https://crates.io/crates/pulp | |
[half]: https://crates.io/crates/half |
src/fallback.rs
Outdated
// Discussion question: what's the natural width? Most other implementations | ||
// implementations default to 1, but perhaps we want to express the idea |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
// Discussion question: what's the natural width? Most other implementations | |
// implementations default to 1, but perhaps we want to express the idea | |
// Discussion question: what's the natural width? Most other implementations | |
// default to 1, but perhaps we want to express the idea |
src/x86_64/avx2.rs
Outdated
/// The SIMD token for the "avx2" level. | ||
/// | ||
/// This is short for the "x86-64-v3 microarchitecture level". In this level, the | ||
/// following target_features are enabled: "avx2", "bmi2", "f16c", "fma", "lzcnt". | ||
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The docs are here twice.
/// The SIMD token for the "avx2" level. | |
/// | |
/// This is short for the "x86-64-v3 microarchitecture level". In this level, the | |
/// following target_features are enabled: "avx2", "bmi2", "f16c", "fma", "lzcnt". |
src/x86_64/avx2.rs
Outdated
/// This is short for the "x86-64-v3 microarchitecture level". In this level, the | ||
/// following target_features are enabled: "avx2", "bmi2", "f16c", "fma", "lzcnt". |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Is authoritative documentation available somewhere of CPU feature hierarchies?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
That's a good question, and when you look at other implementations you see immense confusion and potential bugs that will impact fringe x86 clones (I linked at least one such bug from my irlo thread).
I'm generally going off Wikipedia for the v3 etc microarchitecture levels. In terms of the graph of implied features, my main source is rustc.
Probably this should be captured in a wiki; there's a ton of lore and I'm not aware of any great authoritative sources.
This is another exploration, largely to see what the API would look like. In this one, the vector types look like
f32x4<S: Simd>
, whereSimd
is a trait impl'd by zero size tokens that witness the CPU capabilities. The vector types are 2-element struct with the corresponding[f32; 4]
array and the token. Like the original fearless_simd, because they witness feature detection, std::ops can be implemented safely on them.This design is actually quite a bit more similar to the original fearless_simd idea, and also borrows lots from pulp, including pretty much the entire safe delegation of intrinsics. As far as I can tell, this basically removes the need for target_feature 1.1.
I think one thing that works well is the
as_neon()
method on the level enum, which can also be gotten from the token. This is a way to get downcasting, so you can write code completely generic on the Simd level (this is the main motor for multiversioning) while alsoOne downside of the vector-with-witness approach is that you can't do safe transmute on it (ie they can't implement the bytemuck traits), otherwise that would let you synthesize a token not actually supported by the CPU.
Detection is safe. For performance, the preferred pattern is to mint a level enum at a high level (for example, when instantiating a renderer), then pass it around. When inlined, the dispatch should go away. When not inlined, it should be a cheap conditional branch. This also solves the question of how to take the address of a multiversioned function - take the address of the non-generic version that take the level enum, rather than one that's polymorphic in
<S: Simd>
.As of the first draft, very few types and operations are implemented, not much more than needed to run the srgb example.