Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Pluggable output backends #2943

Open
adetaylor opened this issue Sep 27, 2024 · 5 comments
Open

Pluggable output backends #2943

adetaylor opened this issue Sep 27, 2024 · 5 comments

Comments

@adetaylor
Copy link
Contributor

I was chatting with @pvdrz at the Rust for Linux conference about some bold ideas for bindgen and he thought I should write them up, so here goes.

The problem space

cxx is a pretty popular C++ interop tool these days. Its biggest advantage is that Rust code manipulates C++ data using friendly vocabulary types such as cxx::UniquePtr<T> and cxx::CxxString. These make FFI much safer in practice than bindgen-style bindings, because for example there's no need to track object lifetimes, nor string lengths, etc. Irrespective of the specifics of the policy relating to the unsafe keyword, cxx-style interop is much less error prone in practice than bindgen-style FFI.

However, cxx can't generate such bindings from pre-existing C++ headers. The full language boundary needs to be declared in an IDL (which looks a lot like Rust code but actually isn't quite).

For years now, the Chromium project has wanted to combine the ergonomic advantages of cxx with bindgen-style generation from headers:

The idea

The idea is: bindgen can output different "flavors" of bindings. Perhaps this is pluggable, perhaps there are multiple flavors built in.

  • Basic flavor: current bindgen output. Lots of pointers.
  • cxx-like flavor: detects things like std::unique_ptr and substitutes them with cxx::UniquePtr, generating both C++ and Rust side code to make this possible. This is basically what autocxx does but it's really pretty complicated since it's a post-processor, a lot of the complexity would drop away if it were built into bindgen itself.
  • Future flavors could use a CppRef<T> type for C++ references, etc.

What this would require

It would require:

  • bindgen gains the ability to generate C++ side shim functions as well as Rust side code.
  • Some amount of modularity so that these output backends are not very invasive into the bindgen code. To give an indication of what this takes, autocxx has had to fork bindgen primarily to add extra information about the generated items. The fork is here, with a partial list of the reasons here.

Simplest first step

All this sounds vague. A hypothetical first step could be a --emit-cxx-crate-strings which:

  • Spots all use of const &std::string. If a C++ function takes such a parameter,
  • A synthetic Rust-side function is created which takes a &cxx::CxxString and then calls the real function

This is just a small step, but users of both cxx and bindgen could immediately benefit because they can create such strings in Rust.

Simplest second step

The next step after that would be to support std::string by value, which would require either cunning to store self-referential types on the Rust stack or, more likely, wrapping the std::string in a std::unique_ptr. This would require a small C++ shim function to be generated to unwrap it then pass it into the original C++ API.

@pvdrz
Copy link
Contributor

pvdrz commented Sep 27, 2024

Hey 👋

I support this proposal for several reasons. The first comes from a "human-cost" perspective and its based around the idea that we should join efforts so we can have more eyes in fewer tools instead of having more tools being maintained in isolation.

The second one is that even though this sounds like a pretty big change as a whole, most of the bold things stated here are already done by bindgen in one way or another:

  • We already generate C code for the --wrap-static-fns, so generating C++ code wouldn't be that much of a leap.
  • We already generate Rust code using third party crates such as libloading.

The general idea of having gated outputs sounds reasonable for me, specially since it doesn't affect the default behavior of bindgen and it won't break other people's code. Relying on cxx types doesn't worry me as the crate itself is stable so we shouldn't expect any breaking changes in the future.

My only remaining question here is how do we start? Would anyone be willing to submit some patches with this? or do you consider we should have some discussions before to iron out some kinks?

@emilio
Copy link
Contributor

emilio commented Sep 27, 2024

Yeah, not opposed to something like this, but FWIW you can already do stuff like "replace std::string with my custom type" (same for unique_ptr and pretty much anything).

E.g., for a header like:

// t.hpp
#include <string>

struct Something {
  std::string m_member;
};

And a bindgen invocation like:

$ ./target/debug/bindgen t.hpp \
    --enable-cxx-namespaces \
    --allowlist-item "Something" \
    --blocklist-item "std::.*" --blocklist-item "__gnu_cxx::.*" \
    --module-raw-line "root::std" "pub use cxx::CxxString as string;" 

You get this (cleaned up a little bit):

/* automatically generated by rust-bindgen 0.70.1 */

#[allow(non_snake_case, non_camel_case_types, non_upper_case_globals)]
pub mod root {
    #[allow(unused_imports)]
    use self::super::root;
    pub mod std {
        #[allow(unused_imports)]
        use self::super::super::root;
        pub use cxx::CxxString as string;
    }
    #[repr(C)]
    pub struct Something {
        pub m_member: root::std::string,
    }
    #[allow(clippy::unnecessary_operation, clippy::identity_op)]
    const _: () = {
        ["Size of Something"][::std::mem::size_of::<Something>() - 32usize];
        ["Alignment of Something"]
            [::std::mem::align_of::<Something>() - 8usize];
        ["Offset of field: Something::m_member"]
            [::std::mem::offset_of!(Something, m_member) - 0usize];
    };
}

We use that extensively in Firefox FWIW.

@adetaylor
Copy link
Contributor Author

Thanks for the positive response!

@emilio yep! The prelude and command-line which autocxx feeds to bindgen is quite long :) However cxx is also opinionated about what it does not support, and one of those cases is std::string as a member variable, since it can't be safely moved by Rust. More generally, most C++ types are represented in cxx as zero sized types so there can't be overlapping reference UB. (Alternative approaches are to wrap everything in UnsafeCell and MaybeUninitialized or, as noted before, to never allow Rust references to C++ types).

So that's why I'm first broaching this as a kind of "philosophy of bindgen" discussion: these hypothetical pluggable backends might make quite different decisions to what bindgen does now:

  • don't necessarily attempt to represent everything, but be opinionated about rejecting things which can't be used ergonomically (ergonomically = basically, without pointers being involved)
  • ban some things in some contexts
  • do not use raw pointers
  • ensure all C++ types can never be accessed by value
  • always call through to C++ for any field access
    etc.

@pvdrz

My only remaining question here is how do we start? Would anyone be willing to submit some patches with this? or do you consider we should have some discussions before to iron out some kinks?

I guess one way forward would be for me to write a document or something along the lines of "if autocxx+cxx had been implemented as a bindgen plugin, what would it look like?" and then y'all can see how revolted you are :) Does that sound like a plan? I am unlikely to have time to actually reimplement autocxx (it's 44 KLOC, and I'm not sure that's the right move anyway) but perhaps such a document can move the discussion forward? Like you, I'm really interested in joining efforts to reduce duplication.

@pvdrz
Copy link
Contributor

pvdrz commented Oct 2, 2024

Thanks for the positive response!

@emilio yep! The prelude and command-line which autocxx feeds to bindgen is quite long :) However cxx is also opinionated about what it does not support, and one of those cases is std::string as a member variable, since it can't be safely moved by Rust. More generally, most C++ types are represented in cxx as zero sized types so there can't be overlapping reference UB. (Alternative approaches are to wrap everything in UnsafeCell and MaybeUninitialized or, as noted before, to never allow Rust references to C++ types).

So that's why I'm first broaching this as a kind of "philosophy of bindgen" discussion: these hypothetical pluggable backends might make quite different decisions to what bindgen does now:

* don't necessarily attempt to represent everything, but be opinionated about rejecting things which can't be used ergonomically (ergonomically = basically, without pointers being involved)

* ban some things in some contexts

* do not use raw pointers

* ensure all C++ types can never be accessed by value

* always call through to C++ for any field access
  etc.

I think this is fine and it might be an actual advantage as having a smaller surface of code that's actually accepted might make the whole task more approachable.

I'm just shooting at the air here but maybe we could expose a limited subset of bindgen internals so we can write these backends as separate crates, where the existing backend is just one of these crates.

@pvdrz

My only remaining question here is how do we start? Would anyone be willing to submit some patches with this? or do you consider we should have some discussions before to iron out some kinks?

I guess one way forward would be for me to write a document or something along the lines of "if autocxx+cxx had been implemented as a bindgen plugin, what would it look like?" and then y'all can see how revolted you are :) Does that sound like a plan? I am unlikely to have time to actually reimplement autocxx (it's 44 KLOC, and I'm not sure that's the right move anyway) but perhaps such a document can move the discussion forward? Like you, I'm really interested in joining efforts to reduce duplication.

That sounds like a good way to kickstart the discussion 🚀

adetaylor added a commit to adetaylor/rust-bindgen that referenced this issue Feb 14, 2025
This is a major change to autocxx-bindgen which aims to improve maintainability
by reducing divergence from upstream bindgen. Specifically, it:

* Significantly reduces the textual diffence
* Makes the remaining changes less invasive, such that merge conflicts should
  be easier to resolve
* Redoes the changes such that they're more attuned to the current
  evolution of upstream bindgen, so it may be possible to upstream some or
  ideally all of these changes. (The ultimate goal is to unfork bindgen!)

See google/autocxx#124 and
rust-lang#2943 for the background here.

Specifically this change:
* Removes the #[bindgen_semantic_attributes] which were added to all
  sorts of items. Instead,
* Much more is communicated via the existing ParseCallbacks mechansim.
* In some cases, it's still necessary to annotate individual types -
  in this case we generate a newtype wrapper instead of attributes.

This commit also re-enables the bindgen test suite. It does not yet add
tests for all the above new functionality; that's yet to come.
@adetaylor
Copy link
Contributor Author

An update here: since I wrote this, it seems that bindgen's ParseCallbacks type has become a bit more versatile. With some enhancements it may be able to fill this need. Over in google/autocxx#124 I'm making progress in that direction.

adetaylor added a commit to adetaylor/rust-bindgen that referenced this issue Feb 20, 2025
This PR reports the memory layout information (size, alignment, packed-ness)
for each struct into the ParseCallbacks.

The use-case for this is fairly complex. Postprocessors such as autocxx
currently do not necessarily pass all bindgen's output through: autocxx is
opinionated in deciding what C++ types are "reasonable" to expose to Rust. For
example types containing C++ strings aren't represented to Rust
by-value. Instead, these types are represented as "opaque" types in the cxx
sense (https://cxx.rs/extern-c++.html#opaque-c-types). As far as Rust is
concerned, these types have no fields and can only be poked at using methods.

In order to make these allocable on the stack, autocxx _does_ need to know
the layout of these structs, so it can make an "opaque" fake struct in lieu
of the real one. That's what is enabled by this callback. autocxx and other
tools can post-process the bindgen output to replace bindgen's generated
type with an opaque data-only struct like this:

  #[repr(align(16),packed)]
  pub struct TheType {
    _data: UnsafeCell<MaybeUninit<[u8;4]>> // etc.
  } // the actual struct is quite a bit more complex

The extra information provided in the callback within this PR allows that.

For completeness, there are multiple ways that the whole stack could
be rearchitected to avoid the need for this.

1. We could teach bindgen all the rules for when a struct should be
   "opaque", then bindgen could generate it that way in the first place.
   This suggests major extensibility of bindgen to allow highly
   opinionated backends, along the lines of rust-lang#2943.

2. That decision could be delegated to the ParseCallbacks with a new
      fn make_struct_opaque(&self, some_struct_info) -> bool
   call. Although this sounds simple, the decisions required here are
   very complex and depend (for instance) upon all the different
   constructors, ability to represent template params, etc., so
   in practice this would require post-processors to run bindgen twice,
   once to gather all that information and then again to give
   truthful answers to that callback.

3. Post-processors such as autocxx could instead generate opaque types
   like this:
     #[repr(transparent)]
     pub struct TheType {
       _hidden_field:
         UnsafeCell<MaybeUninit<TheTypeAsGeneratedByBindgen>>
     } // it's more complex in reality
   or similar. This turns out to be very complex because
   TheTypeAsGeneratedByBindgen might include complex STL structures
   such as std::string which autocxx currently instructs bindgen to
   replace with a simpler type, precisely so it can avoid all this
   complexity.

Any of these options are months of work either in bindgen or autocxx
or both, so I'm hoping that it's OK to simply emit the layout
information which bindgen already knows.

Compatibility: this is a further compatibility break to the
DiscoveredItem enum. However, it seems to be accepted that we're going
to grow that enum over time. If the compatibility break is a concern
this information could be reported via a separate callback.

Part of google/autocxx#124
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

3 participants