Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Remove support for not creating local MRs #796

Open
wants to merge 1 commit into
base: master
Choose a base branch
from

Conversation

bwbarrett
Copy link
Contributor

Commits 27548d4 and a1ac0f5 fixed providers that required memory registrations when FI_MR_LOCAL was set, but also broke providers that clear FI_MR_LOCAL (such as HPE's provider), because I did not account for the mr_local handling in the send/recv transport.

We don't have a great way to test that case from AWS, the vast majority of transfers will either be HMEM (which creates an MR anyway) or control messages (which have a freelist which creates an MR), and the Libfabric specification allows passing an MR descriptor to a local operation even if the provider clears the FI_MR_LOCAL bit. Therefore, the best path forward seems to be removing the code to skip registration if FI_MR_LOCAL is cleared, and always creating an MR.

By submitting this pull request, I confirm that my contribution is made under the terms of the Apache 2.0 license.

Commits 27548d4 and a1ac0f5 fixed providers that required memory
registrations when FI_MR_LOCAL was set, but also broke providers that
clear FI_MR_LOCAL (such as HPE's provider), because I did not account
for the mr_local handling in the send/recv transport.

We don't have a great way to test that case from AWS, the vast majority
of transfers will either be HMEM (which creates an MR anyway) or control
messages (which have a freelist which creates an MR), and the Libfabric
specification allows passing an MR descriptor to a local operation even
if the provider clears the FI_MR_LOCAL bit.  Therefore, the best path
forward seems to be removing the code to skip registration if
FI_MR_LOCAL is cleared, and always creating an MR.

Signed-off-by: Brian Barrett <bbarrett@amazon.com>
@bwbarrett bwbarrett requested a review from a team as a code owner February 26, 2025 23:20
@bwbarrett
Copy link
Contributor Author

Created as an alternative to #793, so that we don't trip over this sharp edge again.

@ryanhankins
Copy link
Contributor

Thank you for doing that. It works and is a bit improvement over #793.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants