Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Wire O_DIRECT also to Uncached I/O #17218

Open
wants to merge 1 commit into
base: master
Choose a base branch
from
Open

Conversation

amotin
Copy link
Member

@amotin amotin commented Apr 4, 2025

Before Direct I/O was implemented, I've implemented lighter version I called Uncached I/O. It uses normal DMU/ARC data path with some optimizations, but evicts data from caches as soon as possible and reasonable. Originally I wired it only to a primarycache property, but now completing the integration all the way up to the VFS.

While Direct I/O has the lowest possible memory bandwidth usage, it also has a significant number of limitations. It require I/Os to be page aligned, does not allow speculative prefetch, etc. The Uncached I/O does not have those limitations, but instead require additional memory copy, though still one less than regular cached I/O. As such it should fill the gap in between. Considering this I've disabled annoying EINVAL errors on misaligned requests, adding a tunable for those who wants to test their applications.

To pass the information between the layers I had to change a number of APIs. But as side effect upper layers can now control not only the caching, but also speculative prefetch. I haven't wired it to VFS yet, since it require looking on some OS specifics. But while there I've implemented speculative prefetch of indirect blocks for Direct I/O, controllable via all the same mechanisms.

Fixes #17027

How Has This Been Tested?

Basic read/write tests with and without O_DIRECT, observing proper cache behavior.

Types of changes

  • Bug fix (non-breaking change which fixes an issue)
  • New feature (non-breaking change which adds functionality)
  • Performance enhancement (non-breaking change which improves efficiency)
  • Code cleanup (non-breaking change which makes code smaller or more readable)
  • Breaking change (fix or feature that would cause existing functionality to change)
  • Library ABI change (libzfs, libzfs_core, libnvpair, libuutil and libzfsbootenv)
  • Documentation (a change to man pages or other documentation)

Checklist:

@amotin amotin requested a review from bwatkinson April 4, 2025 19:41
@amotin
Copy link
Member Author

amotin commented Apr 4, 2025

The change is quite invasive on DMU/DBUF APIs. In some places it was not obvious what is better or where to stop, so I am open to comments on whether we'd like to change more, while already there, or less, to keep some parts of compatibility if that is even possible at all with so big change.

@amotin amotin added the Status: Code Review Needed Ready for review and testing label Apr 4, 2025
@tonyhutter
Copy link
Contributor

Would it makes sense to activate this when the user does a fnctl(fd, F_SET_RW_HINT, RWH_WRITE_LIFE_SHORT)? I believe we could key off the the file's inode->i_write_hint value on the kernel side.

RWF_UNCACHED also sounds similar to this concept, but I don't think it's merged into the kernel yet.

@amotin
Copy link
Member Author

amotin commented Apr 4, 2025

@tonyhutter I am not familiar with that fcntl, but sounds possible as a next step. I was thinking about posix_fadvise() or something like it.

@tonyhutter
Copy link
Contributor

Ah yes, posix_fadvise() makes sense too.

@amotin amotin force-pushed the direct branch 3 times, most recently from 742a1cb to 7579173 Compare April 7, 2025 01:20
Before Direct I/O was implemented, I've implemented lighter version
I called Uncached I/O.  It uses normal DMU/ARC data path with some
optimizations, but evicts data from caches as soon as possible and
reasonable.  Originally I wired it only to a primarycache property,
but now completing the integration all the way up to the VFS.

While Direct I/O has the lowest possible memory bandwidth usage,
it also has a significant number of limitations.  It require I/Os
to be page aligned, does not allow speculative prefetch, etc.  The
Uncached I/O does not have those limitations, but instead require
additional memory copy, though still one less than regular cached
I/O.  As such it should fill the gap in between.  Considering this
I've disabled annoying EINVAL errors on misaligned requests, adding
a tunable for those who wants to test their applications.

To pass the information between the layers I had to change a number
of APIs.  But as side effect upper layers can now control not only
the caching, but also speculative prefetch.  I haven't wired it to
VFS yet, since it require looking on some OS specifics.  But while
there I've implemented speculative prefetch of indirect blocks for
Direct I/O, controllable via all the same mechanisms.

Signed-off-by:	Alexander Motin <mav@FreeBSD.org>
Sponsored by:	iXsystems, Inc.
Fixes openzfs#17027
@amotin
Copy link
Member Author

amotin commented Apr 7, 2025

F_SET_RW_HINT and inode->i_write_hint seem about on-disk life time, not saying anything about caching. May be they could be used for some allocation policies to reduce pool fragmentation, but not for this PR. RWF_UNCACHED indeed looks very similar.

*/
ioflag &= ~O_DIRECT;
if (os->os_direct == ZFS_DIRECT_ALWAYS) {
/* Force either direct of uncached I/O. */
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

"direct or uncached"

@bwatkinson
Copy link
Contributor

So, I am not particularly in favor of a strict being disabled by default. I think we already murky the waters enough when a write I/O is PAGE_SIZE aligned, but not recordsize aligned and we just issue it through the ARC with direct=standard. The strict alignment should be observed in general. Users asked for direct I/O with O_DIRECT and they should get an EINVAL returned living by the only requirement we request, which is PAGE_SIZE alignment.

My concern is people are expecting one thing and get another transparently. This leads to confusion and makes ZFS seem like it is not doing what they strictly asked it to do (do not make any copies of my data).

I also don't like the idea that prefetching can happen if a user explicitly requested O_DIRECT for data blocks. When a user informs us of their intent with O_DIRECT they are seriously saying (or should be), they have no desire to have FS caching take place for data. If we disregard this, then prefetching becomes a nightmare for reads.

I think if we decided to add another dataset property setting for direct such as maybe relaxed, that might make sense with this? However, the default should still be standard in my mind. Truth be told, we only added direct=always for Lustre.

I think we don't need to complicate things more in my opinion. This should be an opt in feature if anything with maybe a new dataset value for direct.

@amotin
Copy link
Member Author

amotin commented Apr 11, 2025

I think we already murky the waters enough when a write I/O is PAGE_SIZE aligned, but not recordsize aligned and we just issue it through the ARC with direct=standard.

My patch actually improves this case, evicting the data from DBUF and ARC caches as soon as possible, reducing cache effects as user asked.

The strict alignment should be observed in general. Users asked for direct I/O with O_DIRECT and they should get an EINVAL returned living by the only requirement we request, which is PAGE_SIZE alignment.

I don't think it has to be strict other than for software testing. The man page on Linux says: "The handling of misaligned O_DIRECT I/Os also varies; they can either fail with EINVAL or fall back to buffered I/O.", so relaxed behavior is not a violation.

My concern is people are expecting one thing and get another transparently. This leads to confusion and makes ZFS seem like it is not doing what they strictly asked it to do (do not make any copies of my data).

We've already found several examples of software having no idea about alignment, but using O_DIRECT, including so general as systemd extensions. Considering they are "broken", I bet more typical Linux file systems to do not enforce alignment. And for us being strict just increase suffering. I am not interested to fix all software in the world, but happy to provide them a testing tool in a shape of a tunable.

I also don't like the idea that prefetching can happen if a user explicitly requested O_DIRECT for data blocks. When a user informs us of their intent with O_DIRECT they are seriously saying (or should be), they have no desire to have FS caching take place for data. If we disregard this, then prefetching becomes a nightmare for reads.

The data prefetch will only activate on misaligned I/Os. So obviously user already does not know what he is doing. It is trivial to actually disable it now, if you insist. But I considered that somebody might use O_DIRECT with Direct I/O disabled via module parameter, just to reduce cache trashing on some parts of workload, and prefetch really give performance improvement in many cases even with NVMe pools if application queue depth is insufficient.

I think if we decided to add another dataset property setting for direct such as maybe relaxed, that might make sense with this? However, the default should still be standard in my mind.

I was actually thinking about additional relaxed value for direct property, but present code asserts on any unknown property value, so addition of any new one would be a pain with a new pool features added, etc. I personally could live with zfs_dio_strict=1 by default, overriding it locally, but I can't agree that it is productive for the mass market.

Truth be told, we only added direct=always for Lustre.

It is likely the only way to use Direct I/O with SMB and NFS also, since they have no concepts of alignment in protocol. So it seems the world is somewhat less "perfect" than we'd like. ;)

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Status: Code Review Needed Ready for review and testing
Projects
None yet
Development

Successfully merging this pull request may close these issues.

systemd-sysext fails to install extensions with Direct I/O
3 participants