-
-
Notifications
You must be signed in to change notification settings - Fork 63
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
zfs-check: efficient handling of sparse files #225
Comments
After studying the I thought to make a simple change first, and changed the hash from So, I followed the Development wiki page and made the following change: You can see in the following quick sample that
As another quick sanity check to test the performance difference of So its clear to see that My conclusion right now is that Something else that can make a significant performance difference is the Comments and critique welcome. |
sounds interesting. i'm all for changing the default hash algo to something like xxh64 if thats not a too obscure library. we're still in development/beta anyway, and zfs-check isnt widely used, so we can make big changes. |
same goes for a different default block-size and count, i didnt do much testing to choose that. (just guestimated values) |
its very cool actually, this is very usefull to test huge volumes :) |
Hey. Acknowledged. It might be useful to offer the user control of the hash class via an option. And perhaps default to a (currently) secure but faster hash like sha512. For example offering good options like Unrelated to sparse files, from a security perspective it might be desirable to ask the user for a hash "pepper" which would make the computed hash lists more secure and disconnected from the contents of the original blocks. And/Or perhaps warn users about the potential risks of non salted or peppered hashes. Thinking out loud it would be possible to make further optimisations for reading sparse blocks, similar to the sparse logic I outlined in post 1. Hole chunks have identical checksums and therefore computation of the zeros can be skipped and be replaced by pre-computed values? This might be overly complicated however and be a source of bugs. So it's perhaps a risk vs. rewards decision. |
i think you're confusing password hash security with block-hash security. even with a chunksize of only 1MB, it would be impossible to guess the original file from a hash i think? yes, it might be possible to create a hash collision if you use a insecure hash function, which is bad in case of password security and authentication. but a collision is no problem in our case. |
(so i dont think we need to make it configurable yet) |
My thinking certainly comes from the security of password hashes and content token hashes, for example content tokens used to tokenise GDPR sensitive data sets. From an InfoSec best practice perspective, I thought it might be good to increase the improbability and decouple the chunk hashes from the actual chunk bytes. Yes, the probability of collisions is low for a given call to Back on to the main topic I ran my tweaked The jobs took approx. 6.08 and 5.51hours respectively. Here are some samples from the left job
Here are some samples from the right job
So its clear to see that Good news (for my data at least 😉), both snapshots appear to be integral which is nice given my recent events loosing a disk etc. The pools and datasets have slightly different configuration, specifically around the
A note on HOLE chunks which in this
Sanity check on the HOLE chunk checksum:
Math bug: I used duckduckgo or google to convert 1 MiB to bytes and didn't verify the answer was correct and for some reason ended up with 1024000 bytes (rounding issue?) but the correct result is 1MiB = 1048576 (1024^2), the answer I got and used was missing 24 KiB ¯\(ツ)/¯. Its not a deal breaker for this testing and doesn't invalidate anything. One potential negative is sub optimal unaligned block read I/O but in theory ZFS I/O scheduler should smooth most of that out anyway. netgraph showed the average I/O operation size was consistently ~1MiB: Next steps?So where do we go from here? I'm not sure I have a nice environment to run the docker test stuff without having to create something. Would you prefer to take ownership and make the changes to offer a I could offer to create a lxc to run docker on my lab and try and get the test stuff working and then make a PR? This might get delayed because tbh messing around with lxc and docker isn't my favourite pass-time! Its been a PITA in the past. How are you running docker and the tests in your development setup? |
ah i understand where you're coming from in that case. for now, lets not add salts to keep things simple. just create a pullrequest that changes the default hash. if you really want a --hash option feel free to add it. dont worry about the test too much. also: github will run the testsuite when you create a pull request. |
Acknowledged. I'll put a PR together in the near future. A quick note from further testing. It would appear 4.21GiB/s would appear to the upper CPU boundary on my system for a single Hypothetically |
BlockHasher.py * hash_class is sourced from cli args instead of hardcoding it. * hash_factory() lays the groundwork to support arbitrary hash libs. * Detection of and use of xxhash lib. ZfsCheck.py * Implement new cli arg --hash. The choices for the arg are generated based on what is detected in the python env. * The input to --hash is is validated against the arg choices. * Implemented helper method determine_algorithms_available(). This tries to pick a performant default with a fallback to sha1. * Detection of and use of xxhash lib.
Greetings,
This is a feature/enhancement request for
zfs-check
, and a braindump on possible implementation.There has been discourse on this topic in the past 2022-June on reddit regarding this topic. Screenshot for quick reference and here is a link to the thread:
click to unfold screenshot
Having recently restored some datasets from backups, I had reason to do some checksums on various parts of the data, and
zfs-check
came back to my mind as a nice tool for this.I checked the latest code in the repo and didn't see any support for sparse file handling yet, so I thought I'd make this request and provide some details. If I get the time I will make a PR for it myself, but in the meantime here is the request/info.
To illustrate what I'm referring to take for example a raw GPT partition file with XFS partition(s) stored on ZFS dataset. Example creation command
qemu-img create -f raw test.raw 4T
then the usualgdisk
commands to create the GUID partition table, andmkfs.xfs -m crc=1,reflink=1
to create a filesystem partition.This set of commands would create a sparse 4TiB file, in my case stored on a zfs dataset, create the partition table, and format a partition with XFS and this raw disk would be provisioned on a kvm for example running on proxmox.
Today when
zfs-check
checks such a file it will chunk up and iterate over all the bytes in the sparse file. It will not treat the sparse blocks differently to normal blocks. Therefore it will actually read 4TiB bytes and compute the checksums for the chunks etc. This was covered in the reddit post from 2022-June (see above).What is a sparse file? A sparsely allocated file is a file that contains one or more HOLE(s). A HOLE is a sequence of zeros that (normally) has not been allocated in the underlying file storage. Cite: https://linux.die.net/man/2/lseek
As a sanity check for the post in 2022-June I checked how long it takes to "optimally read" the sparse
store1
4TiB raw disk, with its normal allocation of ~2.6TiB the remainder sparse allocation.zfs-check
readSo, by my calculation there is definitely room for optimisation in
zfs-check
in this regard.An example of common utilities handling sparse files efficiently as follows, consider using
dd
to read a 4TiB sparse file in the following way andxxh64sum
to create the checksum:OK, nothing special on face value BUT when we study the details during the operation using
pv
to monitor the relevant file descriptor, the output format is$(date) %{Current data transfer rate} %{Average data transfer rate} %{Bytes transferred so far}
:What you see here is a human readable log containing 2 minute samples of the progress and data rates for the
dd
operation.Note the 3rd column, the
%{Average data transfer rate}
. You'll notice how sometimes a sample is few hundred MiB/s which is determined by the physical limits of the underlying vdev(s) read speed when reading normally allocated blocks, and sometimes you'll see ~2GiB/s which is obviously an order of magnitude faster. The high speed samples document wheredd
, its libraries, the kernel and OpenZFS have conspired together to efficiently process the sparse blocks and sections of the the sparse 4TiB raw partition file.NB: In my example I used
xxHash
by Yann Collet because as per the name: Extremely fast non-cryptographic hash algorithm (github link) its modern, maintained and it can provide "near RAM speed" hashing speed.zfs-check
defaults tosha1
which has limited performance and perhaps not best suited for sparse file handling. Note the hash performance comparison table on the xxHash GitHub README.The sparse allocation optimisation is also reflected in the 4th column
%{Bytes transferred so far}
: in the samples that illustrate reading of the sparse allocated blocks, the amount of logical data read per sample is naturally a lot more than the samples for normally allocated blocks.Outcome: progress in reading through the file is much quicker (than
zfs-check
) because the sparse allocations/sections are handled efficiently.A quick note on the system in my examples. Its a entry level enterprise storage node running proxmox 7.x which is Debian bookworm based and running the Ubuntu 6.2.x kernel. 12 physical Xeon CPU cores over two sockets and 128 GB RAM.
For my
dd
example, what I've not yet understood is which code (or code combo) in the chain is responsible for the sparse file reading optimisation. From glancing at the thedd
code it doesn't seem that the optimisation is happening at that level, and you can see from mydd
example I'm not giving it special arguments to handle sparse files. I think the optimisation is more likely to be located in lower levelread() and seek()
functions and a combination of the filesystem support for sparse file allocation. OpenZFS definitely supports efficient sparse file reads as demonstrated in thedd
example provided.On Debian the
coreutils
package which containsdd
source might have some clues: https://sources.debian.org/src/coreutils/9.1-1/. I took a quick look but didn't find a smoking gun, in factdd
code is primarily usinglseek (fd, offset, SEEK_CUR)
which means I missed something or the optimisation is happening actually in the kernel/filesystem code (for example fs/read_write.c here) or in OpenZFS code. NB: Proxmox uses the Ubuntu kernel.For reference and perhaps part of the solution for
zfs-check
, for python there is a useful but probably not exactly what is needed answer here: https://stackoverflow.com/a/62312222. This approach seeks the entire file first to determine the ranges of DATA and HOLE(s).From what I can tell
zfs_autobackup/BlockHasher.py
will need an logic update something like the following:lseek(fd, offset, SEEK_HOLE)
to determine if the offset is the end of the filefd
- the existing logic can be usedSPARSE LOGIC:
lseek(fd, offset, SEEK_DATA)
until a HOLE chunk is detected? (a chunk that is sequence of zeros)lseek(fd, offset, SEEK_HOLE)
iterate forward seeking n chunks to find the end of the HOLE.offset
then still in a contiguous holenr_contiguous_hole_chunks
offset
and != EOF then there is some more DATA to processzeros
xnr_contiguous_hole_chunks
xchunk_size
fd
to resume processing DATA at the end offset of the last detected contiguous HOLE chunkfd
to resume processing DATA at the end offset of the last detected contiguous HOLE chunk. The logic then iterates of the remaining DATA.It would be interesting to find examples of other code that have implemented similar logic to sanity check this logic and check for any inefficiencies or flaws.
💡 IDEA: Maybe its enough to switch the
zfs_autobackup/BlockHasher.py
from usingfh.seek(pos)
to usingos.lseek(fd, pos, whence)
and usingxxhash
rather than the slowersha1
and the optimisation would be automatically performed at a lower level? As per thedd
example which is usinglseek (fd, offset, SEEK_CUR)
and no mention ofSEEK_DATA
orSEEK_HOLE
. This might be worth a try before trying to invent new logic per my brain dump herein.python ref for
os.lseek()
: https://docs.python.org/3/library/os.html#os.lseekpython ref for
fd.seek()
: https://docs.python.org/3/tutorial/inputoutput.html#methods-of-file-objectsThanks for reading, look forward to any comments or critique.
Cheers
Kyle
The text was updated successfully, but these errors were encountered: