-
Notifications
You must be signed in to change notification settings - Fork 492
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Support lz4 page images compression #5913
Conversation
No tests were run or test report is not availableTest coverage report is not availableThe comment gets automatically updated with the latest test results
42d0b04 at 2024-03-11T15:37:31.960Z :recycle: |
Some points:
|
pageserver/src/walingest.rs
Outdated
// compression of WAL is not yet supported: fall back to storing the original WAL record | ||
&& !postgres_ffi::bkpimage_is_compressed(blk.bimg_info, self.timeline.pg_version)? | ||
// only lz4 compression of WAL is now supported, for other cmpression algorithms fall back to storing the original WAL record | ||
&& (!postgres_ffi::bkpimage_is_compressed(blk.bimg_info, self.timeline.pg_version)? || |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Postgres15 supports zstd compression since v15 (at least, it was mentioned about basebackup, but I'm not sure about WAL).
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yes, PG15 supports lz4 and zstd wal compression
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Should we include tests showing that some simple workload works with compute created with this postgres option enabled?
Looked at lz4_flex
, at least fuzzing and miri testing has been recognized. I don't think the async-compression
support is an issue, it is questionable to use that crate anyways.
why? could you expand? we use it in many places. |
Compression is CPU intensive task. Not so much reasons to make it async. It can be useful if we have existed async stream (file, socket,...) and what to add compression to it. But for example in case of WAL compression, or more precisely - compression of FPI in WAL records) we can just use |
CPU intensive tasks also need to be made async because you must yield to the executor regularly, whether it's due to waiting on I/O or due to CPU work. Unless we use |
We could maybe get away thanks to the fact that lz4 is so extremely fast, that it might not matter much. |
This is true in general, but it looks like this decompression call has a bounded runtime because it's only handling one page at a time. I would suggest just adding a tokio::task::yield_now() after the decompression, so that if we have a WAL record with many blocks, we are not hogging the executor for the entire decode of the WAL record, but just for one block's decompression work at a time. |
I would like to see at least one test in this PR before merging it. |
Please notice that we are decompressing just one 8kb page. And decompressor is not reading this page from the disk - it is already present in memory. So there is completely no sense to make this code async.
Hmm... Not clear what do you want to check? Been enabled, decompression will be used in ALL tests. So I do not see any sense to write some particular test specially for decompression. The only reason may be to estimate reduce of WAL size because of using compression. And it greatly depends on workload. |
I'm missing the part where it is enabled -- is that something implicit on the postgres side? What I was expecting in a test was to run postgres with some particular config that enables compression, but I don't know the mechanism by which postgres decides to use it. |
I suppose where the WAL is being created, i.e. on the enpoint? |
At least for
|
So this was by hand, right? What's the story for automated tests: are our tests running with or without compression? |
Yes, I just updated postgrsql.conf with
Right now - certainly without. But it is quite trivial to switch on compression in Postgres config for tests. |
It would be great to have at least one test that runs basic pgbench with compression enabled. |
pgbrench test:
|
05a8ee0
to
c548333
Compare
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
With or without compression, I would like to use the "image-with-a-hole" format throughout the system. Let's introduce a new ImageWithHole struct, and use that in most places instead of a plain Bytes, to represent a page image. Let's introduce a new repository::Value
variant for that.
We could send compressed and image-with-a-hole to the Postgres client, too. That would save on network bandwidth, and push the decompression work from the pageserver to Postgres. I think that's a better tradeoff. But that can be a separate PR after this.
With compression, one annoying thing is that we need to decompress a page in order to set the page LSN, and then recompress it. I think that could be optimized though. For example with LZ4, the LSN at the beginning of the block is almost certainly stored literally, so you could patch it in the compressed representation without decompressing the whole block. Or instead of recompressing, you could store the new LSN separately, and set it after you decompress.
let value = if is_rel_data_key(key) && blob.len() < BLCKSZ as usize { | ||
let decompressed = lz4_flex::block::decompress(&blob, BLCKSZ as usize)?; | ||
Bytes::from(decompressed) | ||
} else { | ||
Bytes::from(blob) | ||
}; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
To distinguish compressed and uncompressed mages, let's add an explicit tag byte for that instead of this hack with the size. And bump up the format_version
number.
In order to be able to roll back the release if something goes wrong, let's deploy this in two stages:
- Add support for reading the new format
- Once 1. has been deployed for a while, start writing the new format
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
With or without compression, I would like to use the "image-with-a-hole" format throughout the system.
Please notice that only regular pages may have hole. SLRU pages doesn't hold them.
Frankly speaking I do not think that supporting punched images is so good idea.
It significantly complicates code and gives almost nothing if pages are compressed.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
To distinguish compressed and uncompressed mages, let's add an explicit tag byte for that instead of this hack with the size. And bump up the format_version number.
I agree with this proposal. It gives us better safety and flexibility.
But there is one stupid problem, for which I do not know good solution: how to store this tag.
This is how page is stored in image layer now:
if is_rel_data_key(key) {
let compressed = lz4_flex::block::compress(img);
if compressed.len() < img.len() {
off = self.blob_writer.write_blob(&compressed).await?;
} else {
off = self.blob_writer.write_blob(img).await?;
}
} else {
off = self.blob_writer.write_blob(img).await?;
}
So we just call write_blob
which appears sequence of bytes to the file.
Now assume that we want also to store some tag (1 byte).
There are two possibilities:
- Add
blob_writer.write_byte()
method to separately write this tag - Prepend it to the stored image (blob). But it requires allocation and copying of ~8kb. May be not so expensive. comparing with compression overhead... But looks ugly.
Are there some better alternatives?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@hlinnaka do you think that we need to add this extra byte for each page image or just one byte in summary?
It seems to be no sense to support different compression algorithms for different pages.
Certainly we in any case need to distinguish compressed and not compressed pages. But what's wrong will locking at BLOB size? Wee can enforce that compression is performed only for BLCKSZ page images and that compressed version always has smaller size than original (otherwise we just do not store compressed image).
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Looking at the size is complicated and error prone. You have to be careful that the compressed size never happens to be equal to BLCKSZ, for example.
I'd suggest adding a variant of write_blob
that writes a one byte followed by the buffer. Or an array of buffers, like writev(2)
does, and hope that the compiler inlines it into something fast.
Yet another approach would be to compress all blobs in the file. Then you also don't need a header byte. I think a header byte is better, though.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Looking at the size is complicated and error prone. You have to be careful that the compressed size never happens to be equal to BLCKSZ, for example.
Hmmm...
Not sure that I am convinced by this argument. Comparing size is not more complex than any other comparison, for example comparison of format version. Concerning corner cases, i.e. BLCKSZ - we want compression to make some profit, right? So if it returns exactly the same size then why do we need to store compressed image?
So from my point of view this check is quite obvious and safe.
I already has impelled solution with storing compression algorithm in layer metadata (summary).
I can now try two complement alternative solution with storing extra byte for each BLOB and check if there is some overhead. I do not expect to get some noticeable overhead. But from my point of view it is much more complicated solution than just looking size given us no advantages (extra flexibility, better compression,...).
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'd suggest adding a variant of write_blob that writes a one byte followed by the buffer. Or an array of buffers, like writev(2) does, and hope that the compiler inlines it into something fast.
There are less problems with write than with read. In case of write I can call write_all for this tag.
But BlobCursor has no such method. May be move decompression to blob_io? In this case there is no need to add extra methods...
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Ok, I have implemented approach with header byte
I have also added compression page sent by page server to compute.
is reduced from 17264 msec to 12234 msec |
did you test with network overhead? Or just local setup w/o network delays? |
Just local setup. Certainly with real network connections effect should be more noticeable.
All GetPage responses. |
7b23751
to
d7dcd23
Compare
So overall this is very promising. Lots of different things are being discussed, I'll try to dissect this to separate features to have a clear path forward:
I suggest the following steps forward:
|
Co-authored-by: Joonas Koivunen <joonas@neon.tech>
8dc9987
to
42d0b04
Compare
Right now this PR support compression of
Case 1) doesn't require some API of format changes. In case 4) we really need special tag for compressed images (and it is added by this PR).
I wonder if it should be part of TenantConfig or implemented somehow else. Should we enable/disable compression for particular tenant or region? Right now there is no TenantConfig available in blob_io where compression is performed.
I do not think that we should support different compression algorithm, i.e. lz4 and std. From my experiments lz4 provides the best size/speed ratio: it is much faster than competitors and difference in compressed size is not so large. Also I do not see much sense in splitting this PR into 3: separating 1) and 5) |
After thinking a little bit more and observing problems with building pg14 (which doesn't;t support lz4 compression) |
This PR has been automatically marked as Draft because it was last updated 304 days ago. Please, consider closing it if it's not needed anymore. |
This was a slightly different approach to #7091, and I think given that we have #5431 now as well as #9821 cc @VladLazar . |
Problem
Reduce WAL size by enabling compression of FPI.
Summary of changes
Support lz4 compression of FPI in WAL records in page server wal decoder
Checklist before requesting a review
Checklist before merging