Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Support lz4 page images compression #5913

Closed
wants to merge 9 commits into from
Closed

Conversation

knizhnik
Copy link
Contributor

Problem

Reduce WAL size by enabling compression of FPI.

Summary of changes

Support lz4 compression of FPI in WAL records in page server wal decoder

Checklist before requesting a review

  • I have performed a self-review of my code.
  • If it is a core feature, I have added thorough tests.
  • Do we need to implement analytics? if so did you add the relevant metrics to the dashboard?
  • If this PR requires public announcement, mark it with /release-notes label and add several sentences in this section.

Checklist before merging

  • Do not forget to reformat commit message to not include the above checklist

@knizhnik knizhnik requested review from a team as code owners November 24, 2023 08:49
@knizhnik knizhnik requested review from shanyp and conradludgate and removed request for a team November 24, 2023 08:49
Copy link

github-actions bot commented Nov 24, 2023

No tests were run or test report is not available

Test coverage report is not available

The comment gets automatically updated with the latest test results
42d0b04 at 2024-03-11T15:37:31.960Z :recycle:

@knizhnik knizhnik changed the title Support lx4 WAL compression Support lz4 WAL compression Nov 24, 2023
@arpad-m
Copy link
Member

arpad-m commented Nov 24, 2023

Some points:

// compression of WAL is not yet supported: fall back to storing the original WAL record
&& !postgres_ffi::bkpimage_is_compressed(blk.bimg_info, self.timeline.pg_version)?
// only lz4 compression of WAL is now supported, for other cmpression algorithms fall back to storing the original WAL record
&& (!postgres_ffi::bkpimage_is_compressed(blk.bimg_info, self.timeline.pg_version)? ||
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Postgres15 supports zstd compression since v15 (at least, it was mentioned about basebackup, but I'm not sure about WAL).

Copy link
Contributor Author

@knizhnik knizhnik Nov 24, 2023

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

 Yes, PG15 supports lz4 and zstd wal compression

@vadim2404 vadim2404 requested review from jcsp and koivunej November 29, 2023 09:29
Copy link
Member

@koivunej koivunej left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Should we include tests showing that some simple workload works with compute created with this postgres option enabled?

Looked at lz4_flex, at least fuzzing and miri testing has been recognized. I don't think the async-compression support is an issue, it is questionable to use that crate anyways.

@arpad-m
Copy link
Member

arpad-m commented Dec 4, 2023

it is questionable to use that crate anyways.

why? could you expand? we use it in many places.

@knizhnik
Copy link
Contributor Author

knizhnik commented Dec 4, 2023

it is questionable to use that crate anyways.

why? could you expand? we use it in many places.

Compression is CPU intensive task. Not so much reasons to make it async. It can be useful if we have existed async stream (file, socket,...) and what to add compression to it. But for example in case of WAL compression, or more precisely - compression of FPI in WAL records) we can just use compress/decompress block functions. And not async API is needed here.

@arpad-m
Copy link
Member

arpad-m commented Dec 4, 2023

CPU intensive tasks also need to be made async because you must yield to the executor regularly, whether it's due to waiting on I/O or due to CPU work. Unless we use spawn_blocking. I wonder, is the size of the compressed data bounded by some value? @koivunej has explained to me that tokio has two tasks on each executor thread and work stealing cant steal that second task so if you don't yield regularly, you will stall that other tasks indefinitely.

@arpad-m
Copy link
Member

arpad-m commented Dec 4, 2023

We could maybe get away thanks to the fact that lz4 is so extremely fast, that it might not matter much.

@jcsp
Copy link
Contributor

jcsp commented Dec 5, 2023

CPU intensive tasks also need to be made async because you must yield to the executor regularly,

This is true in general, but it looks like this decompression call has a bounded runtime because it's only handling one page at a time. I would suggest just adding a tokio::task::yield_now() after the decompression, so that if we have a WAL record with many blocks, we are not hogging the executor for the entire decode of the WAL record, but just for one block's decompression work at a time.

@jcsp
Copy link
Contributor

jcsp commented Dec 5, 2023

I would like to see at least one test in this PR before merging it.

@knizhnik
Copy link
Contributor Author

knizhnik commented Dec 5, 2023

CPU intensive tasks also need to be made async because you must yield to the executor regularly, whether it's due to waiting on I/O or due to CPU work.

Please notice that we are decompressing just one 8kb page. And decompressor is not reading this page from the disk - it is already present in memory. So there is completely no sense to make this code async.

I would like to see at least one test in this PR before merging it.

Hmm... Not clear what do you want to check? Been enabled, decompression will be used in ALL tests. So I do not see any sense to write some particular test specially for decompression. The only reason may be to estimate reduce of WAL size because of using compression. And it greatly depends on workload.

@jcsp
Copy link
Contributor

jcsp commented Dec 5, 2023

Hmm... Not clear what do you want to check? Been enabled, decompression will be used in ALL tests. So I do not see any sense to write some particular test specially for decompression.

I'm missing the part where it is enabled -- is that something implicit on the postgres side? What I was expecting in a test was to run postgres with some particular config that enables compression, but I don't know the mechanism by which postgres decides to use it.

@arpad-m
Copy link
Member

arpad-m commented Dec 5, 2023

I'm missing the part where it is enabled

I suppose where the WAL is being created, i.e. on the enpoint?

@knizhnik
Copy link
Contributor Author

knizhnik commented Dec 5, 2023

At least for pgbench -I -s 500 wal compression doesn't cause some noticeable difference in size:

wal_comperssion wal size storage size
no 6.3G 6.7G
lz4 5.9G 6.2G

@jcsp
Copy link
Contributor

jcsp commented Dec 6, 2023

At least for pgbench -I -s 500 wal compression doesn't cause some noticeable difference in size:

So this was by hand, right? What's the story for automated tests: are our tests running with or without compression?

@knizhnik
Copy link
Contributor Author

knizhnik commented Dec 6, 2023

At least for pgbench -I -s 500 wal compression doesn't cause some noticeable difference in size:

So this was by hand, right?

Yes, I just updated postgrsql.conf with wal_compression=lz4

What's the story for automated tests: are our tests running with or without compression?

Right now - certainly without. But it is quite trivial to switch on compression in Postgres config for tests.

@arpad-m
Copy link
Member

arpad-m commented Dec 6, 2023

It would be great to have at least one test that runs basic pgbench with compression enabled.

@knizhnik
Copy link
Contributor Author

knizhnik commented Dec 6, 2023

pgbrench test:

pgbench -i -s 500 "port=55432 user=cloud_admin dbname=postgres" 
pgbench -t 100000 -P 1 -c 24 -N -M prepared "port=55432 user=cloud_admin dbname=postgres"  
wal_comperssion wal size storage size time TPS
no 30G 45G 1048 2288
lz4 11G 13G 887 2703

@knizhnik knizhnik force-pushed the wal_lz4_compression branch from 05a8ee0 to c548333 Compare December 7, 2023 07:09
@knizhnik knizhnik changed the title Support lz4 WAL compression Support lz4 page images compression Dec 7, 2023
Copy link
Contributor

@hlinnaka hlinnaka left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

With or without compression, I would like to use the "image-with-a-hole" format throughout the system. Let's introduce a new ImageWithHole struct, and use that in most places instead of a plain Bytes, to represent a page image. Let's introduce a new repository::Value variant for that.

We could send compressed and image-with-a-hole to the Postgres client, too. That would save on network bandwidth, and push the decompression work from the pageserver to Postgres. I think that's a better tradeoff. But that can be a separate PR after this.

With compression, one annoying thing is that we need to decompress a page in order to set the page LSN, and then recompress it. I think that could be optimized though. For example with LZ4, the LSN at the beginning of the block is almost certainly stored literally, so you could patch it in the compressed representation without decompressing the whole block. Or instead of recompressing, you could store the new LSN separately, and set it after you decompress.

Comment on lines 442 to 472
let value = if is_rel_data_key(key) && blob.len() < BLCKSZ as usize {
let decompressed = lz4_flex::block::decompress(&blob, BLCKSZ as usize)?;
Bytes::from(decompressed)
} else {
Bytes::from(blob)
};
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

To distinguish compressed and uncompressed mages, let's add an explicit tag byte for that instead of this hack with the size. And bump up the format_version number.

In order to be able to roll back the release if something goes wrong, let's deploy this in two stages:

  1. Add support for reading the new format
  2. Once 1. has been deployed for a while, start writing the new format

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

With or without compression, I would like to use the "image-with-a-hole" format throughout the system.

Please notice that only regular pages may have hole. SLRU pages doesn't hold them.
Frankly speaking I do not think that supporting punched images is so good idea.
It significantly complicates code and gives almost nothing if pages are compressed.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

To distinguish compressed and uncompressed mages, let's add an explicit tag byte for that instead of this hack with the size. And bump up the format_version number.

I agree with this proposal. It gives us better safety and flexibility.
But there is one stupid problem, for which I do not know good solution: how to store this tag.
This is how page is stored in image layer now:

        if is_rel_data_key(key) {
            let compressed = lz4_flex::block::compress(img);
            if compressed.len() < img.len() {
                off = self.blob_writer.write_blob(&compressed).await?;
            } else {
                off = self.blob_writer.write_blob(img).await?;
            }
        } else {
            off = self.blob_writer.write_blob(img).await?;
        }

So we just call write_blob which appears sequence of bytes to the file.
Now assume that we want also to store some tag (1 byte).
There are two possibilities:

  1. Add blob_writer.write_byte() method to separately write this tag
  2. Prepend it to the stored image (blob). But it requires allocation and copying of ~8kb. May be not so expensive. comparing with compression overhead... But looks ugly.

Are there some better alternatives?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@hlinnaka do you think that we need to add this extra byte for each page image or just one byte in summary?
It seems to be no sense to support different compression algorithms for different pages.
Certainly we in any case need to distinguish compressed and not compressed pages. But what's wrong will locking at BLOB size? Wee can enforce that compression is performed only for BLCKSZ page images and that compressed version always has smaller size than original (otherwise we just do not store compressed image).

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looking at the size is complicated and error prone. You have to be careful that the compressed size never happens to be equal to BLCKSZ, for example.

I'd suggest adding a variant of write_blob that writes a one byte followed by the buffer. Or an array of buffers, like writev(2) does, and hope that the compiler inlines it into something fast.

Yet another approach would be to compress all blobs in the file. Then you also don't need a header byte. I think a header byte is better, though.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looking at the size is complicated and error prone. You have to be careful that the compressed size never happens to be equal to BLCKSZ, for example.

Hmmm...
Not sure that I am convinced by this argument. Comparing size is not more complex than any other comparison, for example comparison of format version. Concerning corner cases, i.e. BLCKSZ - we want compression to make some profit, right? So if it returns exactly the same size then why do we need to store compressed image?

So from my point of view this check is quite obvious and safe.

I already has impelled solution with storing compression algorithm in layer metadata (summary).
I can now try two complement alternative solution with storing extra byte for each BLOB and check if there is some overhead. I do not expect to get some noticeable overhead. But from my point of view it is much more complicated solution than just looking size given us no advantages (extra flexibility, better compression,...).

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'd suggest adding a variant of write_blob that writes a one byte followed by the buffer. Or an array of buffers, like writev(2) does, and hope that the compiler inlines it into something fast.

There are less problems with write than with read. In case of write I can call write_all for this tag.
But BlobCursor has no such method. May be move decompression to blob_io? In this case there is no need to add extra methods...

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ok, I have implemented approach with header byte

@knizhnik
Copy link
Contributor Author

knizhnik commented Dec 7, 2023

I have also added compression page sent by page server to compute.
Time of the following query:

 select sum(abalance) from pgbench_accounts;

is reduced from 17264 msec to 12234 msec

@vadim2404
Copy link
Contributor

I have also added compression page sent by page server to compute. At least for local Neon installation it doesn't have some noticeable impact on performance. Time of the following query:

 select sum(abalance) from pgbench_accounts;

is reduced from 17264 msec to 16155 msec

did you test with network overhead? Or just local setup w/o network delays?
Did you compress all the requests? Because I thought about having threshold in which it makes sense to compress

@knizhnik
Copy link
Contributor Author

knizhnik commented Dec 8, 2023

did you test with network overhead? Or just local setup w/o network delays?

Just local setup. Certainly with real network connections effect should be more noticeable.

Did you compress all the requests? Because I thought about having threshold in which it makes sense to compress

All GetPage responses.

@knizhnik knizhnik force-pushed the wal_lz4_compression branch from 7b23751 to d7dcd23 Compare December 8, 2023 14:29
@hlinnaka
Copy link
Contributor

So overall this is very promising. Lots of different things are being discussed, I'll try to dissect this to separate features to have a clear path forward:

  1. Supporting WAL compression, where PostgreSQL compresses full-page images being written to WAL. This is controlled by the wal_compression GUC. The pageserver does not currently support digesting such compressed records, so it will fall back to storing the record as is, and use Postgres wal-redo process to WAL redo such pages. That's inefficient especially if one WAL record modifies multiple pages, because we will duplicate the whole record, with images for all of the modified pages, for each of the pages being modified. I think that can arise on CREATE INDEX for example. It would be nice to craft a test for that worst case scenario.

    Even though we don't set wal_compression in the computes today, it would be good to support it. Firstly, we could then consider setting it. Secondly, as we implement support for streaming WAL from an external system to neon, the external system might have wal_compression set, whether we like it or not.

    For this, we should support both LZ4 and zstd, because we don't necessarily control what the wal_compression has been set to in the originating system. But we don't necessarily need to store the pages in a compressed format, we could just decompress the pages as they're digested.

  2. Compress page images stored in image layers. The benefit is obvious: compressed data requires less space.

    Compression is always a tradeoff between CPU and disk space. We can choose to use LZ4 or zstd. I'm leaning towards zstd, it's a good and flexible algorithm that allows tuning for different tradeoffs. And I'd love to do dictionary mode compression, because it's cool technology :-). You did some earlier experiments with that at Implement compression of image and delta layers #1595. Seriously speaking, I'm not sure if the dictionary mode is worth the complexity. And any compression is better than nothing, really.

  3. Support "image+hole" format of storing and passing around pages. This would save some disk space without implementing full-blown compression. Probably doesn't gain much if we do compression anyway. But still it would be nice to have such a format in the pageserver, and in the pageserver-compute protocol, to save some memory and network bandwidth.

  4. Support storing pages in the same compressed format that Postgres uses with wal_compression, without having to recompress them. Recompressing might be OK in practice, but it sure seems silly. The main problem is that PostgreSQL compresses the image before updating the LSN on the page, but we'd like to store the image with the new LSN. One solution is to store the new LSN in addition to the compressed image. After the image is decompressed, update the LSN.

  5. Compressing delta layers. Similar benefits to compressing image layers. Not part of this patch, but would probably be beneficial just like compressing image layers. Can be done separately later.

  6. Compressing images sent over the wire. Nice feature. Orthogonal to the other parts though. It is included in this PR, but I'd suggest to leave it out, to reduce the scope and focus on the more important parts.

  7. Migration path. We need to be able to roll back deployments, so this needs to have feature flags so that we can deploy the code without writing anything in the new format yet. Later, when we're confident that don't need to roll back, we can enable the feature for some subset of tenants, or all.

I suggest the following steps forward:

  1. A PR to support digesting WAL pages compressed in PostgreSQL with wal_compression. Decompress them in the pageserver and digest. This doesn't require any storage format changes, so there's no worry of backwards-compatibility. For this, I'd like to see some tests to demonstrate the benefit in the best case, like at a CREATE INDEX that emits lots of full-page images in a single record.

  2. A PR to implement zstd or lz4 compression of image layers. It needs to include a pageserver option so that it can be turned on and off, on a per-tenant basis. Default to off for now. Pay careful attention to the storage format changes, because we'll be stuck with the new format for a long time. It's OK to not include points 3. and 4. above, but let's pay close attention to the storage format, so that those will be straightforward to add on later. Include performance tests to demonstrate the effect on disk size, but also on CPU usage. How much CPU will we burn on the compression? I believe our pageservers currently have plenty of spare CPU, but let's measure it.

  3. More PRs for the points 3-6 above. They're not urgent.

@knizhnik knizhnik force-pushed the wal_lz4_compression branch from 8dc9987 to 42d0b04 Compare March 11, 2024 15:34
@knizhnik
Copy link
Contributor Author

knizhnik commented Mar 11, 2024

Right now this PR support compression of

  1. WAL FPI
  2. Page images in inmemory layer
  3. Page images in delta layer
  4. Image_layer pages
  5. get_page response

Case 1) doesn't require some API of format changes.
Cases 2) and 3) requires adding CompressedImage to Valuer enum because this layers are serlialized using serde.
Not sure if we need image compression for this layers:
From one hand: them are stored on disk in PS and S3 and are transferred fro PS to S3.
From the other hand - this layers are somehow intermadiate and most likely we have to extract and decompress image to be able to apply wal records to it and perform page reconstruction.

In case 4) we really need special tag for compressed images (and it is added by this PR).

It needs to include a pageserver option so that it can be turned on and off, on a per-tenant basis.

I wonder if it should be part of TenantConfig or implemented somehow else. Should we enable/disable compression for particular tenant or region? Right now there is no TenantConfig available in blob_io where compression is performed.
And there is no nice way to pass it here. I do not think much sense in enabling/disabling compression for particular tenant. So many be some global feature flag which can be used to turn compression on/off for particular region is better choice?

A PR to implement zstd or lz4 compression of image layers.

I do not think that we should support different compression algorithm, i.e. lz4 and std. From my experiments lz4 provides the best size/speed ratio: it is much faster than competitors and difference in compressed size is not so large.

Also I do not see much sense in splitting this PR into 3: separating 1) and 5)

@knizhnik
Copy link
Contributor Author

After thinking a little bit more and observing problems with building pg14 (which doesn't;t support lz4 compression)
I decoded to create separate PR for compression only image layers:
#7091

@ololobus
Copy link
Member

This PR has been automatically marked as Draft because it was last updated 304 days ago. Please, consider closing it if it's not needed anymore.

@ololobus ololobus marked this pull request as draft January 10, 2025 13:12
@arpad-m
Copy link
Member

arpad-m commented Jan 10, 2025

This was a slightly different approach to #7091, and I think given that we have #5431 now as well as #9821 cc @VladLazar .

@arpad-m arpad-m closed this Jan 10, 2025
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

7 participants