Support lz4 page images compression #5913

knizhnik · 2023-11-24T08:49:32Z

Problem

Reduce WAL size by enabling compression of FPI.

Summary of changes

Support lz4 compression of FPI in WAL records in page server wal decoder

Checklist before requesting a review

I have performed a self-review of my code.
If it is a core feature, I have added thorough tests.
Do we need to implement analytics? if so did you add the relevant metrics to the dashboard?
If this PR requires public announcement, mark it with /release-notes label and add several sentences in this section.

Checklist before merging

Do not forget to reformat commit message to not include the above checklist

github-actions · 2023-11-24T09:25:42Z

No tests were run or test report is not available

Test coverage report is not available

_{The comment gets automatically updated with the latest test results
42d0b04 at 2024-03-11T15:37:31.960Z :recycle:}

arpad-m · 2023-11-24T14:45:21Z

Some points:

lz4 is available since postgres 15, as is zstd. before, there was only builtin lz which was apparently quite slow. better we don't use that one.
only a small fraction of our databases are on postgres 14, so I think it's okay to use 15+ compression methods
zstd is in general better compressed than lz4 but lz4 offers better compression and decompression speeds
the async-compression crate does not have lz4 support, it only has zstd support

vadim2404 · 2023-11-24T17:04:43Z

pageserver/src/walingest.rs

-        // compression of WAL is not yet supported: fall back to storing the original WAL record
-            && !postgres_ffi::bkpimage_is_compressed(blk.bimg_info, self.timeline.pg_version)?
+        // only lz4 compression of WAL is now supported, for other cmpression algorithms fall back to storing the original WAL record
+            && (!postgres_ffi::bkpimage_is_compressed(blk.bimg_info, self.timeline.pg_version)? ||


Postgres15 supports zstd compression since v15 (at least, it was mentioned about basebackup, but I'm not sure about WAL).

Yes, PG15 supports lz4 and zstd wal compression

koivunej

Should we include tests showing that some simple workload works with compute created with this postgres option enabled?

Looked at lz4_flex, at least fuzzing and miri testing has been recognized. I don't think the async-compression support is an issue, it is questionable to use that crate anyways.

arpad-m · 2023-12-04T11:15:07Z

it is questionable to use that crate anyways.

why? could you expand? we use it in many places.

knizhnik · 2023-12-04T13:28:21Z

it is questionable to use that crate anyways.

why? could you expand? we use it in many places.

Compression is CPU intensive task. Not so much reasons to make it async. It can be useful if we have existed async stream (file, socket,...) and what to add compression to it. But for example in case of WAL compression, or more precisely - compression of FPI in WAL records) we can just use compress/decompress block functions. And not async API is needed here.

arpad-m · 2023-12-04T15:08:34Z

CPU intensive tasks also need to be made async because you must yield to the executor regularly, whether it's due to waiting on I/O or due to CPU work. Unless we use spawn_blocking. I wonder, is the size of the compressed data bounded by some value? @koivunej has explained to me that tokio has two tasks on each executor thread and work stealing cant steal that second task so if you don't yield regularly, you will stall that other tasks indefinitely.

arpad-m · 2023-12-04T15:10:31Z

We could maybe get away thanks to the fact that lz4 is so extremely fast, that it might not matter much.

jcsp · 2023-12-05T10:44:37Z

CPU intensive tasks also need to be made async because you must yield to the executor regularly,

This is true in general, but it looks like this decompression call has a bounded runtime because it's only handling one page at a time. I would suggest just adding a tokio::task::yield_now() after the decompression, so that if we have a WAL record with many blocks, we are not hogging the executor for the entire decode of the WAL record, but just for one block's decompression work at a time.

jcsp · 2023-12-05T10:45:05Z

I would like to see at least one test in this PR before merging it.

knizhnik · 2023-12-05T15:44:56Z

CPU intensive tasks also need to be made async because you must yield to the executor regularly, whether it's due to waiting on I/O or due to CPU work.

Please notice that we are decompressing just one 8kb page. And decompressor is not reading this page from the disk - it is already present in memory. So there is completely no sense to make this code async.

I would like to see at least one test in this PR before merging it.

Hmm... Not clear what do you want to check? Been enabled, decompression will be used in ALL tests. So I do not see any sense to write some particular test specially for decompression. The only reason may be to estimate reduce of WAL size because of using compression. And it greatly depends on workload.

jcsp · 2023-12-05T15:48:08Z

Hmm... Not clear what do you want to check? Been enabled, decompression will be used in ALL tests. So I do not see any sense to write some particular test specially for decompression.

I'm missing the part where it is enabled -- is that something implicit on the postgres side? What I was expecting in a test was to run postgres with some particular config that enables compression, but I don't know the mechanism by which postgres decides to use it.

arpad-m · 2023-12-05T16:30:32Z

I'm missing the part where it is enabled

I suppose where the WAL is being created, i.e. on the enpoint?

knizhnik · 2023-12-05T16:47:55Z

At least for pgbench -I -s 500 wal compression doesn't cause some noticeable difference in size:

wal_comperssion	wal size	storage size
no	6.3G	6.7G
lz4	5.9G	6.2G

jcsp · 2023-12-06T13:08:49Z

At least for pgbench -I -s 500 wal compression doesn't cause some noticeable difference in size:

So this was by hand, right? What's the story for automated tests: are our tests running with or without compression?

knizhnik · 2023-12-06T13:21:20Z

At least for pgbench -I -s 500 wal compression doesn't cause some noticeable difference in size:

So this was by hand, right?

Yes, I just updated postgrsql.conf with wal_compression=lz4

What's the story for automated tests: are our tests running with or without compression?

Right now - certainly without. But it is quite trivial to switch on compression in Postgres config for tests.

arpad-m · 2023-12-06T15:57:20Z

It would be great to have at least one test that runs basic pgbench with compression enabled.

knizhnik · 2023-12-06T20:53:23Z

pgbrench test:

pgbench -i -s 500 "port=55432 user=cloud_admin dbname=postgres" 
pgbench -t 100000 -P 1 -c 24 -N -M prepared "port=55432 user=cloud_admin dbname=postgres"

wal_comperssion	wal size	storage size	time	TPS
no	30G	45G	1048	2288
lz4	11G	13G	887	2703

pageserver/src/walingest.rs

pageserver/src/tenant/storage_layer/image_layer.rs

pageserver/src/walingest.rs

hlinnaka

With or without compression, I would like to use the "image-with-a-hole" format throughout the system. Let's introduce a new ImageWithHole struct, and use that in most places instead of a plain Bytes, to represent a page image. Let's introduce a new repository::Value variant for that.

We could send compressed and image-with-a-hole to the Postgres client, too. That would save on network bandwidth, and push the decompression work from the pageserver to Postgres. I think that's a better tradeoff. But that can be a separate PR after this.

With compression, one annoying thing is that we need to decompress a page in order to set the page LSN, and then recompress it. I think that could be optimized though. For example with LZ4, the LSN at the beginning of the block is almost certainly stored literally, so you could patch it in the compressed representation without decompressing the whole block. Or instead of recompressing, you could store the new LSN separately, and set it after you decompress.

hlinnaka · 2023-12-07T10:20:17Z

pageserver/src/tenant/storage_layer/image_layer.rs

+            let value = if is_rel_data_key(key) && blob.len() < BLCKSZ as usize {
+                let decompressed = lz4_flex::block::decompress(&blob, BLCKSZ as usize)?;
+                Bytes::from(decompressed)
+            } else {
+                Bytes::from(blob)
+            };


To distinguish compressed and uncompressed mages, let's add an explicit tag byte for that instead of this hack with the size. And bump up the format_version number.

In order to be able to roll back the release if something goes wrong, let's deploy this in two stages:

Add support for reading the new format

Once 1. has been deployed for a while, start writing the new format

With or without compression, I would like to use the "image-with-a-hole" format throughout the system.

Please notice that only regular pages may have hole. SLRU pages doesn't hold them.
Frankly speaking I do not think that supporting punched images is so good idea.
It significantly complicates code and gives almost nothing if pages are compressed.

To distinguish compressed and uncompressed mages, let's add an explicit tag byte for that instead of this hack with the size. And bump up the format_version number.

I agree with this proposal. It gives us better safety and flexibility.
But there is one stupid problem, for which I do not know good solution: how to store this tag.
This is how page is stored in image layer now:

if is_rel_data_key(key) { let compressed = lz4_flex::block::compress(img); if compressed.len() < img.len() { off = self.blob_writer.write_blob(&compressed).await?; } else { off = self.blob_writer.write_blob(img).await?; } } else { off = self.blob_writer.write_blob(img).await?; }

So we just call write_blob which appears sequence of bytes to the file.
Now assume that we want also to store some tag (1 byte).
There are two possibilities:

Add blob_writer.write_byte() method to separately write this tag

Prepend it to the stored image (blob). But it requires allocation and copying of ~8kb. May be not so expensive. comparing with compression overhead... But looks ugly.

Are there some better alternatives?

@hlinnaka do you think that we need to add this extra byte for each page image or just one byte in summary?
It seems to be no sense to support different compression algorithms for different pages.
Certainly we in any case need to distinguish compressed and not compressed pages. But what's wrong will locking at BLOB size? Wee can enforce that compression is performed only for BLCKSZ page images and that compressed version always has smaller size than original (otherwise we just do not store compressed image).

Looking at the size is complicated and error prone. You have to be careful that the compressed size never happens to be equal to BLCKSZ, for example.

I'd suggest adding a variant of write_blob that writes a one byte followed by the buffer. Or an array of buffers, like writev(2) does, and hope that the compiler inlines it into something fast.

Yet another approach would be to compress all blobs in the file. Then you also don't need a header byte. I think a header byte is better, though.

Looking at the size is complicated and error prone. You have to be careful that the compressed size never happens to be equal to BLCKSZ, for example.

Hmmm...
Not sure that I am convinced by this argument. Comparing size is not more complex than any other comparison, for example comparison of format version. Concerning corner cases, i.e. BLCKSZ - we want compression to make some profit, right? So if it returns exactly the same size then why do we need to store compressed image?

So from my point of view this check is quite obvious and safe.

I already has impelled solution with storing compression algorithm in layer metadata (summary).
I can now try two complement alternative solution with storing extra byte for each BLOB and check if there is some overhead. I do not expect to get some noticeable overhead. But from my point of view it is much more complicated solution than just looking size given us no advantages (extra flexibility, better compression,...).

I'd suggest adding a variant of write_blob that writes a one byte followed by the buffer. Or an array of buffers, like writev(2) does, and hope that the compiler inlines it into something fast.

There are less problems with write than with read. In case of write I can call write_all for this tag.
But BlobCursor has no such method. May be move decompression to blob_io? In this case there is no need to add extra methods...

Ok, I have implemented approach with header byte

knizhnik · 2023-12-07T19:26:14Z

I have also added compression page sent by page server to compute.
Time of the following query:

 select sum(abalance) from pgbench_accounts;

is reduced from 17264 msec to 12234 msec

vadim2404 · 2023-12-08T10:33:36Z

I have also added compression page sent by page server to compute. At least for local Neon installation it doesn't have some noticeable impact on performance. Time of the following query:
 select sum(abalance) from pgbench_accounts;
is reduced from 17264 msec to 16155 msec

did you test with network overhead? Or just local setup w/o network delays?
Did you compress all the requests? Because I thought about having threshold in which it makes sense to compress

knizhnik · 2023-12-08T13:18:48Z

did you test with network overhead? Or just local setup w/o network delays?

Just local setup. Certainly with real network connections effect should be more noticeable.

Did you compress all the requests? Because I thought about having threshold in which it makes sense to compress

All GetPage responses.

hlinnaka · 2024-01-19T19:24:46Z

So overall this is very promising. Lots of different things are being discussed, I'll try to dissect this to separate features to have a clear path forward:

Supporting WAL compression, where PostgreSQL compresses full-page images being written to WAL. This is controlled by the wal_compression GUC. The pageserver does not currently support digesting such compressed records, so it will fall back to storing the record as is, and use Postgres wal-redo process to WAL redo such pages. That's inefficient especially if one WAL record modifies multiple pages, because we will duplicate the whole record, with images for all of the modified pages, for each of the pages being modified. I think that can arise on CREATE INDEX for example. It would be nice to craft a test for that worst case scenario.

Even though we don't set wal_compression in the computes today, it would be good to support it. Firstly, we could then consider setting it. Secondly, as we implement support for streaming WAL from an external system to neon, the external system might have wal_compression set, whether we like it or not.

For this, we should support both LZ4 and zstd, because we don't necessarily control what the wal_compression has been set to in the originating system. But we don't necessarily need to store the pages in a compressed format, we could just decompress the pages as they're digested.
Compress page images stored in image layers. The benefit is obvious: compressed data requires less space.

Compression is always a tradeoff between CPU and disk space. We can choose to use LZ4 or zstd. I'm leaning towards zstd, it's a good and flexible algorithm that allows tuning for different tradeoffs. And I'd love to do dictionary mode compression, because it's cool technology :-). You did some earlier experiments with that at Implement compression of image and delta layers #1595. Seriously speaking, I'm not sure if the dictionary mode is worth the complexity. And any compression is better than nothing, really.
Support "image+hole" format of storing and passing around pages. This would save some disk space without implementing full-blown compression. Probably doesn't gain much if we do compression anyway. But still it would be nice to have such a format in the pageserver, and in the pageserver-compute protocol, to save some memory and network bandwidth.
Support storing pages in the same compressed format that Postgres uses with wal_compression, without having to recompress them. Recompressing might be OK in practice, but it sure seems silly. The main problem is that PostgreSQL compresses the image before updating the LSN on the page, but we'd like to store the image with the new LSN. One solution is to store the new LSN in addition to the compressed image. After the image is decompressed, update the LSN.
Compressing delta layers. Similar benefits to compressing image layers. Not part of this patch, but would probably be beneficial just like compressing image layers. Can be done separately later.
Compressing images sent over the wire. Nice feature. Orthogonal to the other parts though. It is included in this PR, but I'd suggest to leave it out, to reduce the scope and focus on the more important parts.
Migration path. We need to be able to roll back deployments, so this needs to have feature flags so that we can deploy the code without writing anything in the new format yet. Later, when we're confident that don't need to roll back, we can enable the feature for some subset of tenants, or all.

I suggest the following steps forward:

A PR to support digesting WAL pages compressed in PostgreSQL with wal_compression. Decompress them in the pageserver and digest. This doesn't require any storage format changes, so there's no worry of backwards-compatibility. For this, I'd like to see some tests to demonstrate the benefit in the best case, like at a CREATE INDEX that emits lots of full-page images in a single record.
A PR to implement zstd or lz4 compression of image layers. It needs to include a pageserver option so that it can be turned on and off, on a per-tenant basis. Default to off for now. Pay careful attention to the storage format changes, because we'll be stuck with the new format for a long time. It's OK to not include points 3. and 4. above, but let's pay close attention to the storage format, so that those will be straightforward to add on later. Include performance tests to demonstrate the effect on disk size, but also on CPU usage. How much CPU will we burn on the compression? I believe our pageservers currently have plenty of spare CPU, but let's measure it.
More PRs for the points 3-6 above. They're not urgent.

Co-authored-by: Joonas Koivunen <joonas@neon.tech>

knizhnik · 2024-03-11T15:46:40Z

Right now this PR support compression of

WAL FPI
Page images in inmemory layer
Page images in delta layer
Image_layer pages
get_page response

Case 1) doesn't require some API of format changes.
Cases 2) and 3) requires adding CompressedImage to Valuer enum because this layers are serlialized using serde.
Not sure if we need image compression for this layers:
From one hand: them are stored on disk in PS and S3 and are transferred fro PS to S3.
From the other hand - this layers are somehow intermadiate and most likely we have to extract and decompress image to be able to apply wal records to it and perform page reconstruction.

In case 4) we really need special tag for compressed images (and it is added by this PR).

It needs to include a pageserver option so that it can be turned on and off, on a per-tenant basis.

I wonder if it should be part of TenantConfig or implemented somehow else. Should we enable/disable compression for particular tenant or region? Right now there is no TenantConfig available in blob_io where compression is performed.
And there is no nice way to pass it here. I do not think much sense in enabling/disabling compression for particular tenant. So many be some global feature flag which can be used to turn compression on/off for particular region is better choice?

A PR to implement zstd or lz4 compression of image layers.

I do not think that we should support different compression algorithm, i.e. lz4 and std. From my experiments lz4 provides the best size/speed ratio: it is much faster than competitors and difference in compressed size is not so large.

Also I do not see much sense in splitting this PR into 3: separating 1) and 5)

knizhnik · 2024-03-12T05:54:48Z

After thinking a little bit more and observing problems with building pg14 (which doesn't;t support lz4 compression)
I decoded to create separate PR for compression only image layers:
#7091

ololobus · 2025-01-10T13:12:22Z

This PR has been automatically marked as Draft because it was last updated 304 days ago. Please, consider closing it if it's not needed anymore.

arpad-m · 2025-01-10T17:00:10Z

This was a slightly different approach to #7091, and I think given that we have #5431 now as well as #9821 cc @VladLazar .

knizhnik requested review from a team as code owners November 24, 2023 08:49

knizhnik requested review from shanyp and conradludgate and removed request for a team November 24, 2023 08:49

knizhnik changed the title ~~Support lx4 WAL compression~~ Support lz4 WAL compression Nov 24, 2023

vadim2404 reviewed Nov 24, 2023

View reviewed changes

vadim2404 requested review from jcsp and koivunej November 29, 2023 09:29

koivunej requested changes Dec 4, 2023

View reviewed changes

knizhnik force-pushed the wal_lz4_compression branch from 05a8ee0 to c548333 Compare December 7, 2023 07:09

knizhnik changed the title ~~Support lz4 WAL compression~~ Support lz4 page images compression Dec 7, 2023

koivunej reviewed Dec 7, 2023

View reviewed changes

pageserver/src/walingest.rs Outdated Show resolved Hide resolved

koivunej reviewed Dec 7, 2023

View reviewed changes

pageserver/src/tenant/storage_layer/image_layer.rs Outdated Show resolved Hide resolved

koivunej reviewed Dec 7, 2023

View reviewed changes

pageserver/src/tenant/storage_layer/image_layer.rs Outdated Show resolved Hide resolved

koivunej reviewed Dec 7, 2023

View reviewed changes

pageserver/src/walingest.rs Outdated Show resolved Hide resolved

hlinnaka reviewed Dec 7, 2023

View reviewed changes

knizhnik force-pushed the wal_lz4_compression branch from 7b23751 to d7dcd23 Compare December 8, 2023 14:29

Konstantin Knizhnik and others added 9 commits March 10, 2024 21:38

Support lx4 WAL compression

a880178

Peform compression of page images in storage

b7e7aee

Resolve merge conflict

ed4bb30

Build Postgres with lz4 support

56ddf8e

Support compression of get_page responses

4cfa2fd

Update pageserver/src/walingest.rs

0dad8e4

Co-authored-by: Joonas Koivunen <joonas@neon.tech>

Store compression algorithm in image layer metadata

62e7638

Add compression tag to BLOBs stored in image layer

9832638

Fix merge conflicts

42d0b04

knizhnik force-pushed the wal_lz4_compression branch from 8dc9987 to 42d0b04 Compare March 11, 2024 15:34

jcsp mentioned this pull request Mar 21, 2024

pageserver: do image layer creation after timeline creation (or remove the code) #7197

Open

ololobus marked this pull request as draft January 10, 2025 13:12

arpad-m closed this Jan 10, 2025

Support lz4 page images compression #5913

Support lz4 page images compression #5913

Conversation

knizhnik commented Nov 24, 2023

Problem

Summary of changes

Checklist before requesting a review

Checklist before merging

github-actions bot commented Nov 24, 2023 • edited Loading

No tests were run or test report is not available

Test coverage report is not available

arpad-m commented Nov 24, 2023

Choose a reason for hiding this comment

knizhnik Nov 24, 2023 • edited Loading

Choose a reason for hiding this comment

koivunej left a comment

Choose a reason for hiding this comment

arpad-m commented Dec 4, 2023

knizhnik commented Dec 4, 2023

arpad-m commented Dec 4, 2023

arpad-m commented Dec 4, 2023

jcsp commented Dec 5, 2023

jcsp commented Dec 5, 2023

knizhnik commented Dec 5, 2023

jcsp commented Dec 5, 2023

arpad-m commented Dec 5, 2023

knizhnik commented Dec 5, 2023

jcsp commented Dec 6, 2023

knizhnik commented Dec 6, 2023

arpad-m commented Dec 6, 2023

knizhnik commented Dec 6, 2023 • edited Loading

hlinnaka left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

knizhnik commented Dec 7, 2023 • edited Loading

vadim2404 commented Dec 8, 2023

knizhnik commented Dec 8, 2023

hlinnaka commented Jan 19, 2024

knizhnik commented Mar 11, 2024 • edited Loading

knizhnik commented Mar 12, 2024

ololobus commented Jan 10, 2025

arpad-m commented Jan 10, 2025

github-actions bot commented Nov 24, 2023 •

edited

Loading

knizhnik Nov 24, 2023 •

edited

Loading

knizhnik commented Dec 6, 2023 •

edited

Loading

knizhnik commented Dec 7, 2023 •

edited

Loading

knizhnik commented Mar 11, 2024 •

edited

Loading