Add total event, unencrypted message, and e2ee event counts to stats reporting (v2) #18371

MadLittleMods · 2025-04-29T14:07:28Z

Re-introduce #18260 which was reverted.

Adds total message counts to homeserver stats reporting. This is primarily to comply with the TI-Messenger spec, which requires each homeserver to report "Number of message events as an integer, cumulative".

Dev notes

poetry run trial tests.metrics.test_phone_home_stats
poetry run trial tests.storage.test_event_stats

Test with Postgres

SYNAPSE_POSTGRES=1 SYNAPSE_POSTGRES_USER=postgres poetry run trial tests.metrics.test_phone_home_stats
SYNAPSE_POSTGRES=1 SYNAPSE_POSTGRES_USER=postgres poetry run trial tests.storage.test_event_stats

`RETURNING` in SQLite

The RETURNING syntax has been supported by SQLite since version 3.35.0 (2021-03-12).

-- https://www.sqlite.org/lang_returning.html

Synapse supports...

The oldest supported version of SQLite is the version provided by Debian oldstable.

-- https://element-hq.github.io/synapse/latest/deprecation_policy.html

which currently is https://packages.debian.org/bullseye/sqlite3 -> 3.34.1-3+deb11u1

We have self.db_pool.engine.supports_returning to detect whether we can use RETURNING.

"could not serialize access due to concurrent update"

Context #18349

Relevant error:

psycopg2.errors.SerializationFailure: could not serialize access due to concurrent update
CONTEXT:  SQL statement "UPDATE event_stats SET total_event_count = total_event_count + 1"
PL/pgSQL function event_stats_increment_counts() line 5 at SQL statement

The default isolation in Postgres is IsolationLevel.READ_COMMITTED but we've set the default to be IsolationLevel.REPEATABLE_READ in Synapse:

synapse/synapse/storage/engines/postgres.py

Lines 73 to 75 in 2965c99

    
           self.default_isolation_level = ( 
        
               psycopg2.extensions.ISOLATION_LEVEL_REPEATABLE_READ 
        
           )

Docs: https://www.postgresql.org/docs/current/transaction-iso.html

In SQLite, all transactions are SERIALIZABLE by default.

Reduce serialization errors in MultiWriterIdGen matrix-org/synapse#8456

Todo

Fix usage of RETURNING not being compatible with old versions of SQLite (the original reason the PR was reverted), see Revert "Add total event, unencrypted message, and e2ee event counts to stats reporting" #18346
Update the background update to always clear the event_stats table when we add the TRIGGER so we always get the correct count regardless of how many times the background update is run, see Add total event, unencrypted message, and e2ee event counts to stats reporting #18260 (comment)
Address concurrency issues, see Sending many events concurrently throws serialization errors on v1.129.0rc1 #18349

Pull Request Checklist

Pull request is based on the develop branch
Pull request includes a changelog file. The entry should:
- Be a short description of your change which makes sense to users. "Fixed a bug that prevented receiving messages from other servers." instead of "Moved X method from EventStore to EventWorkerStore.".
- Use markdown where necessary, mostly for code blocks.
- End with either a period (.) or an exclamation mark (!).
- Start with a capital letter.
- Feel free to credit yourself, by adding a sentence "Contributed by @github_username." or "Contributed by [Your Name]." to the end of the entry.
Code style is correct
(run the linters)

…reporting (#18260) Co-authored-by: Eric Eastwood <erice@element.io>

This was the original reason the PR was reverted, see #18346 > The `RETURNING` syntax has been supported by SQLite since version 3.35.0 (2021-03-12). > > *-- https://www.sqlite.org/lang_returning.html* Synapse supports... > The oldest supported version of SQLite is the version [provided](https://packages.debian.org/bullseye/libsqlite3-0) by [Debian oldstable](https://wiki.debian.org/DebianOldStable). > > *-- https://element-hq.github.io/synapse/latest/deprecation_policy.html* which currently is https://packages.debian.org/bullseye/sqlite3 -> `3.34.1-3+deb11u1` We have `self.db_pool.engine.supports_returning` to detect whether we can use `RETURNING`.

So we always get the correct count regardless of how many times the background update is run,

MadLittleMods · 2025-04-30T05:10:12Z

tests/storage/test_event_stats.py

+                tok=user_2_token,
+            )
+
+    def test_concurrent_event_insert(self) -> None:


I could use a few pointers on how best to architect this test. I'd like to start a transaction that involves the event_stats table, then try to insert a new event to try to reproduce the concurrency error: psycopg2.errors.SerializationFailure: could not serialize access due to concurrent update (#18349). The problem with the current test is that the transaction just blocks everything where asdf1 is printed but not asdf2. I'm not sure how best to release control back to the GIL to run other tasks and continue the test.

I'm assuming the real-life scenario that caused this error happens when you have multiple stream writer workers for the events stream (multiple event_perister) since it "experimentally supports having multiple writer workers, where load is sharded between them by room ID. Each writer is called an event persister." (source). And then you just send events into different rooms at the same time that are handled by separate workers.

I think part of the problem is that ThreadedMemoryReactorClock "doesn't really use threads" and uses make_fake_db_pool(...) which "runs db queries synchronously [...] on the test reactor's main thread".

synapse/tests/server.py

Lines 699 to 705 in d59bbd8

"""Wrapper for `make_pool` which builds a pool which runs db queries synchronously.

For more deterministic testing, we don't use a regular db connection pool: instead

we run all db queries synchronously on the test reactor's main thread. This function

is a drop-in replacement for the normal `make_pool` which builds such a connection

pool.

"""

synapse/tests/server.py

Lines 638 to 641 in d59bbd8

# [1]: we replace the threadpool backing the db connection pool with a

# mock ThreadPool which doesn't really use threads; but we still use

# reactor.callFromThread to feed results back from the db functions to the

# main thread.

So no matter how much I try to split things in threads for parallel execution, it just blocks.

Other references:

defer_to_thread(...) (thanks for pointing this one out @anoadragon453)

self.reactor.callFromThread(...)

I still don't have a solution for this.

I've also tried using a BaseMultiWorkerStreamTestCase with sharded "event_persisters" (multiple workers for the events stream) to mimic the real-life scenario that I think could reproduce this error (see test_sharded_event_persisters) but only the first event in each room seems to send (requires more investigation). This kind of test doesn't guarantee a reproduction either as the stars could align and events could send fine. Additionally, as BaseMultiWorkerStreamTestCase is implemented right now, I don't think it will work as all of workers share the same reactor so they at-best send one at a time without concurrency conflicts.

MadLittleMods · 2025-04-30T13:42:30Z

synapse/storage/databases/main/events_bg_updates.py

+            if isinstance(txn.database_engine, Sqlite3Engine):
+                txn.execute(
+                    """
+                    CREATE TRIGGER IF NOT EXISTS event_stats_events_insert_trigger


I'm concerned that TRIGGER's may not be a viable option overall because of the concurrency problems (psycopg2.errors.SerializationFailure: could not serialize access due to concurrent update) with the default IsolationLevel.READ_COMMITTED for database transactions.

When we insert events, we have pretty giant transactions to also update all of the associated tables in one go. I'm concerned that changing the isolation level of all of the transactions where we insert/delete events will unacceptably degrade performance (probably need IsolationLevel.SERIALIZABLE, need to investigate whether IsolationLevel.REPEATABLE_READ is sufficient).

Any good alternatives or words of encouragement just to try it?

We can set the isolation level per transaction apparently:

BEGIN TRANSACTION ISOLATION LEVEL REPEATABLE READ; -- your queries here COMMIT;

I'm not sure if we have control over that, but that may be preferable to changing the whole database over?

@anoadragon453 We can definitely set the isolation level on a per transaction basis but the problem is that the triggers run as part of the transaction inserting the events. So wherever we insert/delete events, we need to change the isolation level and it appears like the transactions involving events are pretty large so using IsolationLevel.SERIALIZABLE (the result of the transactions come out the other side as if they were run sequentially), concerns me as they may be waiting on each other to finish.

Example of changing isolation level:

synapse/synapse/storage/databases/main/purge_events.py

Lines 345 to 356 in 5ab05e7

# This first runs the purge transaction with READ_COMMITTED isolation level,

# meaning any new rows in the tables will not trigger a serialization error.

# We then run the same purge a second time without this isolation level to

# purge any of those rows which were added during the first.

logger.info("[purge] Starting initial main purge of [1/2]")

await self.db_pool.runInteraction(

"purge_room",

self._purge_room_txn,

room_id=room_id,

isolation_level=IsolationLevel.READ_COMMITTED,

)

FYI the default transaction isolation level in Synapse is repeatable read:

synapse/synapse/storage/engines/postgres.py

Lines 73 to 75 in 2965c99

self.default_isolation_level = (

psycopg2.extensions.ISOLATION_LEVEL_REPEATABLE_READ

)

Add total event, unencrypted message, and e2ee event counts to stats …

7f044f9

…reporting (#18260) Co-authored-by: Eric Eastwood <erice@element.io>

github-actions bot deployed to PR Documentation Preview April 29, 2025 14:08 Active

MadLittleMods added 4 commits April 29, 2025 09:13

Update changelog PR number

eb89758

Always clear the event_stats table when the background update starts

ddf77b5

So we always get the correct count regardless of how many times the background update is run,

Remove superfluous guest from event_stats tests

0b37f9f

github-actions bot deployed to PR Documentation Preview April 30, 2025 05:01 Active

WIP: Concurrent event insert test

29c936f

github-actions bot deployed to PR Documentation Preview April 30, 2025 05:05 Active

MadLittleMods commented Apr 30, 2025

View reviewed changes

Try sharded event_persisters

3c873f9

github-actions bot deployed to PR Documentation Preview April 30, 2025 21:13 Active

MadLittleMods added 2 commits April 30, 2025 22:35

Make it more clear that it is a *threadless* thread pool

1cb9a0f

Try more things

d0e2a45

github-actions bot deployed to PR Documentation Preview May 1, 2025 04:08 Active

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Add total event, unencrypted message, and e2ee event counts to stats reporting (v2) #18371

Add total event, unencrypted message, and e2ee event counts to stats reporting (v2) #18371

MadLittleMods commented Apr 29, 2025 •

edited

Loading

MadLittleMods Apr 30, 2025 •

edited

Loading

MadLittleMods Apr 30, 2025 •

edited

Loading

MadLittleMods Apr 30, 2025 •

edited

Loading

MadLittleMods Apr 30, 2025 •

edited

Loading

anoadragon453 Apr 30, 2025

MadLittleMods Apr 30, 2025 •

edited

Loading

erikjohnston May 1, 2025 •

edited

Loading

	self.default_isolation_level = (
	psycopg2.extensions.ISOLATION_LEVEL_REPEATABLE_READ
	)

	"""Wrapper for `make_pool` which builds a pool which runs db queries synchronously.

	For more deterministic testing, we don't use a regular db connection pool: instead
	we run all db queries synchronously on the test reactor's main thread. This function
	is a drop-in replacement for the normal `make_pool` which builds such a connection
	pool.
	"""

	# [1]: we replace the threadpool backing the db connection pool with a
	# mock ThreadPool which doesn't really use threads; but we still use
	# reactor.callFromThread to feed results back from the db functions to the
	# main thread.

	# This first runs the purge transaction with READ_COMMITTED isolation level,
	# meaning any new rows in the tables will not trigger a serialization error.
	# We then run the same purge a second time without this isolation level to
	# purge any of those rows which were added during the first.

	logger.info("[purge] Starting initial main purge of [1/2]")
	await self.db_pool.runInteraction(
	"purge_room",
	self._purge_room_txn,
	room_id=room_id,
	isolation_level=IsolationLevel.READ_COMMITTED,
	)

Add total event, unencrypted message, and e2ee event counts to stats reporting (v2) #18371

Are you sure you want to change the base?

Add total event, unencrypted message, and e2ee event counts to stats reporting (v2) #18371

Conversation

MadLittleMods commented Apr 29, 2025 • edited Loading

Dev notes

RETURNING in SQLite

"could not serialize access due to concurrent update"

Todo

Pull Request Checklist

MadLittleMods Apr 30, 2025 • edited Loading

Choose a reason for hiding this comment

MadLittleMods Apr 30, 2025 • edited Loading

Choose a reason for hiding this comment

MadLittleMods Apr 30, 2025 • edited Loading

Choose a reason for hiding this comment

MadLittleMods Apr 30, 2025 • edited Loading

Choose a reason for hiding this comment

anoadragon453 Apr 30, 2025

Choose a reason for hiding this comment

MadLittleMods Apr 30, 2025 • edited Loading

Choose a reason for hiding this comment

erikjohnston May 1, 2025 • edited Loading

Choose a reason for hiding this comment

MadLittleMods commented Apr 29, 2025 •

edited

Loading

`RETURNING` in SQLite

MadLittleMods Apr 30, 2025 •

edited

Loading

MadLittleMods Apr 30, 2025 •

edited

Loading

MadLittleMods Apr 30, 2025 •

edited

Loading

MadLittleMods Apr 30, 2025 •

edited

Loading

MadLittleMods Apr 30, 2025 •

edited

Loading

erikjohnston May 1, 2025 •

edited

Loading