Skip to content

Add total event, unencrypted message, and e2ee event counts to stats reporting (v2) #18371

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Draft
wants to merge 9 commits into
base: develop
Choose a base branch
from

Conversation

MadLittleMods
Copy link
Contributor

@MadLittleMods MadLittleMods commented Apr 29, 2025

Re-introduce #18260 which was reverted.

Adds total message counts to homeserver stats reporting. This is primarily to comply with the TI-Messenger spec, which requires each homeserver to report "Number of message events as an integer, cumulative".

Dev notes

poetry run trial tests.metrics.test_phone_home_stats
poetry run trial tests.storage.test_event_stats

Test with Postgres

SYNAPSE_POSTGRES=1 SYNAPSE_POSTGRES_USER=postgres poetry run trial tests.metrics.test_phone_home_stats
SYNAPSE_POSTGRES=1 SYNAPSE_POSTGRES_USER=postgres poetry run trial tests.storage.test_event_stats

RETURNING in SQLite

The RETURNING syntax has been supported by SQLite since version 3.35.0 (2021-03-12).

-- https://www.sqlite.org/lang_returning.html

Synapse supports...

The oldest supported version of SQLite is the version provided by Debian oldstable.

-- https://element-hq.github.io/synapse/latest/deprecation_policy.html

which currently is https://packages.debian.org/bullseye/sqlite3 -> 3.34.1-3+deb11u1

We have self.db_pool.engine.supports_returning to detect whether we can use RETURNING.

"could not serialize access due to concurrent update"

Context #18349

Relevant error:

psycopg2.errors.SerializationFailure: could not serialize access due to concurrent update
CONTEXT:  SQL statement "UPDATE event_stats SET total_event_count = total_event_count + 1"
PL/pgSQL function event_stats_increment_counts() line 5 at SQL statement

The default isolation in Postgres is IsolationLevel.READ_COMMITTED but we've set the default to be IsolationLevel.REPEATABLE_READ in Synapse:

self.default_isolation_level = (
psycopg2.extensions.ISOLATION_LEVEL_REPEATABLE_READ
)

Docs: https://www.postgresql.org/docs/current/transaction-iso.html

In SQLite, all transactions are SERIALIZABLE by default.

Related:

Todo

Pull Request Checklist

  • Pull request is based on the develop branch
  • Pull request includes a changelog file. The entry should:
    • Be a short description of your change which makes sense to users. "Fixed a bug that prevented receiving messages from other servers." instead of "Moved X method from EventStore to EventWorkerStore.".
    • Use markdown where necessary, mostly for code blocks.
    • End with either a period (.) or an exclamation mark (!).
    • Start with a capital letter.
    • Feel free to credit yourself, by adding a sentence "Contributed by @github_username." or "Contributed by [Your Name]." to the end of the entry.
  • Code style is correct
    (run the linters)

…reporting (#18260)

Co-authored-by: Eric Eastwood <erice@element.io>
@github-actions github-actions bot deployed to PR Documentation Preview April 29, 2025 14:08 Active
This was the original reason the PR was reverted, see
#18346

> The `RETURNING` syntax has been supported by SQLite since version 3.35.0 (2021-03-12).
>
> *-- https://www.sqlite.org/lang_returning.html*

Synapse supports...

> The oldest supported version of SQLite is the version [provided](https://packages.debian.org/bullseye/libsqlite3-0) by [Debian oldstable](https://wiki.debian.org/DebianOldStable).
>
> *-- https://element-hq.github.io/synapse/latest/deprecation_policy.html*

which currently is https://packages.debian.org/bullseye/sqlite3 -> `3.34.1-3+deb11u1`

We have `self.db_pool.engine.supports_returning` to detect whether we can use `RETURNING`.
So we always get the correct count regardless of how many times the background
update is run,
@github-actions github-actions bot deployed to PR Documentation Preview April 30, 2025 05:01 Active
@github-actions github-actions bot deployed to PR Documentation Preview April 30, 2025 05:05 Active
tok=user_2_token,
)

def test_concurrent_event_insert(self) -> None:
Copy link
Contributor Author

@MadLittleMods MadLittleMods Apr 30, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I could use a few pointers on how best to architect this test. I'd like to start a transaction that involves the event_stats table, then try to insert a new event to try to reproduce the concurrency error: psycopg2.errors.SerializationFailure: could not serialize access due to concurrent update (#18349). The problem with the current test is that the transaction just blocks everything where asdf1 is printed but not asdf2. I'm not sure how best to release control back to the GIL to run other tasks and continue the test.

I'm assuming the real-life scenario that caused this error happens when you have multiple stream writer workers for the events stream (multiple event_perister) since it "experimentally supports having multiple writer workers, where load is sharded between them by room ID. Each writer is called an event persister." (source). And then you just send events into different rooms at the same time that are handled by separate workers.

Copy link
Contributor Author

@MadLittleMods MadLittleMods Apr 30, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think part of the problem is that ThreadedMemoryReactorClock "doesn't really use threads" and uses make_fake_db_pool(...) which "runs db queries synchronously [...] on the test reactor's main thread".

synapse/tests/server.py

Lines 699 to 705 in d59bbd8

"""Wrapper for `make_pool` which builds a pool which runs db queries synchronously.
For more deterministic testing, we don't use a regular db connection pool: instead
we run all db queries synchronously on the test reactor's main thread. This function
is a drop-in replacement for the normal `make_pool` which builds such a connection
pool.
"""

synapse/tests/server.py

Lines 638 to 641 in d59bbd8

# [1]: we replace the threadpool backing the db connection pool with a
# mock ThreadPool which doesn't really use threads; but we still use
# reactor.callFromThread to feed results back from the db functions to the
# main thread.

So no matter how much I try to split things in threads for parallel execution, it just blocks.


Other references:

  • defer_to_thread(...) (thanks for pointing this one out @anoadragon453)
  • self.reactor.callFromThread(...)

Copy link
Contributor Author

@MadLittleMods MadLittleMods Apr 30, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I still don't have a solution for this.

I've also tried using a BaseMultiWorkerStreamTestCase with sharded "event_persisters" (multiple workers for the events stream) to mimic the real-life scenario that I think could reproduce this error (see test_sharded_event_persisters) but only the first event in each room seems to send (requires more investigation). This kind of test doesn't guarantee a reproduction either as the stars could align and events could send fine. Additionally, as BaseMultiWorkerStreamTestCase is implemented right now, I don't think it will work as all of workers share the same reactor so they at-best send one at a time without concurrency conflicts.

if isinstance(txn.database_engine, Sqlite3Engine):
txn.execute(
"""
CREATE TRIGGER IF NOT EXISTS event_stats_events_insert_trigger
Copy link
Contributor Author

@MadLittleMods MadLittleMods Apr 30, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm concerned that TRIGGER's may not be a viable option overall because of the concurrency problems (psycopg2.errors.SerializationFailure: could not serialize access due to concurrent update) with the default IsolationLevel.READ_COMMITTED for database transactions.

When we insert events, we have pretty giant transactions to also update all of the associated tables in one go. I'm concerned that changing the isolation level of all of the transactions where we insert/delete events will unacceptably degrade performance (probably need IsolationLevel.SERIALIZABLE, need to investigate whether IsolationLevel.REPEATABLE_READ is sufficient).

Any good alternatives or words of encouragement just to try it?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We can set the isolation level per transaction apparently:

BEGIN TRANSACTION ISOLATION LEVEL REPEATABLE READ;
-- your queries here
COMMIT;

I'm not sure if we have control over that, but that may be preferable to changing the whole database over?

Copy link
Contributor Author

@MadLittleMods MadLittleMods Apr 30, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@anoadragon453 We can definitely set the isolation level on a per transaction basis but the problem is that the triggers run as part of the transaction inserting the events. So wherever we insert/delete events, we need to change the isolation level and it appears like the transactions involving events are pretty large so using IsolationLevel.SERIALIZABLE (the result of the transactions come out the other side as if they were run sequentially), concerns me as they may be waiting on each other to finish.

Example of changing isolation level:

# This first runs the purge transaction with READ_COMMITTED isolation level,
# meaning any new rows in the tables will not trigger a serialization error.
# We then run the same purge a second time without this isolation level to
# purge any of those rows which were added during the first.
logger.info("[purge] Starting initial main purge of [1/2]")
await self.db_pool.runInteraction(
"purge_room",
self._purge_room_txn,
room_id=room_id,
isolation_level=IsolationLevel.READ_COMMITTED,
)

Copy link
Member

@erikjohnston erikjohnston May 1, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

FYI the default transaction isolation level in Synapse is repeatable read:

self.default_isolation_level = (
psycopg2.extensions.ISOLATION_LEVEL_REPEATABLE_READ
)

@github-actions github-actions bot deployed to PR Documentation Preview April 30, 2025 21:13 Active
@github-actions github-actions bot deployed to PR Documentation Preview May 1, 2025 04:08 Active
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants