-
Notifications
You must be signed in to change notification settings - Fork 305
Add total event, unencrypted message, and e2ee event counts to stats reporting (v2) #18371
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
base: develop
Are you sure you want to change the base?
Add total event, unencrypted message, and e2ee event counts to stats reporting (v2) #18371
Conversation
…reporting (#18260) Co-authored-by: Eric Eastwood <erice@element.io>
This was the original reason the PR was reverted, see #18346 > The `RETURNING` syntax has been supported by SQLite since version 3.35.0 (2021-03-12). > > *-- https://www.sqlite.org/lang_returning.html* Synapse supports... > The oldest supported version of SQLite is the version [provided](https://packages.debian.org/bullseye/libsqlite3-0) by [Debian oldstable](https://wiki.debian.org/DebianOldStable). > > *-- https://element-hq.github.io/synapse/latest/deprecation_policy.html* which currently is https://packages.debian.org/bullseye/sqlite3 -> `3.34.1-3+deb11u1` We have `self.db_pool.engine.supports_returning` to detect whether we can use `RETURNING`.
So we always get the correct count regardless of how many times the background update is run,
tok=user_2_token, | ||
) | ||
|
||
def test_concurrent_event_insert(self) -> None: |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I could use a few pointers on how best to architect this test. I'd like to start a transaction that involves the event_stats
table, then try to insert a new event to try to reproduce the concurrency error: psycopg2.errors.SerializationFailure: could not serialize access due to concurrent update
(#18349). The problem with the current test is that the transaction just blocks everything where asdf1
is printed but not asdf2
. I'm not sure how best to release control back to the GIL to run other tasks and continue the test.
I'm assuming the real-life scenario that caused this error happens when you have multiple stream writer workers for the events
stream (multiple event_perister
) since it "experimentally supports having multiple writer workers, where load is sharded between them by room ID. Each writer is called an event persister." (source). And then you just send events into different rooms at the same time that are handled by separate workers.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think part of the problem is that ThreadedMemoryReactorClock
"doesn't really use threads" and uses make_fake_db_pool(...)
which "runs db queries synchronously [...] on the test reactor's main thread".
Lines 699 to 705 in d59bbd8
"""Wrapper for `make_pool` which builds a pool which runs db queries synchronously. | |
For more deterministic testing, we don't use a regular db connection pool: instead | |
we run all db queries synchronously on the test reactor's main thread. This function | |
is a drop-in replacement for the normal `make_pool` which builds such a connection | |
pool. | |
""" |
Lines 638 to 641 in d59bbd8
# [1]: we replace the threadpool backing the db connection pool with a | |
# mock ThreadPool which doesn't really use threads; but we still use | |
# reactor.callFromThread to feed results back from the db functions to the | |
# main thread. |
So no matter how much I try to split things in threads for parallel execution, it just blocks.
Other references:
defer_to_thread(...)
(thanks for pointing this one out @anoadragon453)self.reactor.callFromThread(...)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I still don't have a solution for this.
I've also tried using a BaseMultiWorkerStreamTestCase
with sharded "event_persisters" (multiple workers for the events
stream) to mimic the real-life scenario that I think could reproduce this error (see test_sharded_event_persisters
) but only the first event in each room seems to send (requires more investigation). This kind of test doesn't guarantee a reproduction either as the stars could align and events could send fine. Additionally, as BaseMultiWorkerStreamTestCase
is implemented right now, I don't think it will work as all of workers share the same reactor so they at-best send one at a time without concurrency conflicts.
if isinstance(txn.database_engine, Sqlite3Engine): | ||
txn.execute( | ||
""" | ||
CREATE TRIGGER IF NOT EXISTS event_stats_events_insert_trigger |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'm concerned that TRIGGER
's may not be a viable option overall because of the concurrency problems (psycopg2.errors.SerializationFailure: could not serialize access due to concurrent update
) with the default IsolationLevel.READ_COMMITTED
for database transactions.
When we insert events, we have pretty giant transactions to also update all of the associated tables in one go. I'm concerned that changing the isolation level of all of the transactions where we insert/delete events
will unacceptably degrade performance (probably need IsolationLevel.SERIALIZABLE
, need to investigate whether IsolationLevel.REPEATABLE_READ
is sufficient).
Any good alternatives or words of encouragement just to try it?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
We can set the isolation level per transaction apparently:
BEGIN TRANSACTION ISOLATION LEVEL REPEATABLE READ;
-- your queries here
COMMIT;
I'm not sure if we have control over that, but that may be preferable to changing the whole database over?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@anoadragon453 We can definitely set the isolation level on a per transaction basis but the problem is that the triggers run as part of the transaction inserting the events
. So wherever we insert/delete events
, we need to change the isolation level and it appears like the transactions involving events
are pretty large so using IsolationLevel.SERIALIZABLE
(the result of the transactions come out the other side as if they were run sequentially), concerns me as they may be waiting on each other to finish.
Example of changing isolation level:
synapse/synapse/storage/databases/main/purge_events.py
Lines 345 to 356 in 5ab05e7
# This first runs the purge transaction with READ_COMMITTED isolation level, | |
# meaning any new rows in the tables will not trigger a serialization error. | |
# We then run the same purge a second time without this isolation level to | |
# purge any of those rows which were added during the first. | |
logger.info("[purge] Starting initial main purge of [1/2]") | |
await self.db_pool.runInteraction( | |
"purge_room", | |
self._purge_room_txn, | |
room_id=room_id, | |
isolation_level=IsolationLevel.READ_COMMITTED, | |
) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
FYI the default transaction isolation level in Synapse is repeatable read:
synapse/synapse/storage/engines/postgres.py
Lines 73 to 75 in 2965c99
self.default_isolation_level = ( | |
psycopg2.extensions.ISOLATION_LEVEL_REPEATABLE_READ | |
) |
Re-introduce #18260 which was reverted.
Adds total message counts to homeserver stats reporting. This is primarily to comply with the TI-Messenger spec, which requires each homeserver to report "Number of message events as an integer, cumulative".
Dev notes
Test with Postgres
RETURNING
in SQLiteSynapse supports...
which currently is https://packages.debian.org/bullseye/sqlite3 ->
3.34.1-3+deb11u1
We have
self.db_pool.engine.supports_returning
to detect whether we can useRETURNING
."could not serialize access due to concurrent update"
Context #18349
Relevant error:
The default isolation in Postgres is
IsolationLevel.READ_COMMITTED
but we've set the default to beIsolationLevel.REPEATABLE_READ
in Synapse:synapse/synapse/storage/engines/postgres.py
Lines 73 to 75 in 2965c99
Docs: https://www.postgresql.org/docs/current/transaction-iso.html
In SQLite, all transactions are SERIALIZABLE by default.
Related:
Todo
RETURNING
not being compatible with old versions of SQLite (the original reason the PR was reverted), see Revert "Add total event, unencrypted message, and e2ee event counts to stats reporting" #18346event_stats
table when we add theTRIGGER
so we always get the correct count regardless of how many times the background update is run, see Add total event, unencrypted message, and e2ee event counts to stats reporting #18260 (comment)Pull Request Checklist
EventStore
toEventWorkerStore
.".code blocks
.(run the linters)