-
Notifications
You must be signed in to change notification settings - Fork 4
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Make use of temp tables #24
base: master
Are you sure you want to change the base?
Conversation
Closes #23 If this goes according to plan, then we can just slightly patch powa archivist to DELETE FROM ONLY to not waste time with those temp tables. They'll just get purged at disconnection The only issue I see that this could bring is that we'll have lots of distinct queryids on the collector itself, if we decide to manage via powa We could make disconnection from the database optional so that we can keep the temp tables active, but I really think this temp table mechanism isn't a big price to pay for the dramatic IO reduction it should bring Anyway, comments are welcome
data_ins.execute("CREATE TEMP TABLE IF NOT EXISTS %s (CHECK (srvid = %s)) INHERITS (%s)" % (temp_table_name, srvid, target_tbl_name)) ; | ||
# Get rid of the lock. We didn't do anything apart from creating the temp table on this session | ||
# We're doin a COMMIT and a BEGIN, the psycopg2 is useless for savepoints anyway... | ||
data_ins.execute("COMMIT"); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I really have a big doubt about this, because I suppose I'm probably interfering with the transactions of the app. If that's the case, maybe we need 2 loops, one that initially creates all the temp tables, and then another that populates them
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
FWIW this shouldn't cause correctness problem, as the need for a single transaction only exists on the remote servers (the get a consistent timestamp), not on the repository server. It may lead to faster xid consumption and thus amplify any problem related to autovacuum / freezing, but the data itself will be consistent.
Ok, it doesn't work: we need the same ts for all of a snapshot, as it's the way we identify it (for joining between different pieces of data…) => back to the drawing board |
That's very strange, the ts is returned by the source function on the remote server (as we want the remote server's timestamp at the sampling time, not the repository server at the insertion time), so since you're doing a commit on the repository side it shouldn't have any impact on the snapshot's ts. |
Yeah, false alarm. I was chasing a strange display problem and thought it was coming from this. It isn't |
I'm testing this patch right now. Just a few notes. |
I don't understand how using pgbouncer can lead to more catalog bloat, do you have any reasoning here? Also, do you have powa enable on the repository server? |
data_ins.execute("CREATE TEMP TABLE IF NOT EXISTS %s (CHECK (srvid = %s)) INHERITS (%s)" % (temp_table_name, srvid, target_tbl_name)) ; | ||
# Get rid of the lock. We didn't do anything apart from creating the temp table on this session | ||
# We're doin a COMMIT and a BEGIN, the psycopg2 is useless for savepoints anyway... | ||
data_ins.execute("COMMIT"); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
FWIW this shouldn't cause correctness problem, as the need for a single transaction only exists on the remote servers (the get a consistent timestamp), not on the repository server. It may lead to faster xid consumption and thus amplify any problem related to autovacuum / freezing, but the data itself will be consistent.
@@ -138,12 +152,18 @@ def copy_remote_data_to_repo(cls, data_name, | |||
if (cls.is_stopping() or not src_ok): | |||
return errors | |||
|
|||
# insert the data to the transient unlogged table | |||
# insert the data to the local temp table | |||
temp_table_name = get_table_basename(target_tbl_name, data_ins) + '_' + str(srvid) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
You should modify get_global_tmp_name()
to return a tuple with the schema name and the table name separately so there is no need to try to recover the table name afterwards.
In general I think that the temp table approach could be quite nice, but we will need a lot more work to make it usable for everyone. To summarise, the current powa-collector behavior when it does a snapshot is;
So the big problem here is that you will need new temporary tables for each datasource and each repository server for each snapshot. This causes 2 problem:
In other words, as-is the powa-repository server must be reserved for powa and you cannot use pg_stat_statements anymore on that server since you have infinite number of queryid (it will be useless, and probably induce a laughable overhead). Maybe that's already what everyone is doing, but we can't know for sure. And I don't think that we can force this without giving a possibility for the users to opt out of the various problems. So I think that feature will require 2 additional new features :
And a lot of documentation to explain why and how to configure it. We also have to think about what the default configuration should be. I think that using temp tables but with persistent connections could be sensible. |
Closes #23
If this goes according to plan, then we can just slightly patch powa archivist to DELETE FROM ONLY to not waste time with those temp tables. They'll just get purged at disconnection
The only issue I see that this could bring is that we'll have lots of distinct queryids on the collector itself, if we decide to manage via powa
We could make disconnection from the database optional so that we can keep the temp tables active, but I really think this temp table mechanism isn't a big price to pay for the dramatic IO reduction it should bring
Anyway, comments are welcome