Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

"Unsupported data format" when loading certain dump files #3

Open
bitinerant opened this issue Oct 1, 2020 · 7 comments
Open

"Unsupported data format" when loading certain dump files #3

bitinerant opened this issue Oct 1, 2020 · 7 comments

Comments

@bitinerant
Copy link

I have discovered that dump files from pg_dump via a pipe cannot be opened by pgdumplib. Here is a demonstration (note the |cat on line 4):

$ pg_dump -d codimd -Fc >/tmp/good
$ python3 -c "import pgdumplib; dump = pgdumplib.load('/tmp/good')"
$ # no error
$ pg_dump -d codimd -Fc |cat >/tmp/bad
$ python3 -c "import pgdumplib; dump = pgdumplib.load('/tmp/bad')"
Traceback (most recent call last):
  File "<string>", line 1, in <module>
  File "/usr/local/lib/python3.8/dist-packages/pgdumplib/__init__.py", line 24, in load
    return dump.Dump(converter=converter).load(filepath)
  File "/usr/local/lib/python3.8/dist-packages/pgdumplib/dump.py", line 254, in load
    raise RuntimeError('Unsupported data format')
RuntimeError: Unsupported data format
$ 

The 2 files give an identical summary from pg_restore -l except for the timestamp. No errors are reported.

Is this an issue with pg_dump or pgdumplib?

@gmr
Copy link
Owner

gmr commented Oct 1, 2020

Probably pgdumplib - Can you provide the first 10 bytes of /tmp/bad?

with open('/tmp/bad', 'rb') as handle:
    print(repr(handle.read(10))

My guess is that there's a byte sneaking it at position 0 that's giving the format detection an issue.

@bitinerant
Copy link
Author

# hd /tmp/bad |head -10
00000000  50 47 44 4d 50 01 0e 00  04 08 01 01 01 00 00 00  |PGDMP...........|
00000010  00 19 00 00 00 00 1e 00  00 00 00 17 00 00 00 00  |................|
00000020  1e 00 00 00 00 08 00 00  00 00 78 00 00 00 00 00  |..........x.....|
00000030  00 00 00 00 06 00 00 00  63 6f 64 69 6d 64 00 14  |........codimd..|
00000040  00 00 00 31 32 2e 32 20  28 55 62 75 6e 74 75 20  |...12.2 (Ubuntu |
00000050  31 32 2e 32 2d 34 29 00  14 00 00 00 31 32 2e 32  |12.2-4).....12.2|
00000060  20 28 55 62 75 6e 74 75  20 31 32 2e 32 2d 34 29  | (Ubuntu 12.2-4)|
00000070  00 29 00 00 00 00 c2 0b  00 00 00 00 00 00 00 00  |.)..............|
00000080  01 00 00 00 30 00 01 00  00 00 30 00 08 00 00 00  |....0.....0.....|
00000090  45 4e 43 4f 44 49 4e 47  00 08 00 00 00 45 4e 43  |ENCODING.....ENC|

There's a different error (Invalid archive header) when the header is wrong. Unsupported data format means it can't find constants.K_OFFSET_POS_SET (dump.py). For a binary compare of the good and bad dump files, see "Update 2" in Stackoverflow #64145203

@gmr
Copy link
Owner

gmr commented Oct 1, 2020

Ah, yeah, I'm not sure that I can properly detect if the file hasn't fully transferred yet just by reading stdin.

So ultimately the issue here is pgdumplib assumes it has full access to the entire file, but doesn't.

Out of curiosity, how are you using the library? Is such a low-level thing, I didn't think it'd see much use.

@bitinerant
Copy link
Author

Actually, in the test above, pgdumplib has the complete file from the start. I don't call Python until the pg_dump -d codimd -Fc |cat >/tmp/bad command is completely finished.

This issue came up when I was testing pgdumplib to see if I could use it as the core of a small Python project to compare versions of a database. If that works, I have other use cases too.

@gmr
Copy link
Owner

gmr commented Oct 2, 2020

Huh, and there's no binary diff between bad and good files?

Re a diff tool, I have another project, probably much bigger in scope than what you're looking to do - pglifecycle, which is what I wrote pgdumplib for. One of the key features I'm planning on is diffing and generating modification DDL based on diffs.

@bitinerant
Copy link
Author

The good and bad files are different; see "Update 2" in Stackoverflow #64145203. The problem is, of the 9 differences, I don't know which represent the timestamp and which represent the reason that pgdumplib cannot open the file.

The diff tool project sounds interesting. For what I'm doing, I actually need the data formatted in a specific way which may be hard to do in a generic tool.

@themreza
Copy link

themreza commented Aug 23, 2021

I'm facing the same problem.
My dump file looks like this:

00000000  50 47 44 4d 50 01 0e 00  04 08 01 01 01 00 00 00  |PGDMP...........|
00000010  00 2b 00 00 00 00 13 00  00 00 00 00 00 00 00 00  |.+..............|
00000020  16 00 00 00 00 07 00 00  00 00 79 00 00 00 00 00  |..........y.....|
00000030  00 00 00 00 04 00 00 00  74 65 73 74 00 1f 00 00  |........test....|
00000040  00 31 30 2e 31 38 20 28  44 65 62 69 61 6e 20 31  |.10.18 (Debian 1|
00000050  30 2e 31 38 2d 31 2e 70  67 64 67 39 30 2b 31 29  |0.18-1.pgdg90+1)|
00000060  00 1e 00 00 00 31 33 2e  34 20 28 44 65 62 69 61  |.....13.4 (Debia|
00000070  6e 20 31 33 2e 34 2d 31  2e 70 67 64 67 31 30 30  |n 13.4-1.pgdg100|
00000080  2b 31 29 00 f5 38 00 00  00 9f 35 00 00 00 00 00  |+1)..8....5.....|
00000090  00 00 00 01 00 00 00 30  00 01 00 00 00 30 00 08  |.......0.....0..|

This dump was generated by Odoo: https://github.com/odoo/odoo/blob/14.0/odoo/service/db.py#L228

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

3 participants