-
Notifications
You must be signed in to change notification settings - Fork 40
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Support indexing WACZ files #710
Comments
WACZ files can be interpreted as a ZIP file with a defined structure. The target for ipwb (WARCs) are in In the future, we may want to consider the additional context that WACZ provides. Sample WACZ https://play.archipelago.nyc/do/10/iiif/3546d9bd-a25c-4ba1-b96f-29411c0d752a/full/full/0/etd.wacz |
Preliminary support added in 779978a. WACZ detection should be improved but importing py-wacz incurs others dependencies due to pywb coupling. |
Also, |
In 9436999, I created a wacz using:
...which produces a 79 KB file. Attempting to replay this in https://replayweb.page/ shows no URLs in the interface. |
The command should include a
|
@ikreymer Thanks for your proactive feedback here. I ran:
and a 79 KB file wacz 0.4.6 installed via pypi, macOS 12.3.1, Python 3.10.4 - - EDIT: When decompressing the WACZ, the WARCs are present. Perhaps pywb is having an issue replaying them -- they were not created w/ the webrecorder stack. EDIT2: Uploading the WARCs directly to replayweb.page produces the same result -- no URL is shown in the interface. A next step will be to try these WARCs in pywb directly to see if any errors are reported. EDIT3: warcio seems to work ok with these WARCs, for example: from warcio.archiveiterator import ArchiveIterator
with open ('./samples/warcs/5mementos.warc', 'rb') as stream:
for record in ArchiveIterator(stream):
if record.rec_type == 'response':
print(record.rec_headers.get_header('WARC-Target-URI')) produces:
|
Base test added in 25e91ad but GH Action is reporting service issues. |
Per a discussion w/ Mark G. @ IA, WACZ is supported at web-beta.archive.org/save for those with a "beta" account (which I have). |
Via @ikreymer, Web Archive Collection Zipped (WACZ) Format, https://github.com/webrecorder/wacz-format (MIT, potentially reusable)
Example of MDN WACZ at https://twitter.com/webrecorder_io/status/1293730279824089088
https://dh-preserve.sfo2.cdn.digitaloceanspaces.com/webarchives/mdn.wacz (1.6GB)
Finalizing Issue #604 (resolving #631) would be conducive here depending on the WACZ's contents. Also, hosting some larger WARCs remotely like this, because they are beyond the size restrictions on GitHub, could serve as the means for testing scalability.
The text was updated successfully, but these errors were encountered: