-
Notifications
You must be signed in to change notification settings - Fork 40
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Support Datetime other than those specified as 14-digits #283
Comments
The further precision does not be present in the link above. What's the BnF link? Also see iipc/warc-specifications#21. |
First line |
Per the WARC/1.1 spec and iipc/warc-specifications#21, date strings like Traceback (most recent call last):
File "/usr/local/bin/ipwb", line 11, in <module>
sys.exit(main())
File "/usr/local/lib/python2.7/site-packages/ipwb/__main__.py", line 17, in main
args = checkArgs(sys.argv)
File "/usr/local/lib/python2.7/site-packages/ipwb/__main__.py", line 151, in checkArgs
results.func(results)
File "/usr/local/lib/python2.7/site-packages/ipwb/__main__.py", line 32, in checkArgs_index
debug=args.debug)
File "/usr/local/lib/python2.7/site-packages/ipwb/indexer.py", line 141, in indexFileAt
warcFileFullPath, **encryptionAndCompressionSetting)
File "/usr/local/lib/python2.7/site-packages/ipwb/indexer.py", line 179, in getCDXJLinesFromFile
for i in iterForCounting(fhForCounting):
File "/usr/local/lib/python2.7/site-packages/pywb/warc/archiveiterator.py", line 543, in __call__
for entry in entry_iter:
File "/usr/local/lib/python2.7/site-packages/pywb/warc/archiveiterator.py", line 379, in create_record_iter
entry = self.parse_warc_record(record)
File "/usr/local/lib/python2.7/site-packages/pywb/warc/archiveiterator.py", line 465, in parse_warc_record
get_header('WARC-Date'))
File "/usr/local/lib/python2.7/site-packages/pywb/utils/timeutils.py", line 122, in iso_date_to_timestamp
return datetime_to_timestamp(iso_date_to_datetime(string))
File "/usr/local/lib/python2.7/site-packages/pywb/utils/timeutils.py", line 40, in iso_date_to_datetime
the_datetime = datetime.datetime(*map(int, nums))
TypeError: Required argument 'day' (pos 3) not found ...based on 6d219f5. |
Added a sample (variableSizedDates) WARC that I believe conforms to the 1.1 standard with variable length datetime strings. |
Encountered this again in testing, current master (73f136f): % ipwb index samples/warcs/variableSizedDates.warc
Traceback (most recent call last):eSizedDates.warc: 1/5
File "/usr/local/bin/ipwb", line 8, in <module>
sys.exit(main())
File "/usr/local/lib/python3.7/site-packages/ipwb/__main__.py", line 19, in main
args = checkArgs(sys.argv)
File "/usr/local/lib/python3.7/site-packages/ipwb/__main__.py", line 167, in checkArgs
results.func(results)
File "/usr/local/lib/python3.7/site-packages/ipwb/__main__.py", line 34, in checkArgs_index
debug=args.debug)
File "/usr/local/lib/python3.7/site-packages/ipwb/indexer.py", line 174, in indexFileAt
warcFileFullPath, **encryptionAndCompressionSetting)
File "/usr/local/lib/python3.7/site-packages/ipwb/indexer.py", line 291, in getCDXJLinesFromFile
record.rec_headers.get_header('WARC-Date'))
File "/usr/local/lib/python3.7/site-packages/ipwb/util.py", line 165, in iso8601ToDigits14
"%Y-%m-%dT%H:%M:%SZ")
File "/usr/local/Cellar/python/3.7.7/Frameworks/Python.framework/Versions/3.7/lib/python3.7/_strptime.py", line 577, in _strptime_datetime
tt, fraction, gmtoff_fraction = _strptime(data_string, format)
File "/usr/local/Cellar/python/3.7.7/Frameworks/Python.framework/Versions/3.7/lib/python3.7/_strptime.py", line 359, in _strptime
(data_string, format)) WARC 1.0 mandates 14-digit date for the
WARC 1.1 allows for other variants, e.g.,
It seems more flexible to simply read and interpret the date instead of referring to which version of the spec to which the WARC should adhere. As of now, |
Given the rationale for conversion is from ISO8601 to 14-digit datetime, some options:
The former seems more straightforward but instills perhaps unintended assumptions. Fuzziness is inherent in datetimes, as time is continuous, e.g., the millisecond discussion for WARCs. If we read a fuzzy datetime from a WARC and go with option 2, will it be compatible with storing this value in a CDXJ record with no assumptions of the datetime beyond what is specified. @ibnesayeed, can you provide some insight/feedback/commentary for this? |
The key here is ISO8601 with "as much precision as is accurately known." I cannot locate a module to accomplish this but a series of tests (e.g., regex) with the highest level of granularity (with 9 digits following the second) all the way down to simply year is an approach. This starting point might seem wasteful, given the more common ISO8601 length including up to seconds. For Python: %Y-%m-%dT%H:%M:%SZ With the last version not quite correct (but you, future person, hopefully get the gist). |
9cd23ba addresses some of this but I have yet to match the fraction-of-a-second example in that WARC: import datetime
datetime.datetime.strptime('%Y-%m-%dT%H:%M:%S.%fZ','2014-02-10T00:00:01.000000002Z')
ValueError: time data '%Y-%m-%dT%H:%M:%S.%fZ' does not match format '2014-02-10T00:00:01.000000002Z' |
There could be two possible approaches here:
We also need to figure out what is URI format we would want to support in the replay. |
The parameters above are backward, the format string should be second. This works:
Note, however, that %f read six 0-padded digits. The WARC/1.1 spec says:
This is problematic and conflicting with the sub-second W3CDTF says:
%f might be insufficient, as it expects six digits and WARC-Dates can have 1-9 digits. Is there a format portion (akin to |
This level of precision is unlikely but allowable per WARC/1.1, so we need a special case for compliance. One option is to first check compliance with: dt = '2014-02-10T00:00:01.123456789Z'
dt_f = f'{dt[:26]}dt[-1:]'
datetime.datetime.strptime(dt_f, '%Y-%m-%dT%H:%M:%S.%f') ...then parse out dt[27:-1], append it to dt[21:26], check it is all digits, and if so, assign it to the final value of the datetime object. |
|
b76135a adds support for generating more precise, solely digit-based date strings. These become present in the CDXJs generated, for example: % ipwb index samples/warcs/variableSizedDates.warc
!context ["http://tools.ietf.org/html/rfc7089"]
!meta {"generator": "InterPlanetary Wayback v.0.2020.06.18.1933", "created_at": "2020-06-19T14:37:49.991232"}
us,memento)/ 20140101000000 {"locator": "urn:ipfs/QmNQX5gEjbEPModBHXb6w4EWveLkZ57uEC9Kzh8bho7QmL/QmX4gE6SdJK8v67XikqQFJrac4xaqB5kwsgona2nH9hZwm", "status_code": "200", "mime_type": "text/html", "original_uri": "http://memento.us/"}
us,memento)/ 20140210000001 {"locator": "urn:ipfs/QmNQX5gEjbEPModBHXb6w4EWveLkZ57uEC9Kzh8bho7QmL/QmXQB6e2aB7VRaA4CK5H33sTfVC6GxNd1JtSgCaWVuUbfj", "status_code": "200", "mime_type": "text/html", "original_uri": "http://memento.us/"}
us,memento)/ 20140210000001 {"locator": "urn:ipfs/QmNQX5gEjbEPModBHXb6w4EWveLkZ57uEC9Kzh8bho7QmL/QmYWRfaHFcN7ygLUiiKEF6ELApMbdhv7K3zRtrz5rog83U", "status_code": "200", "mime_type": "text/html", "original_uri": "http://memento.us/"}
us,memento)/ 20140210000001000000002 {"locator": "urn:ipfs/QmNQX5gEjbEPModBHXb6w4EWveLkZ57uEC9Kzh8bho7QmL/Qmb8q1BFPws4ZNhL9MczY9tb4mWEPdV41LNuXD6oMkvzcw", "status_code": "200", "mime_type": "text/html", "original_uri": "http://memento.us/"} More adjustments may need to be made to ensure replay can handle the potentially longer date strings (see last line above). This issue is complete but I would like to investigate the end other of using the CDXJ files with long date strings. |
As suspected, when replaying the CDXJ above and accessing any memento, the |
Add test to show breakage using variable len WARC-Dates for #283
|
The WARC 1.1 spec allows for more precise datetimes. These should be supported in the replay system. Does any tool exist that will generate these yet? If not, some sample data can be fabricated.
The text was updated successfully, but these errors were encountered: