Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Meaning of "<any OCTET except CTLs, but including LWS>" #102

Open
o018BUm8UQEEY2e5 opened this issue Feb 12, 2025 · 3 comments
Open

Meaning of "<any OCTET except CTLs, but including LWS>" #102

o018BUm8UQEEY2e5 opened this issue Feb 12, 2025 · 3 comments
Labels

Comments

@o018BUm8UQEEY2e5
Copy link

I'm confused by this rule in the ABNF provided in The WARC Format 1.1:

TEXT          = <any OCTET except CTLs,
                but including LWS>

Which of these (if any) is the correct interpretation:

TEXT          = %x20-7E | %x80-FF | LWS 
TEXT          = %x20-7E | %x80-FF | SP | HT
TEXT          = %x20-7E | %x80-FF | CR | LF | SP | HT
@ato
Copy link
Member

ato commented Feb 12, 2025

The first one. CRLF can appear only if immediately followed by SP or HT. This is called line folding. This definition was inherited from the HTTP/1.1 RFC 2616 so you may find the explanatory text in section 2.2 of it helpful.

Note that while the WARC standard allows them, in practice line folding and non-UTF-8 encodings are not well supported, so I recommend WARC writers avoid using them. Those two features were also deprecated in the newer HTTP RFC 7230.

@o018BUm8UQEEY2e5
Copy link
Author

o018BUm8UQEEY2e5 commented Feb 12, 2025

But compliant parsers should still support it?

@ato ato added the question label Feb 12, 2025
@ato
Copy link
Member

ato commented Feb 12, 2025

Yes. I haven't seen it used in real WARC files in the wild, but a fully compliant parser should support it.

From what I've seen, many (but not all) parsers support line folding but vary in how they interpret it as a string in their header reading API. Some including the LWS sequence as is, others replacing it with a single space or linefeed. I haven't seen any parser that supports the non-UTF-8 'encoded-word' feature though.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Projects
None yet
Development

No branches or pull requests

2 participants