Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Add guidelines (or spec?) for how to represent network/below-HTTP-layer response errors #101

Open
Mr0grog opened this issue Jan 29, 2025 · 2 comments

Comments

@Mr0grog
Copy link
Contributor

Mr0grog commented Jan 29, 2025

As far as I can tell, there’s no standard or even broadly recommended way to represent network errors or just non-HTTP errors when expecting an HTTP response. I’m thinking about things like connection timeouts, DNS lookup failures, SSL handshake failures, etc. (As a practical, real-world example, some US government websites were shut down last week by deleting their SSL certificates, causing handshake errors.)

I tried Warcio, Wget, and Browsertrix-Crawler on a site with an SSL handshake and none of them record either the request or response, although Wget and Browsertrix-Crawler do include their logs (which show the error in a very implementation-specific way) as a resource record in the WARC. I’m not sure if any other crawlers behave differently.

It would be really nice if there were a more standard way (or at least a recommended pattern for) representing the failed response in its own record, so other systems reading a WARC could affirmatively determine that a given URL or origin [was] not available.

The WARC 1.1 spec seems to suggest that it is OK to record this in a response record, but leaves how to do so entirely open-ended:

When software bugs, network issues, or implementation limits cause response-like material to be collected that is not perfectly compliant with HTTP specifications, WARC writing software may record the problematic content using its best effort determination of the interesting material boundaries. That is, neither the use of the ‘response’ record with a ‘http’ target-URI nor the ‘application/http’ content-type serves as an absolute guarantee that the contained material is a legal HTTP response. (Section 6.3.2)

Are there any common patterns for doing this that people are using? Would it be possible to include a recommendation in the implementation guidelines, or even in the spec? (Maybe this would benefit from a new WARC header field?)

@wumpus
Copy link

wumpus commented Jan 29, 2025

I'm a fan of this idea. Recently, I've been trying to quantify "bot defenses" being used to stop our crawler. There are many patterns used by bot defenses. In the past we have only generated a warc response record if a valid http response is received. I'd like to start generating response records for everything short of that.

@Mr0grog
Copy link
Contributor Author

Mr0grog commented Jan 29, 2025

Recently, I've been trying to quantify "bot defenses" being used to stop our crawler. There are many patterns used by bot defenses.

Off topic for this issue, but I would love to chat with you more about this on some other channel! I maintain a lot of the tooling behind the “climate change got removed from government websites” articles in major news publications (through the Environmental Data & Governance Initiative), and classifying bot defense vs. normal response has long been a thorn in my side as well.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants