You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
As far as I can tell, there’s no standard or even broadly recommended way to represent network errors or just non-HTTP errors when expecting an HTTP response. I’m thinking about things like connection timeouts, DNS lookup failures, SSL handshake failures, etc. (As a practical, real-world example, some US government websites were shut down last week by deleting their SSL certificates, causing handshake errors.)
I tried Warcio, Wget, and Browsertrix-Crawler on a site with an SSL handshake and none of them record either the request or response, although Wget and Browsertrix-Crawler do include their logs (which show the error in a very implementation-specific way) as a resource record in the WARC. I’m not sure if any other crawlers behave differently.
It would be really nice if there were a more standard way (or at least a recommended pattern for) representing the failed response in its own record, so other systems reading a WARC could affirmatively determine that a given URL or origin [was] not available.
The WARC 1.1 spec seems to suggest that it is OK to record this in a response record, but leaves how to do so entirely open-ended:
When software bugs, network issues, or implementation limits cause response-like material to be collected that is not perfectly compliant with HTTP specifications, WARC writing software may record the problematic content using its best effort determination of the interesting material boundaries. That is, neither the use of the ‘response’ record with a ‘http’ target-URI nor the ‘application/http’ content-type serves as an absolute guarantee that the contained material is a legal HTTP response. (Section 6.3.2)
Are there any common patterns for doing this that people are using? Would it be possible to include a recommendation in the implementation guidelines, or even in the spec? (Maybe this would benefit from a new WARC header field?)
The text was updated successfully, but these errors were encountered:
I'm a fan of this idea. Recently, I've been trying to quantify "bot defenses" being used to stop our crawler. There are many patterns used by bot defenses. In the past we have only generated a warc response record if a valid http response is received. I'd like to start generating response records for everything short of that.
Recently, I've been trying to quantify "bot defenses" being used to stop our crawler. There are many patterns used by bot defenses.
Off topic for this issue, but I would love to chat with you more about this on some other channel! I maintain a lot of the tooling behind the “climate change got removed from government websites” articles in major news publications (through the Environmental Data & Governance Initiative), and classifying bot defense vs. normal response has long been a thorn in my side as well.
As far as I can tell, there’s no standard or even broadly recommended way to represent network errors or just non-HTTP errors when expecting an HTTP response. I’m thinking about things like connection timeouts, DNS lookup failures, SSL handshake failures, etc. (As a practical, real-world example, some US government websites were shut down last week by deleting their SSL certificates, causing handshake errors.)
I tried Warcio, Wget, and Browsertrix-Crawler on a site with an SSL handshake and none of them record either the request or response, although Wget and Browsertrix-Crawler do include their logs (which show the error in a very implementation-specific way) as a
resource
record in the WARC. I’m not sure if any other crawlers behave differently.It would be really nice if there were a more standard way (or at least a recommended pattern for) representing the failed response in its own record, so other systems reading a WARC could affirmatively determine that a given URL or origin [was] not available.
The WARC 1.1 spec seems to suggest that it is OK to record this in a
response
record, but leaves how to do so entirely open-ended:Are there any common patterns for doing this that people are using? Would it be possible to include a recommendation in the implementation guidelines, or even in the spec? (Maybe this would benefit from a new WARC header field?)
The text was updated successfully, but these errors were encountered: