- 
                Notifications
    
You must be signed in to change notification settings  - Fork 32
 
Description
As far as I can tell, there’s no standard or even broadly recommended way to represent network errors or just non-HTTP errors when expecting an HTTP response. I’m thinking about things like connection timeouts, DNS lookup failures, SSL handshake failures, etc. (As a practical, real-world example, some US government websites were shut down last week by deleting their SSL certificates, causing handshake errors.)
I tried Warcio, Wget, and Browsertrix-Crawler on a site with an SSL handshake and none of them record either the request or response, although Wget and Browsertrix-Crawler do include their logs (which show the error in a very implementation-specific way) as a resource record in the WARC. I’m not sure if any other crawlers behave differently.
It would be really nice if there were a more standard way (or at least a recommended pattern for) representing the failed response in its own record, so other systems reading a WARC could affirmatively determine that a given URL or origin [was] not available.
The WARC 1.1 spec seems to suggest that it is OK to record this in a response record, but leaves how to do so entirely open-ended:
When software bugs, network issues, or implementation limits cause response-like material to be collected that is not perfectly compliant with HTTP specifications, WARC writing software may record the problematic content using its best effort determination of the interesting material boundaries. That is, neither the use of the ‘response’ record with a ‘http’ target-URI nor the ‘application/http’ content-type serves as an absolute guarantee that the contained material is a legal HTTP response. (Section 6.3.2)
Are there any common patterns for doing this that people are using? Would it be possible to include a recommendation in the implementation guidelines, or even in the spec? (Maybe this would benefit from a new WARC header field?)