Skip to content

Add guidelines (or spec?) for how to represent network/below-HTTP-layer response errors #101

@Mr0grog

Description

@Mr0grog

As far as I can tell, there’s no standard or even broadly recommended way to represent network errors or just non-HTTP errors when expecting an HTTP response. I’m thinking about things like connection timeouts, DNS lookup failures, SSL handshake failures, etc. (As a practical, real-world example, some US government websites were shut down last week by deleting their SSL certificates, causing handshake errors.)

I tried Warcio, Wget, and Browsertrix-Crawler on a site with an SSL handshake and none of them record either the request or response, although Wget and Browsertrix-Crawler do include their logs (which show the error in a very implementation-specific way) as a resource record in the WARC. I’m not sure if any other crawlers behave differently.

It would be really nice if there were a more standard way (or at least a recommended pattern for) representing the failed response in its own record, so other systems reading a WARC could affirmatively determine that a given URL or origin [was] not available.

The WARC 1.1 spec seems to suggest that it is OK to record this in a response record, but leaves how to do so entirely open-ended:

When software bugs, network issues, or implementation limits cause response-like material to be collected that is not perfectly compliant with HTTP specifications, WARC writing software may record the problematic content using its best effort determination of the interesting material boundaries. That is, neither the use of the ‘response’ record with a ‘http’ target-URI nor the ‘application/http’ content-type serves as an absolute guarantee that the contained material is a legal HTTP response. (Section 6.3.2)

Are there any common patterns for doing this that people are using? Would it be possible to include a recommendation in the implementation guidelines, or even in the spec? (Maybe this would benefit from a new WARC header field?)

Metadata

Metadata

Assignees

No one assigned

    Labels

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions