- 
                Notifications
    
You must be signed in to change notification settings  - Fork 32
 
Description
This discussion about a (perfectly valid) use of Zstandard for WARCs made me reflect on some longstanding problems in the web archiving community. For years, many web crawlers have failed to comply with the WARC specifications. If we want to build a stronger, more reliable foundation for web archiving, it is time we address these issues openly.
This post focuses on a particularly troubling practice: web crawlers that modify WARC records to "support" HTTP/2 archiving.
Example 1: Common Crawl’s Nutch
Common Crawl’s Nutch crawler rewrites HTTP/2 responses as if they were HTTP/1.1, thereby falsifying the captured records.
- Commit introducing the falsification mechanism: commoncrawl/nutch@5f43692
 - Public discussion of this approach: WARC writer support HTTP/2 commoncrawl/nutch#29
 
The result? Petabytes of falsified WARCs. (unless I am mistaken and Common Crawl uses another, spec compliant, crawler)
Example 2: Heritrix' FetchHTTP2 module:
This module enables kind of the same thing as the Nutch code. I know for a fact that it was not used by the Wayback Team when I was there. (it was actually created after I left)
Someone from there could confirm if it's actually used or not. (AFAIK this module is luckily not enabled by default..)
Example 3: Storm Crawler
Storm Crawler is much smaller in scope than Nutch, so I hope it has not been widely adopted by the archiving community. However, it too produces bad WARC records.
As their own README states:
“The WARC file format is derived from the HTTP message format (RFC 2616) and the WARC format as well as WARC readers require that HTTP requests and responses are recorded as HTTP/1.1 or HTTP/1.0. Therefore, the WARC WARCHdfsBolt writes binary HTTP formats (eg. HTTP/2) as if they were HTTP/1.1. There is no need to limit the supported HTTP protocol versions to HTTP/1.0 or HTTP/1.1.”
In other words: instead of preserving the response as-is, the crawler rewrites it into something it never was. Worse, Storm Crawler also deletes certain HTTP headers outright and modifies responses because it cannot handle them faithfully. (That broader problem deserves its own discussion.)
I hope this summary helps spark a serious conversation within the web archiving community. Some may argue that aspects of the specification should evolve—and of course MAYBE they should! All conversations are welcome.
But please, stop creating falsified HTTP records. Stop claiming compliance with the WARC specification when it is clear you are not. The credibility of our archives depends on it, and if we want a reliable future for web archiving, we must start by respecting the specifications we already have. Let's work together to properly support HTTP/2 (and 3) in the next version of the WARC spec.