Discussion of HTTP/2 use in WARC 1.1

[This discussion about a (perfectly valid) use of Zstandard for WARCs](https://github.com/iipc/warc-specifications/issues/105) made me reflect on some longstanding problems in the web archiving community. For years, many web crawlers have failed to comply with the WARC specifications. If we want to build a stronger, more reliable foundation for web archiving, it is time we address these issues openly.  

This post focuses on a particularly troubling practice: web crawlers that **modify WARC records** to "support" HTTP/2 archiving.  

---

### Example 1: Common Crawl’s Nutch  

[Common Crawl’s Nutch crawler](https://github.com/commoncrawl/nutch) rewrites HTTP/2 responses as if they were HTTP/1.1, thereby falsifying the captured records.  

- Commit introducing the falsification mechanism: https://github.com/commoncrawl/nutch/commit/5f4369298cbedf85572514d5f97a346935e338f0  
- Public discussion of this approach: https://github.com/commoncrawl/nutch/issues/29  

The result? **Petabytes of falsified WARCs.** (unless I am mistaken and Common Crawl uses another, spec compliant, crawler)

---

### Example 2: Heritrix' FetchHTTP2 module:

[This module](https://github.com/internetarchive/heritrix3/pull/649) enables kind of the same thing as the Nutch code. I know for a fact that it was not used by the Wayback Team when I was there. (it was actually created after I left)
Someone from there could confirm if it's actually used or not. (AFAIK this module is luckily not enabled by default..)

---

### Example 3: Storm Crawler  

[Storm Crawler](https://github.com/sebastian-nagel/storm-crawler/) is much smaller in scope than Nutch, so I hope it has not been widely adopted by the archiving community. However, it too produces bad WARC records.  

As their own [README](https://github.com/sebastian-nagel/storm-crawler/blob/master/external/warc/README.md) states:  

> *“The WARC file format is derived from the HTTP message format ([RFC 2616](https://www.ietf.org/rfc/rfc2616.txt)) and the WARC format as well as WARC readers require that HTTP requests and responses are recorded as HTTP/1.1 or HTTP/1.0. Therefore, the WARC WARCHdfsBolt writes binary HTTP formats (eg. [HTTP/2](https://en.wikipedia.org/wiki/HTTP/2)) as if they were HTTP/1.1. There is no need to limit the supported HTTP protocol versions to HTTP/1.0 or HTTP/1.1.”*  

In other words: instead of preserving the response as-is, the crawler **rewrites** it into something it never was. Worse, Storm Crawler also **deletes certain HTTP headers** outright and **modifies responses** because it cannot handle them faithfully. (That broader problem deserves its own discussion.)  

---

I hope this summary helps spark a serious conversation within the web archiving community. Some may argue that aspects of the specification should evolve—and of course **MAYBE**  they should! All conversations are welcome.  

**But please, stop creating falsified HTTP records. Stop claiming compliance with the WARC specification when it is clear you are not.** The credibility of our archives depends on it, and if we want a reliable future for web archiving, we must start by respecting the specifications we already have. Let's work together to properly support HTTP/2 (and 3) in the next version of the WARC spec.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Discussion of HTTP/2 use in WARC 1.1 #106

Example 1: Common Crawl’s Nutch

Example 2: Heritrix' FetchHTTP2 module:

Example 3: Storm Crawler

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Discussion of HTTP/2 use in WARC 1.1 #106

Description

Example 1: Common Crawl’s Nutch

Example 2: Heritrix' FetchHTTP2 module:

Example 3: Storm Crawler

Metadata

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Issue actions