Skip to content

warcio recompress adds WARC-Block-Digest fields to records without one #161

@acidus99

Description

@acidus99

It appears that warcio recompress will add WARC-Block-Digest fields to records that do not already have that field.

In the ZIP there are 2 warcs.
example-warcs.zip

In orig.warc the warcinfo record at the start does not have a WARC-Block-Digest field at all. However if you run:

warcio recompress orig.warc warcio-recompress.warc.gz
gunzip warcio-recompress.warc.gz

And look at warc-recompress.warc you will see that the warcinfo record now has WARC-Block-Digest with a SHA1 hash. (I included a copy of warc-recompress.warc in the ZIP).

While I suppose more digests aren't a bad thing:

  • I would not expect a recompression operation to alter the records in the WARC.
  • This behavior isn't documented
  • It (very slightly) increases the size of the WARC

My suggestion would be that warcio recompress should not alter the records of the WARC it is operating on.

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions