Releases · iipc/jwarc · GitHub

16 Jul 10:24

ato

v0.32.0 Latest

Latest

New features

HeaderValidator with WARC/1.1 standard ruleset
ExtractTool: can now extract sequential concurrent records (--concurrent option)
DedupeTool
- In-memory cache for cross-URL digest-based deduplication (--cache-size option)
- Now prints deduplication statistics (--dry-run and --quiet options)
- Multi-threaded deduplication (--threads option)
ValidateTool
- Multi-threaded validation (--threads option)
ParsingException message is now annotated with the source filename and record offset when available

Bugs fixed

RFC5952 canonical form is now used for IPv6 addresses in WARC-IP-Address
HttpParser in lenient mode now:
- accepts responses missing version number
- ignores header lines missing :
- ignores folded status lines
WarcParser: treats alexa/dat ARC records as not HTTP type

Assets 2

20 Nov 04:11

ato

v0.31.1: Release 0.31.1

Bugs fixed

Fixed URIs.parseLeniently() returning a different value to new URI() if the path was empty or the input contained percent encoded characters #90 #91
Replaced some internal usages of record.targetURI() with record.target() to reduce the chance of runtime exceptions and preserve the exact original value

Assets 3

14 Nov 01:59

ato

v0.31.0: Release 0.31.0

New features

Added optional support for brotli content encoding #88
Added HttpMessage.bodyDecoded() #88
WarcTool: Added dedupe subcommand
DedupeTool: Added --verbose option and silenced default logging

Bug fixes

GunzipChannel: Fixed incorrect record length calculation when gzip footer aligns with the end of the buffer
ValidateTool: Fixed digest validation #87
DedupeTool: Used matchType=exact to properly handle CDX queries for URLs ending with *
DedupeTool: Fixed record copying when transferTo copies fewer bytes than requested
DedupeTool: Prevented appending of an empty gzip member when no records were deduplicated
DedupeTool: Fixed exception when input files are in the current working directory

Assets 3

28 Jun 07:36

ato

v0.30.0: Release 0.30.0

New features

WarcReader and WarcParser gained a lenient parsing mode which:
- permits ASCII control characters in header field names and values
- allows lines to end with LF instead of CRLF
- permits multi-digit WARC minor versions like "0.18"

Assets 3

14 Feb 04:43

ato

v0.29.0: Release 0.29.0

New features

Added MediaType.parseLeniently() and .isValid()

Changes

Message.contentType() and other methods that internally call it now use the lenient MediaType parser instead of throwing IllegalArgumentException #83

Assets 3

09 Feb 07:15

ato

v0.28.6: Release 0.28.6

Bugs fixed

Improved compatibility with ARC variants (version-block length off by one, v2 version-block, spurious linefeeds) #82
WarcParser: Context in parse error messages was incorrectly using the parser (file) position instead of buffer position

Assets 3

13 Dec 05:34

ato

v0.28.5: Release 0.28.5

Bugs fixed

Fixed ClosedChannelException when reading a WarcRevisit body after closing a previous one due to reuse of empty MessageBody. #80

Assets 3

13 Dec 05:33

ato

v0.28.4: Release 0.28.4

Bugs fixed

CDX formatting now percent encodes spaces, newlines and null characters in all string fields. This is non-standard but at least prevents us outputting invalid CDX lines.
CdxRequestEncoder now handles requests with an invalid content-type header

Assets 2

28 Sep 00:09

ato

v0.28.3: Release 0.28.3

Release 0.28.3

Bugs fixed:

Fixed multithreading issue on GzipChannel write header #79

Assets 2

15 Sep 07:18

ato

v0.28.2: Release 0.28.2

Changes:

HttpRequest+HttpResponse in lenient mode now recovers when parsing the Content-Length header throws NumberFormatException
WarcParser now tries to leniently parse ARC records containing corrupt dates

Assets 2