Releases: iipc/jwarc
Releases · iipc/jwarc
v0.32.0
New features
- HeaderValidator with WARC/1.1 standard ruleset
- ExtractTool: can now extract sequential concurrent records (
--concurrentoption) - DedupeTool
- In-memory cache for cross-URL digest-based deduplication (
--cache-sizeoption) - Now prints deduplication statistics (
--dry-runand--quietoptions) - Multi-threaded deduplication (
--threadsoption)
- In-memory cache for cross-URL digest-based deduplication (
- ValidateTool
- Multi-threaded validation (
--threadsoption)
- Multi-threaded validation (
- ParsingException message is now annotated with the source filename and record offset when available
Bugs fixed
- RFC5952 canonical form is now used for IPv6 addresses in WARC-IP-Address
- HttpParser in lenient mode now:
- accepts responses missing version number
- ignores header lines missing :
- ignores folded status lines
- WarcParser: treats
alexa/datARC records as not HTTP type
v0.31.1: Release 0.31.1
Bugs fixed
- Fixed URIs.parseLeniently() returning a different value to new URI() if the path was empty or the input contained percent encoded characters #90 #91
- Replaced some internal usages of record.targetURI() with record.target() to reduce the chance of runtime exceptions and preserve the exact original value
v0.31.0: Release 0.31.0
New features
- Added optional support for brotli content encoding #88
- Added HttpMessage.bodyDecoded() #88
- WarcTool: Added
dedupesubcommand - DedupeTool: Added --verbose option and silenced default logging
Bug fixes
- GunzipChannel: Fixed incorrect record length calculation when gzip footer aligns with the end of the buffer
- ValidateTool: Fixed digest validation #87
- DedupeTool: Used matchType=exact to properly handle CDX queries for URLs ending with
* - DedupeTool: Fixed record copying when transferTo copies fewer bytes than requested
- DedupeTool: Prevented appending of an empty gzip member when no records were deduplicated
- DedupeTool: Fixed exception when input files are in the current working directory
v0.30.0: Release 0.30.0
New features
- WarcReader and WarcParser gained a lenient parsing mode which:
- permits ASCII control characters in header field names and values
- allows lines to end with LF instead of CRLF
- permits multi-digit WARC minor versions like "0.18"
v0.29.0: Release 0.29.0
New features
- Added MediaType.parseLeniently() and .isValid()
Changes
- Message.contentType() and other methods that internally call it now use the lenient MediaType parser instead of throwing IllegalArgumentException #83
v0.28.6: Release 0.28.6
Bugs fixed
- Improved compatibility with ARC variants (version-block length off by one, v2 version-block, spurious linefeeds) #82
- WarcParser: Context in parse error messages was incorrectly using the parser (file) position instead of buffer position
v0.28.5: Release 0.28.5
Bugs fixed
- Fixed ClosedChannelException when reading a WarcRevisit body after closing a previous one due to reuse of empty MessageBody. #80
v0.28.4: Release 0.28.4
Bugs fixed
- CDX formatting now percent encodes spaces, newlines and null characters in all string fields. This is non-standard but at least prevents us outputting invalid CDX lines.
- CdxRequestEncoder now handles requests with an invalid content-type header
v0.28.3: Release 0.28.3
v0.28.2: Release 0.28.2
Changes:
- HttpRequest+HttpResponse in lenient mode now recovers when parsing the Content-Length header throws NumberFormatException
- WarcParser now tries to leniently parse ARC records containing corrupt dates