111 lines (76 loc) · 4.01 KB

Changelog

2.4.1 [2026-04-03]

Fixed Unicode surrogate pair handling bugs in German gender sensitive forms (Fixed #139 crash on unmatched characters)
Updated dependencies

2.4.0 [2026-02-25]

Added German gender-sensitive form tokenization:
- Colon forms: Nutzer:in, Nutzer:innen, Kosovo-Albaner:innen
- Slash forms: Nutzer/in, Nutzer/innen, Nutzer/-in, Kosovo-Albaner/innen
- Parenthetical forms: Nutzer(in), Nutzer(innen), Nutzer(-in)
- Kaufmann/frau pattern: Kaufmann/frau, Kaufmann/-frau, Geschäftsmann/frau (only applies when word ends in "mann" with non-empty prefix)
- Short forms for determiners, adjectives, pronouns: eine(n), gute:r, ihm/r, diese(r), ein(e)
Added de_old German tokenizer variant without gender-sensitive rules (use -l de_old to split forms like Nutzer:in into separate tokens)
Fixed thousands separators not being handled consistently (issue #135):
- Apostrophe ' (Swiss format: 1'000'000)
- Thin space U+2009 and narrow no-break space U+202F
Updated dependencies

2.3.1 [2026-01-28]

Fixed soft hyphens (U+00AD) being incorrectly treated as token boundaries (issue #131)
Updated dependencies
Improved compatibility with Java 25 (fixed deprecation warnings)

2.3.0 [2025-12-23]

Fixed genderstern and omission asterisk breaking after hyphens (issue #115)
Added emoji complex support (issue #113)
Added Wikipedia emoji template support (issue #114)
Fixed breaking most frequent hyphenated compound abbreviations for German (issue #116)
Updated dependencies

2.2.5

adds more ossrh sync data to maven pom

2.2.4 [unreleased]

minor code cleanups
some API documentation added

2.2.3

Updated dependencies
Minimum Java version raised to 17
Fixed group id in pom.xml
Removed compile dependency on Maven Surefire
Build artifacts in src/main/jflex are now ignored by git
java.io's ByteArrayOutputStream used instead of 3rd-party class

2.2.2

Bug fix: a single quotation mark at the beginning of a word is no longer interpreted as a beginning of an omission, but as quotation mark token.
dependencies updated

2.2.1

"du." is no longer treated as an abbreviation.

2.2.0.9000

"Dir." and "dir." are no longer treated as abbreviations.

2.2.0

Apostrophe and hyphen marked contractions and clitics in English (I've, isn't, Peter's, …) and French (j'ai, d'un, l'art, sont-elles, …) are now separated.

2.1.0

GitHub CI test workflow added
Dependencies updated
-Xss2m added to maven jvm config

Potentially breaking change

--sentence-boundaries|-s now prints sentence boundaries only if --positions|-p is also present

2.0.0

Dependencies updated
Tokenizer and sentence splitter for English (-l en option) added
Tokenizer and sentence splitter for French (-l fr option) added
Support for adding more languages
UTF-8 input encoding is now expected by default, different encodings can be set by the --encoding <enc> option
By default, tokens are now printed to stdout (use options --no-tokens --positions to print character offsets instead)
Abbreviated German street names like Kunststr. are now recognized as tokens
Added heuristics for distinguishing between I. as abbrevation vs PPER / CARD
URLs without URI-scheme are now recognized as single tokens if they start wit www.

1.3

Standard EOT/EOF character x04 is used instead of magic escape \n\x03\n

Quoted email names containing space characters, like "John Doe"@xx.com, are no longer interpreted as single tokens
Sentence splitter functionality added (--sentence-boundaries option)

1.2

First version published on https://korap.ids-mannheim.de/gerrit/plugins/gitiles/KorAP/KorAP-Tokenizer
Extracted from KorAP-internal ingestion pipeline