- Fixed Unicode surrogate pair handling bugs in German gender sensitive forms (Fixed #139 crash on unmatched characters)
- Updated dependencies
- Added German gender-sensitive form tokenization:
- Colon forms:
Nutzer:in,Nutzer:innen,Kosovo-Albaner:innen - Slash forms:
Nutzer/in,Nutzer/innen,Nutzer/-in,Kosovo-Albaner/innen - Parenthetical forms:
Nutzer(in),Nutzer(innen),Nutzer(-in) - Kaufmann/frau pattern:
Kaufmann/frau,Kaufmann/-frau,Geschäftsmann/frau(only applies when word ends in "mann" with non-empty prefix) - Short forms for determiners, adjectives, pronouns:
eine(n),gute:r,ihm/r,diese(r),ein(e)
- Colon forms:
- Added
de_oldGerman tokenizer variant without gender-sensitive rules (use-l de_oldto split forms likeNutzer:ininto separate tokens) - Fixed thousands separators not being handled consistently (issue #135):
- Apostrophe
'(Swiss format:1'000'000) - Thin space U+2009 and narrow no-break space U+202F
- Apostrophe
- Updated dependencies
- Fixed soft hyphens (U+00AD) being incorrectly treated as token boundaries (issue #131)
- Updated dependencies
- Improved compatibility with Java 25 (fixed deprecation warnings)
- Fixed genderstern and omission asterisk breaking after hyphens (issue #115)
- Added emoji complex support (issue #113)
- Added Wikipedia emoji template support (issue #114)
- Fixed breaking most frequent hyphenated compound abbreviations for German (issue #116)
- Updated dependencies
- adds more ossrh sync data to maven pom
- minor code cleanups
- some API documentation added
- Updated dependencies
- Minimum Java version raised to 17
- Fixed group id in pom.xml
- Removed compile dependency on Maven Surefire
- Build artifacts in src/main/jflex are now ignored by git
- java.io's ByteArrayOutputStream used instead of 3rd-party class
- Bug fix: a single quotation mark at the beginning of a word is no longer interpreted as a beginning of an omission, but as quotation mark token.
- dependencies updated
- "du." is no longer treated as an abbreviation.
- "Dir." and "dir." are no longer treated as abbreviations.
- Apostrophe and hyphen marked contractions and clitics in English (I've, isn't, Peter's, …) and French (j'ai, d'un, l'art, sont-elles, …) are now separated.
- GitHub CI test workflow added
- Dependencies updated
-Xss2madded to maven jvm config
--sentence-boundaries|-snow prints sentence boundaries only if--positions|-pis also present
- Dependencies updated
- Tokenizer and sentence splitter for English (
-l enoption) added - Tokenizer and sentence splitter for French (
-l froption) added - Support for adding more languages
UTF-8input encoding is now expected by default, different encodings can be set by the--encoding <enc>option- By default, tokens are now printed to stdout (use options
--no-tokens --positionsto print character offsets instead) - Abbreviated German street names like Kunststr. are now recognized as tokens
- Added heuristics for distinguishing between I. as abbrevation vs PPER / CARD
- URLs without URI-scheme are now recognized as single tokens if they start wit
www.
- Standard EOT/EOF character x04 is used instead of magic escape \n\x03\n
- Quoted email names containing space characters, like "John Doe"@xx.com, are no longer interpreted as single tokens
- Sentence splitter functionality added (
--sentence-boundariesoption)
- First version published on https://korap.ids-mannheim.de/gerrit/plugins/gitiles/KorAP/KorAP-Tokenizer
- Extracted from KorAP-internal ingestion pipeline