Releases: tsproisl/SoMaJo
Releases · tsproisl/SoMaJo
v2.0.4
v2.0.3
v2.0.2
v2.0.1
v2.0.0
New features and improvements
- New API: Use new class SoMaJo instead of Tokenizer and SentenceSplitter. Currently, the old API is still supported but will issue deprecation warnings.
- Speed-up: Due to a new internal representation of the input text during processing (as a doubly linked list of Token objects), tokenization is now two to three times faster.
- Incremental and parallel processing of XML: If a sensible set of eos_tags is specified, the XML input will be processed incrementally (allowing for arbitrarily large XML input). In addition, if a sensible set of eos_tags is specified, processing can also be parallelized.
- New option --strip-tags to suppress the output of XML tags.
- Support for textual representations of emojis (
:smile:
,:stuck_out_tongue_winking_eye:
, etc.). - Support for textfaces (༼ʘ̚ل͜ʘ̚༽, ╚(ಠ_ಠ)=┐, etc.).
Breaking changes
- Removed the tokenizer script (deprecated since version 1.5.0 released in October 2017). Use somajo-tokenizer instead.
- Language codes contain the tokenization guideline: "de_CMC" instead of "de" and "en_PTB" instead of "en".