Skip to content

Releases: tsproisl/SoMaJo

v2.0.4

05 Mar 08:17
Compare
Choose a tag to compare

Bugfix: Prevent race conditions between tokenizer and sentence splitter in parallel processing (--parallel > 1).

v2.0.3

28 Feb 10:02
Compare
Choose a tag to compare
  • Skip tests for unimplemented features (some builds will fail if any of the unit tests fail).

v2.0.2

28 Feb 08:38
Compare
Choose a tag to compare
  • Bugfix: Parallel tokenization (--parallel > 1) works again.
  • Support for musical notes (sharps).

v2.0.1

19 Dec 14:20
Compare
Choose a tag to compare

Bugfix. As always, there is a miniscule detail that causes things to go wrong… 🙄

v2.0.0

19 Dec 14:13
Compare
Choose a tag to compare

New features and improvements

  • New API: Use new class SoMaJo instead of Tokenizer and SentenceSplitter. Currently, the old API is still supported but will issue deprecation warnings.
  • Speed-up: Due to a new internal representation of the input text during processing (as a doubly linked list of Token objects), tokenization is now two to three times faster.
  • Incremental and parallel processing of XML: If a sensible set of eos_tags is specified, the XML input will be processed incrementally (allowing for arbitrarily large XML input). In addition, if a sensible set of eos_tags is specified, processing can also be parallelized.
  • New option --strip-tags to suppress the output of XML tags.
  • Support for textual representations of emojis (:smile:, :stuck_out_tongue_winking_eye:, etc.).
  • Support for textfaces (༼ʘ̚ل͜ʘ̚༽, ╚(ಠ_ಠ)=┐, etc.).

Breaking changes

  • Removed the tokenizer script (deprecated since version 1.5.0 released in October 2017). Use somajo-tokenizer instead.
  • Language codes contain the tokenization guideline: "de_CMC" instead of "de" and "en_PTB" instead of "en".

v1.11.0

08 Nov 08:28
Compare
Choose a tag to compare
  • XML sentence splitting: Added hr tag to default sentence breaks
  • Recognize Reddit links in shorthand notation
  • Improved robustness of XML processing

v1.10.7

01 Nov 15:14
Compare
Choose a tag to compare
  • Make recognition of gender star case insensitive
  • Fix problem with “nasty” character as last character of text unit

v1.10.6

02 Oct 14:11
Compare
Choose a tag to compare
  • Added support for gender star (Mitarbeiter*innen)
  • Improvements regarding lists of numbers (1,2,3,4,5,6,7), section numbers (1.1.4) and IPv4 addresses (192.0.2.42)

v1.10.5

02 Aug 13:42
Compare
Choose a tag to compare

A few small improvements regarding the emoji variation selector character.

v1.10.4

01 Aug 11:26
Compare
Choose a tag to compare

Bugfix related to the --version option.