Skip to content

Releases: tsproisl/SoMaJo

v1.10.3

19 Jul 19:55
Compare
Choose a tag to compare
  • New option -v/--version to output version information.
  • Explicitly specify input encoding as UTF-8.

v1.10.2

02 Jul 05:10
Compare
Choose a tag to compare

Fixed a bug where SoMaJo could not be installed if regex was not already there.

v1.10.0

28 Jun 15:41
Compare
Choose a tag to compare
  • Treat emoji sequences that render as a single grapheme as a single token. This includes flags and sequences containing modifiers and zero-width joiners.
  • Recognize underscores used for "underlining" and split them off.
  • Added a few Unicode formatting characters to the “nasty” characters.
  • Replaced POSIX character classes with built-ins or Unicode properties.

v1.9.0

02 Apr 07:26
Compare
Choose a tag to compare

Added a new method Tokenizer.tokenize_file for easy tokenization of files from Python.

v1.8.3

02 Nov 08:43
Compare
Choose a tag to compare

This is a bugfix release (see CHANGES.txt).

v1.8.2

26 Oct 09:27
Compare
Choose a tag to compare

This release fixes two bugs (cf. CHANGES.txt for details).

v1.8.1

30 Jul 08:21
Compare
Choose a tag to compare

Bugfix release, see CHANGES.txt.

v1.8.0

04 Jul 09:28
Compare
Choose a tag to compare
  • SoMaJo can now tokenize English text.
  • Minor improvements to tokenization.

v1.7.0

22 Mar 14:58
Compare
Choose a tag to compare

SoMaJo has now full XML support, see CHANGES.txt for details.

v1.6.0

05 Mar 09:22
Compare
Choose a tag to compare

Some small improvements, see CHANGES.txt.