v2.4.0
- New feature: SoMaJo can output character offsets for tokens, allowing for stand-off tokenization. Pass
character_offsets=True
to the constructor or use the option--character-offsets
on the command line to enable the feature. The character offsets are determined by aligning the tokenized output with the input, therefore activating the feature incurs a noticeable increase in processing time.