Skip to content
/ segments Public

Unicode Standard tokenization routines and orthography profile segmentation

License

Notifications You must be signed in to change notification settings

cldf/segments

Folders and files

NameName
Last commit message
Last commit date

Latest commit

fca8aa4 · Feb 20, 2025

History

92 Commits
Feb 19, 2025
Feb 20, 2025
Feb 20, 2025
Apr 17, 2018
Feb 20, 2025
Nov 16, 2017
Aug 23, 2016
Aug 23, 2016
Feb 19, 2025
Feb 19, 2025
Feb 20, 2025
Feb 19, 2025
Jun 22, 2018
Feb 20, 2025
Feb 19, 2025

Repository files navigation

segments

Build Status PyPI

DOI

The segments package provides Unicode Standard tokenization routines and orthography segmentation, implementing the linear algorithm described in the orthography profile specification from The Unicode Cookbook (Moran and Cysouw 2018 DOI).

Command line usage

Create a text file:

$ echo "aäaaöaaüaa" > text.txt

Now look at the profile:

$ cat text.txt | segments profile
Grapheme        frequency       mapping
a       7       a
ä       1       ä
ü       1       ü
ö       1       ö

Write the profile to a file:

$ cat text.txt | segments profile > profile.prf

Edit the profile:

$ more profile.prf
Grapheme        frequency       mapping
aa      0       x
a       7       a
ä       1       ä
ü       1       ü
ö       1       ö

Now tokenize the text without profile:

$ cat text.txt | segments tokenize
a ä a a ö a a ü a a

And with profile:

$ cat text.txt | segments --profile=profile.prf tokenize
a ä aa ö aa ü aa

$ cat text.txt | segments --mapping=mapping --profile=profile.prf tokenize
a ä x ö x ü x

API

>>> from segments import Profile, Tokenizer
>>> t = Tokenizer()
>>> t('abcd')
'a b c d'
>>> prf = Profile({'Grapheme': 'ab', 'mapping': 'x'}, {'Grapheme': 'cd', 'mapping': 'y'})
>>> print(prf)
Grapheme	mapping
ab	x
cd	y
>>> t = Tokenizer(profile=prf)
>>> t('abcd')
'ab cd'
>>> t('abcd', column='mapping')
'x y'