segments

The segments package provides Unicode Standard tokenization routines and orthography segmentation, implementing the linear algorithm described in the orthography profile specification from The Unicode Cookbook (Moran and Cysouw 2018 ).

Command line usage

Create a text file:

$ echo "aäaaöaaüaa" > text.txt

Now look at the profile:

$ cat text.txt | segments profile
Grapheme        frequency       mapping
a       7       a
ä       1       ä
ü       1       ü
ö       1       ö

Write the profile to a file:

$ cat text.txt | segments profile > profile.prf

Edit the profile:

$ more profile.prf
Grapheme        frequency       mapping
aa      0       x
a       7       a
ä       1       ä
ü       1       ü
ö       1       ö

Now tokenize the text without profile:

$ cat text.txt | segments tokenize
a ä a a ö a a ü a a

And with profile:

$ cat text.txt | segments --profile=profile.prf tokenize
a ä aa ö aa ü aa

$ cat text.txt | segments --mapping=mapping --profile=profile.prf tokenize
a ä x ö x ü x

API

>>> from segments import Profile, Tokenizer
>>> t = Tokenizer()
>>> t('abcd')
'a b c d'
>>> prf = Profile({'Grapheme': 'ab', 'mapping': 'x'}, {'Grapheme': 'cd', 'mapping': 'y'})
>>> print(prf)
Grapheme	mapping
ab	x
cd	y
>>> t = Tokenizer(profile=prf)
>>> t('abcd')
'ab cd'
>>> t('abcd', column='mapping')
'x y'

Name	Name	Last commit message	Last commit date
Latest commit xrotwang bump version for development Feb 20, 2025 fca8aa4 · Feb 20, 2025 History 92 Commits
.github/workflows	.github/workflows	modernize code	Feb 19, 2025
src/segments	src/segments	bump version for development	Feb 20, 2025
tests	tests	removed clldutils as dependency; corrected docs.	Feb 20, 2025
.gitignore	.gitignore	Major Refactoring	Apr 17, 2018
CHANGES.md	CHANGES.md	release 2.3.0	Feb 20, 2025
CONTRIBUTING.md	CONTRIBUTING.md	refactored into a more modern package (#24 )	Nov 16, 2017
LICENSE	LICENSE	renamed package	Aug 23, 2016
MANIFEST.in	MANIFEST.in	renamed package	Aug 23, 2016
README.md	README.md	updated project scaffolding	Feb 19, 2025
RELEASING.md	RELEASING.md	updated project scaffolding	Feb 19, 2025
faq.md	faq.md	removed clldutils as dependency; corrected docs.	Feb 20, 2025
pyproject.toml	pyproject.toml	updated project scaffolding	Feb 19, 2025
requirements.txt	requirements.txt	closes #40	Jun 22, 2018
setup.cfg	setup.cfg	bump version for development	Feb 20, 2025
setup.py	setup.py	updated project scaffolding	Feb 19, 2025

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

segments

Command line usage

API

About

Releases 6

Contributors 6

Languages

License

cldf/segments

Folders and files

Latest commit

History

Repository files navigation

segments

Command line usage

API

About

Resources

License

Stars

Watchers

Forks

Releases 6

Contributors 6

Languages