Releases: OpenPecha/Botok
Releases · OpenPecha/Botok
pybo/botok split
0.6.9 - 20190901
Added
- only the tokenizer's codebase is kept in botok repo. everything else is moved to the new pybo repo.
Bugfix
Batch regex
Add RDR rule adjustments
0.6.6 - 20190820
Added
- RDR rules parser to convert them into pybo's CQL ReplaceMatcher format
- integrate it in WordTokenizer and Config (same options as for the trie data and profiles)
- add a CLI option using parse_rdr_rules().
Bugfix
Basic CLI
Bugfix
Add sentence and paragraph tokenizers
0.6.2 - 20190814
Added
- implemented sentence and paragraph tokenizers + Text properties
- meaning field in the entries attribute of Token objects
Changed
- reduced the amount of times WordTokenizers were loaded in the test suite (for Travis)
- improve names for higher consistency
Fixed
- a few remaining bugs from previous release
Multiple meanings per inflected form/trie entry
0.6.1 - 20190813
Fixed
- affixed particles were inflected
- pos, lemma and frequency are brought together: a single inflected form can be two different words, thus different POS and different frequency.
- various bugs related to the refactoring
Added
- support for more than one meaning for every trie entry (inflected form)
A meanings
attribute is added in the Token objects. They hold as many meanings as found in the trie data.
A default meaning is chosen, then the pos
, lemma
and freq
fields are copied from the meanings
attribute to the attributes bearing these names.
When only one meaning is available, it is chosen, otherwise, the meaning with the highest amount of attributes is chosen from the following groups, in this order:
meanings that are unaffixed words, meanings that don't have the affixed
attribute, meanings that are affixed words.
- adjustments required by the above in the different parts of pybo
Refactoring: making an intuitive interface to pybo
Changed
- refactoring the Pipeline class into the Text class. check test_text.py to have an overview of what it does.