Skip to content

Releases: OpenPecha/Botok

pybo/botok split

01 Sep 23:04
3343f64
Compare
Choose a tag to compare

0.6.9 - 20190901

Added

  • only the tokenizer's codebase is kept in botok repo. everything else is moved to the new pybo repo.

Bugfix

26 Aug 14:38
Compare
Choose a tag to compare

0.6.7 - 20190826

Fixed

  • ../ in Path freezing on Mac

Batch regex

21 Aug 11:46
Compare
Choose a tag to compare

0.6.6 - 20190821

Added

  • batch regex from file + cli

Add RDR rule adjustments

20 Aug 14:44
Compare
Choose a tag to compare

0.6.6 - 20190820

Added

  • RDR rules parser to convert them into pybo's CQL ReplaceMatcher format
  • integrate it in WordTokenizer and Config (same options as for the trie data and profiles)
  • add a CLI option using parse_rdr_rules().

Bugfix

16 Aug 14:57
Compare
Choose a tag to compare

0.6.5 - 20190816

Fixed

  • particles not in the list were bugging

Basic CLI

15 Aug 20:21
Compare
Choose a tag to compare

0.6.4 - 20190815

Added

  • CLI interface for basic tokenization of strings and files

Bugfix

14 Aug 23:15
Compare
Choose a tag to compare

0.6.3 - 20190814

Fixed

  • remove print() that was executed at every added word

Add sentence and paragraph tokenizers

14 Aug 22:55
Compare
Choose a tag to compare

0.6.2 - 20190814

Added

  • implemented sentence and paragraph tokenizers + Text properties
  • meaning field in the entries attribute of Token objects

Changed

  • reduced the amount of times WordTokenizers were loaded in the test suite (for Travis)
  • improve names for higher consistency

Fixed

  • a few remaining bugs from previous release

Multiple meanings per inflected form/trie entry

13 Aug 21:23
Compare
Choose a tag to compare

0.6.1 - 20190813

Fixed

  • affixed particles were inflected
  • pos, lemma and frequency are brought together: a single inflected form can be two different words, thus different POS and different frequency.
  • various bugs related to the refactoring

Added

  • support for more than one meaning for every trie entry (inflected form)

A meanings attribute is added in the Token objects. They hold as many meanings as found in the trie data.
A default meaning is chosen, then the pos, lemma and freq fields are copied from the meanings attribute to the attributes bearing these names.
When only one meaning is available, it is chosen, otherwise, the meaning with the highest amount of attributes is chosen from the following groups, in this order:
meanings that are unaffixed words, meanings that don't have the affixed attribute, meanings that are affixed words.

  • adjustments required by the above in the different parts of pybo

Refactoring: making an intuitive interface to pybo

01 Jul 12:26
Compare
Choose a tag to compare

Changed

  • refactoring the Pipeline class into the Text class. check test_text.py to have an overview of what it does.