All notable changes to this project will be documented in this file.
The format is based on Keep a Changelog and this project adheres to Semantic Versioning.
- Added the CantoMap corpus.
- Added
from_gitandfrom_urltoCHATto support fetching datasets from remote sources.
- Bumped rustling to >= 0.8.0.
segment()for word segmentation now takes a boolean keyword argumentoffsetsto optionally give the(start, end)indices of the segmented words.
- Bumped rustling to >= 0.7.0, for retrieving offsets in word segmentation.
segment()for word segmentation now splits a mixed Cantonese/English unit as separate word if it's not already a recognized mixed word.
The major version bump from v3 to v4 is due to backward-incompatible yet minor API changes triggered by the switch from the PyLangAcq + wordseg dependencies to Rustling for handing CHAT data, word segmentation, and part-of-speech tagging.
- Added the new function
jyutping_to_ipafor Jyutping-to-IPA conversion. - The
characters_to_jyutpingfunction can now take a list of strings as input with user-provided word segmentation. - Added support for Python 3.11, 3.12, 3.13, and 3.14.
- Switched to Rustling as the underlying engine for CHAT data parsing, word segmentation, and part-of-speech tagging; dropped PyLangAcq and wordseg as dependencies.
CHATReaderhas been renamedCHATand switched from PyLangAcq's legacy pure-Python parser to a Rust-based parser from Rustling, with various API changes for method names, arguments, etc.- The word segmentation model has been updated to a semi-supervised hybrid approach
that combines a DAG and hidden Markov model.
The
segmentfunction for word segmentation no longer accepts an argument for a custom segmenter. - For both word segmentation and part-of-speech tagging, the persisted models shipped with the package are now zstd-compressed FlatBuffers binaries.
- Dropped support for Python 3.7, 3.8, and 3.9.
- Fixed word segmentation so that spaces between English words in the user input are now honored as word boundaries.
- If
parse_textis given an empty input orNone, now an emptyCHATinstance is returned. - If
parse_textis given a non-empty list of utterances, then any empty utterance (e.g.,None,"") will now be represented by an emptyUtteranceinstance inside the resultingCHAToutput. - Fixed the HKCanCor-to-UD mapping for
G1mapped toVERBnotV.
- Added the
parse_textfor analyzing Cantonese text data. - Characters-to-Jyutping conversion:
The
characters_to_jyutpingfunction now has thesegmenterkwarg for customizing word segmentation. - Added support for Python 3.10.
- Turned on Windows testing on CircleCI.
- Added
pyproject.toml. Related to preferringsetup.cfgfor specifying build metadata and options.
- Characters-to-Jyutping conversion:
For the
characters_to_jyutpingfunction, in case rime-cantonese and HKCanCor don't agree, rime-cantonese data (more accurate) is preferred. - Updated the rime-cantonese data to the latest
2021.05.16release, improving both characters-to-Jyutping conversion and word segmentation. - Updated the PyLangAcq dependency to v0.16.0, allowing PyCantonese's
CHATReaderto use the new methodsto_chat,to_strs,info,head, andtail. - Switched to
setup.cfgto fully specify build metadata and options, while keeping a minimalsetup.pyfor backward compatibility. Related to the newpyproject.toml.
- Dropped support for Python 3.6.
- Turned on
safetyandbanditchecks at CircleCI builds.
- Allowed PyLangAcq v0.14.* for real.
- Allowed PyLangAcq v0.14.*, thereby adding the new features of the
filtermethod toCHATReaderand optional parallelization for CHAT data processing.
- Fixed the
searchmethod ofCHATReaderwhenby_tokensisFalse.
- Fixed the previously inoperational methods
append,append_left,extend, andextend_leftof the classCHATReaderthrough the upstream PyLangAcq package. - Retrained the part-of-speech tagger, after the minor character fix from v3.2.3.
- Raised
NotImplementedErrorfor the methodipsynofCHATReader, since the upstream method works only for English.
- Fixed character issues in the built-in HKCanCor data: 𥄫
- Fixed a CHAT parsing issue when correction and repetition are combined, by bumping the pylangacq dependency from v0.13.0 to v0.13.1.
- Fixed character issues in the built-in HKCanCor data: 𠮩𠹌, 𠻗
Note: The underlying CHAT parser, the PyLangAcq package, has been bumped to v0.13.0. All of the updates of PyLangAcq's CHAT reader apply to this PyCantonese release as well. The details are in PyLangAcq's changelog for v0.13.0. The changelog entries below only document updates specific to PyCantonese.
- Defined the
Jyutpingclass to better represent parsed Jyutping romanization.
- Bumped the PyLangAcq dependency to v0.13.0.
- The function
parse_jyutpingnow returns a list ofJyutpingobjects, rather than tuples of strings.
-
The following methods in the
CHATReaderclass have been deprecated:character_sents(usecharacterswithby_utterances=Trueinstead)jyutping_sents(usejyutpingwithby_utterances=Trueinstead)
-
The following arguments of the
searchmethod ofCHATReaderhave been deprecated:sent_range(useutterance_rangeinstead)tagged(useby_tokensinstead)sents(useby_utterancesinstead)
- Fixed the character issues in the built-in HKCanCor data: 𠺢, 𠺝, 𡁜, 𧕴, 𥊙, 𡃓, 𠴕, 𡀔
- Pinned pylangacq at 0.12.0 (the new 0.13.0 has breaking changes).
- Part-of-speech tagging:
- Added the function
pos_tagthat takes a segmented sentence or phrase and returns its part-of-speech tags. - Added the function
hkcancor_to_udthat maps a part-of-speech tag from the original HKCanCor annotated data to one of the tags from the Universal Dependencies v2 tagset.
- Added the function
- Word segmentation:
- Improved segmentation quality by revising the underlying wordlist data.
- The test suite now covers code snippets in both the docstrings and
.rstdoc files.
- Fixed the issue of not opening text files with UTF-8 encoding (a possible issue on Windows).
jyutping_to_yaleandparse_jyutpingnow return a null value (rather than raise an error) when the input is null.- The word segmentation function
segmentnow strips all whitespace from the input unsegmented string before segmenting it.
- Word segmentation:
- Segmentation is customizable for the following:
- Maximum word length
- A user-supplied list of words to allow as words
- A user-supplied list of words to disallow as words
- The default segmentation model has been improved with the rime-cantonese data (CC BY 4.0 license).
- Segmentation is customizable for the following:
- Characters-to-Jyutping conversion:
- The conversion returns results in a word-segmented form.
- The conversion model has been improved with the rime-cantonese data (CC BY 4.0 license).
- Added the following functions; they are equivalent to their (now deprecated)
x2ycounterparts:characters_to_jyutpingjyutping_to_tipajyutping_to_yale
- Added support for Python 3.9.
jyutping_to_yale: The default value of the keyword argumentas_listhas been changed fromFalsetoTrue, so that this function is now more in line with the other "jyutping_to_X" functions for returning a list.characters_to_jyutping: The returned valued is now a list of segmented words, where each is a 2-tuple of (Cantonese characters, Jyutping). Previously, it was a list of Jyutping strings for the individual Cantonese characters.
- Switched documentation to the readthedocs theme and numpydoc docstring style.
- Improved CircleCI builds with orbs.
- The following
x2yfunctions have been deprecated in favor of their equivalents named in the form ofx_to_y.characters2jyutpingjyutping2tipajyutping2yale
- Turned on HTTPS for the pycantonese.org domain.
- Switched to the
wordsegdependency to a PyPI source instead of a GitHub direct link.
- Added the
characters2jyutping()function for converting Cantonese characters to Jyutping romanization. - Added the
segment()function for word segmentation.
- Added support for Python 3.7 and 3.8.
- Dropped support for Python 3.4 and 3.5 (supporting 3.6, 3.7, and 3.8 now).
- 104 stop words.
- Exposed the
excludeparameter in various reader methods for excluding specific participants. This parameter was implemented at pylangacq v0.10.0.
- Allowed "n" to be a syllabic nasal.
- Fixed corpus reader not picking up the characters.
- PyCantonese now requires Python 3.4 or above.
- Adopted the CHAT corpus format, piggybacking on PyLangAcq
- Converted HKCanCor into the CHAT format
- Switched to transparent function names
(cf. issue #10):
parse_jyutping(),jyutping2yale(),jyutping2tipa() - Bug fixes: issues #6, #7, #8 #9
- Fixed the Jyutping-Yale conversion issue with "yu"
- Added
number_of_words()andnumber_of_characters()for corpus access - Forced all part-of-speech tags (both in searches and internal to corpus objects) in caps, in line with the NLTK convention
- Overall code restructuring
- Only Python 3.x is supported from this point onwards
- Used generators instead of lists for corpus access methods
- Added the part-of-speech search criterion
- Added Jyutping-to-Yale conversion
- Added Jyutping-to-TIPA conversion
- Disabled the function for reading a custom corpus dataset (it will come back)
- Fixed corpus access path issues
- The Hong Kong Cantonese Corpus is included in the package.
- A general-purpose
search()function is defined, replacing the element-specific search functions from version 0.1.
- Basic functions available, including...
- Parsing Jyutping romanization
- Reading a tagged corpus data folder
- Searching by a given element (onset/initial, nucleus, coda, final, character)
- Searching by a character plus a range