Releases: jacksonllee/pycantonese
Releases · jacksonllee/pycantonese
v3.4.0
[3.4.0] - 2021-12-28
Added
- Added the
parse_text
for analyzing Cantonese text data. - Characters-to-Jyutping conversion:
Thecharacters_to_jyutping
function now has thesegmenter
kwarg for
customizing word segmentation. - Added support for Python 3.10.
- Turned on Windows testing on CircleCI.
- Added
pyproject.toml
. Related to preferringsetup.cfg
for specifying
build metadata and options.
Changed
- Characters-to-Jyutping conversion:
For thecharacters_to_jyutping
function,
in case rime-cantonese and HKCanCor don't agree,
rime-cantonese data (more accurate) is preferred. - Updated the rime-cantonese data to the latest
2021.05.16
release,
improving both characters-to-Jyutping conversion and word segmentation. - Updated the PyLangAcq dependency to v0.16.0, allowing PyCantonese's
CHATReader
to use the new methodsto_chat
,to_strs
,info
,head
, andtail
. - Switched to
setup.cfg
to fully specify build metadata and options,
while keeping a minimalsetup.py
for backward compatibility.
Related to the newpyproject.toml
.
Removed
- Dropped support for Python 3.6.
Security
- Turned on
safety
andbandit
checks at CircleCI builds.
v3.3.1
[3.3.1] - 2021-05-14
Fixed
- Allowed PyLangAcq v0.14.* for real.
v3.3.0
[3.3.0] - 2021-05-14
Changed
- Allow PyLangAcq v0.14.*, thereby adding the new features of the
filter
method toCHATReader
and optional parallelization for CHAT data processing.
Fixed
- Fixed the
search
method ofCHATReader
whenby_tokens
isFalse
.
v3.2.4
[3.2.4] - 2021-05-07
Fixed
- Fixed the previously inoperational methods
append
,append_left
,extend
, andextend_left
of the classCHATReader
through the upstream PyLangAcq package. - Retrained the part-of-speech tagger, after the minor character fix from v3.2.3.
- Raised
NotImplementedError
for the methodipsyn
ofCHATReader
,
since the upstream method works only for English.
v3.2.3
[3.2.3] - 2021-04-12
Fixed
- Fixed character issues in the built-in HKCanCor data: 𥄫
v3.2.2
[3.2.2] - 2021-03-23
Fixed
- Fixed a CHAT parsing issue when correction and repetition are combined,
by bumping the pylangacq dependency from v0.13.0 to v0.13.1.
v3.2.1
[3.2.1] - 2021-03-21
Fixed
- Fixed character issues in the built-in HKCanCor data: 𠮩𠹌, 𠻗
v3.2.0
[3.2.0] - 2021-03-20
Note: The underlying CHAT parser, the PyLangAcq package, has been bumped to v0.13.0.
All of the updates of PyLangAcq's CHAT reader apply to this PyCantonese release as well.
The details are in PyLangAcq's changelog for v0.13.0.
The changelog entries below only document updates specific to PyCantonese.
Added
- Defined the
Jyutping
class to better represent parsed Jyutping romanization.
Changed
- Bumped the PyLangAcq dependency to v0.13.0.
- The function
parse_jyutping
now returns a list ofJyutping
objects,
rather than tuples of strings.
Deprecated
-
The following methods in the
CHATReader
class have been deprecated:character_sents
(usecharacters
withby_utterances=True
instead)jyutping_sents
(usejyutping
withby_utterances=True
instead)
-
The following arguments of the
search
method ofCHATReader
have been deprecated:sent_range
(useutterance_range
instead)tagged
(useby_tokens
instead)sents
(useby_utterances
instead)
Fixed
- Fixed the character issues in the built-in HKCanCor data: 𠺢, 𠺝, 𡁜, 𧕴, 𥊙, 𡃓, 𠴕, 𡀔
v3.1.1
[3.1.1] - 2021-03-18
Fixed
- Pinned pylangacq at 0.12.0 (the new 0.13.0 has breaking changes).
v3.1.0
[3.1.0] - 2021-02-21
Added
- Part-of-speech tagging:
- Added the function
pos_tag
that takes a segmented sentence or phrase
and returns its part-of-speech tags. - Added the function
hkcancor_to_ud
that maps a part-of-speech tag
from the original HKCanCor annotated data to one of the tags from the
Universal Dependencies v2 tagset.
- Added the function
- Word segmentation:
- Improved segmentation quality by revising the underlying wordlist data.
- The test suite now covers code snippets in both the docstrings and
.rst
doc files.
Fixed
- Fixed the issue of not opening text files with UTF-8 encoding
(a possible issue on Windows). jyutping_to_yale
andparse_jyutping
now return a null value
(rather than raise an error) when the input is null.- The word segmentation function
segment
now strips all whitespace
from the input unsegmented string before segmenting it.