Contents
Cjklib provides language routines related to Han characters (characters based on Chinese characters named Hanzi, Kanji, Hanja and chu Han respectively) used in writing of the Chinese, the Japanese, infrequently the Korean and formerly the Vietnamese language(s). Functionality is included for character pronunciations, radicals, glyph components, stroke decomposition and variant information.
- Python 2.4 or above (currently no support for Python3)
- SQLite 3+
- SQLAlchemy 0.5+
- pysqlite2 (already ships with Python 2.5 and above)
Alternatively for MySQL as backend:
conda create -n py37 python=3.7
conda activate py37
git clone <git-repo-url>
cd cjklib3
pip install 2to3
2to3 -w .
curl http://ftp.unicode.org/Public/UNIDATA/Unihan.zip -o Unihan.zip
python -m cjklib.build.cli build cjklibData --attach= --database=sqlite:///cjklib/cjklib.db
pip install .
installcjkdict --download CEDICT
python -m cjklib.build.cli build fullCEDICT --attach=sqlite:///cjklib/cjklib.db --database=sqlite:///cjklib/cjklib.db
pip install .
Install cjklib using the provided .exe
installer. Make sure above
dependencies are satisfied.
Three scripts cjknife.exe
, buildcjkdb.exe
, and installcjkdict.exe
will be added to the Python Scripts
sub-directory. Make sure this directory
is included in your PATH
environment variable to access these programs from
the command line.
CJK dictionaries are not included by default. If you want to install any of those run the following (with an Internet connection) from the root directory of the source package:
$ installcjkdict CEDICT
This will download CEDICT, create a SQLite database file and install it under
the directory given by the APPDATA
environment variable, e.g.
C:\windows\profiles\MY_USER\Application Data\cjklib
. Just substitute
CEDICT
for any other supported dictionary (i.e. EDICT, CEDICT, HanDeDict,
CFDICT, CEDICTGR).
If you are installing from the source package you need to deploy the library on your system:
$ sudo python setup.py install
Also make sure above dependencies are satisfied. CJK dictionaries are not included by default. If you want to install any of those run the following (with an Internet connection):
$ sudo installcjkdict CEDICT
This will download CEDICT, create a SQLite database file and install it to
/usr/local/share/cjklib
. Just substitute CEDICT
for any other supported
dictionary (i.e. EDICT, CEDICT, HanDeDict, CFDICT, CEDICTGR).
Documentation is available online. Also see the project page and its wiki.
There is a small command line tool cjknife
that offers some of the library's
functions. See cjknife --help
for an overview.
Get stroke order of characters:
>>> from cjklib import characterlookup >>> cjk = characterlookup.CharacterLookup('C') >>> cjk.getStrokeOrder(u'说') [u'㇔', u'㇊', u'㇔', u'㇒', u'㇑', u'㇕', u'㇐', u'㇓', u'㇟']
Access a dictionary (here using Jim Breen's EDICT):
>>> from cjklib.dictionary import EDICT >>> d = EDICT() >>> d.getForTranslation('Tokyo') [EntryTuple(Headword=u'東京', Reading=u'とうきょう', Translation=u'/(n) Tokyo (current capital of Japan)/(P)/')]
Packaged versions of the library will ship with a pre-built SQLite database file. You can however easily rebuild the database yourself.
First download the newest Unihan file:
$ wget ftp://ftp.unicode.org/Public/UNIDATA/Unihan.zip
Then start the build process:
$ sudo buildcjkdb -r build cjklibData
SQLite by default has no Unicode support for string operations. Optionally the
ICU library can be compiled in for handling alphabetic non-ASCII characters.
Cjklib can register own Unicode functions if ICU support is missing. Queries
with LIKE
will then use function lower()
. This compatibility mode has
negative impact on performance and as it is not needed for dictionaries like
EDICT or CEDICT it is disabled by default. See cjklib.conf
for enabling.
With MySQL 5 the following CREATE
command creates a database with utf8
as character set using the general Unicode collation
(MySQL from 5.5.3 on will support full Unicode given character set
utf8mb4
and collation utf8mb4_bin
):
CREATE DATABASE cjklib DEFAULT CHARACTER SET utf8 COLLATE utf8_bin;
You might need to set access rights, too (substitute user_name
and
host_name
):
GRANT ALL ON cjklib.* TO 'user_name'@'host_name';
Now update the settings in cjklib.conf
.
MySQL < 5.5 doesn't support full UTF-8, and uses a version with max 3 bytes, so
characters outside the Basic Multilingual Plane (BMP) can't be encoded. Building
the Unihan database thus might result in warnings, characters above U+FFFF
can't be built at all. You need to disable building the full character range
by setting wideBuild
to False
in cjklib.conf
before building.
Alternatively pass --wideBuild=False
to buildcjkdb
.
For help or discussions on cjklib, join [email protected].
Please report bugs to the project's bug tracker.