reBabel is a library for converting between various language data files, such as ELAN .eaf
files, Fieldworks Language Explorer flextext
files, Universal Dependencies .conllu
files, and others.
Rather than create a separate converter for each pair of formats, it goes through an intermediary representation using a SQLite database and thus needs only one importer and one exporter for each file format.
reBabel also provides other functionality, including merging data from different sources (say, from manual annotation and from a machine learning model), and a query and rewrite system.
To install this package locally, run
$ pip3 install -e .
This package is written in pure Python and has no external dependencies apart from backports of standard library modules to older Python versions.
This package installs a command-line utility named rebabel-format
, which can be invoked as follows:
$ rebabel-format ACTION config.toml
Common actions include import
, export
, query
, and transform
. Run rebabel-format --help
for a complete list of available actions.
Configuration for the various actions is provided in a TOML file.
The parameters to a given action can be top-level keys or they can be under the name of the action, allowing a single config file to be used for multiple steps of a given workflow.
# the same database file will be used for all workflows
db = "demo.db"
# these next parameters only apply to the "import" action
[import]
mode = "conllu"
infiles = ["file1.conllu", "file2.conllu"]
# to write a query, we define some nodes
# in this case, S, N, and V
[query.S]
# S is a sentence
type = "sentence"
[query.N]
# N is a word
type = "word"
# N has a feature named UD:upos with the value NOUN
features = [{feature = "UD:upos", value = "NOUN"}]
# N is part of S
parent = "S"
[query.V]
# V is a word
type = "word"
# V is part of S
parent = "S"
# a different way of listing features
[[query.V.features]]
feature = "UD:upos"
value = "VERB"
[[query.V.features]]
feature = "UD:FEATS:Person"
value = "3"
Readers, writers, and processes are imported dynamically at startup:
rebabel_format.load_processes(True)
rebabel_format.load_readers(True)
rebabel_format.load_writers(True)
Each function takes a single parameter indicating whether or not plugins outside this module should be loaded.
Plugin packages can be created using entry point specifiers in the package metadata. The entry point can indicate either a module or a class (any subclass of Process
, Reader
, or Writer
is automatically registered on import).
The loading functions check for the following plugin namespaces: rebabel.processes
, rebabel.converters
, rebabel.readers
, rebabel.writers
.
After loading, lists of available names can be retrieved as follows:
rebabel_format.get_process_names()
rebabel_format.get_reader_names()
rebabel_format.get_writer_names()
Different processes and format converters take different arguments. These arguments can be programmatically inspected using the following functions:
rebabel_format.get_process_parameters('import')
rebabel_format.get_reader_parameters('conllu')
rebabel_format.get_writer_parameters('flextext')
Each of these returns a dictionary with parameter names as keys and Parameter
objects as values. Parameter
objects have the properties required
, default
, type
, and help
. Any missing parameter which has required=True
will cause ValueError
to be raised.
Processes can be invoked as follows:
# import in.conllu into temp.db
rebabel_format.run_command('import', mode='conllu', db='temp.db',
infiles=['in.conllu'])
rebabel_format.run_command(
# export out.flextext from in.db
'export', mode='flextext', db='temp.db', outfile='out.flextext',
# making some adjustments to account for differences between
# CoNLL-U and FlexText
mappings=[
# use CoNLL-U sentence nodes where FlexText expects phrases
{'in_type': 'sentence', 'out_type': 'phrase'},
# use UD:lemma where FlexText wants FlexText:en:txt
{'in_feature': 'UD:lemma', 'out_feature': 'FlexText:en:txt'},
],
# settings specific to the FlexText writer:
# the highest non-empty node will be the phrase
# (the CoNLL-U importer currently doesn't create paragraph and document nodes)
root='phrase',
# the morpheme layer will also be empty
skip=['morph'],
)