cleantmx

A small Python library to build NLP data cleaning pipelines

This library is meant to be a lightweight but powerful tool to help create data cleaning pipelines for NLP. It's designed to be flexible, and you can easily extend it with your own code.

It comes with several built in filters of different types: most operate only on an individual segment, but some can modify source/target text using information from both (to remove source/target pairs that are identical, or with mismatched segment lengths, for example).

See examples/process_tmx.py for an example of reading in a .tmx file of English-Swedish pairs, cleaning it, then saving two segment-aligned text files.

Name		Name	Last commit message	Last commit date
Latest commit History 6 Commits
docs		docs
examples		examples
.gitignore		.gitignore
README.md		README.md
__init__.py		__init__.py
charsets.py		charsets.py
core.py		core.py
filters.py		filters.py

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

cleantmx

About

Releases

Packages

Languages

Numeri/cleantmx

Folders and files

Latest commit

History

Repository files navigation

cleantmx

About

Resources

Stars

Watchers

Forks

Releases

Packages 0

Languages

Packages