cleantmx

A small Python library to build NLP data cleaning pipelines

This library is meant to be a lightweight but powerful tool to help create data cleaning pipelines for NLP. It's designed to be flexible, and you can easily extend it with your own code.

It comes with several built in filters of different types: most operate only on an individual segment, but some can modify source/target text using information from both (to remove source/target pairs that are identical, or with mismatched segment lengths, for example).

See examples/process_tmx.py for an example of reading in a .tmx file of English-Swedish pairs, cleaning it, then saving two segment-aligned text files.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

README.md

README.md

cleantmx

Files

README.md

Latest commit

History

README.md

File metadata and controls

cleantmx