Skip to content

Latest commit

 

History

History
8 lines (5 loc) · 690 Bytes

README.md

File metadata and controls

8 lines (5 loc) · 690 Bytes

cleantmx

A small Python library to build NLP data cleaning pipelines

This library is meant to be a lightweight but powerful tool to help create data cleaning pipelines for NLP. It's designed to be flexible, and you can easily extend it with your own code.

It comes with several built in filters of different types: most operate only on an individual segment, but some can modify source/target text using information from both (to remove source/target pairs that are identical, or with mismatched segment lengths, for example).

See examples/process_tmx.py for an example of reading in a .tmx file of English-Swedish pairs, cleaning it, then saving two segment-aligned text files.