Monocorpus: Tatar Language Monocorpus Development Tools

Overview

The Monocorpus project aims to provide tools for developing a Tatar language monocorpus. The project includes functionality to extract texts from books and save them in files.

Features

Extract text from EPUB and PDF files.
Post-processing of extracted text to remove unwanted characters (e.g. OCR artifacts). More precisely, the following steps are performed:
- Remove sudden ASCII chars in the tatar word (e.g. с[0x0063)]у --> с[0x0441]у)
- Remove sudden non-ASCII chars in the non-tatar word (e.g. а[0x0430]rm --> a[0x0061]rm)
- Unify punctuation marks by replacing look-alikes with a single variant (e.g. '»' | '«' | '“' | '”' | '„' --> '"')
- Remove unwanted characters (e.g. '•')
- Remove sudden digits at the end of the word (e.g. башына2 —> башына)

Getting Started

To get started with the project, follow these steps:

Clone the Repository:

git clone https://github.com/neurotatarlar/monocorpus.git
cd monocorpus

Prepare Python Environment:

Make sure you have Python 3.x installed on your system.
Create and activate a virtual environment (optional but recommended):

python3 -m venv venv
source venv/bin/activate

Install the required dependencies:

pip install -r requirements.txt

Extract Texts from Books:

Place your book(s) into the workdir/000_entry_point folder. Currently we support EPUB and PDF formats.
Run the script to extract texts:

python src/main.py extract

Proces dirty extracted texts further:

Run the script to process extracted texts:

python src/main.py process

Explore the Output:

Processed text files will be saved in the workdir/900_artifacts directory.

Project Structure

src/: Contains the main script for text extraction and processing.
workdir/000_entry_point: Place your books here for text extraction.
workdir/900_artifacts: Processed text files will be saved here.
requirements.txt: List of required Python dependencies.

Contributing

Contributions are welcome! If you'd like to contribute to the project, make your changes and submit a pull request detailing the changes made.

License

This project is licensed under the MIT License - see the LICENSE file for details.

todo does not support encrypted documents

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

README.md

README.md

Monocorpus: Tatar Language Monocorpus Development Tools

Overview

Features

Getting Started

Project Structure

Contributing

License

todo does not support encrypted documents

no embedded files

Files

README.md

Latest commit

History

README.md

File metadata and controls

Monocorpus: Tatar Language Monocorpus Development Tools

Overview

Features

Getting Started

Project Structure

Contributing

License

todo does not support encrypted documents

no embedded files