This Python project focuses on generating training data for detecting columns or text blocks of tibetan texts by embedding Tibetan text into images.
It includes functions to create lorem ipsum-like Tibetan text, read random Tibetan text files from a directory, and calculate and embed text within specified bounding boxes in images. The project effectively handles Tibetan script, ensuring proper display and formatting within the images.
- Automated Data Generation: Simplifies the process of generating training data for Tibetan NLP tasks.
- Customizable Input: Allows users to specify various input parameters like images, labels, directories for backgrounds and corporate images, etc.
- Image Processing: Utilizes the PIL library for image manipulation.
- Bounding Box Preparation: Includes a utility function
prepare_bbox_string
for handling bounding boxes. - Multiprocessing Support: Leverages multiprocessing for efficient data processing.
- Debugging Mode: Includes a debug mode for troubleshooting and ensuring correct data processing.
- Python 3.x
- PIL (Python Imaging Library)
- YOLO utilities (for bounding box handling)
- Additional Python libraries: numpy, tqdm, yaml
Clone the repository to your local machine:
git clone https://github.com/nih23/Tibetan-NLP.git
cd Tibetan-NLP
The script supports various command-line arguments to customize the data generation process:
--background_train
: Folder with background images for training (default: './ext/TibetanOCR/data/background_images_train/')--background_val
: Folder with background images for validation (default: './ext/TibetanOCR/data/background_images_val/')--dataset_folder
: Folder for the generated YOLO dataset (default: './data/yolo_tibetan/')--corpora_folder
: Folder with Tibetan tibetan numbers corpora (default: './data/corpora/UVA Tibetan Spoken Corpus/')--train_samples
: Number of training samples to generate (default: 2)--val_samples
: Number of validation samples to generate (default: 1)--no_cols
: Number of text columns to generate [1....5] (default: 1)--font_path
: Path to a font file that supports Tibetan characters (default: 'ext/Microsoft Himalaya.ttf')--single_label
: Use a single label "tibetan" for all files instead of using filenames as labels (flag, no value required)
Training data is generated by simply running generate_training_data.py
. Make sure to update folders for background images.
python generate_training_data.py --font_path "ext/Microsoft Himalaya.ttf" --single_label
Training of YOLOv8n is done by a CLI call to Ultralytics.
yolo detect train data=data/yolo_tibetan/tibetan_yolo.yml epochs=1000 imgsz=1024
The model is then converted into a torchscript for inference:
yolo detect export model=runs/detect/train9/weights/best.pt
We can now employ our trained model for recognition and classification of tibetan text blocks as follows:
yolo predict task=detect model=runs/detect/train9/weights/best.torchscript imgsz=1024 source=data/my_inference_data/*.jpg
The results are then saved to folder runs/detect/predict
Contributions to this project are welcome! Please fork the repository and submit a pull request with your proposed changes.
This project is licensed under the MIT License - see the LICENSE file for details.