Data Programming for Text Detection in Documents using CAGE. The work includes usage of CAGE from SPEAR to detect text within documents accurately, which can be used in creation of large benchmark datasets for Text detection task for any down stream tasks.
Several recent deep learning (DL) based techniques perform considerably well on image-based multilingual text detection. However, their performance relies heavily on the availability and quality of training data. There are numerous types of page-level document images consisting of information in several modalities, languages, fonts, and layouts. This makes text detection a challenging problem in the field of computer vision (CV), especially for low-resource or handwritten languages. Furthermore, there is a scarcity of word-level labeled data for text detection, especially for multilingual settings and Indian scripts that incorporate both printed and handwritten text. Conventionally, Indian script text detection requires training a DL model on plenty of labeled data, but to the best of our knowledge, no relevant datasets are available. Manual annotation of such data requires a lot of time, effort, and expertise. In order to solve this problem, we propose \Textron, a {\em Data Programming-based approach}, where users can plug various text detection methods into a weak supervision-based learning framework. One can view this approach to multilingual text detection as an ensemble of different CV-based techniques and DL approaches. TEXTRON can leverage the predictions of DL models pre-trained on a significant amount of language data in conjunction with CV-based methods to improve text detection in other languages. We demonstrate that TEXTRON can improve the detection performance for documents written in Indian languages, despite the absence of corresponding labeled data. Further, through extensive experimentation, we show improvement brought about by our approach over the current State-of-the-art (SOTA) models, especially for handwritten Devanagari text.
If you use this paper or the accompanying code/data in your research, please cite it as:
[Insert Citation Information Here]
- Run ```pip install -r requriements.txt``
- Make the configurations as stated in config.py
- Create a directory outside the main project directory, data, with a sub-directory temp
- Within temp, create 2 sub-directories, img and txt
- Place your input images in the img sub-directory and the corresponding ground truth labels (if available) in the txt sub-directory
- Set the appropriate path for INPUT_DATA_DIR in config.py
- In case ground truth isn't available, set GROUND_TRUTH_AVAILABLE within config.py as
False
- Choose the appropriate Labeling functions within config.py file from the lab_funcs list and also set the respective quaility quide for CAGE
- Place your input images in the img sub-directory and the corresponding ground truth labels (if available) in the txt sub-directory
- Finally, run the main.py code to get the predictions in the results folder (outside the main project directory) defined in config.py
- Passing images to the CAGE model, which has several labeling functions which generate weak labels of pixel level information of Image data describing Textual or Non-textual information of the corresponding pixel
- Usage of effective post processing steps to generate bounding boxes for the corresponding detected word level text
- This method can be applied to documents of various settings and provides SOTA and near to SOTA results
- Labeling functions could be used as a plug and play model to analyze results of different configurations
- Pretrained Models based Labelling Functions
-
- DocTR
- Image Processing based Labelling Functions
-
- Convex hull Labeling Function
-
- Edges based Labeling Function
-
- Contour based Labeling Function
-
- Segmentation based Labeling Function
-
- Mask Region based Labeling Function
-
- Tesseract Model for Text Detection
The Datasets could be found at this link
Class | Coverage% | DBNet Model | Textron3LF | Textron4LF |
---|---|---|---|---|
Date | 00.02% | 33.34 | 100.00 | 66.67 |
Author | 00.08% | 76.40 | 75.79 | 77.78 |
Title | 00.13% | 77.94 | 26.29 | 58.54 |
Section | 00.79% | 57.08 | 61.30 | 66.13 |
List | 00.86% | 52.37 | 66.95 | 62.02 |
Abstract | 01.34% | 51.54 | 76.87 | 69.52 |
Footer | 01.57% | 54.67 | 72.10 | 67.64 |
Caption | 02.38% | 42.25 | 67.65 | 57.34 |
Table | 04.83% | 22.99 | 28.48 | 21.03 |
Equation | 07.59% | 2.86 | 18.31 | 11.71 |
Reference | 10.09% | 48.10 | 68.45 | 65.27 |
Paragraph | 70.31% | 49.98 | 68.36 | 63.68 |
Overall | 100.00% | 46.24 | 63.38 | 58.91 |
Textron results on classwise data of Docbank for 100 test images
Threshold | P | R | F1 | P | R | F1 | P | R | F1 |
---|---|---|---|---|---|---|---|---|---|
0.5 | 40.49 | 74.03 | 52.35 | 90.49 | 80.00 | 84.92 | 87.23 | 84.46 | 85.82 |
0.6 | 29.63 | 54.16 | 38.30 | 76.45 | 67.59 | 71.75 | 79.97 | 77.43 | 78.68 |
0.7 | 13.21 | 24.15 | 17.08 | 43.56 | 49.27 | 46.24 | 64.42 | 62.38 | 63.38 |
0.8 | 04.62 | 08.44 | 05.97 | 18.82 | 16.64 | 17.67 | 33.63 | 32.56 | 33.09 |
0.9 | 00.36 | 00.65 | 00.46 | 03.11 | 02.75 | 02.92 | 33.63 | 32.56 | 33.09 |
TEXTRON yields a better overall performance and also shows significant improvement in detecting classes like equations and footers as compared to DBNet
The work has been licensed by GNU license
We wish to Acknowledge IITB annotators for annotating the Text Detection dataset to perform our experiments.
- Badri Vishal Kasuba
- Dhruv Kudale
we conclude with opening doors to more innovative contributions bringing about seamless multilingual text detection. Thank you for your interest in our research paper!