Merge pull request #196 from VikParuchuri/dev

Table recognition, better layout
VikParuchuri · Oct 8, 2024 · a87dede · a87dede
2 parents 0d7d8c3 + a63258e
commit a87dede
Show file tree

Hide file tree

Showing 39 changed files with 2,789 additions and 151 deletions.
diff --git a/.github/workflows/tests.yml b/.github/workflows/tests.yml
@@ -36,10 +36,11 @@ jobs:
         run: |
           poetry run python benchmark/layout.py --max 5
           poetry run python scripts/verify_benchmark_scores.py results/benchmark/layout_bench/results.json --bench_type layout
-      - name: Run ordering benchmark text
+      - name: Run ordering benchmark
         run: |
           poetry run python benchmark/ordering.py --max 5
           poetry run python scripts/verify_benchmark_scores.py results/benchmark/order_bench/results.json --bench_type ordering
-        
-          
-
+      - name: Run table recognition benchmark
+        run: |
+          poetry run python benchmark/table_recognition.py --max 5
+          poetry run python scripts/verify_benchmark_scores.py results/benchmark/table_rec_bench/results.json --bench_type table_recognition
diff --git a/README.md b/README.md
@@ -6,16 +6,23 @@ Surya is a document OCR toolkit that does:
 - Line-level text detection in any language
 - Layout analysis (table, image, header, etc detection)
 - Reading order detection
+- Table recognition (detecting rows/columns)
 
 It works on a range of documents (see [usage](#usage) and [benchmarks](#benchmarks) for more details).
 
+
 |                            Detection                             |                                   OCR                                   |
 |:----------------------------------------------------------------:|:-----------------------------------------------------------------------:|
-|  ![New York Times Article Detection](static/images/excerpt.png)  |  ![New York Times Article Recognition](static/images/excerpt_text.png)  |
+|  <img src="static/images/excerpt.png" width="500px"/>  |  <img src="static/images/excerpt_text.png" width="500px"/> |
 
 |                               Layout                               |                               Reading Order                                |
 |:------------------------------------------------------------------:|:--------------------------------------------------------------------------:|
-| ![New York Times Article Layout](static/images/excerpt_layout.png) | ![New York Times Article Reading Order](static/images/excerpt_reading.jpg) |
+| <img src="static/images/excerpt_layout.png" width="500px"/> | <img src="static/images/excerpt_reading.jpg" width="500px"/> |
+
+|             Table Recognition             |     |
+|:-----------------------------------------:|:----------------:|
+| <img src="static/images/table_rec.png" width="500px"/> | <img width="500px"/> |
+
 
 Surya is named for the [Hindu sun god](https://en.wikipedia.org/wiki/Surya), who has universal vision.
 
@@ -25,19 +32,19 @@ Surya is named for the [Hindu sun god](https://en.wikipedia.org/wiki/Surya), who
 
 ## Examples
 
-| Name             |              Detection              |                                      OCR |                                     Layout |                                       Order |
-|------------------|:-----------------------------------:|-----------------------------------------:|-------------------------------------------:|--------------------------------------------:|
-| Japanese         | [Image](static/images/japanese.jpg) | [Image](static/images/japanese_text.jpg) | [Image](static/images/japanese_layout.jpg) | [Image](static/images/japanese_reading.jpg) |
-| Chinese          | [Image](static/images/chinese.jpg)  |  [Image](static/images/chinese_text.jpg) |  [Image](static/images/chinese_layout.jpg) |  [Image](static/images/chinese_reading.jpg) |
-| Hindi            |  [Image](static/images/hindi.jpg)   |    [Image](static/images/hindi_text.jpg) |    [Image](static/images/hindi_layout.jpg) |    [Image](static/images/hindi_reading.jpg) |
-| Arabic           |  [Image](static/images/arabic.jpg)  |   [Image](static/images/arabic_text.jpg) |   [Image](static/images/arabic_layout.jpg) |   [Image](static/images/arabic_reading.jpg) |
-| Chinese + Hindi  | [Image](static/images/chi_hind.jpg) | [Image](static/images/chi_hind_text.jpg) | [Image](static/images/chi_hind_layout.jpg) | [Image](static/images/chi_hind_reading.jpg) |
-| Presentation     |   [Image](static/images/pres.png)   |     [Image](static/images/pres_text.jpg) |     [Image](static/images/pres_layout.jpg) |     [Image](static/images/pres_reading.jpg) |
-| Scientific Paper |  [Image](static/images/paper.jpg)   |    [Image](static/images/paper_text.jpg) |    [Image](static/images/paper_layout.jpg) |    [Image](static/images/paper_reading.jpg) |
-| Scanned Document | [Image](static/images/scanned.png)  |  [Image](static/images/scanned_text.jpg) |  [Image](static/images/scanned_layout.jpg) |  [Image](static/images/scanned_reading.jpg) |
-| New York Times   |   [Image](static/images/nyt.jpg)    |      [Image](static/images/nyt_text.jpg) |      [Image](static/images/nyt_layout.jpg) |        [Image](static/images/nyt_order.jpg) |
-| Scanned Form     |  [Image](static/images/funsd.png)   |    [Image](static/images/funsd_text.jpg) |    [Image](static/images/funsd_layout.jpg) |    [Image](static/images/funsd_reading.jpg) |
-| Textbook         | [Image](static/images/textbook.jpg) | [Image](static/images/textbook_text.jpg) | [Image](static/images/textbook_layout.jpg) |   [Image](static/images/textbook_order.jpg) |
+| Name             |              Detection              |                                      OCR |                                     Layout |                                       Order |                                    Table Rec |
+|------------------|:-----------------------------------:|-----------------------------------------:|-------------------------------------------:|--------------------------------------------:|---------------------------------------------:|
+| Japanese         | [Image](static/images/japanese.jpg) | [Image](static/images/japanese_text.jpg) | [Image](static/images/japanese_layout.jpg) | [Image](static/images/japanese_reading.jpg) | [Image](static/images/japanese_tablerec.png) |
+| Chinese          | [Image](static/images/chinese.jpg)  |  [Image](static/images/chinese_text.jpg) |  [Image](static/images/chinese_layout.jpg) |  [Image](static/images/chinese_reading.jpg) |                                              |
+| Hindi            |  [Image](static/images/hindi.jpg)   |    [Image](static/images/hindi_text.jpg) |    [Image](static/images/hindi_layout.jpg) |    [Image](static/images/hindi_reading.jpg) |                                              |
+| Arabic           |  [Image](static/images/arabic.jpg)  |   [Image](static/images/arabic_text.jpg) |   [Image](static/images/arabic_layout.jpg) |   [Image](static/images/arabic_reading.jpg) |                                              |
+| Chinese + Hindi  | [Image](static/images/chi_hind.jpg) | [Image](static/images/chi_hind_text.jpg) | [Image](static/images/chi_hind_layout.jpg) | [Image](static/images/chi_hind_reading.jpg) |                                              |
+| Presentation     |   [Image](static/images/pres.png)   |     [Image](static/images/pres_text.jpg) |     [Image](static/images/pres_layout.jpg) |     [Image](static/images/pres_reading.jpg) |     [Image](static/images/pres_tablerec.png) |
+| Scientific Paper |  [Image](static/images/paper.jpg)   |    [Image](static/images/paper_text.jpg) |    [Image](static/images/paper_layout.jpg) |    [Image](static/images/paper_reading.jpg) |    [Image](static/images/paper_tablerec.png) |
+| Scanned Document | [Image](static/images/scanned.png)  |  [Image](static/images/scanned_text.jpg) |  [Image](static/images/scanned_layout.jpg) |  [Image](static/images/scanned_reading.jpg) |  [Image](static/images/scanned_tablerec.png) |
+| New York Times   |   [Image](static/images/nyt.jpg)    |      [Image](static/images/nyt_text.jpg) |      [Image](static/images/nyt_layout.jpg) |        [Image](static/images/nyt_order.jpg) |                                              |
+| Scanned Form     |  [Image](static/images/funsd.png)   |    [Image](static/images/funsd_text.jpg) |    [Image](static/images/funsd_layout.jpg) |    [Image](static/images/funsd_reading.jpg) | [Image](static/images/scanned_tablerec2.png) |
+| Textbook         | [Image](static/images/textbook.jpg) | [Image](static/images/textbook_text.jpg) | [Image](static/images/textbook_layout.jpg) |   [Image](static/images/textbook_order.jpg) |                                              |
 
 # Hosted API
 
@@ -272,6 +279,43 @@ processor = load_processor()
 order_predictions = batch_ordering([image], [bboxes], model, processor)
 ```
 
+## Table Recognition
+
+This command will write out a json file with the detected table cells and row/column ids, along with row/column bounding boxes.
+
+```shell
+surya_table DATA_PATH
+```
+
+- `DATA_PATH` can be an image, pdf, or folder of images/pdfs
+- `--images` will save images of the pages and detected table cells + rows and columns (optional)
+- `--max` specifies the maximum number of pages to process if you don't want to process everything
+- `--results_dir` specifies the directory to save results to instead of the default
+- `--detect_boxes` specifies if cells should be detected.  By default, they're pulled out of the PDF, but this is not always possible. 
+- `--skip_table_detection` tells table recognition not to detect tables first.  Use this if your image is already cropped to a table.
+
+The `results.json` file will contain a json dictionary where the keys are the input filenames without extensions.  Each value will be a list of dictionaries, one per page of the input document.  Each page dictionary contains:
+
+- `cells` - detected table cells
+  - `bbox` - the axis-aligned rectangle for the text line in (x1, y1, x2, y2) format.  (x1, y1) is the top left corner, and (x2, y2) is the bottom right corner.
+  - `row_id` - the id of the row this cell belongs to.
+  - `col_id` - the id of the column this cell belongs to.
+  - `text` - if text could be pulled out of the pdf, the text of this cell.
+- `rows` - detected table rows
+  - `bbox` - the bounding box of the table row
+  - `row_id` - the id of the row
+- `cols` - detected table columns
+  - `bbox` - the bounding box of the table column
+  - `col_id`- the id of the column
+- `page` - the page number in the file
+- `table_idx` - the index of the table on the page (sorted in vertical order)
+- `image_bbox` - the bbox for the image in (x1, y1, x2, y2) format.  (x1, y1) is the top left corner, and (x2, y2) is the bottom right corner.  All line bboxes will be contained within this bbox.
+
+**Performance tips**
+
+Setting the `TABLE_REC_BATCH_SIZE` env var properly will make a big difference when using a GPU.  Each batch item will use `150MB` of VRAM, so very high batch sizes are possible.  The default is a batch size `64`, which will use about 10GB of VRAM.  Depending on your CPU core count, it might help, too - the default CPU batch size is `8`.
+
+
 # Limitations
 
 - This is specialized for document OCR.  It will likely not work on photos or other images.
@@ -381,10 +425,23 @@ I benchmarked the layout analysis on [Publaynet](https://github.com/ibm-aur-nlp/
 
 **Methodology**
 
-I benchmarked the layout analysis on the layout dataset from [here](https://www.icst.pku.edu.cn/cpdp/sjzy/), which was not in the training data.  Unfortunately, this dataset is fairly noisy, and not all the labels are correct.  It was very hard to find a dataset annotated with reading order and also layout information.  I wanted to avoid using a cloud service for the ground truth.
+I benchmarked the reading order on the layout dataset from [here](https://www.icst.pku.edu.cn/cpdp/sjzy/), which was not in the training data.  Unfortunately, this dataset is fairly noisy, and not all the labels are correct.  It was very hard to find a dataset annotated with reading order and also layout information.  I wanted to avoid using a cloud service for the ground truth.
 
 The accuracy is computed by finding if each pair of layout boxes is in the correct order, then taking the % that are correct.
 
+## Table Recognition
+
+| Model             | Row Intersection | Col Intersection |   Time Per Image |
+|-------------------|------------------|------------------|------------------|
+| Surya             | 0.97             | 0.93             |             0.03 |
+| Table transformer | 0.72             | 0.84             |             0.02 |
+
+Higher is better for intersection, which the percentage of the actual row/column overlapped by the predictions.
+
+**Methodology**
+
+The benchmark uses a subset of [Fintabnet](https://developer.ibm.com/exchanges/data/all/fintabnet/) from IBM.  It has labeled rows and columns.  After table recognition is run, the predicted rows and columns are compared to the ground truth.  There is an additional penalty for predicting too many or too few rows/columns.
+
 ## Running your own benchmarks
 
 You can benchmark the performance of surya on your machine.  
@@ -396,7 +453,7 @@ You can benchmark the performance of surya on your machine.
 
 This will evaluate tesseract and surya for text line detection across a randomly sampled set of images from [doclaynet](https://huggingface.co/datasets/vikp/doclaynet_bench).
 
-```
+```shell
 python benchmark/detection.py --max 256
 ```
 
@@ -409,7 +466,7 @@ python benchmark/detection.py --max 256
 
 This will evaluate surya and optionally tesseract on multilingual pdfs from common crawl (with synthetic data for missing languages).
 
-```
+```shell
 python benchmark/recognition.py --tesseract
 ```
 
@@ -425,7 +482,7 @@ python benchmark/recognition.py --tesseract
 
 This will evaluate surya on the publaynet dataset.
 
-```
+```shell
 python benchmark/layout.py
 ```
 
@@ -435,14 +492,25 @@ python benchmark/layout.py
 
 **Reading Order**
 
-```
+```shell
 python benchmark/ordering.py
 ```
 
 - `--max` controls how many images to process for the benchmark
 - `--debug` will render images with detected text
 - `--results_dir` will let you specify a directory to save results to instead of the default one
 
+**Table Recognition**
+
+```shell
+python benchmark/table_recognition.py --max 1024 --tatr
+```
+
+- `--max` controls how many images to process for the benchmark
+- `--debug` will render images with detected text
+- `--results_dir` will let you specify a directory to save results to instead of the default one
+- `--tatr` specifies whether to also run table transformer
+
 # Training
 
 Text detection was trained on 4x A6000s for 3 days.  It used a diverse set of images as training data.  It was trained from scratch using a modified efficientvit architecture for semantic segmentation.

diff --git a/benchmark/table_recognition.py b/benchmark/table_recognition.py
@@ -0,0 +1,143 @@
+import argparse
+import collections
+import copy
+import json
+
+from tabulate import tabulate
+
+from surya.input.processing import convert_if_not_rgb
+from surya.model.table_rec.model import load_model
+from surya.model.table_rec.processor import load_processor
+from surya.tables import batch_table_recognition, get_batch_size
+from surya.settings import settings
+from surya.benchmark.metrics import rank_accuracy, penalized_iou_score
+from surya.benchmark.tatr import load_tatr, batch_inference_tatr
+import os
+import time
+import datasets
+
+
+def main():
+    parser = argparse.ArgumentParser(description="Benchmark surya table recognition model.")
+    parser.add_argument("--results_dir", type=str, help="Path to JSON file with benchmark results.", default=os.path.join(settings.RESULT_DIR, "benchmark"))
+    parser.add_argument("--max", type=int, help="Maximum number of images to run benchmark on.", default=None)
+    parser.add_argument("--tatr", action="store_true", help="Run table transformer.", default=False)
+    args = parser.parse_args()
+
+    model = load_model()
+    processor = load_processor()
+
+    pathname = "table_rec_bench"
+    # These have already been shuffled randomly, so sampling from the start is fine
+    split = "train"
+    if args.max is not None:
+        split = f"train[:{args.max}]"
+    dataset = datasets.load_dataset(settings.TABLE_REC_BENCH_DATASET_NAME, split=split)
+    images = list(dataset["image"])
+    images = convert_if_not_rgb(images)
+    bboxes = list(dataset["bboxes"])
+
+    start = time.time()
+    bboxes = [[{"bbox": b, "text": None} for b in bb] for bb in bboxes]
+    table_rec_predictions = batch_table_recognition(images, bboxes, model, processor)
+    surya_time = time.time() - start
+
+    folder_name = os.path.basename(pathname).split(".")[0]
+    result_path = os.path.join(args.results_dir, folder_name)
+    os.makedirs(result_path, exist_ok=True)
+
+    page_metrics = collections.OrderedDict()
+    mean_col_iou = 0
+    mean_row_iou = 0
+    for idx, pred in enumerate(table_rec_predictions):
+        row = dataset[idx]
+        pred_row_boxes = [p.bbox for p in pred.rows]
+        pred_col_bboxes = [p.bbox for p in pred.cols]
+        actual_row_bboxes = row["rows"]
+        actual_col_bboxes = row["cols"]
+        row_score = penalized_iou_score(pred_row_boxes, actual_row_bboxes)
+        col_score = penalized_iou_score(pred_col_bboxes, actual_col_bboxes)
+        page_results = {
+            "row_score": row_score,
+            "col_score": col_score,
+            "row_count": len(actual_row_bboxes),
+            "col_count": len(actual_col_bboxes)
+        }
+
+        mean_col_iou += col_score
+        mean_row_iou += row_score
+
+        page_metrics[idx] = page_results
+
+    mean_col_iou /= len(table_rec_predictions)
+    mean_row_iou /= len(table_rec_predictions)
+
+    out_data = {"surya": {
+        "time": surya_time,
+        "mean_row_iou": mean_row_iou,
+        "mean_col_iou": mean_col_iou,
+        "page_metrics": page_metrics
+    }}
+
+    if args.tatr:
+        tatr_model = load_tatr()
+        start = time.time()
+        tatr_predictions = batch_inference_tatr(tatr_model, images, 1)
+        tatr_time = time.time() - start
+
+        page_metrics = collections.OrderedDict()
+        mean_col_iou = 0
+        mean_row_iou = 0
+        for idx, pred in enumerate(tatr_predictions):
+            row = dataset[idx]
+            pred_row_boxes = [p["bbox"] for p in pred["rows"]]
+            pred_col_bboxes = [p["bbox"] for p in pred["cols"]]
+            actual_row_bboxes = row["rows"]
+            actual_col_bboxes = row["cols"]
+            row_score = penalized_iou_score(pred_row_boxes, actual_row_bboxes)
+            col_score = penalized_iou_score(pred_col_bboxes, actual_col_bboxes)
+            page_results = {
+                "row_score": row_score,
+                "col_score": col_score,
+                "row_count": len(actual_row_bboxes),
+                "col_count": len(actual_col_bboxes)
+            }
+
+            mean_col_iou += col_score
+            mean_row_iou += row_score
+
+            page_metrics[idx] = page_results
+
+        mean_col_iou /= len(tatr_predictions)
+        mean_row_iou /= len(tatr_predictions)
+
+        out_data["tatr"] = {
+            "time": tatr_time,
+            "mean_row_iou": mean_row_iou,
+            "mean_col_iou": mean_col_iou,
+            "page_metrics": page_metrics
+        }
+
+
+    with open(os.path.join(result_path, "results.json"), "w+") as f:
+        json.dump(out_data, f, indent=4)
+
+    table = [
+        ["Model", "Row Intersection", "Col Intersection", "Time Per Image"],
+        ["Surya", f"{out_data['surya']['mean_row_iou']:.2f}", f"{out_data['surya']['mean_col_iou']:.2f}",
+         f"{surya_time / len(images):.2f}"],
+    ]
+
+    if args.tatr:
+        table.append(["Table transformer", f"{out_data['tatr']['mean_row_iou']:.2f}", f"{out_data['tatr']['mean_col_iou']:.2f}",
+         f"{tatr_time / len(images):.2f}"])
+
+    print(tabulate(table, headers="firstrow", tablefmt="github"))
+
+    print("Intersection is the average of the intersection % between each actual row/column, and the predictions.  With penalties for too many/few predictions.")
+    print("Note that table transformers is unbatched, since the example code in the repo is unbatched.")
+    print(f"Wrote results to {result_path}")
+
+
+if __name__ == "__main__":
+    main()