This repository contains supplementary materials for DeepCPCFG: Deep Learning and Context Free Grammars for End-to-End Information Extraction. We are pleased to announce that DeepCPCFG has been accepted for publication at the 16th International Conference on Document Analysis and Recognition ICDAR 2021. The contents in this repository are newer than the ArXiv version and may not have been presented before.

There are three sets of materials.

RVL-CDIP link:
- OCRed bounding boxes.
- Annotations for the invoices with line-items extracted from relational records.
- The inference output of DeepCPCFG.
- Results for each invoice using the Hierarchical Edit Distance (HED) metric.
CORD receipts link:
- OCRed bounding boxes.
- Annotations converted from hand-annotations and represented as relational records.
- Inference output of DeepCPCFG.
- Results for each receipt based on the following metrics
  - SPADE metric, our interpretation and implementation of the metric used in Spatial Dependency Parsing for Semi-Structured Document Information Extraction.
  - HED metric
Code implementing the Hierarchical Edit Distance (HED) metric.

Instructions for using the hed.jl

hed.jl is a self-contained file that implements Hierarchical Edit Distance (HED) metric for comparing two files representing the output/annotation of a hierarchical document.

Download and install Julia
Install packages ArgParse and JSON in Julia Built-in Package Manager
Try to run hed.jl as follows

$ julia hed.jl --help

You should see the following printout on your terminal prompt.

usage: hed.jl [--str-func STR-FUNC] [-h] prediction groundTruth

positional arguments:
  prediction           the .json file containing the prediction
  groundTruth          the .json file containing the ground truth

optional arguments:
  --str-func STR-FUNC  function on string: choose "split" (word-based)
                       or "identity" (character-based) (default:
                       "identity")
  -h, --help           show this help message and exit

If you see above printout, then continue as follows,

$ julia hed.jl rvl-cdip/predictions/json/0060087309.json rvl-cdip/annotations/0060087309.json

Then you will get the following output, or see hed_sample_output.txt.

(long output detailing the exact calculations for each element in the prediction)

TP = 90, FP = 14, FN = 18, Precision = 0.8654, Recall = 0.8333, F₁ = 0.8491

To get precision and recall for the entire corpus, calculate HED for every pair of prediction/annotation, sum up the respective true positives, false positives and false negatives, then obtain the precision and recall using the aggregated counts.

References

A. W. Harley, A. Ufkes, K. G. Derpanis, "Evaluation of Deep Convolutional Nets for Document Image Classification and Retrieval," in ICDAR, 2015.
S. Park, S. Shin, B. Lee, J. Lee, J. Surh, M. Seo, H. Lee, "CORD: A Consolidated Receipt Dataset for Post-OCR Parsing," in Document Intelligence Workshop NeurIPS, 2019.
W. Hwang, J. Yim, S. Park, S. Yang, M. Seo, "Spatial Dependency Parsing for Semi-Structured Document Information Extraction," in ArXiv, 2020.

Name		Name	Last commit message	Last commit date
Latest commit History 32 Commits
cord		cord
response		response
rvl-cdip		rvl-cdip
LICENSE		LICENSE
README.md		README.md
hed.jl		hed.jl
hed_sample_output.txt		hed_sample_output.txt

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

Contents

Instructions for using the hed.jl

References

About

Releases

Packages

Languages

License

halona719halona/datasets

Folders and files

Latest commit

History

Repository files navigation

Contents

Instructions for using the hed.jl

References

About

Resources

License

Stars

Watchers

Forks

Releases

Packages 0

Languages

Packages