Skip to content

OCR Benchmark

Florian edited this page Nov 18, 2019 · 1 revision

I quickly benchmarked OCR systems with Microsoft Azure Cognitive Service, Amazon Textract, Google Vision API and Tesseract. First 3 lines from block Pressure at station level; monthly means and diurnal inequalities are shown below for quick inter-comparison. Full text recognition output are available for download next to each OCR system. In a nutshell, Amazon Textract and Microsoft Azure Cognition Service don't segment correctly the page which negatively impacts recognition. Amazon Textract completely removes the layout structure while Microsoft Azure brings more satisfying results in this regards. Recent release of Form Recognizer from Microsoft Azure Cognition Service may improve overall accuracy (not tested). Google Vision API and Tesseract show good segmentation and keep layout structure. In all cases, some characters are still not read correctly and require text post-processing. As always, output text recognition accuracy relies on input image quality.

  • Amazon Textract (output)

    It was almost impossible to retrieve the internal layout structure (see output)

  • Microsoft Azure Cognitive Service (output)

    IOOI .31 +0.24 +0.21 +0.05 -0.16 -
    -0.22 -0.27
    -0.23 -0.05 +0.06 +0.24 +0.26 +0.12 +0.02 -0.16 -0.27 -0.31 -0.15 -0 02 +0.02 +0.08 +0.07 +0-05 +0.13 + 0.29
    1007.62 +0.23 +0.09 -0.09-0.30-0.38 -0-40-
    -0.33 -0.19 -0.1I -0.01 +0.02 +0-03 -0.05 -
    -0.28 -0.21 +0.05 +0'27 +0.38 +0-43 +0-47 +0.42 +0.38
    
  • Google Vision OCR (output)

    1003'56 +o:38 +o27 tol —о07 —о:22 \~о-37 —о:33 —о18 —о-10 — о-II —о-10 —о-27 -о47 -0-59 —о-55 -о:38.
    10O1 31 +o:24 +o21 to05 —о-16 —о-22 -\~о-27 —о-23 \~о05 +o-об +o-24 +oо-26 +o-12 +о02 —о-16 —о-27 —0-31 —о-15 —о02 +o-02 +o-o8 +oo7 toos to13 +0:29
    I007 62 +o 23 +o.o9-0-09 -o-30 -o-38 -0-40 -o-33 -o-19-o-11 -o oi +o-02 +o*03 -o-05 -o-18 -o-25 -o-28 -o-21 +O 05 +o 27 +o 38 +o 43 +0-47 +o'42 +o-38
    -o 22 +o 02 +o-27 +0'43 +o-57 +o-62 +0-65 +o-62  
    
  • Tesseract with LTSM (--oem 1 --psm 6)(output)

    "AES: H 1003-56 40°38 +027 +0°11 —0-07 —0-22 —0-37 —0-33 —0-I8 —0-I0 —0‘II —0°'I0 —0+27 —0°47 —0-69 —0-55 —0:38 —0-22 +0-02 +027 +043 +0-57 + 0-62 +0-685 +0-62",
    "pCR 100131 40:24 +0°2X 40°05 —0'I6 —0'22 -\~0-27 —0*23 —0'05 +0°06 +0-24 +026 40-12 40:02 —0°16 —0-27 —0-31 —0'15 —0'02 + 0'02 +0-08 40-07 40-05 +0-13 + 0-29",
    "Mar. 100762 +023 +009 —0:09 —0-30 —0-38 —0-40 —0°33 —0'19 —0-11 —0-0I +06+02 40:03 —0-05 —0-18 —0-25 —0-28 —0-21 40-05 + 0°27 40°38 +043 + 0-47 + 0°42 +0-38"
    
  • Tesseract with Legacy (--oem 0 --psm 6)(output)

    'Jan. 1003-56 +0-38 +0-27 +o-11 —-o-o7 —o-22 —0-37 —o\~33 —o-18 —o-10 —0-n —o-10 —0-27 —o-47 —0-59 —0-55 —o-38 —0-22 +o-02 +o-z7 +o-43 +o-57 +o\~62 +005 +o-62',
    "Feb. 1001-31 +o-24 +o-2x +0-05 —0-16 —0-22 »—o-27 —o-23 ——0-o5 +0-o6 +0-24 +0-26 +o-1z +0-oz —0-16 —o-27 —0-31 —0-15 —0-02 +o-oz +o-08 +0'07 +0'05 +013 +029",
    "Mar. 1007-62 +o-23 +o-09 —0-09 —o-3o —0-38 —0-40 —o-33 —o-19 —o-11 —o-01 +0-oz +o-o3 —o-o5 —0-18 —o-z5 ——0-28 —o-21 +0-05 +0-z7 +0-38 +0'43 +047 +o-42 +o\~38"
    
Clone this wiki locally