ACKNOWLEDGMENT

V1.1

Written by I. Zeki Yalniz

ACKNOWLEDGMENT

This software was developed at the Center for Intelligent Information Retrieval (CIIR), University of Massachusetts Amherst.
Basic research to develop the software was funded by the CIIR and the National Science Foundation while its application was supported by a grant from the Mellon Foundation. Any opinions, findings and conclusions or recommendations expressed in this material are the authors' and do not necessarily reflect those of the sponsor.

CITATION AND CONTACT INFORMATION

We ask that any publications using this software acknowledge the following paper:
Ismet Zeki Yalniz and R Manmatha: A Fast Alignment Scheme for Automatic OCR Evaluation of Books. ICDAR 2011: 754-758

For further information please contact either I. Zeki Yalniz ([email protected]) or R. Manmatha ([email protected]) or [email protected].

HOW TO COMPILE:

Inside the source folder, type the following command to compile the code (tested for Java version 1.6): "javac *.java"

HOW TO USE THE TOOL

1 - COMMAND LINE INTERFACE:

USAGE: RecursiveAligmentTool -opt

is the reference (ground truth) text filename is the candidate (OCR output) text filename is the filename for the alignment output (optional) file must contain the following arguments on each line: ignoredChars= alignmentFormat=<COLUMN|LINES> (default is lines) level=<W|C> (level of alignment can be either character or word level. Default is W.)

The screen output format is: <OCR_accuracy>

Example command: java RecursiveAlignmentTool texts/adventuresofhuck_ground_truth.txt texts/adventuresofhuck00clemrich_OCR_output.txt texts/alignmentOutput.txt -opt config.txt

An example configuration file includes the three lines below:

level=CHAR alignmentFormat=LINES ignoredChars=,.'";:!?()[]{}<>`-+=/$@%#|&^*_~

2 - RETAS JAVA API

2.a) This method returns the alignment output in an ArrayList. It does not produce any text output

public static ArrayList<AlignedSequence> processSingleJob_getAlignedSequence(
        String gtFile,  // input text 1: ground truth text
        String candFile,  // input text 2: OCR output text (or the candidate text)
        String ignoredChars, // The list of characters to be ignored
        String level ) // alignment level: "c" or "w" (for character and word level alignment respectively)

2.b)

This function produces the alignment at the word or character level and produces a text output file. The output file has two formats. One can also choose the characters to be ignored for the alignment.

Stats st = RecursiveAlignmentTool.processSingleJob( gtFile, // (String) input text 1: ground truth text candFile, // (String) input text 2: OCR output text alignmentLevel, // (String) The level of alignment: 'c' for the character and 'w' for the the word level alignment. outputFormat, // (String) The format of the alignment output: 'column' or 'line' ignoredChars, // (String) The list of characters to be ignored alignFile // (String) The filename for the alignment output );

"Stats" object contains the total number of matching characters/words and the total number of chars/words in the input texts. OCR accuracy is defined to be the total number of matching chars/words divided by the total number of chars/words in the ground truth file. One can calculate OCR accuracy by calling the getOCRaccuracy() method as:

double ocrAccuracy = st.getOCRaccuracy();

2.c)

If the number of matching chars/words is the only concern, then this method is faster.

Stats sts[] = RecursiveAlignmentTool.processSingleJob_getAlignmentStatsOnly( gtFile, // (String) input text 1: ground truth text candFile, // (String) input text 2: OCR output text ignoredChars, // (String) The list of characters to be ignored );

sts[0] contains the word level alignment statistics sts[1] contains the character level alignment statistics

Name		Name	Last commit message	Last commit date
Latest commit History 4 Commits
texts		texts
.gitignore		.gitignore
AlignedSequence.java		AlignedSequence.java
EditDistAligner.java		EditDistAligner.java
GNU_licence.txt		GNU_licence.txt
IndexEntry.java		IndexEntry.java
IndexTermComparator.java		IndexTermComparator.java
LCS.java		LCS.java
LICENSE.md		LICENSE.md
README.md		README.md
README.txt		README.txt
RecursiveAlignmentTool.java		RecursiveAlignmentTool.java
Stats.java		Stats.java
TermIndexBuilder.java		TermIndexBuilder.java
TermPosComparator.java		TermPosComparator.java
TextPreprocessor.java		TextPreprocessor.java
TextPreprocessorUniversal.java		TextPreprocessorUniversal.java
change.log		change.log
config.txt		config.txt
license.txt		license.txt

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Licenses found

Repository files navigation

V1.1

ACKNOWLEDGMENT

CITATION AND CONTACT INFORMATION

HOW TO COMPILE:

HOW TO USE THE TOOL

1 - COMMAND LINE INTERFACE:

An example configuration file includes the three lines below:

level=CHAR alignmentFormat=LINES ignoredChars=,.'";:!?()[]{}<>`-+=/$@%#|&^*_~

2 - RETAS JAVA API

About

Licenses found

Releases

Packages

Languages

License

Licenses found

Early-Modern-OCR/RETAS

Folders and files

Latest commit

History

Repository files navigation

V1.1

ACKNOWLEDGMENT

CITATION AND CONTACT INFORMATION

HOW TO COMPILE:

HOW TO USE THE TOOL

1 - COMMAND LINE INTERFACE:

An example configuration file includes the three lines below:

level=CHAR alignmentFormat=LINES ignoredChars=,.'";:!?()[]{}<>`-+=/$@%#|&^*_~

2 - RETAS JAVA API

About

Resources

License

Licenses found

Stars

Watchers

Forks

Releases

Packages 0

Languages

Packages