Workflow Guide line segmentation

In this processing step, text regions are segmented into text lines. A line detection algorithm is run on every text region of every PAGE in the input file group, and a TextLine element with the resulting polygon outline is added to the annotation of the output PAGE.

Note: If you use ocrd-cis-ocropy-segment, you can directly go on with Step 13.

Note: If you use ocrd-tesserocr-segment-line, which uses only bounding boxes instead of polygon coordinates, then you should post-process with the processors described in Step 12. Alternatively, consider using the all-in-one capabilities of ocrd-tesserocr-recognize, which can do line segmentation and text recognition in one step by querying Tesseract's internal iterator (accessing the more precise polygon outlines instead of just coarse bounding boxes with lots of hard-to-recover overlap). Alternatively, run with shrink_polygons=True (accessing that same iterator to calculate convex hull polygons)

Note: As described in Step 7, ocrd-eynollah-segment, ocrd-sbb-textline-detector and ocrd-cis-ocropy-segment do not only segment the page, but also the text lines within the detected text regions in one step. Therefore with those (and only with those!) processors you don’t need to segment into lines in an extra step.

Available processors

Processor	Parameter	Remarks	Call
ocrd-cis-ocropy-segment	`-P level-of-operation region`		`ocrd-cis-ocropy-segment -I OCR-D-CLIP-REG -O OCR-D-SEG-LINE -P level-of-operation region`
ocrd-tesserocr-segment-line			`ocrd-tesserocr-segment-line -I OCR-D-CLIP-REG -O OCR-D-SEG-LINE`

Notes on parameter usage

E.g.

which parameters do you use with what values?
which parameters are insufficiently documented?
which aspects of a processor should be parameterizable but are not?

Notes on document-specific usage

E.g. which processors worked best with what material? -- feel free to post sample images here, too.

Welcome to the OCR-D wiki, a companion to the OCR-D website.

Articles and tutorials

Discussions

Expert section on OCR-D- workflows

Particular workflow steps

Recommended workflows

Successful Workflows for Particular Material (Template)

Workflow Guide

Videos

Section on Ground Truth

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Workflow Guide line segmentation

Available processors

Notes on parameter usage

Notes on document-specific usage

Clone this wiki locally