Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Benchmark workflows "selected_pages_ocr" do not produce text results #22

Open
stweil opened this issue Jan 11, 2024 · 23 comments
Open

Benchmark workflows "selected_pages_ocr" do not produce text results #22

stweil opened this issue Jan 11, 2024 · 23 comments

Comments

@stweil
Copy link
Contributor

stweil commented Jan 11, 2024

The related workflows all end with CER / WER 1.0, so no text is recognized by Calamari.

A manual run for a single GT terminates in less than 1 second without error message, but also without a usable result:

root@35dc144e05e4:/app/workflows/workspaces/euler_rechenkunst01_1738_selected_pages_ocr/data/euler_rechenkunst01_1738# ocrd-calamari-recognize -I OCR-D-SEG-LINE-RESEG-DEWARP -O OCR-D-OCR2 -P checkpoint_dir qurator-gt4histocr-1.0
Checkpoint version 2 is up-to-date.
Model: "model"
__________________________________________________________________________________________________
 Layer (type)                   Output Shape         Param #     Connected to                     
==================================================================================================
 input_data (InputLayer)        [(None, None, 48, 1  0           []                               
                                )]                                                                
                                                                                                  
 conv2d_0 (Conv2D)              (None, None, 48, 40  400         ['input_data[0][0]']             
                                )                                                                 
                                                                                                  
 pool2d_1 (MaxPooling2D)        (None, None, 24, 40  0           ['conv2d_0[0][0]']               
                                )                                                                 
                                                                                                  
 conv2d_1 (Conv2D)              (None, None, 24, 60  21660       ['pool2d_1[0][0]']               
                                )                                                                 
                                                                                                  
 pool2d_3 (MaxPooling2D)        (None, None, 12, 60  0           ['conv2d_1[0][0]']               
                                )                                                                 
                                                                                                  
 reshape (Reshape)              (None, None, 720)    0           ['pool2d_3[0][0]']               
                                                                                                  
 bidirectional (Bidirectional)  (None, None, 400)    1473600     ['reshape[0][0]']                
                                                                                                  
 input_sequence_length (InputLa  [(None, 1)]         0           []                               
 yer)                                                                                             
                                                                                                  
 dropout (Dropout)              (None, None, 400)    0           ['bidirectional[0][0]']          
                                                                                                  
 tf.compat.v1.floor_div (TFOpLa  (None, 1)           0           ['input_sequence_length[0][0]']  
 mbda)                                                                                            
                                                                                                  
 logits (Dense)                 (None, None, 255)    102255      ['dropout[0][0]']                
                                                                                                  
 tf.compat.v1.floor_div_1 (TFOp  (None, 1)           0           ['tf.compat.v1.floor_div[0][0]'] 
 Lambda)                                                                                          
                                                                                                  
 softmax (Softmax)              (None, None, 255)    0           ['logits[0][0]']                 
                                                                                                  
 input_data_params (InputLayer)  [(None, 1)]         0           []                               
                                                                                                  
 tf.cast (TFOpLambda)           (None, 1)            0           ['tf.compat.v1.floor_div_1[0][0]'
                                                                 ]                                
                                                                                                  
==================================================================================================
Total params: 1,597,915
Trainable params: 1,597,915
Non-trainable params: 0
__________________________________________________________________________________________________
None
Checkpoint version 2 is up-to-date.
Model: "model_1"
__________________________________________________________________________________________________
 Layer (type)                   Output Shape         Param #     Connected to                     
==================================================================================================
 input_data (InputLayer)        [(None, None, 48, 1  0           []                               
                                )]                                                                
                                                                                                  
 conv2d_0 (Conv2D)              (None, None, 48, 40  400         ['input_data[0][0]']             
                                )                                                                 
                                                                                                  
 pool2d_1 (MaxPooling2D)        (None, None, 24, 40  0           ['conv2d_0[0][0]']               
                                )                                                                 
                                                                                                  
 conv2d_1 (Conv2D)              (None, None, 24, 60  21660       ['pool2d_1[0][0]']               
                                )                                                                 
                                                                                                  
 pool2d_3 (MaxPooling2D)        (None, None, 12, 60  0           ['conv2d_1[0][0]']               
                                )                                                                 
                                                                                                  
 reshape_1 (Reshape)            (None, None, 720)    0           ['pool2d_3[0][0]']               
                                                                                                  
 bidirectional_1 (Bidirectional  (None, None, 400)   1473600     ['reshape_1[0][0]']              
 )                                                                                                
                                                                                                  
 input_sequence_length (InputLa  [(None, 1)]         0           []                               
 yer)                                                                                             
                                                                                                  
 dropout_1 (Dropout)            (None, None, 400)    0           ['bidirectional_1[0][0]']        
                                                                                                  
 tf.compat.v1.floor_div_2 (TFOp  (None, 1)           0           ['input_sequence_length[0][0]']  
 Lambda)                                                                                          
                                                                                                  
 logits (Dense)                 (None, None, 255)    102255      ['dropout_1[0][0]']              
                                                                                                  
 tf.compat.v1.floor_div_3 (TFOp  (None, 1)           0           ['tf.compat.v1.floor_div_2[0][0]'
 Lambda)                                                         ]                                
                                                                                                  
 softmax (Softmax)              (None, None, 255)    0           ['logits[0][0]']                 
                                                                                                  
 input_data_params (InputLayer)  [(None, 1)]         0           []                               
                                                                                                  
 tf.cast_1 (TFOpLambda)         (None, 1)            0           ['tf.compat.v1.floor_div_3[0][0]'
                                                                 ]                                
                                                                                                  
==================================================================================================
Total params: 1,597,915
Trainable params: 1,597,915
Non-trainable params: 0
__________________________________________________________________________________________________
None
Checkpoint version 2 is up-to-date.
Model: "model_2"
__________________________________________________________________________________________________
 Layer (type)                   Output Shape         Param #     Connected to                     
==================================================================================================
 input_data (InputLayer)        [(None, None, 48, 1  0           []                               
                                )]                                                                
                                                                                                  
 conv2d_0 (Conv2D)              (None, None, 48, 40  400         ['input_data[0][0]']             
                                )                                                                 
                                                                                                  
 pool2d_1 (MaxPooling2D)        (None, None, 24, 40  0           ['conv2d_0[0][0]']               
                                )                                                                 
                                                                                                  
 conv2d_1 (Conv2D)              (None, None, 24, 60  21660       ['pool2d_1[0][0]']               
                                )                                                                 
                                                                                                  
 pool2d_3 (MaxPooling2D)        (None, None, 12, 60  0           ['conv2d_1[0][0]']               
                                )                                                                 
                                                                                                  
 reshape_2 (Reshape)            (None, None, 720)    0           ['pool2d_3[0][0]']               
                                                                                                  
 bidirectional_2 (Bidirectional  (None, None, 400)   1473600     ['reshape_2[0][0]']              
 )                                                                                                
                                                                                                  
 input_sequence_length (InputLa  [(None, 1)]         0           []                               
 yer)                                                                                             
                                                                                                  
 dropout_2 (Dropout)            (None, None, 400)    0           ['bidirectional_2[0][0]']        
                                                                                                  
 tf.compat.v1.floor_div_4 (TFOp  (None, 1)           0           ['input_sequence_length[0][0]']  
 Lambda)                                                                                          
                                                                                                  
 logits (Dense)                 (None, None, 255)    102255      ['dropout_2[0][0]']              
                                                                                                  
 tf.compat.v1.floor_div_5 (TFOp  (None, 1)           0           ['tf.compat.v1.floor_div_4[0][0]'
 Lambda)                                                         ]                                
                                                                                                  
 softmax (Softmax)              (None, None, 255)    0           ['logits[0][0]']                 
                                                                                                  
 input_data_params (InputLayer)  [(None, 1)]         0           []                               
                                                                                                  
 tf.cast_2 (TFOpLambda)         (None, 1)            0           ['tf.compat.v1.floor_div_5[0][0]'
                                                                 ]                                
                                                                                                  
==================================================================================================
Total params: 1,597,915
Trainable params: 1,597,915
Non-trainable params: 0
__________________________________________________________________________________________________
None
Checkpoint version 2 is up-to-date.
Model: "model_3"
__________________________________________________________________________________________________
 Layer (type)                   Output Shape         Param #     Connected to                     
==================================================================================================
 input_data (InputLayer)        [(None, None, 48, 1  0           []                               
                                )]                                                                
                                                                                                  
 conv2d_0 (Conv2D)              (None, None, 48, 40  400         ['input_data[0][0]']             
                                )                                                                 
                                                                                                  
 pool2d_1 (MaxPooling2D)        (None, None, 24, 40  0           ['conv2d_0[0][0]']               
                                )                                                                 
                                                                                                  
 conv2d_1 (Conv2D)              (None, None, 24, 60  21660       ['pool2d_1[0][0]']               
                                )                                                                 
                                                                                                  
 pool2d_3 (MaxPooling2D)        (None, None, 12, 60  0           ['conv2d_1[0][0]']               
                                )                                                                 
                                                                                                  
 reshape_3 (Reshape)            (None, None, 720)    0           ['pool2d_3[0][0]']               
                                                                                                  
 bidirectional_3 (Bidirectional  (None, None, 400)   1473600     ['reshape_3[0][0]']              
 )                                                                                                
                                                                                                  
 input_sequence_length (InputLa  [(None, 1)]         0           []                               
 yer)                                                                                             
                                                                                                  
 dropout_3 (Dropout)            (None, None, 400)    0           ['bidirectional_3[0][0]']        
                                                                                                  
 tf.compat.v1.floor_div_6 (TFOp  (None, 1)           0           ['input_sequence_length[0][0]']  
 Lambda)                                                                                          
                                                                                                  
 logits (Dense)                 (None, None, 255)    102255      ['dropout_3[0][0]']              
                                                                                                  
 tf.compat.v1.floor_div_7 (TFOp  (None, 1)           0           ['tf.compat.v1.floor_div_6[0][0]'
 Lambda)                                                         ]                                
                                                                                                  
 softmax (Softmax)              (None, None, 255)    0           ['logits[0][0]']                 
                                                                                                  
 input_data_params (InputLayer)  [(None, 1)]         0           []                               
                                                                                                  
 tf.cast_3 (TFOpLambda)         (None, 1)            0           ['tf.compat.v1.floor_div_7[0][0]'
                                                                 ]                                
                                                                                                  
==================================================================================================
Total params: 1,597,915
Trainable params: 1,597,915
Non-trainable params: 0
__________________________________________________________________________________________________
None
Checkpoint version 2 is up-to-date.
Model: "model_4"
__________________________________________________________________________________________________
 Layer (type)                   Output Shape         Param #     Connected to                     
==================================================================================================
 input_data (InputLayer)        [(None, None, 48, 1  0           []                               
                                )]                                                                
                                                                                                  
 conv2d_0 (Conv2D)              (None, None, 48, 40  400         ['input_data[0][0]']             
                                )                                                                 
                                                                                                  
 pool2d_1 (MaxPooling2D)        (None, None, 24, 40  0           ['conv2d_0[0][0]']               
                                )                                                                 
                                                                                                  
 conv2d_1 (Conv2D)              (None, None, 24, 60  21660       ['pool2d_1[0][0]']               
                                )                                                                 
                                                                                                  
 pool2d_3 (MaxPooling2D)        (None, None, 12, 60  0           ['conv2d_1[0][0]']               
                                )                                                                 
                                                                                                  
 reshape_4 (Reshape)            (None, None, 720)    0           ['pool2d_3[0][0]']               
                                                                                                  
 bidirectional_4 (Bidirectional  (None, None, 400)   1473600     ['reshape_4[0][0]']              
 )                                                                                                
                                                                                                  
 input_sequence_length (InputLa  [(None, 1)]         0           []                               
 yer)                                                                                             
                                                                                                  
 dropout_4 (Dropout)            (None, None, 400)    0           ['bidirectional_4[0][0]']        
                                                                                                  
 tf.compat.v1.floor_div_8 (TFOp  (None, 1)           0           ['input_sequence_length[0][0]']  
 Lambda)                                                                                          
                                                                                                  
 logits (Dense)                 (None, None, 255)    102255      ['dropout_4[0][0]']              
                                                                                                  
 tf.compat.v1.floor_div_9 (TFOp  (None, 1)           0           ['tf.compat.v1.floor_div_8[0][0]'
 Lambda)                                                         ]                                
                                                                                                  
 softmax (Softmax)              (None, None, 255)    0           ['logits[0][0]']                 
                                                                                                  
 input_data_params (InputLayer)  [(None, 1)]         0           []                               
                                                                                                  
 tf.cast_4 (TFOpLambda)         (None, 1)            0           ['tf.compat.v1.floor_div_9[0][0]'
                                                                 ]                                
                                                                                                  
==================================================================================================
Total params: 1,597,915
Trainable params: 1,597,915
Non-trainable params: 0
__________________________________________________________________________________________________
None
09:51:19.351 INFO processor.CalamariRecognize - INPUT FILE 0 / phys_0001
09:51:19.400 INFO processor.CalamariRecognize - INPUT FILE 1 / phys_0002
09:51:19.443 INFO processor.CalamariRecognize - INPUT FILE 2 / phys_0003
09:51:19.485 INFO processor.CalamariRecognize - INPUT FILE 3 / phys_0004
09:51:19.530 INFO processor.CalamariRecognize - INPUT FILE 4 / phys_0005
09:51:19.575 INFO processor.CalamariRecognize - INPUT FILE 5 / phys_0006
09:51:19.619 INFO ocrd.process.profile - Executing processor 'ocrd-calamari-recognize' took 0.269729s (wall) 0.160561s (CPU)( [--input-file-grp='OCR-D-SEG-LINE-RESEG-DEWARP' --output-file-grp='OCR-D-OCR2' --parameter='{"checkpoint_dir": "qurator-gt4histocr-1.0", "voter": "confidence_voter_default_ctc", "textequiv_level": "line", "glyph_conf_cutoff": 0.001}' --page-id='']
@mikegerber
Copy link

Would be interesting to see the files of OCR-D-SEG-LINE-RESEG-DEWARP.

@stweil
Copy link
Contributor Author

stweil commented Jan 11, 2024

Here is OCR-D-SEG-LINE-RESEG-DEWARP/OCR-D-SEG-LINE-RESEG-DEWARP_0001.xml:

<?xml version="1.0" encoding="UTF-8"?>
<pc:PcGts xmlns:pc="http://schema.primaresearch.org/PAGE/gts/pagecontent/2019-07-15" xmlns:xsi="http://www.w3.org/2001/XMLSchema-instance" xsi:schemaLocation="http://schema.primaresearch.org/PAGE/gts/pagecontent/2019-07-15 http://schema.primaresearch.org/PAGE/gts/pagecontent/2019-07-15/pagecontent.xsd" pcGtsId="OCR-D-SEG-LINE-RESEG-DEWARP_0001">
    <pc:Metadata>
        <pc:Creator>OCR-D/core 2.49.0</pc:Creator>
        <pc:Created>2024-01-11T05:41:57.730360</pc:Created>
        <pc:LastChange>2024-01-11T05:41:57.730360</pc:LastChange>
        <pc:MetadataItem type="processingStep" name="preprocessing/optimization/binarization" value="ocrd-cis-ocropy-binarize">
            <pc:Labels externalModel="ocrd-tool" externalId="parameters">
                <pc:Label value="ocropy" type="method"/>
                <pc:Label value="0.5" type="threshold"/>
                <pc:Label value="False" type="grayscale"/>
                <pc:Label value="0.0" type="maxskew"/>
                <pc:Label value="0" type="noise_maxsize"/>
                <pc:Label value="0" type="dpi"/>
                <pc:Label value="page" type="level-of-operation"/>
            </pc:Labels>
            <pc:Labels externalModel="ocrd-tool" externalId="version">
                <pc:Label value="0.1.5" type="ocrd-cis-ocropy-binarize"/>
                <pc:Label value="2.49.0" type="ocrd/core"/>
            </pc:Labels>
        </pc:MetadataItem>
        <pc:MetadataItem type="processingStep" name="preprocessing/optimization/cropping" value="ocrd-tesserocr-crop">
            <pc:Labels externalModel="ocrd-tool" externalId="parameters">
                <pc:Label value="0" type="dpi"/>
                <pc:Label value="4" type="padding"/>
            </pc:Labels>
            <pc:Labels externalModel="ocrd-tool" externalId="version">
                <pc:Label value="0.17.0 (tesseract 5.3.0-46-g1569)" type="ocrd-tesserocr-crop"/>
                <pc:Label value="2.49.0" type="ocrd/core"/>
            </pc:Labels>
        </pc:MetadataItem>
        <pc:MetadataItem type="processingStep" name="preprocessing/optimization/binarization" value="ocrd-skimage-binarize">
            <pc:Labels externalModel="ocrd-tool" externalId="parameters">
                <pc:Label value="li" type="method"/>
                <pc:Label value="page" type="level-of-operation"/>
                <pc:Label value="0" type="dpi"/>
                <pc:Label value="0" type="window_size"/>
                <pc:Label value="0.34" type="k"/>
            </pc:Labels>
            <pc:Labels externalModel="ocrd-tool" externalId="version">
                <pc:Label value="0.1.7" type="ocrd-skimage-binarize"/>
                <pc:Label value="2.49.0" type="ocrd/core"/>
            </pc:Labels>
        </pc:MetadataItem>
        <pc:MetadataItem type="processingStep" name="preprocessing/optimization/despeckling" value="ocrd-skimage-denoise">
            <pc:Labels externalModel="ocrd-tool" externalId="parameters">
                <pc:Label value="page" type="level-of-operation"/>
                <pc:Label value="0" type="dpi"/>
                <pc:Label value="0.0" type="protect"/>
                <pc:Label value="1.0" type="maxsize"/>
            </pc:Labels>
            <pc:Labels externalModel="ocrd-tool" externalId="version">
                <pc:Label value="0.1.7" type="ocrd-skimage-denoise"/>
                <pc:Label value="2.49.0" type="ocrd/core"/>
            </pc:Labels>
        </pc:MetadataItem>
        <pc:MetadataItem type="processingStep" name="preprocessing/optimization/deskewing" value="ocrd-tesserocr-deskew">
            <pc:Labels externalModel="ocrd-tool" externalId="parameters">
                <pc:Label value="page" type="operation_level"/>
                <pc:Label value="0" type="dpi"/>
                <pc:Label value="1.5" type="min_orientation_confidence"/>
            </pc:Labels>
            <pc:Labels externalModel="ocrd-tool" externalId="version">
                <pc:Label value="0.17.0 (tesseract 5.3.0-46-g1569)" type="ocrd-tesserocr-deskew"/>
                <pc:Label value="2.49.0" type="ocrd/core"/>
            </pc:Labels>
        </pc:MetadataItem>
        <pc:MetadataItem type="processingStep" name="layout/segmentation/region" value="ocrd-cis-ocropy-segment">
            <pc:Labels externalModel="ocrd-tool" externalId="parameters">
                <pc:Label value="0" type="dpi"/>
                <pc:Label value="region" type="level-of-operation"/>
                <pc:Label value="20" type="maxcolseps"/>
                <pc:Label value="20" type="maxseps"/>
                <pc:Label value="10" type="maximages"/>
                <pc:Label value="4" type="csminheight"/>
                <pc:Label value="10" type="hlminwidth"/>
                <pc:Label value="0.01" type="gap_height"/>
                <pc:Label value="1.5" type="gap_width"/>
                <pc:Label value="True" type="overwrite_order"/>
                <pc:Label value="True" type="overwrite_separators"/>
                <pc:Label value="True" type="overwrite_regions"/>
                <pc:Label value="True" type="overwrite_lines"/>
                <pc:Label value="2.4" type="spread"/>
            </pc:Labels>
            <pc:Labels externalModel="ocrd-tool" externalId="version">
                <pc:Label value="0.1.5" type="ocrd-cis-ocropy-segment"/>
                <pc:Label value="2.49.0" type="ocrd/core"/>
            </pc:Labels>
        </pc:MetadataItem>
        <pc:MetadataItem type="processingStep" name="preprocessing/optimization/dewarping" value="ocrd-cis-ocropy-dewarp">
            <pc:Labels externalModel="ocrd-tool" externalId="parameters">
                <pc:Label value="0" type="dpi"/>
                <pc:Label value="4.0" type="range"/>
                <pc:Label value="1.0" type="smoothness"/>
                <pc:Label value="0.05" type="max_neighbour"/>
            </pc:Labels>
            <pc:Labels externalModel="ocrd-tool" externalId="version">
                <pc:Label value="0.1.5" type="ocrd-cis-ocropy-dewarp"/>
                <pc:Label value="2.49.0" type="ocrd/core"/>
            </pc:Labels>
        </pc:MetadataItem>
    </pc:Metadata>
    <pc:Page imageFilename="OCR-D-IMG/OCR-D-IMG_0001.tif" imageWidth="1296" imageHeight="1855" orientation="0.124143090435354" readingDirection="left-to-right" textLineOrder="top-to-bottom">
        <pc:AlternativeImage filename="OCR-D-BIN/OCR-D-BIN_0001.IMG-BIN.png" comments=",binarized"/>
        <pc:AlternativeImage filename="OCR-D-CROP/OCR-D-CROP_0001.IMG-CROP.png" comments=",binarized,cropped"/>
        <pc:AlternativeImage filename="OCR-D-BIN2/OCR-D-BIN2_0001.IMG-BIN.png" comments=",cropped,binarized"/>
        <pc:AlternativeImage filename="OCR-D-BIN-DENOISE/OCR-D-BIN-DENOISE_0001.IMG-DEN.png" comments=",cropped,binarized,despeckled"/>
        <pc:AlternativeImage filename="OCR-D-BIN-DENOISE-DESKEW/OCR-D-BIN-DENOISE-DESKEW_0001.IMG-DESKEW.png" comments=",cropped,binarized,despeckled,deskewed"/>
        <pc:Border>
            <pc:Coords points="17,140 1296,140 1296,1805 17,1805"/>
        </pc:Border>
    </pc:Page>
</pc:PcGts>

@stweil
Copy link
Contributor Author

stweil commented Jan 11, 2024

The QuiVer benchmark workflow selected_pages_ocr uses a process which binarizes twice. That gives an image which is too light for good OCR results (some characters are even missing completely). Nevertheless most of the text is still readable, to there should be some OCR result.

@stweil
Copy link
Contributor Author

stweil commented Jan 11, 2024

All data is now available online.

It also includes the generated page images, for example page 1 (binarized twice, denoised, deskewed).

@mikegerber
Copy link

There are no TextLines to recognize text from, so this is expected.

@mikegerber
Copy link

(I'm going on vacation in 2 hours so I'm not checking where the segmentation step is missing/going wrong, but I can check when I'm back)

@stweil stweil changed the title Calamari OCR does not produce text results Benchmark workflows "selected_pages_ocr" with Calamari OCR do not produce text results Jan 12, 2024
@stweil stweil changed the title Benchmark workflows "selected_pages_ocr" with Calamari OCR do not produce text results Benchmark workflows "selected_pages_ocr" do not produce text results Jan 12, 2024
@stweil
Copy link
Contributor Author

stweil commented Jan 13, 2024

Commit 3b32589 removed a parameter for cis-ocropy-segment, so that processor no longer produces the required text lines. cc @mweidling.

If that parameter is added again, some tests work fine, but others fail with a runtime error in cis-ocropy-segment. See related issue for ocrd_cis.

@stweil
Copy link
Contributor Author

stweil commented Jan 14, 2024

Meanwhile I restored the line segmentation for the workflow and got OCR results at least for the tests where the segmentation process did not crash (see cisocrgroup/ocrd_cis#94). It looks like the segmentation of a single newspaper page takes several hours (the first one is now running for 252 minutes, see cisocrgroup/ocrd_cis#98). I am afraid that the whole workflow cannot be used in the benchmark tests because of that.

@stweil
Copy link
Contributor Author

stweil commented Jan 14, 2024

The workflow selected_pages_ocr uses more than 118 GiB of RAM while running OCR with calamari-recognize for newspaper pages. A server with 128 GiB RAM starts swapping and gets nearly unusable.

    PID USER      PR  NI    VIRT    RES    SHR S  %CPU  %MEM     TIME+ COMMAND                                                                                                                             
3023389 stweil    20   0    6648   2132   1696 S   6.2   0.0   0:00.86 bash                                                                                                                                
3023393 stweil    20   0  207.9g 118.8g  44280 S   6.2  94.5  16:10.67 ocrd-calamari-r                                                                                                                     

@bertsky
Copy link

bertsky commented Jan 18, 2024

Commit 3b32589 removed a parameter for cis-ocropy-segment, so that processor no longer produces the required text lines. cc @mweidling.

That change is faulty btw: default is level-of-operation=region (for historic reasons), and since no prior segmentation happened, nothing will happen.

@mikegerber
Copy link

The workflow selected_pages_ocr uses more than 118 GiB of RAM while running OCR with calamari-recognize for newspaper pages. A server with 128 GiB RAM starts swapping and gets nearly unusable.

    PID USER      PR  NI    VIRT    RES    SHR S  %CPU  %MEM     TIME+ COMMAND                                                                                                                             
3023389 stweil    20   0    6648   2132   1696 S   6.2   0.0   0:00.86 bash                                                                                                                                
3023393 stweil    20   0  207.9g 118.8g  44280 S   6.2  94.5  16:10.67 ocrd-calamari-r                                                                                                                     

On which input data is that specifically? Your upload seems not to be up-to-date.

I'd gladly reproduce and debug if I had the workspace including the segmentation used. The "configuration" used is workflows/ocrd_workflows/selected_pages_ocr.txt, I take it?

@stweil
Copy link
Contributor Author

stweil commented Feb 22, 2024

That's right, selected_pages_ocr.txt is the workflow file.

@mikegerber
Copy link

It would help if I had a workspace up to ocrd-calamari-recognize's input fileGrp (OCR-D-SEG-LINE-RESEG-DEWARP?), so I can easily reproduce this.

From the looks of it, @bertsky seems to be right (above) and the workflow still doesn't produce line segmentation (only region segmentation), so this behaviour would be even more curious.

@bertsky
Copy link

bertsky commented Feb 23, 2024

@stweil didn't we already establish (in the OCR-D Forum) that the version of ocrd_all used by Quiver at the time was hopelessly outdated? But I agree we should get to the bottom of this – with or without line segments, ocrd-calamari-recognize should not be allowed (or motivated) to allocate large amounts of memory.

@mikegerber
Copy link

But I agree we should get to the bottom of this – with or without line segments, ocrd-calamari-recognize should not be allowed (or motivated) to allocate large amounts of memory.

Yep. The way it works (line by line processing) it shouldn't happen, but a. I didn't test many newspaper pages myself and did that on a host with a lot of memory b. wouldn't be the first time to see a memory leak with TensorFlow.

@mikegerber
Copy link

(Should probably run processors with ulimit or in a cgroup)

@bertsky
Copy link

bertsky commented Feb 23, 2024

(Should probably run processors with ulimit or in a cgroup)

Agreed! Could also be easily done in ocrd_all Docker images. Docker itself offers options like --memory 2GB and --ulimit rss=2000000:4000000, but we could also set something in the image's /etc/profile.d ...

@mikegerber
Copy link

(Should probably run processors with ulimit or in a cgroup)

Agreed! Could also be easily done in ocrd_all Docker images. Docker itself offers options like --memory 2GB and --ulimit rss=2000000:4000000, but we could also set something in the image's /etc/profile.d ...

I have thoughts about this (for example, I don't think profile.d would work here), should we open an issue in ocrd_all then? Have to look into the "slim image" efforts anyway.

@stweil
Copy link
Contributor Author

stweil commented Feb 23, 2024

version of ocrd_all used by Quiver at the time was hopelessly outdated

It is still outdated, see issue #23. And I don't know whether there are plans and resources to change that.

@bertsky
Copy link

bertsky commented Feb 23, 2024

I have thoughts about this (for example, I don't think profile.d would work here), should we open an issue in ocrd_all then?

@mikegerber I added it to OCR-D/ocrd_all#280 – please add your ideas there.

@mikegerber
Copy link

Because I didn't have to workspace to debug the memory problem involving ocrd-calamari-recognize, I tried to run the selected_page_ocr workflow on reichsanzeiger_random_selected_pages_ocr (removed all filegroups except OCR-D-IMG and OCR-D-GT-SEG-LINE to start with) and encountered a different problem (using latest ocrd/all:maximum image):

15:51:21.928 INFO ocrd.task_sequence.run_tasks - Start processing task 'skimage-denoise -I OCR-D-BIN2 -O OCR-D-BIN-DENOISE -p '{"level-of-operation": "page", "dpi": 0, "protect": 0.0, "maxsize": 1.0}''
15:51:24.932 INFO processor.SkimageDenoise - INPUT FILE 0 / P_1879_45_0344
15:51:31.599 ERROR ocrd.processor.helpers.run_processor - Failure in processor 'ocrd-skimage-denoise'
Traceback (most recent call last):
  File "/usr/local/lib/python3.8/site-packages/ocrd/processor/helpers.py", line 130, in run_processor
    processor.process()
  File "/build/ocrd_wrap/ocrd_wrap/skimage_denoise.py", line 75, in process
    page_image, page_coords, page_image_info = self.workspace.image_from_page(
  File "/usr/local/lib/python3.8/site-packages/ocrd/workspace.py", line 781, in image_from_page
    raise Exception('Found no AlternativeImage that satisfies all requirements ' +
Exception: Found no AlternativeImage that satisfies all requirements selector="binarized" in page "P_1879_45_0344"

Workspace at this point - if someone wants to have a look: https://qurator-data.de/~mike.gerber/2024-02-quiver-benchmarks-issue-22/reichsanzeiger_random_selected_pages_ocr.zip (Includes a ocrd.log)

At this point, I am not willing to look into this specific ocrd-calamari-recognize memory issue further, because I can't reproduce anything properly - it already involved guessing which original workspace it could have been and trying to run 7 processors. I am willing to look into it further, if I get the workspace in the state before ocrd-calamari-recognize ran, including OCR-D-SEG-LINE-RESEG-DEWARP.

I'll test with some other segmentation in OCR-D/ocrd_calamari#110, just to make sure that there is no general issue.

@mikegerber
Copy link

The QuiVer benchmark workflow selected_pages_ocr uses a process which binarizes twice. That gives an image which is too light for good OCR results (some characters are even missing completely). Nevertheless most of the text is still readable, to there should be some OCR result.

I am not sure that the images are binarized twice. It runs the binarization twice, yes, but the second binarization step may just use the original image but cropped, via AlternativeImage.

@kba @bertsky It this correct? Is there a way to verify with the log? (In the ZIP in the comment above this)

@bertsky
Copy link

bertsky commented Feb 29, 2024

@mikegerber exactly. All binarization processors filter avoid images on the input side (via feature_filter='binarized').
It's not a useful step IMHO, but it cannot hurt either.

The log would only detail this if you were to enable debug loggers for ocrd.workspace.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

3 participants