A Framework to create a dataset and use it to evaluate the Tesseract OCR software for the usecase on webpages. You can use it to evaluate other OCR software.
Pre: Docker:
alias ocr='docker run ocr '
alias ocrp='docker run ocr pipenv run python'
https://www.digitalocean.com/community/questions/how-to-fix-docker-got-permission-denied-while-trying-to-connect-to-the-docker-daemon-socket
pyenv:
``https://github.com/pyenv/pyenv/wiki``
(sudo apt-get update; sudo apt-get install --no-install-recommends make build-essential libssl-dev zlib1g-dev libbz2-dev libreadline-dev libsqlite3-dev wget curl llvm libncurses5-dev xz-utils tk-dev libxml2-dev libxmlsec1-dev libffi-dev liblzma-dev)
``Im Zweifel: pipenv --rm``
``env PYTHON_CONFIGURE_OPTS="--enable-shared" pyenv install 3.7.2``
``pyenv local 3.7.2``
``pip install -U setuptools``
build docker image:
docker build -t ocr .
start docker container (pipenv shell will be started automatically):
docker run -v AbsolutePathToSomeDir:/data:rw -ti ocr
example call to create dataset:
python dataset/creation/main.py -o /data/results -t 3 -v -b
main.py
-i:
a file containing one url per line to use for crawling (not needed when crawling is skipped)
-o:
path to an output folder
-t:
provide a number indicating on which entry to stop when parsing the crawl data (i.e. '-t 3' the 3 most used stylings per attribute will be parsed)
Default:
1
-s:
enter a string indicating which step you want to skip, if you're string contains:
'c' crawling will be skipped and the program assumes crawl data in '-c'
'g' html generating will be skipped and the program assumes html data in '-g'
'r' html rendering will be skipped and the program assumes rendered data in '-r'
-c:
provide a path to where the crawl data file will be saved (in output folder)
Default:
'html'
-g:
provide a path to where the html data file will be saved (in output folder)
Default:
'crawl.json'
-r:
provide a path to where the rendered data file will be saved (in output folder)
Default:
'dataset'
-b:
adds boxes to the rendered data, saved in output folder with the '-r' + '_boxes'
-v:
creates visualisations for the crawled data
-z:
zips the output folder
generate:
pipenv run python generate_html.py
=> './html/font_family/font_size/font_style/layout.html'
render ( & save ):
pipenv run python render_html.py
=> './dataset/font_family/font_size/font_style/layout.png'
=> './dataset/font_family/font_size/font_style/layout.txt'
(contains words and their boxes in this format: // word\t(left,top,width,height)\n)
(first line contains path to the corresponding html file)
zip:
pipenv run python zip_dataset.py
=> zips the 'dataset' directory with the same structure to 'dataset.zip'
evaluation:
pipenv run python evaluation.py ideal recognized
=> evaluates the recognized dataset against the ideal
True Positives
False Positives
False Negatives
Accuracy
Precision
Recall
F1-Score
(for respectively the localisation and the determination)
=> they need to have the same structure
=> results in 2 files:
'evaluation_ideal_recognized.csv'
(contains the )
'evaluation_ideal_recognized.txt'
reset virtual env:
pipenv --rm