Skip to content

Commit

Permalink
use ocrmypdf
Browse files Browse the repository at this point in the history
  • Loading branch information
ipitio committed Oct 17, 2024
1 parent dce811f commit 5e797d4
Show file tree
Hide file tree
Showing 8 changed files with 94 additions and 104 deletions.
8 changes: 4 additions & 4 deletions .github/workflows/predict.yml
Original file line number Diff line number Diff line change
Expand Up @@ -28,9 +28,9 @@ jobs:

- name: Run inference
run: |
docker run --name ocr2pdf \
-v ./src:/app \
-v ./pdf:/app/pdf \
docker run \
-v ./src:/ocr2pdf \
-v ./pdf:/ocr2pdf/pdf \
ghcr.io/ipitio/ocr-pdf:latest \
bash predict.sh pdf
Expand All @@ -39,4 +39,4 @@ jobs:
uses: EndBug/add-and-commit@v9
with:
add: "**/*.pdf"
message: "enhanced pdfs"
message: "processed files"
46 changes: 25 additions & 21 deletions README.md
Original file line number Diff line number Diff line change
Expand Up @@ -4,23 +4,29 @@

# ocr2pdf

**Convert images and scans to searchable PDFs!**
**OCRmyPDF and Merge it**

---

[![downloads](https://img.shields.io/badge/dynamic/json?url=https%3A%2F%2Fipitio.github.io%2Fbackage%2Fipitio%2Focr-pdf%2Focr-pdf.json&query=%24.downloads&logo=github&logoColor=959da5&labelColor=333a41&label=pulls)](https://github.com/arevindh/pihole-speedtest/pkgs/container/pihole-speedtest) [![build](https://github.com/ipitio/ocr-pdf/actions/workflows/publish.yml/badge.svg)](https://github.com/ipitio/ocr-pdf/actions/workflows/publish.yml)

</div>

You can run this in your browser, on your computer, or somewhere in between, depending how much you want to automate and virtualize. The core logic resides in a Python script that you could run yourself, if you really wanted to. It extracts all the files from `todo`, transforms their pages with a pretrained LSTM RNN, and loads them into `done`. Files in subfolders will be merged in alphabetical order, but will still be available individually.
Convert images and scans to searchable and selectable (and merged) PDFs! The core logic resides in a Python script that you could run yourself, if you really wanted to. It extracts all the files from `todo`, transforms them with Tesseract via [OCRmyPDF](https://github.com/ocrmypdf/OCRmyPDF), and loads them into `done`. Files in subfolders will be merged in alphabetical order, but will still be available individually.

I recommend you use either:

- The Bash script, which runs the Python script
- The Docker image, which runs the Bash script
- A Google Colab or GitHub Actions server, both of which run the Docker image

Read on to find out which is best for you!
Read on to find out which is best for you! In any case, the Bash script is, or must be, called like so:

```bash
bash /path/to/predict.sh /folder/containing/todo/ [OCRmyPDF options]
```

For more information, see the [OCRmyPDF documentation](https://ocrmypdf.readthedocs.io/en/latest).

## Fast Start

Expand All @@ -34,27 +40,21 @@ Are you on mobile or simply want an easy and seamless experience?
2. Follow the instructions in the notebook
3. Find the OCR'd files in your [Drive](https://drive.google.com/drive/my-drive)`/ocr-pdf`

To add OCRmyPDF options, append them to the `run` command in the code cell.

### Self-hosted: Prebuilt Docker Image

If you want to skip building an image, just use mine:

1. Install Docker and Compose, such as with Docker Desktop
2. Enter a new folder, add the file below, and put your files in `./pdf/todo`
3. Run the following command to OCR the files and move them to `./pdf/done`

```yaml
# compose.yml
services:
predict:
container_name: ocr2pdf
image: ghcr.io/ipitio/ocr-pdf:latest
command: bash predict.sh pdf
volumes:
- ./pdf:/app/pdf
```
1. Install Docker, such as with Docker Desktop
2. Make a new `pdf` folder and put your files in `pdf/todo`
3. Run the following command from `pdf/..` to convert the files and move them into `pdf/done`

```bash
docker compose up
docker run --rm \
-v ./pdf:/ocr2pdf/pdf \
ghcr.io/ipitio/ocr-pdf:latest \
bash predict.sh pdf [OCRmyPDF options]
```

## Quick Start
Expand All @@ -70,27 +70,31 @@ It's still easy as 1, 2, 3! You'll find the OCR'd files in `pdf/done`.
If you made a fork and cloned it, Git is your best friend!

```bash
git add pdf/*
git add .
git commit -m "add files"
git push
# wait for the magic to happen
git pull
```

To add OCRmyPDF options, edit the command the `predict.yml` file before committing.

### Self-hosted

#### Build Docker Image

If you aren't on Linux, or want to avoid polluting your system, use Docker Compose:
If you aren't on Linux, or want to avoid polluting your system, use Docker Compose (which is included with Docker Desktop):

```bash
docker compose up
```

To add OCRmyPDF options, edit the command in the `compose.yml` file.

#### Use Bare Metal

Are you on Linux and want to make the most out of it?

```bash
bash src/predict.sh pdf
bash src/predict.sh pdf [OCRmyPDF options]
```
60 changes: 24 additions & 36 deletions colab.ipynb
Original file line number Diff line number Diff line change
Expand Up @@ -26,18 +26,9 @@
"\n",
"## Steps\n",
"\n",
"1. Make two new folders, one inside the other\n",
" - The outer one can be named anything, say `pdf`\n",
" - The inner one must be named `todo`\n",
"2. Place your files in the `todo` folder\n",
" - Those by themselves will just be converted\n",
" - Those inside subfolders will also be merged in alphabetical order\n",
"3. Share the outer `pdf` folder with this notebook\n",
" - Zip the folder\n",
" - Open this notebook in [Colab](https://colab.research.google.com/github/ipitio/ocr-pdf/blob/master/colab.ipynb)\n",
" - Run the cell below to be prompted to connect Drive and upload the zip\n",
"\n",
"You'll be offered a zip of the converted (and merged) files to download locally, whether or not Drive was connected\n"
"To merge files, organize them into folders and zip each one. Ensure the files are named in alphabetical order, as they will be merged in that order. If you'd like to add any options for [OCRmyPDF](https://ocrmypdf.readthedocs.io/en/latest), append them to the `run` line in the cell below. At the end, you'll be offered a zip of the converted (and merged) files to download locally, whether or not Drive was connected.\n",
"\n",
"1. Run the cell below to get prompted to connect Drive and upload your files and/or zipped folders\n"
]
},
{
Expand All @@ -58,34 +49,31 @@
"\n",
"# Extract your PDFs\n",
"files.upload()\n",
"\n",
"# Get the name of the zip file\n",
"pdfs = [pdf for pdf in os.listdir() if pdf.endswith(\".zip\")]\n",
"if len(pdfs) == 0:\n",
" raise Exception(\"No ZIP file found\")\n",
"![ -d pdf ] || mkdir pdf\n",
"![ -d pdf/todo ] || mkdir pdf/todo\n",
"![ -d pdf/done ] || mkdir pdf/done\n",
"!unzip -o \"*.zip\" -d pdf/todo 2>/dev/null\n",
"!rm -f *.zip\n",
"!mv *.* pdf/todo 2>/dev/null\n",
"\n",
"# Transform them\n",
"%pip install udocker\n",
"!udocker --allow-root install\n",
"\n",
"for pdf in pdfs:\n",
" !unzip -o \"$pdf\"\n",
" !rm -f \"$pdf\"\n",
" !udocker --allow-root run -v /content/\"$pdf\":/app/pdf ghcr.io/ipitio/ocr-pdf bash predict.sh pdf\n",
" converted = os.listdir(\"$pdf/done\")\n",
"\n",
" # And load\n",
" if drive and len(converted) > 0:\n",
" ![ -d \"drive/MyDrive/ocr-pdf\" ] || mkdir \"drive/MyDrive/ocr-pdf\"\n",
" !\\cp -r \"$pdf/done/\"* \"drive/MyDrive/ocr-pdf/\"\n",
"\n",
" if len(converted) == 1 and os.path.isfile(\"$pdf/done/\" + converted[0]):\n",
" files.download(\"$pdf/done/\" + converted[0])\n",
" elif len(converted) > 0:\n",
" !zip -r \"$pdf.zip\" \"$pdf/done\"\n",
" files.download(\"$pdf.zip\")\n",
" else:\n",
" print(\"No PDFs found\")"
"!udocker --allow-root run -v /content/pdf:/ocr2pdf/pdf ghcr.io/ipitio/ocr-pdf bash predict.sh pdf\n",
"converted = os.listdir(\"pdf/done\")\n",
"\n",
"# And load\n",
"if drive and len(converted) > 0:\n",
" ![ -d \"drive/MyDrive/ocr-pdf\" ] || mkdir \"drive/MyDrive/ocr-pdf\"\n",
" !\\cp -r \"pdf/done/\"* \"drive/MyDrive/ocr-pdf/\"\n",
"\n",
"if len(converted) == 1 and os.path.isfile(\"$pdf/done/\" + converted[0]):\n",
" files.download(\"pdf/done/\" + converted[0])\n",
"elif len(converted) > 0:\n",
" !zip -r \"pdf.zip\" \"pdf/done\"\n",
" files.download(\"pdf.zip\")\n",
"else:\n",
" print(\"No PDFs found\")"
]
}
],
Expand Down
6 changes: 3 additions & 3 deletions compose.yml
Original file line number Diff line number Diff line change
Expand Up @@ -2,7 +2,7 @@ services:
predict:
container_name: ocr2pdf
build: ./src
command: bash predict.sh pdf
command: bash predict.sh pdf -l eng+fra
volumes:
- ./src:/app
- ./pdf:/app/pdf
- ./src:/ocr2pdf
- ./pdf:/ocr2pdf/pdf
5 changes: 3 additions & 2 deletions src/Dockerfile
Original file line number Diff line number Diff line change
@@ -1,4 +1,5 @@
FROM python:3.11-slim
WORKDIR /app
FROM jbarlow83/ocrmypdf-ubuntu:v16.5.0
WORKDIR /ocr2pdf
COPY . .
RUN bash predict.sh
ENTRYPOINT []
50 changes: 22 additions & 28 deletions src/main.py
Original file line number Diff line number Diff line change
Expand Up @@ -3,18 +3,17 @@
"""

import os
import subprocess
import sys
from pathlib import Path

import pymupdf
import pytesseract
from joblib import Parallel, delayed
from natsort import natsorted, ns
from pdf2image import convert_from_path
from PIL import Image


def predict(base: Path, input_file: Path) -> None:
def predict(base: Path, input_file: Path, args: list[str]) -> None:
"""
Predicts the text in the input file and saves it to the output file
Expand All @@ -23,35 +22,28 @@ def predict(base: Path, input_file: Path) -> None:
input_file (Path): The input file
"""
relative_path = input_file.relative_to(base / "todo")
output_file = base / "done" / relative_path.with_suffix(".pdf")

if str(input_file).lower().endswith(".pdf"):
pages = convert_from_path(input_file, fmt="jpeg")
else:
try:
pages = [Image.open(input_file)]
except Exception:
return

print(f"Processing {relative_path}...")
doc = pymupdf.open()

for page in pages:
doc.insert_pdf(pymupdf.open("pdf", pytesseract.image_to_pdf_or_hocr(page)))
try:
if not str(input_file).lower().endswith(".pdf"):
image = Image.open(input_file)
image.convert("RGB").save(input_file, dpi=image.info.get("dpi", (300, 300)))

if not output_file.parent.exists():
output_file = base / "done" / relative_path.with_suffix(".pdf")
output_file.parent.mkdir(exist_ok=True, parents=True)

doc.save(output_file, garbage=4, deflate=True)
doc.close()

try:
subprocess.run(
[
"bash",
"-c",
f"ocrmypdf --jobs 1 {' '.join(args)} {input_file} {output_file}",
],
check=True,
)
input_file.unlink()
except subprocess.CalledProcessError:
print(f"Failed to process {relative_path}")
except Exception:
pass

print(f"Processed {relative_path}")


if __name__ == "__main__":
pdfs = Path(sys.argv[1] if len(sys.argv) > 1 else ".")
Expand All @@ -60,7 +52,11 @@ def predict(base: Path, input_file: Path) -> None:
(pdfs / "done").mkdir(exist_ok=True, parents=True)

Parallel(n_jobs=-1)(
delayed(predict)(pdfs, Path(root) / file)
delayed(predict)(
pdfs,
Path(root) / file,
sys.argv[2:] if len(sys.argv) > 2 else ["--rotate-pages", "--deskew", "--skip-text", "--invalidate-digital-signatures", "--clean"],
)
for root, _, files in os.walk(pdfs / "todo")
for file in files
)
Expand Down Expand Up @@ -96,5 +92,3 @@ def predict(base: Path, input_file: Path) -> None:

for pdf in pdf_list:
pdf.close()

print("Done")
21 changes: 13 additions & 8 deletions src/predict.sh
Original file line number Diff line number Diff line change
Expand Up @@ -2,24 +2,29 @@
# shellcheck disable=SC1091,SC2015

apt_install() {
apt-get install -y python3 python3-pip python3-venv tesseract-ocr poppler-utils git
# shellcheck disable=SC2068
apt-get install -y python3 python3-pip python3-venv tesseract-ocr poppler-utils git ocrmypdf $@
}

if ! apt_install 2>/dev/null; then
main() {
find . -name requirements.txt -exec pip3 install --user --root-user-action ignore --break-system-packages --no-cache-dir -r {} \;
[ -z "$1" ] || find . -name main.py -exec python3 {} "${@:1}" \;
}

langs=$(echo "$*" | grep -oP '(?<=-l )[^ ]+' | tr '+' '\n' | sed 's/^/tesseract-ocr-/' | sort -u | tr '\n' ' ')
if ! apt_install "$langs" 2>/dev/null; then
apt-get update
apt_install
apt_install "$langs"
fi

[ -d venv ] || python3 -m venv venv
export OMP_THREAD_LIMIT=1

if [[ -f venv/bin/pip3 ]]; then
if [[ -e venv/bin/pip3 ]]; then
source venv/bin/activate
find . -name requirements.txt -exec ./venv/bin/pip3 install --no-cache-dir -r {} \;
[ -z "$1" ] || find . -name main.py -exec ./venv/bin/python3 {} "$1" \;
main "${@}"
deactivate
elif [[ -f /.dockerenv ]]; then
[[ ":$PATH:" == *":/root/.local/bin:"* ]] || export PATH=$PATH:/root/.local/bin
pip3 install -r requirements.txt --user --break-system-packages
[ -z "$1" ] || python3 ./main.py "$1"
main "${@}"
fi
2 changes: 0 additions & 2 deletions src/requirements.txt
Original file line number Diff line number Diff line change
@@ -1,5 +1,3 @@
pytesseract==0.3.13
pdf2image==1.17.0
PyMuPDF==1.24.11
pillow==10.4.0
joblib==1.4.2
Expand Down

0 comments on commit 5e797d4

Please sign in to comment.