Skip to content

Convert PDF files to nicely structured Markdown and EPUB format with intelligent layout detection using AI.

License

Notifications You must be signed in to change notification settings

overcuriousity/pdf2epub

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

19 Commits
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 

Repository files navigation

PDF2EPUB 📚

Convert PDF files to nicely structured Markdown and EPUB format with intelligent layout detection.

✨ Features

  • 📖 Smart layout detection for books and academic papers
  • 🔍 Advanced text extraction and OCR capabilities
  • 📊 Table detection and formatting
  • 🖼️ Image extraction and optimization
  • 📝 Clean markdown output with preserved structure
  • 📱 EPUB generation with customizable styling
  • 🌍 Multi-language support
  • 🚀 GPU acceleration support (NVIDIA & AMD)

🛠️ Dependencies

  • Python 3.9+
  • PyTorch (with CUDA/ROCm support for GPU acceleration)
  • marker-pdf==0.3.10
  • transformers==4.45.2
  • markdown==3.7

💻 Installation

  1. Install Python dependencies:
pip install -r requirements.txt
  1. Install PyTorch:
  • For NVIDIA GPUs, install with CUDA support:
pip install torch torchvision torchaudio
  • For AMD GPUs, install with ROCm support:
pip3 uninstall torch torchvision torchaudio
pip3 install torch torchvision torchaudio --index-url https://download.pytorch.org/whl/rocm6.2
  1. Verify GPU support:
import torch
print(torch.__version__)  # PyTorch version
print(torch.cuda.is_available())  # Should return True for NVIDIA
print(torch.version.hip)  # Should print ROCm version for AMD

🚀 Usage

Basic Usage

Convert a single PDF file:

python main.py input.pdf

Convert all PDFs in a directory:

python main.py input_directory/

Advanced Options

python main.py [input_path] [output_path] [options]

Options:
  --batch-multiplier INT    Batch size multiplier for memory/speed tradeoff (default: 2)
  --max-pages INT          Maximum number of pages to process
  --start-page INT         Page number to start from
  --langs STRING           Comma-separated list of languages in document
  --skip-epub             Skip EPUB generation, only create markdown
  --skip-md               Skip markdown generation, use existing markdown files

Examples

Process a specific range of pages:

python main.py book.pdf --start-page 10 --max-pages 50

Process a multi-language document:

python main.py paper.pdf --langs "English,German"

Convert to markdown only:

python main.py thesis.pdf --skip-epub

Output Structure

output_directory/
├── document_name/
│   ├── document_name.md
│   ├── document_name.epub
│   ├── document_name_metadata.json
│   └── images/
│       ├── image1.png
│       ├── image2.jpg
│       └── ...

🤝 Contributing

Contributions are welcome! Here's how you can help:

  1. Fork the repository
  2. Create a new branch for your feature
  3. Commit your changes
  4. Push to your branch
  5. Create a Pull Request

Please ensure your code follows the existing style and includes appropriate tests.

Development Setup

  1. Clone the repository:
git clone https://github.com/yourusername/pdf2epub.git
cd pdf2epub
  1. Create a virtual environment:
python -m venv venv
source venv/bin/activate  # Linux/Mac
venv\Scripts\activate     # Windows
  1. Install development dependencies:
pip install -r requirements.txt

📄 License

This project is licensed under the MIT License - see the LICENSE file for details.

🐛 Known Issues

  • Some image embedding might need manual adjustment
  • Some complex mathematical equations might not be perfectly converted
  • Certain PDF layouts with multiple columns may require manual adjustment
  • Font detection might be imperfect in some cases

🙏 Acknowledgments

This project builds upon several excellent open-source libraries:

About

Convert PDF files to nicely structured Markdown and EPUB format with intelligent layout detection using AI.

Topics

Resources

License

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published