- Natural Language Toolkit (NLTK)
- Pillow -- Python Imaging Library
- Python Tesseract
- Beautiful Soup -- a Python library for pulling data out of HTML and XML files
- Scrapy, a fast high-level web crawling & scraping framework for Python.
- pypdf & pdfminer.six
- SentencePiece -- Unsupervised text tokenizer for Neural Network-based text generation
- Stanza: A Python NLP Library for Many Human Languages
- spaCy: Industrial-strength NLP
- MinerU:A one-stop, open-source, high-quality data extraction tool, supports PDF/webpage/e-book extraction.一站式开源高质量数据提取工具,支持PDF/网页/多格式电子书提取。