NLP(Natural Language Processing) 自然语言处理

Natural Language Toolkit (NLTK)
Pillow -- Python Imaging Library
Python Tesseract
Beautiful Soup -- a Python library for pulling data out of HTML and XML files
Scrapy, a fast high-level web crawling & scraping framework for Python.
pypdf & pdfminer.six
SentencePiece -- Unsupervised text tokenizer for Neural Network-based text generation
Stanza: A Python NLP Library for Many Human Languages
spaCy: Industrial-strength NLP
MinerU:A one-stop, open-source, high-quality data extraction tool, supports PDF/webpage/e-book extraction.一站式开源高质量数据提取工具，支持PDF/网页/多格式电子书提取。

Provide feedback