Scholar crawler 学术爬虫

This projects aims to crawl title and citation from given journals, with few human interaction, then download from url or scihub. Achieve 97%+ success rate on a wide range of papers.

Current : v2.0, add recaptcha solver.

Google Scholarscraper.py : With good pacing, the script is able to crawl 1 page in 10 second, and run for at least 100 page until it hit a bot check. This project solve the bot check uses recaptcha-challenger , with a openai whisper model. You can always fallback to manual human check with scraper_manual.py

ScihubDownloader.py : Try to download the links scrapped from google scholar. If failed, it will fallback to scihub. If failed again, it will fallback to scihub backbones. There are lots of sci-hub mirrors, be sure plenty to put accessible mirrors in _get_available_scihub_urls(). It's recommended to add at least 5 mirrors for load balancing.

Clash clash.py: Change proxy server before google gets irritated. TBD

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

README.md

README.md

Scholar crawler 学术爬虫

Files

README.md

Latest commit

History

README.md

File metadata and controls

Scholar crawler 学术爬虫