This is a Distributed Web Crawler Project using C++ on Linux platform .
- This project introduces Consistent Hash algorithm, which is used to solve the strategy of URL partition, hot-spot problem and load balancing between web crawler nodes and ensure that the distributed crawler has good scalability, balancing, fault tolerance.
- In order to meet the politeness and priority needs of the web crawler, this project designs and implements a URL queue based on Mercator model.
- The solutions to large-scale URLs deduplication,DNS resolution,page crawling and parsing and some other key problems are given.
- This project designs and implements a thread pool model for efficient and multi-threaded page collection.
- A scheme for downloaded page storage is given, which creats indexd files and data files to manage and store the downloaded data.