Managing Hubstorage Crawl Frontiers ‐ The Modern way

Previous Chapter: Managing Hubstorage Crawl Frontiers

Handling crawl frontier with frontera and scrapy-frontera libraries has been usually a pain in some complex projects. Maintainability and traceability are also complex when the spider is the component that handles the frontier, either for reading or writing it. scrapy-frontera itself hasn't been definitively able to transparently replace scrapy memory and disk queues by the hubstorage crawl frontier (HCF). There have always been some limitations and the effort to overcome them has lead to a quite complex underlying logic and, in some cases, usability.

In this document we describe a different approach for relying on the hcf frontier, by full leveraging on shub-workflow, in order to separate the frontier operations role from the spider. Under this approach, other workflow components, like the already described crawl managers, will take care of reading seeds from the frontier and pass them via arguments to short lived spider jobs. The spider doesn't necessarily need to receive a url. It can build it from the received seeds. The frontier writing role is also separated from the spider. Instead, this task will be taken over by a consumer that scans all finished jobs items and extracts new seeds from them.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Managing Hubstorage Crawl Frontiers ‐ The Modern way

Tutorial TOC

Using Frontera (deprecated)

Appendices

Clone this wiki locally