Skip to content

Managing Hubstorage Crawl Frontiers ‐ The Modern way

Martin Olveyra edited this page Dec 2, 2024 · 5 revisions

Previous Chapter: Managing Hubstorage Crawl Frontiers


Handling crawl frontier with frontera and scrapy-frontera libraries has been usually a pain in some complex projects. Maintainability and traceability are also complex when the spider is the component that handles the frontier, either for reading or writing it. scrapy-frontera itself hasn't been definitively able to transparently replace scrapy memory and disk queues by the hubstorage crawl frontier (HCF) for all use cases. There have always been some limitations and the effort to overcome them has lead to a quite complex underlying logic and configuration and, in some cases, usability.

For legacy we conserve the documents related to frontera + scrapy-frontera approach:

Managing Hubstorage Crawl Frontiers with Frontera

In this document we start to describe a different approach for relying on the hcf frontier. Under this approach the spider doesn't have awareness of the HCF, so you don't need to design or redesign a spider logic based on the HCF handling logic. Under this approach, other workflow components, like the already described crawl managers, will take care of reading seeds from the frontier and pass them via arguments to short lived spider jobs. The spider doesn't necessarily need to receive a url. It can build it from the received seeds. And it doesn't know anything about the HCF frontier.

The frontier writing role is also separated from the spider. In general one main problem of frontier writing is that only one process can write to the same slot in a given time, and this fact also contributes to a more complex and less transparent design when the spiders assume this role. In the approach we will describe, this task will be taken over by a consumer that typically scans all finished jobs items and extracts new seeds from them.

An advantage of this approach is that escalation to any amount of parallel jobs is easy. You don't need to predefine it. A single HCF slot per crawl is used. A single process read it, a single one writes it, and the number of spider jobs in parallel can be increased or decreased at any time, with no slot reengeneering.

[WIP]


Next Chapter: Monitors