Inconsistent Crawling Behavior with Specified Depth in Spider Scraper #9

HarshJa1n · 2024-10-03T07:09:02Z

I am developing a spider scraper using the spider_py library and encountering issues with the crawling depth functionality. The crawling depth behavior appears inconsistent across different sites.

Issue:

For one site, I set the depth to 4, and the results were as follows:

Found 1 URL, crawled with depth = 1 & 2
Found 152 URLs, crawled with depth = 3
Found 165 URLs, crawled with depth = 4

For another site with the same depth setting, the results were different:

Found 1 URL, crawled with depth = 1, 2, 3
Found 36 URLs, crawled with depth = 4
Found 210 URLs, crawled with depth = 5

Expected Behavior:

Depth 1: Crawl only the current page
Depth 2: Crawl the current page and all of its forwarded links
Depth 3: Crawl the forwarded pages' forwarded links, and so on

However, the actual crawling behavior doesn't align with the expected depth definition, sometimes crawling more than the specified depth.

Steps to Reproduce:

Set the crawling depth to 4 on different websites.
Observe the number of URLs found at each depth level.

Request:

Clarification on how depth is being calculated or a potential fix to make the crawling depth behave consistently across different websites.

The text was updated successfully, but these errors were encountered:

j-mendez · 2024-10-03T11:31:30Z

I am developing a spider scraper using the spider_py library and encountering issues with the crawling depth functionality. The crawling depth behavior appears inconsistent across different sites.

Issue:

For one site, I set the depth to 4, and the results were as follows:

Found 1 URL, crawled with depth = 1 & 2

Found 152 URLs, crawled with depth = 3

Found 165 URLs, crawled with depth = 4

For another site with the same depth setting, the results were different:

Found 1 URL, crawled with depth = 1, 2, 3

Found 36 URLs, crawled with depth = 4

Found 210 URLs, crawled with depth = 5

Expected Behavior:

Depth 1: Crawl only the current page

Depth 2: Crawl the current page and all of its forwarded links

Depth 3: Crawl the forwarded pages' forwarded links, and so on

However, the actual crawling behavior doesn't align with the expected depth definition, sometimes crawling more than the specified depth.

Steps to Reproduce:

Set the crawling depth to 4 on different websites.

Observe the number of URLs found at each depth level.

Request:

Clarification on how depth is being calculated or a potential fix to make the crawling depth behave consistently across different websites.

Hi, can you share example urls and settings used? Thanks!

HarshJa1n · 2024-10-07T08:28:36Z

Ok sure, So for example:
When crawling the site brev.dev, the spider_rs library only returns 52 URLs with a depth of 4. However, at depths less than 4, the crawl results in just one URL. Similarly, when crawling the site promptfoo.dev, the library only starts returning more than one URL at a depth greater than or equal to 5.

You can infer settings from this code maybe
Current Code:

from spider_rs import Website
website = Website(url)
website.with_depth(config['depth'])
website.crawl(None, None, config['use_headless'])
links = website.get_links()
return links

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Inconsistent Crawling Behavior with Specified Depth in Spider Scraper #9

Inconsistent Crawling Behavior with Specified Depth in Spider Scraper #9

HarshJa1n commented Oct 3, 2024

j-mendez commented Oct 3, 2024

Issue:

Expected Behavior:

Steps to Reproduce:

Request:

HarshJa1n commented Oct 7, 2024

Inconsistent Crawling Behavior with Specified Depth in Spider Scraper #9

Inconsistent Crawling Behavior with Specified Depth in Spider Scraper #9

Comments

HarshJa1n commented Oct 3, 2024

Issue:

Expected Behavior:

Steps to Reproduce:

Request:

j-mendez commented Oct 3, 2024

Issue:

Expected Behavior:

Steps to Reproduce:

Request:

HarshJa1n commented Oct 7, 2024