You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
I am developing a spider scraper using the spider_py library and encountering issues with the crawling depth functionality. The crawling depth behavior appears inconsistent across different sites.
Issue:
For one site, I set the depth to 4, and the results were as follows:
Found 1 URL, crawled with depth = 1 & 2
Found 152 URLs, crawled with depth = 3
Found 165 URLs, crawled with depth = 4
For another site with the same depth setting, the results were different:
Found 1 URL, crawled with depth = 1, 2, 3
Found 36 URLs, crawled with depth = 4
Found 210 URLs, crawled with depth = 5
Expected Behavior:
Depth 1: Crawl only the current page
Depth 2: Crawl the current page and all of its forwarded links
Depth 3: Crawl the forwarded pages' forwarded links, and so on
However, the actual crawling behavior doesn't align with the expected depth definition, sometimes crawling more than the specified depth.
Steps to Reproduce:
Set the crawling depth to 4 on different websites.
Observe the number of URLs found at each depth level.
Request:
Clarification on how depth is being calculated or a potential fix to make the crawling depth behave consistently across different websites.
The text was updated successfully, but these errors were encountered:
I am developing a spider scraper using the spider_py library and encountering issues with the crawling depth functionality. The crawling depth behavior appears inconsistent across different sites.
Issue:
For one site, I set the depth to 4, and the results were as follows:
Found 1 URL, crawled with depth = 1 & 2
Found 152 URLs, crawled with depth = 3
Found 165 URLs, crawled with depth = 4
For another site with the same depth setting, the results were different:
Found 1 URL, crawled with depth = 1, 2, 3
Found 36 URLs, crawled with depth = 4
Found 210 URLs, crawled with depth = 5
Expected Behavior:
Depth 1: Crawl only the current page
Depth 2: Crawl the current page and all of its forwarded links
Depth 3: Crawl the forwarded pages' forwarded links, and so on
However, the actual crawling behavior doesn't align with the expected depth definition, sometimes crawling more than the specified depth.
Steps to Reproduce:
Set the crawling depth to 4 on different websites.
Observe the number of URLs found at each depth level.
Request:
Clarification on how depth is being calculated or a potential fix to make the crawling depth behave consistently across different websites.
Hi, can you share example urls and settings used? Thanks!
Ok sure, So for example:
When crawling the site brev.dev, the spider_rs library only returns 52 URLs with a depth of 4. However, at depths less than 4, the crawl results in just one URL. Similarly, when crawling the site promptfoo.dev, the library only starts returning more than one URL at a depth greater than or equal to 5.
You can infer settings from this code maybe Current Code:
I am developing a spider scraper using the
spider_py
library and encountering issues with the crawling depth functionality. The crawling depth behavior appears inconsistent across different sites.Issue:
For one site, I set the depth to 4, and the results were as follows:
For another site with the same depth setting, the results were different:
Expected Behavior:
However, the actual crawling behavior doesn't align with the expected depth definition, sometimes crawling more than the specified depth.
Steps to Reproduce:
Request:
Clarification on how depth is being calculated or a potential fix to make the crawling depth behave consistently across different websites.
The text was updated successfully, but these errors were encountered: