Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Inconsistent Crawling Behavior with Specified Depth in Spider Scraper #9

Open
HarshJa1n opened this issue Oct 3, 2024 · 2 comments

Comments

@HarshJa1n
Copy link

I am developing a spider scraper using the spider_py library and encountering issues with the crawling depth functionality. The crawling depth behavior appears inconsistent across different sites.

Issue:

For one site, I set the depth to 4, and the results were as follows:

  • Found 1 URL, crawled with depth = 1 & 2
  • Found 152 URLs, crawled with depth = 3
  • Found 165 URLs, crawled with depth = 4

For another site with the same depth setting, the results were different:

  • Found 1 URL, crawled with depth = 1, 2, 3
  • Found 36 URLs, crawled with depth = 4
  • Found 210 URLs, crawled with depth = 5

Expected Behavior:

  • Depth 1: Crawl only the current page
  • Depth 2: Crawl the current page and all of its forwarded links
  • Depth 3: Crawl the forwarded pages' forwarded links, and so on

However, the actual crawling behavior doesn't align with the expected depth definition, sometimes crawling more than the specified depth.

Steps to Reproduce:

  1. Set the crawling depth to 4 on different websites.
  2. Observe the number of URLs found at each depth level.

Request:

Clarification on how depth is being calculated or a potential fix to make the crawling depth behave consistently across different websites.

@j-mendez
Copy link
Member

j-mendez commented Oct 3, 2024

I am developing a spider scraper using the spider_py library and encountering issues with the crawling depth functionality. The crawling depth behavior appears inconsistent across different sites.

Issue:

For one site, I set the depth to 4, and the results were as follows:

  • Found 1 URL, crawled with depth = 1 & 2
  • Found 152 URLs, crawled with depth = 3
  • Found 165 URLs, crawled with depth = 4

For another site with the same depth setting, the results were different:

  • Found 1 URL, crawled with depth = 1, 2, 3
  • Found 36 URLs, crawled with depth = 4
  • Found 210 URLs, crawled with depth = 5

Expected Behavior:

  • Depth 1: Crawl only the current page
  • Depth 2: Crawl the current page and all of its forwarded links
  • Depth 3: Crawl the forwarded pages' forwarded links, and so on

However, the actual crawling behavior doesn't align with the expected depth definition, sometimes crawling more than the specified depth.

Steps to Reproduce:

  1. Set the crawling depth to 4 on different websites.
  2. Observe the number of URLs found at each depth level.

Request:

Clarification on how depth is being calculated or a potential fix to make the crawling depth behave consistently across different websites.

Hi, can you share example urls and settings used? Thanks!

@HarshJa1n
Copy link
Author

Ok sure, So for example:
When crawling the site brev.dev, the spider_rs library only returns 52 URLs with a depth of 4. However, at depths less than 4, the crawl results in just one URL. Similarly, when crawling the site promptfoo.dev, the library only starts returning more than one URL at a depth greater than or equal to 5.

You can infer settings from this code maybe
Current Code:

from spider_rs import Website
website = Website(url)
website.with_depth(config['depth'])
website.crawl(None, None, config['use_headless'])
links = website.get_links()
return links

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants