Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Check and respect robots.txt #434

Open
wjt opened this issue Jul 11, 2024 · 0 comments
Open

Check and respect robots.txt #434

wjt opened this issue Jul 11, 2024 · 0 comments

Comments

@wjt
Copy link
Contributor

wjt commented Jul 11, 2024

I've just encountered a project that responds 403 Forbidden to requests with FEDC's User-Agent. Once I realised this was what was happening, I experimented with a few different user-agents for a few minutes, and then my IP address was blocklisted by the web server, which no longer responds to TCP connection attempts.

Searching the project's forum (via a VPN!) I came across this thread where the author refers to:

automated third-party software in your network that disregards robots.txt or HTTP 403.

I checked the project's robots.txt and there is nothing there that the tools I was using (fedc, wget, and Debian's uscan) were disobeying. But it's true that FEDC does not check robots.txt. It probably should, with a per-domain cache that would ideally persist between runs of the tool.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

1 participant