Check and respect robots.txt #434

wjt · 2024-07-11T21:51:25Z

I've just encountered a project that responds 403 Forbidden to requests with FEDC's User-Agent. Once I realised this was what was happening, I experimented with a few different user-agents for a few minutes, and then my IP address was blocklisted by the web server, which no longer responds to TCP connection attempts.

Searching the project's forum (via a VPN!) I came across this thread where the author refers to:

automated third-party software in your network that disregards robots.txt or HTTP 403.

I checked the project's robots.txt and there is nothing there that the tools I was using (fedc, wget, and Debian's uscan) were disobeying. But it's true that FEDC does not check robots.txt. It probably should, with a per-domain cache that would ideally persist between runs of the tool.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Check and respect robots.txt #434

Check and respect robots.txt #434

wjt commented Jul 11, 2024

Check and respect robots.txt #434

Check and respect robots.txt #434

Comments

wjt commented Jul 11, 2024