Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

nightlies and link validation failing because of repository.apache.org blockage #1585

Open
raboof opened this issue Dec 16, 2024 · 6 comments

Comments

@raboof
Copy link
Member

raboof commented Dec 16, 2024

Our nightlies and link validation sometimes fail when it is ran on a GitHub Actions running that is blocked from repository.apache.org.

Infra seems open to create per-project buckets for the abuse thresholds, but we'd have to add a header to the requests to identify ourselves.

Looks like this would depend on coursier/coursier#1203

@pjfanning
Copy link
Contributor

I wonder if we could try ordering the resolvers in sbt.

I've seen failures where we get issues loading 3rd party jars because our sbt setup seems to check repository.apache.org before checking maven central. Ideally, repository.apache.org should be checked last.

@Humbedooh
Copy link
Member

https://brettporter.wordpress.com/2009/06/16/configuring-maven-http-connections/ suggests you can set a custom user agent header for the requests. We could make use of this, if we come up with a standard format for denoting ASF projects. This would allow us to tailor rules to both be more lenient in these cases, as well as debug which projects or builds are causing issues.

@raboof
Copy link
Member Author

raboof commented Dec 16, 2024

I wonder if we could try ordering the resolvers in sbt.

I've seen failures where we get issues loading 3rd party jars because our sbt setup seems to check repository.apache.org before checking maven central. Ideally, repository.apache.org should be checked last.

I agree that would be a good thing to keep an eye on. 'Normal' CI builds shouldn't reference repository.a.o at all, though, right? And even when including repository.apache.org, I think sbt should use Maven Central first regardless of what additional things we put into resolvers (e.g. sbt/sbt#1138)

https://brettporter.wordpress.com/2009/06/16/configuring-maven-http-connections/ suggests you can set a custom user agent header for the requests

Yes (or arbitrary other headers). Pekko uses sbt instead of mvn to access the Maven repository, though, so that'd need a separate change.

@pjfanning
Copy link
Contributor

pjfanning commented Dec 31, 2024

@raboof one source of strain that we put on repository.apache.org is from https://github.com/pjfanning/sbt-pekko-build

This has logic to find the latest snapshot versions by scraping pages served by repository.apache.org.

@raboof
Copy link
Member Author

raboof commented Jan 3, 2025

We haven't seen GitHub Actions runners get blocked anymore by the "too many 404's on repository.apache.org" rule since apache/ranger#435 was merged. I now (ack'ed by infra) removed all those bans.

That should help, but GitHub Actions runners are still being banned for Bugzilla scraping (> 800req/hr to show_bug.cgi). I guess we should look into whether those are 'real' scrapers or some misconfigured job somewhere as well.

@raboof
Copy link
Member Author

raboof commented Jan 3, 2025

looks like this might be bingbot, filed https://issues.apache.org/jira/browse/INFRA-26405 to get a robots.txt in place

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

3 participants