Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Fix CI to correctly use OCaml version matrix #296

Merged
merged 19 commits into from
Aug 13, 2021
Merged

Fix CI to correctly use OCaml version matrix #296

merged 19 commits into from
Aug 13, 2021

Conversation

sim642
Copy link
Member

@sim642 sim642 commented Jul 21, 2021

Closes #184.

Changes

This implements (a preliminary version) of the CI setup I proposed:

  1. First (fast) workflow for checking testing the locked state. This matrix just has 2 OS-s, OCaml version and Apron are fixed by the lock file.
  2. Second (slow) workflow for checking the unlocked state. This matrix uses 2 OS-s, multiple OCaml versions and availability of the optional Apron dependency.

TODOs

Some open questions still are:

  • The fast workflow should run on every push, but when should the slow one run? Nightly on schedule? And when invoked manually?
  • Should the fast workflow run domain and marshalling tests or should they be left for the slow workflow?
  • Which OCaml versions to test and thus support? Currently Goblint doesn't compile on 4.07 any more due to some newer Stdlib usage. 4.09 works, so we could make it the new minimum and have CI enforce enforce that.
  • Review caching. setup-ocaml@v2 comes with some support built-in.

@sim642 sim642 added cleanup Refactoring, clean-up testing setup Dependencies, CI, releasing labels Jul 21, 2021
@vogler
Copy link
Collaborator

vogler commented Jul 27, 2021

  • The fast workflow should run on every push, but when should the slow one run? Nightly on schedule? And when invoked manually?

Yes, nightly, manually.

  • Should the fast workflow run domain and marshalling tests or should they be left for the slow workflow?

How long do they take? Depends on how well they're covered by the reg. tests and their likelihood of failing. I assume domain might be worth it, marshalling maybe in the nightly?

  • Which OCaml versions to test and thus support? Currently Goblint doesn't compile on 4.07 any more due to some newer Stdlib usage. 4.09 works, so we could make it the new minimum and have CI enforce enforce that.

Yes, 4.09 min should be fine.

  • Review caching. setup-ocaml@v2 comes with some support built-in.

You used it here - times are the same or is it worse than what we had before?

@vogler
Copy link
Collaborator

vogler commented Jul 27, 2021

By running on push and pull_request we have 2*n runs where n are the number of runs from matrix.

We could also just run on push and opened PRs:
https://github.community/t/how-to-trigger-an-action-on-push-or-pull-request-but-not-both/16662/8

However, pull_request has the advantage that it checks the merge every time. The question is if we need that or only when ready to merge / after review.

@sim642
Copy link
Member Author

sim642 commented Jul 28, 2021

I'm surprised about the caching situation in #308, i.e. going back to v1 and doing it manually being so much faster. Because with regression tests, the situation seems to be the following:

  1. Old (silently locked) workflow on Ubuntu and 4.12 takes ~7m 16s and has ~661MB cache: https://github.com/goblint/analyzer/runs/3180831067?check_suite_focus=true.
  2. New explicitly locked workflow on Ubuntu and 4.12 takes ~5m 14s and has ~307MB cache in total: https://github.com/goblint/analyzer/runs/3123784005?check_suite_focus=true.

And from what I understand, the caching built into v2 doesn't actually cache the entire installed switch but just the local opam repository listing and downloads? How can the equivalent locked workflow be faster than the old one at all then?

Also, I guess that means we don't have to then explicitly put ${{ hashFiles('goblint.opam.locked') }} into the cache key like before, since it's not actually caching any of that? I was wondering about this when I meant to review the caching since that wouldn't work for the unlocked install anyway (newer dependencies might match the constraints, but wouldn't get installed if we have the entire switch cached). Not caching that is consistent with the fact that the setup-ocaml action is currently meant for unlocked installs: ocaml/setup-ocaml#166.

If caching the entire switch based on the lock file for the locked workflow brings additional speedup on top of what v2 currently offers, then I suppose our locked workflow could explicitly also do that, but I haven't looked into how well it would play with the v2 action. I don't like the idea of staying with the outdated v1, especially since somehow it seems to be slower despite caching the entire locked switch.

@sim642
Copy link
Member Author

sim642 commented Jul 28, 2021

By running on push and pull_request we have 2*n runs where n are the number of runs from matrix.

We could also just run on push and opened PRs:
https://github.community/t/how-to-trigger-an-action-on-push-or-pull-request-but-not-both/16662/8

However, pull_request has the advantage that it checks the merge every time. The question is if we need that or only when ready to merge / after review.

Currently I left it that way so I could see the unlocked workflow results each time, but I agree it's excessive. I think it'd be fine to limit the unlocked workflow to pull_request, nightly schedule on master and manual triggering.

@vogler
Copy link
Collaborator

vogler commented Jul 28, 2021

I'm surprised about the caching situation in #308, i.e. going back to v1 and doing it manually being so much faster. Because with regression tests, the situation seems to be the following: ...

There are several things:

  • Old calls bash -x scripts/travis-ci.sh after avsm/setup-ocaml@v1 which tries to install opam via apt ppa again whereas the new one just does opam install . --deps-only --locked and ./make.sh nat. But that's only 18s wasted.
  • 4.12.0 takes 7m16s, but 4.07.1 only takes 4m6s. Both use the same cache, but only 4.07.1 seems to be cached, whereas 4.12.0 is compiled fresh. That's ~3m6s wasted.

vogler added a commit that referenced this pull request Jul 28, 2021
We currently compile 4.12.0 every run since only 4.07.1 is cached.
See #296 (comment).
@vogler
Copy link
Collaborator

vogler commented Jul 28, 2021

Now it's at 2m56s-3m22s: https://github.com/goblint/analyzer/runs/3184132409
Needed to add opam depext for v1 which was done manually in travis-ci.sh.

@vogler
Copy link
Collaborator

vogler commented Jul 28, 2021

If caching the entire switch based on the lock file for the locked workflow brings additional speedup on top of what v2 currently offers, then I suppose our locked workflow could explicitly also do that, but I haven't looked into how well it would play with the v2 action.

With an existing cache of the switch, it complains:
https://github.com/goblint/analyzer/runs/3184417966#step:4:231

 /opt/hostedtoolcache/opam/2.0.9/x86_64/opam switch create . --no-install --packages ocaml-base-compiler.4.12.0
  [ERROR] There already is an installed switch named /home/runner/work/analyzer/analyzer

@vogler
Copy link
Collaborator

vogler commented Jul 29, 2021

Now it's at 2m56s-3m22s: https://github.com/goblint/analyzer/runs/3184132409

With v2 and its cache only it now takes 6m59s for ubuntu (and 12m39s for macos):
https://github.com/goblint/analyzer/runs/3184910273

@sim642 sim642 marked this pull request as ready for review August 11, 2021 14:13
@sim642
Copy link
Member Author

sim642 commented Aug 11, 2021

These unlocked matrix builds really highlight how fragile our CI is. Regression test timeouts are not so infrequent on GitHub despite the fact that I never see them locally. Restarting all the jobs might cause them to go through, which means that it's just some kind of resource sharing thing on GitHub. Like, is Ruby's timeout using wall time instead of cpu time and that causes it to fail if unfair scheduling doesn't give enough time for Goblint to run?

EDIT: I guess we could also increase the regression test timeout to see if it helps.

@sim642 sim642 merged commit d512465 into master Aug 13, 2021
@sim642 sim642 deleted the ocaml-ci branch August 13, 2021 08:14
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
cleanup Refactoring, clean-up setup Dependencies, CI, releasing testing
Projects
None yet
Development

Successfully merging this pull request may close these issues.

Decide what version of OCaml to support & Fix CI accordingly
2 participants