Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

BUG: jobs temporary disappeared while recalculating (analysis) #202

Open
kit-ty-kate opened this issue Jan 25, 2023 · 4 comments
Open

BUG: jobs temporary disappeared while recalculating (analysis) #202

kit-ty-kate opened this issue Jan 25, 2023 · 4 comments
Labels

Comments

@kit-ty-kate
Copy link
Contributor

Happened in ocaml/opam-repository#23047 (comment) (noticed live)

  1. all the jobs for the PR finished and was all green
  2. i merged some PRs (many in a row)
  3. (analysis) is thus recalculated
  4. but all the other jobs disappeared from the web-ui and the github status moved back from green to active
  5. waited for the (analysis) job to finish and all the previous jobs came back directly without any rebuild

4) should not happen as this job is latched

@kit-ty-kate
Copy link
Contributor Author

@art-w so far, to me it seems that the pattern is:

  1. some PR gets merge
  2. the analysis restarts
  3. it fails for whatever reason
  • maybe it's the cancellation that's causing that
  • or maybe just too many jobs and git don't like that many concurrent calls/ocurrent doesn't handle it too well
  • probably a combination of both
  1. then get restarted by another merge

i can see some of these PRs change to a github status with a red cross instead of "in progress"

@art-w
Copy link
Contributor

art-w commented Jan 26, 2023

Yes you are absolutely right!

When a PR gets merged, master is updated: In practice only the analysis is strict about using the latest HEAD, but master is still an ocurrent dependency of all the other jobs (lint / build) (master gets past to those Current_cache as "extra context" meaning that previous runs with an older master are reused for free, but ocurrent still has to traverse the whole graph whenever master changes -- at the scale of opam-repo-ci this might explains the freezes following a merge?)

I think we could cheat a bit here by gating master such that the latest master value would only be propagated if the analysis' result had also changed (.. and this should produce the same outcome as today ^^') Using ocurrent/ocurrent#330 this ugly code might work:

let analysis = Analyse.examine ~master src in (* same as before *)
let master_analysis =
  Current.cutoff
     (* keep the previous master if the analysis hasn't changed *)
     ~eq:(fun (_old_master, old_analysis) (_new_master, new_analysis) ->
            Analyse.Analysis.equal old_analysis new_analysis)
  @@ Current.pair master analysis
in
(* see below: protect from analysis failures here with [latch]? *)
let master = Current.map fst master_analysis in
let analysis = Current.map snd master_analysis in

Now the situation is even worse when the analysis fails, because then the whole PR subgraph gets invalidated.. When the next update to master re-triggers the analysis and it eventually succeeds, ocurrent has to rebuild the subgraph from scratch and ask the Current_cache sqlite database to recover each previous job result. I'm guessing it would be nicer to use Pipeline.latch to hide those analysis failures on re-run:

let master_analysis = latch ~label:"analysis" master_analysis in

But this would hide the "second re-run" analysis failures from the github status, so perhaps it needs to be sent separately like you did for the lint pass? :)

@kit-ty-kate
Copy link
Contributor Author

Bingo! I managed to catch one of the failures in (analysis):

2023-01-27 15:31.55: New job: Analyse
2023-01-27 15:31.55: Waiting for resource in pool analyse
2023-01-27 15:39.14: Got resource from pool analyse
2023-01-27 15:39.14: Checking out commit ae1eaa1e. To reproduce:
                       git clone --recursive "https://github.com/ocaml/opam-repository.git" && cd "opam-repository" && git fetch origin "refs/pull/23140/head" && git reset --hard ae1eaa1e
2023-01-27 15:39.14: Exec: "cp" "-a" "--" "/var/lib/ocurrent/var/git/opam-repository.git-a0bc4b41ddb868605dc1500002c27cfa79faf4389fe53b6754817ea16b48d458/.git" 
                           "/tmp/git-checkout19f801a9"
2023-01-27 15:39.16: Exec: "git" "-C" "/tmp/git-checkout19f801a9" "submodule" 
                           "deinit" "--force" "--all"
2023-01-27 15:39.17: Exec: "git" "-C" "/tmp/git-checkout19f801a9" "reset" 
                           "--hard" "-q" "ae1eaa1ecb788f8ede49c45f775f509d3462bf20"
fatal: Unable to create '/tmp/git-checkout19f801a9/.git/index.lock': File exists.

Another git process seems to be running in this repository, e.g.
an editor opened by 'git commit'. Please make sure all processes
are terminated then try again. If it still fails, a git process
may have crashed in this repository earlier:
remove the file manually to continue.
2023-01-27 15:39.18: Exec: "git" "-C" "/var/lib/ocurrent/var/git/opam-repository.git-a0bc4b41ddb868605dc1500002c27cfa79faf4389fe53b6754817ea16b48d458" 
                           "branch" "-f" "fetch-ae1eaa1ecb788f8ede49c45f775f509d3462bf20" 
                           "ae1eaa1ecb788f8ede49c45f775f509d3462bf20"
2023-01-27 15:39.24: Job failed: Command "git" "-C" "/tmp/git-checkout19f801a9" "reset" "--hard" "-q" 
"ae1eaa1ecb788f8ede49c45f775f509d3462bf20" exited with status 128

@art-w
Copy link
Contributor

art-w commented Jan 30, 2023

Nice! /tmp/git-checkoutXYZ appears to be created specifically for that job with no concurrent git processes accessing it, so my best guess is that the cp -a /var/lib/... triggers during an update to the reference repo and copies along the lock? (but then git submodule deinit ignores the lock? >_>)

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Projects
None yet
Development

No branches or pull requests

3 participants