Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Optional heaps in Holmake #1283

Open
talsewell opened this issue Aug 12, 2024 · 8 comments
Open

Optional heaps in Holmake #1283

talsewell opened this issue Aug 12, 2024 · 8 comments

Comments

@talsewell
Copy link
Contributor

I'm hoping to start a discussion about a possible feature, motivated by CakeML build times.

In some Holmake runs, a large fraction of the CPU time is spent loading. I can include stats if necessary. Load times (to load up the included theories and libraries) can be tens of seconds in the CakeML repo. This is particularly unsatisfactory when --fast is enabled, of course, as proofs are completed rapidly but the load times are the same.

PolyML heaps are a potential solution, but have serious drawbacks. As currently implemented, they fill in the dependency graph. Suppose the build sequence has directories A, B, C with some dependencies from A -> B and B -> C. Without heaps, a change to one theory in A may permit a minimal rebuild of affected theories in B + C. Suppose that B/Holmakefile declares a heap containing A, however, that heap will require a rebuild, and trigger a rebuild of every theory in B.

I think that the best of both of these cases can be had with some minimal changes. The heap to use is already decided at startup time by running the heapname executable. It seems reasonable to support a case where a heap is used if it is available (to save CPU time in loading it) but not used if it is unavailable or out of date (to save CPU time in building it). The user could opt-in to this mechanism by using a variant of the HOLHEAP variable.

The only tricky bit is to ensure that heapname can quickly make the right decision. It should know when a heap is out of date. It shouldn't have to replay holdep or Holmake steps to examine dependencies or timestamps.

My idea is for Holmake to put a persistent note into the world when a target is out of date. I propose to do this by saving the actions to be taken as soon as Holmake has built a graph. Specifically, for each target to be built (by a built in command or otherwise), create a file in the relevant .HOLMK directory listing some details of the job. If that file remains newer than the target, then the target is out of date. This seems simple and stable enough. It's possibly that the buildheap mechanism also needs to be tweaked to place the final heap object atomically.

Does this makes sense?

@mn200
Copy link
Member

mn200 commented Aug 12, 2024

This sounds plausible. Notes:

  1. The dependency information calculated for graph-purposes is already stored in .HOLMK/*.d files that are checked for out-of-date ness; it would be reasonable to extend the use of these files along the lines you suggest.
  2. heapname is not used by Holmake; it‘s only used so that interactive uses of hol start with the right heap. Holmake’s traversal of the file-system and its Holmakefiles means that the right heap to use is already apparent when Holmake is building things.

@mn200
Copy link
Member

mn200 commented Aug 12, 2024

Do you have to run Holmake in a particular mode to get it to ever generate heap files, or do you do that all the time regardless?

It seems that the solution is to create a dependency graph that never includes the “optional heaps” as dependencies if they are “bad”, but which still keeps them as targets. That way they do get built for next time. The command that Holmake stores in its graph as the thing to run for a given target would or would not include that heap depending. But this decision doesn't depend on anything being cached on disk.

@talsewell
Copy link
Contributor Author

OK, I'll add a bit more of my thinking.

The dependency information is already cached in the .d files, but only for "built in" build commands. There doesn't seem to be a .d file when a heap is built. Also, they don't seem to be updated very often (unless the --rebuild_deps flag is used), so in the critical case here where the system has committed to rebuilding a heap, it's likely there won't be any update in its .d file.

Indeed, heapname is separate from Holmake. That was the plan. When I said "startup time" above, I meant startup time of the actual hol process being used either interactively or in a batch build. If I understand right, it peeks the Holmakefile at various points to decide what heap to use and to pick a load path. This change would be a tweak to that process.

I did consider having multiple kinds of dependency edge but no change to heap selection. That approach has some attractions, however I think it would be less effective. It would work well for partial rebuilds, but not for partial builds. Consider the A->B->C example above. On a rebuild caused by a small change in A, this would rebuild the heap used in B, avoid redundant rebuilds of previously successful builds in B, etc. However, when attempting a partial build of one theory in C when nothing else is built, this will still block it waiting on everything in A+B to be built into a heap.

@mn200
Copy link
Member

mn200 commented Aug 12, 2024

  • When Holmake runs something to get a theory built, it doesn't call hol per se, it calls buildheap on a set of .uo files, which in turn causes the buildheap process to load many .uo files. That process doesn't use heapname. Scanning Holmakefiles is not time-consuming. Of course, if hol has to do a dependency analysis as it starts just to determine if a heap should or shouldn't be used, that's not ideal so I agree that it might be nice to cache a "this is out-of-date" verdict for heaps (or Holmake could just delete the out-of-date heaps entirely!)
  • I don't understand your second scenario. My proposal says that if an optional heap is "bad" (out-of-date or not existing) then you don't use it for anything, so that things in C and B would start building as soon as their direct dependencies in A and B were ready to go. Nonetheless, for the sake of interactive usage, the heap would still get built (just as if the heap were a leaf target that nothing depended on). Of course, you want to maximise the number of times heaps get used as well so if the user knows that there's lots to do in those directories then they should be able to call for B’s heap to not be optional after all.
  • One issue is what happens after your partial build of C/theory1 (build 1). If the optional heap gets built in parallel but perhaps later than C/theory1 (which you want, because having it around will reduce startup times), it will cause everything in C, including theory1 to get built again. But at least those builds will be quicker...

@talsewell
Copy link
Contributor Author

Sorry, I think I'd misunderstood some of your points. I didn't realise Holmake used an alternative startup process.

To clarify, the problem I see is on the Holmake side. Interactive users might appreciate a speed boost in startup, but it's only one of many annoyances. However, when an interactive wants to step through one theory (e.g. because it is of interest, or failed in a regression test) and wants to build to that point (e.g. with --fast on), it is especially annoying that directory build times are dominated by repeated loading.

@talsewell
Copy link
Contributor Author

To follow up, I'm happy to leave this for the moment and spend some cycles thinking about what we're aiming for here.

I guess my claim was that loading of theory objects is "pure" in some sense, so that it's OK to use a heap that contains additional out-of-date theory objects as long as they won't be used, or to use a prior heap which doesn't contain some theories as long as they won't be used. But, if that's our point, is it really heaps we need to improve, or would it be better to try to be faster loading the *Theory.dat files?

@mn200
Copy link
Member

mn200 commented Aug 13, 2024

I thought recent profiling suggested it was loading code that was causing the slow-downs, but I agree that it would be nice to know just where the slowness is coming from.

Here's my attempt to make at least one scenario concrete. Imagine that directory A contains A1, A2, .. An and that all of these are sucked in to build heapB in directory B. This heap is then a dependency for all of the theories in that directory though those theories may "really" only depend on a subset of {Ai | 1 <= i <= n}. We can then imagine directory C with more files that mention some mixture of things in B and A.

If I then touch A6, I would prefer not to rebuild things in B and C that don't "really" depend on A6. So, what do I do to achieve the best possible outcome?

  • I have to decide whether to not to rebuild heapB. Certainly, it is out of date wrt its dependencies.
  • Say, I touched A6 so that C6 could benefit. Before playing around with C6 I have to build the intermediate dependencies in B. This will be slow if I don't rebuild heapB, so this seems to argue for doing that build.
  • Then I'm in a situation where parts of B that don't "really" depend on A6 now feel that they should be rebuilt. But perhaps here's our opportunity to be smart: we ignore the new version of heapB and leave the uninvolved theories unchanged.

So, if this scenario is the one we're really interested in, the issue is not stale heaps, it's being able to ignore fresh heaps that we're confident won't affect us. If so, Holmake might change its dependency analysis so that heaps don't force rebuilds unless there is a "real" dependency also calling for the rebuild.

Of course, in this scenario, with the current implementation Holmake will ignore the uninvolved theories unless you explicitly call for them to be built. Unfortunately, that's what a bare call to Holmake will do, but if you explicitly went Holmake C6Theory, unrelated things would be ignored.

Given that we have forgotten how heaps behave in practice, I'd recommend starting by building more heaps (or just using the one in misc “everywhere”), and then seeing how load times improve or not.

@mn200
Copy link
Member

mn200 commented Aug 13, 2024

Loading CakeML’s preamble on top of the standard core HOL heap takes just over 15s on my machine. If you have preloaded this (as happens in misc/cakeml-heap), loading it is instantaneous (because it's already there). So every theory in cakeml not using a heap is paying that 15s tax.

I have often argued that putting everything and the kitchen sink into preamble is a bad idea (logically and pragmatically), but if you're committed to doing that, you really should use a heap that has it preloaded.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants