Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

feat: improve change detection #457

Merged
merged 3 commits into from
Oct 20, 2024

Conversation

brianmcgee
Copy link
Member

@brianmcgee brianmcgee commented Oct 17, 2024

Previously, we were storing the last modtime and size for a given path in boltdb.
This was used to determine if a file had changed and whether we should format it.

In addition, if a formatter's executable changed (modtime or size), we would delete
all path entries in the database before processing, thereby forcing each file to be
formatted.

This commit introduces a new approach to change detection, one which takes into account
changes to the underlying file, the formatter's executable, and the formatter's
configuration.

Now, when deciding if we should format a file, we do the following:

  • Hash each matching formatter, in sequence, using its name, options, priority
    as well as the modtime and size of its executable.
    This is pre-computed on a per-pipeline basis.
  • We then add the file's modtime and size to generate a format signature.
    You can think of this signature as a unique representation of what we are about to do
    with the file.
  • The format signature is then compared with a cache entry (if available).
  • If the signatures match, we have already applied this sequence of formatters, with these
    particular options etc. to this file when it had this modtime and size, so there is
    no more processing to be done.
  • If the signatures do no match, we should format the file, recording the new format
    signature in the cache when we are finished.

This approach is simpler in terms of storage, and has the added benefit of finer grained
change detection versus the brute force cache busting we were doing before.

In terms of performance impact, with the pre-computing of hashes per-pipeline and the
simpler storage schema, there appears to have been no significant impact.
Manual testing with nixpkgs shows comparable run times
for both hot and cold caches.

Note

Since this changes the database schema, rather than implementing some form of migration
logic to remove the old buckets and so on, I decided to upgrade the hash algorithm we
use when determining the filename for the db file.

Previously, we were using sha1, matching the behaviour from 1.0.
Now we use sha256, which results in a slightly longer db name, but has the benefit of
ensuring a new db instance will be created on first invocation, as well as making
gosec happy.

Closes #455

@brianmcgee brianmcgee requested review from zimbatm and jfly October 17, 2024 10:54
@brianmcgee brianmcgee force-pushed the feat/improve-formatter-change-detection branch 2 times, most recently from b0aa949 to 5f4c98b Compare October 17, 2024 10:59
Copy link
Collaborator

@jfly jfly left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for working on this!

I don't think I totally understand the hashing mechanism, which I've left some ineline comments on. I'd like to understand it before I give you a final shipit.

I think it's possible that I'm confused because I'm not thinking about the scenario where we have multiple formatters interested in formatting the same file.

format/composite.go Outdated Show resolved Hide resolved
format/composite.go Outdated Show resolved Hide resolved
format/composite.go Outdated Show resolved Hide resolved
format/formatter.go Outdated Show resolved Hide resolved
format/formatter_test.go Outdated Show resolved Hide resolved
test/test.go Outdated Show resolved Hide resolved
@brianmcgee brianmcgee force-pushed the feat/improve-formatter-change-detection branch 2 times, most recently from 12a0d50 to 1bebe54 Compare October 17, 2024 15:45
@brianmcgee brianmcgee requested a review from jfly October 17, 2024 15:46
@brianmcgee
Copy link
Member Author

I appreciate the feedback. Made some changes, left some comments.

@brianmcgee brianmcgee marked this pull request as draft October 17, 2024 17:07
format/composite.go Show resolved Hide resolved
format/composite.go Outdated Show resolved Hide resolved
format/composite.go Show resolved Hide resolved
format/composite.go Outdated Show resolved Hide resolved
@brianmcgee brianmcgee force-pushed the feat/improve-formatter-change-detection branch from ac13194 to 4c38fbd Compare October 18, 2024 09:13
@brianmcgee brianmcgee changed the title feat: improve formatter/config change detection feat: improve change detection Oct 18, 2024
@brianmcgee brianmcgee force-pushed the feat/improve-formatter-change-detection branch from 4c38fbd to 88f99bc Compare October 18, 2024 09:23
@brianmcgee
Copy link
Member Author

Still chasing down a caching bug before I mark this ready for review again.

@brianmcgee brianmcgee force-pushed the feat/improve-formatter-change-detection branch from 88f99bc to 185e6e3 Compare October 18, 2024 10:03
@brianmcgee brianmcgee marked this pull request as ready for review October 18, 2024 10:04
@brianmcgee brianmcgee requested a review from zimbatm October 18, 2024 10:04
Previously, we were storing the last `modtime` and `size` for a given path in `boltdb`.
This was used to determine if a file had changed and whether we should format it.

In addition, if a formatter's executable changed (`modtime` or `size`), we would delete
all path entries in the database before processing, thereby forcing each file to be
formatted.

This commit introduces a new approach to change detection, one which takes into account
changes to the underlying file, the formatter's executable, _and_ the formatter's
configuration.

Now, when deciding if we should format a file, we do the following:

- Hash each matching formatter, in sequence, using its `name`, `options`, `priority`
  as well as the `modtime` and `size` of its executable.
  This is pre-computed on a per-pipeline basis.
- We then add the file's `modtime` and `size` to generate a format signature.
  You can think of this signature as a unique representation of what we are about to do
  with the file.
- The format signature is then compared with a cache entry (if available).
- If the signatures match, we have already applied this sequence of formatters, with these
  particular options etc. to this file when it had this `modtime` and `size`, so there is
  no more processing to be done.
- If the signatures do no match, we should format the file, recording the new format
  signature in the cache when we are finished.

This approach is simpler in terms of storage, and has the added benefit of finer grained
change detection versus the brute force cache busting we were doing before.

In terms of performance impact, with the pre-computing of hashes per-pipeline and the
simpler storage schema, there appears to have been no significant impact.
Manual testing with [nixpkgs](https://github.com/nixos/nixpkgs) shows comparable run times
for both hot and cold caches.

> [!NOTE]
> Since this changes the database schema, rather than implementing some form of migration
> logic to remove the old buckets and so on, I decided to upgrade the hash algorithm we
> use when determining the filename for the db file.
>
> Previously, we were using `sha1`, matching the behaviour from `1.0`.
> Now we use `sha256`, which results in a slightly longer db name, but has the benefit of
> ensuring a new db instance will be created on first invocation, as well as making
> [gosec](https://golangci-lint.run/usage/linters/#gosec) happy.

Closes #455

Signed-off-by: Brian McGee <[email protected]>
@brianmcgee brianmcgee force-pushed the feat/improve-formatter-change-detection branch from 185e6e3 to d7c39b6 Compare October 18, 2024 10:14
Copy link
Member

@zimbatm zimbatm left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM

@brianmcgee brianmcgee force-pushed the feat/improve-formatter-change-detection branch from 887a321 to 3f4351c Compare October 18, 2024 13:07
Copy link
Collaborator

@jfly jfly left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looks great! I'm a bit confused about how we're handling the db schema change. What happens when we run this new code with an old cache? Does this work because we've change the name of the db? (I saw a change from Sha to Sha256 when computing the filename).

cmd/root_test.go Outdated Show resolved Hide resolved
format/composite.go Outdated Show resolved Hide resolved
format/formatter.go Outdated Show resolved Hide resolved
format/scheduler.go Show resolved Hide resolved
format/scheduler.go Show resolved Hide resolved
test/test.go Outdated Show resolved Hide resolved
walk/walk.go Show resolved Hide resolved
@brianmcgee
Copy link
Member Author

Looks great! I'm a bit confused about how we're handling the db schema change. What happens when we run this new code with an old cache? Does this work because we've change the name of the db? (I saw a change from Sha to Sha256 when computing the filename).

The first time someone runs this, it will create a new db (fresh cache) rather than use an existing one.
There is no need to migrate the state in there. Yes, they lose their cache state, but it's not a big deal. They'll only really notice if they're running it on a huge repo.

This is a short-term fix to improve the time it takes to run tests.

Signed-off-by: Brian McGee <[email protected]>
@brianmcgee brianmcgee force-pushed the feat/improve-formatter-change-detection branch from 3f4351c to 9c96554 Compare October 18, 2024 14:30
@jfly
Copy link
Collaborator

jfly commented Oct 18, 2024

The first time someone runs this, it will create a new db (fresh cache) rather than use an existing one

What's the mechanism that causes us to create a new db rather than try to use the existing one?

No longer relies on an arbitrary sleep.

Signed-off-by: Brian McGee <[email protected]>
@brianmcgee
Copy link
Member Author

What's the mechanism that causes us to create a new db rather than try to use the existing one?

Without this change, we will need to add code to remove the old formatters bucket, and clear the paths bucket as it's no longer binary compatible. We could use a different bucket name other than paths, but we would still need to delete that bucket if it's present.

Rather than complicating things, we can circumvent the problem by changing the scheme for determining the database name and keeping the DB code simple.

I want to avoid a complicated migration approach for a throwaway database that will likely be discarded at some point in the near-future due to a config change or a user running treefmt with -c.

@brianmcgee
Copy link
Member Author

Having thought on this a bit more, we could re-use the same db if we:

  • just left the old formatters bucket alone, it's not doing any harm and isn't that big
  • deleted the old paths bucket
  • used a new paths_v2 bucket

It's not that onerous. It's still less code to change the naming convention for the db. It sits in ~/.cache/treefmt/eval-cache, and it's not like anyone uses it directly.

@brianmcgee
Copy link
Member Author

I've created #459 to follow up and make the tests clearer.

@brianmcgee brianmcgee merged commit ba30619 into main Oct 20, 2024
27 checks passed
@brianmcgee brianmcgee deleted the feat/improve-formatter-change-detection branch October 20, 2024 16:22
@jfly
Copy link
Collaborator

jfly commented Oct 20, 2024

we can circumvent the problem by changing the scheme for determining the database name and keeping the DB code simple.

That makes sense! I just wasn't sure how we were ensuring we create a new db. What happens with the old db? Does it just get orphaned until someone clears their cache directory?

@brianmcgee
Copy link
Member Author

What happens with the old db? Does it just get orphaned until someone clears their cache directory?

Yeah, that's right. Currently, we don't manage the cache directory e.g. clean up old dbs.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

Simplify formatter change detection
3 participants