feat: improve change detection #457

brianmcgee · 2024-10-17T10:54:32Z

Previously, we were storing the last modtime and size for a given path in boltdb.
This was used to determine if a file had changed and whether we should format it.

In addition, if a formatter's executable changed (modtime or size), we would delete
all path entries in the database before processing, thereby forcing each file to be
formatted.

This commit introduces a new approach to change detection, one which takes into account
changes to the underlying file, the formatter's executable, and the formatter's
configuration.

Now, when deciding if we should format a file, we do the following:

Hash each matching formatter, in sequence, using its name, options, priority
as well as the modtime and size of its executable.
This is pre-computed on a per-pipeline basis.
We then add the file's modtime and size to generate a format signature.
You can think of this signature as a unique representation of what we are about to do
with the file.
The format signature is then compared with a cache entry (if available).
If the signatures match, we have already applied this sequence of formatters, with these
particular options etc. to this file when it had this modtime and size, so there is
no more processing to be done.
If the signatures do no match, we should format the file, recording the new format
signature in the cache when we are finished.

This approach is simpler in terms of storage, and has the added benefit of finer grained
change detection versus the brute force cache busting we were doing before.

In terms of performance impact, with the pre-computing of hashes per-pipeline and the
simpler storage schema, there appears to have been no significant impact.
Manual testing with nixpkgs shows comparable run times
for both hot and cold caches.

Note

Since this changes the database schema, rather than implementing some form of migration
logic to remove the old buckets and so on, I decided to upgrade the hash algorithm we
use when determining the filename for the db file.

Previously, we were using sha1, matching the behaviour from 1.0.
Now we use sha256, which results in a slightly longer db name, but has the benefit of
ensuring a new db instance will be created on first invocation, as well as making
gosec happy.

Closes #455

jfly

Thanks for working on this!

I don't think I totally understand the hashing mechanism, which I've left some ineline comments on. I'd like to understand it before I give you a final shipit.

I think it's possible that I'm confused because I'm not thinking about the scenario where we have multiple formatters interested in formatting the same file.

format/composite.go

format/formatter.go

format/formatter_test.go

test/test.go

brianmcgee · 2024-10-17T15:47:30Z

I appreciate the feedback. Made some changes, left some comments.

format/composite.go

brianmcgee · 2024-10-18T09:32:46Z

Still chasing down a caching bug before I mark this ready for review again.

Previously, we were storing the last `modtime` and `size` for a given path in `boltdb`. This was used to determine if a file had changed and whether we should format it. In addition, if a formatter's executable changed (`modtime` or `size`), we would delete all path entries in the database before processing, thereby forcing each file to be formatted. This commit introduces a new approach to change detection, one which takes into account changes to the underlying file, the formatter's executable, _and_ the formatter's configuration. Now, when deciding if we should format a file, we do the following: - Hash each matching formatter, in sequence, using its `name`, `options`, `priority` as well as the `modtime` and `size` of its executable. This is pre-computed on a per-pipeline basis. - We then add the file's `modtime` and `size` to generate a format signature. You can think of this signature as a unique representation of what we are about to do with the file. - The format signature is then compared with a cache entry (if available). - If the signatures match, we have already applied this sequence of formatters, with these particular options etc. to this file when it had this `modtime` and `size`, so there is no more processing to be done. - If the signatures do no match, we should format the file, recording the new format signature in the cache when we are finished. This approach is simpler in terms of storage, and has the added benefit of finer grained change detection versus the brute force cache busting we were doing before. In terms of performance impact, with the pre-computing of hashes per-pipeline and the simpler storage schema, there appears to have been no significant impact. Manual testing with [nixpkgs](https://github.com/nixos/nixpkgs) shows comparable run times for both hot and cold caches. > [!NOTE] > Since this changes the database schema, rather than implementing some form of migration > logic to remove the old buckets and so on, I decided to upgrade the hash algorithm we > use when determining the filename for the db file. > > Previously, we were using `sha1`, matching the behaviour from `1.0`. > Now we use `sha256`, which results in a slightly longer db name, but has the benefit of > ensuring a new db instance will be created on first invocation, as well as making > [gosec](https://golangci-lint.run/usage/linters/#gosec) happy. Closes #455 Signed-off-by: Brian McGee <[email protected]>

zimbatm

LGTM

jfly

Looks great! I'm a bit confused about how we're handling the db schema change. What happens when we run this new code with an old cache? Does this work because we've change the name of the db? (I saw a change from Sha to Sha256 when computing the filename).

cmd/root_test.go

format/composite.go

format/formatter.go

format/scheduler.go

test/test.go

walk/walk.go

brianmcgee · 2024-10-18T14:19:27Z

Looks great! I'm a bit confused about how we're handling the db schema change. What happens when we run this new code with an old cache? Does this work because we've change the name of the db? (I saw a change from Sha to Sha256 when computing the filename).

The first time someone runs this, it will create a new db (fresh cache) rather than use an existing one.
There is no need to migrate the state in there. Yes, they lose their cache state, but it's not a big deal. They'll only really notice if they're running it on a huge repo.

This is a short-term fix to improve the time it takes to run tests. Signed-off-by: Brian McGee <[email protected]>

jfly · 2024-10-18T14:56:26Z

The first time someone runs this, it will create a new db (fresh cache) rather than use an existing one

What's the mechanism that causes us to create a new db rather than try to use the existing one?

No longer relies on an arbitrary sleep. Signed-off-by: Brian McGee <[email protected]>

brianmcgee · 2024-10-18T15:48:22Z

What's the mechanism that causes us to create a new db rather than try to use the existing one?

Without this change, we will need to add code to remove the old formatters bucket, and clear the paths bucket as it's no longer binary compatible. We could use a different bucket name other than paths, but we would still need to delete that bucket if it's present.

Rather than complicating things, we can circumvent the problem by changing the scheme for determining the database name and keeping the DB code simple.

I want to avoid a complicated migration approach for a throwaway database that will likely be discarded at some point in the near-future due to a config change or a user running treefmt with -c.

brianmcgee · 2024-10-18T15:53:56Z

Having thought on this a bit more, we could re-use the same db if we:

just left the old formatters bucket alone, it's not doing any harm and isn't that big
deleted the old paths bucket
used a new paths_v2 bucket

It's not that onerous. It's still less code to change the naming convention for the db. It sits in ~/.cache/treefmt/eval-cache, and it's not like anyone uses it directly.

brianmcgee · 2024-10-18T16:25:22Z

I've created #459 to follow up and make the tests clearer.

jfly · 2024-10-20T22:35:12Z

we can circumvent the problem by changing the scheme for determining the database name and keeping the DB code simple.

That makes sense! I just wasn't sure how we were ensuring we create a new db. What happens with the old db? Does it just get orphaned until someone clears their cache directory?

brianmcgee · 2024-10-21T10:27:20Z

What happens with the old db? Does it just get orphaned until someone clears their cache directory?

Yeah, that's right. Currently, we don't manage the cache directory e.g. clean up old dbs.

brianmcgee requested review from zimbatm and jfly October 17, 2024 10:54

brianmcgee force-pushed the feat/improve-formatter-change-detection branch 2 times, most recently from b0aa949 to 5f4c98b Compare October 17, 2024 10:59

jfly reviewed Oct 17, 2024

View reviewed changes

brianmcgee force-pushed the feat/improve-formatter-change-detection branch 2 times, most recently from 12a0d50 to 1bebe54 Compare October 17, 2024 15:45

brianmcgee requested a review from jfly October 17, 2024 15:46

brianmcgee marked this pull request as draft October 17, 2024 17:07

zimbatm approved these changes Oct 17, 2024

View reviewed changes

format/composite.go Show resolved Hide resolved

format/composite.go Outdated Show resolved Hide resolved

format/composite.go Show resolved Hide resolved

format/composite.go Outdated Show resolved Hide resolved

brianmcgee force-pushed the feat/improve-formatter-change-detection branch from ac13194 to 4c38fbd Compare October 18, 2024 09:13

brianmcgee changed the title ~~feat: improve formatter/config change detection~~ feat: improve change detection Oct 18, 2024

brianmcgee force-pushed the feat/improve-formatter-change-detection branch from 4c38fbd to 88f99bc Compare October 18, 2024 09:23

brianmcgee force-pushed the feat/improve-formatter-change-detection branch from 88f99bc to 185e6e3 Compare October 18, 2024 10:03

brianmcgee marked this pull request as ready for review October 18, 2024 10:04

brianmcgee requested a review from zimbatm October 18, 2024 10:04

brianmcgee force-pushed the feat/improve-formatter-change-detection branch from 185e6e3 to d7c39b6 Compare October 18, 2024 10:14

zimbatm approved these changes Oct 18, 2024

View reviewed changes

brianmcgee force-pushed the feat/improve-formatter-change-detection branch from 887a321 to 3f4351c Compare October 18, 2024 13:07

jfly approved these changes Oct 18, 2024

View reviewed changes

feat: only delay treefmt in tests if fail on change is enabled

9c96554

This is a short-term fix to improve the time it takes to run tests. Signed-off-by: Brian McGee <[email protected]>

brianmcgee force-pushed the feat/improve-formatter-change-detection branch from 3f4351c to 9c96554 Compare October 18, 2024 14:30

feat: improve fail on change test

e32b5a2

No longer relies on an arbitrary sleep. Signed-off-by: Brian McGee <[email protected]>

brianmcgee merged commit ba30619 into main Oct 20, 2024
27 checks passed

brianmcgee deleted the feat/improve-formatter-change-detection branch October 20, 2024 16:22

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

feat: improve change detection #457

feat: improve change detection #457

brianmcgee commented Oct 17, 2024 •

edited

Loading

jfly left a comment

brianmcgee commented Oct 17, 2024

brianmcgee commented Oct 18, 2024

zimbatm left a comment

jfly left a comment

brianmcgee commented Oct 18, 2024

jfly commented Oct 18, 2024

brianmcgee commented Oct 18, 2024

brianmcgee commented Oct 18, 2024

brianmcgee commented Oct 18, 2024

jfly commented Oct 20, 2024

brianmcgee commented Oct 21, 2024

feat: improve change detection #457

feat: improve change detection #457

Conversation

brianmcgee commented Oct 17, 2024 • edited Loading

jfly left a comment

Choose a reason for hiding this comment

brianmcgee commented Oct 17, 2024

brianmcgee commented Oct 18, 2024

zimbatm left a comment

Choose a reason for hiding this comment

jfly left a comment

Choose a reason for hiding this comment

brianmcgee commented Oct 18, 2024

jfly commented Oct 18, 2024

brianmcgee commented Oct 18, 2024

brianmcgee commented Oct 18, 2024

brianmcgee commented Oct 18, 2024

jfly commented Oct 20, 2024

brianmcgee commented Oct 21, 2024

brianmcgee commented Oct 17, 2024 •

edited

Loading