-
Notifications
You must be signed in to change notification settings - Fork 38
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
feat: improve change detection #457
Conversation
b0aa949
to
5f4c98b
Compare
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks for working on this!
I don't think I totally understand the hashing mechanism, which I've left some ineline comments on. I'd like to understand it before I give you a final shipit.
I think it's possible that I'm confused because I'm not thinking about the scenario where we have multiple formatters interested in formatting the same file.
12a0d50
to
1bebe54
Compare
I appreciate the feedback. Made some changes, left some comments. |
ac13194
to
4c38fbd
Compare
4c38fbd
to
88f99bc
Compare
Still chasing down a caching bug before I mark this ready for review again. |
88f99bc
to
185e6e3
Compare
Previously, we were storing the last `modtime` and `size` for a given path in `boltdb`. This was used to determine if a file had changed and whether we should format it. In addition, if a formatter's executable changed (`modtime` or `size`), we would delete all path entries in the database before processing, thereby forcing each file to be formatted. This commit introduces a new approach to change detection, one which takes into account changes to the underlying file, the formatter's executable, _and_ the formatter's configuration. Now, when deciding if we should format a file, we do the following: - Hash each matching formatter, in sequence, using its `name`, `options`, `priority` as well as the `modtime` and `size` of its executable. This is pre-computed on a per-pipeline basis. - We then add the file's `modtime` and `size` to generate a format signature. You can think of this signature as a unique representation of what we are about to do with the file. - The format signature is then compared with a cache entry (if available). - If the signatures match, we have already applied this sequence of formatters, with these particular options etc. to this file when it had this `modtime` and `size`, so there is no more processing to be done. - If the signatures do no match, we should format the file, recording the new format signature in the cache when we are finished. This approach is simpler in terms of storage, and has the added benefit of finer grained change detection versus the brute force cache busting we were doing before. In terms of performance impact, with the pre-computing of hashes per-pipeline and the simpler storage schema, there appears to have been no significant impact. Manual testing with [nixpkgs](https://github.com/nixos/nixpkgs) shows comparable run times for both hot and cold caches. > [!NOTE] > Since this changes the database schema, rather than implementing some form of migration > logic to remove the old buckets and so on, I decided to upgrade the hash algorithm we > use when determining the filename for the db file. > > Previously, we were using `sha1`, matching the behaviour from `1.0`. > Now we use `sha256`, which results in a slightly longer db name, but has the benefit of > ensuring a new db instance will be created on first invocation, as well as making > [gosec](https://golangci-lint.run/usage/linters/#gosec) happy. Closes #455 Signed-off-by: Brian McGee <[email protected]>
185e6e3
to
d7c39b6
Compare
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LGTM
887a321
to
3f4351c
Compare
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Looks great! I'm a bit confused about how we're handling the db schema change. What happens when we run this new code with an old cache? Does this work because we've change the name of the db? (I saw a change from Sha to Sha256 when computing the filename).
The first time someone runs this, it will create a new db (fresh cache) rather than use an existing one. |
This is a short-term fix to improve the time it takes to run tests. Signed-off-by: Brian McGee <[email protected]>
3f4351c
to
9c96554
Compare
What's the mechanism that causes us to create a new db rather than try to use the existing one? |
No longer relies on an arbitrary sleep. Signed-off-by: Brian McGee <[email protected]>
Without this change, we will need to add code to remove the old Rather than complicating things, we can circumvent the problem by changing the scheme for determining the database name and keeping the DB code simple. I want to avoid a complicated migration approach for a throwaway database that will likely be discarded at some point in the near-future due to a config change or a user running |
Having thought on this a bit more, we could re-use the same db if we:
It's not that onerous. It's still less code to change the naming convention for the db. It sits in |
I've created #459 to follow up and make the tests clearer. |
That makes sense! I just wasn't sure how we were ensuring we create a new db. What happens with the old db? Does it just get orphaned until someone clears their cache directory? |
Yeah, that's right. Currently, we don't manage the cache directory e.g. clean up old dbs. |
Previously, we were storing the last
modtime
andsize
for a given path inboltdb
.This was used to determine if a file had changed and whether we should format it.
In addition, if a formatter's executable changed (
modtime
orsize
), we would deleteall path entries in the database before processing, thereby forcing each file to be
formatted.
This commit introduces a new approach to change detection, one which takes into account
changes to the underlying file, the formatter's executable, and the formatter's
configuration.
Now, when deciding if we should format a file, we do the following:
name
,options
,priority
as well as the
modtime
andsize
of its executable.This is pre-computed on a per-pipeline basis.
modtime
andsize
to generate a format signature.You can think of this signature as a unique representation of what we are about to do
with the file.
particular options etc. to this file when it had this
modtime
andsize
, so there isno more processing to be done.
signature in the cache when we are finished.
This approach is simpler in terms of storage, and has the added benefit of finer grained
change detection versus the brute force cache busting we were doing before.
In terms of performance impact, with the pre-computing of hashes per-pipeline and the
simpler storage schema, there appears to have been no significant impact.
Manual testing with nixpkgs shows comparable run times
for both hot and cold caches.
Note
Since this changes the database schema, rather than implementing some form of migration
logic to remove the old buckets and so on, I decided to upgrade the hash algorithm we
use when determining the filename for the db file.
Previously, we were using
sha1
, matching the behaviour from1.0
.Now we use
sha256
, which results in a slightly longer db name, but has the benefit ofensuring a new db instance will be created on first invocation, as well as making
gosec happy.
Closes #455