Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

irmin-pack: introduce chunked suffix abstraction #2115

Merged
merged 3 commits into from
Oct 21, 2022

Conversation

metanivek
Copy link
Member

This PR is part of #2096 and introduces a chunked suffix abstraction that works with our current single suffix world. The goal with this change is to introduce the abstraction as a step towards multiple chunks and modifying our GC to work with chunks.

@metanivek metanivek added the no-changelog-needed No changelog is needed here label Oct 12, 2022
@metanivek metanivek modified the milestone: 3.3.2 Oct 12, 2022
@metanivek metanivek force-pushed the intro_chunked_suffix branch from c931439 to d5d13a0 Compare October 12, 2022 18:25
@codecov-commenter
Copy link

codecov-commenter commented Oct 12, 2022

Codecov Report

Merging #2115 (7976c57) into main (6854d5b) will increase coverage by 0.08%.
The diff coverage is 82.75%.

❗ Current head 7976c57 differs from pull request most recent head a602ce2. Consider uploading reports for the commit a602ce2 to get more accurate results

@@            Coverage Diff             @@
##             main    #2115      +/-   ##
==========================================
+ Coverage   64.77%   64.86%   +0.08%     
==========================================
  Files         133      134       +1     
  Lines       15890    15946      +56     
==========================================
+ Hits        10293    10343      +50     
- Misses       5597     5603       +6     
Impacted Files Coverage Δ
src/irmin-pack/unix/ext.ml 62.15% <ø> (ø)
src/irmin-pack/unix/import.ml 73.33% <60.00%> (-6.67%) ⬇️
src/irmin-pack/unix/chunked_suffix.ml 84.00% <84.00%> (ø)
src/irmin-pack/unix/file_manager.ml 84.01% <100.00%> (+0.05%) ⬆️
src/irmin-test/store.ml 94.95% <0.00%> (+0.05%) ⬆️
src/irmin/watch.ml 80.98% <0.00%> (+2.11%) ⬆️

📣 We’re building smart automated test selection to slash your CI/CD build times. Learn more

Copy link
Contributor

@icristescu icristescu left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

great, LGTM!

@icristescu icristescu mentioned this pull request Oct 13, 2022
34 tasks
@metanivek
Copy link
Member Author

Ah, CI is failing because of Array.find_opt not being available on older OCaml versions -- will fix this.

{ inventory }

let appendable_ao t = (Inventory.appendable t.inventory).ao
let end_poff t = appendable_ao t |> Ao.end_poff
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think this is wrong - the end_poff is used upstreamed as the end poff of the whole suffix abstraction, but here it returns only the end poff of the last chunk. I should be the sum over all chunks.

Copy link
Member Author

@metanivek metanivek Oct 14, 2022

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for bringing this up. I think this is worth some discussion.

The way I understand how the suffix's poff is currently used in the control file is as a control mechanism to know/verify what data we have written to disk (as is seen in the append only file's consistency check). In this sense, the current code is correct. It tracks the physical offset of the appendable chunk for consistency checks. No other chunk can be changed so for other chunks their end_poff is necessarily equivalent to their Io.size_of_path (the awkward code in Ao_chunk.open_ro that does exactly this).

Summing the values would give the chunked suffix's length and final offset but doesn't really correlate with a physical offset

While working on this I intentionally wanted to keep the existing code working with minimal extra changes but I do think we need to change this. Here is my proposal for a future PR:

  • rename control file field to suffix_consistency_poff (or something else -- open to ideas!)
  • rename end_poff in chunked file to consistency_poff
  • move consistency checking up from append only to chunked suffix and remove end_poff from append only's api open api (it would still be an available property)

Copy link
Contributor

@icristescu icristescu Oct 17, 2022

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The dispatcher needs to know the end offset for the suffix file in order to verify that a read is within bounds. This does not correspond to a physical offset, indeed. It's "a virtual offset corresponding to the physical end offset, if the suffix was a single file". I'll call it suffix_end_off.

We can either (I am reiterating what you say):

  • keep end_poff as is, keep the control file as is (tracking the end_poff of the last chunk), but then we need a function to compute the suffix_end_off - this solution works for me, I do find the concept of suffix_end_off a bit weird.
  • replace the end_poff in the chunked file and the control file with the suffix_end_off - I'm not sure if this is what you are proposing for a future PR?

Copy link
Member Author

@metanivek metanivek Oct 17, 2022

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ah, I just looked in the dispatcher and see its usage of Suffix.end_poff -- this does need to be updated to be the suffix end offset. Before chunks, this makes sense as an end_poff, but there is no "end poff" for a chunked suffix. I don't necessarily love adding another virutalized set of offsets, but the suffix's offset range will necessarily be virtualized for chunks since it doesn't map directly to a single range of physical offsets any longer. And it is neatly contained within Chunked_suffix which helps reduce the scope of offsets to consider.

Thanks for pointing out dispatcher's use of the end offset for the suffix. I think the following proposal addresses everything:

  • Add end_off to Chunked_suffix to represent the last offset of the suffix. This can be a sum of the chunks poffs but I don't think tracking it in the control file is needed.
  • Rename Chunked_suffix.end_poff to consistency_poff (or something else -- but I think "end poff" doesn't make sense in the context of a chunked suffix) and also the control file field to reflect the name change. This is the offset we need to track in the control file for consistency checks since we just need to check the appendable chunk's poff upon open. As a part of this, also move the consistency check from append only to chunked suffix (like mentioned previously).

I will make the first change in this PR since you rightly pointed out a correctness issue.

The second one can be a future PR since it is mostly a cosmetic name change, but it will also get rid of some of the awkward end_poff code (like in Ao_chunk.open_ro) which will be nice. Edit: on second glance, I see that we set let persisted_end_poff = end_poff directly in append only so perhaps this needs more consideration (immediate thought is to use Io for the non-appendable chunks).

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I will make the first change in this PR since you rightly pointed out a correctness issue.

it does not need to be in this PR necessarily, as there is nothing that breaks here. I included a proposal for this in #2118, you can either review it there or add your own here, as you wish.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Aha, I missed that. I'll just review in your PR since you have already made the change. 👍

@icristescu
Copy link
Contributor

@metanivek can you add a changelog entry to mention that the file disk format changed to chunk files?

@metanivek metanivek removed the no-changelog-needed No changelog is needed here label Oct 18, 2022
src/irmin-pack/unix/chunked_suffix.ml Outdated Show resolved Hide resolved
src/irmin-pack/unix/chunked_suffix.ml Show resolved Hide resolved
src/irmin-pack/unix/chunked_suffix.ml Outdated Show resolved Hide resolved
src/irmin-pack/unix/chunked_suffix.ml Outdated Show resolved Hide resolved
src/irmin-pack/unix/chunked_suffix.ml Show resolved Hide resolved
src/irmin-pack/unix/chunked_suffix.ml Outdated Show resolved Hide resolved
src/irmin-pack/unix/chunked_suffix.ml Outdated Show resolved Hide resolved
@metanivek metanivek force-pushed the intro_chunked_suffix branch from 43929d1 to 8f2229f Compare October 19, 2022 18:55
@metanivek
Copy link
Member Author

This is ready for a final review before merging. I have resolved all of the comments that I have addressed in the PR. I will add the list of things to revisit to #2096 once this merges.

@metanivek metanivek force-pushed the intro_chunked_suffix branch 2 times, most recently from a1204da to 88b6888 Compare October 20, 2022 13:42
@metanivek metanivek requested a review from icristescu October 20, 2022 14:03
@metanivek metanivek requested review from Ngoguey42 and art-w October 20, 2022 14:03
Copy link
Contributor

@Ngoguey42 Ngoguey42 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks. Ship it!

@metanivek metanivek force-pushed the intro_chunked_suffix branch from 9026a9a to 47f95af Compare October 21, 2022 13:12
@metanivek metanivek merged commit 8b90346 into mirage:main Oct 21, 2022
@metanivek metanivek deleted the intro_chunked_suffix branch October 21, 2022 13:28
metanivek added a commit to metanivek/opam-repository that referenced this pull request Dec 14, 2022
…ils, irmin-test, irmin-pack, irmin-mirage, irmin-mirage-graphql, irmin-mirage-git, irmin-http, irmin-graphql, irmin-git, irmin-fs, irmin-containers, irmin-cli, irmin-chunk and irmin-bench (3.5.0)

CHANGES:

### Added

- **irmin-pack**
  - Add `Irmin_pack_unix.Stats.Latest_gc` which is now the parameter of GC's
    `finished` callback (mirage/irmin#2089, @Ngoguey42)
  - Add `Gc.oldest_live_commit` which returns the key of the commit on which the
    latest gc was called on. (mirage/irmin#2110, @icristescu)
  - Add `split` to create a new suffix chunk. Subsequent writes will append to
    this chunk until `split` is called again. (mirage/irmin#2118, @icristescu)
  - Add `create_one_commit_store` to create a new store from the existing one,
    containing only one commit. (mirage/irmin#2125, @icristescu)

### Changed

- **irmin-pack**
  - Upgraded on-disk format to version 4. (mirage/irmin#2110, @icristescu)
  - Detecting control file corruption with a checksum (mirage/irmin#2119, @art-w)
  - Change on-disk layout of the suffix from a single file to a multiple,
    chunked file design (mirage/irmin#2115, @metanivek)
  - Modify GC to work with new chunked suffix. See `examples/gc.ml` for a
    demonstration of how it works with the new `split` function. (mirage/irmin#2126,
    @metanivek)

### Fixed
metanivek added a commit to metanivek/opam-repository that referenced this pull request Dec 14, 2022
…ils, irmin-test, irmin-pack, irmin-mirage, irmin-mirage-graphql, irmin-mirage-git, irmin-http, irmin-graphql, irmin-git, irmin-fs, irmin-containers, irmin-cli, irmin-chunk and irmin-bench (3.5.0)

CHANGES:

### Added

- **irmin-pack**
  - Add `Irmin_pack_unix.Stats.Latest_gc` which is now the parameter of GC's
    `finished` callback (mirage/irmin#2089, @Ngoguey42)
  - Add `Gc.oldest_live_commit` which returns the key of the commit on which the
    latest gc was called on. (mirage/irmin#2110, @icristescu)
  - Add `split` to create a new suffix chunk. Subsequent writes will append to
    this chunk until `split` is called again. (mirage/irmin#2118, @icristescu)
  - Add `create_one_commit_store` to create a new store from the existing one,
    containing only one commit. (mirage/irmin#2125, @icristescu)

### Changed

- **irmin-pack**
  - Upgraded on-disk format to version 4. (mirage/irmin#2110, @icristescu)
  - Detecting control file corruption with a checksum (mirage/irmin#2119, @art-w)
  - Change on-disk layout of the suffix from a single file to a multiple,
    chunked file design (mirage/irmin#2115, @metanivek)
  - Modify GC to work with new chunked suffix. See `examples/gc.ml` for a
    demonstration of how it works with the new `split` function. (mirage/irmin#2126,
    @metanivek)

### Fixed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

5 participants