Skip to content
This repository has been archived by the owner on Aug 2, 2021. It is now read-only.

file, testutil: Add reference file hasher #2099

Merged
merged 7 commits into from
Feb 24, 2020

Conversation

nolash
Copy link
Contributor

@nolash nolash commented Feb 11, 2020

This PR is part of a series of PRs that introduces an interface that allows chaining of components that receive a data stream and generate hashes and intermediate Merkle-Tree chunks. The individual PR steps will be partitioned from #2022 (branch https://github.com/nolash/swarm/tree/filehasher-avenged ) as follows:

  1. Introduce SectionWriter, implement this interface in bmt, make AsyncHasher standalone
  2. Move AsyncHasher to file/hasher
  3. Add reference implementation of the Filehasher algorithm (this PR)
  4. Add implementation of SectionWriter sub-component for hashing intermediate Merkle Tree levels.
  5. Add implementation of SectionWriter component executing the FileHasher algorithm
  6. Add a "splitter" that bridges io.Reader and SectionWriter, and an implementation of SectionWriter component that provides Chunk output.
  7. Add implementation of SectionWriter that provides encryption, along with a test utility SectionWriter implentation of a data cache.
  8. Evaluate and prune bmt.Hasher exports wtr AsyncHasher

The ReferenceHasher object is introduced, which is a non-performant implementation of the file hashing algorithm. It uses a single buffer to write data, where the data is stacked sequentially left to right from top to bottom level in the file tree.

The object treeParams describes the dimensions of the filehasher tree, and also holds a pool of hashers. The latter is not particularly relevant for the reference hasher (since it hashes synchronously), but the same component is used for the performant, threaded filehasher that will be PR'ed subsequently.

Terms

  • level: A given height in the tree of hashes. Level number 0 is the actual data.
  • section: A data section, which in its basic form equals the digest size of the BMT hash (32 bytes).
  • branches: The branching factor of the tree, which in current implementation is 128.
  • chunksize: Equals branches x section
  • span: the data a particular swarm chunk represents:
    • the span of a data chunk (level 0) will always be the same as the amount of bytes in the chunk
    • the span of any other chunk (level 1+) equals the actual byte count of the data it represents.

For more details see sections 7.1 and 7.1.1 of "The Book of Swarm."

Example

The follow diagram shows how the data bytes are laid out in the buffer for each level of the tree when hashing 128^2 bytes of data at the level boundary stages of progression. The numbers above the bar indicates the number of sections the point of the bar represents, and the number below indicates the right offset of the bytes of each level.

When starting out, all the levels have the same offset at the start of the buffer:

ref-000001

After we write 128 sections to the data level:

ref-000002

After we hash the 128 sections of the data level, we write the resulting hash to the level 1 offset. level 0 is truncated and offset aligned with level 1.

ref-000003

Here are written 128 * 128 chunks at level 0, which before hashing corresponds to 127 chunks of reference hashes in level 1.

ref-000004

When level 0 is hashed and written to level 1, level 1 will have 128 sections. It in turn is hashed, and the first level 2 hash is written at level 2 offset. Both level 1 and level 0 are truncated.

ref-000005


incidentally, the ReferenceHasher is ported to javascript. I tried to keep the implementation as similar as possible: https://www.npmjs.com/package/swarm-lowlevel

@acud
Copy link
Member

acud commented Feb 12, 2020

@nolash can you please explain the diagram? i can't see much more apart from a bunch of random arrows and numbers thrown on it. thanks

Copy link
Member

@acud acud left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

started reviewing but i'm missing some formal definitions

// calls sum if chunk boundary is reached and recursively calls this function for the next level with the acquired bmt hash
// adjusts cursors accordingly
func (r *ReferenceHasher) update(lvl int, data []byte) {
if lvl == 0 {
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

it would be nice to write that level 0 is the data layer. especially when this reference hasher is another representation of a tree or trie, in which tree height is measured as the inverse (0 is the root)

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

agreed.

file/hasher/reference.go Show resolved Hide resolved
file/hasher/param.go Show resolved Hide resolved
@acud
Copy link
Member

acud commented Feb 14, 2020

I still don't understand what the arrows pointing upwards in the diagrams mean. are these section numbers? section indexes? byte counts? byte array indexes?

@nolash
Copy link
Contributor Author

nolash commented Feb 14, 2020

I still don't understand what the arrows pointing upwards in the diagrams mean. are these section numbers? section indexes? byte counts? byte array indexes?

In the PR description:

... and the number below indicates the right offset of the bytes of each level.

@acud The general idea is that you never need to keep a full chunk of data from a level below. Let me try to explain in a different way:

Let chunk C be a fully written chunk in level n, and let s be the 32-byte section offset in the work buffer where you started to write C. When you hash C, write the result at s. Then increase s by one to s'. When starting to write the next chunk in level n you start at s'

You truncate the chunk data in the same way for every level. Notice that at the point where you hit a balanced tree, all of the chunks below will be truncated, and only the first 32 bytes in the buffer will remain. The maximum data you have to store in the work buffer if you'd fill all 9 levels would be just short of 8*128*32, because you only ever need to keep the last chunk in each level.

Copy link
Member

@acud acud left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

i think i understand this a bit better now. a few last clarifications about certain parts would be nice

file/hasher/param.go Show resolved Hide resolved
file/hasher/reference.go Show resolved Hide resolved
file/hasher/reference.go Outdated Show resolved Hide resolved
file/hasher/reference.go Show resolved Hide resolved
file/hasher/reference.go Show resolved Hide resolved
file/hasher/reference.go Show resolved Hide resolved
file/hasher/reference.go Show resolved Hide resolved
file/hasher/param.go Outdated Show resolved Hide resolved
file/hasher/reference.go Outdated Show resolved Hide resolved
file/hasher/reference.go Outdated Show resolved Hide resolved
// calls sum if chunk boundary is reached and recursively calls this function for the next level with the acquired bmt hash
// adjusts cursors accordingly
func (r *ReferenceHasher) update(lvl int, data []byte) {
if lvl == 0 {
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

agreed.

file/hasher/reference.go Outdated Show resolved Hide resolved
Copy link
Member

@acud acud left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

get on with it

@nolash nolash merged commit 15fbb9d into ethersphere:master Feb 24, 2020
@nolash nolash deleted the reference-filehasher branch February 24, 2020 05:41
Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants