file, testutil: Add reference file hasher #2099

nolash · 2020-02-11T13:47:59Z

This PR is part of a series of PRs that introduces an interface that allows chaining of components that receive a data stream and generate hashes and intermediate Merkle-Tree chunks. The individual PR steps will be partitioned from #2022 (branch https://github.com/nolash/swarm/tree/filehasher-avenged ) as follows:

~~Introduce SectionWriter, implement this interface in bmt, make AsyncHasher standalone~~
~~Move AsyncHasher to file/hasher~~
Add reference implementation of the Filehasher algorithm (this PR)
Add implementation of SectionWriter sub-component for hashing intermediate Merkle Tree levels.
Add implementation of SectionWriter component executing the FileHasher algorithm
Add a "splitter" that bridges io.Reader and SectionWriter, and an implementation of SectionWriter component that provides Chunk output.
Add implementation of SectionWriter that provides encryption, along with a test utility SectionWriter implentation of a data cache.
Evaluate and prune bmt.Hasher exports wtr AsyncHasher

The ReferenceHasher object is introduced, which is a non-performant implementation of the file hashing algorithm. It uses a single buffer to write data, where the data is stacked sequentially left to right from top to bottom level in the file tree.

The object treeParams describes the dimensions of the filehasher tree, and also holds a pool of hashers. The latter is not particularly relevant for the reference hasher (since it hashes synchronously), but the same component is used for the performant, threaded filehasher that will be PR'ed subsequently.

Terms

level: A given height in the tree of hashes. Level number 0 is the actual data.
section: A data section, which in its basic form equals the digest size of the BMT hash (32 bytes).
branches: The branching factor of the tree, which in current implementation is 128.
chunksize: Equals branches x section
span: the data a particular swarm chunk represents:
- the span of a data chunk (level 0) will always be the same as the amount of bytes in the chunk
- the span of any other chunk (level 1+) equals the actual byte count of the data it represents.

For more details see sections 7.1 and 7.1.1 of "The Book of Swarm."

Example

The follow diagram shows how the data bytes are laid out in the buffer for each level of the tree when hashing 128^2 bytes of data at the level boundary stages of progression. The numbers above the bar indicates the number of sections the point of the bar represents, and the number below indicates the right offset of the bytes of each level.

When starting out, all the levels have the same offset at the start of the buffer:

After we write 128 sections to the data level:

After we hash the 128 sections of the data level, we write the resulting hash to the level 1 offset. level 0 is truncated and offset aligned with level 1.

Here are written 128 * 128 chunks at level 0, which before hashing corresponds to 127 chunks of reference hashes in level 1.

When level 0 is hashed and written to level 1, level 1 will have 128 sections. It in turn is hashed, and the first level 2 hash is written at level 2 offset. Both level 1 and level 0 are truncated.

incidentally, the ReferenceHasher is ported to javascript. I tried to keep the implementation as similar as possible: https://www.npmjs.com/package/swarm-lowlevel

acud · 2020-02-12T08:52:15Z

@nolash can you please explain the diagram? i can't see much more apart from a bunch of random arrows and numbers thrown on it. thanks

acud

started reviewing but i'm missing some formal definitions

acud · 2020-02-12T09:14:11Z

file/hasher/reference.go

+// calls sum if chunk boundary is reached and recursively calls this function for the next level with the acquired bmt hash
+// adjusts cursors accordingly
+func (r *ReferenceHasher) update(lvl int, data []byte) {
+	if lvl == 0 {


it would be nice to write that level 0 is the data layer. especially when this reference hasher is another representation of a tree or trie, in which tree height is measured as the inverse (0 is the root)

file/hasher/reference.go

file/hasher/param.go

acud · 2020-02-14T06:10:37Z

I still don't understand what the arrows pointing upwards in the diagrams mean. are these section numbers? section indexes? byte counts? byte array indexes?

nolash · 2020-02-14T06:32:45Z

I still don't understand what the arrows pointing upwards in the diagrams mean. are these section numbers? section indexes? byte counts? byte array indexes?

In the PR description:

... and the number below indicates the right offset of the bytes of each level.

@acud The general idea is that you never need to keep a full chunk of data from a level below. Let me try to explain in a different way:

Let chunk C be a fully written chunk in level n, and let s be the 32-byte section offset in the work buffer where you started to write C. When you hash C, write the result at s. Then increase s by one to s'. When starting to write the next chunk in level n you start at s'

You truncate the chunk data in the same way for every level. Notice that at the point where you hit a balanced tree, all of the chunks below will be truncated, and only the first 32 bytes in the buffer will remain. The maximum data you have to store in the work buffer if you'd fill all 9 levels would be just short of 8*128*32, because you only ever need to keep the last chunk in each level.

acud

i think i understand this a bit better now. a few last clarifications about certain parts would be nice

file/hasher/param.go

file/hasher/reference.go

file/hasher/param.go

file/hasher/reference.go

zelig · 2020-02-17T14:20:00Z

file/hasher/reference.go

+// calls sum if chunk boundary is reached and recursively calls this function for the next level with the acquired bmt hash
+// adjusts cursors accordingly
+func (r *ReferenceHasher) update(lvl int, data []byte) {
+	if lvl == 0 {


file/hasher/reference.go

acud

get on with it

file, testutil: Add reference file hasher

42b1887

nolash added enhancement in progress hashing labels Feb 11, 2020

nolash self-assigned this Feb 11, 2020

nolash added 2 commits February 11, 2020 14:51

file: Remove premature code

457b569

file: Remove unused zeroHex and unused logs

65a444e

nolash added ready for review and removed in progress labels Feb 11, 2020

nolash requested review from pradovic, zelig, janos and acud February 11, 2020 14:26

acud suggested changes Feb 12, 2020

View reviewed changes

file: Add comments

93bdad9

nolash added ready for another review and removed ready for review labels Feb 12, 2020

acud self-requested a review February 13, 2020 10:33

acud reviewed Feb 17, 2020

View reviewed changes

zelig approved these changes Feb 17, 2020

View reviewed changes

file: Elaborate comments, remove redundant loglines, var rename

d603c6d

zelig approved these changes Feb 20, 2020

View reviewed changes

file: Split up digest function, add explanations

028aa1e

acud reviewed Feb 24, 2020

View reviewed changes

file/hasher/reference.go Outdated Show resolved Hide resolved

file: Purify digest method

fe7ddee

acud approved these changes Feb 24, 2020

View reviewed changes

nolash merged commit 15fbb9d into ethersphere:master Feb 24, 2020

nolash deleted the reference-filehasher branch February 24, 2020 05:41

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

file, testutil: Add reference file hasher #2099

file, testutil: Add reference file hasher #2099

nolash commented Feb 11, 2020 •

edited

Loading

acud commented Feb 12, 2020

acud left a comment

acud Feb 12, 2020

zelig Feb 17, 2020

acud commented Feb 14, 2020 •

edited

Loading

nolash commented Feb 14, 2020 •

edited

Loading

acud left a comment

zelig Feb 17, 2020

acud left a comment

file, testutil: Add reference file hasher #2099

file, testutil: Add reference file hasher #2099

Conversation

nolash commented Feb 11, 2020 • edited Loading

Terms

Example

acud commented Feb 12, 2020

acud left a comment

Choose a reason for hiding this comment

acud Feb 12, 2020

Choose a reason for hiding this comment

zelig Feb 17, 2020

Choose a reason for hiding this comment

acud commented Feb 14, 2020 • edited Loading

nolash commented Feb 14, 2020 • edited Loading

acud left a comment

Choose a reason for hiding this comment

zelig Feb 17, 2020

Choose a reason for hiding this comment

acud left a comment

Choose a reason for hiding this comment

nolash commented Feb 11, 2020 •

edited

Loading

acud commented Feb 14, 2020 •

edited

Loading

nolash commented Feb 14, 2020 •

edited

Loading