-
Notifications
You must be signed in to change notification settings - Fork 110
file, testutil: Add reference file hasher #2099
Conversation
@nolash can you please explain the diagram? i can't see much more apart from a bunch of random arrows and numbers thrown on it. thanks |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
started reviewing but i'm missing some formal definitions
// calls sum if chunk boundary is reached and recursively calls this function for the next level with the acquired bmt hash | ||
// adjusts cursors accordingly | ||
func (r *ReferenceHasher) update(lvl int, data []byte) { | ||
if lvl == 0 { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
it would be nice to write that level 0 is the data layer. especially when this reference hasher is another representation of a tree or trie, in which tree height is measured as the inverse (0 is the root)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
agreed.
I still don't understand what the arrows pointing upwards in the diagrams mean. are these section numbers? section indexes? byte counts? byte array indexes? |
In the PR description:
@acud The general idea is that you never need to keep a full chunk of data from a level below. Let me try to explain in a different way: Let chunk You truncate the chunk data in the same way for every level. Notice that at the point where you hit a balanced tree, all of the chunks below will be truncated, and only the first 32 bytes in the buffer will remain. The maximum data you have to store in the work buffer if you'd fill all 9 levels would be just short of |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
i think i understand this a bit better now. a few last clarifications about certain parts would be nice
// calls sum if chunk boundary is reached and recursively calls this function for the next level with the acquired bmt hash | ||
// adjusts cursors accordingly | ||
func (r *ReferenceHasher) update(lvl int, data []byte) { | ||
if lvl == 0 { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
agreed.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
get on with it
This PR is part of a series of PRs that introduces an interface that allows chaining of components that receive a data stream and generate hashes and intermediate Merkle-Tree chunks. The individual PR steps will be partitioned from #2022 (branch https://github.com/nolash/swarm/tree/filehasher-avenged ) as follows:
IntroduceSectionWriter
, implement this interface inbmt
, makeAsyncHasher
standaloneMoveAsyncHasher
tofile/hasher
SectionWriter
sub-component for hashing intermediate Merkle Tree levels.SectionWriter
component executing the FileHasher algorithmio.Reader
andSectionWriter
, and an implementation ofSectionWriter
component that providesChunk
output.SectionWriter
that provides encryption, along with a test utilitySectionWriter
implentation of a data cache.bmt.Hasher
exports wtrAsyncHasher
The
ReferenceHasher
object is introduced, which is a non-performant implementation of the file hashing algorithm. It uses a single buffer to write data, where the data is stacked sequentially left to right from top to bottom level in the file tree.The object
treeParams
describes the dimensions of the filehasher tree, and also holds a pool of hashers. The latter is not particularly relevant for the reference hasher (since it hashes synchronously), but the same component is used for the performant, threaded filehasher that will be PR'ed subsequently.Terms
level
: A given height in the tree of hashes. Level number0
is the actual data.section
: A data section, which in its basic form equals the digest size of the BMT hash (32 bytes).branches
: The branching factor of the tree, which in current implementation is 128.chunksize
: Equalsbranches x section
span
: the data a particular swarm chunk represents:For more details see sections 7.1 and 7.1.1 of "The Book of Swarm."
Example
The follow diagram shows how the data bytes are laid out in the buffer for each level of the tree when hashing
128^2
bytes of data at the level boundary stages of progression. The numbers above the bar indicates the number of sections the point of the bar represents, and the number below indicates the right offset of the bytes of each level.When starting out, all the levels have the same offset at the start of the buffer:
After we write 128 sections to the data level:
After we hash the 128 sections of the data level, we write the resulting hash to the level 1 offset. level 0 is truncated and offset aligned with level 1.
Here are written 128 * 128 chunks at level 0, which before hashing corresponds to 127 chunks of reference hashes in level 1.
When level 0 is hashed and written to level 1, level 1 will have 128 sections. It in turn is hashed, and the first level 2 hash is written at level 2 offset. Both level 1 and level 0 are truncated.
incidentally, the
ReferenceHasher
is ported to javascript. I tried to keep the implementation as similar as possible: https://www.npmjs.com/package/swarm-lowlevel