Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

hash all downloads #266

Open
rabernat opened this issue Jan 25, 2022 · 2 comments
Open

hash all downloads #266

rabernat opened this issue Jan 25, 2022 · 2 comments

Comments

@rabernat
Copy link
Contributor

Just noting how easy it is to hash all downloads.

In our current copy function

with output_opener as target:
start = time.time()
interval = 5 # seconds
bytes_read = log_count = 0
while True:
data = source.read(BLOCK_SIZE)
if not data:
break
target.write(data)

we could easily insert this

md5hash = hashlib.md5()
with fsspec.open(fname, mode='rb') as fp:
    while True:
        data = fp.read(block_size)
        if not data:
            break
        md5hash.update(data)

I have confirmed that a hash built this way matches the md5 command line results.

@cisaacstern
Copy link
Member

Where would we store this hash and what would it be used for?

@rabernat
Copy link
Contributor Author

There are three basic uses:

  • Validation against data corruption (most effective if the original hash is available somewhere)
  • Determining if we need to re-download a cached file
  • Indexing for content-addressable storage such as IPFS

We could store it in the metadata store for now.

We might also eventually want to be saving the list of ingested files in the database (e.g. to enable appending; #37). Storing the hash along with the file would be a good practice here.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants