Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Add MD5 checksum utility #125

Open
erikyao opened this issue Jun 25, 2021 · 2 comments
Open

Add MD5 checksum utility #125

erikyao opened this issue Jun 25, 2021 · 2 comments
Assignees

Comments

@erikyao
Copy link
Contributor

erikyao commented Jun 25, 2021

Datasources in myvariant.io may contain large files to download (e.g. dbsnp release 155 has 380GB). Due to various reasons (like FTP connection issues), the download may be incomplete, leading errors during the uploading processes.

Some of the datasource has MD5 checksum files available. It would be nice to download those *.md5 files as well and validate the data files in the post_dump phrases.

The validation in bash is quite straight forward. Each .md5 file is essentially a tuple of (checksum, filename). md5sum -c will read a .md5 file, re-calculate the checksum for that filename and match it with the origin checksum. E.g.

(venv) myvariant@su09:/data/hub/myvariant_hub/dbsnp/155$ cat refsnp-chr16.json.bz2.md5 
aa8f0ec9c4752ea34dff2ae309d2a239  refsnp-chr16.json.bz2
(venv) myvariant@su09:/data/hub/myvariant_hub/dbsnp/155$ md5sum -c refsnp-chr16.json.bz2.md5
refsnp-chr16.json.bz2: OK

It's also feasible in python with built-in hashlib.md5(). See Generating an MD5 checksum of a file. Performance of feeding file content to hashlib should be taken into account before developing a MD5 helper class/function.

@erikyao erikyao self-assigned this Jun 25, 2021
@newgene
Copy link
Member

newgene commented Jul 29, 2021

@erikyao Looks like post_download or post_dump might be the good place to add the md5 check for a data src like dbsnp:

https://github.com/biothings/biothings.api/blob/181a36fc2d5f782bb3608ec032891b0eaa9e7e1d/biothings/hub/dataload/dumper.py#L162-L172

@newgene
Copy link
Member

newgene commented Jul 30, 2021

see an example here from mychem: biothings/mychem.info@79f704b
and a small change biothings/mychem.info@6fa8937

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

2 participants