Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Hashlib paradigm inconsistent with Bagit spec, but also causing runtime errors #158

Open
ross-spencer opened this issue May 13, 2022 · 2 comments

Comments

@ross-spencer
Copy link

Bagit-python is currently using a generic pattern to import the list of checksums available to the tool to be used:

bagit-python/bagit.py

Lines 125 to 131 in bff20d2

try:
CHECKSUM_ALGOS = hashlib.algorithms_guaranteed
except AttributeError:
# FIXME: remove when we drop Python 2 (https://github.com/LibraryOfCongress/bagit-python/issues/102)
# Python 2.7.0-2.7.8
CHECKSUM_ALGOS = set(hashlib.algorithms)
DEFAULT_CHECKSUMS = ["sha256", "sha512"]

This creates a help-command with the following in Python 3.9.6.

--sha3_224 
--sha3_256 
--blake2s 
--sha3_384 
--sha384 
--md5 
--sha224 
--shake_256 
--sha256 
--shake_128 
--sha3_512 
--blake2b 
--sha512 
--sha1 

@pwinckles brings up something similar here: #156 I am not clear what is invalid about those names - that being said:

The spec states:

To avoid future ambiguity, the checksum algorithm SHOULD be registered in IANA's "Named Information Hash Algorithm Registry"

The most obvious example of algorithm's not in the registry at time of writing are those from the shake family. Further, there is ambiguity in the Blake algorithms presented by Python's hashlib.

Errors caused by shake

Additionally, the shake algorithms require a length argument not handled by the code: https://docs.python.org/3/library/hashlib.html#shake-variable-length-digests

Causing the following when selected.

Traceback (most recent call last):
  File "C:\temp\testing\bags\venv\lib\site-packages\bagit.py", line 241, in make_bag
    total_bytes, total_files = make_manifests(
  File "C:\temp\testing\bags\venv\lib\site-packages\bagit.py", line 1254, in make_manifests
    checksums = [manifest_line_generator(i) for i in _walk(data_dir)]
  File "C:\temp\testing\bags\venv\lib\site-packages\bagit.py", line 1254, in <listcomp>
    checksums = [manifest_line_generator(i) for i in _walk(data_dir)]
  File "C:\temp\testing\bags\venv\lib\site-packages\bagit.py", line 1412, in generate_manifest_lines
    results = [
  File "C:\temp\testing\bags\venv\lib\site-packages\bagit.py", line 1413, in <listcomp>
    (alg, hasher.hexdigest(), decoded_filename, total_bytes)
TypeError: hexdigest() missing required argument 'length' (pos 1)
2022-05-13 10:38:36,341 - ERROR - Failed to create bag in govdoc-gifs: hexdigest() missing required argument 'length' (pos 1)
Traceback (most recent call last):
  File "C:\temp\testing\bags\venv\lib\site-packages\bagit.py", line 1599, in main
    make_bag(
  File "C:\temp\testing\bags\venv\lib\site-packages\bagit.py", line 241, in make_bag
    total_bytes, total_files = make_manifests(
  File "C:\temp\testing\bags\venv\lib\site-packages\bagit.py", line 1254, in make_manifests
    checksums = [manifest_line_generator(i) for i in _walk(data_dir)]
  File "C:\temp\testing\bags\venv\lib\site-packages\bagit.py", line 1254, in <listcomp>
    checksums = [manifest_line_generator(i) for i in _walk(data_dir)]
  File "C:\temp\testing\bags\venv\lib\site-packages\bagit.py", line 1412, in generate_manifest_lines
    results = [
  File "C:\temp\testing\bags\venv\lib\site-packages\bagit.py", line 1413, in <listcomp>
    (alg, hasher.hexdigest(), decoded_filename, total_bytes)
TypeError: hexdigest() missing required argument 'length' (pos 1)

While it's nice there's some extensibility for free, perhaps, as this issue can come up again in future, the algorithms can be provided from a known list compatible with IANA today, and Python, and perhaps cross-referenced against, checksums_guaranteed instead?

@acdha
Copy link
Member

acdha commented May 13, 2022

If memory serves, that bit of code predates the work on the RFC which strengthened the guidance around things like this. If you'd like to send a pull-request I think it would definitely make sense to trim that down to be only the list of non-draft algorithms from https://www.iana.org/assignments/named-information/named-information.xhtml.

@ross-spencer
Copy link
Author

That makes sense @acdha. I'll have a look and see if I can submit something. Folks reading this list might want to avail themselves on whether they're going to be affected by any of the changes, i.e. checking the IANA link above for any checksums removed they may be using.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants