Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

encoding integrity of source documents #150

Open
keighrim opened this issue Dec 15, 2020 · 4 comments
Open

encoding integrity of source documents #150

keighrim opened this issue Dec 15, 2020 · 4 comments
Labels
✨N New feature or request

Comments

@keighrim
Copy link
Member

At some point, we had advertised that MMIF would encode file checksums in the Document objects for checking data integrity. I want to bring it to the discussion, specifically related to these questions;

  1. I think the data integrity is important, especially MMIF, unlink LIF, doesn't carry the contents of raw source data. What I'm not sure is whether encoding checksum hash string is the best way to do it.
  2. If we encode it, we need a standardized way (e.g. CRC32) of doing it, and it must be specified in the documentation.
  3. Also I think the implementation of generating checksum string should go in the add_document method of the MMIF SDK (maybe as an optional parameter). We could also consider implementing some helpers either in MMIF SDK or CLASM SDK to check the file integrity using the checksum string.
@angus-lherrou
Copy link
Contributor

angus-lherrou commented Dec 15, 2020

I think this is a good idea. It does bring up some questions about the clams source command, since the filepaths we provide there are not host paths but in-container paths, so the CLI tool as is would not be able to generate those checksums itself, but that'd be an issue for the clams and mmif-python repos, not this one.

I think CRC32 makes sense for this.

@keighrim
Copy link
Member Author

keighrim commented Dec 15, 2020

Good point. A simple solution I can imagine is to add a parameter to clams source command to mend the file path on the fly (--prefix sounds like a proper name). We can also add a flag to make clams source generates checksum strings during generating source MMIF JSONs.

@angus-lherrou
Copy link
Contributor

As discussed in the meeting today, Python's zlib module has a CRC32 implementation. However, it also has zlib.adler32, for which the docs state, "An Adler-32 checksum is almost as reliable as a CRC32 but can be computed much more quickly."

I don't know what "much" means here but it might be worth considering choosing Adler-32 as our standard instead.

@keighrim keighrim added this to CLAMS Mar 18, 2023
@github-project-automation github-project-automation bot moved this to 🆕 New in CLAMS Mar 18, 2023
@keighrim keighrim added ▶️F Migrate to next phase ✨N New feature or request labels Apr 19, 2023
@clams-bot clams-bot added this to infra Apr 23, 2023
@github-project-automation github-project-automation bot moved this to Todo in infra Apr 23, 2023
@keighrim
Copy link
Member Author

Recent developments;

  1. we might want to use a hash function that matches the near-identical assets based on contents, besides a strict hash for byte streams. (e.g., https://pypi.org/project/videohash/)
  2. that said, we might want to allow multiple hashes with their "specs" stored in the MMIF serialization
  3. the primary purpose of this hash records is not any security measure, so cryptographic level isn't our first consideration (e.g., https://xxhash.com/)

@keighrim keighrim removed the ▶️F Migrate to next phase label Sep 26, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
✨N New feature or request
Projects
Status: 🆕 New
Status: Todo
Development

No branches or pull requests

2 participants