-
Notifications
You must be signed in to change notification settings - Fork 2
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Add Datalake Files endpoints #3
base: master
Are you sure you want to change the base?
Conversation
missioncontrol_client/__init__.py
Outdated
l.append(self._compressor.flush()) | ||
return b''.join(l) | ||
|
||
def _calculate_hash(self): |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Unused?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This is used in super().__init__()
I've added the comment below to explain why we need to override it (to hash the raw file, not the gzip stream).
missioncontrol_client/__init__.py
Outdated
|
||
def _calculate_hash(self): | ||
'''ensure the hash is over the raw file, not the gzip steam''' | ||
b2 = blake2b(digest_size=16) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Is digest_size required here? Or just to ensure we're consistent?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
yeah, I we need both a hashing algorithm and a standard way to call it, otherwise we might have the same file twice with different hash lengths.
Actually, I'll probably encode the hash type and length with something like pymultihash.
missioncontrol_client/__init__.py
Outdated
import zlib | ||
import socket | ||
import datetime | ||
from pyblake2 import blake2b |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Why not builtin? https://docs.python.org/3/library/hashlib.html#hashlib.blake2b
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'm relying on planetlabs/datalake, which uses pyblake2
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I guess I could vendor this functionality actually and make the modifications directly.
missioncontrol_client/__init__.py
Outdated
start = UTC("now").iso | ||
|
||
if where is None: | ||
where = socket.gethostname() |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Does this do FQDN if it can?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
good point - I'll change to getfqdn
missioncontrol_client/__init__.py
Outdated
|
||
cid = f.metadata["hash"] | ||
|
||
fleetmeta = { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Fleet specific?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'm inheriting from planetlabs/datalake File class, but we've changed our metadata structure, so this converts to the "fleet" version of metadata.
Ideally we'd fork/enhance this library with metadata 2.0
and distribute it, but using it as is provides a lot of value without much effort in the near term.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Ah right.
At this point I don't actually see what this File class adds to us? Can you explain a bit?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yeah - I'm trying to leverage as much of the datalake infrastructure as possible, as it includes lessons learned and features developed over several years.
For example, datalake files have a tar bundle
format, and an inotify-based auto-uploaded service. If we can re-use a lot of this work, we won't have to rewrite it from scratch.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Except currently we're just doing a POST directly anyway?
I'm wondering what specifically this PR users from the File
class of datalake
Maybe the answer is 'nothing yet'?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think the mismatch is that I was overriding the methods used in this diff to get streaming gzip working.
I've since moved that into a datalake fork. I think it's worth factoring out the metadata and tooling as it's a bit more complex than just normal REST api stuff.
Ideally this library would be very lean and not doing too much magic.
36689b5
to
c19b4db
Compare
This adds client-side functionality for interfacing with the missioncontrol files endpoints.
Note this requires us to land the changes into missioncontrol first.