Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Add Datalake Files endpoints #3

Open
wants to merge 1 commit into
base: master
Choose a base branch
from
Open

Add Datalake Files endpoints #3

wants to merge 1 commit into from

Conversation

dtkav
Copy link
Contributor

@dtkav dtkav commented Mar 18, 2019

This adds client-side functionality for interfacing with the missioncontrol files endpoints.
Note this requires us to land the changes into missioncontrol first.

@dtkav dtkav requested a review from Psykar March 18, 2019 06:00
l.append(self._compressor.flush())
return b''.join(l)

def _calculate_hash(self):
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Unused?

Copy link
Contributor Author

@dtkav dtkav Mar 19, 2019

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is used in super().__init__()
I've added the comment below to explain why we need to override it (to hash the raw file, not the gzip stream).


def _calculate_hash(self):
'''ensure the hash is over the raw file, not the gzip steam'''
b2 = blake2b(digest_size=16)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is digest_size required here? Or just to ensure we're consistent?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

yeah, I we need both a hashing algorithm and a standard way to call it, otherwise we might have the same file twice with different hash lengths.
Actually, I'll probably encode the hash type and length with something like pymultihash.

import zlib
import socket
import datetime
from pyblake2 import blake2b
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm relying on planetlabs/datalake, which uses pyblake2

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I guess I could vendor this functionality actually and make the modifications directly.

missioncontrol_client/__init__.py Show resolved Hide resolved
start = UTC("now").iso

if where is None:
where = socket.gethostname()
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Does this do FQDN if it can?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

good point - I'll change to getfqdn


cid = f.metadata["hash"]

fleetmeta = {
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Fleet specific?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm inheriting from planetlabs/datalake File class, but we've changed our metadata structure, so this converts to the "fleet" version of metadata.

Ideally we'd fork/enhance this library with metadata 2.0 and distribute it, but using it as is provides a lot of value without much effort in the near term.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ah right.
At this point I don't actually see what this File class adds to us? Can you explain a bit?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yeah - I'm trying to leverage as much of the datalake infrastructure as possible, as it includes lessons learned and features developed over several years.
For example, datalake files have a tar bundle format, and an inotify-based auto-uploaded service. If we can re-use a lot of this work, we won't have to rewrite it from scratch.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Except currently we're just doing a POST directly anyway?
I'm wondering what specifically this PR users from the File class of datalake

Maybe the answer is 'nothing yet'?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think the mismatch is that I was overriding the methods used in this diff to get streaming gzip working.
I've since moved that into a datalake fork. I think it's worth factoring out the metadata and tooling as it's a bit more complex than just normal REST api stuff.
Ideally this library would be very lean and not doing too much magic.

@dtkav dtkav force-pushed the files_api branch 4 times, most recently from 36689b5 to c19b4db Compare March 19, 2019 06:42
@dtkav dtkav changed the title wip: files endpoints Add Datalake Files endpoints Mar 20, 2019
@dtkav dtkav requested a review from Psykar March 20, 2019 07:13
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants