Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Generic storage #1059

Open
wants to merge 12 commits into
base: master
Choose a base branch
from
Open

Generic storage #1059

wants to merge 12 commits into from

Conversation

davidfarkas
Copy link
Contributor

#969 Switch to PyFilesystem

Review Checklist

  • Tests were added to cover all code changes
  • Documentation was added / updated
  • Code and tests follow standards in CONTRIBUTING.md

api/resolver.py Outdated
@@ -1,4 +1,4 @@
"""
"""1
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looks like an accidental addition.

('acquisitions', 'files'),
('analyses', 'files'),
('sessions', 'files'),
('sessions', 'subject.files'),
Copy link
Contributor

@nagem nagem Jan 26, 2018

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I don't think we have any actual users using subject files to be clear, but it is good to include them just in case. It is an undocumented feature of the folder uploader.

file_id = f['fileinfo'].get('_id', '')
if file_id:
file_path = util.path_from_uuid(file_id)
if not config.fs.isfile(file_path):
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I assume here is where files that have been created since starting the conversion would be ignored? Also does this mean the conversion script can run many times and will just be a noop when no files need to be migrated?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes, exactly

api/util.py Outdated


def path_from_uuid(uuid_):
"""
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Comment below should be updated.

@nagem
Copy link
Contributor

nagem commented Jan 26, 2018

I started adding comments to the PR as I went through it even though work is still being done on the branch. Address anything I mentioned now or wait until later when we move into a more "official" review state.

@codecov-io
Copy link

codecov-io commented Feb 2, 2018

Codecov Report

Merging #1059 into master will increase coverage by 0.06%.
The diff coverage is 94.79%.

@@            Coverage Diff             @@
##           master    #1059      +/-   ##
==========================================
+ Coverage   90.98%   91.05%   +0.06%     
==========================================
  Files          49       49              
  Lines        7036     7067      +31     
==========================================
+ Hits         6402     6435      +33     
+ Misses        634      632       -2

Copy link
Contributor

@ambrussimon ambrussimon left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I only did a really quick first pass without much findings.
Please exclude the separated range-read diff, so I can give it a more thorough 2nd look.

api/config.py Outdated

# Storage configuration
fs = open_fs(__config['persistent']['fs_url'])
signed_url_available = ['GCSFS']
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

During the design phase we discussed that storages should be able to tell about themselves, whether they support signed URLs or not. It's not a hard requirement, especially if it's not easy to solve, but I would prefer that to having this config variable.

api/files.py Outdated

def get_fs_by_file_path(file_path):
if config.support_legacy_fs:
if config.legacy_fs.isfile(file_path):
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Both if's test config.legacy_fs.isfile(file_path)...?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

thanks, I forgot to remove that

else:
self.response.headers['Content-Type'] = 'application/octet-stream'
self.response.headers['Content-Disposition'] = 'attachment; filename="' + filename + '"'
range_header = self.request.headers.get('Range', '')
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If it's not too much effort, could you please hide these changes from the PR by tracking range-reads as it's target branch (at least until that's merged).

# Update the file with the newly generated UUID
config.db[f['collection']].find_one_and_update(
{'_id': f['collection_id'], f['prefix'] + '.name': f['fileinfo']['name']},
{'$set': update_set}
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We'll need to adjust this a bit to handle when a file is replaced while the migration is underway. An example:

While iterating over all acquisition files, any outputs from gears added to new or existing acquisitions will be stored in the new storage format and location. The migration script in it's current state handles that without issue. A "feature" of our current file placers is that they replace existing files in a container if the uploaded file has the same name. If that process happens after the load of all files from the acquisition collection, the migration script will move the old file from the old path to a new (and different) path and overwrite the information in the file object. This will leave the DB in a state where it looks like a dicom -> nifti gear had been rerun, generating a new file, but the nifti referenced in the DB will be the old nifti. The new nifti will still exist, but will be an unreferenced object.

To solve this, I think we'll need to update the query portion of the find_one_and_update to not match if the object the file document points to has changed since the initial load. Maybe matching on hash? We wouldn't want to use modified because we don't care if someone updated info in the meantime. We could also make adjustments to updated created when files are replaced.

Note: if the modified_count is 0, the file was replaced or removed. We'll want to clean up the file object in the new path that will go unreferenced.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

#878 is relevant when discussing a reset of created on replace.


def get_signed_url(file_path, file_system, filename=None):
try:
if hasattr(file_system, 'get_signed_url'):
Copy link
Contributor

@nagem nagem Feb 15, 2018

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I made a change here from checking if config.fs has the method get_signed_url to checking if file_system does. I believe that's what we'll want to do while supporting legacy file system in the transition state, let me know if that change is incorrect. @ryansanford was seeing this issue before the fix:

(most recent call last):
  File "/usr/local/lib/python2.7/dist-packages/webapp2.py", line 570, in dispatch
    return method(*args, **kwargs)
  File "./api/handlers/listhandler.py", line 482, in get
    signed_url = files.get_signed_url(file_path, file_system, filename=filename)
  File "./api/files.py", line 190, in get_signed_url
    return file_system.get_signed_url(file_path, filename=filename)
AttributeError: 'OSFS' object has no attribute 'get_signed_url'
 request_id=e5c40c00-1518721379

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

That was when attempting to access an unmigrated file when config.fs was set to a gcs file system.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yep, that's correct. Thanks!

@davidfarkas davidfarkas force-pushed the generic-storage branch 2 times, most recently from 1ea3629 to 389c02f Compare February 21, 2018 16:33
api/files.py Outdated
self.file.write(line)
if hasattr(self, 'hasher'):
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Were you able to track down how this can get into a state where self.hasher is not set at this point?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

In case of the metadata we don't have a file name so we won't create the file and hasher. I'll add a comment to clarify this.

api/config.py Outdated
# Storage configuration
fs = open_fs(__config['persistent']['fs_url'])
legacy_fs = open_fs('osfs://' + __config['persistent']['data_path'])
support_legacy_fs = True
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can this be toggled without code change/redeploy?
I assumed we want that, and definitely desirable if not too hard.

info['comment'] = zf.comment
info['members'] = []
for zi in zf.infolist():
m = {}
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Why not

info['members'].append({
    'path': zi.filename,
    ...
})

instead?

api/placer.py Outdated
if field is not None and self.file_processor is not None:
self.file_processor.store_temp_file(field.path, util.path_from_uuid(field.uuid))

# if field is not None:
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Code comment leftovers.

total = len(paths)

# Write all files to zip
complete = 0
for path in paths:
p = os.path.join(self.folder, path)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I suggest using relpath/abspath, or maybe full_path, but definitely something different to just p.

@ambrussimon
Copy link
Contributor

Did more thorough review. I left a couple minor comments, and there's also the merge conflict, but LGTM!

@nagem
Copy link
Contributor

nagem commented Mar 22, 2018

My final testing of the migration script edge cases (Analysis inputs and files changing while the upgrade is taking place) passed as expected. Awesome work!!

After a rebase I'd consider this PR approved and ready to merge, but we should hold off on merging until file-browser goes in first. That will also be after the switch over to the new repo location.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

6 participants