-
Notifications
You must be signed in to change notification settings - Fork 18
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Generic storage #1059
base: master
Are you sure you want to change the base?
Generic storage #1059
Conversation
api/resolver.py
Outdated
@@ -1,4 +1,4 @@ | |||
""" | |||
"""1 |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Looks like an accidental addition.
bin/oneoffs/migrate_storage.py
Outdated
('acquisitions', 'files'), | ||
('analyses', 'files'), | ||
('sessions', 'files'), | ||
('sessions', 'subject.files'), |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I don't think we have any actual users using subject files to be clear, but it is good to include them just in case. It is an undocumented feature of the folder uploader.
bin/oneoffs/migrate_storage.py
Outdated
file_id = f['fileinfo'].get('_id', '') | ||
if file_id: | ||
file_path = util.path_from_uuid(file_id) | ||
if not config.fs.isfile(file_path): |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I assume here is where files that have been created since starting the conversion would be ignored? Also does this mean the conversion script can run many times and will just be a noop when no files need to be migrated?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yes, exactly
api/util.py
Outdated
|
||
|
||
def path_from_uuid(uuid_): | ||
""" |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Comment below should be updated.
I started adding comments to the PR as I went through it even though work is still being done on the branch. Address anything I mentioned now or wait until later when we move into a more "official" review state. |
Codecov Report
@@ Coverage Diff @@
## master #1059 +/- ##
==========================================
+ Coverage 90.98% 91.05% +0.06%
==========================================
Files 49 49
Lines 7036 7067 +31
==========================================
+ Hits 6402 6435 +33
+ Misses 634 632 -2 |
cf1f1c5
to
3ee124e
Compare
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I only did a really quick first pass without much findings.
Please exclude the separated range-read diff, so I can give it a more thorough 2nd look.
api/config.py
Outdated
|
||
# Storage configuration | ||
fs = open_fs(__config['persistent']['fs_url']) | ||
signed_url_available = ['GCSFS'] |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
During the design phase we discussed that storages should be able to tell about themselves, whether they support signed URLs or not. It's not a hard requirement, especially if it's not easy to solve, but I would prefer that to having this config variable.
api/files.py
Outdated
|
||
def get_fs_by_file_path(file_path): | ||
if config.support_legacy_fs: | ||
if config.legacy_fs.isfile(file_path): |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Both if's test config.legacy_fs.isfile(file_path)
...?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
thanks, I forgot to remove that
api/handlers/listhandler.py
Outdated
else: | ||
self.response.headers['Content-Type'] = 'application/octet-stream' | ||
self.response.headers['Content-Disposition'] = 'attachment; filename="' + filename + '"' | ||
range_header = self.request.headers.get('Range', '') |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
If it's not too much effort, could you please hide these changes from the PR by tracking range-reads as it's target branch (at least until that's merged).
3ee124e
to
dca93f9
Compare
bin/oneoffs/migrate_storage.py
Outdated
# Update the file with the newly generated UUID | ||
config.db[f['collection']].find_one_and_update( | ||
{'_id': f['collection_id'], f['prefix'] + '.name': f['fileinfo']['name']}, | ||
{'$set': update_set} |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
We'll need to adjust this a bit to handle when a file is replaced while the migration is underway. An example:
While iterating over all acquisition files, any outputs from gears added to new or existing acquisitions will be stored in the new storage format and location. The migration script in it's current state handles that without issue. A "feature" of our current file placers is that they replace existing files in a container if the uploaded file has the same name. If that process happens after the load of all files from the acquisition collection, the migration script will move the old file from the old path to a new (and different) path and overwrite the information in the file object. This will leave the DB in a state where it looks like a dicom -> nifti gear had been rerun, generating a new file, but the nifti referenced in the DB will be the old nifti. The new nifti will still exist, but will be an unreferenced object.
To solve this, I think we'll need to update the query
portion of the find_one_and_update
to not match if the object the file document points to has changed since the initial load. Maybe matching on hash
? We wouldn't want to use modified
because we don't care if someone updated info
in the meantime. We could also make adjustments to updated created
when files are replaced.
Note: if the modified_count
is 0, the file was replaced or removed. We'll want to clean up the file object in the new path that will go unreferenced.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
#878 is relevant when discussing a reset of created
on replace.
|
||
def get_signed_url(file_path, file_system, filename=None): | ||
try: | ||
if hasattr(file_system, 'get_signed_url'): |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I made a change here from checking if config.fs
has the method get_signed_url
to checking if file_system
does. I believe that's what we'll want to do while supporting legacy file system in the transition state, let me know if that change is incorrect. @ryansanford was seeing this issue before the fix:
(most recent call last):
File "/usr/local/lib/python2.7/dist-packages/webapp2.py", line 570, in dispatch
return method(*args, **kwargs)
File "./api/handlers/listhandler.py", line 482, in get
signed_url = files.get_signed_url(file_path, file_system, filename=filename)
File "./api/files.py", line 190, in get_signed_url
return file_system.get_signed_url(file_path, filename=filename)
AttributeError: 'OSFS' object has no attribute 'get_signed_url'
request_id=e5c40c00-1518721379
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
That was when attempting to access an unmigrated file when config.fs was set to a gcs file system.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yep, that's correct. Thanks!
1ea3629
to
389c02f
Compare
3522c95
to
d55d706
Compare
api/files.py
Outdated
self.file.write(line) | ||
if hasattr(self, 'hasher'): |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Were you able to track down how this can get into a state where self.hasher
is not set at this point?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
In case of the metadata we don't have a file name so we won't create the file and hasher. I'll add a comment to clarify this.
api/config.py
Outdated
# Storage configuration | ||
fs = open_fs(__config['persistent']['fs_url']) | ||
legacy_fs = open_fs('osfs://' + __config['persistent']['data_path']) | ||
support_legacy_fs = True |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Can this be toggled without code change/redeploy?
I assumed we want that, and definitely desirable if not too hard.
api/handlers/listhandler.py
Outdated
info['comment'] = zf.comment | ||
info['members'] = [] | ||
for zi in zf.infolist(): | ||
m = {} |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Why not
info['members'].append({
'path': zi.filename,
...
})
instead?
api/placer.py
Outdated
if field is not None and self.file_processor is not None: | ||
self.file_processor.store_temp_file(field.path, util.path_from_uuid(field.uuid)) | ||
|
||
# if field is not None: |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Code comment leftovers.
total = len(paths) | ||
|
||
# Write all files to zip | ||
complete = 0 | ||
for path in paths: | ||
p = os.path.join(self.folder, path) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I suggest using relpath
/abspath
, or maybe full_path
, but definitely something different to just p
.
Did more thorough review. I left a couple minor comments, and there's also the merge conflict, but LGTM! |
…work well without CAS
…pdate the remove-cas script to show progress of the rollback
… files duplication
f1e1e0f
to
ddc0232
Compare
My final testing of the migration script edge cases (Analysis inputs and files changing while the upgrade is taking place) passed as expected. Awesome work!! After a rebase I'd consider this PR approved and ready to merge, but we should hold off on merging until |
#969 Switch to PyFilesystem
Review Checklist