Generic storage #1059

davidfarkas · 2018-01-25T12:32:10Z

#969 Switch to PyFilesystem

Review Checklist

Tests were added to cover all code changes
Documentation was added / updated
Code and tests follow standards in CONTRIBUTING.md

nagem · 2018-01-26T19:40:00Z

api/resolver.py

@@ -1,4 +1,4 @@
-"""
+"""1


Looks like an accidental addition.

nagem · 2018-01-26T19:46:59Z

bin/oneoffs/migrate_storage.py

+                            ('acquisitions', 'files'),
+                            ('analyses', 'files'),
+                            ('sessions', 'files'),
+                            ('sessions', 'subject.files'),


I don't think we have any actual users using subject files to be clear, but it is good to include them just in case. It is an undocumented feature of the folder uploader.

nagem · 2018-01-26T20:28:10Z

bin/oneoffs/migrate_storage.py

+            file_id = f['fileinfo'].get('_id', '')
+            if file_id:
+                file_path = util.path_from_uuid(file_id)
+                if not config.fs.isfile(file_path):


I assume here is where files that have been created since starting the conversion would be ignored? Also does this mean the conversion script can run many times and will just be a noop when no files need to be migrated?

Yes, exactly

nagem · 2018-01-26T21:02:42Z

api/util.py

+
+
+def path_from_uuid(uuid_):
+    """


Comment below should be updated.

nagem · 2018-01-26T21:04:26Z

I started adding comments to the PR as I went through it even though work is still being done on the branch. Address anything I mentioned now or wait until later when we move into a more "official" review state.

codecov-io · 2018-02-02T14:47:03Z

Codecov Report

Merging #1059 into master will increase coverage by 0.06%.
The diff coverage is 94.79%.

@@            Coverage Diff             @@
##           master    #1059      +/-   ##
==========================================
+ Coverage   90.98%   91.05%   +0.06%     
==========================================
  Files          49       49              
  Lines        7036     7067      +31     
==========================================
+ Hits         6402     6435      +33     
+ Misses        634      632       -2

ambrussimon

I only did a really quick first pass without much findings.
Please exclude the separated range-read diff, so I can give it a more thorough 2nd look.

ambrussimon · 2018-02-08T12:25:52Z

api/config.py

+
+# Storage configuration
+fs = open_fs(__config['persistent']['fs_url'])
+signed_url_available = ['GCSFS']


During the design phase we discussed that storages should be able to tell about themselves, whether they support signed URLs or not. It's not a hard requirement, especially if it's not easy to solve, but I would prefer that to having this config variable.

ambrussimon · 2018-02-08T12:36:16Z

api/files.py

+
+def get_fs_by_file_path(file_path):
+    if config.support_legacy_fs:
+        if config.legacy_fs.isfile(file_path):


Both if's test config.legacy_fs.isfile(file_path)...?

thanks, I forgot to remove that

ambrussimon · 2018-02-08T12:47:32Z

api/handlers/listhandler.py

            else:
-                self.response.headers['Content-Type'] = 'application/octet-stream'
-                self.response.headers['Content-Disposition'] = 'attachment; filename="' + filename + '"'
+                range_header = self.request.headers.get('Range', '')


If it's not too much effort, could you please hide these changes from the PR by tracking range-reads as it's target branch (at least until that's merged).

nagem · 2018-02-15T18:55:31Z

bin/oneoffs/migrate_storage.py

+                # Update the file with the newly generated UUID
+                config.db[f['collection']].find_one_and_update(
+                    {'_id': f['collection_id'], f['prefix'] + '.name': f['fileinfo']['name']},
+                    {'$set': update_set}


We'll need to adjust this a bit to handle when a file is replaced while the migration is underway. An example:

While iterating over all acquisition files, any outputs from gears added to new or existing acquisitions will be stored in the new storage format and location. The migration script in it's current state handles that without issue. A "feature" of our current file placers is that they replace existing files in a container if the uploaded file has the same name. If that process happens after the load of all files from the acquisition collection, the migration script will move the old file from the old path to a new (and different) path and overwrite the information in the file object. This will leave the DB in a state where it looks like a dicom -> nifti gear had been rerun, generating a new file, but the nifti referenced in the DB will be the old nifti. The new nifti will still exist, but will be an unreferenced object.

To solve this, I think we'll need to update the query portion of the find_one_and_update to not match if the object the file document points to has changed since the initial load. Maybe matching on hash? We wouldn't want to use modified because we don't care if someone updated info in the meantime. We could also make adjustments to updated created when files are replaced.

Note: if the modified_count is 0, the file was replaced or removed. We'll want to clean up the file object in the new path that will go unreferenced.

#878 is relevant when discussing a reset of created on replace.

nagem · 2018-02-15T20:19:27Z

api/files.py

+
+def get_signed_url(file_path, file_system, filename=None):
+    try:
+        if hasattr(file_system, 'get_signed_url'):


I made a change here from checking if config.fs has the method get_signed_url to checking if file_system does. I believe that's what we'll want to do while supporting legacy file system in the transition state, let me know if that change is incorrect. @ryansanford was seeing this issue before the fix:

(most recent call last): File "/usr/local/lib/python2.7/dist-packages/webapp2.py", line 570, in dispatch return method(*args, **kwargs) File "./api/handlers/listhandler.py", line 482, in get signed_url = files.get_signed_url(file_path, file_system, filename=filename) File "./api/files.py", line 190, in get_signed_url return file_system.get_signed_url(file_path, filename=filename) AttributeError: 'OSFS' object has no attribute 'get_signed_url' request_id=e5c40c00-1518721379

That was when attempting to access an unmigrated file when config.fs was set to a gcs file system.

Yep, that's correct. Thanks!

ambrussimon · 2018-03-07T16:04:41Z

api/files.py

            self.file.write(line)
+            if hasattr(self, 'hasher'):


Were you able to track down how this can get into a state where self.hasher is not set at this point?

In case of the metadata we don't have a file name so we won't create the file and hasher. I'll add a comment to clarify this.

ambrussimon · 2018-03-07T16:15:02Z

api/config.py

+# Storage configuration
+fs = open_fs(__config['persistent']['fs_url'])
+legacy_fs = open_fs('osfs://' + __config['persistent']['data_path'])
+support_legacy_fs = True


Can this be toggled without code change/redeploy?
I assumed we want that, and definitely desirable if not too hard.

ambrussimon · 2018-03-07T16:19:20Z

api/handlers/listhandler.py

+                info['comment'] = zf.comment
+                info['members'] = []
+                for zi in zf.infolist():
+                    m = {}


Why not

info['members'].append({ 'path': zi.filename, ... })

instead?

ambrussimon · 2018-03-07T16:47:00Z

api/placer.py

+        if field is not None and self.file_processor is not None:
+            self.file_processor.store_temp_file(field.path, util.path_from_uuid(field.uuid))
+
+        # if field is not None:


Code comment leftovers.

ambrussimon · 2018-03-07T16:59:21Z

api/placer.py

        total = len(paths)

        # Write all files to zip
        complete = 0
        for path in paths:
-            p = os.path.join(self.folder, path)


I suggest using relpath/abspath, or maybe full_path, but definitely something different to just p.

ambrussimon · 2018-03-07T17:34:05Z

Did more thorough review. I left a couple minor comments, and there's also the merge conflict, but LGTM!

…work well without CAS

…pdate the remove-cas script to show progress of the rollback

… files duplication

…rage migration

nagem · 2018-03-22T14:26:54Z

My final testing of the migration script edge cases (Analysis inputs and files changing while the upgrade is taking place) passed as expected. Awesome work!!

After a rebase I'd consider this PR approved and ready to merge, but we should hold off on merging until file-browser goes in first. That will also be after the switch over to the new repo location.

davidfarkas requested review from ambrussimon and nagem January 25, 2018 12:43

nagem reviewed Jan 26, 2018

View reviewed changes

api/resolver.py Outdated

@@ -1,4 +1,4 @@

"""

"""1

Copy link

Contributor

nagem Jan 26, 2018

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looks like an accidental addition.

nagem reviewed Jan 26, 2018

View reviewed changes

api/util.py Outdated

def path_from_uuid(uuid_):

"""

Copy link

Contributor

nagem Jan 26, 2018

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Comment below should be updated.

davidfarkas force-pushed the generic-storage branch from cf1f1c5 to 3ee124e Compare February 7, 2018 15:55

ambrussimon suggested changes Feb 8, 2018

View reviewed changes

davidfarkas force-pushed the generic-storage branch from 3ee124e to dca93f9 Compare February 15, 2018 12:29

nagem reviewed Feb 15, 2018

View reviewed changes

davidfarkas force-pushed the generic-storage branch 2 times, most recently from 1ea3629 to 389c02f Compare February 21, 2018 16:33

davidfarkas force-pushed the generic-storage branch from 3522c95 to d55d706 Compare February 27, 2018 13:54

ambrussimon reviewed Mar 7, 2018

View reviewed changes

ambrussimon approved these changes Mar 7, 2018

View reviewed changes

davidfarkas93 and others added 7 commits March 8, 2018 12:24

Remove CAS logic, use uuid instead of hash, update the unit tests to …

11082cf

…work well without CAS

Create placeholder files for collection attachments and gear files, u…

089d116

…pdate the remove-cas script to show progress of the rollback

Add generic storage

88a5272

Add requested changes, imporve singed url handling

128542b

Check file's storage system rather than default

ea5a89b

Trigger docker image builds on branch generic-storage

b3ac920

Use the same type of filsystem as temp fs too

465237b

davidfarkas added 4 commits March 8, 2018 12:26

Silenc urllib3 library log messages

0118323

Handle file replace during the migration and update pytest packages

df6eeeb

Handle specially the analysis during storage migration to avoid input…

48162ad

… files duplication

Add integration test to test the new analysis handling during the sto…

ddc0232

…rage migration

davidfarkas force-pushed the generic-storage branch from f1e1e0f to ddc0232 Compare March 8, 2018 14:18

Add requsted changes

08d5016

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Generic storage #1059

Generic storage #1059

davidfarkas commented Jan 25, 2018

nagem Jan 26, 2018

nagem Jan 26, 2018 •

edited

Loading

nagem Jan 26, 2018

davidfarkas Feb 15, 2018

nagem Jan 26, 2018

nagem commented Jan 26, 2018

codecov-io commented Feb 2, 2018 •

edited

Loading

ambrussimon left a comment

ambrussimon Feb 8, 2018

ambrussimon Feb 8, 2018

davidfarkas Feb 15, 2018

ambrussimon Feb 8, 2018

nagem Feb 15, 2018

nagem Feb 15, 2018

nagem Feb 15, 2018 •

edited

Loading

nagem Feb 15, 2018

davidfarkas Feb 16, 2018

ambrussimon Mar 7, 2018

davidfarkas Mar 8, 2018

ambrussimon Mar 7, 2018

ambrussimon Mar 7, 2018

ambrussimon Mar 7, 2018

ambrussimon Mar 7, 2018

ambrussimon commented Mar 7, 2018

nagem commented Mar 22, 2018

Generic storage #1059

Are you sure you want to change the base?

Generic storage #1059

Conversation

davidfarkas commented Jan 25, 2018

Review Checklist

Choose a reason for hiding this comment

nagem Jan 26, 2018 • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

nagem commented Jan 26, 2018

codecov-io commented Feb 2, 2018 • edited Loading

Codecov Report

ambrussimon left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

nagem Feb 15, 2018 • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

ambrussimon commented Mar 7, 2018

nagem commented Mar 22, 2018

nagem Jan 26, 2018 •

edited

Loading

codecov-io commented Feb 2, 2018 •

edited

Loading

nagem Feb 15, 2018 •

edited

Loading