Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Expand relative mask_path and make paths valid for used OS (#3257) #3704

Open
wants to merge 2 commits into
base: develop
Choose a base branch
from

Conversation

denkrau
Copy link

@denkrau denkrau commented Oct 20, 2023

What changes are proposed in this pull request?

When importing a dataset relative mask_path keys get expanded as well.
This even works with nested structures.
Additionally, filepath and mask_path path separators get converted based on the OS.

How is this patch tested? If it is not, please explain why.

Importing a dataset in Ubuntu (which was exported in Windows) with relative mask_path and check if path is absolute and valid now.

Release Notes

Is this a user-facing change that should be mentioned in the release notes?

  • No. You can skip the rest of this section.
  • Yes. Give a description of this change to be included in the release
    notes for FiftyOne users.

(Details in 1-2 sentences. You can just refer to another PR with a description
if this PR is part of a larger change.)

What areas of FiftyOne does this PR affect?

  • App: FiftyOne application changes
  • Build: Build and test infrastructure changes
  • Core: Core fiftyone Python library changes
  • Documentation: FiftyOne documentation changes
  • Other

Dennis Kraus added 2 commits October 19, 2023 15:11
Expands relative mask_path values when importing a dataset. These
keys are searched recursively to ensure compatibility with nested
structures. Previously only filepath keys got expanded.
When importing a dataset all path separators of keys `filepath` and
`mask_path` get replaced with the ones used by the OS.

Resolves: voxel51#3257
Copy link
Contributor

@brimoor brimoor left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Hi @denkrau thanks for the PR!

Can you explain a bit more about your use case? The existing implementation of the FiftyOneDataset format was designed to handle relative -> absolute path conversions for you, but it sounds like it's not working in your case?

The example below demonstrates this working as expected:

import fiftyone as fo
import fiftyone.zoo as foz
import fiftyone.utils.labels as foul

dataset = foz.load_zoo_dataset(
    "coco-2017",
    split="validation",
    label_types="segmentations",
    max_samples=10,
)

# Create a `Segmentation` field with masks stored on disk 
foul.objects_to_segmentations(
    dataset,
    "ground_truth",
    "segmentations",
    output_dir="/tmp/segmentations",
    overwrite=True,
)

# Export in FiftyOneDataset format
dataset.export(
    export_dir="/tmp/fod",
    dataset_type=fo.types.FiftyOneDataset,
)

The directory structure of /tmp/fod is like this:

/tmp/fod/
    data/
        000000001000.jpg
        ...
    fields/
        segmentations/
            000000001000.png
            ...
    metadata.json
    samples.json

And here's how the samples.json stores the mask paths (as relative to this JSON file):

$ python -m json.tool /tmp/fod/samples.json

            ...
            "segmentations": {
                "_id": {
                    "$oid": "6532854e5f4d8589d27625e8"
                },
                "_cls": "Segmentation",
                "tags": [],
                "mask_path": "fields/segmentations/000000001000.png"
            }
        }
    ]
}

When I then import this dataset:

dataset2 = fo.Dataset.from_dir(
    dataset_dir="/tmp/fod",
    dataset_type=fo.types.FiftyOneDataset,
)

print(dataset2.first())

the absolute mask paths are correctly constructed:

    ...
    'segmentations': <Segmentation: {
        'id': '6532854e5f4d8589d27625df',
        'tags': [],
        'mask': None,
        'mask_path': '/tmp/fod/fields/segmentations/000000000139.png',
    }>,
}>

@@ -1848,6 +1848,7 @@ def _import_samples(self, dataset, dataset_dict, tags=None):
def _parse_sample(sd):
if not os.path.isabs(sd["filepath"]):
sd["filepath"] = os.path.join(rel_dir, sd["filepath"])
sd["filepath"] = fou.make_path_os_safe(sd["filepath"])
Copy link
Contributor

@brimoor brimoor Oct 20, 2023

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We actually have a fiftyone.core.storage.normpath() utility that does this. Can you use that instead?

@@ -1856,6 +1857,10 @@ def _parse_sample(sd):
_parse_media_fields(sd, media_fields, rel_dir)

sd["_dataset_id"] = dataset_id

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The _parse_media_fields() method call on line 1857 is intended to handle this for you. I'd like to understand a bit more about your use case to see why it's not working for you 🤔

@codecov
Copy link

codecov bot commented Oct 20, 2023

Codecov Report

All modified and coverable lines are covered by tests ✅

Comparison is base (19f7c62) 16.24% compared to head (593a2b5) 16.24%.
Report is 6 commits behind head on develop.

Additional details and impacted files
@@           Coverage Diff            @@
##           develop    #3704   +/-   ##
========================================
  Coverage    16.24%   16.24%           
========================================
  Files          640      640           
  Lines        74031    74031           
  Branches       988      988           
========================================
  Hits         12029    12029           
  Misses       62002    62002           
Flag Coverage Δ
app 16.24% <ø> (ø)

Flags with carried forward coverage won't be shown. Click here to find out more.

☔ View full report in Codecov by Sentry.
📢 Have feedback on the report? Share it here.

@denkrau
Copy link
Author

denkrau commented Oct 30, 2023

Hey @brimoor, thanks for the quick reply.

Indeed your sample is working, so this got me wondering.

tl;dr: It's about export_media parameter during export. When it is None or True as in your example everything works fine, otherwise it does not.

In my case I parse samples from (a network) drive and create a dataset. Data copying must be avoided, this means export_media=False if I understand correctly. I was able to reproduce both the error and success with a dummy dataset by setting that parameter accordingly. Somehow, when data is copied the new path of the segmentation mask gets correctly expanded.

But you can test it with your code example as well. Just set fo.config.dataset_zoo_dir = "/tmp/fiftyone_zoo" to have a common prefix, add rel_dir="/tmp" to the export and import call, and set export_media=False. This will give an unexpanded mask_path. See the following summary.

Keys With export_media=None or True With export_media=False and rel_dir="/tmp"
samples.json: filepath data/000000000139.jpg fiftyone_zoo/coco-2017/validation/data/000000000139.jpg
samples.json: mask_path fields/segmentations/000000000139.png segmentations/000000000139.png
File structure /tmp/fod/
  data/
    000000001000.jpg
    ...
  fields/
    segmentations/
        000000001000.png
       ...
  metadata.json
  samples.json
/tmp/fod/
  metadata.json
  samples.json
Imported dataset: filepath /tmp/fod/data/000000000139.jpg /tmp/fiftyone_zoo/coco-2017/validation/data/000000000139.jpg
Imported dataset: mask_path /tmp/fod/fields/segmentations/000000000139.png segmentations/000000000139.png

@brimoor
Copy link
Contributor

brimoor commented Oct 30, 2023

@denkrau your example with export_media=False and rel_dir="/tmp" is working as intended.

The rel_dir parameter is used to strip a common prefix from each filepath. When export_media=False, this results in exporting relative filepaths rather than absolute ones. When export_media=True, it causes the media to be exported in a tree structure that matches the relative paths (rather than just exporting all media in a flat directory based on the filenames of all media in the collection you're exporting).

Generally you wouldn't combine export_media=False with the rel_dir option.

@denkrau
Copy link
Author

denkrau commented Oct 30, 2023

Generally you wouldn't combine export_media=False with the rel_dir option.

Thanks for the clarification. So there is no way to avoid copying/restructuring and to have all relative paths expanded? Is such a feature even desired from the project's point of view? If not, could it be possible to use file links with export_media=True, toggled by an additional parameter?

I'm just trying to figure out how to proceed with this or another PR.

@brimoor
Copy link
Contributor

brimoor commented Oct 30, 2023

Can you clarify what you're trying to achieve that's not possible? I'm not sure 😅

@denkrau
Copy link
Author

denkrau commented Oct 31, 2023

Sure, thanks for your help, here we go.

There are several raw datasets available on a shared network drive. Multiple users access them. The mount point for each user might vary. Now for some raw datasets FiftyOne dataset versions shall be created (i.e. exported) additionally, so users can import them on demand. The raw datasets cannot be moved/restructured, because some users work with this structure. Also I would like to avoid copying raw datasets for disk space and maintaining reasons.
In conclusion I need no data restructuring, no data copying and correctly expanded relative paths when importing FiftyOne datasets.
Does this make it more clear?

In my understanding of FiftyOne so far with the insights you gave me, such a scenario is not possible to achieve. Because it would require to expand all relative paths from samples.json (including the mask paths), while keeping the original file structure. Right?

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants