ENH: Crawl dataset's metadata only once and before Nipype's workflow #1317

oesteban · 2024-08-10T09:27:32Z

Aggregate dataset-wise operations that typically traverse the list of input files (datalad get, extract biggest file size and metadata extraction) in a single step.

Resolves #1316.

oesteban · 2024-08-14T07:58:51Z

@effigies @mgxd it'd be nice to get feedback on this one - maybe not so much on the code itself (if time is tight), but on the idea of pushing all metadata crawling to just once at the beginning. This way:

we avoid the costly ReadSidecarJSON nodes, esp. with multi-echo (multiplies the cost by the number of echos),
we take advantage of iterating over input files to estimate the biggest file size in GB, and also consider all echos of a run a single file (in some scenarios you are likely to have to load all echos at once in memory -- similar rationale would apply to magnitude and phase, btw),
we still leverage nipype's caching
we can iterate over inputs in parallel with asyncio in a rather efficient way.

WDTY?

This PR is still marked a draft and I'm locally testing -- the base implementation should be solid enough for taking a look.

effigies · 2024-08-14T12:00:35Z

Yeah, no objections to the overall strategy.

mgxd · 2024-08-14T17:28:04Z

So essentially this shifts to keeping all metadata in memory? What if the fields get pared down to only the ones relevant to the pipeline?

oesteban · 2024-08-15T09:02:46Z

So essentially this shifts to keeping all metadata in memory?

Yes, that's correct. If memory becomes a concern (although metadata's size should be negligible), you can set them to None and only load them when needed from the pickle file.

What if the fields get pared down to only the ones relevant to the pipeline?

Not sure I understand -- you mean further filtering metadata within the new loop so only the relevant metadata is kept?

mgxd · 2024-08-15T15:48:15Z

I realize I'm not familiar with what mriqc actually needs the metadata for - is it using the information to calculate something, or just aggregating it into the report?

oesteban · 2024-08-15T22:27:57Z

I realize I'm not familiar with what mriqc actually needs the metadata for - is it using the information to calculate something, or just aggregating it into the report?

Aggregating it in the report. However, fMRIPrep does something similar as it attaches all the metadata to the output. We could have some dictionary of relevant metadata, or leave to the user to find non-critical unmodified metadata (in fMRIPrep).

MRIQC does filter some metadata when submitting to the webapi, but I'm a bit sceptical that will actually shave off a lot of memory.

…ile length

oesteban force-pushed the fix/me-metadadata-multiplicity branch 3 times, most recently from 82597a5 to ef40015 Compare August 12, 2024 22:30

oesteban changed the title ~~ENH: Run dataset-wise operations only once~~ ENH: Crawl dataset's metadata only once and before Nipype's workflow Aug 14, 2024

oesteban force-pushed the fix/me-metadadata-multiplicity branch 5 times, most recently from 2b6049f to 233d7bd Compare August 15, 2024 08:31

oesteban force-pushed the fix/me-metadadata-multiplicity branch 4 times, most recently from f5c2037 to 991eb50 Compare August 15, 2024 15:33

oesteban force-pushed the fix/me-metadadata-multiplicity branch 5 times, most recently from fe47aa2 to de9629d Compare August 15, 2024 22:11

oesteban added 6 commits August 16, 2024 06:31

enh: write a parallelized routine to extract file metadata and test f…

ae1bacd

…ile length

sty: run ruff

8031e8c

enh: add a method to merge BIDS entities dictionaries

e3601e3

fix: mechanism to store/retrieve inputs

446b6b9

fix: update functional workflow to new metadata crawling

174b187

fix: update anatomical wf to new metadata crawling

d20d45e

oesteban added 3 commits August 16, 2024 06:31

fix: update dwi workflow to new metadata crawling

dc33b19

fix: make the subject id optional

d8b55da

sty: run ruff

cf1ea8f

oesteban force-pushed the fix/me-metadadata-multiplicity branch from da3b412 to cf1ea8f Compare August 16, 2024 04:31

oesteban marked this pull request as ready for review August 16, 2024 04:32

oesteban merged commit 0cf1ae6 into master Aug 16, 2024
15 checks passed

oesteban deleted the fix/me-metadadata-multiplicity branch August 16, 2024 16:06

This was referenced Aug 17, 2024

T2w and acq-hippo_T2w mixed up in report #1303

Open

Filter metadata fields before plugging into the report. #1287

Open

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

ENH: Crawl dataset's metadata only once and before Nipype's workflow #1317

ENH: Crawl dataset's metadata only once and before Nipype's workflow #1317

oesteban commented Aug 10, 2024

oesteban commented Aug 14, 2024

effigies commented Aug 14, 2024

mgxd commented Aug 14, 2024

oesteban commented Aug 15, 2024

mgxd commented Aug 15, 2024

oesteban commented Aug 15, 2024

ENH: Crawl dataset's metadata only once and before Nipype's workflow #1317

ENH: Crawl dataset's metadata only once and before Nipype's workflow #1317

Conversation

oesteban commented Aug 10, 2024

oesteban commented Aug 14, 2024

effigies commented Aug 14, 2024

mgxd commented Aug 14, 2024

oesteban commented Aug 15, 2024

mgxd commented Aug 15, 2024

oesteban commented Aug 15, 2024