Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Add high-level metadata to promote data re-use #30

Open
adamltyson opened this issue Jul 10, 2023 · 5 comments
Open

Add high-level metadata to promote data re-use #30

adamltyson opened this issue Jul 10, 2023 · 5 comments
Labels
enhancement New feature or request

Comments

@adamltyson
Copy link
Member

We are agnostic about what form experimental metadata takes. This is to help adoption as labs can save metadata however they like. However, would it be useful to come up with a standard way of recording high level metadata (e.g. species, age, behavioural paradigm etc) in such a way that it can be searched and help promote data re-use?

My idea is something along these lines:

  • Researcher acquires a dataset and tags the dataset with specific metadata

    • Recorded from brain region A
    • Behavioural task B
    • N mice
    • Conditions X Y, Z
    • etc etc
  • Other researchers do the same

  • In the future these can be catalogued and searched. Researchers can in theory search for datasets that could answer a specific question leading to:

    • Saving time
    • Cost savings
    • Reduced number of animals
    • Increased collaboration

I think it's a nice idea in theory, but harder to implement. Something would be better than nothing though, and it would really need to be part of the data acquisition (very few people will go back and tag historical datasets). It should be part of DataShuttle and any other tools we develop in this space (i.e. could the compression/analysis tools add tags to the metadata?).

@adamltyson adamltyson added the enhancement New feature or request label Jul 10, 2023
@adamltyson
Copy link
Member Author

The BIDS metadata is along the same lines, but I think we need something more flexible and broad in scope.

@adamltyson
Copy link
Member Author

One approach that I've been thinking of is:

  • Adopt the very detailed AIND metadata schema for specific rig, experiment details
  • Add some simple, NeuroBlueprint high-level metadata information including
    • Tags to aid re-use e.g. optophysiology, PV, decision-making
    • A high-level narrative about the experiment, e.g. "This experiment sought to investigate X in Y using Z. It was published in journal ABC
    • Lab notebook style info, e.g. "lights flickered at timepoint X"

In the future some tool could scrape NeuroBlueprint directories for this metadata and put it into a database of some sort. Researchers could then:

  • Use the tags to find potentially useful datasets
  • Use the narrative to confirm that it's relevant
  • Use the detailed metadata to enable re-use

@adamltyson
Copy link
Member Author

The more I think about this, the more I think we should promote (but not require) a single low-level metadata spec, like the AIND one. Another option is openminds.

I've been asked about metadata a lot, and it will be useful for us when building analysis pipelines.

@adamltyson
Copy link
Member Author

openminds also has a brain atlas standard which could be used when linking with BrainGlobe tools.

@JoeZiminski
Copy link
Member

JoeZiminski commented Jul 4, 2024

Below is a write-up for the three options for metadata format (BIDS, Allen, openMINDS) with some thoughts on how to proceed. I had a look for other standards but couldn't find many others, though of course we should consider anything available so let me know if you know of others. An alternative approach would be to use the NWB schema's .yaml files, they are really strong on the time syncing stuff, but I think we'd have to extend it to our use case ourselves, so will ignore this option for now.

Format

BIDS has two metadata formats, either .tsv or .json. The .tsv is used for creating tables with multiple entries. Some examples include list of participants, samples, trial information. Otherwise. .json is used for all other metadata that are key-value pairs. For example, the dataset description and a lot of modality-specific metadata (e.g. behaviour
or miroscopy. In general the metadata key-value pairs are very well documented, and split by modality.

The Allen uses .json for formatting all metadata, and have really extensive set of metadata entries specified. they provide very nice dashboard for manual creation of metadata files. The key-value pairs in the schema itself are not particularly well documented, I am not sure what many of these are referring to, for example in the 'session' metadata file. Reading their docs, it seems many of these fields are filled automatically during data acquisition / transfer by Allen infrastructure.

openMINDS used a JSON-LD format, (.jsonld). This allows linking across the metadata, this seems powerful but may introduce a lot of dependencies that could be hard to manage in our use case. The introduction guide has three possible syntaxes you can use, it allows flexibility but I found it difficult to follow. The metadata documentation is split by modality, the individual key-value pairs required are pretty well documented.

High-level organisation

BIDS

There is a high-level participants.tsv and dataset_description.json. Then, each data file can have a sidecar .json with associated metadata. For different modalities, the metadata to include in these sidecar .json can be found here. I'm not sure if the information in these sidecar jsons has to be stored one-per-datafile or if you can put it at a high level (e.g. subject, session or experiment) if it is the same for all subjects, sessions, etc. Possibly, as they have a nice 'inheritance' principle, where if for example some metadata is the same for 90/100 animals, you can put common metadata at a high level and then overwrite it for specific fields lower down (e.g. in that example,10 subjects would have subject-level metadata overwriting some fields on the project-wide metadata). In general, BIDS metadata in a project it would look like:

├─ raw//
│  ├─ CHANGES 
│  ├─ README 
│  ├─ channels.tsv 
│  ├─ dataset_description.json 
│  ├─ participants.tsv 
│  └─ sub-001/
│     └─ eeg/
│        ├─ sub-001_task-listening_events.tsv 
│        ├─ sub-001_task-listening_events.json 
│        ├─ sub-001_task-listening_eeg.edf 
│        └─ sub-001_task-listening_eeg.json 
└─ derivatives//
   ├─ descriptions.tsv 
   └─ sub-001/
      └─ eeg/
         ├─ sub-001_task-listening_desc-Filt_eeg.edf 
         ├─ sub-001_task-listening_desc-Filt_eeg.json 
         ├─ sub-001_task-listening_desc-FiltDs_eeg.edf 
         ├─ sub-001_task-listening_desc-FiltDs_eeg.json 
         ├─ sub-001_task-listening_desc-preproc_eeg.edf 
         └─ sub-001_task-listening_desc-preproc_eeg.json 

Allen

The Allen has 6 different metadata files:

  • Data description
  • Subject
  • Procedures
  • Instrument or Rig
  • Acquisition or Session
  • Processing

The way that metadata is organised was for me not very intuitive, it is not split by modality. The 'Acquisition' file only covers imaging modality and I think. the other modalities go in 'session' but I'm not entirely sure. A lot is stored in the 'Rig / Instrument' file, but if say 'I am doing a behavioural or ephys data and I want to look up what fields I should include', it is harder to find. Most likely in the session metadata file and implicit based on your acquisition system? For me this is a drawback and I think reflects how the Allen is collecting data (on a large scale, with fairly standardised rigs).

They do not have an 'inheritance' principle but the way they get around this is to throw as much static information on the data collection in the 'Instrument' file, see" What's the difference between Rig and Session"). I think this will become a problem for many labs with less standardised setups, with a lot of duplicated metadata.

For the Allen, I'm not sure exactly where these are supposed to go in terms of a folder structure, maybe this is something we will have to mandate. Most are fairly self explanatory (e.g. subject in subjects, session in sessions).

openMINDS I think but am not 100% sure that metadata is completely separated from data. For example see the folder structure here and the note on linking to data here. I think there is a lot of dependencies between the metadata files through linking, I can see the advantages of this approach but I'm not sure how well this will translate to NeuroBlueprint.

Supported modalities

All have some form of 'subject' and 'dataset description'. There are divergences in what modalities they support.

BIDS currently supports: MRI, PET, NIRS, EEG, behaviour, microscopy, motion, there is ephys BEP. Their behavioural metadata for trials is pretty straightforward as well as 'task events (but this is more for fMRI). I'm not sure it has anything for timing and sync pulses, this may be awaiting the animal ephys BEP.

Allen currently supports: for sure some microscopy modalities, events and timings. I think their events and timings data is stored in NWB and they just mandate the associated metadata. More on this in the section below. They must have support for ephys and behaviour but it is folded into other sections and harder to find.

openMINDS has ephys and brain atlases. As well as experimental stimulus )(e.g. ephys) + some others. I'm not sure about behaviour data (e.g. trial information), I couldn't see anything obvious. For the stimulus, they have the metadata fields but its not clear to me how the timing and sync data are to be stored.

Tooling

I had a good look and couldn't find a comprehensive BIDS management tool for reading and writing metadata across all modalities. There is pybids for example but I believe it is more for neuroimaging. They have a lot of BIDS validators but as far as I could tell that was it, nothing on the writing side, but I'm not 100% sure.

Allen has a lot of really nice tooling, the dashboard for manually creating metadata as well as a very promising looking python writer / reader. I don't think the entire spec (e.g. acquisition, session) is supported yet.

openMINDS do have some tooling (matlab and python) but is quite young, the example on the repo looks nice. I think this is promising, but at the moment their example seems a little verbose but I think a lot of that is just getting the data from random format into openMINDS, and once you are in openMINDS format is it looks much simpler. I'm not sure how much of the spec is supported in the tooling, it is under active development.

Pros and Cons of each

Pros

BIDS

Pros

  • I think this is easiest to get started with, the most well documented and intuitive to use (although, I am biased as already familiar with it). Everything is organised by modality.
  • It integrates nicely with the existing folder structure
  • It handles metadata 'inheritance' well

Cons

  • It is missing required modalities (e.g. animal ephys)
  • I do not think it currently has support for timings and sync pulses
  • I do not think it has general tooling past BIDS validators

Allen

Pros

  • It has a lot of nice tooling for manual and automatic creation of metadata files
  • As far as I can tell the timings and sync pulses are handled in a nice way
  • It has a LOT of fields, it is quite 'deep' (though not especially broad). In particular for rig / equipment there are a lot of options.

Cons

  • The organisation of the metadata is not intuitive, and not split by modality. I found it the most impenetrable of the three. For example, I am not sure where ephys and behaviour -specific metadata are stored.
  • The key-value pairs are not well documented.
  • I think the above problems are because it is strongly tied to the Allen's setup and many of the fields are auto-filled by their systems. I'm not sure how user-friendly it will be for use outside of the Allen.

openMINDS

Pros

  • The spec itself is quite well documented and fairly intuitive, split by modality.
  • It has some promising looking tooling
  • It is fairly young, I think it will get better and better over the next few years.

Cons

  • The actual formatting (through the JSON-LD files) I found quite confusing. I could not understand the 'getting started' easily and there are multiple ways formatting the metadata, though it may be powerful, it might be difficult for us to support.
  • As far as I can tell it does not cover behaviour, it has 'stimulus' metadata for ephys.
  • I believe the metadata is stored completely separately from the data. I can see why this would be advantageous, but it doesn't really work so well for NeuroBlueprint.

What is our use case? Can we support multiple formats?

For me, there is no clear 'winner' from the three metadata formats and each has their own area where they excell but nothing provides exactly what we want across the board. I'm not sure any of the above are currently in a place where I'd feel happy to strongly recommend a researcher to go and start using i.e. they go off and use without too much confusion and it would cover 90% of their use case.

Therefore I think it is worth looking at how we would need to interact with metadata and whether we can avoid making a strong recommendation on a standard to use. In that case, researchers can use the one best suited to their needs. We could recommend the most appropriate for use in the SWC (though I'm not sure what this would be) and write a blog covering the pros and cons of these existing metadata initiatives.

The downside of this is we would probably need to interoperate with all recommended specs in our analysis packages / datashuttle. The below explores what our 'points of contact' with metadata will be.

reading metadata This is something we may have to do in various modality-specific analysis tools. In particular, for timings and sync pulses. I will have a section on that key case below. Otherwise, for ephys, there is not a lot of acquisition-related metadata to read, anything pertinent (e.g. sampling rate etc.) is already handled by spikeinterface or can be easily passed by the user. I'm not sure for behaviour? For microscopy, Adam suggested (Orientation Voxel size, Species, Organ, Imaging type). In general, I think the parameter sets we will need to read will be fairly minimal and not too painful to handle from three separate metadata (it will basically be loading some jsons and mapping the names which refer to the same metadata across schemas).

writing metadata this is a bit trickier, though I'm not sure how much we will need to do it. For datashuttle, we may want to write subject metadata and dataset descriptions. This would be a bit of a pain supporting three schemas, but again not awful as we would basically collate all the information we want to write, then map to the keys of the metadata standards are write in json / json-ld. This would only have to be done once. It would be more of a problem if we decided we wanted to manage metadata more extensively e.g. writing metadata from raw data files (e.g. microscopy, ephys, behaviour). I'm not sure we want go to down the route of full metadata management though, if we did it would be a big initiative.

Events, time, sycning data

This is one area where we will have to interact in detail with whatever metadata standard we pick. I've tried to summarise as best possible the requirements for each schema as I understand them. All three are ofc metadata standards and as far as I can tell don't specify how to write the data. I think in general we can be quite flexible here (e.g. binary, numpy, csv, etc) they are all 1d timeseries so not going to be too complex.

BIDS
As far as I can tell this has no schema for systems-neuroscience specific time data (e.g. sync pulses) and it is due for inclusion in the animal BEP. Closely related concepts are the task events and behaviour. I think this covers behavioural stimuli well but doesn't really address the time-syncing issue for systems neuroscience (?).

Allen

Allen has what seems to be a well developed method for handling events data, described here and in the spec here. It looks good, but honestly I would not really know how to go about using it. I guess to write the various time data to disk and have each represented in the session.json folder under one of the pydantic classes. This highlights an issue with the Allen scheme - if your acquisition system does not fall under one of their pre-defined classes you are kind of stuck.

openMINDS

openMINDS have a stimulation metadata section. I'm not sure how the data should actually be stored. I looks nice for ephys stimulus timings, but couldn't find where behaviour trial is supported.

Our own schema for this?

I think we need to survey what people are doing in the building and possibly introduce our own schema for this, based on what people are already doing, as lightweight as possible. We should support these other metadata schemas where relevant but I think all we really need are event timestamps mapped to each modality / a table of stimulus or behavioural event trial information.

My thoughts

Unfortunately I don't think any of the schema fully meet all our (extensive, unrealistic?) requirements, a flexible, intuitive and very well documented metadata standard for systems neuroscience that incorporates all our modalities of interest.

Then I guess a mild preference, is to write a blog on these metadata standards and their pros and cons, and how you could get started with them in NeuroBlueprint. Then we say any of them are allowed as long as you use them consistently.

This means we would also need to support all three in our tools. Because we only ever planning on actually mandating a small subset of most relevant metadata, this might be OK as discussed above. Maybe in future as these specifications progress and there is a clear' winner' we can actually mandate a particular approach. But at present, I'm not sure any are 100% sufficient for all use cases we will come across.

Before making any decision, we should definitely survey the building for 1) what metadata people are currently collecting 2) if they are, what format 3) how people are collecting timings and sync data 4) what kind of metadata people would like to see standardised.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
enhancement New feature or request
Projects
None yet
Development

No branches or pull requests

2 participants