-
Notifications
You must be signed in to change notification settings - Fork 1
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Add high-level metadata to promote data re-use #30
Comments
The BIDS metadata is along the same lines, but I think we need something more flexible and broad in scope. |
One approach that I've been thinking of is:
In the future some tool could scrape NeuroBlueprint directories for this metadata and put it into a database of some sort. Researchers could then:
|
The more I think about this, the more I think we should promote (but not require) a single low-level metadata spec, like the AIND one. Another option is openminds. I've been asked about metadata a lot, and it will be useful for us when building analysis pipelines. |
openminds also has a brain atlas standard which could be used when linking with BrainGlobe tools. |
Below is a write-up for the three options for metadata format (BIDS, Allen, openMINDS) with some thoughts on how to proceed. I had a look for other standards but couldn't find many others, though of course we should consider anything available so let me know if you know of others. An alternative approach would be to use the NWB schema's FormatBIDS has two metadata formats, either The Allen uses openMINDS used a JSON-LD format, ( High-level organisationBIDS There is a high-level
Allen The Allen has 6 different metadata files:
The way that metadata is organised was for me not very intuitive, it is not split by modality. The 'Acquisition' file only covers imaging modality and I think. the other modalities go in 'session' but I'm not entirely sure. A lot is stored in the 'Rig / Instrument' file, but if say 'I am doing a behavioural or ephys data and I want to look up what fields I should include', it is harder to find. Most likely in the session metadata file and implicit based on your acquisition system? For me this is a drawback and I think reflects how the Allen is collecting data (on a large scale, with fairly standardised rigs). They do not have an 'inheritance' principle but the way they get around this is to throw as much static information on the data collection in the 'Instrument' file, see" What's the difference between Rig and Session"). I think this will become a problem for many labs with less standardised setups, with a lot of duplicated metadata. For the Allen, I'm not sure exactly where these are supposed to go in terms of a folder structure, maybe this is something we will have to mandate. Most are fairly self explanatory (e.g. subject in subjects, session in sessions). openMINDS I think but am not 100% sure that metadata is completely separated from data. For example see the folder structure here and the note on linking to data here. I think there is a lot of dependencies between the metadata files through linking, I can see the advantages of this approach but I'm not sure how well this will translate to NeuroBlueprint. Supported modalitiesAll have some form of 'subject' and 'dataset description'. There are divergences in what modalities they support. BIDS currently supports: MRI, PET, NIRS, EEG, behaviour, microscopy, motion, there is ephys BEP. Their behavioural metadata for trials is pretty straightforward as well as 'task events (but this is more for fMRI). I'm not sure it has anything for timing and sync pulses, this may be awaiting the animal ephys BEP. Allen currently supports: for sure some microscopy modalities, events and timings. I think their events and timings data is stored in NWB and they just mandate the associated metadata. More on this in the section below. They must have support for ephys and behaviour but it is folded into other sections and harder to find. openMINDS has ephys and brain atlases. As well as experimental stimulus )(e.g. ephys) + some others. I'm not sure about behaviour data (e.g. trial information), I couldn't see anything obvious. For the stimulus, they have the metadata fields but its not clear to me how the timing and sync data are to be stored. ToolingI had a good look and couldn't find a comprehensive BIDS management tool for reading and writing metadata across all modalities. There is pybids for example but I believe it is more for neuroimaging. They have a lot of BIDS validators but as far as I could tell that was it, nothing on the writing side, but I'm not 100% sure. Allen has a lot of really nice tooling, the dashboard for manually creating metadata as well as a very promising looking python writer / reader. I don't think the entire spec (e.g. acquisition, session) is supported yet. openMINDS do have some tooling (matlab and python) but is quite young, the example on the repo looks nice. I think this is promising, but at the moment their example seems a little verbose but I think a lot of that is just getting the data from random format into openMINDS, and once you are in openMINDS format is it looks much simpler. I'm not sure how much of the spec is supported in the tooling, it is under active development. Pros and Cons of eachProsBIDS Pros
Cons
Allen Pros
Cons
openMINDS Pros
Cons
What is our use case? Can we support multiple formats?For me, there is no clear 'winner' from the three metadata formats and each has their own area where they excell but nothing provides exactly what we want across the board. I'm not sure any of the above are currently in a place where I'd feel happy to strongly recommend a researcher to go and start using i.e. they go off and use without too much confusion and it would cover 90% of their use case. Therefore I think it is worth looking at how we would need to interact with metadata and whether we can avoid making a strong recommendation on a standard to use. In that case, researchers can use the one best suited to their needs. We could recommend the most appropriate for use in the SWC (though I'm not sure what this would be) and write a blog covering the pros and cons of these existing metadata initiatives. The downside of this is we would probably need to interoperate with all recommended specs in our analysis packages / datashuttle. The below explores what our 'points of contact' with metadata will be. reading metadata This is something we may have to do in various modality-specific analysis tools. In particular, for timings and sync pulses. I will have a section on that key case below. Otherwise, for ephys, there is not a lot of acquisition-related metadata to read, anything pertinent (e.g. sampling rate etc.) is already handled by spikeinterface or can be easily passed by the user. I'm not sure for behaviour? For microscopy, Adam suggested (Orientation Voxel size, Species, Organ, Imaging type). In general, I think the parameter sets we will need to read will be fairly minimal and not too painful to handle from three separate metadata (it will basically be loading some jsons and mapping the names which refer to the same metadata across schemas). writing metadata this is a bit trickier, though I'm not sure how much we will need to do it. For datashuttle, we may want to write subject metadata and dataset descriptions. This would be a bit of a pain supporting three schemas, but again not awful as we would basically collate all the information we want to write, then map to the keys of the metadata standards are write in json / json-ld. This would only have to be done once. It would be more of a problem if we decided we wanted to manage metadata more extensively e.g. writing metadata from raw data files (e.g. microscopy, ephys, behaviour). I'm not sure we want go to down the route of full metadata management though, if we did it would be a big initiative. Events, time, sycning dataThis is one area where we will have to interact in detail with whatever metadata standard we pick. I've tried to summarise as best possible the requirements for each schema as I understand them. All three are ofc metadata standards and as far as I can tell don't specify how to write the data. I think in general we can be quite flexible here (e.g. binary, numpy, csv, etc) they are all 1d timeseries so not going to be too complex. BIDS Allen Allen has what seems to be a well developed method for handling events data, described here and in the spec here. It looks good, but honestly I would not really know how to go about using it. I guess to write the various time data to disk and have each represented in the openMINDS openMINDS have a stimulation metadata section. I'm not sure how the data should actually be stored. I looks nice for ephys stimulus timings, but couldn't find where behaviour trial is supported. Our own schema for this? I think we need to survey what people are doing in the building and possibly introduce our own schema for this, based on what people are already doing, as lightweight as possible. We should support these other metadata schemas where relevant but I think all we really need are event timestamps mapped to each modality / a table of stimulus or behavioural event trial information. My thoughtsUnfortunately I don't think any of the schema fully meet all our (extensive, unrealistic?) requirements, a flexible, intuitive and very well documented metadata standard for systems neuroscience that incorporates all our modalities of interest. Then I guess a mild preference, is to write a blog on these metadata standards and their pros and cons, and how you could get started with them in NeuroBlueprint. Then we say any of them are allowed as long as you use them consistently. This means we would also need to support all three in our tools. Because we only ever planning on actually mandating a small subset of most relevant metadata, this might be OK as discussed above. Maybe in future as these specifications progress and there is a clear' winner' we can actually mandate a particular approach. But at present, I'm not sure any are 100% sufficient for all use cases we will come across. Before making any decision, we should definitely survey the building for 1) what metadata people are currently collecting 2) if they are, what format 3) how people are collecting timings and sync data 4) what kind of metadata people would like to see standardised. |
We are agnostic about what form experimental metadata takes. This is to help adoption as labs can save metadata however they like. However, would it be useful to come up with a standard way of recording high level metadata (e.g. species, age, behavioural paradigm etc) in such a way that it can be searched and help promote data re-use?
My idea is something along these lines:
Researcher acquires a dataset and tags the dataset with specific metadata
Other researchers do the same
In the future these can be catalogued and searched. Researchers can in theory search for datasets that could answer a specific question leading to:
I think it's a nice idea in theory, but harder to implement. Something would be better than nothing though, and it would really need to be part of the data acquisition (very few people will go back and tag historical datasets). It should be part of DataShuttle and any other tools we develop in this space (i.e. could the compression/analysis tools add tags to the metadata?).
The text was updated successfully, but these errors were encountered: