Skip to content

Releases: SciCrunch/sparc-curation

dataset-template-3.0.1

16 Sep 22:19
Compare
Choose a tag to compare
Pre-release

Changes.
General view regularization for all files.

code_description.xlsx

  • regularize formatting
  • regularize casing for all header fields
  • correct TSR Column Type description
  • correct example column

dataset_description.xlsx

  • regularize formatting
  • regularize casing for the now named Metadata version to match other fields as sentence case
  • add a suggest example to related identifiers for DOI for this dataset
  • a spelling fix here and there in the description column

dataset-template-3.0.0

12 Sep 20:19
Compare
Choose a tag to compare
Pre-release

You can see full version of this changelog at https://github.com/SciCrunch/sparc-curation/blob/master/docs/sds-3-changelog.org

changelog 3.0.0

Overview

This change log provides a summary of the changes in SDS 3.0.0. See commit message x for the full change log that includes a full explication.

Validation changes added for SDS 3.0.0 datasets

These are changes that are not literal changes to the template itself, but are validation changes that will be enforced for any datasets using the templates greater than 3.0.0.

  • samples and subjects cannot share the same pool-id (this remains confusing), a given pool-id can only appear in one of samples or subjects file

  • Add naming restrictions for all paths [0-9A-Za-z,.-_ ].

    The following characters are no longer allowed in file names and no longer allowed in folder names (they were already technically banned from folders mapping to SDS entities). @#$%^&*()+=/\|"'~;:<>{}[]? The only whitespace character allowed is space however the use of spaces in file and folder names is discouraged. Further, all forms of whitespace (including space) are banned from appearing at the start or end of file and folder names. All non-printing characters are banned. This leaves us with following regex for allowed file and folder names [A-Za-z0-9.,-_ ]. Folders mapped to SDS entity ids still must follow the more restrictive rule [A-Za-z0-9-].

    By default SDS 3.0.0 explicitly excludes the larger unicode categories for letter and number. See https://en.wikipedia.org/wiki/Unicode_character_property#General_Category. See also https://lamport.azurewebsites.net/tla/future.pdf Common Operators of Ordinary Math for an account of why ascii remains a sound default for scientific use cases.

    An option extension of the standard to support those categories could be implemented, in which case an explicit field must be provided in the dataset description file indicating that the dataset makes use of extended file naming rules. Such extension is intended for internal use in organizations where non-ascii file names are unavoidable due to the presence of existing processes. Platforms that use and exchange SDS formatted datasets publicly may always reject such datasets as not conforming to the requirements for public sharing and publication. If such an extension is implemented NO OTHER unicode general categories shall be allowed, that is no mark, punctuation, symbol, separator, or other. The only extra characters allowed outside letter and number categories shall be [.,-_ ]. A metadata field with the name Extension unicode paths can be provided in the dataset description and the presence of any non-empty cell value for the field will be considered to be enabled.

  • Add advisory naming restrictions on use of . for all paths.

    SDS validators shall warn in the following cases.

    • . appears in a directory name.
    • . appears at the start of a file name outside the top level and code.
    • . appears more than once in a file name.

    It is not easy to create a general rule that limits the usage of period (.) in paths without accidentally banning legitimate usage patterns. In general we discourage the use of warnings since they are either ignored entirely or treated as errors, however in this case we do not see an easy solution at this time.

    A slightly less restrictive approach could be implementing by maintaining a list of known valid multi-suffix formats, such as .tar.gz, however the standard does not require this.

  • Additional restrictions related to banned whole file names follow the answers given in https://stackoverflow.com/q/1976007.

    CON|PRN|AUX|NUL|((COM|LPT)[0-9]) An additional (incomplete) list of banned file names includes Thumbs.db|.DS_Store

  • File types restrictions are now enforced by modality when modality is provided.

    • modality to file type mapping

File system structure changes

  • Add .dss file. Data Structure Standard.

    Contents are (standard-abbrev standard-version) e.g. (SDS 3.0.0). Should appear at the top level of the dataset and may also appear in other folders if they conform to a different data structure standard such as BIDS. If standard-abbrev does not match the current parent then validation will not be run using the parent validator. At this time standard-version is purely an informative field and carries no semantics of any kind for validation. The default contents of the file result in standard-version matching dataset-template-version, HOWEVER IT SHOULD NOT BE ASSUMED THAT THEY WILL ALWAYS BE THE SAME. In general contents of this file will always be an s-expression that contains only atoms or nest s-expressions which themselves contain only atoms (i.e., no strings). The format is chosen to avoid the creation of a custom surface syntax for the .dss file. The index of known values for standard-abbrev is case-insensitive, so e.g. both SDS and sds refer to the same expanded data structure standard to avoid collisions and confusion. At this time the only semantics for standard-abbrev are that a mismatch between case-insensitive standard-abbrev fields means that the subdirectory will not be validated using the parent validator, no central registry mapping abbrevs to specific data structure standard validators is required. This leaves room for cooperative development between standards in the future. I used (sds 3) as the default value to reinforce the note above.

  • Add LICENSE file. This file is not required. If a data platform does not include functionality for specifying a license then this file can be used to provide the full text of a license. See also, license-identifier added to dataset_description closes #109 license file

  • Add file sites.{csv,tsv,json,xlsx} for metadata about sites. Examples of sites are electrode locations, physical locations on subjects or samples that were not further derived, such as left eye and right eye. closes #86 sites file

  • Add file curation.{csv,tsv,json,xlsx} for metadata from curation. This file is not required, and if provided by a data submitter may be completely overwritten as part of curation since it is designed to hold information from a controlled curation process that happens after submission. closes #106 curation notes closes #103 ensure that submission metadata and organs sheet are in combo of dataset description and curation notes

  • Delete file code_parameters.{csv,tsv,json,xlsx}. The functionality is now implemented in code_description.

  • Add folder aux to top level. This folder is not required and can be used to store auxillary files that may be needed as part of a publication process to support the needs of a particular publication platform. The manifest in this folder can reference out to other folders, but no manifest from outside this folder may reference anything in the aux folder. This is because the aux folder may be removed for external publication and only be visible to internal systems. Example use cases would be for storing pre-computed thumbnails for video files. closes #108 aux folder

Changes from 2.1.0 to 3.0.0 for all

  • The first row and first column of all sheets are now frozen by default where relevant. closes #105 freeze first row and column
  • All entity metadata files now include a metadata-only column. closes #90 metadata-only column for all sds-entity metadata files

Changes from 2.1.0 to 3.0.0 for manifest

  • Add entity column. More granular variants of this column may also be used, but are not included in the default template specimen, subject, sample, site, and performance.

    These columns can be used to map individual files to an SDS entity, this can be used instead of or as a way to enhance the granularity of the mapping of files to SDS entities by their containing folders. Only the most granular mapping should be provided since all entities should be upwardly contextualized (i.e., perf references sam, etc.).

  • Add data dictionary path column. Reference the relative path to the data dictionary used to validate this file.

  • Add also in dataset. Provide a dataset id where a copy of this file is also present.

  • Add also in dataset path. Provide the dataset relative path to the copy of this file in also in dataset. closes #97

  • Add data modality column. Allowed values are TBD. closes #99 manifest modality column

  • Add entity is transitive column. Mark an SDS entity id folder has subfolders to indicate that those folders are about that entity and not any more granular entity. that are from preventing a check on subfolders. Default behavior is to warn for nested folders with no entity metadata and no modality. The validator will not warn if another entity folde...

Read more

dataset-template-2.1.0

28 Jan 02:52
Compare
Choose a tag to compare

Changes from 2.0.0 to 2.1.0 for dataset_description

  • Change Metadata Version from 2.0.0 -> 2.1.0

Changes from 2.0.0 to 2.1.0 for submission

  • Rename SPARC Award number -> Award number
  • Add Consortium data standard
  • Add Funding consortium

General notes
These changes are made so that we can decouple the funding consortium
(which might not exist for external datasets) from the consortium data standard,
which must always be present.

Bumping to 2.1.0, not a major release, but new rows are added, so bumping the
minor release due to row additions and one row name change.

Ideally the consortium data standard would be part of the dataset
description file, but for now we put it in submission to avoid churn.

0.0.1.dev5

14 Jan 01:17
Compare
Choose a tag to compare
move test_data to internal to avoid long tests

they do not test actual code and should be
moved to queries.org anyway

0.0.1.dev4

23 Dec 07:02
Compare
Choose a tag to compare

0.0.1.dev3

24 Jun 02:17
Compare
Choose a tag to compare
setup.py tweak version deps

dataset-template-2.0.0

25 Jun 05:36
Compare
Choose a tag to compare

Overview

The structure of the dataset_description file has been updated so that it is broken into five sections, basic, study, contributor, related identifiers, and participants. Fields that only accept a single value have had additional cells grayed out to indicate that additional values should not be provided.

Files for subjects and samples have been updated to disambiguate the referents of columns, metadata about subjects should NOT be included in the samples file. In addition subjects and samples optional columns aligned with openMinds, Dandi, NEMO, and BCDC. Other additional columns were selected based on the most common columns appearing in existing subjects and samples sheets.

Four new files have been added to hold metadata on resources, protocol performances, high level metadata for code submissions, and parameters needed to run the code.

Documentation for the new rules governing folder naming and valid folder nesting structures is forthcoming.

File system structure changes

  • Add file code_description.{csv,tsv,json,xlsx}.
  • Add file code_parameters.{csv,tsv,json,xlsx}.
  • Add file resources.{csv,tsv,json,xlsx}.
  • Add file performances.{csv,tsv,json,xlsx} to hold metadata about performances for perf- folders.
  • Remove file variant manifest.xlsx containing pattern instead of filename.
  • Remove file plog.xlsx with note that it may reappear as performance_log.xlsx in a future version.
  • Remove all example folders and files. Examples of properly formatted datasets will be provided separately in the future.

Changes from 1.2.3 to 2.0.0 for dataset_description

  • Add gray coloring to mark cases where only a single value is allowed.
  • Remove non-default width that was set on all columns which was causing errors when opening in Libre Office Calc due to the fact that 1024 columns were mentioned but Calc doesn't support that many.
  • Remove incorrect embedded authoring metadata.
  • Rename Metadata Version DO NOT REMOVE -> Metadata Version
  • Move Metadata Version to second row and mark it blue.
  • Add Type row which currently only accepts experimental or computational as a value. Set the default to experimental since most computational datasets are likely to generate their metadata.
  • Rename Name -> Title to make it consistent with internal naming.
  • Move Funding to follow Keywords
  • Change Funding to be optional.
  • Rename Acknowledgements -> Acknowledgments.
  • Move Acknowledgments to follow Funding
  • Change Acknowledgments to be optional.
  • Add Study purpose.
  • Add Study data collection.
  • Add Study primary conclusion.
  • Add Study organ system.
  • Add Study approach.
  • Add Study technique.
  • Rename Title for complete data set -> Study collection title.
  • Rename Contributors -> Contributor Name
  • Rename Contributor ORCID ID -> Contributor ORCiD
  • Change Contributor role value ContactPerson -> CorrespondingAuthor.
  • Change Contributor role cells for Value 1 and Value 2 to be PrincipalInvestigator and CorrespondingAuthor. Need to consider whether to make DataManager required as well. Will need to note to wranglers that the ordering of contributors is the order the will appear in on the dataset publication, so it is ok to move the PI and CA records to their appropriate place.
  • Delete Is contact person. Redundant with CorrespondingAuthor role.
  • Delete Originating Article DOI.
  • Delete Protocol URL or DOI.
  • Delete Additional Links.
  • Delete Link Description.
  • Add Identifier replacing Originating Article DOI, Protocol URL or DOI, and Additional Links.
  • Add Identifier type.
  • Change Identifier type cells for Value 1 and Value 2 to be HasProtocol and IsDescribedBy, replacing Protocol URL or DOI and Originating Article DOI.
  • Add Relation type matching the DataCite Relation type from this dataset to the related identifier.
  • Add Identifier description replacing Link Description and ensuring that there is no ambiguity about which link the description applies to, in the previous template the description could refer to any of the originating article doi, the protocol doi, or the first additional link.
  • Delete Completeness of data set.
  • Delete Parent dataset ID. This is replaced by querying for datasets that use the same protocol. Other relations can be added via related identifiers.

Changes from 1.2.3 to 2.0.0 for subjects and samples

  • Remove description and example rows.
  • Rename subject_id to subject id.
  • Rename pool_id to pool id. The id columns are all normalized to the previous form internally, and the underscore is hidden by the cell outline when viewing the spreadsheet leading to confusion.
  • Add member of for cases where we need to include a specimen in a population.
  • Add also in dataset for including the Pennsieve id(s) for other datasets that have data about the same specimen
  • Delete Additional fields.
  • Add laboratory internal id to provide a mapping for groups that have incompatible internal identifier conventions.
  • Rename experimental log file name to experimental log file path
  • Rename protocol.io location to protocol url or doi.

Changes from 1.2.3 to 2.0.0 for subjects

  • Rename experimental group to subject experimental group for clarity.
  • Add date of birth.
  • Add body mass.
  • Add phenotype.
  • Add disease or disorder.
  • Add disease model.

Changes from 1.2.3 to 2.0.0 for samples

  • Rename sample_id to sample id.
  • Move sample id to the first column and subject id to the 2nd column to make it clear that sample ids are the primary key.
  • Rename wasDerivedFromSpecimen to was derived from.
  • Rename experimental group to sample experimental group for clarity.
  • Rename specimen type to sample type for consistency.
  • Rename specimen anatomical location to sample anatomical location for consistency.
  • Add date of derivation.
  • Add pathology.
  • Add laterality.
  • Add cell type.
  • Add plane of section.

Changes from 1.2.3 to 2.0.0 for submission

  • Rename SPARC Award number -> SPARC award number.
  • Change SPARC award number to accept EXTERNAL as a value. If EXTERNAL is provided then milestone data is ignored.

0.0.1.dev2

30 Jun 01:13
Compare
Choose a tag to compare
setup.py bump idlib dep

0.0.1.dev1

22 May 07:17
Compare
Choose a tag to compare
massive improvements in spc clone time, setup.py ver and dep bumps

A fresh pull of all 200 remote datasets now takes about 3 minutes.

NOTE: `spc pull` should NOT BE USED unless you know exactly what
you are doing. In the future this functionality will be restored
with better performance, but for now it is almost always faster
delete the contents of the dataset folder and express ds.rchildren.

It only took me about 9 months to finally figure out that I had
actually fixed many of the pulling performance bottlenecks and that we
can almost entirely get rid of the current implementation of pull.

As it turns out it I got almost everything sorted out so that it is
possible to just call `list(dataset_cache.rchildren)` and the entire
entire tree will populate itself. When we fix the cache constructor
this becomes `[rc.materialize() for rc in d.rchildren]` or similar,
depending on exactly what we name that method. Better yet, if we do
it using a bare for loop then the memory overhead will be zero.

The other piece that makes this faster is the completed sparse pull
implementation. We now use the remote package count with a default
cutoff of 10k packages to cause a dataset to be sparse, namely that
only its metadata files and their parend directories are pulled. The
implementation of that is a bit slow, but still about 2 orders of
magnitude faster than the alternative. The approach for implementing
is_sparse also points the way toward being able to mark folders with
additional operational information, e.g. that they should not be
exported or that they should not be pulled at all.

Some tweaks to how spc rmeta works were also made so that existing
metadata will not be repulled in a bulk clone. This work also makes
the BlackfynnCache aware of the dataset metadata pulled from rmeta,
so we should be able to start comparing ttl file and bf:internal
metadata in the near future.

0.0.1.dev0

21 May 11:46
Compare
Choose a tag to compare
skip test_datase test versions if not in git