Releases: SciCrunch/sparc-curation
dataset-template-3.0.1
Changes.
General view regularization for all files.
code_description.xlsx
- regularize formatting
- regularize casing for all header fields
- correct
TSR Column Type
description - correct example column
dataset_description.xlsx
- regularize formatting
- regularize casing for the now named
Metadata version
to match other fields as sentence case - add a suggest example to related identifiers for
DOI for this dataset
- a spelling fix here and there in the description column
dataset-template-3.0.0
You can see full version of this changelog at https://github.com/SciCrunch/sparc-curation/blob/master/docs/sds-3-changelog.org
- changelog 3.0.0
- Overview
- Validation changes added for SDS 3.0.0 datasets
- File system structure changes
- Changes from 2.1.0 to 3.0.0 for all
- Changes from 2.1.0 to 3.0.0 for manifest
- Changes from 2.1.0 to 3.0.0 for submission
- Changes from 2.1.0 to 3.0.0 for dataset_description
- Changes from 2.1.0 to 3.0.0 for subjects
- Changes from 2.1.0 to 3.0.0 for samples
- Changes from 2.1.0 to 3.0.0 for code_description
- Changes from 2.1.0 to 3.0.0 for resources
- Changes from 2.1.0 to 3.0.0 for performances
- Changes from 2.1.0 to 3.0.0 for sites
- Changes from 2.1.0 to 3.0.0 for curation
changelog 3.0.0
Overview
This change log provides a summary of the changes in SDS 3.0.0. See commit message x for the full change log that includes a full explication.
Validation changes added for SDS 3.0.0 datasets
These are changes that are not literal changes to the template itself, but are validation changes that will be enforced for any datasets using the templates greater than 3.0.0.
-
samples and subjects cannot share the same pool-id (this remains confusing), a given pool-id can only appear in one of samples or subjects file
-
Add naming restrictions for all paths
[0-9A-Za-z,.-_ ]
.The following characters are no longer allowed in file names and no longer allowed in folder names (they were already technically banned from folders mapping to SDS entities).
@#$%^&*()+=/\|"'~;:<>{}[]?
The only whitespace character allowed isspace
however the use of spaces in file and folder names is discouraged. Further, all forms of whitespace (includingspace
) are banned from appearing at the start or end of file and folder names. All non-printing characters are banned. This leaves us with following regex for allowed file and folder names[A-Za-z0-9.,-_ ]
. Folders mapped to SDS entity ids still must follow the more restrictive rule[A-Za-z0-9-]
.By default SDS 3.0.0 explicitly excludes the larger unicode categories for letter and number. See https://en.wikipedia.org/wiki/Unicode_character_property#General_Category. See also https://lamport.azurewebsites.net/tla/future.pdf
Common Operators of Ordinary Math
for an account of why ascii remains a sound default for scientific use cases.An option extension of the standard to support those categories could be implemented, in which case an explicit field must be provided in the dataset description file indicating that the dataset makes use of extended file naming rules. Such extension is intended for internal use in organizations where non-ascii file names are unavoidable due to the presence of existing processes. Platforms that use and exchange SDS formatted datasets publicly may always reject such datasets as not conforming to the requirements for public sharing and publication. If such an extension is implemented NO OTHER unicode general categories shall be allowed, that is no mark, punctuation, symbol, separator, or other. The only extra characters allowed outside letter and number categories shall be
[.,-_ ]
. A metadata field with the nameExtension unicode paths
can be provided in the dataset description and the presence of any non-empty cell value for the field will be considered to be enabled. -
Add advisory naming restrictions on use of
.
for all paths.SDS validators shall warn in the following cases.
.
appears in a directory name..
appears at the start of a file name outside the top level andcode
..
appears more than once in a file name.
It is not easy to create a general rule that limits the usage of period (
.
) in paths without accidentally banning legitimate usage patterns. In general we discourage the use of warnings since they are either ignored entirely or treated as errors, however in this case we do not see an easy solution at this time.A slightly less restrictive approach could be implementing by maintaining a list of known valid multi-suffix formats, such as
.tar.gz
, however the standard does not require this. -
Additional restrictions related to banned whole file names follow the answers given in https://stackoverflow.com/q/1976007.
CON|PRN|AUX|NUL|((COM|LPT)[0-9])
An additional (incomplete) list of banned file names includesThumbs.db|.DS_Store
-
File types restrictions are now enforced by modality when modality is provided.
- modality to file type mapping
File system structure changes
-
Add
.dss
file. Data Structure Standard.Contents are
(standard-abbrev standard-version)
e.g.(SDS 3.0.0)
. Should appear at the top level of the dataset and may also appear in other folders if they conform to a different data structure standard such as BIDS. Ifstandard-abbrev
does not match the current parent then validation will not be run using the parent validator. At this timestandard-version
is purely an informative field and carries no semantics of any kind for validation. The default contents of the file result instandard-version
matchingdataset-template-version
, HOWEVER IT SHOULD NOT BE ASSUMED THAT THEY WILL ALWAYS BE THE SAME. In general contents of this file will always be an s-expression that contains only atoms or nest s-expressions which themselves contain only atoms (i.e., no strings). The format is chosen to avoid the creation of a custom surface syntax for the.dss
file. The index of known values forstandard-abbrev
is case-insensitive, so e.g. bothSDS
andsds
refer to the same expanded data structure standard to avoid collisions and confusion. At this time the only semantics forstandard-abbrev
are that a mismatch between case-insensitivestandard-abbrev
fields means that the subdirectory will not be validated using the parent validator, no central registry mapping abbrevs to specific data structure standard validators is required. This leaves room for cooperative development between standards in the future. I used(sds 3)
as the default value to reinforce the note above. -
Add
LICENSE
file. This file is not required. If a data platform does not include functionality for specifying a license then this file can be used to provide the full text of a license. See also,license-identifier
added todataset_description
closes #109 license file -
Add file
sites.{csv,tsv,json,xlsx}
for metadata about sites. Examples of sites are electrode locations, physical locations on subjects or samples that were not further derived, such as left eye and right eye. closes #86 sites file -
Add file
curation.{csv,tsv,json,xlsx}
for metadata from curation. This file is not required, and if provided by a data submitter may be completely overwritten as part of curation since it is designed to hold information from a controlled curation process that happens after submission. closes #106 curation notes closes #103 ensure that submission metadata and organs sheet are in combo of dataset description and curation notes -
Delete file
code_parameters.{csv,tsv,json,xlsx}
. The functionality is now implemented incode_description
. -
Add folder
aux
to top level. This folder is not required and can be used to store auxillary files that may be needed as part of a publication process to support the needs of a particular publication platform. The manifest in this folder can reference out to other folders, but no manifest from outside this folder may reference anything in the aux folder. This is because theaux
folder may be removed for external publication and only be visible to internal systems. Example use cases would be for storing pre-computed thumbnails for video files. closes #108 aux folder
Changes from 2.1.0 to 3.0.0 for all
- The first row and first column of all sheets are now frozen by default where relevant. closes #105 freeze first row and column
- All entity metadata files now include a
metadata-only
column. closes #90 metadata-only column for all sds-entity metadata files
Changes from 2.1.0 to 3.0.0 for manifest
-
Add
entity
column. More granular variants of this column may also be used, but are not included in the default templatespecimen
,subject
,sample
,site
, andperformance
.These columns can be used to map individual files to an SDS entity, this can be used instead of or as a way to enhance the granularity of the mapping of files to SDS entities by their containing folders. Only the most granular mapping should be provided since all entities should be upwardly contextualized (i.e., perf references sam, etc.).
-
Add
data dictionary path
column. Reference the relative path to the data dictionary used to validate this file. -
Add
also in dataset
. Provide a dataset id where a copy of this file is also present. -
Add
also in dataset path
. Provide the dataset relative path to the copy of this file inalso in dataset
. closes #97 -
Add
data modality
column. Allowed values are TBD. closes #99 manifest modality column -
Add
entity is transitive
column. Mark an SDS entity id folder has subfolders to indicate that those folders are about that entity and not any more granular entity. that are from preventing a check on subfolders. Default behavior is to warn for nested folders with no entity metadata and no modality. The validator will not warn if another entity folde...
dataset-template-2.1.0
Changes from 2.0.0 to 2.1.0 for dataset_description
- Change
Metadata Version
from2.0.0
->2.1.0
Changes from 2.0.0 to 2.1.0 for submission
- Rename
SPARC Award number
->Award number
- Add
Consortium data standard
- Add
Funding consortium
General notes
These changes are made so that we can decouple the funding consortium
(which might not exist for external datasets) from the consortium data standard,
which must always be present.
Bumping to 2.1.0, not a major release, but new rows are added, so bumping the
minor release due to row additions and one row name change.
Ideally the consortium data standard would be part of the dataset
description file, but for now we put it in submission to avoid churn.
0.0.1.dev5
move test_data to internal to avoid long tests they do not test actual code and should be moved to queries.org anyway
0.0.1.dev4
Full Changelog: 0.0.1.dev3...0.0.1.dev4
0.0.1.dev3
setup.py tweak version deps
dataset-template-2.0.0
Overview
The structure of the dataset_description file has been updated so that it is broken into five sections, basic, study, contributor, related identifiers, and participants. Fields that only accept a single value have had additional cells grayed out to indicate that additional values should not be provided.
Files for subjects and samples have been updated to disambiguate the referents of columns, metadata about subjects should NOT be included in the samples file. In addition subjects and samples optional columns aligned with openMinds, Dandi, NEMO, and BCDC. Other additional columns were selected based on the most common columns appearing in existing subjects and samples sheets.
Four new files have been added to hold metadata on resources, protocol performances, high level metadata for code submissions, and parameters needed to run the code.
Documentation for the new rules governing folder naming and valid folder nesting structures is forthcoming.
File system structure changes
- Add file
code_description.{csv,tsv,json,xlsx}
. - Add file
code_parameters.{csv,tsv,json,xlsx}
. - Add file
resources.{csv,tsv,json,xlsx}
. - Add file
performances.{csv,tsv,json,xlsx}
to hold metadata about performances forperf-
folders. - Remove file variant
manifest.xlsx
containing pattern instead of filename. - Remove file
plog.xlsx
with note that it may reappear asperformance_log.xlsx
in a future version. - Remove all example folders and files. Examples of properly formatted datasets will be provided separately in the future.
Changes from 1.2.3 to 2.0.0 for dataset_description
- Add gray coloring to mark cases where only a single value is allowed.
- Remove non-default width that was set on all columns which was causing errors when opening in Libre Office Calc due to the fact that 1024 columns were mentioned but Calc doesn't support that many.
- Remove incorrect embedded authoring metadata.
- Rename
Metadata Version DO NOT REMOVE
->Metadata Version
- Move
Metadata Version
to second row and mark it blue. - Add
Type
row which currently only acceptsexperimental
orcomputational
as a value. Set the default toexperimental
since most computational datasets are likely to generate their metadata. - Rename
Name
->Title
to make it consistent with internal naming. - Move
Funding
to followKeywords
- Change
Funding
to be optional. - Rename
Acknowledgements
->Acknowledgments
. - Move
Acknowledgments
to followFunding
- Change
Acknowledgments
to be optional. - Add
Study purpose
. - Add
Study data collection
. - Add
Study primary conclusion
. - Add
Study organ system
. - Add
Study approach
. - Add
Study technique
. - Rename
Title for complete data set
->Study collection title
. - Rename
Contributors
->Contributor Name
- Rename
Contributor ORCID ID
->Contributor ORCiD
- Change
Contributor role
valueContactPerson
->CorrespondingAuthor
. - Change
Contributor role
cells forValue 1
andValue 2
to bePrincipalInvestigator
andCorrespondingAuthor
. Need to consider whether to makeDataManager
required as well. Will need to note to wranglers that the ordering of contributors is the order the will appear in on the dataset publication, so it is ok to move the PI and CA records to their appropriate place. - Delete
Is contact person
. Redundant withCorrespondingAuthor
role. - Delete
Originating Article DOI
. - Delete
Protocol URL or DOI
. - Delete
Additional Links
. - Delete
Link Description
. - Add
Identifier
replacingOriginating Article DOI
,Protocol URL or DOI
, andAdditional Links
. - Add
Identifier type
. - Change
Identifier type
cells forValue 1
andValue 2
to beHasProtocol
andIsDescribedBy
, replacingProtocol URL or DOI
andOriginating Article DOI
. - Add
Relation type
matching the DataCiteRelation type
from this dataset to the related identifier. - Add
Identifier description
replacingLink Description
and ensuring that there is no ambiguity about which link the description applies to, in the previous template the description could refer to any of the originating article doi, the protocol doi, or the first additional link. - Delete
Completeness of data set
. - Delete
Parent dataset ID
. This is replaced by querying for datasets that use the same protocol. Other relations can be added via related identifiers.
Changes from 1.2.3 to 2.0.0 for subjects and samples
- Remove description and example rows.
- Rename
subject_id
tosubject id
. - Rename
pool_id
topool id
. The id columns are all normalized to the previous form internally, and the underscore is hidden by the cell outline when viewing the spreadsheet leading to confusion. - Add
member of
for cases where we need to include a specimen in a population. - Add
also in dataset
for including the Pennsieve id(s) for other datasets that have data about the same specimen - Delete
Additional fields
. - Add
laboratory internal id
to provide a mapping for groups that have incompatible internal identifier conventions. - Rename
experimental log file name
toexperimental log file path
- Rename
protocol.io location
toprotocol url or doi
.
Changes from 1.2.3 to 2.0.0 for subjects
- Rename
experimental group
tosubject experimental group
for clarity. - Add
date of birth
. - Add
body mass
. - Add
phenotype
. - Add
disease or disorder
. - Add
disease model
.
Changes from 1.2.3 to 2.0.0 for samples
- Rename
sample_id
tosample id
. - Move
sample id
to the first column andsubject id
to the 2nd column to make it clear that sample ids are the primary key. - Rename
wasDerivedFromSpecimen
towas derived from
. - Rename
experimental group
tosample experimental group
for clarity. - Rename
specimen type
tosample type
for consistency. - Rename
specimen anatomical location
tosample anatomical location
for consistency. - Add
date of derivation
. - Add
pathology
. - Add
laterality
. - Add
cell type
. - Add
plane of section
.
Changes from 1.2.3 to 2.0.0 for submission
- Rename
SPARC Award number
->SPARC award number
. - Change
SPARC award number
to acceptEXTERNAL
as a value. IfEXTERNAL
is provided then milestone data is ignored.
0.0.1.dev2
setup.py bump idlib dep
0.0.1.dev1
massive improvements in spc clone time, setup.py ver and dep bumps A fresh pull of all 200 remote datasets now takes about 3 minutes. NOTE: `spc pull` should NOT BE USED unless you know exactly what you are doing. In the future this functionality will be restored with better performance, but for now it is almost always faster delete the contents of the dataset folder and express ds.rchildren. It only took me about 9 months to finally figure out that I had actually fixed many of the pulling performance bottlenecks and that we can almost entirely get rid of the current implementation of pull. As it turns out it I got almost everything sorted out so that it is possible to just call `list(dataset_cache.rchildren)` and the entire entire tree will populate itself. When we fix the cache constructor this becomes `[rc.materialize() for rc in d.rchildren]` or similar, depending on exactly what we name that method. Better yet, if we do it using a bare for loop then the memory overhead will be zero. The other piece that makes this faster is the completed sparse pull implementation. We now use the remote package count with a default cutoff of 10k packages to cause a dataset to be sparse, namely that only its metadata files and their parend directories are pulled. The implementation of that is a bit slow, but still about 2 orders of magnitude faster than the alternative. The approach for implementing is_sparse also points the way toward being able to mark folders with additional operational information, e.g. that they should not be exported or that they should not be pulled at all. Some tweaks to how spc rmeta works were also made so that existing metadata will not be repulled in a bulk clone. This work also makes the BlackfynnCache aware of the dataset metadata pulled from rmeta, so we should be able to start comparing ttl file and bf:internal metadata in the near future.
0.0.1.dev0
skip test_datase test versions if not in git