diff --git a/mkdocs.yaml b/mkdocs.yaml index 2e7b7f5..108e95c 100644 --- a/mkdocs.yaml +++ b/mkdocs.yaml @@ -21,7 +21,7 @@ site_name: GHGA User Documentation site_description: Public documentation of the German Human Genome-Phenome Archive. site_author: The GHGA Team site_url: https://docs.ghga-dev.de - +docs_dir: user_docs # Repository repo_name: GHGA User Documentation repo_url: https://github.com/ghga-de/docs/ @@ -93,9 +93,19 @@ nav: - "Overview": index.md - "GHGA Metadata Model": - "Overview": metadata/overview.md + - "Standards": metadata/standards.md - metadata/concepts.md - metadata/modules.md - "Entities & Attributes": metadata/entities.md + - "Data Dictionary": + - "Analysis Module": metadata/data_dictionary/analysis-module.md + - "Basic Module": metadata/data_dictionary/basic-module.md + - "Data Use Conditions Module": metadata/data_dictionary/data-use-conditions-module.md + - "Dataset Module": metadata/data_dictionary/dataset-module.md + - "Phenotype Module": metadata/data_dictionary/phenotype-module.md + - "Sample Module": metadata/data_dictionary/sample-module.md + - "Sequencing Module": metadata/data_dictionary/sequencing-module.md + - "File Submission": metadata/data_dictionary/file-submission.md - "Tools": - "GHGA Validator": validator/validator.md - "GHGA Transpiler": transpiler/transpiler.md diff --git a/readme_generation.md b/readme_generation.md new file mode 100644 index 0000000..432153c --- /dev/null +++ b/readme_generation.md @@ -0,0 +1,47 @@ + + +# Readme Generation + +The README file is generated by collecting information from different sources as +outlined in the following. + +- name: The full name of the package is derived from the remote origin Git repository. +- title: A title case representation of the name. +- shortname: An abbreviation of the full name. This is derived from the name mentioned + in the [`./setup.cfg`](`./setup.cfg). +- summary: A short 1-2 sentence summary derived from the description in the + [`./setup.cfg`](`./setup.cfg). +- version: The package version derived from the version specified in the + [`./setup.cfg`](`./setup.cfg). +- description: A markdown-formatted description of the features and use cases of this + service or package. Obtained from the [`./.description.md`](./.description.md). +- design_description: A markdown-formatted description of the overall architecture and + design of the package. Obtained from the [`./.design.md`](./.design.md). +- config_description: A markdown-formatted description of all config parameters. + This is autogenerated from the [`./config_schema.json`](./config_schema.json). +- openapi_doc: A markdown-formatted description of the HTTP API. This is autogenerated + and links to the [`./openapi.yaml`](./openapi.yaml). If the openapi.yaml is not + this documentation is empty. + +The [`./.readme_template.md`](./.readme_template.md) serves as a template where the +above variable can be filled in using Pythons `string.Template` utility from the +standard library. + +The [`./scripts/update_readme.py`] script can be used to collect all information and +fill it into the template to generate the README file. diff --git a/scripts/__init__.py b/scripts/__init__.py new file mode 100644 index 0000000..6222ab0 --- /dev/null +++ b/scripts/__init__.py @@ -0,0 +1,17 @@ +# Copyright 2021 - 2023 Universität Tübingen, DKFZ, EMBL, and Universität zu Köln +# for the German Human Genome-Phenome Archive (GHGA) +# +# Licensed under the Apache License, Version 2.0 (the "License"); +# you may not use this file except in compliance with the License. +# You may obtain a copy of the License at +# +# http://www.apache.org/licenses/LICENSE-2.0 +# +# Unless required by applicable law or agreed to in writing, software +# distributed under the License is distributed on an "AS IS" BASIS, +# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. +# See the License for the specific language governing permissions and +# limitations under the License. +# + +"""Scripts and utils used during development or in CI pipelines.""" diff --git a/scripts/update_all.py b/scripts/update_all.py new file mode 100755 index 0000000..78854df --- /dev/null +++ b/scripts/update_all.py @@ -0,0 +1,51 @@ +#!/usr/bin/env python3 + +# Copyright 2021 - 2023 Universität Tübingen, DKFZ, EMBL, and Universität zu Köln +# for the German Human Genome-Phenome Archive (GHGA) +# +# Licensed under the Apache License, Version 2.0 (the "License"); +# you may not use this file except in compliance with the License. +# You may obtain a copy of the License at +# +# http://www.apache.org/licenses/LICENSE-2.0 +# +# Unless required by applicable law or agreed to in writing, software +# distributed under the License is distributed on an "AS IS" BASIS, +# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. +# See the License for the specific language governing permissions and +# limitations under the License. +# + +"""Run all update scripts that are present in the repository in the correct order""" + +try: + from scripts.update_template_files import main as update_template +except ImportError: + pass +else: + print("Pulling in updates from template repository") + update_template() + +try: + from scripts.update_config_docs import main as update_config +except ImportError: + pass +else: + print("Updating config docs") + update_config() + +try: + from scripts.update_openapi_docs import main as update_openapi +except ImportError: + pass +else: + print("Updating OpenAPI docs") + update_openapi() + +try: + from scripts.update_readme import main as update_readme +except ImportError: + pass +else: + print("Updating README") + update_readme() diff --git a/user_docs/metadata/data_dictionary/analysis-module.md b/user_docs/metadata/data_dictionary/analysis-module.md new file mode 100644 index 0000000..76bfd2c --- /dev/null +++ b/user_docs/metadata/data_dictionary/analysis-module.md @@ -0,0 +1,53 @@ +# **Analysis Module** + +The **Analysis Module** captures the following entities and properties: + +- Analysis + - title + - description + - [type](#type) + - [reference genome](#reference-genome) + - reference chromosome +- Analysis Process + +## **Analysis** + +### **type** +| Controlled Vocabulary | Ontology Term | Description | +| :-------------------- | :-----------: | :---------- | +| BAM | | | +| VCF | | | +| SAM/CRAM | | | + +### **reference genome** +| Controlled Vocabulary | Ontology Term | Description | +| :-------------------- | :-----------: | :---------- | +| GRCh37 | | | +| GRCh37.p1 | | | +| GRCh37.p2 | | | +| GRCh37.p3 | | | +| GRCh37.p4 | | | +| GRCh37.p5 | | | +| GRCh37.p6 | | | +| GRCh37.p7 | | | +| GRCh37.p8 | | | +| GRCh37.p9 | | | +| GRCh37.p10 | | | +| GRCh37.p11 | | | +| GRCh37.p12 | | | +| GRCh37.p13 | | | +| GRCh38 | | | +| GRCh38.p1 | | | +| GRCh38.p2 | | | +| GRCh38.p3 | | | +| GRCh38.p4 | | | +| GRCh38.p5 | | | +| GRCh38.p6 | | | +| GRCh38.p7 | | | +| GRCh38.p8 | | | +| GRCh38.p9 | | | +| GRCh38.p10 | | | +| GRCh38.p11 | | | +| GRCh38.p12 | | | +| GRCh38.p13 | | | +| GRCh38.p14 | | | diff --git a/user_docs/metadata/data_dictionary/basic-module.md b/user_docs/metadata/data_dictionary/basic-module.md new file mode 100644 index 0000000..d31729b --- /dev/null +++ b/user_docs/metadata/data_dictionary/basic-module.md @@ -0,0 +1,40 @@ +# **Basic Module** + +The **Basic Module** captures the following entities and properties: + +- Study + - title + - [type](#type) + - description + - affiliations +- Publication + - abstract + - author + - doi + - journal + - title + - xref + - year + + +## **Study** +### **type** + +| Controlled Vocabulary | Ontology Term | Description | +| :---------------------- | :--------------: | :----------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------- | +| Cancer Genomics | [topic:0622] | Study of cancer genomics | +| Epigenetics | [topic:3295] | Study of heritable changes, for example in gene expression or phenotype, caused by mechanisms other than changes in the DNA sequence | +| Exome Sequencing | [EFO:0005396] | Exome sequencing, also known as whole exome sequencing (WES), is a genomic technique for sequencing all of the protein-coding regions of genes in a genome (known as the exome). Exons (the subset of DNA that encodes proteins) are selected, and the exonic DNA is then sequenced using any high-throughput DNA sequencing technology | +| Forensic Genomics | [OMIT:0025593] | Genetic samples to help identify crime victims, perpetrators, or family relationships | +| Paleo-genomics | [topic:3943] | The reconstruction and analysis of genomic information in extinct species | +| Gene Regulation Study | [topic:0204] | The regulation of gene expression | +| Metagenomics | [topic:3174] | The study of genetic material recovered from environmental samples, and associated environmental data | +| Other | | | +| Pooled Clone Sequencing | [EFO:0003741] | An assay in which DNA is the input molecule derived from pooled clones (for example BACs and Fosmids) is sequenced using high throughput technology using shotgun methodology | +| Population Genomics | [topic:3796] | Large-scale study (typically comparison) of DNA sequences of populations | +| RNASeq | [EFO:0008896] | A method that involves purifying RNA and making cDNA, followed by high-throughput sequencing | +| Resequencing | [operation:3923] | Laboratory experiment to identify the differences between a specific genome (of an individual) and a reference genome (developed typically from many thousands of individuals). WGS re-sequencing is used as golden standard to detect variations compared to a given reference genome, including small variants (SNP and InDels) as well as larger genome re-organisations (CNVs, translocations, etc.). ows re-sequencing of complete genomes of any given organism with high resolution and high accuracy | +| Synthetic Genomics | [topic:0622] | Sequencing of modified, synthetic, or transplanted genomes | +| Transcriptome Analysis | [EFO:0009865] | Sequencing and characterization of transcription elements | +| Whole Genome Sequencing | [topic:3673] | Laboratory technique to sequence the complete DNA sequence of an organism's genome at a single time | +| GWAS | [topic:3517] | Genome-wide association study experiments | diff --git a/user_docs/metadata/data_dictionary/data-use-conditions-module.md b/user_docs/metadata/data_dictionary/data-use-conditions-module.md new file mode 100644 index 0000000..9151a2e --- /dev/null +++ b/user_docs/metadata/data_dictionary/data-use-conditions-module.md @@ -0,0 +1,47 @@ +# **Data Use Conditions Module** + +The **Data Use Conditions Module** captures the following entities and properties: + +- Data Access Policy + - [data use permission](#data-use-permission) + - [data use modifier](#data-use-modifier) + +- Data Access Committee + - institute + - email + +## **Data Access Policy** + +### **data use permission** + +| Controlled Vocabulary | Ontology Term | Description | +| :------------------------------------------- | :-----------: | :----------------------------------------------------------------------------------------------------------------------------------------------------------- | +| no restriction | [DUO:0000004] | This data use permission indicates there is no restriction on use. | +| general research use | [DUO:0000042] | This data use permission indicates that use is allowed for general research use for any research purpose. | +| health or medical or biomedical research | [DUO:0000006] | This data use permission indicates that use is allowed for health/medical/biomedical purposes; does not include the study of population origins or ancestry. | +| disease specific research | [DUO:0000007] | This data use permission indicates that use is allowed provided it is related to the specified disease. | +| population origins or ancestry research only | [DUO:0000011] | This data use permission indicates that use of the data is limited to the study of population origins or ancestry. | + + +### **data use modifier** + +| Controlled Vocabulary | Ontology Term | Description | +| :------------------------------------------------- | :-----------: | :----------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------- | +| ethics approval required | [DUO:0000021] | This data use modifier indicates that the requestor must provide documentation of local IRB/ERB approval. | +| publication required | [DUO:0000019] | This data use modifier indicates that requestor agrees to make results of studies using the data available to the larger scientific community. | +| user specific restriction | [DUO:0000026] | This data use modifier indicates that use is limited to use by approved users. | +| population origins or ancestry research prohibited | [DUO:0000044] | This data use modifier indicates use for purposes of population, origin, or ancestry research is prohibited. | +| collaboration required | [DUO:0000020] | This data use modifier indicates that the requestor must agree to collaboration with the primary study investigator(s). This could be coupled with a string describing the primary study investigator(s). | +| non-commercial use only | [DUO:0000046] | This data use modifier indicates that use of the data is limited to not-for-profit use. This indicates that data can be used by commercial organisations for research purposes, but not commercial purposes. | +| not for profit, non commercial use only | [DUO:0000018] | This data use modifier indicates that use of the data is limited to not-for-profit organizations and not-for-profit use, non-commercial use. | +| research specific restrictions | [DUO:0000012] | This data use modifier indicates that use is limited to studies of a certain research type. | +| time limit on use | [DUO:0000025] | This data use modifier indicates that use is approved for a specific number of months. This should be coupled with an integer value indicating the number of months. | +| not for profit organisation use only | [DUO:0000045] | This data use modifier indicates that use of the data is limited to not-for-profit organizations. | +| publication moratorium | [DUO:0000024] | This data use modifier indicates that requestor agrees not to publish results of studies until a specific date. This should be coupled with a date specified as ISO8601 | +| genetic studies only | [DUO:0000016] | This data use modifier indicates that use is limited to genetic studies only (i.e., studies that include genotype research alone or both genotype and phenotype research, but not phenotype research exclusively) | +| return to database or resource | [DUO:0000029] | This data use modifier indicates that the requestor must return derived/enriched data to the database/resource. | +| clinical care use | [DUO:0000043] | This data use modifier indicates that use is allowed for clinical use and care. Clinical Care is defined as Health care or services provided at home, in a healthcare facility or hospital. Data may be used for clinical decision making. | +| no general methods research | [DUO:0000015] | This data use modifier indicates that use does not allow methods development research (e.g., development of software or algorithms). | +| institution specific restriction | [DUO:0000028] | This data use modifier indicates that use is limited to use within an approved institution. | +| geographical restriction | [DUO:0000022] | This data use modifier indicates that use is limited to within a specific geographic region. This should be coupled with an ontology term describing the geographical location the restriction applies to. | +| project specific restriction | [DUO:0000027] | This data use modifier indicates that use is limited to use within an approved project. | diff --git a/user_docs/metadata/data_dictionary/dataset-module.md b/user_docs/metadata/data_dictionary/dataset-module.md new file mode 100644 index 0000000..c11ae75 --- /dev/null +++ b/user_docs/metadata/data_dictionary/dataset-module.md @@ -0,0 +1,29 @@ +# **Dataset Module** + +The **Dataset Module** captures the following entities and properties: + +- Dataset + - title + - description + - [type](#type) + +## **Dataset** + +### **type** + +| Controlled Vocabulary | Ontology Term | Description | +| :-------------------------------------------------------------- | :--------------: | :-------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------- | +| Whole genome sequencing | [NCIT:C101294] | A procedure that can determine the DNA sequence for nearly the entire genome of an individual. [ NCI ] | +| Exome sequencing | [EFO:0005396] | Exome sequencing, also known as whole exome sequencing (WES), is a genomic technique for sequencing all of the protein-coding regions of genes in a genome (known as the exome). Exons (the subset of DNA that encodes proteins) are selected, and the exonic DNA is then sequenced using any high-throughput DNA sequencing technology. [ https://orcid.org/0000-0002-2825-0621 https://en.wikipedia.org/w/index.php?title=Exome_sequencing&oldid=1000635953 ] | +| Genotyping by array | [EFO:0002767] | An assay in which an array is used detect polymorphisms in DNA samples | +| Transcriptome profiling by high-throughput sequencing | [EFO:0002770] | A method used to assess the transcriptome of a biological sample using a high-throughput sequencing platform. | +| Transcriptome profiling by array | [EFO:0002768] | An assay in which the transcriptome of a biological sample is analysed using array technology. | +| Amplicon sequencing | [EFO:0003747] | An assay in which a DNA or RNA input molecule amplified by PCR is sequenced. | +| Methylation binding domain sequencing | [EFO:0003750] | An assay in which DNA is the input molecule derived from a selection process using methyl binding domain protein to enrich for methylated fractions of DNA, then sequenced using high throughput sequencing. | +| Methylation profiling by high-throughput sequencing | [EFO:0002761] | An assay in which the methylation state of DNA is determined and is compared between samples using sequencing based technology. | +| Phenotype information | [EFO:0000651] | The observable form taken by some character (or group of characters) in an individual or an organism, excluding pathology and disease. The detectable outward manifestations of a specific genotype | +| Study summary information | | Object containing complementary summaries of other objects | +| Genomic variant calling | [operation:3227] | Detect, identify and map mutations, such as single nucleotide polymorphisms, short indels and structural variants, in multiple DNA sequences. Typically the alignment and comparison of the fluorescent traces produced by DNA sequencing hardware, to study genomic alterations | +| Chromatin accessibility profiling by high-throughput sequencing | [EFO:0007045] | Assay for transposase-accessible chromatin using sequencing (ATAC-seq), is a method based on direct in vitro transposition of sequencing adaptors into native chromatin, and is a rapid and sensitive method for integrative epigenomic analysis. ATAC-seq captures open chromatin sites using a simple two-step protocol. | +| Histone modification profiling by high-throughput sequencing | | Sequencing assay revolving around post-translational processing of amino acids within histone proteins | +| Chip-Seq | [EFO:0002692] | ChIP-seq is an assay in which chromatin immunoprecipitation with high throughput sequencing is used to identify the cistrome of DNA-associated proteins | diff --git a/user_docs/metadata/data_dictionary/file-submission.md b/user_docs/metadata/data_dictionary/file-submission.md new file mode 100644 index 0000000..ff132a3 --- /dev/null +++ b/user_docs/metadata/data_dictionary/file-submission.md @@ -0,0 +1,50 @@ +# **File Submission** + +The **File Submission** captures the following entities and properties: + +- File + - name + - [format](#file-format) + - size + - checksum + - checksum type + - [forward or reverse](#forward-or-reverse) + +## **format** + +| Controlled Vocabulary | Ontology Term | Description | +| :-------------------- | :-----------: | :-------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------- | +| agp | [topic:3693] | AGP is a tabular format for a sequence assembly (a contig, a scaffold/supercontig, or a chromosome). | +| bai | [topic:3327] | BAM indexing format | +| bam | [topic:2572] | BAM format, the binary, BGZF-formatted compressed version of SAM format for alignment of nucleotide sequences (e.g. sequencing reads) to (a) reference sequence(s). May contain base-call and alignment qualities and other data. | +| bcf | [topic:3020] | BCF, the binary version of Variant Call Format (VCF) for sequence variation (indels, polymorphisms, structural variation). | +| bed | [topic:3003] | Browser Extensible Data (BED) format of sequence annotation track, typically to be displayed in a genome browser. | +| crai | | An index file corresponding to a CRAM file | +| cram | [topic:3462] | Reference-based compression of alignment format | +| csv | [topic:3752] | Tabular data represented as comma-separated values in a text file. | +| fasta | [topic:1929] | FASTA format including NCBI-style IDs. | +| fastq | [topic:1930] | FASTQ short read format ignoring quality scores. | +| gff | [topic:2305] | GFF feature format (of indeterminate version). | +| hdf5 | [topic:3590] | HDF5 is a data model, library, and file format for storing and managing data, based on Hierarchical Data Format (HDF). | +| info | | Info files contain unformatted plain text and are often used to document version information, authorship, and copyright information | +| json | [topic:3464] | JavaScript Object Notation format; a lightweight, text-based format to represent tree-structured data using key-value pairs. | +| md | | Markdown file | +| other | | | +| ped | [topic:3286] | The PED file describes individuals and genetic data and is used by the Plink package. | +| sam | [topic:2573] | Sequence Alignment/Map (SAM) format for alignment of nucleotide sequences (e.g. sequencing reads) to (a) reference sequence(s). May contain base-call and alignment qualities and other data. | +| sff | [topic:3284] | Standard flowgram format (SFF) is a binary file format used to encode results of pyrosequencing from the 454 Life Sciences platform for high-throughput sequencing. | +| srf | [topic:3017] | Sequence Read Format (SRF) of sequence trace data. Supports submission to the NCBI Short Read Archive. | +| tab | [MS:1000914] | A file format that has two or more columns of tabular data where each column is separated by a TAB character. [ http://www.w3.org/2002/07/owl#Axiom PSI : MS ] | +| tabix | [topic:3616] | TAB-delimited genome position file index format. | +| tsv | [topic:3475] | Tabular data represented as tab-separated values in a text file. | +| txt | [topic:1964] | Plain text sequence format (essentially unformatted). | +| vcf | [topic:3016] | Variant Call Format (VCF) for sequence variation (indels, polymorphisms, structural variation). | +| wig | [topic:3005] | Wiggle format (WIG) of a sequence annotation track that consists of a value for each sequence position. Typically to be displayed in a genome browser. | + + +## **forward or reverse** + +| Controlled Vocabulary | Ontology Term | Description | +| :-------------------- | :-----------: | :------------------------------- | +| forward | | The reads are forward (R1) reads | +| reverse | | The reads are reverse (R2) reads | diff --git a/user_docs/metadata/data_dictionary/phenotype-module.md b/user_docs/metadata/data_dictionary/phenotype-module.md new file mode 100644 index 0000000..3d229f9 --- /dev/null +++ b/user_docs/metadata/data_dictionary/phenotype-module.md @@ -0,0 +1,117 @@ +# **Phenotype Module** + +The **Phenotype Module** captures the following entities and properties: + +- Biospecimen + - type + - name + - description + - tissue + - [isolation](#isolation) + - [storage](#storage) + - [vital status at sampling](#vital-status-at-sampling) + - [age at sampling](#age-at-sampling) + +- Individual + - [sex](#sex) + - [phenotypic feature](#phenotypic-feature) + - [karyotype](#karyotype) + - [geographical region](#geographical-region) + - [ancestry](#ancestry) + +- Trio + - mother + - father + - child + + +## **Biospecimen** + +### **isolation** + +| Controlled Vocabulary | Ontology Term | Description | +| :-------------------------------------------------- | :----------------------------------------------------------------------------------------------------------------------------------: | :---------- | +| terms from SNOMED CT classification "Removal" terms | https://www.ebi.ac.uk/ols/ontologies/snomed/terms?iri=http%3A%2F%2Fsnomed.info%2Fid%2F118292001&lang=en&viewMode=All&siblings=false# | | + +### **storage** + +| Controlled Vocabulary | Ontology Term | Description | +| :------------------------------------------- | :-----------: | :---------- | +| Refrigerated storage (2°C to 5°C) | | | +| Freezer storage (-20°C) | | | +| Ultra-low freezer storage (-80°C) | | | +| Cryogenic freezer storage (-150°C to -190°C) | | | + +### **vital status at sampling** + +| Controlled Vocabulary | Ontology Term | Description | +| :-------------------- | :-----------: | :----------------------------------------------------------------- | +| alive | [NCIT:C37987] | Showing characteristics of life; displaying signs of life. [ NCI ] | +| deceased | [NCIT:C28554] | The cessation of life. [ NCI ] | +| unknown | | | + +### **age at sampling** + +| Controlled Vocabulary | Ontology Term | Description | +| :-------------------- | :-----------: | :---------- | +| 0 - 5 | | | +| 6 - 10 | | | +| 11 - 15 | | | +| 16 - 20 | | | +| 21 - 25 | | | +| 26 - 30 | | | +| 31 - 35 | | | +| 36 - 40 | | | +| 41 - 45 | | | +| 46 - 50 | | | +| 51 - 55 | | | +| 56 - 60 | | | +| 61 - 65 | | | +| 66 - 70 | | | +| 71 - 75 | | | +| 76 - 80 | | | +| 80+ | | | + + + +## **Individual** + +### **sex** + +| Controlled Vocabulary | Ontology Term | Description | +| :-------------------- | :-----------: | :----------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------- | +| female | [GSSO:011317] | A sex for clinical use value in which stereotypically or statistically "female" values apply to an individual in a given medical context, such as for a procedure, process, algorithm, hormone level, genetic composition, organ inventory, etc. | +| male | [GSSO:011318] | A sex for clinical use value in which stereotypically or statistically "male" values apply to an individual in a given medical context, such as for a procedure, process, algorithm, hormone level, genetic composition, organ inventory, etc. | +| unknown | [GSSO:011320] | A sex for clinical use value in which the stereotypical or statistical known values do not apply, cannot be determined, or are not sufficient for determination of a another value. | +| other | | | + +### **phenotypic feature** + +| Controlled Vocabulary | Ontology Term | Description | +| :----------------------------------- | :-----------------------------------------------------------------------------------------------------------------------------------------: | :---------- | +| HPO Terms | https://hpo.jax.org/app/ | | +| EFO Terms for "Material Entity" | https://www.ebi.ac.uk/ols/ontologies/efo/terms?iri=http%3A%2F%2Fpurl.obolibrary.org%2Fobo%2FBFO_0000040&lang=en&viewMode=All&siblings=false | | +| ORDO Codes (includes Orpha and OMIM) | https://www.ebi.ac.uk/ols/ontologies/ordo | | +| SNOMED CT Terms for "Finding" | https://www.ebi.ac.uk/ols/ontologies/snomed/terms?iri=http%3A%2F%2Fsnomed.info%2Fid%2F404684003&lang=en&viewMode=All&siblings=false | | +| MONDO Terms | https://www.ebi.ac.uk/ols/ontologies/mondo | | +| ICD10 Terms | https://www.icd-code.de/ | | + +### **karyotype** + +| Controlled Vocabulary | Ontology Term | Description | +| :-------------------- | :-----------: | :------------------------------------------------------------------------- | +| 46,XY | [GSSO:009560] | A karyotype in which every cell has one X chromosome and one Y chromosome. | +| 46,XX | [GSSO:009558] | A karyotype in which every cell has two X chromosomes. | +| other | | | + +### **geographical region** + +| Controlled Vocabulary | Ontology Term | Description | +| :--------------------------------------- | :--------------------------------------------------------------------------------------------------------------------------------------------------: | :------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------ | +| terms from HANCESTRO terms for "Country" | https://www.ebi.ac.uk/ols/ontologies/hancestro/terms?iri=http%3A%2F%2Fpurl.obolibrary.org%2Fobo%2FHANCESTRO_0003&lang=en&viewMode=All&siblings=false | A collective generic term that refers here to a wide variety of dependencies, areas of special sovereignty, uninhabited islands, and other entities in addition to the traditional countries or independent states. | + +### **ancestry** + +| Controlled Vocabulary | Ontology Term | Description | +| :------------------------------------------------- | :--------------------------------------------------------------------------------------------------------------------------------------------------: | :-------------------------------------------------------------------------------------------------- | +| terms from HANCESTRO terms for "Ancestry category" | https://www.ebi.ac.uk/ols/ontologies/hancestro/terms?iri=http%3A%2F%2Fpurl.obolibrary.org%2Fobo%2FHANCESTRO_0004&lang=en&viewMode=All&siblings=false | Population category defined using ancestry informative markers (AIMs) based on genetic/genomic data | diff --git a/user_docs/metadata/data_dictionary/sample-module.md b/user_docs/metadata/data_dictionary/sample-module.md new file mode 100644 index 0000000..d9296e1 --- /dev/null +++ b/user_docs/metadata/data_dictionary/sample-module.md @@ -0,0 +1,66 @@ +# **Sample Module** + +The **Sample Module** captures the following entities and properties: + +- Sample + - name + - description + - type + - [isolation](#isolation) + - [storage](#storage) + +- Condition + - title + - name + - description + - [mutant or wildtype](#mutant-or-wildtype) + - [disease or healthy](#disease-or-healthy) + - [case control status](#case-control-status) + +## **Sample** + +### **isolation** + +| Controlled Vocabulary | Ontology Term | Description | +| :-------------------------------------------------- | :----------------------------------------------------------------------------------------------------------------------------------: | :---------- | +| terms from SNOMED CT classification "Removal" terms | https://www.ebi.ac.uk/ols/ontologies/snomed/terms?iri=http%3A%2F%2Fsnomed.info%2Fid%2F118292001&lang=en&viewMode=All&siblings=false# | | + +### **storage** + +| Controlled Vocabulary | Ontology Term | Description | +| :------------------------------------------- | :-----------: | :---------- | +| Refrigerated storage (2°C to 5°C) | | | +| Freezer storage (-20°C) | | | +| Ultra-low freezer storage (-80°C) | | | +| Cryogenic freezer storage (-150°C to -190°C) | | | + + +## **Condition** + +### **mutant or wildtype** + +| Controlled Vocabulary | Ontology Term | Description | +| :-------------------- | :-----------: | :--------------------------------- | +| mutant | | Mutant state | +| wildtype | | Wildtype state | +| not applicable | | The distrinction is not applicable | + +### **disease or healthy** + +| Controlled Vocabulary | Ontology Term | Description | +| :-------------------- | :-----------: | :--------------------------------- | +| disease | | Disease state | +| healthy | | Healthy state | +| not applicable | | The distrinction is not applicable | + +### **case control status** + +| Controlled Vocabulary | Ontology Term | Description | +| :-------------------------------------- | :-----------: | :---------------------------------------------------------------------------------------------------- | +| Neither Case or Control Status | [NCIT:C99273] | The participant is neither a true case or true control for the phenotype under consideration. [ NCI ] | +| Probable Case Status | [NCIT:C99271] | The participant is a probable case for the phenotype under consideration. [ NCI ] | +| Probable Control Status | [NCIT:C99272] | The participant is a probable control for the phenotype under consideration. [ NCI ] | +| True Case Status | [NCIT:C99269] | The participant is a true case for the phenotype under consideration. [ NCI ] | +| True Control Status | [NCIT:C99270] | The participant is a true control for the phenotype under consideration. [ NCI ] | +| Unable to Assess Case or Control Status | [NCIT:C99274] | The participant's case/control status could not be assessed. [ NCI ] | +| Both | | | diff --git a/user_docs/metadata/data_dictionary/sequencing-module.md b/user_docs/metadata/data_dictionary/sequencing-module.md new file mode 100644 index 0000000..61202e1 --- /dev/null +++ b/user_docs/metadata/data_dictionary/sequencing-module.md @@ -0,0 +1,292 @@ +# **Sequencing Module** + +The **Sequencing Module** captures the following entities and properties: + +- Sequencing Experiment + - title + - type + - description + +- Sequencing Process + - title + - name + - description + - index sequence + - lane number + - sequencing run id + - sequencing machine id + - sequencing lane id + +- Library Preparation Protocol + - description + - library name + - library preparation kit retail name + - library preparation kit manufacturer + - [library layout](#library-layout) + - [library type](#library-type) + - [library selection](#library-selection) + - [library preparation](#library-preparation) + - [rna strandedness](#library-rna-strandedness) + - [primer](#library-primer) + - [end bias](#library-end-bias) + - rnaseq strandedness + - target regions + +- Sequencing Protocol + - type + - description + - [instrument model](#instrument-model) + - [flow cell type](#flow-cell-type) + - flow cell id + - sequencing center + - sequencing read length + - target coverage + - [umi barcode read](#umi-barcode-read) + - umi barcode offset + - umi barcode size + - [cell barcode read](#cell-barcode-read) + - cell barcode offset + - cell barcode size + - [sample barcode read](#sample-barcode-read) + +## **Library Preparation Protocol** + +### **library layout** + +| Controlled Vocabulary | Ontology Term | Description | +| :-------------------- | :-----------: | :--------------------------------------------------------------------------------------------------------------------------------------------------- | +| SE | [OBI:0002481] | A cDNA library that is a collection of short tags from only one end of DNA fragments. | +| PE | [OBI:0000722] | A paired-end library is a collection of short paired tags from the two ends of DNA fragments are extracted and covalently linked as ditag constructs | + +### **library type** + +| Controlled Vocabulary | Ontology Term | Description | +| :------------------------------ | :------------: | :---------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------- | +| WGS | [NCIT:C101294] | A procedure that can determine the DNA sequence for nearly the entire genome of an individual. [ NCI ] | +| WXS | [NCIT:C101295] | A procedure that can determine the DNA sequence for all of the exons in an individual. [ NCI ] | +| WCS | [NCIT:C19653] | A DNA sequencing method, which involves random sequencing of tiny cloned pieces of the genome, with no foreknowledge of where on a chromosome the piece originally came from. The sequences obtained have a considerable overlap and by using appropriate computer software it is possible to compare sequences and align them to build larger units of genetic information. This sequencing strategy can be automated and leads to rapid sequencing information. [ NCI ] | +| Total RNA | [NCIT:C124261] | A procedure that can determine the nucleotide sequence for all of the RNA transcripts in an individual. [ NCI ] | +| mRNA | [NCIT:C129432] | A procedure that can determine the RNA sequences for all or part of the poly-A tail-containing messenger RNA transcripts in an individual. [ NCI ] | +| miRNA | [NCIT:C156057] | A next-generation or massively parallel high-throughput DNA sequencing-based procedure that can identify and quantify the microRNA sequences present in a biological sample. [ NCI ] | +| ncRNA | [NCIT:C172858] | A molecular genetic technique that can determine the RNA sequences for all or part of the population of small and large non-protein coding RNA transcripts in a sample. [ NCI ] | +| ATAC | [NCIT:C156056] | A molecular genetic technique that isolates and sequences chromosomal regions that are rich in open chromatin. First, nuclei are harvested from a cellular sample. Then a hyperactive Tn5 transposase is added to the nuclei where it excises non-nucleosomal DNA strands and ligates co-administered high-throughput sequencing adapters (tagmentation). The tagged DNA fragments are isolated, amplified by PCR and sequenced. The number of reads for specific region of DNA correlate with increased chromatin accessibility and this method can identify regions of transcription factor and nucleosome binding. [ NCI ] | +| Methylation | [topic:3674] | Laboratory technique to sequence the methylated regions in DNA. | +| Chromosome conformation capture | [topic:3940] | Molecular biology methods used to analyze the spatial organization of chromatin in a cell. | + +### **library selection** + +| Controlled Vocabulary | Ontology Term | Description | +| :-------------------------------------------- | :---------------: | :------------------------------------------------------------------------------------------------------------------------------------------------------ | +| 5-methylcytidine antibody method | [GENEPIO:0001941] | Selection of methylated DNA fragments using an antibody raised against 5-methylcytosine or 5-methylcytidine (m5C). | +| CAGE method | [GENEPIO:0001942] | Cap-analysis gene expression. | +| cDNA method | [GENEPIO:0001962] | PolyA selection or enrichment for messenger RNA (mRNA) complementary DNA. | +| CF-H method | [GENEPIO:0001943] | Cot-filtered highly repetitive genomic DNA | +| CF-M method | [GENEPIO:0001944] | Cot-filtered moderately repetitive genomic DNA | +| CF-S method | [GENEPIO:0001945] | Cot-filtered single/low-copy genomic DNA | +| CF-T method | [GENEPIO:0001946] | Cot-filtered theoretical single-copy genomic DNA | +| ChIP-Seq method | [GENEPIO:0001947] | Chromatin immunoprecipitation | +| DNase method | [GENEPIO:0001948] | Deoxyribonuclease (MNase) digestion | +| HMPR method | [GENEPIO:0001949] | Hypo-methylated partial restriction digest | +| Hybrid Selection method | [GENEPIO:0001950] | Selection by hybridization in array or solution. | +| MBD2 protein methyl-CpG binding domain method | [GENEPIO:0001951] | Enrichment by methyl-CpG binding domain. | +| MDA | [EFO:0008800] | Multiple displacement amplification (MDA) | +| MF method | [GENEPIO:0001952] | Methyl Filtrated | +| MNase method | [GENEPIO:0001953] | Micrococcal Nuclease (MNase) digestion | +| MSLL method | [GENEPIO:0001954] | Methylation Spanning Linking Library | +| oligo-dT | [FBcv:0003203] | Material that has been selected using short sequences of deoxy-thymine nucleotides to affinity purify RNA containing long polyA stretches. [ FBC : CT ] | +| PCR method | [GENEPIO:0001955] | Source material was selected by designed primers. | +| RACE method | [GENEPIO:0001956] | Rapid Amplification of cDNA Ends. | +| RANDOM PCR method | [GENEPIO:0001957] | Source material was selected by randomly generated primers. | +| RANDOM method | [GENEPIO:0001958] | Random selection by shearing or other method. | +| RT-PCR method | [GENEPIO:0001959] | Source material was selected by reverse transcription PCR | +| Reduced Representation method | [GENEPIO:0001960] | Reproducible genomic subsets, often generated by restriction fragment size selection, containing a manageable number of loci to facilitate re-sampling. | +| Restriction Digest method | [GENEPIO:0001961] | DNA fractionation using restriction enzymes. | +| Size Fractionation method | [GENEPIO:0001963] | Physical selection of size appropriate targets. | +| unspecified | | | +| other | [GENEPIO:0001964] | Other library enrichment, screening, or selection process. | +| inverse rRNA | | | +| padlock probes capture method | | | +| PolyA | | | +| Repeat Fractionation | | | + +### **library preparation** + +| Controlled Vocabulary | Ontology Term | Description | +| :-------------------------------------------------------------------- | :-----------: | :---------- | +| 10XGenomics | | | +| 10xGenomics Chromium Single Cell 3 v2 | | | +| Accel-NGS 2S Plus DNA Library Kit | | | +| Accel-NGS Methyl-Seq DNA | | | +| Agilent Strand-Specific RNA | | | +| Agilent SureSelect Custom enrichment Kit | | | +| Agilent SureSelect V3 | | | +| Agilent SureSelect V4 | | | +| Agilent SureSelect V4+UTRs | | | +| Agilent SureSelect V5 | | | +| Agilent SureSelect V5+UTRs | | | +| Agilent SureSelect V6 | | | +| Agilent SureSelect V6+ONE | | | +| Agilent SureSelect V6+UTRs | | | +| Agilent SureSelect V7 | | | +| Agilent SureSelect WGS | | | +| Agilent SureSelect XT HS + Human All Exon V7 | | | +| Agilent SureSelect XT Mouse AllExon | | | +| Agilent XT-HS SureSelect Clinical Research Exome V2 | | | +| Avenio ctDNA Kit | | | +| IDT_xGen_Exome_Research_Panel | | | +| Illumina Nextera DNA Flex | | | +| Illumina Nextera Exome Enrichment Kit | | | +| Illumina TruSeq Custom Amplicon | | | +| Illumina TruSeq DNA | | | +| Illumina TruSeq Nano DNA | | | +| Illumina TruSeq Nano DNA HT | | | +| Illumina TruSeq Nano DNA LT | | | +| Illumina TruSeq Nano FFPE DNA | | | +| Illumina TruSeq PCR-free | | | +| Illumina Truseq PCR-free Methyl | | | +| Illumina TruSeq PCRFree DNA | | | +| Illumina TruSeq RNA | | | +| Illumina TruSeq Small RNA Kit | | | +| Illumina TruSeq Stranded Total RNA Kit | | | +| Illumina VAHTS total RNA | | | +| Inform_OncoPanel_hg19 | | | +| Ion AmpliSeq Exome Kit | | | +| KAPA Hyper Prep Kit | | | +| KAPA HyperPlus Kit | | | +| Kapa mRNA HyperPrep kit | | | +| Magnetic Methylated DNA Immunoprecipitation(Diagnode) | | | +| NEB NEXT Ultra directional RNA | | | +| NEB Next Ultra II Directional RNA | | | +| NEBNext ChIP-seq library prep kit for Illumina | | | +| NEBNext RNA Ultra II stranded | | | +| NEBNext Ultra DNA | | | +| NEBNext Ultra DNA Library Prep Kit for Illumina | | | +| NEBNext Ultra II DNA Library Prep Kit for Illumina | | | +| Nextera XT DNA | | | +| Pico Methyl Seq | | | +| Smart Seq v4 Ultra low Input RNA Kit | | | +| SMARTer Stranded Total RNA-Seq Kit | | | +| SmarTer Ultra Low Input RNA and NEBNext ChIP-Seq | | | +| SmarTer Ultra Low Input RNA v4 and NEBNext ChIP-Seq | | | +| SMARTseq2_tag | | | +| SureSelect eurofins enrichment custom 01 | | | +| Takara Clontech SMARTer Stranded Total RNA | | | +| Takara SMARTer PrepX DNA Library Kit - Active Motif custom indices 01 | | | +| TruSeq ChIP Sample Preparation Kit | | | +| TruSeq stranded total RNA- Ribo Minus Gold | | | +| Twist Human Core Exome Plus Kit | | | +| Ultralow_Methyl-Seq_with_TrueMethyl_oxBS_Module | | | + +### **library RNA strandedness** + +| Controlled Vocabulary | Ontology Term | Description | +| :-------------------- | :-----------: | :----------------------------------------------------------------------------------------------------------------------------- | +| sense | [NCIT:C63550] | Having a DNA sequence identical to that of a messenger RNA molecule; the coding strand in double-stranded DNA. [ NCI ] | +| antisense | [NCIT:C63551] | Having a DNA sequence complementary to that of a messenger RNA molecule; the non-coding strand in double-stranded DNA. [ NCI ] | +| both | | | + +### **library primer** + +| Controlled Vocabulary | Ontology Term | Description | +| :-------------------- | :-----------: | :------------------------------------------------------------------------------------------------------------------------------------- | +| oligo-dT | [EFO:0010215] | An oligonucleotide primer consisting of thymidine bases only. It is used to target messenger RNA molecules with poly-adenosine 3' end. | +| random | [EFO:0010216] | An oligonucleotide primer with random sequence. | +| gene-specific | | | +| other | | | + +### **library end bias** + +| Controlled Vocabulary | Ontology Term | Description | +| :-------------------- | :-----------: | :------------------------------------------------------------------------------------------------------------------- | +| 3'-end | [EFO:0010189] | When a sequencing method preferentially captures the nucleic acids towards the 3 prime end of the targeted molecule. | +| 5'-end | [EFO:0010191] | When a sequencing method preferentially captures the nucleic acids towards the 5 prime end of the targeted molecule. | +| full length | | | +| other | | | + +## **Sequencing Protocol** + +### **instrument model** + +| Controlled Vocabulary | Ontology Term | Description | +| :------------------------------------- | :---------------: | :---------- | +| Illumina HiScan | | | +| Illumina HiSeq 1000 | [EFO:0004204] | | +| Illumina HiSeq 1500 | [EFO:0011027] | | +| Illumina HiSeq 2000 | [EFO:0004203] | | +| Illumina HiSeq 2500 | [EFO:0008565] | | +| Illumina HiSeq 3000 | [EFO:0008564] | | +| Illumina HiSeq 4000 | [EFO:0008563] | | +| Illumina HiSeq X Five | [GENEPIO:0100112] | | +| Illumina HiSeq X Ten | [GENEPIO:0100113] | | +| Illumina HiSeq X | [EFO:0008567] | | +| Illumina iScan | | | +| Illumina iSeq 100 | [EFO:0008635] | | +| Illumina MiniSeq | [EFO:0008636] | | +| Illumina MiSeq | [EFO:0004205] | | +| Illumina MiSeqDx | | | +| Illumina MiSeqDx (Research Mode) | | | +| Illumina NextSeq 500 | [OBI:0002021] | | +| Illumina NextSeq 550 | [EFO:0008566] | | +| Illumina NextSeq 550Dx | | | +| Illumina NextSeq 550Dx (Research Mode) | | | +| Illumina NextSeq 1000 | [EFO:0010962] | | +| Illumina NextSeq 2000 | [EFO:0010963] | | +| Illumina NovaSeq 6000 | [EFO:0008637] | | +| Illumina Genome Analyzer | [EFO:0004200] | | +| Illumina Genome Analyzer II | [EFO:0004201] | | +| Illumina Genome Analyzer IIx | [EFO:0004202] | | +| Illumina HiScanSQ | [GENEPIO:0100109] | | +| PacBio Revio | | | +| PacBio Onso | | | +| PacBio Sequel IIe | | | +| PacBio Sequel II | [OBI:0002633] | | +| PacBio Sequel | [EFO:0008631] | | +| PacBio RS | [GENEPIO:0100131] | | +| PacBio RS II | [EFO:0008631] | | +| ONT MinION | [EFO:0008632] | | +| ONT GridION | [OBI:0002751] | | +| ONT PromethION | [EFO:0008634] | | +| DNBSEQ-G50 | [GENEPIO:0100150] | | +| DNBSEQ-T7 | [GENEPIO:0100147] | | +| DNBSEQ-G400 | [GENEPIO:0100148] | | +| DNBSEQ-G400 FAST | [GENEPIO:0100149] | | +| Ultima UG 100 | | | +| other | | | + +### **flow cell type** + +| Controlled Vocabulary | Ontology Term | Description | +| :-------------------- | :-----------: | :---------- | +| Illumina NovaSeq S2 | | | +| Illumina NovaSeq S4 | | | +| PromethION | | | +| Flongle | | | +| MinION | | | +| GridION | | | +| other | | | + +### **umi barcode read** + +| Controlled Vocabulary | Ontology Term | Description | +| :-------------------- | :-----------: | :---------- | +| index1 | | | +| index2 | | | +| read1 | | | +| read2 | | | + +### **cell barcode read** + +| Controlled Vocabulary | Ontology Term | Description | +| :-------------------- | :-----------: | :---------- | +| index1 | | | +| index2 | | | +| read1 | | | +| read2 | | | + +### **sample barcode read** + +| Controlled Vocabulary | Ontology Term | Description | +| :-------------------- | :-----------: | :---------- | +| index1 | | | +| index1 and index2 | | | +| other | | | diff --git a/user_docs/metadata/entities.md b/user_docs/metadata/entities.md index fce40ab..5180b07 100644 --- a/user_docs/metadata/entities.md +++ b/user_docs/metadata/entities.md @@ -1,29 +1,104 @@ -# Entities & Attributes +# **Captured Metadata** -## Dataset +This section provides an overview on what metadata elements are captured with the GHGA Metadata Schema. -## Study +A breakdown of each metadata element described in the different entities will provide more insight on what elements are required for the functionality of GHGA, mandatory properties and recommended or optional information that can be provided by the data submitters. -## File +## **Study** -## Publication +All data deposited at GHGA is subject to a specific study, under which relevant data has been aggregated. A study is an experimental investigation of a particular phenomenon and involves a detailed examination and analysis of a subject to learn more about the phenomenon being studied. A detailed description of a study can guide data requesters to identify the most relevant datasets for their own research. -## Condition +### **Study metadata properties** -## Sample +In order to describe a *Study*, data submitters are required to provide information about the study affiliation(s), title, description and type - i.e.: Cancer Genomic, Epigenomics, etc. - of a study. -## Biospecimen +A study can also be linked to a *Publication*. The *Publication* entity holds the title, id (e.g. DOI of a publication), external reference, abstract, author, year, and journal for a unique publication. -## Individual +*Publication* is an optional metadata entity. If it is submitted, its properties become mandatory or optional. -## SequencingProcess +## **Sample** -## LibraryPreparationProtocol +GHGAs *Sample* metadata can be separated into three distinct entities: *Sample*, *Biospecimen* and *Condition*. Both the *Sample* and *Biospecimen* entities provide the data submitter with options to deposit metadata that allows for deeper insight into the characteristics of samples and biospecimen. The *Condition* allows to further define the state of the samples and to group samples within a study accordingly. The following paragraph gives a definition of what a sample, biospecimen or condition is in the context of GHGAs metadata schema. -## SequencingProtocol +A *Sample* is defined as a limited quantity of something to be used for testing, analysis, inspection, investigation, demonstration, or trial use. A sample is prepared from a biospecimen (isolate or tissue). -## SequencingExperiment +A *Biospecimen* is defined in GHGAs metadata as any natural material taken from a biological entity for testing, diagnostics, treatment or research purposes. The *Biospecimen* is linked to the *Individual* entity from which the biospecimen itself has been derived. -## Analysis +A *Condition* describes the state and origin of a sample. It captures actions applied to a sample that were necessary for the specific study in which the sample is used. The *Condition* links the *Sample* to a *Study*. -## AnalysisProcess +### **Sample metadata properties** + +The *Sample* entity requires data submitters to provide the name and description of a sample, as well as the link to the condition. On top of those mandatory properties, a submitter can provide more information on a sample through the type of isolation of the sample, or how it is stored. Further properties held by the sample entity are type (e.g. genomic DNA, single cell RNA or total RNA) and an external reference. + +The *Biospecimen* entity captures only optional information, which reflects the information of the sample entity. These include an alias, a description, isolation, name, type, external ID and storage. Addtionally, this property captures the age at sampling and vital status of the *Biospecimen* donor. + +The *Condition* entity captures information about whether and how a sample was treated and its status (case or control, mutant or wildtype). This information is important to uniquely link samples with the same *Condition* within one *Study*. All properties within the *Condition* are required. + +## **Individual** + +The *Individual* entity within GHGAs Metadata Schema is aimed at capturing relevant information about a sample’s donor. The content of the individual entity is crucial to identify cohorts of interest and gives valuable insight on the target group of an experiment. Data submitters are asked to provide information such as sex and other phenotypic features that help data requesters to identify a cohort of interest. + +Individuals can be part of a *Trio*. The *Trio* class is a study design often used in studying genetic conditions in a family. It involves the genetic analysis of three individuals within a family unit, usually a child and their biological parents. + +### **Individual metadata properties** +The data submitter is required to provide information about an individual’s sex. Additional information such as phenotypic features, karyotype, geographical region and ethnicity can be submitted. + +## **Sequencing Experiment** + +Omics data is gathered while carrying out an experiment under certain conditions and procedures. Thus, GHGA aims at collecting as much as possible information about several protocols which gives data requesters an insight on how an experiment was conducted. Data submitters are asked to provide a core set of properties to help make the data deposited at GHGA more rich, but are welcome to provide all the information that has been generated while carrying out the experiment for the dataset. The insights provided by this collection of information helps to make data reusable, one of the main incentives of the FAIR data principles. + +### **Sequencing Experiment metadata properties** +*Sequencing Experiment* metadata elements within GHGA’s Metadata Schema spans not only information about the experiment itself but also protocols under which an experiment has been conducted. These include entities for *Sequencing Process*, *Sequencing Protocol* and *Library Preparation Protocol*. + +The *Sequencing Experiment* entity is used to group *Sequencing Processes*, *Library Preparation Protocols* and *Sequencing Protocols* within one *Sequencing Experiment*. Via the *Sequencing Process* and *Condition* entities, it is then linked to *Study*, *Sample* and *File*. + +The *Sequencing Process* captures the technical parameters that were used to generate output from a *Sample* during a *Sequencing Experiment*. + +The *Sequencing Protocol* entity gives a variety of properties that a data submitter can submit. Mandatory properties are the sequencing alias, the description of the sequencing protocol and the provision of the instrument model with which the sequencing was done. Optional properties include the offset, read and size for cell and umi barcodes, flow cell id and type, read length, target coverage, and the type of the used sequencing protocol. + +The *Library Preparation Protocol* entity requires a data submitter to provide the following mandatory properties in order to allow reproducible research: library– layout, type, selection, preparation, as well as a name for the protocol, an alias and a thorough description. Optional properties include information about the kit retail name and manufacturer, the library primer, the RNAseq strandedness, target regions, primer end bias and the type of the protocol. If there is a publicly referable url for the protocol, this can also be submitted. + + +## **File** + +At the core of GHGA is the deposition of raw files that have been generated while carrying out an experiment. These files also have to be annotated with metadata, in order to give data requesters more information on what files have been deposited at GHGA by the data submitter. Therefore this metadata will also be used by the user interface of GHGAs data portal to provide not only information on how many files are contained within a dataset, but also information on file size, file formats, checksums and file types. + +The files deposited at GHGA, and their metadata, will always link to an experiment entity. + +### **File metadata properties** + +The *File* entity requires submitters to provide the name, the alias and the format for a file. During the GHGA Catalog phase, the submitter also has to provide information about the file size, checksum and checksum type. + +## **Analysis** + +GHGA will provide data analysis of the raw files deposited at GHGA by a data submitter. The *Analysis* entity will aim at storing metadata related to the computational analysis of the files that potentially will be run using containers and nf-core pipelines. The information that will be stored in the *Analysis* entity will help to make the analysis data reproducible and reusable with respect to the FAIR data principles. + +### **Analysis metadata properties** + +The data submitter is required to provide an analysis alias, the aliases for the input and output files, as well as the link to a study and the reference genome or chromosome(s) used for the analysis. Optional properties include a description of the analysis steps and the analysis type. + +## **Dataset** + +GHGA presents its content to potential data requesters and submitters with the *Dataset* entity, which focuses on sharing functionality by describing the contents at a high level. Each dataset is linked to a *Data Access Policy*, which builds the legal basis for the sharing of data. One dataset has links to *Experiment* and / or *Analysis* entities to bundle all relevant data that makes a dataset by the definition of the GHGA Metadata Schema. + +### **Dataset metadata properties** + +The *Dataset* entity is aimed at capturing relevant information about a dataset itself. The data submitter can provide a description and a title for the dataset. The main purpose of this entity is to link a dataset to the related study, experiments, samples, analysis, files and data access policies. These links must be provided on the submission of data, either through automatic linking with respect to the *Data Access Committee*, or the data submitter. + +All properties captured in the *Dataset* entity are required for the functionality of GHGA and are therefore mandatory. The only exception is the analysis alias, which only needs to be provided if an analysis is to be submitted. A title and description can be indexed by the database in order to make the GHGA Data Portal searchable for a specific dataset. In addition, the links to study, experiment, samples, analysis (if avalaible) and files are necessary to provide a data requester with all relevant data and metadata associated with a dataset. This also ensures reusability in the light of the FAIR Data Principles. + +## **Data Access Policy and Committee** + +Depositing data at GHGA requires a data submitter to provide a *Data Access Committee (DAC)* and *Data Access Policy (DAP)*. This ensures controlled access to their deposited data and a clear guideline for data requesters to access the data. This includes a defined contact person and a consent-based legal basis for getting access to a dataset. + +### **Data Access Policy and Committee metadata properties** + +The *DAC* entity bundles necessary information that is required to identify the Data Controller of the deposited data. Therefore a name and description for the *DAC*, and the main contact have to be provided upon submission. The information about a contact includes the email address and the associated affiliation. + +A *DAP* is directly linked to the *DAC* and *Dataset* entity, thus providing the condition under which the data deposited at GHGA can be re-used by a data requester. The submitter must provide an alias, name, description and either the policy text for the *DAP* or the URL where the *DAP* is stored. The *DAP* needs to be linked to the *DAC* and *Dataset*. + +To systematically and semantically identify the conditions under which deposited data can be reused, data submitters can optionally provide DUO terms that are used to identify the research purpose under which the data can be requested, e.g. General Research Use (DUO:0000042), research specific restrictions (DUO:0000012). + +## **Submission Spreadsheet** + +The Submission Spreadsheet for GHGA Archive captures the above-mentioned metadata in an ordered fashion. Data submitters are given a predefined set of properties to describe the data, which they are aiming to deposit. Furthermore in the initial phase, data submitters are asked to provide additional information using key-value pairs. That means, metadata properties which are not yet covered within the metadata catalog for GHGA Archive can be provided with a descriptive property title and the corresponding value. Leaving the freedom to the submitters to provide as much information as is being captured in their study. Since information submitted through the attributes-property is not controllable by GHGA, attributes are considered restricted metadata and will not be visible in the GHGA Catalog. diff --git a/user_docs/metadata/modules.md b/user_docs/metadata/modules.md index a55ecc0..e8fdfc0 100644 --- a/user_docs/metadata/modules.md +++ b/user_docs/metadata/modules.md @@ -1 +1,15 @@ -# Modules +# **Modules in the GHGA Metadata Model** + +- **Basic Module**: The Basic Module is the fundamental module in the GHGA Metadata Schema. It covers the minimal amount of information that must be included in a successful submission. + +- **Sample Module**: Every Basic Module can be linked to one or more Sample Modules. This module contains information relating to the sample that was later sequenced in a sequencing experiment. + +- **Phenotype Module**: One Sample Module can have one or more Phenotype Modules. This module can be used when a sample originated from a ‘Biospecimen’ or an ‘Individual’ and thus allows to group several Sample Modules based on the sample origin. In addition, the Phenotype Module captures detailed information about phenotypes or individual demographics. + +- **Sequencing Module**: One Sample Module can also be linked to one or more Sequencing Modules. The Sequencing Module captures information about the ‘Sequencing Process’, such as the sequencing and library preparation protocols. + +- **Data Use Conditions Module**: The Data Use Conditions Module captures in granular detail what restrictions and use conditions are associated with a Data Access Policy. This section also captures the Data Access Committee that enforces the Data Access Policy requirements. + +- **Dataset Module**: The Dataset Module contains the ‘Dataset’ entity, which is a collection of one or more Files from one or more Modules. All Files within the Dataset Module are subject to the Data Access Policy that is captured in the Data Use Conditions Module. One Dataset Module can only be linked to one Data Use Conditions Module. + +- **Analysis Module**: A dataset can have one or more Analysis Modules where each Analysis Module links to one or more files as input to the Analysis, one or more files as output to the Analysis, and the ‘Analysis Process’ that captures how the analysis was performed. diff --git a/user_docs/metadata/overview.md b/user_docs/metadata/overview.md index b50d7c6..4a3d148 100644 --- a/user_docs/metadata/overview.md +++ b/user_docs/metadata/overview.md @@ -1 +1,21 @@ # The GHGA Metadata Model +## **Glossary** +- **Entity**: An Entity holds characteristics of a real-world object. Example: The Individual entity is described by the information (properties) for sex, year of birth and height. + + - Synonyms: class, table, object + +- **Property**: A Property is a single characteristic that can be used in combination with other characteristics to describe a real-world object. Example: The combination of the properties sex, year of birth and height describe the (real-world object) entity Individual. + + - Synonyms: attribute, element, field, slot + +- **FAIR**: Findable, Accessible, Interoperable, Reusable + +## **Introduction** +The German Human Genome-Phenome Archive (GHGA) provides a nation-wide resource for archiving, accessing and sharing of multi-omics data produced and processed in research and health care initiatives in Germany. GHGA aims to bring these data together and make it easier to find data for secondary use, by adopting and adhering to [FAIR data principles](https://doi.org/10.1038/sdata.2016.18). In order to meet the domain-specific requirements we developed the GHGA Metadata Schema - a schema for representing information pertaining to various aspects of our data. + +This documentation serves as the description and reasoning behind the Metadata Model of GHGA, which encapsulates the metadata schema, its technical implementation, and resources to support submission of metadata. The Archive function of GHGA is envisioned to handle a wide variety of omics and research data. The Metadata Model is architecturally flexible and can be expanded with specific fields using domain and technology specific modules. + + +The core of the schema is built such that it can be expanded to accommodate genomic, epigenetic, transcriptomic, clinical, and other forms of medical data. Our initial focus is on research data from the Cancer and Rare Diseases communities. These communities can benefit greatly by improving the exchange of data and associated metadata. + +Furthermore we provide data submitters with a Submission Spreadsheet in order to easily deposit their data within GHGA. \ No newline at end of file diff --git a/user_docs/metadata/standards.md b/user_docs/metadata/standards.md new file mode 100644 index 0000000..90bc45b --- /dev/null +++ b/user_docs/metadata/standards.md @@ -0,0 +1,88 @@ +# **Concepts and Standards** +GHGAs metadata model follows several internationally renowned concepts, standards, and resources to provide a metadata schema to share data in a standardized and harmonized fashion. + +## **Resources and Standards** + +### **FAIR Data Principles** +While digitization is becoming more and more important and technologies accelerate constantly, NGS experiments and measurements produce large quantities of data. Every single dataset in this huge amount of data should be findable and usable for humans and computers equally. In 2016, a conglomerate of representatives of different disciplines - such as academia, industry, funding agencies and scholarly publishers - published the ["FAIR Guiding principles for scientific data management and stewardship"](https://doi.org/10.1038/sdata.2016.18). These principles provide guidance on what to consider when data is published so that an automated and individual exploration, sharing, and reusing of the data is possible. FAIR data should be: Findable, Accessible, Interoperable and Reusable. + +### **FAIRsharing** +Thousands of standards, ontologies and vocabularies have been developed for a variety of communities in order to guide reproducible research. A central database for FAIR standards, repositories and standards is [FAIRsharing](https://fairsharing.org). The mission of the FAIRsharing community is to evaluate standards, databases, policies, and collections. These can be queried by the user’s specific field of interest and can be categorized by Maintained / Not Maintained, Recommended / Not Recommended and Ready / Deprecated / Uncertain / In Dev. + +### **Global Alliance for Genomics & Health (GA4GH)** +The [Global Alliance for Genomics and Health (GA4GH)](https://www.ga4gh.org/) is a worldwide acknowledged standards body established to promote globally responsible data sharing of genomic and health-related data. The main objective of this initiative is the alliance of researchers, data scientists, healthcare providers and practitioners and other authorized users while protecting competing interests. GA4GH enables federated data sharing models while preserving the data security, ethical and regulatory framework as well as data authorization and access of sensitive data. Data sharing standards offer data providers the confidence and trust on the data being accessed in accordance with their data policies and without losing control over the multiple downloads of data. + +### **RDA alliance** +The [Research Data Alliance (RDA)](https://www.rd-alliance.org/) is a community based initiative formed between the European Commission (EU), the United States Government's National Science Foundation and National Institute of Standards and Technology, and the Australian Government’s Department of Innovation. The main purpose of this initiative is to allow researchers and data providers to practice open data sharing across technologies and geographical locations. Although RDA cannot be considered as a standards development organization (SDO), it fosters building synergies with SDOs and similar organizations so that RDA Recommendations can either be fast-tracked to becoming standards, be incorporated in or contribute to common goals and activities, and overall foster adoption and build trust and engagement in new user communities. + +### **Genomics Standards Consortium (GSC)** +The [Genomics Standards Consortium (GSC)]() was one of the earliest launched initiatives towards better descriptions of genomes and metagenomes across communities. The main objective of GSC was to implement new genomic standards, establish methods for capturing and exchanging metadata and harmonization of information across the genomics community. With the support of working groups, GSC has created standards for [Minimum Information about any Sequence (MIxS)]() - which includes genomes (MIGS), metagenomes (MIMS) and environmental genomes (MIMARKS) respectively. + +### **Genomic Data Commons** +The [Genomic Data Commons (GDC)](https://datacommons.cancer.gov/repository/genomic-data-commons) was established by the National Cancer Institute (NCI) to boost the understanding of "large-scale, multidimensional data". Therefore the GDC generates datasets to systematize human tumor variations, especially encouraging the unification and sharing of data. GDC provides the cancer research community with a unified repository and cancer knowledge base that enables data sharing across cancer genomic studies in support of precision medicine. The GDC Data Dictionary is a resource that describes the GDC data model which includes clinical, biospecimen, administrative, and genomic metadata that can be used in parallel with the genomic data generated by the GDC. +The properties and the values in the GDC data dictionary contain references to external standards which are defined and maintained by [NCI Thesaurus (NCIt)](https://ncithesaurus.nci.nih.gov/ncitbrowser/) and the [Cancer Data Standards Registry and Repository (caDSR)](https://datascience.cancer.gov/resources/metadata). + +## **Metadata standards** +Metadata provides context and provenance to raw data and methods and are essential to both discovery and validation. It can be classified as a high level document which establishes a common way of structuring and understanding data by including principles and implementation issues utilizing the standard. Metadata standards offer conventions for the generation and description of research data. They specify and define the structure of metadata. + +### **DublinCore (DC) metadata element set** +The [Dublin Core™ Metadata Element Set](https://www.dublincore.org/specifications/dublin-core/dcmi-terms/) is a vocabulary of fifteen properties for use in resource description. The fifteen element "Dublin Core™" described in this standard is part of a larger set of metadata vocabularies and technical specifications maintained by the [Dublin Core™ Metadata Initiative (DCMI)](https://www.dublincore.org/). The full set of vocabularies, DCMI Metadata Terms \[DCMI-TERMS\], also includes sets of resource classes (including the DCMI Type Vocabulary \[DCMI-TYPE\]), vocabulary encoding schemes, and syntax encoding schemes. The terms in DCMI vocabularies are intended to be used in combination with terms from other, compatible vocabularies in the context of application profiles and on the basis of the DCMI Abstract Model \[DCAM\]. + +### **Minimum Information (MI) Standards** +[Minimum information (MI) standards](https://www.ebi.ac.uk/training/online/courses/bioinformatics-terrified/what-makes-a-good-bioinformatics-database/minimum-information-standards/) are sets of guidelines and formats for reporting data derived by specific high-throughput methods. The purpose of MI standards is to ensure the data generated by high-throughput methods can be easily verified, analyzed and interpreted by the wider scientific community. Minimal information standards are available for a vast variety of experiment types including microarray ([MIAME](https://fairsharing.org/FAIRsharing.32b10v)), RNAseq ([MINSEQE](https://fairsharing.org/FAIRsharing.a55z32)), metabolomics ([MSI](https://github.com/MSI-Metabolomics-Standards-Initiative/CIMR)) and proteomics ([MIAPE](https://www.psidev.info/miape)). + +### **Minimum information about a Microarray Experiment (MIAME)** +The [MIAME](https://fairsharing.org/FAIRsharing.32b10v) guidelines were created by the [Functional Genomics Data Society (FGED)](https://www.fged.org/) society to describe the standards for recording and reporting microarray-based gene expression data to facilitate data access and interpretation from public repositories such as [Gene Expression Omnibus (GEO)](https://www.ncbi.nlm.nih.gov/geo/). + +### **Minimum Information about a Proteomic Experiment (MIAPE)** +The [MIAPE](https://www.psidev.info/miape) guidelines were founded by the [Human Proteome Organization's Proteomics Standards Initiative](https://www.hupo.org/Proteomics-Standards-Initiative-(PSI)) to develop guidelines for reporting the use of techniques such as gel electrophoresis and mass spectrometry. One of the main objectives of the MIAPE is to increase the value derived by the scientific community from ongoing experimentation in proteomics. This is established through community processes that support sharing, dissemination and reanalysis of datasets, and that assist in establishing and promoting best practice in specific technical areas. The most important MI standards for genomics are listed below. + +### **Minimum Information about a high-throughput Nucleotide Sequencing Experiment (MINSEQE)** +[MINSEQE](https://fairsharing.org/FAIRsharing.a55z32) describes the minimum information about a high-throughput nucleotide sequencing experiment that is needed to enable the unambiguous interpretation and facilitate reproduction of the results of the experiment. By analogy to the [MIAME](https://fairsharing.org/FAIRsharing.32b10v) guidelines for microarray experiments, adherence to the [MINSEQE](https://fairsharing.org/FAIRsharing.a55z32) guidelines will improve integration of multiple experiments across different modalities, thereby maximizing the value of high-throughput research. The five main elements of experimental description to be [MINSEQUE](https://fairsharing.org/FAIRsharing.a55z32) compliant include - description of the experiment and sample under study, sequence read data for each assay, final processed data for the study, information about experiment-sample relationship, experiment and sample processing protocol. + + +## **Ontologies** + +To ensure that the metadata that is collected in GHGA is of high quality, we support a selection of ontologies for certain properties where their values can be one or more concept terms from these ontologies. The ontologies were chosen based on their suitability to represent the knowledge specific to genomic medicine. They have a wide adoption and community support, which increases their interoperability and reusability. + +### **Bioscientific data analysis ontology** +[Bioscientific data analysis ontology (EDAM)](https://bioportal.bioontology.org/ontologies/EDAM) is a straightforward ontology that consists of commonly known and widely used concepts in the field of bioinformatics, such as data types and identifiers, data formats, operations, and topics. It offers a collection of terms that come with definitions and synonyms, all organized into an easily understandable hierarchy for convenient usage. We recommend the use of concepts from EDAM to represent file formats. For example, instead of using free text ‘FASTQ’ to represent a file in FASTQ-format, we recommend using the appropriate concept **Format:1930 FASTQ**. + +### **BRENDA Tissue Ontology** +The [BRENDA Tissue Ontology (BTO)](https://obofoundry.org/ontology/bto.html ) provides a structured controlled vocabulary to describe the source of an enzyme. The ontology contains terms to represent tissues, cell lines, cell types and cell cultures. These terms span uni- and multicellular organisms. We recommend the use of concepts from BTO to represent anatomical location/site associated with a Biospecimen and/or a Sample. For example, instead of using free text ‘heart tissue’ to represent the site from which a Biospecimen was derived from, we would recommend using the appropriate concept **BTO:0004293 heart endothelium**. + +### **Data Use Ontology** +Endorsed by GA4GH, the [Data Use Ontology (DUO)](https://obofoundry.org/ontology/duo.html) allows users to tag datasets with usage restrictions, allowing them to become automatically discoverable based on a health, clinical, or biomedical researcher’s authorization level or intended use. We recommend the use of concepts from DUO to represent the use restrictions associated with a Dataset. For example, instead of having use restrictions as free text in a Data Access Policy, we would recommend using the appropriate concepts from DUO to better represent the granularity of use conditions and restrictions. + +### **Experimental Factor Ontology** +The [Experimental Factor Ontology (EFO)](https://www.ebi.ac.uk/efo) provides a systematic description of many experimental variables available in databases like those from the EBI. EFO combines parts of several biological ontologies, such as UBERON anatomy, ChEBI chemical compounds, and Cell Ontology. EFO is endorsed by [EMBL-EBI](https://www.ebi.ac.uk/services), [EGA](https://ega-archive.org), and [ENA](https://www.ebi.ac.uk/ena/browser). We recommend the use of concepts from EFO to represent experimental factors that are typically associated with studies. For example, instead of using free text ‘Exome sequencing’ to signify the type of an Experiment, we would recommend using the appropriate concept **EFO:0005396 Exome sequencing**. + +### **The Gender, Sex, and Sexual Orientation Ontology** +The [Gender, Sex, and Sexual Orientation Ontology (GSSO)](https://obofoundry.org/ontology/gsso.html) offers terms to describe gender, sex, and sexual orientation. It is aimed at interdisciplinary research in the biomedical and related sciences. We recommend the use of concepts from GSSO to represent the biological sex of an individual. For example, instead of using free text ‘female’ to represent the sex of an individual, we would recommend using the appropriate concept **GSSO:011317 female sex for clinical use**. + +### **Human Ancestry Ontology** +The [Human Ancestry Ontology (HANCESTRO)](https://obofoundry.org/ontology/hancestro) provides a systematic description of the ancestry concepts. HANCESTRO was originally built for NHGRI-GWAS Catalog and has since then been used by other consortia like the GA4GH, and the [Human Cell Atlas](https://www.humancellatlas.org). We recommend the use of concepts from HANCESTRO to represent the ancestry of an Individual. For example, instead of using ‘European ancestry’ to represent the ancestry of an Individual, we would recommend using the appropriate concept **HANCESTRO:0005 European**. + +### **Human Phenotype Ontology** +The [Human Phenotype Ontology (HPO)](https://obofoundry.org/ontology/hp) provides a standardized vocabulary of phenotypic abnormalities encountered in human disease. HPO is used by various consortia like the [GA4GH](https://www.ga4gh.org), [Solve-RD](https://solve-rd.eu), and [IRDiRC](https://irdirc.org). We recommend the use of concepts from HPO to represent phenotypic abnormalities that characterize a Biospecimen and/or an Individual. For example, instead of using free text ‘Heart attack’ to represent an Individual who has suffered from a heart attack, we would recommend using the appropriate concept **HP:0001658 Myocardial infarction**. + +### **International Classification of Diseases** +The [International Classification of Diseases (ICD)](https://www.who.int/classifications/classification-of-diseases) is widely used across the world and is a crucial source of information on the prevalence, causes, and outcomes of human disease and mortality. Through the use of standardized coding, clinical information can be collected and recorded using ICD in primary, secondary, and tertiary care settings, as well as on death certificates. These data form the foundation for disease surveillance and statistical analysis, which inform healthcare planning, payment systems, quality control, and research. In addition, ICD's diagnostic categories facilitate consistent data collection and enable large-scale research studies. We recommend the use of classifications from ICD to represent a diagnosis associated with an Individual. For example, instead of using free text ‘Malignant neoplasm of thymus’ to represent that an Individual suffers from thymic carcinoma, we would recommend using the appropriate concept **C37 Malignant neoplasm of thymus**. + +### **Mondo Disease Ontology** +The [Mondo Disease Ontology (Mondo)](https://obofoundry.org/ontology/mondo) provides a unified disease terminology that yields precise equivalences between disease concepts across various terminologies like OMIM, Orphanet, EFO, and DOID. Mondo is used by several consortia like GA4GH, [ClinGen](https://clinicalgenome.org), and [Gabriella Miller Kids First](https://kidsfirstdrc.org). We recommend the use of concepts from Mondo to represent diseases associated with a Biospecimen and/or an Individual. For example, instead of using free text ‘Myocardial infarction’ to represent an Individual who has suffered from a heart attack, we would recommend using the appropriate concept **MONDO:0005068 Myocardial infarction**. + +### **National Cancer Institute Thesaurus** +The [National Cancer Institute Thesaurus (NCIt)](https://obofoundry.org/ontology/ncit.html ) is a reference terminology covering the cancer domain, including diseases, abnormalities, anatomy, drugs, genes, and more. It provides granular and consistent terminology in certain areas like cancer diseases and combination chemotherapies. The terminology is a combination from numerous cancer research domains and enables integration of information through semantic relationships. We recommend the use of concepts from NCIt to represent the case or control status associated with a Sample. For example, instead of using free text ‘True Case Status’ to represent the case status of a sample, we would recommend using the appropriate concept **NCIT:C99269 True Case Status**. + +### **Orphanet Rare Disease Ontology** +Orphanet and the EBI have collaborated to develop the [Orphanet Rare Disease Ontology (ORDO)](https://www.ebi.ac.uk/ols/ontologies/ordo), which provides a well-organized and structured vocabulary for rare diseases. This ontology captures the relationships between diseases, genes, and other relevant features, and it serves as a valuable resource for the computational analysis of rare diseases. The Orphanet database, which is a multilingual database dedicated to rare diseases that is populated from literature and validated by international experts, serves as the basis for the ORDO. The ORDO incorporates a nosology, which is a classification system for rare diseases, as well as relationships such as gene-disease connections and epidemiological data. Additionally, the ORDO is connected to other terminologies like [Medical Subject Headings (MeSH)](https://www.nlm.nih.gov/mesh/meshhome.html), [Unified Medical Language System (UMLS)](https://www.nlm.nih.gov/research/umls/index.html), and [Medical Dictionary for Regulatory Activities (MedDRA)](https://www.ich.org/page/meddra), databases like [OMIM](https://www.omim.org/), [UniProtKB](https://www.uniprot.org/), [HGNC](https://www.genenames.org/), [Ensembl](https://www.ensembl.org/index.html), [Reactome](https://reactome.org/), The [International Union of Basic and Clinical Pharmacology (IUPHAR)](https://www.guidetopharmacology.org/), [GenAtlas](https://bio.tools/genatlas), and classifications like ICD10. We recommend the use of concepts from ORDO to represent phenotypic features that characterize a Biospecimen and/or Individual. For example, instead of using free text ‘Duchenne muscular dystrophy’ to represent an Individual with Duchenne, we recommend using the appropriate concept **ORPHA:98896 Duchenne muscular dystrophy**. + +### **Systematic Nomenclature of Medicine Clinical Terms** +[SNOMED Clinical Terms (SNOMED CT)](https://www.ebi.ac.uk/ols/ontologies/snomed) is a computerized repository of medical terms that are systematically organized for easy processing. This collection includes codes, terms, synonyms, and definitions that are commonly used in clinical documentation and reporting. We recommend the use of concepts from SNOMED CT to represent a sampling method associated with a Biospecimen and/or a Sample. For example, instead of using free text ‘Bone marrow sampling’ to represent the method used to isolate a sample, we would recommend using the appropriate concept **SNOMEDCT:234326005 Bone marrow sampling**. + +### **Semanticscience Integrated Ontology** +The [Semanticscience Integrated Ontology (SIO)](https://doi.org/10.1186/2041-1480-5-14) is a biomedical ontology for knowledge discovery. It describes diverse objects, processes, and attributes (real or hypothetical) using simple design patterns. SIO has extensions for chemistry, biology, biochemistry, and bioinformatics. It underpins the Bio2RDF linked data project and aids semantic integration for SADI-based web services. To unambiguously indicate the meaning of properties in the GHGA Metadata Schema they are linked to SIO terms, e.g. **SIO:000089 dataset**. + +### **Uber-Anatomy Ontology** +[Uber-Anatomy Ontology (Uberon)](http://obophenotype.github.io/uberon/) is an integrated cross-species anatomy ontology representing a variety of entities classified according to traditional anatomical criteria such as structure, function and developmental lineage. Uberon is used in various databases like ENA, EGA, and EBI [BioSamples](https://www.ebi.ac.uk/biosamples). We recommend the use of concepts from Uberon to represent anatomical location/site associated with a Biospecimen and/or a Sample. For example, instead of using free text ‘heart tissue’ to represent the site from which a Biospecimen was derived from, we would recommend using the appropriate concept **UBERON:0008307 heart endothelium**.