forked from microbiomedata/nmdc-schema
-
Notifications
You must be signed in to change notification settings - Fork 0
/
external_identifiers.yaml
355 lines (306 loc) · 11.4 KB
/
external_identifiers.yaml
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
175
176
177
178
179
180
181
182
183
184
185
186
187
188
189
190
191
192
193
194
195
196
197
198
199
200
201
202
203
204
205
206
207
208
209
210
211
212
213
214
215
216
217
218
219
220
221
222
223
224
225
226
227
228
229
230
231
232
233
234
235
236
237
238
239
240
241
242
243
244
245
246
247
248
249
250
251
252
253
254
255
256
257
258
259
260
261
262
263
264
265
266
267
268
269
270
271
272
273
274
275
276
277
278
279
280
281
282
283
284
285
286
287
288
289
290
291
292
293
294
295
296
297
298
299
300
301
302
303
304
305
306
307
308
309
310
311
312
313
314
315
316
317
318
319
320
321
322
323
324
325
326
327
328
329
330
331
332
333
334
335
336
337
338
339
340
341
342
343
344
345
346
347
348
349
350
351
352
353
id: https://w3id.org/nmdc/external_identifiers
name: external_identifiers
title: NMDC External Identifiers
description: >-
External identifiers
license: https://creativecommons.org/publicdomain/zero/1.0/
imports:
- attribute_values # in cycle with basic_classes
prefixes:
gold: "https://bioregistry.io/gold:"
linkml: https://w3id.org/linkml/
nmdc: https://w3id.org/nmdc/
skos: http://www.w3.org/2004/02/skos/core#
default_prefix: nmdc
default_range: string
slots:
img_identifiers:
title: IMG Identifiers
is_a: external_database_identifiers
description: >-
A list of identifiers that relate the biosample to records in the IMG database.
todos:
- add is_a or mixin modeling, like other external_database_identifiers
- what class would IMG records belong to?! Are they Studies, Biosamples, or something else?
pattern: '^img\.taxon:[a-zA-Z0-9_][a-zA-Z0-9_\/\.]*$' # DOI-like pattern. Could be more conservative if we leave this out of the external_database_identifiers hierarchy
## mixins
igsn_identifiers:
mixin: true
gold_identifiers:
mixin: true
see_also:
- https://gold.jgi.doe.gov/
emsl_identifiers:
mixin: true
mgnify_identifiers:
mixin: true
see_also:
- https://www.ebi.ac.uk/metagenomics/
insdc_identifiers:
mixin: true
aliases:
- EBI identifiers
- NCBI identifiers
- DDBJ identifiers
description: >-
Any identifier covered by the International Nucleotide Sequence Database Collaboration
comments:
- note that we deliberately abstract over which of the partner databases accepted the initial submission
- "the first letter of the accession indicates which partner accepted the initial submission: E for ENA, D for DDBJ, or S or N for NCBI."
see_also:
- https://www.insdc.org/
- https://ena-docs.readthedocs.io/en/latest/submit/general-guide/accessions.html
neon_identifiers:
mixin: true
description: identifiers for entities according to NEON
jgi_portal_identifiers:
mixin: true
description: identifiers for entities according to JGI Portal
see_also:
- https://data.jgi.doe.gov/
gnps_identifiers:
mixin: true
## studies
study_identifiers:
abstract: true
is_a: external_database_identifiers
jgi_portal_study_identifiers:
is_a: study_identifiers
mixins:
- jgi_portal_identifiers
id_prefixes:
- jgi.proposal
pattern: '^jgi.proposal:\d+$'
examples:
- value: jgi.proposal:507130
comments:
- Could this could be considered a related identifier?
- Curie suffix is the Site Award Number from an OSTI award page
- Site Award Number 507130 == award doi doi:10.46936/10.25585/60000017 -- GOLD study identifier gold:Gs0154044
- bioregistry.io/jgi.proposal:507130 ==https://genome.jgi.doe.gov/portal/BioDefcarcycling/BioDefcarcycling.info.html
title: JGI Portal Study identifiers
description: >-
Identifiers that link a NMDC study to a website hosting raw and analyzed data for a JGI proposal.
The suffix of the curie can used to query the GOLD API and is interoperable with an award DOI from OSTI and a GOLD study identifier.
neon_study_identifiers:
is_a: study_identifiers
mixins:
- neon_identifiers
insdc_sra_ena_study_identifiers:
is_a: study_identifiers
mixins:
- insdc_identifiers
aliases:
- EBI ENA study identifiers
- NCBI SRA identifiers
- DDBJ SRA identifiers
pattern: "^insdc.sra:(E|D|S)RP[0-9]{6,}$"
description: identifiers for corresponding project in INSDC SRA / ENA
examples:
- value: https://bioregistry.io/insdc.sra:SRP121659
description: Avena fatua rhizosphere microbial communities - H1_Rhizo_Litter_2 metatranscriptome
see_also:
- https://github.com/bioregistry/bioregistry/issues/109
- https://trace.ncbi.nlm.nih.gov/Traces/sra/sra.cgi?view=studies
- https://trace.ncbi.nlm.nih.gov/Traces/sra/sra.cgi?view=studies
insdc_bioproject_identifiers:
is_a: study_identifiers
mixins:
- insdc_identifiers
aliases:
- NCBI bioproject identifiers
- DDBJ bioproject identifiers
pattern: "^bioproject:PRJ[DEN][A-Z][0-9]+$"
description: identifiers for corresponding project in INSDC Bioproject
comments:
- these are distinct IDs from INSDC SRA/ENA project identifiers, but are usually(?) one to one
examples:
- value: https://bioregistry.io/bioproject:PRJNA366857
description: Avena fatua rhizosphere microbial communities - H1_Rhizo_Litter_2 metatranscriptome
see_also:
- https://www.ncbi.nlm.nih.gov/bioproject/
- https://www.ddbj.nig.ac.jp/bioproject/index-e.html
gold_study_identifiers:
is_a: study_identifiers
mixins:
- gold_identifiers
pattern: "^gold:Gs[0-9]+$"
description: identifiers for corresponding project(s) in GOLD
comments:
- uses the prefix GS (but possibly in a different case)
examples:
- value: https://bioregistry.io/gold:Gs0110115
see_also:
- https://gold.jgi.doe.gov/studies
title: GOLD Study Identifiers
mgnify_project_identifiers:
is_a: study_identifiers
mixins:
- mgnify_identifiers
pattern: "^mgnify.proj:[A-Z]+[0-9]+$"
description: identifiers for corresponding project in MGnify
examples:
- value: https://bioregistry.io/mgnify.proj:MGYS00005757
gnps_task_identifiers:
is_a: study_identifiers
mixins:
- gnps_identifiers
title: GNPS task identifiers
description: identifiers that link a NMDC study to a web-based report about metabolomics analysis progress and results
comments:
- this could be considered a related identifier, as the metabolomics progress and results aren't a study per se
- this identifier was registered with bioregistry but not identifiers.org
see_also:
- https://microbiomedata.github.io/nmdc-schema/MetabolomicsAnalysis/
examples:
- value: gnps.task:4b848c342a4f4abc871bdf8a09a60807
pattern: "^gnps\\.task:[a-f0-9]+$"
emsl_project_identifiers:
title: EMSL Project Identifiers
is_a: study_identifiers
mixins:
- emsl_identifiers
description: Identifiers that link a NMDC study to the EMSL user facility website hosting the project description of an EMSL user project
see_also:
- https://github.com/microbiomedata/nmdc-schema/issues/927#issuecomment-1802136437
pattern: "^emsl\\.project:[0-9]{5}$"
examples:
- value: emsl.project:60141
notes:
- these identifiers are all currently 5 digits long but that could change in the future
todos:
- elaborate on description
## samples
biosample_identifiers:
abstract: true
is_a: external_database_identifiers
neon_biosample_identifiers:
is_a: biosample_identifiers
mixins:
- neon_identifiers
gold_biosample_identifiers:
is_a: biosample_identifiers
mixins:
- gold_identifiers
pattern: "^gold:Gb[0-9]+$"
description: identifiers for corresponding sample in GOLD
examples:
- value: https://bioregistry.io/gold:Gb0312930
range: uriorcurie
insdc_biosample_identifiers:
is_a: biosample_identifiers
mixins:
- insdc_identifiers
aliases:
- EBI biosample identifiers
- NCBI biosample identifiers
- DDBJ biosample identifiers
pattern: "^biosample:SAM[NED]([A-Z])?[0-9]+$"
description: identifiers for corresponding sample in INSDC
examples:
- value: https://bioregistry.io/biosample:SAMEA5989477
- value: https://bioregistry.io/biosample:SAMD00212331
description: I13_N_5-10 sample from Soil fungal diversity along elevational gradients
see_also:
- https://github.com/bioregistry/bioregistry/issues/108
- https://www.ebi.ac.uk/biosamples/
- https://www.ncbi.nlm.nih.gov/biosample
- https://www.ddbj.nig.ac.jp/biosample/index-e.html
insdc_secondary_sample_identifiers:
is_a: biosample_identifiers
mixins:
- insdc_identifiers
pattern: "^biosample:(E|D|S)RS[0-9]{6,}$"
description: secondary identifiers for corresponding sample in INSDC
comments:
- "ENA redirects these to primary IDs, e.g. https://www.ebi.ac.uk/ena/browser/view/DRS166340 -> SAMD00212331"
- MGnify uses these as their primary sample IDs
examples:
- value: https://bioregistry.io/insdc.sra:DRS166340
description: I13_N_5-10 sample from Soil fungal diversity along elevational gradients
emsl_biosample_identifiers:
title: EMSL Biosample Identifiers
description: >-
A list of identifiers for the biosample from the EMSL database. This is
used to link the biosample, as modeled by NMDC, to the biosample in the planned EMSL NEXUS database.
is_a: biosample_identifiers
mixins:
- emsl_identifiers
todos:
- removed "planned" once NEXUS is online
- determine real expansion for emsl prefix
igsn_biosample_identifiers:
title: IGSN Biosample Identifiers
description: >-
A list of identifiers for the biosample from the IGSN database.
is_a: biosample_identifiers
mixins:
- igsn_identifiers
## DataGeneration
omics_processing_identifiers:
abstract: true
is_a: external_database_identifiers
gold_sequencing_project_identifiers:
is_a: omics_processing_identifiers
mixins:
- gold_identifiers
pattern: "^gold:Gp[0-9]+$"
description: identifiers for corresponding sequencing project in GOLD
examples:
- value: https://bioregistry.io/gold:Gp0108335
insdc_experiment_identifiers:
is_a: external_database_identifiers
pattern: "^insdc.sra:(E|D|S)RX[0-9]{6,}$"
mixins:
- insdc_identifiers
## analysis run
analysis_identifiers:
abstract: true
is_a: external_database_identifiers
gold_analysis_project_identifiers:
is_a: analysis_identifiers
mixins:
- gold_identifiers
pattern: "^gold:Ga[0-9]+$"
description: identifiers for corresponding analysis projects in GOLD
examples:
- value: https://bioregistry.io/gold:Ga0526289
jgi_portal_analysis_project_identifiers:
is_a: analysis_identifiers
mixins:
- jgi_portal_identifiers
id_prefixes:
- jgi.analysis
pattern: '^jgi.analysis:[0-9]+$'
description: identifiers for corresponding analysis projects in JGI Portal
examples:
- value: https://data.jgi.doe.gov/search?q=1414320
description: Metagenome - Draft Assembly YELL_051-M-20210705-comp-DNA1
insdc_analysis_identifiers:
is_a: analysis_identifiers
comments:
- in INSDC this is a run but it corresponds to a GOLD analysis
pattern: "^insdc.sra:(E|D|S)RR[0-9]{6,}$"
mixins:
- insdc_identifiers
examples:
- value: https://www.ebi.ac.uk/metagenomics/runs/DRR218479
description: Illumina MiSeq paired end sequencing of SAMD00212331
- value: https://www.ebi.ac.uk/ena/browser/view/ERR436051
mgnify_analysis_identifiers:
is_a: analysis_identifiers
notes:
- 'removed pattern: "^mgnify:MGYA[0-9]+$" ## TODO https://github.com/bioregistry/bioregistry/issues/109'
mixins:
- mgnify_identifiers
examples:
- value: https://www.ebi.ac.uk/metagenomics/analyses/MGYA00002270
description: combined analyses (taxonomic, functional) of sample ERS438107
## assemblies
assembly_identifiers:
abstract: true
insdc_assembly_identifiers:
is_a: assembly_identifiers
pattern: "^insdc.sra:[A-Z]+[0-9]+(\\.[0-9]+)?$"
mixins:
- insdc_identifiers