examine and refine how we handle external link-prefixes in GigaDB #824

only1chunts · 2021-10-21T13:31:33Z

User story

As a curator
I want to be able to add accession numbers of externally hosted linked data
So that we can link directly to relevant data hosted in external repositories such as INSDC

Acceptance criteria

Given I have a BioProject accession (e.g. PRJNA144099) number that I wish to add to a GigaDB dataset
When I add the accession number with the prefix "BioProject:" to the dataset_link table e.g. "BioProject:PRJNA144099"
Then the link to the relevant URL is included in the GigaDB dataset page depending on the logged in user preference or unlogged in default:
default = https://www.ebi.ac.uk/ena/browser/view/PRJNA144099
NCBI = https://www.ncbi.nlm.nih.gov/bioproject/PRJNA144099
EBI = https://www.ebi.ac.uk/ena/browser/view/PRJNA144099
DDBJ = Do not display NCBI or EBI submitted BioProjects so this is not valid for all BioProject accessions

Additional Info

The user story for the website user perspective is #17
INSDC archives such as SRA, BioProject, BioSample and GenBank are mirrored in 3 different repositories around the world; NCBI in USA, EBI in Europe, DDBJ in Asia. People have their own preferences on which of these repositories they prefer to use and we currently attempt to allow the registered users to choose which they are sent to. This does cause complications in the link-prefix table! and that is why the entire method currently being used probably needs an overhaul!

NB for BioProjects there are regex to those accessions based on their origins, see list here

We will need the ability to add new link prefixes quickly and easily, hence the current admin page for link prefixes.

The current list of link prefixes needs tidying up! Frankly, it's soo bad I don't even know how it's still working!
Things to correct:
EBI/NCBI/DDBJ has been added to various entries that are not even mirrored in those 3 institutes!
at least 2 prefixes are present for ontologies (DOID MEDDRA), no idea why or if they are used for anything, I can't see any reason why they should be included here.
yahoo? they dont provide accessions?!
http ? why is that there?
an old entry for EGA with outdated URL remains, even though there is a new one also!
PXD = ProteomeXchange
ERA has changed name to ENA
PROJECT should be BioProject

We should include RRIDs

There is also a ticket #279 suggesting we add a column for regular expression value of accessions, which is a good idea

In addition, it would perhaps be useful to include a short description of each row to enable help icons on website to assist users in choosing the correct prefix (in the future).

Also we need to consider the implications of any changes made to the link prefix table on the display of datasets.

We may want to add mandatory checks in the admin interface before changes are actioned, i.e. two curators sign off on changes, or URLs are tested and confirmed or something else?!

Here I list the accession number providers that we either already have or know we should be ready to accept:

INSDC - this includes; BioProject, BioStudies, BioSample, SRA_Study, SRA_Sample, SRA_Experiment, SRA_Run, SRA_Analysis, SRA_Submission
GenBank
RefSeq
GEO
AE
dbGaP
dbSNP
dbVar
TRACE
ProteomeXchange
PeptideAtlas
ProteomeCentral
MTBLS
MCZ
PRIDE
MG-RAST
CNGBdb - this includes; CNSA_BioProject, CNSA_BioSample, CNSA_... need to check with CNGB what accessions they have!
RRID

More info

Link prefixes admin

http://gigadb.gigasciencejournal.com:9170/adminLinkPrefix/update/id/23

no source should be legal

dataset links admin

http://gigadb.gigasciencejournal.com:9170/adminLink/admin

at the moment, prefix and accession number are in same column, should be separated

copied from #824

This could be considered part of Epic #597 but be aware the accession links are NOT stored in the "external_links" table they are in the "links" table- This MIGHT be a good time to revisit the logic of the schema design and to merge the external_links and links tables into 1 table, but that will need some thought and possibly has further reaching consequences.

Certain "Attributes" should be linked to external resources. NB - those links should all open new windows (i.e. anything linking away from GigaDB.org should open a new window leaving the gigadb.org page in background).
Specifically all accessions;
e.g.
Attribute_name should_link_to{EBI} OR {NCBI} depending on personal preferences.

alternative accession-BioSample {http://www.ebi.ac.uk/ena/data/view/_accession_ }{http://www.ncbi.nlm.nih.gov/biosample/?term=_accession_ }

alternative accession-BioProject {http://www.ebi.ac.uk/ena/data/view/_accession_ } {http://www.ncbi.nlm.nih.gov/biosample/?term=_accession_}

alternative accession-SRA_project {http://www.ebi.ac.uk/ena/data/view/_accession_ } {http://www.ncbi.nlm.nih.gov/biosample/?term=_accession_}

alternative accession-SRA_sample {http://www.ebi.ac.uk/ena/data/view/_accession_ } {http://www.ncbi.nlm.nih.gov/biosample/?term=_accession_}

alternative accession-SRA_experiment {http://www.ebi.ac.uk/ena/data/view/_accession_ } {http://www.ncbi.nlm.nih.gov/biosample/?term=_accession_}

alternative accession-SRA_file {http://www.ebi.ac.uk/ena/data/view/_accession_ } {http://www.ncbi.nlm.nih.gov/biosample/?term=_accession_}

alternative accession-GEO {need to look up GEO URLs}

links to additional analysis {value will be URL or DOI and should be hyperlinked}
relevant electronics resources {value will be URL or DOI and should be hyperlinked}

Product Backlog Item Ready Checklist

Business value is clearly articulated
Item is understood enough by the IT team so it can make an informed decision as to whether it can complete this item
Dependencies are identified and no external dependencies would block this item from being completed
At the time of the scheduled sprint, the IT team has the appropriate composition to complete this item
This item is estimated and small enough to comfortably be completed in one sprint
Acceptance criteria are clear and testable
Performance criteria, if any, are defined and testable
The Scrum team understands how to demonstrate this item at the sprint review

Product Backlog Item Done Checklist

Code is complete
Automated tests related to the changes are implemented and passing
All automated test suites are passing locally
Code is refactored to best practices and coding standards
Documentation is updated as needed
A Pull Request has been created and review requested
Pull Request is reviewed and approved
The item has been merged to the develop branch
All automated test suites are passing on continuous Integration pipeline and item is ready to release

only1chunts · 2022-11-25T14:58:59Z

it might be worth checking how the curies idea in #424 might be implemented as its a synonymous system and the Names to Things application might be used here or the bioregistry might be used there

only1chunts · 2023-04-28T08:35:39Z

@cthoyt is keen to encourage us to use Bioregistries for this task, and they have various tools that may make it easier for us to implement it, worth having a discussion with them before starting work on it. Including valid regex for various things that we use.

cthoyt · 2023-04-28T08:56:33Z

Yes, I'm also happy to make any improvements to the existing software/data to support your use case. We're also thinking about reimplementations in other languages, too, if a combination of python packages and web api endpoints isn't sufficient

only1chunts added backlog:Story asa:Curator labels Oct 21, 2021

rija added backlog:Epic and removed backlog:Story asa:Curator labels Dec 29, 2021

This was referenced Dec 29, 2021

Update admin interface for link prefix to allow no source and with better database schema modeling E824 #903

Open

link prefix reg-ex E824 #279

Closed

Enable validation of dataset links in submission wizard E824 #905

Open

only1chunts mentioned this issue Nov 25, 2022

sample attribute accession links to relevant external DBs #17

Open

cthoyt mentioned this issue Apr 27, 2023

Bioregistry in GigaDB biopragmatics/bioregistry#801

Open

only1chunts added this to the Increase FAIRness milestone Aug 2, 2024

rija added this to Backlog: GigaDB Database Nov 4, 2024

rija added the asa:Curator label Nov 4, 2024

rija moved this to To Estimate in Backlog: GigaDB Database Nov 4, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

examine and refine how we handle external link-prefixes in GigaDB #824

examine and refine how we handle external link-prefixes in GigaDB #824

only1chunts commented Oct 21, 2021 •

edited by rija

Loading

only1chunts commented Nov 25, 2022

only1chunts commented Apr 28, 2023

cthoyt commented Apr 28, 2023

examine and refine how we handle external link-prefixes in GigaDB #824

examine and refine how we handle external link-prefixes in GigaDB #824

Comments

only1chunts commented Oct 21, 2021 • edited by rija Loading

User story

Acceptance criteria

Additional Info

More info

copied from #824

Product Backlog Item Ready Checklist

Product Backlog Item Done Checklist

only1chunts commented Nov 25, 2022

only1chunts commented Apr 28, 2023

cthoyt commented Apr 28, 2023

only1chunts commented Oct 21, 2021 •

edited by rija

Loading