Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

examine and refine how we handle external link-prefixes in GigaDB #824

Open
17 tasks
only1chunts opened this issue Oct 21, 2021 · 3 comments
Open
17 tasks

Comments

@only1chunts
Copy link
Member

only1chunts commented Oct 21, 2021

User story

As a curator
I want to be able to add accession numbers of externally hosted linked data
So that we can link directly to relevant data hosted in external repositories such as INSDC

Acceptance criteria

Given I have a BioProject accession (e.g. PRJNA144099) number that I wish to add to a GigaDB dataset
When I add the accession number with the prefix "BioProject:" to the dataset_link table e.g. "BioProject:PRJNA144099"
Then the link to the relevant URL is included in the GigaDB dataset page depending on the logged in user preference or unlogged in default:
default = https://www.ebi.ac.uk/ena/browser/view/PRJNA144099
NCBI = https://www.ncbi.nlm.nih.gov/bioproject/PRJNA144099
EBI = https://www.ebi.ac.uk/ena/browser/view/PRJNA144099
DDBJ = Do not display NCBI or EBI submitted BioProjects so this is not valid for all BioProject accessions

Additional Info

The user story for the website user perspective is #17
INSDC archives such as SRA, BioProject, BioSample and GenBank are mirrored in 3 different repositories around the world; NCBI in USA, EBI in Europe, DDBJ in Asia. People have their own preferences on which of these repositories they prefer to use and we currently attempt to allow the registered users to choose which they are sent to. This does cause complications in the link-prefix table! and that is why the entire method currently being used probably needs an overhaul!

NB for BioProjects there are regex to those accessions based on their origins, see list here

We will need the ability to add new link prefixes quickly and easily, hence the current admin page for link prefixes.

The current list of link prefixes needs tidying up! Frankly, it's soo bad I don't even know how it's still working!
Things to correct:
EBI/NCBI/DDBJ has been added to various entries that are not even mirrored in those 3 institutes!
at least 2 prefixes are present for ontologies (DOID MEDDRA), no idea why or if they are used for anything, I can't see any reason why they should be included here.
yahoo? they dont provide accessions?!
http ? why is that there?
an old entry for EGA with outdated URL remains, even though there is a new one also!
PXD = ProteomeXchange
ERA has changed name to ENA
PROJECT should be BioProject

We should include RRIDs

There is also a ticket #279 suggesting we add a column for regular expression value of accessions, which is a good idea

In addition, it would perhaps be useful to include a short description of each row to enable help icons on website to assist users in choosing the correct prefix (in the future).

Also we need to consider the implications of any changes made to the link prefix table on the display of datasets.

We may want to add mandatory checks in the admin interface before changes are actioned, i.e. two curators sign off on changes, or URLs are tested and confirmed or something else?!

Here I list the accession number providers that we either already have or know we should be ready to accept:

  • INSDC - this includes; BioProject, BioStudies, BioSample, SRA_Study, SRA_Sample, SRA_Experiment, SRA_Run, SRA_Analysis, SRA_Submission
  • GenBank
  • RefSeq
  • GEO
  • AE
  • dbGaP
  • dbSNP
  • dbVar
  • TRACE
  • ProteomeXchange
  • PeptideAtlas
  • ProteomeCentral
  • MTBLS
  • MCZ
  • PRIDE
  • MG-RAST
  • CNGBdb - this includes; CNSA_BioProject, CNSA_BioSample, CNSA_... need to check with CNGB what accessions they have!
  • RRID

More info

  • Link prefixes admin

http://gigadb.gigasciencejournal.com:9170/adminLinkPrefix/update/id/23

no source should be legal

  • dataset links admin

http://gigadb.gigasciencejournal.com:9170/adminLink/admin

at the moment, prefix and accession number are in same column, should be separated

copied from #824

This could be considered part of Epic #597 but be aware the accession links are NOT stored in the "external_links" table they are in the "links" table- This MIGHT be a good time to revisit the logic of the schema design and to merge the external_links and links tables into 1 table, but that will need some thought and possibly has further reaching consequences.

Certain "Attributes" should be linked to external resources. NB - those links should all open new windows (i.e. anything linking away from GigaDB.org should open a new window leaving the gigadb.org page in background).
Specifically all accessions;
e.g.
Attribute_name should_link_to{EBI} OR {NCBI} depending on personal preferences.

alternative accession-BioSample {http://www.ebi.ac.uk/ena/data/view/_accession_ }{http://www.ncbi.nlm.nih.gov/biosample/?term=_accession_ }

alternative accession-BioProject {http://www.ebi.ac.uk/ena/data/view/_accession_ } {http://www.ncbi.nlm.nih.gov/biosample/?term=_accession_}

alternative accession-SRA_project {http://www.ebi.ac.uk/ena/data/view/_accession_ } {http://www.ncbi.nlm.nih.gov/biosample/?term=_accession_}

alternative accession-SRA_sample {http://www.ebi.ac.uk/ena/data/view/_accession_ } {http://www.ncbi.nlm.nih.gov/biosample/?term=_accession_}

alternative accession-SRA_experiment {http://www.ebi.ac.uk/ena/data/view/_accession_ } {http://www.ncbi.nlm.nih.gov/biosample/?term=_accession_}

alternative accession-SRA_file {http://www.ebi.ac.uk/ena/data/view/_accession_ } {http://www.ncbi.nlm.nih.gov/biosample/?term=_accession_}

alternative accession-GEO {need to look up GEO URLs}

links to additional analysis {value will be URL or DOI and should be hyperlinked}
relevant electronics resources {value will be URL or DOI and should be hyperlinked}

Product Backlog Item Ready Checklist

  • Business value is clearly articulated
  • Item is understood enough by the IT team so it can make an informed decision as to whether it can complete this item
  • Dependencies are identified and no external dependencies would block this item from being completed
  • At the time of the scheduled sprint, the IT team has the appropriate composition to complete this item
  • This item is estimated and small enough to comfortably be completed in one sprint
  • Acceptance criteria are clear and testable
  • Performance criteria, if any, are defined and testable
  • The Scrum team understands how to demonstrate this item at the sprint review

Product Backlog Item Done Checklist

  • Code is complete
  • Automated tests related to the changes are implemented and passing
  • All automated test suites are passing locally
  • Code is refactored to best practices and coding standards
  • Documentation is updated as needed
  • A Pull Request has been created and review requested
  • Pull Request is reviewed and approved
  • The item has been merged to the develop branch
  • All automated test suites are passing on continuous Integration pipeline and item is ready to release
@only1chunts
Copy link
Member Author

it might be worth checking how the curies idea in #424 might be implemented as its a synonymous system and the Names to Things application might be used here or the bioregistry might be used there

@only1chunts
Copy link
Member Author

@cthoyt is keen to encourage us to use Bioregistries for this task, and they have various tools that may make it easier for us to implement it, worth having a discussion with them before starting work on it. Including valid regex for various things that we use.

@cthoyt
Copy link

cthoyt commented Apr 28, 2023

Yes, I'm also happy to make any improvements to the existing software/data to support your use case. We're also thinking about reimplementations in other languages, too, if a combination of python packages and web api endpoints isn't sufficient

@only1chunts only1chunts added this to the Increase FAIRness milestone Aug 2, 2024
@rija rija moved this to To Estimate in Backlog: GigaDB Database Nov 4, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
Status: To Estimate
Development

No branches or pull requests

3 participants