Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

NCBI taxonomy as a taxonomic authority #5

Open
SSuominen1 opened this issue Mar 18, 2021 · 16 comments
Open

NCBI taxonomy as a taxonomic authority #5

SSuominen1 opened this issue Mar 18, 2021 · 16 comments

Comments

@SSuominen1
Copy link
Contributor

Could NCBI taxonomy IDs be useful as taxonomic links?
Where will these be added?

@dianalg
Copy link

dianalg commented Mar 31, 2021

To expand on this: To the best of my knowledge, right now, the scientificName field can only contain a Linnaean scientific name that matches on WoRMS, or an OTU identifier from BOLD or UNITE. However, many people working with DNA-derived occurrences obtain scientific names from NCBI taxonomy. Can we discuss why these names (or their associated taxonomy IDs) are not acceptable as scientificName values? Since the NCBI taxonomy seems to be a standard in this field of study, wouldn't we want to accommodate that? Right now, I'm handling this by going to progressively higher taxonomic ranks (genus --> family --> order, etc.) until I find a term that matches on WoRMS. But I'm not sure to what degree this maintains the integrity of the original data.

Secondly, records may have associated NCBI taxonomy IDs and/or GenBank IDs. If these are not acceptable in the scientificName and associated columns, where could they be included?

@claudenozeres
Copy link

I am interested in hearing more about this. NCBI Taxonomy Browser shows Linnaean scientific names, so these should be available for use. Then again, these names should also be matching in WoRMS--so will be available, even if 'source' is not NCBI? Is the issue then that under scientificNameID, would like to use NCBI Taxonomy ID instead of AphiaID? Does GBIF allow other sources of scientificNameID, and only OBIS requires AphiaID? Example (hyperlinks on OBIS site)
Macoma calcarea on OBIS: https://obis.org/taxon/141580
Aphia ID, urn:lsid:marinespecies.org:taxname:141580
BOLD ID, 70992
NCBI ID, 1421134

For scientificNameID then, would use urn:lsid:marinespecies.org:taxname:141580, or it could be NCBI:txid1421134
But OBIS will only permit the WoRMS AphiaID--is this correct?

I imagine the challenge is if NCBI has names that are NOT available (or correct?) on WoRMS. In that case, it is a matter of updates between the two?
Example with the related species, Limecola petalum. https://obis.org/taxon/880026
Not linked to NCBI because WoRMS does not show NCBI for this name, but for older name Macoma petalum:
http://www.marinespecies.org/aphia.php?p=taxdetails&id=397131#links
https://www.ncbi.nlm.nih.gov/Taxonomy/Browser/wwwtax.cgi?mode=Info&id=425103&lvl=3&lin=f&keep=1&srchmode=1&unlock

Regarding the first example by @dianalg - there is discussion here on a verbatimScientificName tdwg/dwc#181
Could use an available field to record NCBI term if not matching specifically in WoRMS. Have to review notes--there have been other discussions on this.

Secondly, seems similar--need to identify what are the fields to use for other ID codes.

@claudenozeres
Copy link

Suggestions that were recently made to me:
If name is not in WoRMS, can use field 'identificationRemarks' (free-text) https://dwc.tdwg.org/terms/#identification
If is not in WoRMS, can inform and will be added. If not a published/recognized name, can be on the 'Annotated List' of OBIS with explanation why not available.
This paper was for image-based identifications, but may be of relevance for genetic data, too: https://www.frontiersin.org/articles/10.3389/fmars.2021.620702/full

@albenson-usgs
Copy link

albenson-usgs commented Apr 6, 2021

But OBIS will only permit the WoRMS AphiaID--is this correct?

Yes, OBIS only accepts a WoRMS LSID in scientificNameID. GBIF does not have this requirement. You can use any name but it is matched to the GBIF Backbone Taxonomy.

@claudenozeres
Copy link

claudenozeres commented Apr 6, 2021

Thanks @albenson-usgs. Am curious about usage, and so now looking at 2 random marine examples on GBIF. 1) Somniosus microcephalus and 2) Leptasterias polaris. I note that (apart from 1 record for dynatax.se LSID), only OBIS records with scientificNameID, with WoRMS LSID. Most records (not coming from OBIS) on GBIF do not use scientificNameID, but rather taxonID (which is not a LSID).

Would it be acceptable to fill in taxonID with NCBI taxon code (or BOLD BIN--also used often), in addition to scientificNameID? Thus, have the WoRMS taxon name, but also information on the potential genetic identifier (BINs and NCBI do not always match 1:1 with WoRMS LSID :)

@claudenozeres
Copy link

Note, see my test queries of 2 marine species on GBIF here:
https://doi.org/10.15468/dl.ffm32b
https://doi.org/10.15468/dl.4p6qph

@albenson-usgs
Copy link

Would it be acceptable to fill in taxonID with NCBI taxon code (or BOLD BIN--also used often), in addition to scientificNameID?

Good question and actually I would extend it to say can we use NCBI taxon code instead of WoRMS LSID when there is no match in WoRMS? (I think that's Diana's question) This would be a question for the OBIS Steering Group. Further, it's recently come to my attention that OBIS may not be using scientificNameID correctly which has a definition of "An identifier for the nomenclatural (not taxonomic) details of a scientific name." This is only tangentially related to this topic but may factor in to how we form our recommendations.

@kpitz
Copy link

kpitz commented Apr 7, 2021

If WORMS IDs are preferred over NCBI IDs, it would be very useful to have a look-up table linking these two standards. I worry that searching by a name, like a genus, in isolation might accidentally give you the WoRMS ID of an organism with the same name but that is totally different lineage from the NCBI sequence you matched. There are so many records we can't manually check them all and so we rely on more automated searches and tools. If NCBI IDs are acceptable instead of WoRMS IDs then that would be much easier for us to use across our datasets.

@dianalg
Copy link

dianalg commented Apr 7, 2021

Yes @claudenozeres, I have used vernacularName in the way you're proposing using verbatimScientificName in the past. But I don't know that that's really acceptable/best practice...

As @claudenozeres said and @albenson-usgs clarified, my real issue is whether an NCBI name/code could be used if there is no matching name on WoRMS. So, for example, I have the name "phototrophic eukaryote" as the assigned taxon in many rows of an eDNA dataset. This has a matching name and taxonomy ID on NCBI, but is a non-Linnaean term that does not match on WoRMS. But OBIS requires a WoRMS-approved name in the scientificName column.

Right now, my options are 1) work my way up the taxonomic tree until I get a rank for "phototrophic eukaryote" that matches on WoRMS or 2) remove the record before submitting the data to OBIS. Following strategy 1, I'm putting "Biota" in the scientificName column, which from what I can tell is WoRMS's accepted name for anything that's alive. I'm also putting the associated Aphia ID in the taxonID column, and "phototrophic eukaryote" in the vernacularName column.

That said, it seems like this kind of issue will be really common for genetically-derived data. And my work around (strategy 1 above) does run the risk that @kpitz is describing.

@claudenozeres
Copy link

Regarding look-up table mentioned by @kpitz , that would be an important tool, and would help with adoption of use with WoRMS, so I would push for further work between the two because there are conflicts and lack of attention that I can see. Going forward, if increasingly common and not easy/satisfactory, an alternative would be not to use OBIS+WoRMS, but to publish on GBIF with taxonomy of choice. I think the first one is valuable if it leads to stronger connections and updates between resources, namely WoRMS, NCBI, and BOLD (existing links but not very solid at the moment). Similar to how OBIS became vastly improved with names once they adopted WoRMS as their taxonomic backbone, instead of continuing on their own.

@claudenozeres
Copy link

I recommend Dhugal Lindsay et al. 2017 for an interesting summary of issues with occurrence datasets and genetically identified taxa. https://www.tandfonline.com/doi/full/10.1080/17451000.2016.1268261. They highlight several issues to be improved for sequence data on biodiversity portals.

@pieterprovoost
Copy link
Member

Thanks everyone for your valuable feedback. While we intend to keep WoRMS as our taxonomic backbone (as the NCBI disclaimer states: "the NCBI taxonomy database is not an authoritative source for nomenclature or classification"), I'm going to discuss with WoRMS and the taxonomy task team to see if we can come up with some recommendations for using NCBI and BOLD identifiers and correct use of scientificNameID, taxonID, and taxonConceptID. We should be able to come up with a technical solution to match records to our backbone using alternate identifiers in case a WoRMS LSID is not available.

Note that the WoRMS API has an endpoint to get an Aphia record by NCBI ID, for example: https://www.marinespecies.org/rest/AphiaRecordByExternalID/94237?type=ncbi. I'm not sure how complete this is.

Somewhat related: gbif/doc-publishing-dna-derived-data#35

@dianalg
Copy link

dianalg commented Apr 27, 2021

Thanks, @pieterprovoost, some recommendations around this issue would be great to start. Do you have any sense of when we might expect those?

@pieterprovoost
Copy link
Member

After discussing with WoRMS, we propose the following:

  • WoRMS is our taxonomic backbone, and the recommendation remains to match records to WoRMS and include a WoRMS LSID in scientificNameID. There's certainly value in providing an NCBI taxonomy identifier, but NCBI explicitely states that they are not a taxonomic authority. There's also something to be said about the quality of identifications in/based on NCBI (see Lindsay et al. as mentioned by @claudenozeres) but that's probably another discussion.
  • We think scientificNameID is the appropriate field for WoRMS LSIDs as these refer to names, not taxon concepts.
  • We believe taxonConceptID is the appropriate field for OTUs, BOLD BINs, NCBI taxonomy identifiers, etc. We are aware that there are recommendations to add OTUs in scientificName, but that is not what the term was intended for (also see here and here).
  • OBIS will add support for search on taxonConceptID.
  • WoRMS is going to look into developing an online tool to match NCBI taxonomy identifiers to WoRMS LSIDs, similar to their existing taxon matching services. However, this is only going to work for cases where there's an exact match between the NCBI and WoRMS concepts. NCBI concepts which are not scientific names for example will require "manual" matching to the closest parent in WoRMS (if that helps I can probably provide example code to somewhat automate this process using the WoRMS and NCBI APIs). If you come across names you think should be in WoRMS but are not, please get in touch with the WoRMS team, they will be happy to look into it. As mentioned in earlier comments, having better connections between the different systems benefits everyone.

So for @dianalg's example I would propose this:

term value
scientificName Biota
scientificNameID urn:lsid:marinespecies.org:taxname:1
taxonConceptID NCBI:txid1899546
identificationRemarks phototrophic eukaryote

Hopefully this is a workable solution.

Finally, please note that we are not outright rejecting records without WoRMS LSID, but they may get flagged as not being linked to the taxonomic backbone. It would be a shame if people decide not to publish to OBIS at all due to this requirement. Making data findable and accessible should be the priority, even if interoperability is not perfect.

@bart-v @leenvandepitte

@albenson-usgs
Copy link

albenson-usgs commented May 3, 2021

We believe taxonConceptID is the appropriate field for OTUs, BOLD BINs, NCBI taxonomy identifiers, etc. We are aware that there are recommendations to add OTUs in scientificName, but that is not what the term was intended for (also see here and here).

Is anyone going to bring this to TDWG for discussion with the broader community?

I note that both of the issues Pieter links to are closed and are in GBIF only discussion areas. I think we would all benefit from wider community input on how to move forward with this.

@claudenozeres
Copy link

I agree with @albenson-usgs --need to inform/alert/hear from TDWG or broader community. We raised these matters for OBIS, but are applicable to others (and may not be aware). @pieterprovoost's summary with example is very useful.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

6 participants