Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

In-journal periodical publications merged together #35

Open
eliarizzetto opened this issue Nov 11, 2024 · 0 comments
Open

In-journal periodical publications merged together #35

eliarizzetto opened this issue Nov 11, 2024 · 0 comments
Labels
bug Something isn't working

Comments

@eliarizzetto
Copy link
Collaborator

We have detected the presence of Bibliographic Resources in OpenCitations Meta that are linked to multiple external IDs associated, in the real world, to publications that are periodically published in the same venue (journal). E.g., editorial comments or recurrent news columns in a specific research field.

Although the periodicity of these publications might not be relevant to the cause of the problem (i.e. having IDs of separate resources all linked to a single one in Meta), it seems appropriate to point it out and take it into consideration, since these scenarios do not seem to be – or at least not exclusively – generated by software bugs in Meta (contrary to the cases where different real-world entities that have no perceivable common features have been erroneously merged); rather, they seem to result from errors in the data provided by OpenCitations' primary sources (e.g. Crossref, DataCite, PubMed, OpenAire, ecc.).

For example, let's consider the case of br/061903839782. By querying the Meta SPARQL endpoint, we can see that this journal article has 76 DOIs and 17 PMIDs:

PREFIX literal: <http://www.essepuntato.it/2010/06/literalreification/>
PREFIX fabio: <http://purl.org/spar/fabio/>
PREFIX datacite: <http://purl.org/spar/datacite/>

SELECT ?value ?scheme {
<https://w3id.org/oc/meta/br/061903839782> datacite:hasIdentifier ?id .
?id datacite:usesIdentifierScheme ?scheme ;
    literal:hasLiteralValue ?value .
}

Then we searched these external IDs in the databases of the primary sources by querying the appropriate APIs, to obtain information on their current representation (even though, of course, the current state of the data exposed via API might differ from the one at the time of the ingestion in Meta). For the sake of brevity, we only post the script we used to obtain current data on these external IDs in PubMed (since it is this source that generates the error in this case).

import requests
import json
import time
from tqdm import tqdm


def get_pmids_for_doi(doi_list):

    base_url = "https://eutils.ncbi.nlm.nih.gov/entrez/eutils/esearch.fcgi"
    pmid_dict_out = dict()
    
    for doi in tqdm(doi_list):
        params = {
            "db": "pubmed",
            "term": doi,
            "field": "DOI",
            "retmode": "json"
        }
        try:
            response = requests.get(base_url, params=params)
            data = response.json()
            
            # Extract PMID from the response
            pmids = data['esearchresult']['idlist']
            if pmids:
                if pmid_dict_out.get(doi):
                    pmid_dict_out[doi] += pmids
                else:
                    pmid_dict_out[doi] = pmids
            time.sleep(0.60)
        except Exception as e:
            print(e)
            time.sleep(5)
            response = requests.get(base_url, params=params)
            data = response.json()
            
            # Extract PMID from the response
            pmids = data['esearchresult']['idlist']
            if pmids:
                if pmid_dict_out.get(doi):
                    pmid_dict_out[doi] += pmids
                else:
                    pmid_dict_out[doi] = pmids

    
    return pmid_dict_out

By running the get_pmids_for_doi() function and passing to it the list of DOIs associated to br/061903839782 in Meta we obtain, in the form of a dictionary, the DOI-to-PMID mapping available in the current PubMed data. From these results we can see how the great majority of the DOIs is pointing to multiple PMIDs in PubMed, which exaplains the fact that so many IDs point to the same resource in Meta. Nonetheless, it should be noticed that 21 among the queried DOIs are uniquely associated to a single PMID, as of the current state of PubMed: the reason why they, too, point to br/061903839782 is likely the fact the data in PubMed has probably been updated at a time following the ingestion of this entity (and its external IDs) in Meta.

For more information on the issue described hereby and on the operations made to examine it, see the following gist: https://gist.github.com/eliarizzetto/c984bb85642aee7ae9eeb0761a9f0d40.

@eliarizzetto eliarizzetto added the bug Something isn't working label Nov 11, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug Something isn't working
Projects
None yet
Development

No branches or pull requests

1 participant