Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Consider flexibility in rules for matching an article by title #1297

Open
3 tasks done
jvwong opened this issue Sep 3, 2024 · 3 comments
Open
3 tasks done

Consider flexibility in rules for matching an article by title #1297

jvwong opened this issue Sep 3, 2024 · 3 comments

Comments

@jvwong
Copy link
Member

jvwong commented Sep 3, 2024

Background

The article matching has been iterated on many times for different edge cases: #1074 #1124 #848 and there are services aimed at resolving this information e.g. #1295. From my observations, this works pretty well, but there are cases where no article is matched, due to ambiguity.

Currently, an author's input title must be an exact subset of the record retrieved from either PubMed or CrossRef after 'sanitization':
- trimming: const trimmed = _.trim( raw , ' .')
- lower casing: const lower = _.toLower( trimmed )
- removal of non-words: const clean = lower.replace(/[\W_]+/g, ' ')

Problems observed

There remain cases where we might want to reasonably relax conditions. For example:

  • Stop words
    • Input: "Syntaxin-6 delays prion protein fibril formation and prolongs the presence of toxic aggregation intermediates"
    • Actual: "Syntaxin-6 delays prion protein fibril formation and prolongs presence of toxic aggregation intermediates"
    • Source: https://doi.org/10.7554/eLife.83320
  • Formatting
    • whitespace 1
      • Input: "Senescent cells inhibit mouse myoblast differentiation via the Senescence Associated Secretory Phenotype ( SASP)-lipid 15d-PGJ2 -mediated modification and control of HRas"
      • Actual (PubMed): "Senescent cells inhibit mouse myoblast differentiation via the SASP-lipid 15d-PGJ2 mediated modification and control of HRas."
      • Source: https://pubmed.ncbi.nlm.nih.gov/39196610/
    • whitespace 2
      • Input: "Circular RNA HMGCS1 sponges MIR4521 to aggravate type 2 diabetes-induced vascular endothelial dysfunction"
      • Actual (eLife): "Circular RNA HMGCS1 sponges miR-4521 to aggravate type 2 diabetes-induced vascular endothelial dysfunction"
      • Source: https://doi.org/10.7554/eLife.97267.1
    • whitespace 3
      • Input: "Circular RNA HMGCS1 sponges miR-4521 to aggravate type 2 diabetes-induced vascular endothelial dysfunction"
      • Actual (update): "Circular RNA HMGCS1 sponges MIR4521 to aggravate type 2 diabetes-induced vascular endothelial dysfunction."
      • Source: https://doi.org/10.7554/eLife.97267
    • Author-specified info
      • journal, year
        • Input: "
          eLife 2024: Defining cell type-specific immune responses in a mouse model of allergic contact dermatitis by single-cell transcriptomics"
      • Actual: "Defining cell type-specific immune responses in a mouse model of allergic contact dermatitis by single-cell transcriptomics"
      • Source: https://doi.org/10.7554/eLife.94698.3
  • Markup
    • Input: "Trans regulation of an odorant binding protein by a proto-Y chromosome affects male courtship in house fly"
    • Actual: "Transregulation of an odorant binding protein by a proto-Y chromosome affects male courtship in house fly"
    • Source: https://www.biorxiv.org/content/10.1101/2021.06.22.447776v2
  • Partial-match
    • title
      • Input: "Root-specific theanine metabolism and regulation at the single-cell level in tea plants (Camellia sinensis)"
    • Actual: "Root-specific secondary metabolism at the single-cell level: a case study of theanine metabolism and regulation in the roots of tea plants (Camellia sinensis)"
    • Source: https://doi.org/10.7554/eLife.95891.2
  • Characters
    • Dashes
      • Input: "
        Neurons enhance blood-–brain barrier function via upregulating claudin-5 and VE-cadherin expression due to glial cell line-derived neurotrophic factor secretion"
      • Actual: "Neurons enhance blood-brain barrier function via upregulating claudin-5 and VE-cadherin expression due to GDNF secretion"
      • Source: https://elifesciences.org/reviewed-preprints/96161

Details

There are potential pitfalls to increasing flexibility, notably, the title of a manuscript can change between preprints, versions and the final version of record.

Tasks

  • Collect additional cases of real/potential mismatches
  • Create test harness
  • Pull out common code for matching
@jvwong
Copy link
Member Author

jvwong commented Oct 16, 2024

Not too flexible: #1299 (comment)

@jvwong jvwong reopened this Dec 6, 2024
@jvwong
Copy link
Member Author

jvwong commented Dec 6, 2024

New case:

  • Punctuation
    • Apostrophe
      • Input: "Saccharomyces cerevisiae Rev7 promotes non-homologous end-joining by blocking Mre11 nuclease and Rad50’s ATPase activities and homologous recombination"
      • Actual (PubMed): "Saccharomyces cerevisiae Rev7 promotes non-homologous end-joining by blocking Mre11 nuclease and Rad50's ATPase activities and homologous recombination"
      • Sources

There's probably some npm packages for normalizing stuff, but which ones and how far reaching is another question. So far I've avoided all of these.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

1 participant