Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Improve configurability and documentation of Pandas utilities #71

Draft
wants to merge 11 commits into
base: main
Choose a base branch
from

Conversation

cthoyt
Copy link
Member

@cthoyt cthoyt commented Sep 9, 2023

References #63

This pull request does the following:

  • Make strictness configurable for pandas expansion, compression, and standardization functions
  • When not strict, report a summary of what couldn't be expanded, compressed, or standardized
  • Add real-world examples for each of these functions (expansion, compression, standardize prefix, standardize CURIE, standardize URI) for both strictnesses

This pull request will explicitly not add any banana processing to the curies.Converter as it's not the job of this package to define what the semantics are. In the end, content containing bananas is just plain wrong. The curies package's goal is, given some semantics, to correctly implement them. The Bioregistry goes a bit further than curies and uses domain insight to deal with this, but we're not going to upstream that functionality since curies really should stay 100% generic.

That being said, we can make some more documentation to help people in these scenarios and suggest:

  1. Using a comprehensive registry like the Bioregistry as a default. When it's not sufficient, give examples on how to either 1) add additional e.g., synonyms to the Bioregistry to make a community-wide way of addressing the issue or 2) locally extending the Converter
  2. Chain the Bioregistry following some community-specific registries, such as for OBO Foundry
  3. Give an example on how the Converter class can be extended to include additional functionality, such as banana processing. This is planned to replace the Bioregistry's compression and expansion functionality.

Notebook demo: https://github.com/cthoyt/curies/blob/improve-pandas-utils/notebooks/Data%20Science%20Demo.ipynb

cc @hrshdhgd

@codecov-commenter
Copy link

codecov-commenter commented Sep 9, 2023

Codecov Report

Merging #71 (bef7b4c) into main (135988c) will decrease coverage by 0.66%.
The diff coverage is 90.00%.

❗ Current head bef7b4c differs from pull request most recent head f18ddd8. Consider uploading reports for the commit f18ddd8 to get more accurate results

@@            Coverage Diff             @@
##             main      #71      +/-   ##
==========================================
- Coverage   99.32%   98.66%   -0.66%     
==========================================
  Files           9        9              
  Lines         593      601       +8     
  Branches      127      128       +1     
==========================================
+ Hits          589      593       +4     
- Misses          3        7       +4     
  Partials        1        1              
Files Changed Coverage Δ
src/curies/api.py 97.82% <90.00%> (-1.07%) ⬇️

📣 We’re building smart automated test selection to slash your CI/CD build times. Learn more

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants