Releases: gyorilab/adeft
Adeft v0.12.3
What's Changed
Full Changelog: 0.11.2...0.12.3
0.11.2
0.11.1
0.11.0
This release fixes a bug that caused the grounding GUI to not work when adeft is pip installed. The adeft folder for the pretrained models is now placed in a platform specific user data folder by default rather than in a hidden folder in the users home directory. Users are still able to override this default by setting the environment variable ADEFT_HOME
. Tests have been updated to use pytest instead of nose.
0.10.0
This release makes several changes concerning model statistics.
-
The global precision, recall, and F1 scores for a classifier now use micro-averaging to aggregate across the scores for different positive class labels rather than taking an average weighted by the frequencies for each positive label. Micro-averaging looks at global counts of true positives, false positives, and false negatives
across all positive labels. A true positive involves any positive labeled datapoint classified correctly. A false positive involves any positive labeled datapoint that has been classified incorrectly. A false negative involves any datapoint being classified incorrectly to a positive labeled datapoint. Note that false positives and false negatives can overlap. Micro-averaging is easier to reason about and interpret and using it allows for some simplification of implementation in other places. The original decision to use the weighted average was made with little thought at a time when we were making less use of model statistics. -
A method has been added to
adeft.disambiguate.AdeftDisambiguator
that allows the set of positive labels to be updated while recomputing global model statistics. Previously it was required to retrain the model. This is facilitated by storing the entire label vs label confusion matrix for each CV fold upon training a model and serializing this when saving the model.
Bug fixes and a smaller changes were also made
- A bug was fixed that was causing the labels in model statistics to fail to update when
adeft.disambiguate.AdeftDisambiguator.modify_groundings
was used to update groundings in a model. - A bug was fixed that caused the labels attribute of an
adeft.disambiguate.AdeftDisambiguator
to not contain labels for which no defining pattern exists. (These labels are typically for texts manually curated in Entrez as mentioning a particular gene with the shortform of interest as a synonym but which are not abbreviations.) - A new attribute was added to classifiers called
other_metadata
. Anything jsonable stored within this attribute will be preserved upon model serialization. We are using this to store any relevant information needed to retrain a model that does not fit into the existing attributes. This allows for simplification of the retraining process. - Some small updates have been made to the introductory Jupyter notebook.
0.9.0
This release makes a number of improvements to the grounding GUI.
- Previously, actions such as deleting an entry or toggling a label as positive/negative would cause the scroll position and text entered into the input boxes to be lost. This made using the app tedious since the page would refresh to the top after each action, making it burdensome for example to delete many groundings or toggle many labels in sequence. This has been remedied.
- The input boxes at the top are now fixed in a sticky position making it unnecessary to scroll back and forth in order to select rows and then enter groundings. They now follow along as the user scrolls.
- Columns of the table are now sortable. The headers for each column are now buttons masquerading as links. Clicking each header will cause the rows to be sorted by that column. This is useful for example to aid in scanning for similar longforms or to group every row together that has the same grounding.
- The user may now pass in a csv file of known groundings with rows of the form namespace, identifier, standard name (e.g HGNC,6091,INSR). It is then only necessary to enter the namespace and one of the identifier or standard name into the input boxes for any grounding that has a row in the supplied table.
- Entered groundings are now color coded, with one color for groundings where the standard name and identifier match in a row in the supplied groundings csv file, another color for groundings where the standard name and identifier do not match based on the table, and black if there are no rows in the table for the entered standard name and identifier. The colors have been chosen so that the contrast can hopefully be detected by most color blind users; instead of the standard green for match, red for match, approximations have been chosen for these colors based on the Wong color palette.
- Any rows provided the grounding
ignore
will have their longforms dropped from the generated grounding map. These are displayed with a special color to highlight the special semantic role. - Labels without a namespace will not appear in the column of labels which can be toggled as positive/negative.
These changes should make the GUI much more user friendly and less tedious to use.
0.8.0
This release fixes several bugs and makes some small updates.
Fixes have been made for
- A bug in AdeftMiner.prune that broke this method but was undiscovered due to lack of testing. The bug has been fixed and a test has been added.
- Training adeft models throwing an error for the edge case where there are more than two labels with only one positive label.
- The longform scorer throwing an error when there are punctuation characters in the shortform.
- The GUI not working when the multiprocessing start method is set to spawn. This caused the GUI to fail on windows, where fork is unavailable. This should resolve issue #49.
- The deprecated parameter iid has been removed from internal use of Scikit-learn's GridSearchCV, removing a deprecation warning.
The following other changes have been made
- AdeftLabeler now requires unique identifiers along with the texts passed into process_texts. Instead of passing in a list of texts, the process_texts method now takes a list of tuples of the form (text, identifier). The output list now contains tuples of the form (text, label, identifier). This is useful for mapping back from texts in the generated corpus to texts in the input. Texts without defining patterns are filtered out completely and those with defining patterns have the defining patterns replaced with only the shortform, making mapping backwards nontrivial without the identifiers.
- Adeft's home folder can now be specified by setting the environment variable ADEFT_HOME in the user's profile. The default is now the hidden folder ".adeft" in the users home directory with subfolders for different adeft versions.
- The parameter class_weight from Scikit-learn's implementation of logistic regression is now exposed as a parameter of AdeftClassifier. This allows for provided different weights in the loss function for different class labels.
0.7.0
This release updates the longform expansion discover algorithm in AdeftMiner
to combine the Acromine based approach with an alignment based scorer that we have developed. Alignment based scoring algorithms look for common subsequences between the shortform and longform candidates, different approaches scoring matches in a variety of ways. We have combined the two approaches by taking weighted averages of normalized Acromine scores and alignment based scores for each longform candidate, with the weight assigned to the alignment based score increasing for rare expansions.
The AdeftRecognizer
has also been updated to allow the raw longform expansion to be recovered as it actually appears in a text. Previously only a normalized expansion was recovered.
0.6.0
In this release
- Users may specify a seed that will be used in random number generators involved in adeft model creation, allowing for repeatable model training results.
- Additional statistics are captured at the time of model training. F1, Precision, and Recall are now captured for each class label separately, allowing users to see how performance compares across labels. These statistics have been propagated to the info string of
AdeftDisambiguator
. - Timestamps and additional metadata are collected at model training time, making changes in models more transparent.
AdeftDisambiguator
now has an additional methodversion
based on some of this metadata. - The ".adeft" folder containing models now has the version appended to it. E.g. in this release the folder will be named ".adeft_0.6.0". This will allow different versions of adeft with incompatible models to coexist on the same machine
- The command
python -m adeft.download
no longer takes argument--update
, the new behavior is for all existing models to be replaced and a fresh copy of all existing models on S3 downloaded when the command is run. - The method
feature_importances
ofAdeftClassifier
no longer raises an exception if called for a classifier trained before the information necessary to calculate feature importances was included. Now a warning is logged and None is returned
0.5.5 - JOSS Paper
Version of Adeft software corresponding to accepted manuscript at the Journal of Open Source Software (JOSS).
Change log from 0.5.3:
- add shortforms to model stopwords to prevent use of abbreviations as model features
- capture information about feature importance for each model