Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

V1.13.post #48

Open
wants to merge 24 commits into
base: master
Choose a base branch
from
Open

V1.13.post #48

wants to merge 24 commits into from

Conversation

mart-r
Copy link
Owner

@mart-r mart-r commented Oct 14, 2024

Just testing GHA

tomolopolis and others added 24 commits October 5, 2022 18:20
* CU-8677ge6j8 Version identification and updating (CogStack#313)

* Expose example model card version in metadata test

* Add version detection along with tests

* Move to a more comprehensive version string parser (regex)

* Add more comprehensive versioning tests

* Move MedCAT unzip to a separate method

* Separate getting semantic version from string

* Add new CDB with version information and use that with versioning tests

* Add methods to get version info from CDB dump and model pack zip/folder

* Exposing CDB file name and adding custom dev patch version support

* Fix config.linking.filters.cuis - from empty dict to empty set

* Add logging to versioning

* Fix f-strings instead of (intended) r-strings

* Add creating model pack archive to versioning CDB fix

* Fix logger initialising

* Making versioning a runnable module that allows fixing the config

* Add docstrings to CLI methods

* CU-8677ge6j8 Make explicit check regards to empty dict when fixing config

* CU-8677ge6j8 Add tests regarding versioning changes

* CU-8677ge6j8 Add missing return type hint

* CU-8677ge6j8 Simplify action handling for CLI input

* CU-8677ge6j8 Simplifying archive making method

* Pin down transformers for the de-identification model (CogStack#314)

* NO-TICKET pin down transformers for the de-id model

* Added function to remove CUI from cdb (CogStack#316)

* Added function to remove CUI from cdb

* Unit test for remove_cui

* CU-862jjprjw Fix github actions failures (CogStack#317)

* Added function to remove CUI from cdb

---------

Co-authored-by: antsh3k <[email protected]>

* CU-862jr8wkk Pin pydantic dependency to avoid conflicts with v2.0 (CogStack#318)

* Bump django from 3.2.18 to 3.2.19 in /webapp/webapp

Bumps [django](https://github.com/django/django) from 3.2.18 to 3.2.19.
- [Commits](django/django@3.2.18...3.2.19)

---
updated-dependencies:
- dependency-name: django
  dependency-type: direct:production
...

Signed-off-by: dependabot[bot] <[email protected]>

* CU-863gntc58 Umlspt2ch (CogStack#322)

* CU-863gntc58 Add parent to child relationship getter to UMLS preprocessing

* CU-863gntc58 Only use ISA relationships

* Make sure parents do not have themselves as children

* CU-863gntc58 Only keep preferred names

* CU-863gntc58 Fix typing issues

* CU-863gntc58 Fix child-parent relationships being saved instea

* Better system for avoiding parent-child being the same

* Fix for Issue 325 (CogStack#326)

* Issue-325 Add check for old/new spacy; fix code for nested entities

* Issue-325 Fix a typing issue

* Issue-325 Improve nested entity extraction in _doc_to_out; add type hint for individual entities

* Issue-325 Remove unneccessary whitespace

* Issue-325 Move spacy version detection from cat to utils.helpers

* CU-86783u6d9 Add wrapper to simplify De-ID model usage (CogStack#324)

* CU-2wgnqg5 Add javadoc to a method

* CU-2wgnqg5 Fix issues with typing

* CU-2wgnqg5 Add (potential) progress bar to regression testing

* CU-2wgnqg5 Add runnable regression checker with command line arguments

* CU-2wgnqg5 Add better help message for a CLI argument

* CU-2wgnqg5 Fix import to use proper namespace

* CU-2wgnqg5 Add parent-child functionality for filters

* CU-2wgnqg5 Add cui and children option to the config example

* Revert "CU-2wgnqg5 Fix import to use proper namespace"

This reverts commit 882be44.

* CU-2wgnqg5 Add default / empty children to translation layer

* CU-2wgnqg5 Remove use of deprecated warning method

* CU-2wgnqg5 Add new default test case that checks for 'heart rate' and its children 4 deep

* CU-2wgnqg5 Remove unneccessary TODO comment

* CU-2wgnqg5 Add possibility of using result reporting for regression checks

* CU-2wgnqg5 Fix issue with delegations not shown for reports

* CU-2wgnqg5 Add possibility of using reports for CLI regression testing

* CU-2wgnqg5 Fix minor typing issues

* CU-2wgnqg5 Fix typo in default regression config

* CU-2wgnqg5 Make sure imports work both when running directly as well as when using as part of the project

* CU-2wgnqg5 Add a new test case with the ANY strategy

* CU-2wgnqg5 Fixing imports so that absolute imports are used

* CU-2wgnqg5 Add new package to setup.py

* CU-2wgnqg5 Fix typing issues

* CU-2wgnqg5 Fix report output formating

* CU-2vzhd93 Remove logging tutorials (move to MedCATtutorials)

* CU-2wgnqg5 Move to a simpler filter design

* CU-2wgnqg5 Add (optional) per-phrase results to results/reporting

* CU-2wgnqg5 Add per-phrase information toggle to CLI

* CU-2wgnqg5 Fix method signature changes between inherited classes

* CU-2q50k3c: add contact email address.

* added latest release news / accepted paper

* Update README.md

* CU-2zj4czk Move to a class based linking filter approach

* CU-2zj4czk Move to identifier based linking filter access

* CU-2zj4czk Use MCT filters when training supervised

* New UMLS Full Model

* CU-2zj4czk Make sure excluded CUIs are always specified (even if by an empty set)

* CU-2zj4czk Add possibility of creating a copy of linking filters

* CU-2zj4czk Use copies of linking.filters in train_supervised and _print_stats

* CU-2zj4czk Add linking.filters merging functionality

* CU-2zj4czk Add parameter to retain MCT filters within train_supervised

* CU-2zj4czk Rename filters variable within print_stats method for better consistency and readability

* CU-2zj4czk Consolidate some duplicate code between train_supervised and _print_stats

* CU-2zj4czk Fix multi-project detection

* CU-2zj4czk Fix linking filter merging

* CU-2zj4czk Add tests for retaining filters from MCT along with a test-trainer export

* CU-2zj4czk Remove debug print outputs from some tests

* CU-2wgnqg5 Separate some of the regression code into different modules

* Add URL of paper for Dutch model (CogStack#275)

* CU-2wgnqg5 Add serialisation code along with tests

* CU-2wgnqg5 Fix regression checker and case serialisation and add tests

* CU-2wgnqg5 Add conversion code from MCT export to regression YAML along with tests

* CU-2wgnqg5 Fix minor import and typing issues

* CU-2wgnqg5 Add runnable to convert from MedCATtrainer to regression YAML

* CU-2wgnqg5 Add for number of cases read from MCT export

* CU-2wgnqg5 Add context selectors for conversion from MCT

* CU-2wgnqg5 Add use of context selector to converter

* CU-2wgnqg5 Add use of context selector to runnable

* CU-2wgnqg5 Fix issue with typing

* CU-2wgnqg5 Add regression case based progress bar in case the total of sub-cases is unknown

* CU-2wgnqg5 Make sure (and test) that only 1 replacement '%s' is in each phrase for regression tests

* CU-2wgnqg5 Add test cases for '%' replacement in context and some minor optimisation

* CU-2wgnqg5 Add option to not show empty cases in report

* CU-2wgnqg5 Fix verbose output mode/logging

* CU-2wgnqg5 Fix name clashes in test cases

* CU-2wgnqg5 Make conversion filter for both CUI and NAME

* CU-2wgnqg5 Use different approach for generating targets for regression cases

* CU-2wgnqg5 Add warning when no parent-child information is present (but continue to run)

* Fix issue with typing

* Add TODO comment regarding more comprehensive reporting

* Fix whitespace issue

* CU-2wgnqg5 Translation layer now able to confirm if a set of CUIs has a parent or child of a specified one

* CU-2wgnqg5 Add reasons for failure of a regression case

* CU-2wgnqg5 Make hiding failures a possibility from the CLI

* CU-2wgnqg5 Use better report output for failures with summary

* CU-2wgnqg5 Fix typing issues

* CU-2wgnqg5 Add description to failed cases where applicable

* CU-2wgnqg5 Fix successes not being reported on

* CU-2wgnqg5 Rename some fail reasons for better readability

* CU-2wgnqg5 Add test cases for specifeid CUI and name if/when none are found from the CDB

* CU-2wgnqg5 Add extra information (names) in case of failure becasue name not in CDB

* CU-2wgnqg5 Make converter consolidate different test cases with identical filters (CUI and name) into one with multiple phrases

* CU-2wgnqg5 Remove use of TargetInfo and using a tuple instead

* CU-2wgnqg5 Fix remnant targetinfo

* CU-2wgnqg5 Fix remnant targetinfo stuff

* CU-2wgnqg5 Fix remnant targetinfo in docstrings

* CU-2wgnqg5 Fix missing argumnet in docstrings

* CU-2wgnqg5 Allow only reports in regression checker

* CU-2wgnqg5 Add medcat.utils.regression level parent logger

* CU-2wgnqg5 Use medcat.utils.regression parent logger for verbose output in regression checker

* CU-2wgnqg5 Move from logger.warn to logger.warning

* CU-2wgnqg5 Fix issue with wrong targets being generated

* CU-2wgnqg5 Fix checking tests

* CU-2wgnqg5 Add dunder init to test (utils) packages to make the tests within discoverable

* CU-2wgnqg5 Fix serialisation tests (add missing argument)

* CU-2wgnqg5 Fix regression results tests (change method owner)

* CU-2wgnqg5 Fix regression results tests (make names ordered)

* CU-2wgnqg5 Remove unnecessary print output in test

* CU-2wgnqg5 Update conversion code to not use target info

* CU-2wgnqg5 Attempt to fix automated build on github actions (bin sklearn version)

* CU-2wgnqg5 Move from sklearn to scikit-learn dependency

* CU-2wgnqg5 Separate some code in converting, add docs

* CU-2wgnqg5 Make yaml dumping save for yaml representation of regression checker

* CU-2wgnqg5 Add initial editing code with some simple tests

* CU-2wgnqg5 Add possibility for combinations to ignore identicals

* CU-2wgnqg5 Add docs to the editing/combining methods

* CU-2wgnqg5 Add runnable python file for combining different regression YAMLs

* CU-2wgnqg5 Minor codebase improvements

* CU-2wgnqg5 Make FailReasons serializable

* CU-2wgnqg5 Add json output to regression checking

* Make stats reporting not have np.nan values on empty train count  (CogStack#277)

* CU-327vb66 make stats reporting not have np.nan values on empty train count
* CU-327vb66 start using scikit-learn instead of deprecated sklearn

* Bump django from 3.2.15 to 3.2.16 in /webapp/webapp

Bumps [django](https://github.com/django/django) from 3.2.15 to 3.2.16.
- [Release notes](https://github.com/django/django/releases)
- [Commits](django/django@3.2.15...3.2.16)

---
updated-dependencies:
- dependency-name: django
  dependency-type: direct:production
...

Signed-off-by: dependabot[bot] <[email protected]>

* Update ReadMe.md to show Licence change

Updated News Section

* CU-2wgnqg5 Add docstring to fail descriptor getter method

* CU-2wgnqg5 Removed handled TODO

* CU-33g09h4 Make strides towards PEP 257. Make all docstrings use triple double quotes; remove preceding whitespace from docstrings; remove raw-string docstrings where applicable; remove empty docstrings

* CU-2zj4czk Add documentation regarding config.linking.filters

* CU-2zj4czk Add test for leakage of extra_cui_filters

* CU-33g09h4 Remove leftover whitespace from start of docstring

* include joblib dep

* CU-2zj4czk Add parameter to retain extra_cui_filters (instead of MCT filters). Make sure tests pass.

* CU-33g09h4 Some docstring unification for config(s)

* CU-33g09h4 Some docstring unification for pipe, meta_cat and vocab

* CU-33g09h4 Some docstring unification for cdb

* CU-33g09h4 Some docstring unification for cdb maker

* CU-33g09h4 Some docstring unification for cdb and maker (Return: to Returns:)

* CU-33g09h4 Some docstring unification for cat

* CU-33g09h4 Fix typo in docstring

* CU-33g09h4 Some docstring unification for utils

* CU-33g09h4 Some docstring unification for tokenizers

* CU-33g09h4 Some docstring unification for preprocessors

* CU-33g09h4 Some docstring unification for NER parts

* CU-33g09h4 Some docstring unification for NEO parts

* CU-33g09h4 Some docstring unification for linking parts

* CU-33g09h4 Some docstring unification for cogstack connection part

* CU-33g09h4 Remove some leftover backticks from docstring types

* CU-33g09h4 Remove some leftover 'Return:' -> 'Returns:' changes

* CU-33g09h4 Fix typo in a return type name

* CU-384mewq match post release branches in the production workflow (CogStack#283)

* CU-346mpxm Add new JSON based (faster) serialization for CDB along with tests

* CU-346mpxm Add new package to setup.py; add logger and docstrings to serializer; remove dead code and comments

* CU-346mpxm Remove leftover codel; Fix type safety regarding optinal json path

* CU-346mpxm Add logging on writing to serializer

* CU-346mpxm Add logging on reading to serializer

* CU-346mpxm Make deserializing consistent with previous CDB deserialising

* CU-346mpxm Add JSON serialisation to CDB

* CU-346mpxm Remove issue with circular imports

* CU-346mpxm Make sure json files end with .json

* CU-346mpxm Add json type format to modelpack creation

* CU-346mpxm Add tests for json format modelpack creation

* CU-346mpxm Add logging output to model pack creation and loading

* CU-346mpxm Add model pack converter / runnable

* Update README.md

* CU-862hyd5wx Unify rosalind/vocab downloading in tests, identify and fail meaningfully in case of 503

* CU-862hyd5wx Remove unused imports in tests due to last commit

* CU-862hyd5wx Add possibility of generating and using a simply vocab when Rosalind is down

* CU-862hyd5wx Fix small typo in tests

* Loosen dependency restrictions (CogStack#289)

Signed-off-by: zethson <[email protected]>

Signed-off-by: zethson <[email protected]>

* bug found in snomed2OPCS func

* markdown improvements

* Mapping icd10 and opcs complete

* get all children func added

* pep8 fixes

* Update README.md

* Add confusion matrix to meta model evaluation

* CU-862j0jcdu / CU-862j0jd2n Cdb json (CogStack#295)

* CU-862j0jcdu Rename format parameter in model creation to specify it only applys to the CDB

* CU-862j0jd2n Add addl_info to be JSON serialised when required

* CU-862j0jd2n Add addl_info to docstring of CDB serializer

* CU-38g55wn / CU-39cmv82 Support for python3.11 (and 3.10) (CogStack#285)

* CU-38g55wn Move dependencies to (hopefully) support python 3.11 on Ubuntu

* CU-38g55wn Attempt to fix dependencies for github dependency (gensim)

* CU-38g55wn Attempt to fix dependencies for github dependency (gensim) x2

* CU-38g55wn Attempt to fix dependencies for github dependency (gensim) x3

* CU-38g55wn Attempt to fix dependencies for github dependency (gensim) x4

* CU-38g55wn Attempt to fix dependencies for github dependency (gensim) x5 - fix missing comma

* CU-38g55wn Remove errorenous package from setup.py

* CU-38g55wn Bump spacy version so as to (hopefully) fix pydantic issues

* CU-38g55wn Bump spacy en_core_web_md version so as to (hopefully) fix requirements issues

* CU-38g55wn Fix test typo that was fixed on newere en_core_web_md

* CU-38g55wn Fix small issue in NER test

* CU-38g55wn Fix small issue with NER test (int conversion)

* CU-38g55wn Mark some places as ignore where newer mypy complains

* CU-38g55wn Bump mypy dev requirement version

* CU-38g55wn Add python 3.11 and 3.10 to workflow

* CU-38g55wn Trying to install gensim over https rather tha ssh

* CU-38g55wn Make python versions strings in GH worfklow so 3.10 doesn't get 'rounded' to 3.10 when read

* CU-38g55wn Remove python 3.7 from workflow since it's not compatible with required versions of numpy and scipy

* CU-38g55wn Universally fixing NER test regarding the 'movar~viruse' -> 'movar~virus' thing

* CU-38g55wn Bump gensim version to 4.3.0 - the first to support 3.11

* CU-862hyd5wx Unify rosalind/vocab downloading in tests, identify and fail meaningfully in case of 503

* CU-862hyd5wx Remove unused imports in tests due to last commit

* CU-862hyd5wx Add possibility of generating and using a simply vocab when Rosalind is down

* CU-862hyd5wx Remove python 3.7 and add 3.10/3.11 to classifiers

* CU-862hyd5wx Reorder python versions in GitHub workflow

* CU-862hyd5wx Attempt to fix GHA by importing unittest.mock explicitly

* CU-39cmvru Faster hashing (CogStack#286)

* CU-39cmvru Add marking of CDB dirty if/when concepts change. Avoid calculating its hash separately if it hasn't been dirtied. Add tests to
verify behaviour.

* CU-39cmvru Add possibility to force recalculation of hash for CDB (inlcuding when getting hash for CAT)

* CU-39cmvru Add possibility to force recalculation of hash for CDB through modelcat creation (new parameter, propageting through _versioning)

* CU-39cmvru Remove previous hash from influencing hashing of CDB to produce consistent hash on every recalculation
Add tests to make sure that is the case on the CDB level as well as the CAT/modelpack level.

* CU-39cmvru Add logging around the (re)calclulation of the CDB hash

* CU-39cmvru Fix typo in log message

* CU-39cmvru Add test to make sure the CDB hash is saved to disk and loaded from disk

* CU-39cmvru Add possibility to calculate hash upon saving of CDB if/when the hash is unknown (i.e when saving outside a model pack)

* CU-39cmvru Add CDB dirty flag to all other methods that modify the CDB

* Change confusion matrix to DF and add labels

* Fix model config

* CU-86777ey74 No elastic dependency (CogStack#298)

* Removed elastic dependency

* CU-86777ey74 Remove module that depends on elastic (cogstack/cogstack_conn)

* CU-86777ey74 Remove medcat.cogstack package from setup.py packages

* Docstring updated to google-style docstring

* CU-2e77a2k Remove unused utility modules

* CU-2e77a2k Remove deprecated utils

* Bump django from 3.2.16 to 3.2.17 in /webapp/webapp

Bumps [django](https://github.com/django/django) from 3.2.16 to 3.2.17.
- [Release notes](https://github.com/django/django/releases)
- [Commits](django/django@3.2.16...3.2.17)

---
updated-dependencies:
- dependency-name: django
  dependency-type: direct:production
...

Signed-off-by: dependabot[bot] <[email protected]>

* CU-33g0f3w Read the docs build failures (CogStack#306)

* CU-33g0f3w Pin aiohttp dependency version for docs

* CU-33g0f3w Pin aiohttp dependency version for docs (CogStack#303)

* CU-33g0f3w Pin aiohttp dependency version for docs in setup.py

* Read the docs build failures (CogStack#304)

* CU-33g0f3w Pin aiohttp dependency version for docs

* CU-33g0f3w Pin aiohttp dependency version for docs in setup.py

* CU-33g0f3w Pin blis dependency version for docs in setup.py

* Add options for loading meta models and additional NERs (CogStack#300)

* CU-8677aud63 add options for loading meta models and addl NERs
* CU-8677aud63 reduce memory usage during test

* Style fix

* NO-TICKET reduce the false positives on pushing to test pypi (CogStack#307)

* CU-862j5by9q Regression touchup - metadata and ability to split suites into categories (CogStack#301)

* CU-862j5by9q Add metadata to regression suite, loaded from model card if/when specified. A model can be specified upon creation to get the model card from.

* CU-862j5by9q Remove f-string from string with no placeholders

* CU-862j5by9q Make regression case hashable

* CU-862j5by9q Add category separation to regression test suite along with automated tests and test example

* CU-862j5by9q Add missing docstringgs to category separation

* CU-862j5by9q Add saving to category separator and a convenience method for separation based on regression test YAML file and categories YAML file

* CU-862j5by9q Add missing docstrings to new methods

* CU-862j5by9q Fix typo in class name

* CU-862j5by9q Fix saving issue for separation results

* CU-862j5by9q Add runnable category separator

* CU-862j5by9q Separate some file location constants in separation tests

* CU-862j5by9q Add test for separation that checks that no information gets lost (in the specific situation)

* CU-862j5by9q Add an anything-goes category description

* CU-862j5by9q Fix anything-goes option

* CU-862j5by9q Add tests for anything-goes category description

* CU-862j5by9q Add possibility of using an overflow category when separating regression suite

* CU-862j5by9q Add use of the overflow category to the runnable

* CU-862j5by9q Fix linting and typing issues

* CU-862j5by9q Add test for each individual separated suite

* CU-862j5by9q Fix minor abstract class issues

* CU-862j5by9q Rename categoryseparation module as category_separation

* CU-862j5by9q Add docstrings to category_separator

* CU-8677craqe make transformer_ner continue processing other entities after the first non-matching

* Bump django from 3.2.17 to 3.2.18 in /webapp/webapp

Bumps [django](https://github.com/django/django) from 3.2.17 to 3.2.18.
- [Release notes](https://github.com/django/django/releases)
- [Commits](django/django@3.2.17...3.2.18)

---
updated-dependencies:
- dependency-name: django
  dependency-type: direct:production
...

Signed-off-by: dependabot[bot] <[email protected]>

* CU-862j7b9jc Mypy full release - 1.0.0 (CogStack#308)

* CU-862j7b9jc Add abstract base class to regression converting strategy where necessary

* CU-862j7b9jc Bump mypy to version 1.0.0

* CU-862j7b9jc Mypy abc hotfix (CogStack#311)

* CU-862j7b9jc Fix issue with duplicate imports

* CU-862j7b9jc Fix issue with no whitespace after keyword (E275)

* CU-862j7b9jc Remove unnecessary brackets from if statement

* CU-8677ge6j8 Version identification and updating (CogStack#313)

* Expose example model card version in metadata test

* Add version detection along with tests

* Move to a more comprehensive version string parser (regex)

* Add more comprehensive versioning tests

* Move MedCAT unzip to a separate method

* Separate getting semantic version from string

* Add new CDB with version information and use that with versioning tests

* Add methods to get version info from CDB dump and model pack zip/folder

* Exposing CDB file name and adding custom dev patch version support

* Fix config.linking.filters.cuis - from empty dict to empty set

* Add logging to versioning

* Fix f-strings instead of (intended) r-strings

* Add creating model pack archive to versioning CDB fix

* Fix logger initialising

* Making versioning a runnable module that allows fixing the config

* Add docstrings to CLI methods

* CU-8677ge6j8 Make explicit check regards to empty dict when fixing config

* CU-8677ge6j8 Add tests regarding versioning changes

* CU-8677ge6j8 Add missing return type hint

* CU-8677ge6j8 Simplify action handling for CLI input

* CU-8677ge6j8 Simplifying archive making method

* Pin down transformers for the de-identification model (CogStack#314)

* NO-TICKET pin down transformers for the de-id model

* Added function to remove CUI from cdb (CogStack#316)

* Added function to remove CUI from cdb

* Unit test for remove_cui

* CU-862jjprjw Fix github actions failures (CogStack#317)

* Added function to remove CUI from cdb

---------

Co-authored-by: antsh3k <[email protected]>

* CU-862jr8wkk Pin pydantic dependency to avoid conflicts with v2.0 (CogStack#318)

* Bump django from 3.2.18 to 3.2.19 in /webapp/webapp

Bumps [django](https://github.com/django/django) from 3.2.18 to 3.2.19.
- [Commits](django/django@3.2.18...3.2.19)

---
updated-dependencies:
- dependency-name: django
  dependency-type: direct:production
...

Signed-off-by: dependabot[bot] <[email protected]>

* CU-863gntc58 Umlspt2ch (CogStack#322)

* CU-863gntc58 Add parent to child relationship getter to UMLS preprocessing

* CU-863gntc58 Only use ISA relationships

* Make sure parents do not have themselves as children

* CU-863gntc58 Only keep preferred names

* CU-863gntc58 Fix typing issues

* CU-863gntc58 Fix child-parent relationships being saved instea

* Better system for avoiding parent-child being the same

* CU-86783u6d9 Add wrapper to simplify De-ID model usage

* CU-86783u6d9 Add wrapper to simplify De-ID model usage

* CU-86783u6d9 Fix typoe (nod vs not)

* CU-86783u6d9 Fix typo in docstring

* CU-86783u6d9 Change loading method name to match CAT

* CU-86783u6d9 Separate NER model from DeID model

* Better separation of NER models from DeID models

* CU-86783u6d9 Move deid method from helpers module to deid model and deprecated the use of the wrappers in the helpers module

* Fix imports in deid model

* Fix deid training method return value

* CU-86783u6d9 Fix dunder call defaults for redaction

* CU-86783u6d9 Add a few simple tests for the DeID model

* CU-86783u6d9 Add redaction test for the DeID model

* CU-86783u6d9 Add remove senitive data

* CU-86783u6d9 Fix deid model validation

* CU-86783u6d9 Add ChatGPT generated DeId trian data

* CU-86783u6d9 Add Warning regarding deid training data

* CU-86783u6d9 Fix model issue with multiple NER models

* CU-86783u6d9 Fix merge conflict in docstring

* CU-86783u6d9 Try and fix keyword argument duplication

* CU-86783u6d9 Ignore mypy where needed

* CU-86783u6d9 Fix issue with NER model being returned when loading a DeID model

* CU-86783u6d9 Remove unused import

* CU-86783u6d9 Update training data with some more examples

* CU-86783u6d9 Add type hints and doc string to deid method

* CU-86783u6d9 Add comment regarding deid_text method being outside the model class

* CU-86783u6d9 Add missing return type

* CU-86783u6d9 Expose get_entities in NER model

* CU-86783u6d9 Expose dunder call in NER model

* CU-86783u6d9 Remove dunder call in override in deid model

* CU-86783u6d9 Fix deid model tests

* CU-86783u6d9 Fix a few typos in docstrings

* CU-86783u6d9 Fix a method name in docstrings

---------

Signed-off-by: dependabot[bot] <[email protected]>
Signed-off-by: zethson <[email protected]>
Co-authored-by: tomolopolis <[email protected]>
Co-authored-by: Zeljko <[email protected]>
Co-authored-by: Sander Tan <[email protected]>
Co-authored-by: Xi Bai <[email protected]>
Co-authored-by: dependabot[bot] <49699333+dependabot[bot]@users.noreply.github.com>
Co-authored-by: Anthony Shek <[email protected]>
Co-authored-by: Lukas Heumos <[email protected]>
Co-authored-by: antsh3k <[email protected]>
Co-authored-by: James Brandreth <[email protected]>
Co-authored-by: Xi Bai <[email protected]>

* CU-862k1tt90 Fix circular imports by moving raw deid method back to helpers module (CogStack#328)

* CU-862k1tt90 Fix circular imports by moving raw deid method back to helpers module

* CU-862k1tt90 Fix missing import regarding deid

* CU-862k1tt90 Remove unnecessary newline

* Cu 863h30jyb separate train from data load (CogStack#329)

* CU-863h30jyb Deprecated train_supervised method in favour of train_supervised_from_json method

* CU-863h30jyb Shuffle around docstrings for supoervised training methods

* CU-863h30jyb Create new train_supervised_raw method for raw data based training

* CU-863h30jyb In MetaCat deprecate train method and replace with train_from_json method

* CU-863h30jyb In MetaCat add train_raw method and move most of the training logic into that one

* CU-863h30jyb Fix type hint

* CU-86785yhfk Add method to populate cui2snames with data from cui2names (CogStack#327)

* CU-86785yhfk Add method to populate cui2snames with data from cui2names

* CU-86785yhfk Add test for cui2sname population method

* Bump django from 3.2.19 to 3.2.20 in /webapp/webapp

Bumps [django](https://github.com/django/django) from 3.2.19 to 3.2.20.
- [Commits](django/django@3.2.19...3.2.20)

---
updated-dependencies:
- dependency-name: django
  dependency-type: direct:production
...

Signed-off-by: dependabot[bot] <[email protected]>

* CU-346mpwz Improving memory usage of MedCAT models (CogStack#323)

* CU-863gntc58 Add parent to child relationship getter to UMLS preprocessing

* CU-863gntc58 Only use ISA relationships

* Make sure parents do not have themselves as children

* CU-863gntc58 Only keep preferred names

* CU-346mpwz Add memory optimiser for CDB

* CU-346mpwz Add name2<stuff> to memory optimiser for CDB

* CU-346mpwz Add keys/items/values views to memory optimiser fake dicts

* CU-346mpwz Fix keys/items/values views in memory optimiser fake dicts

* CU-346mpwz Add option to optimise or not cui and/or name based dicts in memory optimiser

* CU-346mpwz Make default memory optimiser omit name2... optimising; add comment regarding this in docstring

* CU-346mpwz Remove unused/legacy code from memory optimiser

* CU-346mpwz Add tests for memory optimiser

* CU-346mpwz Add tests memory optimised CDB

* CU-346mpwz Make dict names available within memory optimiser

* CU-346mpwz Add separate tests for memory optimised CDB

* CU-346mpwz Remove unused imports in memory optimiser

* CU-346mpwz Move some encoding and decoing stuff within serialisation to their own module

* CU-346mpwz Add tests for encoding/decoding stuff

* CU-346mpwz Add encoding/decoding for delegating dict as well as postprocessing for delegation linking with json serialisation

* CU-346mpwz Fix decision upon JSON deserialisation of CDB when loading model pack

* CU-346mpwz Adapt serialisation tests to the potential one2many mappings

* CU-346mpwz Add tests for memory optimisation, including JSON serialisation ones

* CU-346mpwz Remove debug print statements

* CU-346mpwz Remove debug methods from tests

* CU-346mpwz Fix method signatures in encoding/decoding methods

* CU-346mpwz Fix typing issue in serialiser when passing encoder

* CU-346mpwz Relax typing restrictions for umls preprocessing / parent2child mapping

* CU-346mpwz Remove some debug variables

* CU-346mpwz Fix remnant merge conflict

* CU-346mpwz Add item removal and popping to delegating dict

* CU-346mpwz Add item removal and popping tests to delegating dict

* CU-346mpwz Add item adding/setting tests to delegating dict

* CU-346mpwz Fix typing issue (List vs list)

* CU-346mpwz Add possibility of memory-optimising for snames as well

* CU-346mpwz Add comment regarding memory-optimising for filtering by CUI to CDB

* CU-346mpwz Add sname based memory optimisation tests

* CU-346mpwz Add json serialisation capabilities to snames delegation

* CU-346mpwz Make sname optimisation default for memory optimisation

* CU-346mpwz Fix typo in serialisation tests

* CU-346mpwz Add variable to keep track of current memory optimisation info to CDB

* CU-346mpwz Add default cui2snames to sname optimisations; make sure sname optimisation dirties the CDB

* CU-346mpwz Add method to undo CDB memory optimisation

* CU-346mpwz Add tests for undoing CDB memory optimisation

* CU-346mpwz Clear memory optimised parts if/when undoing optimisations

* CU-346mpwz Remove accidentally added file/module

* CU-346mpwz Add more straight forward optimisation part names; Fix memory optimisation part clearing

* CU-346mpwz Add further tests for memory optimisation (dirty state, checking optimised parts)

---------

Signed-off-by: dependabot[bot] <[email protected]>
Signed-off-by: zethson <[email protected]>
Co-authored-by: Xi Bai <[email protected]>
Co-authored-by: Anthony Shek <[email protected]>
Co-authored-by: antsh3k <[email protected]>
Co-authored-by: dependabot[bot] <49699333+dependabot[bot]@users.noreply.github.com>
Co-authored-by: tomolopolis <[email protected]>
Co-authored-by: Zeljko <[email protected]>
Co-authored-by: Sander Tan <[email protected]>
Co-authored-by: Lukas Heumos <[email protected]>
Co-authored-by: James Brandreth <[email protected]>
Co-authored-by: Xi Bai <[email protected]>
* remove bad merge <p> element

* CU-8692kpchc Fix for Rosalind link not working (CogStack#342)

* CU-8692kpchc Add the 403 exception to vocab downloader

* CU-8692kpchc Add the new vocab download link

* Add missing self argument (CogStack#343)

To `_refset_df2dict ` method in Snomed preprocessing

* CU-8692kn0yv Fix issue with fake dict in identifier based config

More specifically the get method which was not able to return default values for non-existant keys (CogStack#341)

* CU-8692mevx8 Fix issue with filters not taking effect in train_supervised method (CogStack#345)

* CU-8692mevx8 Fix issue with filters not taking effect in train_supervised method

* CU-8692mevx8 Fix filter retention in train_supervised method

---------

Co-authored-by: tomolopolis <[email protected]>
* Bump urllib3 from 1.26.5 to 1.26.17 in /webapp/webapp

Bumps [urllib3](https://github.com/urllib3/urllib3) from 1.26.5 to 1.26.17.
- [Release notes](https://github.com/urllib3/urllib3/releases)
- [Changelog](https://github.com/urllib3/urllib3/blob/main/CHANGES.rst)
- [Commits](urllib3/urllib3@1.26.5...1.26.17)

---
updated-dependencies:
- dependency-name: urllib3
  dependency-type: direct:production
...

Signed-off-by: dependabot[bot] <[email protected]>

* Cu 8692wbcq5 docs builds (CogStack#359)

* CU-8692wbcq5: Pin max version of numpy

* CU-8692wbcq5: Pin max version of numpy in setup.py

* CU-8692wbcq5: Bump python version for readthedocs workflow

* CU-8692wbcq5: Pin all requirement versions in docs requirements

* CU-8692wbcq5: Move docs requirements before setuptools

* CU-8692wbcq5: Fix typo in docs requirements

* CU-8692wbcq5: Remove some less relevant stuff from docs requirements

* CU-8692wbcq5: Add back sphinx-based requirements to docs requirements

* CU-8692wbcq5: Move back to python 3.9 on docs build workflow

* CU-8692wbcq5: Bump sphinx-autoapi version

* CU-8692wbcq5: Bump sphinx version

* CU-8692wbcq5: Bump python version back to 3.10 for future-proofing

* CU-8692wbcq5: Undo pinning numpy to max version in setup.py

* CU-8692wbcq5: Remove docs-build specific dependencies in setup.py

* Bump urllib3 from 1.26.17 to 1.26.18 in /webapp/webapp

Bumps [urllib3](https://github.com/urllib3/urllib3) from 1.26.17 to 1.26.18.
- [Release notes](https://github.com/urllib3/urllib3/releases)
- [Changelog](https://github.com/urllib3/urllib3/blob/main/CHANGES.rst)
- [Commits](urllib3/urllib3@1.26.17...1.26.18)

---
updated-dependencies:
- dependency-name: urllib3
  dependency-type: direct:production
...

Signed-off-by: dependabot[bot] <[email protected]>

* CU-8692uznvd: Allow empty-dict config.linking.filters.cuis and convert to set in memory (CogStack#352)

* CU-8692uznvd: Allow empty-dict config.linking.filters.cuis and convert to set in memory

* CU-8692uznvd: Move the empty-set detection and conversion from validator to init

* CU-8692uznvd: Remove unused import

* CU-8692t3fdf separate config on save (CogStack#350)

* CU-8692t3fdf Move saving config outside of the cdb.dat; Add test to make sure the config does not get saved with the CDB; patch a few existing tests

* CU-8692t3fdf Use class methods on class instead of instance in a few tests

* CU-8692t3fdf Fix typing issue

* CU-8692t3fdf Add additional tests for 2 configs and zero configs when loading model pack

* CU-8692t3fdf: Make sure CDB is linked to the correct config; Treat incorrect configs as dirty CDBs and force a recalc of the hash

* CU-2cdpd4t: Unify default addl_info in different methdos. (CogStack#363)

* Bump django from 3.2.20 to 3.2.23 in /webapp/webapp

Bumps [django](https://github.com/django/django) from 3.2.20 to 3.2.23.
- [Commits](django/django@3.2.20...3.2.23)

---
updated-dependencies:
- dependency-name: django
  dependency-type: direct:production
...

Signed-off-by: dependabot[bot] <[email protected]>

* Changing cdb.add_concept to a protected method

* Re-added deprecated method  with deprecated flag and addtional comments

* Initial commit for merge_cdb method

* Added indentation to make merge_cdb a class method

* fixed syntax issues

* more lint fixes

* more lint fixes

* bug fixes of merge_cdb

* removed print statements

* CU-86931prq4: Update GHA versions (checkout and setup-python) to v4 (CogStack#368)

* Cu 1yn0v9e duplicate multiprocessing methods (CogStack#364)

* CU-1yn0v9e: Rename and deprecate one of the multiprocessing methods;

Add docstring. Trying to be more explicit regarding usage and differences between different methods

* CU-1yn0v9e: Rename and deprecate the multiprocessing_pipe method;

Add docstring. Trying to be more explicit regarding usage and differences between different methods

* CU-1yn0v9e: Fix typo in docstring; more consistent naming

* 869377m3u: Add comment regarding demo link load times to README (CogStack#376)

* intermediate changes of merge_cdb and testing function

* Added README.md documentation for CPU only installations (CogStack#365)

* changed README.md to reflect installation options.

* added setup script to demonstrate how wrapper could look for CPU installations

* removed setup.sh as unnessescary for cpu only builds

* Initial commit for merge_cdb method

* Added indentation to make merge_cdb a class method

* fixed syntax issues

* more lint fixes

* more lint fixes

* bug fixes of merge_cdb

* removed print statements

* Added commentary on disk space usage of pytorch-gpu

* removed merge_cdb from branch

---------

Co-authored-by: adam-sutton-1992 <[email protected]>

* Cu 8692zguyq no preferred name (CogStack#367)

* CU-8692zguyq: Slight simplification of minimum-name-length logic

* CU-8692zguyq: Add some tests for prepare_name preprocessor

* CU-8692zguyq: Add warning if no preferred name was added along a new CUI

* CU-8692zguyq: Add additional warning messages when adding/training a new CUI with no preferred name

* CU-8692zguyq: Make no preferred name warnings only run if name status is preferred

* CU-8692zguyq: Add tests for no-preferred name warnings

* CU-8692zguyq: Add Vocab.make_unigram_table to CAT tests

* CU-8692zguyq: Move to built in asserting for logging instead of patching the method

* CU-8692zguyq: Add workaround for assertNoLogs on python 3.8 and 3.9

* Add trainer callbacks for Transformer NER (CogStack#377)

CU-86938vf30 add trainer callbacks for Transformer NER

* changes to merge_cdb and adding unit tests for method

* fixing lint issues

* fixing flake8 linting

* bug fixes, additional tests, and more documentation

* moved set up of cdbs to be merged to tests.helper

* moved merge_cdb to utils and created test_cdb_utils

* removed class wrapper in cdb utils and fixed class set up in tests

* changed test object setup to class setup

* removed erroneous new line

* CU-2e77a31 improve print stats (CogStack#366)

* Add base class for CAT

* Add CDB base class

* Some whitespace fixes for base modules

* CU-2e77a31: Move print stats to their own module and class

* CU-2e77a31: Fix issues introduced by moving print stats

* CU-2e77a31: Rename print_stats to get_stats and add option to avoid printed output

* CU-2e77a31: Add test for print_stats

* CU-2e77a31: Remove unused import

* CU-2e77a31: Add new package to setup.py

* CU-2e77a31: Fix a bunch of typing issues

* CU-2e77a31: Revert CAT and CDB abstraction

* Load stopwords in Defaults before spacy model

* CU-8693az82g Remove cdb tests side effects (CogStack#380)

* 8693az82g: Add method to CDBMaker to reset the CDB

* 8693az82g: Add test in CDB tests to ensure a new CDB is used for each test

* 8693az82g: Reset CDB in CDB tests before each test to avoid side effects

* Added tests

* CU-8693bpq82 fallback spacy model (CogStack#384)

* CU-8693bpq82: Add fallback spacy model along with test

* CU-8693bpq82: Remove debug output

* CU-8693bpq82: Add exception info to warning upon spacy model load failure and fallback

* Remove tests of internals where possible

* Add test for skipping of stopwords

* Avoid supporting only English for stopwords

* Remove debug output

* Make sure stopwords language getter works for file-path spacy models

* CU-8693cv3w0 Fix fallback spacy model existance on pip installs (CogStack#386)

* CU-8693cv3w0: Add method to ensure spacy model and use it when falling back to default model

* CU-8693cv3w0: Add logged output when installing/downloading spacy model

* CU-8693b0a61 Add method to get spacy model version (CogStack#381)

* CU-8693b0a61: Add method to find spacy folder in model pack along with some tests

* CU-8693b0a61: Add test for spacy folder finding (full path)

* CU-8693b0a61: Add method for finding spacy model in model pack along with tests

* CU-8693b0a61: Add method for finding current spacy version

* CU-8693b0a61: Add method for getting spacy model version installed

* CU-8693b0a61: Fix getting spacy model folder return path

* CU-8693b0a61: Add method to get name and meta of spacy model based on model pack

* CU-8693b0a61: Add missing fake spacy model meta

* CU-8693b0a61: Add missing docstrings

* CU-8693b0a61: Change name of method for clarity

* CU-8693b0a61: Add method to get spacy model name and version from model pack path

* CU-8693b0a61: Fix a few typing issues

* CU-8693b0a61: Add a missing docstring

* CU-8693b0a61: Match folder name of fake spacy model to its name

* CU-8693b0a61: Make the final method return true name of spacy model instead of folder name

* Add additional output to method for getting spacy model version - the compatible spacy versions

* CU-8693b0a61: Add method for querying whether the spacy version is compatible with a range

* CU-8693b0a61: Add better abstraction for spacy version mocking in tests

* CU-8693b0a61: Add some more abstraction for fake model pack in tests

* CU-8693b0a61: Add method for checking whethera model pack has a spacy model compatible with installed spacy version

* CU-8693b0a61: Improve abstraction within tests

* CU-8693b0a61: Add method to check which of two versions is older

* CU-8693b0a61: Fix fake spacy model versioning

* CU-8693b0a61: Add method for determining whether a model pack has semi-compatible spacy model

* CU-8693b0a61: Add missing word in docstring.

* CU-8693b0a61: Change some method to protected ones

---------

Signed-off-by: dependabot[bot] <[email protected]>
Co-authored-by: dependabot[bot] <49699333+dependabot[bot]@users.noreply.github.com>
Co-authored-by: tomolopolis <[email protected]>
Co-authored-by: adam-sutton-1992 <[email protected]>
Co-authored-by: adam-sutton-1992 <[email protected]>
Co-authored-by: Xi Bai <[email protected]>
Co-authored-by: jenniferajiang <[email protected]>
Co-authored-by: Jennifer Jiang <[email protected]>
* CU-8693u6b4u: Make sure failed/errored tests fail the main workflow

* CU-8693u6b4u: Attempt to fix deid multiprocessing, at least for GHA

* CU-8693u6b4u: Fix small docstring issue
* Cu 8693u6b4u tests continue on fail (CogStack#400)

* CU-8693u6b4u: Make sure failed/errored tests fail the main workflow

* CU-8693u6b4u: Attempt to fix deid multiprocessing, at least for GHA

* CU-8693u6b4u: Fix small docstring issue

* CU-8693v3tt6 SOMED opcs refset selection (CogStack#402)

* CU-8693v3tt6: Update refset ID for OPCS4 mappings in newer SNOMED releases

* CU-8693v3tt6: Add method to get direct refset mappings

* CU-8693v3tt6: Add tests to direct refset mappings method

* CU-8693v3tt6: Fix OPCS4 refset ID selection logic

* CU-8693v3tt6: Add test for OPCS4 refset ID selection

* CU-8693v6epd: Move typing imports away from pydantic (CogStack#403)

* CU-8693qx9yp Deid chunking - hugging face pipeline approach (CogStack#405)

* Pushing chunking update

* Update transformers_ner.py

* Pushing update to config

Added NER config in cat load function

* Update cat.py

* Updating chunking overlap

* CU-8693qx9yp: Add warning for deid multiprocessing with (potentially) non-functioning chunking window

* CU-8693qx9yp: Fix linting issue

---------

Co-authored-by: mart-r <[email protected]>

---------

Co-authored-by: Shubham Agarwal <[email protected]>
* Pushing changes for bert-style models for MetaCAT

* Pushing fix for LSTM

* Pushing changes for flake8 and type fixes

* Pushing type fixes

* Fixing type issue

* Pushing changes

1) Added model.zero_grad to clear accumulated gradients
2) Fixed config save issue
3) Re-structured data preparation for oversampled data

* Pushing change and type fixes

Pushing ml_utils file which was missed in the last commit

* Fixing flake8 issues

* Pushing flake8 fixes

* Pushing fixes for flake8

* Pushing flake8 fix

* Adding peft to list of libraries

* Pushing changes with load and train workflow and type fixes

The workflow for inference is: load() and inference
For training: init() and train()
Train will always not load the model dict, except when the phase_number is set to 2 for 2 phase learning's second phase

* Pushing changes with type hints and new documentation

* Pushing type fix

* Fixing type issue

* Adding test case for BERT and reverting config changes

BERT test cases: Testing for BERT model along with 2 phase learning

* Merging changes from master to metacat_bert branch (CogStack#431)

* Small addition to contribution guidelines (CogStack#420)

* CU-8694cbcpu: Allow specifying an AU Snomed when preprocessing (CogStack#421)

* CU-8694dpy1c: Return empty generator upon empty stream (CogStack#423)

* CU-8694dpy1c: Return empty generator upon empty stream

* CU-8694dpy1c: Fix empty generator returns

* CU-8694dpy1c: Simplify empty generator returns

* Relation extraction (CogStack#173)

* Added files.

* More additions to rel extraction.

* Rel base.

* Update.

* Updates.

* Dependency parsing.

* Updates.

* Added pre-training steps.

* Added training & model utils.

* Cleanup & fixes.

* Update.

* Evaluation updates for pretraining.

* Removed duplicate relation storage.

* Moved RE model file location.

* Structure revisions.

* Added custom config for RE.

* Implemented custom dataset loader for RE.

* More changes.

* Small fix.

* Latest additions to RelCAT (pipe + predictions)

* Setup.py fix.

* RE utils update.

* rel model update.

* rel dataset + tokenizer improvements.

* RelCAT updates.

* RelCAT saving/loading improvements.

* RelCAT saving/loading improvements.

* RelCAT model fixes.

* Attempted gpu learning fix. Dataset label generation fixes.

* Minor train dataset gen fix.

* Minor train dataset gen fix No.2.

* Config updates.

* Gpu support fixes. Added label stats.

* Evaluation stat fixes.

* Cleaned stat output mode during training.

* Build fix.

* removed unused dependencies and fixed code formatting

* Mypy compliance.

* Fixed linting.

* More Gpu mode train fixes.

* Fixed model saving/loading issues when using other baes models.

* More fixes to stat evaluation. Added proper CAT integration of RelCAT.

* Setup.py typo fix.

* RelCAT loading fix.

* RelCAT Config changes.

* Type fix. Minor additions to RelCAT model.

* Type fixes.

* Type corrections.

* RelCAT update.

* Type fixes.

* Fixed type issue.

* RelCATConfig: added seed param.

* Adaptations to the new codebase + type fixes..

* Doc/type fixes.

* Fixed input size issue for model.

* Fixed issue(s) with model size and config.

* RelCAT: updated configs to new style.

* RelCAT: removed old refs to logging.

* Fixed GPU training + added extra stat print for train set.

* Type fixes.

* Updated dev requirements.

* Linting.

* Fixed pin_memory issue when training on CPU.

* Updated RelCAT dataset get + default config.

* Updated RelDS generator + default config

* Linting.

* Updated RelDatset + config.

* Pushing updates to model

Made changes to:
1) Extracting given number of context tokens left and right of the entities
2) Extracting hidden state from bert for all the tokens of the entities and performing max pooling on them

* Fixing formatting

* Update rel_dataset.py

* Update rel_dataset.py

* Update rel_dataset.py

* RelCAT: added test resource files.

* RelCAT: Fixed model load/checkpointing.

* RelCAT: updated to pipe spacy doc call.

* RelCAT: added tests.

* Fixed lint/type issues & added rel tag to test DS.

* Fixed ann id to token issue.

* RelCAT: updated test dataset + tests.

* RelCAT: updates to requested changes + dataset improvements.

* RelCAT: updated docs/logs according to commends.

* RelCAT: type fix.

* RelCAT: mct export dataset updates.

* RelCAT: test updates + requested changes p2.

* RelCAT: log for MCT export train.

* Updated docs + split train_test & dataset for benchmarks.

* type fixes.

---------

Co-authored-by: Shubham Agarwal <[email protected]>
Co-authored-by: mart-r <[email protected]>

* CU-8694fae3r: Avoid publishing PyPI release when doing GH pre-releases (CogStack#424)

* CU-8694fae3r: Avoid publishing PyPI release when doing GH pre-releases

* CU-8694fae3r: Fix pre-releases tagging

* CU-8694fae3r: Allow actions to run on release edit

---------

Co-authored-by: Mart Ratas <[email protected]>
Co-authored-by: Vlad Dinu <[email protected]>

* Pushing changed tests and removing empty change

* Pushing change for logging

* Revert "Pushing change for logging"

This reverts commit fbcdb70.

* CU-8694hukwm: Document the materialising of generator when multiproce… (CogStack#433)

* CU-8694hukwm: Document the materialising of generator when multiprocessing and batching for docs

* CU-8694hukwm: Add TODO note for where the generator is materialised

* CU-8694hukwm: Add warning from large amounts of generator data (10k items) is materialised by the docs size mp method

* CU-8694fk90t (almost) only primitive config (CogStack#425)

* CU-8694fk90r: Move backwards compatibility method from CDB to config utils

* CU-8694fk90r: Move weighted_average_function from config to CDB; create necessary backwards compatibility workarounds

* CU-8694fk90r: Move usage of weighted_average_function in tests

* CU-8694fk90r: Add JSON encode and decoder for re.Pattern

* CU-8694fk90r: Rebuild custom decoder if needed

* CU-8694fk90r: Add method to detect old style config

* CU-8694fk90r: Use regular json serialisation for config; Retain option to read old jsonpickled config

* CU-8694fk90r: Add test for config serialisation

* CU-8694fk90r: Make sure to fix weighted_average_function upon setting it

* CU-8694fk90t: Add missing tests for config utils

* CU-8694fk90t: Add tests for better raised exception upon old way of using weighted_average_function

* CU-8694fk90t: Fix exception type in an added test

* CU-8694fk90t: Add further tests for exception payload

* CU-8694fk90t: Add improved exceptions when using old/unsupported value of weighted_average_function in config

* CU-8694fk90t: Add typing fix exceptions

* CU-8694fk90t: Make custom exception derive from AttributeError to correctly handle hasattr calls

* CU-8694gza88: Create codeql.yml (CogStack#434)

Run CodeQL to identify vulnerabilities.
This will run on any push or pull request to `master`, but also runs once every day in case some new vulnerabilities are discovered (or something else changes).

* CU-8694mbn03: Remove the web app (CogStack#441)

* CU-8694n48uw better deprecation (CogStack#443)

* CU-8694n493m: Add deprecation and removal versions to deprecation decorator

* CU-8694n493m: Deprecation version to existing deprecated methods.

Made the removal version 2 minor versions from the minor version
in which the method was deprecated, or the next minor version if
the method had been deprecated for longer.

* CU-8694n4ff0: Raise exception upon deprecated method call at test time

* CU-8694n4ff0: Fix usage of deprecated methods call during test time

* CU-8694pey4u: extract cdb load to cls method, to be used in trainer for model pack loading

* CU-8694pey4u: extract meta cat loading also to a cls method

* CU-8694pey4u: docstrings

* CU-8694pey4u: typehints and mypy issues

* CU-8694pey4u: fix flake8

* CU-8694pey4u: fix flake8

* CU-8694pey4u: missing extra config if passed in

* CU-8694py1jr: Fix issue with reuse of opened file when loading old configs

* CU-8694py1jr: Make old config identifier more robust

* CU-8694py1jr: Add doc string to old config identifier

* CU-8694py1jr: Add test for old style MetaCAT config load

* CU-8694py1jr: Add test for old style main config load (functional)

* CU-8694py1jr: Refactor config utils load tests for more flexibility

* CU-8694py1jr: Add config utils load tests for NER and Rel CAT configs

* CU-8694vcvz7: Trust remote code when loading transfomers NER dataset (CogStack#453)

* CU-8694vcvz7: Trust remote code when loading transfomers NER dataset

* CU-8694vcvz7: Add support for older datasets without the remote code trusing kwarg

* CU-8694gzbn3 k fold metrics (CogStack#432)

* CU-8694gzbud: Add context manager that is able to snapshot CDB state

* CU-8694gzbud: Add tests to snapshotting CDB state

* CU-8694gzbud: Refactor tests for CDB state snapshotting

* CU-8694gzbud: Remove use of deprecated method in CDB utils and use non-deprecated one instead

* CU-8694gzbud: Add tests for training and CDB state capturing

* CU-8694gzbud: Small refactor in tests

* CU-8694gzbud: Add option to save state on disk

* CU-8694gzbud: Add debug logging output when saving state on disk

* CU-8694gzbud: Remove unused import

* CU-8694gzbud: Add tests for disk-based state save

* CU-8694gzbud: Move CDB state code to its own module

* CU-8694gzbud: Remove unused import

* CU-8694gzbud: Add doc strings to methods

* CU-8694gzbx4: Small optimisation for stats

* CU-8694gzbx4: Add MCTExport related module

* CU-8694gzbx4: Add MCTExport related tests

* CU-8694gzbx4: Add code for k-fold statistics

* CU-8694gzbx4: Add tests for k-fold statistics

* CU-8694gzbx4: Add test-MCT export with fake concepts

* CU-8694gzbx4: Fix a doc string

* CU-8694gzbx4: Fix types in MCT export module

* CU-8694gzbx4: Fix types in k-fold module

* CU-8694gzbx4: Remove accidentally committed test class

* CU-8694gzbn3: Add missing test helper file

* CU-8694gzbn3: Remove whitespace change from otherwise uncahnged file

* CU-8694gzbn3: Allow 5 minutes longer for tests

* CU-8694gzbn3: Move to python 3.8-compatible typed dict

* CU-8694gzbn3: Add more time for tests in worklow (now 30 minutes)

* CU-8694gzbn3: Add more time for tests in worklow (now 45 minutes)

* CU-8694gzbn3: Update test-pypi timeout to 45 minutes

* CU-8694gzbn3: Remove timeout from unit tests in main workflow

* CU-8694gzbn3: Make tests stop upon first failure

* CU-8694gzbn3: Fix test stop upon first failure (arg/option order)

* CU-8694gzbn3: Remove debug code and old comments

* CU-8694gzbn3: Remove all timeouts from main workflow

* CU-8694gzbn3: Remove more old / useless comments in tests

* CU-8694gzbn3: Add debug output when running k-fold tests to see where it may be stalling

* CU-8694gzbn3: Add debug output when ANY tests to see where it may be stalling

* CU-8694gzbn3: Remove explicit debug output from k-fold test cases

* CU-8694gzbn3: Remove timeouts from DEID tests in case they're the ones creating issues

* GHA/test fixes (CogStack#437)

* Revert "CU-8694gzbn3: Remove timeouts from DEID tests in case they're the ones creating issues"

This reverts commit faaf7fb.

* Revert "CU-8694gzbn3: Remove explicit debug output from k-fold test cases"

This reverts commit 9b02925.

* Revert "CU-8694gzbn3: Add debug output when ANY tests to see where it may be stalling"

This reverts commit 12c519a.

* Revert "CU-8694gzbn3: Add debug output when running k-fold tests to see where it may be stalling"

This reverts commit 03531da.

* Revert "CU-8694gzbn3: Remove all timeouts from main workflow"

This reverts commit e6debce.

* Revert "CU-8694gzbn3: Fix test stop upon first failure (arg/option order)"

This reverts commit 666c013.

* Revert "CU-8694gzbn3: Make tests stop upon first failure"

This reverts commit 94bce56.

* Revert "CU-8694gzbn3: Remove timeout from unit tests in main workflow"

This reverts commit 3618b9c.

* CU-8694gzbn3: Improve state copy code in CDB state tests

* CU-8694gzbn3: Fix a CDB state test issue

* CU-8694gzbn3: Split all tests into 2 halves

* CU-8694gzbn3: Remove legacy / archived / unused tests

* CU-8694gzbn3: Add doc strings for FoldCreator init

* CU-8694gzbn3: Move to a split-type enum

* CU-8694gzbn3: Add documentation to split-type enum

* CU-8694gzbn3: Create separate fold creators for different types of splitting strategies

* CU-8694gzbn3: Resort document order in test time nullification process

* CU-8694gzbn3: Add option to count number of annotations in doc for MCT export

* CU-8694gzbn3: Add weighted documents based split option along with relevant tests

* CU-8694gzbn3: Update default fold creation split type to weighted documents

* CU-8694gzbn3: Add test to ensure weighted documents split creates a reasonable number of annotations per split

* CU-8693n892x environment/dependency snapshots (CogStack#438)

* CU-8693n892x: Save environment/dependency snapshot upon model pack creation

* CU-8693n892x: Fix typing for env snapshot module

* CU-8693n892x: Add test for env file existance in .zip

* CU-8693n892x: Add doc strings

* CU-8693n892x: Centralise env snapshot file name

* CU-8693n892x: Add env snapshot file to exceptions in serialisation tests

* CU-8693n892x: Only list direct dependencies

* CU-8693n892x: Add test that verifies all direct dependencies are listed in environment

* CU-8693n892x: Move requirements to separate file and use that for environment snapshot

* CU-8693n892x: Remove unused constants

* CU-8693n892x: Allow URL based dependencies when using direct dependencies

* CU-8693n892x: Distribute install_requires.txt alongside the package; use correct path in distributed version

* CU-8694p8y0k deprecation GHA check (CogStack#445)

* CU-8694p8y0k: Add check for deprecations (code)

* CU-8694p8y0k: Add workflow check for deprecations

* CU-8694p8y0k: Fix (hopefully) workflow check for deprecations

* CU-8694p8y0k: Add option to remove version prefix when checking deprecation

* CU-8694p8y0k: Update deprecation checks with more detail (i.e current/next version).

* CU-8694p8y0k: Only run deprecation checking step when merging master into production

* CU-8694u3yd2 cleanup name removal (CogStack#450)

* CU-8694u3yd2: Add logged warning for when using full-unlink

* CU-8694u3yd2: Make CDB.remove_names simply expect an iterable of names

* CU-8694u3yd2: Improve CDB.remove_names doc string

* CU-8694u3yd2: Explicitly pass the keys to CDB.remove_names in CAT.unlink_concept_name

* CU-8694u3yd2: Add note regarding state (and order) dependent tests to some CDB maker tests

* CU-8694u3yd2: Rename/make protected CDB.remove_names method

* CU-8694u3yd2: Create deprecated CDB.remove_names method

* CU-8694vte2g 1.12 depr removal (CogStack#454)

* CU-8694vte2g: Remove CDB.add_concept method

* CU-8694vte2g: Remove unused import (deprecated decorator)

* CU-8694vte2g: Remove CAT.get_spacy_nlp method

* CU-8694vte2g: Remove CAT.train_supervised method

* CU-8694vte2g: Remove CAT multiprocessing methods

* CU-8694vte2g: Remove MetaCAT.train method

* CU-8694vte2g: Remove medcat.utils.ner.helper.deid_text method

* CU-8694vte2g: Remove use of deprecated method

* CU-8694vte2g: Add back removed deprecation import

---------

Co-authored-by: Shubham Agarwal <[email protected]>
Co-authored-by: Vlad Dinu <[email protected]>
Co-authored-by: Tom Searle <[email protected]>
* Pushing bug fix for metacat

2-phase learning for MetaCAT utilises data_undersampled. Fixed a bug in the eval function, which was incorrectly using the data_undersampled instead of the full_data

* Pushing change for lazy logging

* Pushing update for lazy logging

* Pushing lint fix
* CU-8695pvhfe: Rename a test class

* CU-8695pvhfe: Add tests for multiprocessig usage monitoring

* CU-8695pvhfe: Fix usage monitor for multiprocessig.

When using CAT.multiprocessing_batch_char_size (CAT._multiprocessing_batch and CAT._mp_cons internally), flush the usage monitor at the end of multiprocessing method.
When using CAT.get_entities_multi_texts or CAT.multiprocessing_batch_docs_size (uses the former internally), add logging of usage to output

* CU-8695pvhfe: Fix remaining issues with usage monitor for multiprocessig.

Avoid checking length of (potentially) non-existent strings. Avoid early iteration of generator.
* CU-8695ucw9b: Fix older DeID models due to changes in transformers.

Since transformers 4.42.0, the tokenizer is expected to have the 'split_special_tokens' attribute. But the version we've saved does not. So when it's loaded, this causes an exception to be raised (which is currently caught and logged by medcat).

* CU-8695ucw9b: Add functionality for transformers NER to spectacularly fail upon consistent consecutive exceptions.

The idea is that this way, if something in the underlying models is consistently failing, the exception is raised rather than simply logged

* CU-8695ucw9b: Add tests for exception raising after a pre-defined number of failed document processes

* CU-8695ucw9b: Change conditions for raising exception on consecutive failure.

Now only raise the exception if the consecutive failure is identical (or similar). We determine that from the type and string-representation of the exception being raised.

* CU-8695ucw9b: Small additional cleanup on successful TNER processing

* CU-8695ucw9b: Use custom exception when failing due to consecutive exceptions

* CU-8695ucw9b: Remove try-except when processing transformers NER to force immediate raising of exception
* CU-8695uhe5n: Update docs dependency pins

* CU-8695uhe5n: Fix typo in fsspec version pin
* MetaCAT fixes and upgrades

Pushing for 3 updates:
1) Removed the check and update for labels with zero data, as this was causing issues during evaluation
2) Resolved an issue where the confusion matrix couldn't be calculated when testing on a single class with an F1 score of 1, as it expected the original number of training classes (3)
3) Updated the attention mask creation to dynamically use the actual pad_idx value instead of assuming it to be 0

* Pushing type fix

* Pushing for type fix

* Fixing type issues

* Pushing change

* Pushing update w/o try except block

For the issue where the confusion matrix couldn't be calculated when testing on a single class with an F1 score of 1, as it expected the original number of training classes (3), pushing an optimized version w/o the try except block
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants