Skip to content

Commit

Permalink
V1.0.0dev2 ep (#48)
Browse files Browse the repository at this point in the history
* feat(annotator, datasets, init): fixed log in annotator, use_exact_path, example print flag (#34)

Fixed logger in Annotator class. Fixed use exact path. Added flag to print examples in dataset.
Fixed dotted dict to allow for saving best scoring model.

* Lorr1 master pr (#35)

* feat(annotator, datasets, init): fixed log in annotator, use_exact_path, example print flag

Fixed logger in Annotator class. Fixed use exact path. Added flag to print examples in dataset.
Fixed dotted dict to allow for saving best scoring model.

* Fixed error in bootleg annotator with batched model size determination

* Lorr1 master pr (#36)

* feat(annotator, datasets, init): fixed log in annotator, use_exact_path, example print flag

Fixed logger in Annotator class. Fixed use exact path. Added flag to print examples in dataset.
Fixed dotted dict to allow for saving best scoring model.

* Fixed error in bootleg annotator with batched model size determination

* fix(dumping predictions): fixed memory issues and allow for eval accumulated steps

Reduced memory in prediction dumping to allow for step accumulation before dumping. This results in
chunks of output files that are merged together and allows a user to control for constant memory.
Also added the ability for dataset_threads == 1 to turn off Pools.

* Lorr1 master pr (#37)

* feat(annotator, datasets, init): fixed log in annotator, use_exact_path, example print flag

Fixed logger in Annotator class. Fixed use exact path. Added flag to print examples in dataset.
Fixed dotted dict to allow for saving best scoring model.

* Fixed error in bootleg annotator with batched model size determination

* fix(dumping predictions): fixed memory issues and allow for eval accumulated steps

Reduced memory in prediction dumping to allow for step accumulation before dumping. This results in
chunks of output files that are merged together and allows a user to control for constant memory.
Also added the ability for dataset_threads == 1 to turn off Pools.

* Added support for data_parallel eval. Some issues around pytorch/cuda versions

* Lorr1 master pr (#38)

* feat(annotator, datasets, init): fixed log in annotator, use_exact_path, example print flag

Fixed logger in Annotator class. Fixed use exact path. Added flag to print examples in dataset.
Fixed dotted dict to allow for saving best scoring model.

* Fixed error in bootleg annotator with batched model size determination

* fix(dumping predictions): fixed memory issues and allow for eval accumulated steps

Reduced memory in prediction dumping to allow for step accumulation before dumping. This results in
chunks of output files that are merged together and allows a user to control for constant memory.
Also added the ability for dataset_threads == 1 to turn off Pools.

* Added support for data_parallel eval. Some issues around pytorch/cuda versions

* Bumpped travil python version

* Removing psutil

* Modified dependencies

* Modified dependencies

* Modified dependencies

* Merge in entity profile (#40)

* Lorr1 master pr2 (#41)

* Merge in entity profile

* Fixed tests

* Lorr1 master pr2 (#43)

* feat(annotator, datasets, init): fixed log in annotator, use_exact_path, example print flag

Fixed logger in Annotator class. Fixed use exact path. Added flag to print examples in dataset.
Fixed dotted dict to allow for saving best scoring model.

* Fixed error in bootleg annotator with batched model size determination

* fix(dumping predictions): fixed memory issues and allow for eval accumulated steps

Reduced memory in prediction dumping to allow for step accumulation before dumping. This results in
chunks of output files that are merged together and allows a user to control for constant memory.
Also added the ability for dataset_threads == 1 to turn off Pools.

* Added support for data_parallel eval. Some issues around pytorch/cuda versions

* Bumpped travil python version

* Fixed empty aliases

* Lorr1 master pr (#39)

* feat(annotator, datasets, init): fixed log in annotator, use_exact_path, example print flag

Fixed logger in Annotator class. Fixed use exact path. Added flag to print examples in dataset.
Fixed dotted dict to allow for saving best scoring model.

* Fixed error in bootleg annotator with batched model size determination

* fix(dumping predictions): fixed memory issues and allow for eval accumulated steps

Reduced memory in prediction dumping to allow for step accumulation before dumping. This results in
chunks of output files that are merged together and allows a user to control for constant memory.
Also added the ability for dataset_threads == 1 to turn off Pools.

* Added support for data_parallel eval. Some issues around pytorch/cuda versions

* Bumpped travil python version

* Fixed empty aliases

* Merge in entity profile

* Fixed tests

* Updated requirements and setup to work with genie

* Lorr1 master pr (#42)

* feat(annotator, datasets, init): fixed log in annotator, use_exact_path, example print flag

Fixed logger in Annotator class. Fixed use exact path. Added flag to print examples in dataset.
Fixed dotted dict to allow for saving best scoring model.

* Fixed error in bootleg annotator with batched model size determination

* fix(dumping predictions): fixed memory issues and allow for eval accumulated steps

Reduced memory in prediction dumping to allow for step accumulation before dumping. This results in
chunks of output files that are merged together and allows a user to control for constant memory.
Also added the ability for dataset_threads == 1 to turn off Pools.

* Added support for data_parallel eval. Some issues around pytorch/cuda versions

* Bumpped travil python version

* Fixed empty aliases

* Updated requirements and setup to work with genie

* Finalized entiyt profile API for release

* Lorr1 master pr2 (#44)

* feat(annotator, datasets, init): fixed log in annotator, use_exact_path, example print flag

Fixed logger in Annotator class. Fixed use exact path. Added flag to print examples in dataset.
Fixed dotted dict to allow for saving best scoring model.

* Fixed error in bootleg annotator with batched model size determination

* fix(dumping predictions): fixed memory issues and allow for eval accumulated steps

Reduced memory in prediction dumping to allow for step accumulation before dumping. This results in
chunks of output files that are merged together and allows a user to control for constant memory.
Also added the ability for dataset_threads == 1 to turn off Pools.

* Added support for data_parallel eval. Some issues around pytorch/cuda versions

* Bumpped travil python version

* Fixed empty aliases

* Lorr1 master pr (#39)

* feat(annotator, datasets, init): fixed log in annotator, use_exact_path, example print flag

Fixed logger in Annotator class. Fixed use exact path. Added flag to print examples in dataset.
Fixed dotted dict to allow for saving best scoring model.

* Fixed error in bootleg annotator with batched model size determination

* fix(dumping predictions): fixed memory issues and allow for eval accumulated steps

Reduced memory in prediction dumping to allow for step accumulation before dumping. This results in
chunks of output files that are merged together and allows a user to control for constant memory.
Also added the ability for dataset_threads == 1 to turn off Pools.

* Added support for data_parallel eval. Some issues around pytorch/cuda versions

* Bumpped travil python version

* Fixed empty aliases

* Merge in entity profile

* Fixed tests

* Updated requirements and setup to work with genie

* Lorr1 master pr (#42)

* feat(annotator, datasets, init): fixed log in annotator, use_exact_path, example print flag

Fixed logger in Annotator class. Fixed use exact path. Added flag to print examples in dataset.
Fixed dotted dict to allow for saving best scoring model.

* Fixed error in bootleg annotator with batched model size determination

* fix(dumping predictions): fixed memory issues and allow for eval accumulated steps

Reduced memory in prediction dumping to allow for step accumulation before dumping. This results in
chunks of output files that are merged together and allows a user to control for constant memory.
Also added the ability for dataset_threads == 1 to turn off Pools.

* Added support for data_parallel eval. Some issues around pytorch/cuda versions

* Bumpped travil python version

* Fixed empty aliases

* Updated requirements and setup to work with genie

* Finalized entiyt profile API for release

* Fixed the mention selection criteria to not overcount mentions when doing batch eval.

* Lorr1 master pr2 (#45)

* feat(flake8 compatibility): made code flake8 compatible

* feat(annotator, datasets, init): fixed log in annotator, use_exact_path, example print flag

Fixed logger in Annotator class. Fixed use exact path. Added flag to print examples in dataset.
Fixed dotted dict to allow for saving best scoring model.

* Fixed error in bootleg annotator with batched model size determination

* fix(dumping predictions): fixed memory issues and allow for eval accumulated steps

Reduced memory in prediction dumping to allow for step accumulation before dumping. This results in
chunks of output files that are merged together and allows a user to control for constant memory.
Also added the ability for dataset_threads == 1 to turn off Pools.

* Added support for data_parallel eval. Some issues around pytorch/cuda versions

* Bumpped travil python version

* Fixed empty aliases

* Lorr1 master pr (#39)

* feat(annotator, datasets, init): fixed log in annotator, use_exact_path, example print flag

Fixed logger in Annotator class. Fixed use exact path. Added flag to print examples in dataset.
Fixed dotted dict to allow for saving best scoring model.

* Fixed error in bootleg annotator with batched model size determination

* fix(dumping predictions): fixed memory issues and allow for eval accumulated steps

Reduced memory in prediction dumping to allow for step accumulation before dumping. This results in
chunks of output files that are merged together and allows a user to control for constant memory.
Also added the ability for dataset_threads == 1 to turn off Pools.

* Added support for data_parallel eval. Some issues around pytorch/cuda versions

* Bumpped travil python version

* Fixed empty aliases

* Merge in entity profile

* Fixed tests

* Updated requirements and setup to work with genie

* Lorr1 master pr (#42)

* feat(annotator, datasets, init): fixed log in annotator, use_exact_path, example print flag

Fixed logger in Annotator class. Fixed use exact path. Added flag to print examples in dataset.
Fixed dotted dict to allow for saving best scoring model.

* Fixed error in bootleg annotator with batched model size determination

* fix(dumping predictions): fixed memory issues and allow for eval accumulated steps

Reduced memory in prediction dumping to allow for step accumulation before dumping. This results in
chunks of output files that are merged together and allows a user to control for constant memory.
Also added the ability for dataset_threads == 1 to turn off Pools.

* Added support for data_parallel eval. Some issues around pytorch/cuda versions

* Bumpped travil python version

* Fixed empty aliases

* Updated requirements and setup to work with genie

* Finalized entiyt profile API for release

* Fixed the mention selection criteria to not overcount mentions

* Removed files befor emerge

* Flake8 compatibility and fixed unit tests for release

* Lorr1 master pr2 (#46)

* feat(annotator, datasets, init): fixed log in annotator, use_exact_path, example print flag

Fixed logger in Annotator class. Fixed use exact path. Added flag to print examples in dataset.
Fixed dotted dict to allow for saving best scoring model.

* Fixed error in bootleg annotator with batched model size determination

* fix(dumping predictions): fixed memory issues and allow for eval accumulated steps

Reduced memory in prediction dumping to allow for step accumulation before dumping. This results in
chunks of output files that are merged together and allows a user to control for constant memory.
Also added the ability for dataset_threads == 1 to turn off Pools.

* Added support for data_parallel eval. Some issues around pytorch/cuda versions

* Bumpped travil python version

* Fixed empty aliases

* Lorr1 master pr (#39)

* feat(annotator, datasets, init): fixed log in annotator, use_exact_path, example print flag

Fixed logger in Annotator class. Fixed use exact path. Added flag to print examples in dataset.
Fixed dotted dict to allow for saving best scoring model.

* Fixed error in bootleg annotator with batched model size determination

* fix(dumping predictions): fixed memory issues and allow for eval accumulated steps

Reduced memory in prediction dumping to allow for step accumulation before dumping. This results in
chunks of output files that are merged together and allows a user to control for constant memory.
Also added the ability for dataset_threads == 1 to turn off Pools.

* Added support for data_parallel eval. Some issues around pytorch/cuda versions

* Bumpped travil python version

* Fixed empty aliases

* Merge in entity profile

* Fixed tests

* Updated requirements and setup to work with genie

* Lorr1 master pr (#42)

* feat(annotator, datasets, init): fixed log in annotator, use_exact_path, example print flag

Fixed logger in Annotator class. Fixed use exact path. Added flag to print examples in dataset.
Fixed dotted dict to allow for saving best scoring model.

* Fixed error in bootleg annotator with batched model size determination

* fix(dumping predictions): fixed memory issues and allow for eval accumulated steps

Reduced memory in prediction dumping to allow for step accumulation before dumping. This results in
chunks of output files that are merged together and allows a user to control for constant memory.
Also added the ability for dataset_threads == 1 to turn off Pools.

* Added support for data_parallel eval. Some issues around pytorch/cuda versions

* Bumpped travil python version

* Fixed empty aliases

* Updated requirements and setup to work with genie

* Finalized entiyt profile API for release

* Fixed the mention selection criteria to not overcount mentions

* Removed files befor emerge

* Flake8 compatibility and fixed unit tests for release

* Updated requirements

* Lorr1 master pr2 (#47)

* feat(annotator, datasets, init): fixed log in annotator, use_exact_path, example print flag

Fixed logger in Annotator class. Fixed use exact path. Added flag to print examples in dataset.
Fixed dotted dict to allow for saving best scoring model.

* Fixed error in bootleg annotator with batched model size determination

* fix(dumping predictions): fixed memory issues and allow for eval accumulated steps

Reduced memory in prediction dumping to allow for step accumulation before dumping. This results in
chunks of output files that are merged together and allows a user to control for constant memory.
Also added the ability for dataset_threads == 1 to turn off Pools.

* Added support for data_parallel eval. Some issues around pytorch/cuda versions

* Bumpped travil python version

* Fixed empty aliases

* Lorr1 master pr (#39)

* feat(annotator, datasets, init): fixed log in annotator, use_exact_path, example print flag

Fixed logger in Annotator class. Fixed use exact path. Added flag to print examples in dataset.
Fixed dotted dict to allow for saving best scoring model.

* Fixed error in bootleg annotator with batched model size determination

* fix(dumping predictions): fixed memory issues and allow for eval accumulated steps

Reduced memory in prediction dumping to allow for step accumulation before dumping. This results in
chunks of output files that are merged together and allows a user to control for constant memory.
Also added the ability for dataset_threads == 1 to turn off Pools.

* Added support for data_parallel eval. Some issues around pytorch/cuda versions

* Bumpped travil python version

* Fixed empty aliases

* Merge in entity profile

* Fixed tests

* Updated requirements and setup to work with genie

* Lorr1 master pr (#42)

* feat(annotator, datasets, init): fixed log in annotator, use_exact_path, example print flag

Fixed logger in Annotator class. Fixed use exact path. Added flag to print examples in dataset.
Fixed dotted dict to allow for saving best scoring model.

* Fixed error in bootleg annotator with batched model size determination

* fix(dumping predictions): fixed memory issues and allow for eval accumulated steps

Reduced memory in prediction dumping to allow for step accumulation before dumping. This results in
chunks of output files that are merged together and allows a user to control for constant memory.
Also added the ability for dataset_threads == 1 to turn off Pools.

* Added support for data_parallel eval. Some issues around pytorch/cuda versions

* Bumpped travil python version

* Fixed empty aliases

* Updated requirements and setup to work with genie

* Finalized entiyt profile API for release

* Fixed the mention selection criteria to not overcount mentions

* Removed files befor emerge

* Flake8 compatibility and fixed unit tests for release

* Updated requirements

* Fixed tests by removing checkpointing to save disk space for ci
  • Loading branch information
lorr1 authored Mar 20, 2021
1 parent c62fbd4 commit 4a33774
Show file tree
Hide file tree
Showing 147 changed files with 20,308 additions and 13,341 deletions.
6 changes: 6 additions & 0 deletions .flake8
Original file line number Diff line number Diff line change
@@ -0,0 +1,6 @@
[flake8]
exclude = .git
max-line-length = 120
ignore = E203, W503, W605, F541, E731, E722, E231
per-file-ignores =
*/__init__.py: F401
7 changes: 4 additions & 3 deletions .github/workflows/ci.yaml
Original file line number Diff line number Diff line change
Expand Up @@ -2,7 +2,8 @@ name: CI

on:
push:
branches: [ emmental_master, master ]
branches:
- '**'
pull_request:
branches: [ emmental_master, master ]

Expand Down Expand Up @@ -34,7 +35,7 @@ jobs:
python -m pip install --upgrade pip
make dev
- name: Lint with isort, black, docformatter, flake8
- name: Lint with isort, black, docformatter
run: |
make format
make check
Expand Down Expand Up @@ -88,7 +89,7 @@ jobs:
with:
# This path is specific to Ubuntu
path: ~/.cache/pip
# Look to see if there is a cache hit for the corresponding requirements file
# Look to see if there is a cache hit for the corresponding reqs file
key: ${{ runner.os }}-pip-${{ hashFiles('requirements-dev.txt') }}
restore-keys: |
${{ runner.os }}-pip-
Expand Down
2 changes: 1 addition & 1 deletion .travis.yml
Original file line number Diff line number Diff line change
Expand Up @@ -5,7 +5,7 @@ dist: xenial
cache: pip

python:
- "3.6"
- "3.8"

# command to install dependencies
install:
Expand Down
33 changes: 29 additions & 4 deletions CHANGELOG.rst
Original file line number Diff line number Diff line change
@@ -1,7 +1,32 @@
Unreleased_
Unreleased 1.0.2dev0
---------------------

1.0.1 - 2021-03-22
-------------------

1.0.0 - 2020-02-15

.. note::

If upgrading to 1.0.1 from 1.0.0, you will need to re-download our models given the links in the README.md. We altered what keys were saved in the state dict, but the model weights are unchanged.

Added
^^^^^^^
* ``data_config.print_examples_prep`` flag to toggle data example printing during data prep.
* ``data_config.eval_accumulation_steps`` to support subbatching dumping of predictings. We save outputs to separate files of size approximately ``data_config.eval_accumulation_steps*data_config.eval_batch_size`` and merge into a final file at the end.
* Entity Profile API. See the `docs <https://bootleg.readthedocs.io/en/latest/gettingstarted/entity_profile.html>`_. This allows for modifying entity metadata as well as adding and removing entities. We profile methods for refitting a model with a new profile for immediate inference, no finetuning needed.

Changed
^^^^^^^^
* Support for not using multiprocessing if use sets ``data_config.dataset_threads`` to be 1.
* Added better argument parsing to check for arguments that were misspelled or otherwise wouldn't trigger anything.
* Code is now Flake8 compatible.

Fixed
^^^^^^^
* Fixed readthedocs so the BootlegAnnotator was loaded correctly.
* Fixed logging in BootlegAnnotator.
* Fixed ``use_exact_path`` argument in Emmental.

1.0.0 - 2021-02-15
-------------------
We did a major rewrite of our entire codebase and moved to using `Emmental <https://github.com/SenWu/Emmental>`_ for training. Emmental allows for each multi-task training, FP16, and support for both DataParallel and DistributedDataParallel.

Expand All @@ -22,7 +47,7 @@ Added
Changed
^^^^^^^^
* Mention extraction code and alias map has been updated
* Models trained on October 2020 dump of Wikipedia
* Models trained on October 2020 save of Wikipedia
* Have uncased and cased models

Removed
Expand Down
10 changes: 7 additions & 3 deletions Makefile
Original file line number Diff line number Diff line change
Expand Up @@ -13,9 +13,9 @@ format:
docformatter --in-place --recursive bootleg test

check:
isort -c -rc bootleg/ test/
isort -c bootleg/ test/
black bootleg/ test/ --check
# flake8 bootleg/ test/
flake8 bootleg/ test/

docs:
sphinx-build -b html docs/source/ docs/build/html/
Expand All @@ -32,4 +32,8 @@ clean:
rm -rf src/bootleg.egg-info
rm -rf _build/

.PHONY: dev test clean check docs
prune:
@bash -c "git fetch -p";
@bash -c "for branch in $(git branch -vv | grep ': gone]' | awk '{print $1}'); do git branch -d $branch; done";

.PHONY: dev test clean check docs prune
2 changes: 1 addition & 1 deletion bootleg/_version.py
Original file line number Diff line number Diff line change
@@ -1,2 +1,2 @@
"""Bootleg version."""
__version__ = "1.0.0"
__version__ = "1.0.1"
12 changes: 8 additions & 4 deletions bootleg/data.py
Original file line number Diff line number Diff line change
Expand Up @@ -107,7 +107,8 @@ def get_dataloaders(
if Meta.config["learner_config"]["local_rank"] != -1:
log_rank_0_info(
logger,
f"You are using distributed computing for eval. We are not using a distributed sampler. Please use DataParallel and not DDP.",
f"You are using distributed computing for eval. We are not using a distributed sampler. "
f"Please use DataParallel and not DDP.",
)
dataloaders.append(
EmmentalDataLoader(
Expand Down Expand Up @@ -180,7 +181,8 @@ def get_dataloader_embeddings(main_args, entity_symbols):
)
# Extract its kg adj, we'll use this later
# Extract the kg_adj_process_func (how to process the embeddings in __get_item__ or dataset prep)
# Extract the prep_file. We use this to load the kg_adj back after saving/loading state using scipy.sparse.load_npz(prep_file)
# Extract the prep_file. We use this to load the kg_adj back after
# saving/loading state using scipy.sparse.load_npz(prep_file)
assert hasattr(
kg_class, "kg_adj"
), f"The embedding class {emb.key} does not have a kg_adj attribute and it needs to."
Expand Down Expand Up @@ -236,7 +238,8 @@ def bootleg_collate_fn(
if isinstance(value, list):
X_batch[field_name] += value
elif isinstance(value, dict):
# We reinstantiate the field_name here in case there is not kg adj data - this keeps the field_name key intact
# We reinstantiate the field_name here in case there is not kg adj data
# This keeps the field_name key intact
if field_name not in X_sub_batch:
X_sub_batch[field_name] = defaultdict(list)
for sub_field_name, sub_value in value.items():
Expand All @@ -259,7 +262,8 @@ def bootleg_collate_fn(
if isinstance(value, list):
X_batch[field_name] += value
elif isinstance(value, dict):
# We reinstantiate the field_name here in case there is not kg adj data - this keeps the field_name key intact
# We reinstantiate the field_name here in case there is not kg adj data
# This keeps the field_name key intact
if field_name not in X_sub_batch:
X_sub_batch[field_name] = defaultdict(list)
for sub_field_name, sub_value in value.items():
Expand Down
Loading

0 comments on commit 4a33774

Please sign in to comment.