V1.0.0dev2 ep (#48)

* feat(annotator, datasets, init): fixed log in annotator, use_exact_path, example print flag (#34) Fixed logger in Annotator class. Fixed use exact path. Added flag to print examples in dataset. Fixed dotted dict to allow for saving best scoring model. * Lorr1 master pr (#35) * feat(annotator, datasets, init): fixed log in annotator, use_exact_path, example print flag Fixed logger in Annotator class. Fixed use exact path. Added flag to print examples in dataset. Fixed dotted dict to allow for saving best scoring model. * Fixed error in bootleg annotator with batched model size determination * Lorr1 master pr (#36) * feat(annotator, datasets, init): fixed log in annotator, use_exact_path, example print flag Fixed logger in Annotator class. Fixed use exact path. Added flag to print examples in dataset. Fixed dotted dict to allow for saving best scoring model. * Fixed error in bootleg annotator with batched model size determination * fix(dumping predictions): fixed memory issues and allow for eval accumulated steps Reduced memory in prediction dumping to allow for step accumulation before dumping. This results in chunks of output files that are merged together and allows a user to control for constant memory. Also added the ability for dataset_threads == 1 to turn off Pools. * Lorr1 master pr (#37) * feat(annotator, datasets, init): fixed log in annotator, use_exact_path, example print flag Fixed logger in Annotator class. Fixed use exact path. Added flag to print examples in dataset. Fixed dotted dict to allow for saving best scoring model. * Fixed error in bootleg annotator with batched model size determination * fix(dumping predictions): fixed memory issues and allow for eval accumulated steps Reduced memory in prediction dumping to allow for step accumulation before dumping. This results in chunks of output files that are merged together and allows a user to control for constant memory. Also added the ability for dataset_threads == 1 to turn off Pools. * Added support for data_parallel eval. Some issues around pytorch/cuda versions * Lorr1 master pr (#38) * feat(annotator, datasets, init): fixed log in annotator, use_exact_path, example print flag Fixed logger in Annotator class. Fixed use exact path. Added flag to print examples in dataset. Fixed dotted dict to allow for saving best scoring model. * Fixed error in bootleg annotator with batched model size determination * fix(dumping predictions): fixed memory issues and allow for eval accumulated steps Reduced memory in prediction dumping to allow for step accumulation before dumping. This results in chunks of output files that are merged together and allows a user to control for constant memory. Also added the ability for dataset_threads == 1 to turn off Pools. * Added support for data_parallel eval. Some issues around pytorch/cuda versions * Bumpped travil python version * Removing psutil * Modified dependencies * Modified dependencies * Modified dependencies * Merge in entity profile (#40) * Lorr1 master pr2 (#41) * Merge in entity profile * Fixed tests * Lorr1 master pr2 (#43) * feat(annotator, datasets, init): fixed log in annotator, use_exact_path, example print flag Fixed logger in Annotator class. Fixed use exact path. Added flag to print examples in dataset. Fixed dotted dict to allow for saving best scoring model. * Fixed error in bootleg annotator with batched model size determination * fix(dumping predictions): fixed memory issues and allow for eval accumulated steps Reduced memory in prediction dumping to allow for step accumulation before dumping. This results in chunks of output files that are merged together and allows a user to control for constant memory. Also added the ability for dataset_threads == 1 to turn off Pools. * Added support for data_parallel eval. Some issues around pytorch/cuda versions * Bumpped travil python version * Fixed empty aliases * Lorr1 master pr (#39) * feat(annotator, datasets, init): fixed log in annotator, use_exact_path, example print flag Fixed logger in Annotator class. Fixed use exact path. Added flag to print examples in dataset. Fixed dotted dict to allow for saving best scoring model. * Fixed error in bootleg annotator with batched model size determination * fix(dumping predictions): fixed memory issues and allow for eval accumulated steps Reduced memory in prediction dumping to allow for step accumulation before dumping. This results in chunks of output files that are merged together and allows a user to control for constant memory. Also added the ability for dataset_threads == 1 to turn off Pools. * Added support for data_parallel eval. Some issues around pytorch/cuda versions * Bumpped travil python version * Fixed empty aliases * Merge in entity profile * Fixed tests * Updated requirements and setup to work with genie * Lorr1 master pr (#42) * feat(annotator, datasets, init): fixed log in annotator, use_exact_path, example print flag Fixed logger in Annotator class. Fixed use exact path. Added flag to print examples in dataset. Fixed dotted dict to allow for saving best scoring model. * Fixed error in bootleg annotator with batched model size determination * fix(dumping predictions): fixed memory issues and allow for eval accumulated steps Reduced memory in prediction dumping to allow for step accumulation before dumping. This results in chunks of output files that are merged together and allows a user to control for constant memory. Also added the ability for dataset_threads == 1 to turn off Pools. * Added support for data_parallel eval. Some issues around pytorch/cuda versions * Bumpped travil python version * Fixed empty aliases * Updated requirements and setup to work with genie * Finalized entiyt profile API for release * Lorr1 master pr2 (#44) * feat(annotator, datasets, init): fixed log in annotator, use_exact_path, example print flag Fixed logger in Annotator class. Fixed use exact path. Added flag to print examples in dataset. Fixed dotted dict to allow for saving best scoring model. * Fixed error in bootleg annotator with batched model size determination * fix(dumping predictions): fixed memory issues and allow for eval accumulated steps Reduced memory in prediction dumping to allow for step accumulation before dumping. This results in chunks of output files that are merged together and allows a user to control for constant memory. Also added the ability for dataset_threads == 1 to turn off Pools. * Added support for data_parallel eval. Some issues around pytorch/cuda versions * Bumpped travil python version * Fixed empty aliases * Lorr1 master pr (#39) * feat(annotator, datasets, init): fixed log in annotator, use_exact_path, example print flag Fixed logger in Annotator class. Fixed use exact path. Added flag to print examples in dataset. Fixed dotted dict to allow for saving best scoring model. * Fixed error in bootleg annotator with batched model size determination * fix(dumping predictions): fixed memory issues and allow for eval accumulated steps Reduced memory in prediction dumping to allow for step accumulation before dumping. This results in chunks of output files that are merged together and allows a user to control for constant memory. Also added the ability for dataset_threads == 1 to turn off Pools. * Added support for data_parallel eval. Some issues around pytorch/cuda versions * Bumpped travil python version * Fixed empty aliases * Merge in entity profile * Fixed tests * Updated requirements and setup to work with genie * Lorr1 master pr (#42) * feat(annotator, datasets, init): fixed log in annotator, use_exact_path, example print flag Fixed logger in Annotator class. Fixed use exact path. Added flag to print examples in dataset. Fixed dotted dict to allow for saving best scoring model. * Fixed error in bootleg annotator with batched model size determination * fix(dumping predictions): fixed memory issues and allow for eval accumulated steps Reduced memory in prediction dumping to allow for step accumulation before dumping. This results in chunks of output files that are merged together and allows a user to control for constant memory. Also added the ability for dataset_threads == 1 to turn off Pools. * Added support for data_parallel eval. Some issues around pytorch/cuda versions * Bumpped travil python version * Fixed empty aliases * Updated requirements and setup to work with genie * Finalized entiyt profile API for release * Fixed the mention selection criteria to not overcount mentions when doing batch eval. * Lorr1 master pr2 (#45) * feat(flake8 compatibility): made code flake8 compatible * feat(annotator, datasets, init): fixed log in annotator, use_exact_path, example print flag Fixed logger in Annotator class. Fixed use exact path. Added flag to print examples in dataset. Fixed dotted dict to allow for saving best scoring model. * Fixed error in bootleg annotator with batched model size determination * fix(dumping predictions): fixed memory issues and allow for eval accumulated steps Reduced memory in prediction dumping to allow for step accumulation before dumping. This results in chunks of output files that are merged together and allows a user to control for constant memory. Also added the ability for dataset_threads == 1 to turn off Pools. * Added support for data_parallel eval. Some issues around pytorch/cuda versions * Bumpped travil python version * Fixed empty aliases * Lorr1 master pr (#39) * feat(annotator, datasets, init): fixed log in annotator, use_exact_path, example print flag Fixed logger in Annotator class. Fixed use exact path. Added flag to print examples in dataset. Fixed dotted dict to allow for saving best scoring model. * Fixed error in bootleg annotator with batched model size determination * fix(dumping predictions): fixed memory issues and allow for eval accumulated steps Reduced memory in prediction dumping to allow for step accumulation before dumping. This results in chunks of output files that are merged together and allows a user to control for constant memory. Also added the ability for dataset_threads == 1 to turn off Pools. * Added support for data_parallel eval. Some issues around pytorch/cuda versions * Bumpped travil python version * Fixed empty aliases * Merge in entity profile * Fixed tests * Updated requirements and setup to work with genie * Lorr1 master pr (#42) * feat(annotator, datasets, init): fixed log in annotator, use_exact_path, example print flag Fixed logger in Annotator class. Fixed use exact path. Added flag to print examples in dataset. Fixed dotted dict to allow for saving best scoring model. * Fixed error in bootleg annotator with batched model size determination * fix(dumping predictions): fixed memory issues and allow for eval accumulated steps Reduced memory in prediction dumping to allow for step accumulation before dumping. This results in chunks of output files that are merged together and allows a user to control for constant memory. Also added the ability for dataset_threads == 1 to turn off Pools. * Added support for data_parallel eval. Some issues around pytorch/cuda versions * Bumpped travil python version * Fixed empty aliases * Updated requirements and setup to work with genie * Finalized entiyt profile API for release * Fixed the mention selection criteria to not overcount mentions * Removed files befor emerge * Flake8 compatibility and fixed unit tests for release * Lorr1 master pr2 (#46) * feat(annotator, datasets, init): fixed log in annotator, use_exact_path, example print flag Fixed logger in Annotator class. Fixed use exact path. Added flag to print examples in dataset. Fixed dotted dict to allow for saving best scoring model. * Fixed error in bootleg annotator with batched model size determination * fix(dumping predictions): fixed memory issues and allow for eval accumulated steps Reduced memory in prediction dumping to allow for step accumulation before dumping. This results in chunks of output files that are merged together and allows a user to control for constant memory. Also added the ability for dataset_threads == 1 to turn off Pools. * Added support for data_parallel eval. Some issues around pytorch/cuda versions * Bumpped travil python version * Fixed empty aliases * Lorr1 master pr (#39) * feat(annotator, datasets, init): fixed log in annotator, use_exact_path, example print flag Fixed logger in Annotator class. Fixed use exact path. Added flag to print examples in dataset. Fixed dotted dict to allow for saving best scoring model. * Fixed error in bootleg annotator with batched model size determination * fix(dumping predictions): fixed memory issues and allow for eval accumulated steps Reduced memory in prediction dumping to allow for step accumulation before dumping. This results in chunks of output files that are merged together and allows a user to control for constant memory. Also added the ability for dataset_threads == 1 to turn off Pools. * Added support for data_parallel eval. Some issues around pytorch/cuda versions * Bumpped travil python version * Fixed empty aliases * Merge in entity profile * Fixed tests * Updated requirements and setup to work with genie * Lorr1 master pr (#42) * feat(annotator, datasets, init): fixed log in annotator, use_exact_path, example print flag Fixed logger in Annotator class. Fixed use exact path. Added flag to print examples in dataset. Fixed dotted dict to allow for saving best scoring model. * Fixed error in bootleg annotator with batched model size determination * fix(dumping predictions): fixed memory issues and allow for eval accumulated steps Reduced memory in prediction dumping to allow for step accumulation before dumping. This results in chunks of output files that are merged together and allows a user to control for constant memory. Also added the ability for dataset_threads == 1 to turn off Pools. * Added support for data_parallel eval. Some issues around pytorch/cuda versions * Bumpped travil python version * Fixed empty aliases * Updated requirements and setup to work with genie * Finalized entiyt profile API for release * Fixed the mention selection criteria to not overcount mentions * Removed files befor emerge * Flake8 compatibility and fixed unit tests for release * Updated requirements * Lorr1 master pr2 (#47) * feat(annotator, datasets, init): fixed log in annotator, use_exact_path, example print flag Fixed logger in Annotator class. Fixed use exact path. Added flag to print examples in dataset. Fixed dotted dict to allow for saving best scoring model. * Fixed error in bootleg annotator with batched model size determination * fix(dumping predictions): fixed memory issues and allow for eval accumulated steps Reduced memory in prediction dumping to allow for step accumulation before dumping. This results in chunks of output files that are merged together and allows a user to control for constant memory. Also added the ability for dataset_threads == 1 to turn off Pools. * Added support for data_parallel eval. Some issues around pytorch/cuda versions * Bumpped travil python version * Fixed empty aliases * Lorr1 master pr (#39) * feat(annotator, datasets, init): fixed log in annotator, use_exact_path, example print flag Fixed logger in Annotator class. Fixed use exact path. Added flag to print examples in dataset. Fixed dotted dict to allow for saving best scoring model. * Fixed error in bootleg annotator with batched model size determination * fix(dumping predictions): fixed memory issues and allow for eval accumulated steps Reduced memory in prediction dumping to allow for step accumulation before dumping. This results in chunks of output files that are merged together and allows a user to control for constant memory. Also added the ability for dataset_threads == 1 to turn off Pools. * Added support for data_parallel eval. Some issues around pytorch/cuda versions * Bumpped travil python version * Fixed empty aliases * Merge in entity profile * Fixed tests * Updated requirements and setup to work with genie * Lorr1 master pr (#42) * feat(annotator, datasets, init): fixed log in annotator, use_exact_path, example print flag Fixed logger in Annotator class. Fixed use exact path. Added flag to print examples in dataset. Fixed dotted dict to allow for saving best scoring model. * Fixed error in bootleg annotator with batched model size determination * fix(dumping predictions): fixed memory issues and allow for eval accumulated steps Reduced memory in prediction dumping to allow for step accumulation before dumping. This results in chunks of output files that are merged together and allows a user to control for constant memory. Also added the ability for dataset_threads == 1 to turn off Pools. * Added support for data_parallel eval. Some issues around pytorch/cuda versions * Bumpped travil python version * Fixed empty aliases * Updated requirements and setup to work with genie * Finalized entiyt profile API for release * Fixed the mention selection criteria to not overcount mentions * Removed files befor emerge * Flake8 compatibility and fixed unit tests for release * Updated requirements * Fixed tests by removing checkpointing to save disk space for ci
HazyResearch · Mar 20, 2021 · 4a33774 · 4a33774
1 parent c62fbd4
commit 4a33774
Show file tree

Hide file tree

Showing 147 changed files with 20,308 additions and 13,341 deletions.
diff --git a/.flake8 b/.flake8
@@ -0,0 +1,6 @@
+[flake8]
+exclude = .git
+max-line-length = 120
+ignore = E203, W503, W605, F541, E731, E722, E231
+per-file-ignores =
+    */__init__.py: F401
diff --git a/.github/workflows/ci.yaml b/.github/workflows/ci.yaml
@@ -2,7 +2,8 @@ name: CI
 
 on:
     push:
-        branches: [ emmental_master, master ]
+        branches:
+            - '**'
     pull_request:
         branches: [ emmental_master, master ]
 
@@ -34,7 +35,7 @@ jobs:
                     python -m pip install --upgrade pip
                     make dev
 
-            -   name: Lint with isort, black, docformatter, flake8
+            -   name: Lint with isort, black, docformatter
                 run: |
                     make format
                     make check
@@ -88,7 +89,7 @@ jobs:
               with:
                 # This path is specific to Ubuntu
                 path: ~/.cache/pip
-                # Look to see if there is a cache hit for the corresponding requirements file
+                # Look to see if there is a cache hit for the corresponding reqs file
                 key: ${{ runner.os }}-pip-${{ hashFiles('requirements-dev.txt') }}
                 restore-keys: |
                   ${{ runner.os }}-pip-

diff --git a/.travis.yml b/.travis.yml
@@ -5,7 +5,7 @@ dist: xenial
 cache: pip
 
 python:
-  - "3.6"
+  - "3.8"
 
 # command to install dependencies
 install:

diff --git a/CHANGELOG.rst b/CHANGELOG.rst
@@ -1,7 +1,32 @@
-Unreleased_
+Unreleased 1.0.2dev0
+---------------------
+
+1.0.1 - 2021-03-22
 -------------------
-
-1.0.0 - 2020-02-15
+
+.. note::
+
+    If upgrading to 1.0.1 from 1.0.0, you will need to re-download our models given the links in the README.md. We altered what keys were saved in the state dict, but the model weights are unchanged.
+
+Added
+^^^^^^^
+* ``data_config.print_examples_prep`` flag to toggle data example printing during data prep.
+* ``data_config.eval_accumulation_steps`` to support subbatching dumping of predictings. We save outputs to separate files of size approximately ``data_config.eval_accumulation_steps*data_config.eval_batch_size`` and merge into a final file at the end.
+* Entity Profile API. See the `docs <https://bootleg.readthedocs.io/en/latest/gettingstarted/entity_profile.html>`_. This allows for modifying entity metadata as well as adding and removing entities. We profile methods for refitting a model with a new profile for immediate inference, no finetuning needed.
+
+Changed
+^^^^^^^^
+* Support for not using multiprocessing if use sets ``data_config.dataset_threads`` to be 1.
+* Added better argument parsing to check for arguments that were misspelled or otherwise wouldn't trigger anything.
+* Code is now Flake8 compatible.
+
+Fixed
+^^^^^^^
+* Fixed readthedocs so the BootlegAnnotator was loaded correctly.
+* Fixed logging in BootlegAnnotator.
+* Fixed ``use_exact_path`` argument in Emmental.
+
+1.0.0 - 2021-02-15
 -------------------
 We did a major rewrite of our entire codebase and moved to using `Emmental <https://github.com/SenWu/Emmental>`_ for training. Emmental allows for each multi-task training, FP16, and support for both DataParallel and DistributedDataParallel.
 
@@ -22,7 +47,7 @@ Added
 Changed
 ^^^^^^^^
 * Mention extraction code and alias map has been updated
-* Models trained on October 2020 dump of Wikipedia
+* Models trained on October 2020 save of Wikipedia
 * Have uncased and cased models
 
 Removed

diff --git a/Makefile b/Makefile
@@ -13,9 +13,9 @@ format:
 	docformatter --in-place --recursive bootleg test
 
 check:
-	isort -c -rc bootleg/ test/
+	isort -c bootleg/ test/
 	black bootleg/ test/ --check
-  	# flake8 bootleg/ test/
+	flake8 bootleg/ test/
 
 docs:
 	sphinx-build -b html docs/source/ docs/build/html/
@@ -32,4 +32,8 @@ clean:
 	rm -rf src/bootleg.egg-info
 	rm -rf _build/
 
-.PHONY: dev test clean check docs
+prune:
+	@bash -c "git fetch -p";
+	@bash -c "for branch in $(git branch -vv | grep ': gone]' | awk '{print $1}'); do git branch -d $branch; done";
+
+.PHONY: dev test clean check docs prune
diff --git a/bootleg/_version.py b/bootleg/_version.py
@@ -1,2 +1,2 @@
 """Bootleg version."""
-__version__ = "1.0.0"
+__version__ = "1.0.1"
diff --git a/bootleg/data.py b/bootleg/data.py
@@ -107,7 +107,8 @@ def get_dataloaders(
             if Meta.config["learner_config"]["local_rank"] != -1:
                 log_rank_0_info(
                     logger,
-                    f"You are using distributed computing for eval. We are not using a distributed sampler. Please use DataParallel and not DDP.",
+                    f"You are using distributed computing for eval. We are not using a distributed sampler. "
+                    f"Please use DataParallel and not DDP.",
                 )
         dataloaders.append(
             EmmentalDataLoader(
@@ -180,7 +181,8 @@ def get_dataloader_embeddings(main_args, entity_symbols):
                 )
                 # Extract its kg adj, we'll use this later
                 # Extract the kg_adj_process_func (how to process the embeddings in __get_item__ or dataset prep)
-                # Extract the prep_file. We use this to load the kg_adj back after saving/loading state using scipy.sparse.load_npz(prep_file)
+                # Extract the prep_file. We use this to load the kg_adj back after
+                # saving/loading state using scipy.sparse.load_npz(prep_file)
                 assert hasattr(
                     kg_class, "kg_adj"
                 ), f"The embedding class {emb.key} does not have a kg_adj attribute and it needs to."
@@ -236,7 +238,8 @@ def bootleg_collate_fn(
                     if isinstance(value, list):
                         X_batch[field_name] += value
                     elif isinstance(value, dict):
-                        # We reinstantiate the field_name here in case there is not kg adj data - this keeps the field_name key intact
+                        # We reinstantiate the field_name here in case there is not kg adj data
+                        # This keeps the field_name key intact
                         if field_name not in X_sub_batch:
                             X_sub_batch[field_name] = defaultdict(list)
                         for sub_field_name, sub_value in value.items():
@@ -259,7 +262,8 @@ def bootleg_collate_fn(
                 if isinstance(value, list):
                     X_batch[field_name] += value
                 elif isinstance(value, dict):
-                    # We reinstantiate the field_name here in case there is not kg adj data - this keeps the field_name key intact
+                    # We reinstantiate the field_name here in case there is not kg adj data
+                    # This keeps the field_name key intact
                     if field_name not in X_sub_batch:
                         X_sub_batch[field_name] = defaultdict(list)
                     for sub_field_name, sub_value in value.items():
-Original file line number
+Diff line change
@@ Expand Up / @@ -5,7 +5,7 @@ dist: xenial @@
     cache: pip
     python:
-      - "3.6"
+      - "3.8"
     # command to install dependencies
     install:
@@ Expand Down @@