-
Notifications
You must be signed in to change notification settings - Fork 0
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
feature/comet-logger-update #2
base: master
Are you sure you want to change the base?
Conversation
⛈️ Required checks status: Has failure 🔴
Groups summary🔴 pytorch_lightning: Tests workflowThese checks are required after the changes to 🟡 pytorch_lightning: Azure GPU
These checks are required after the changes to 🟡 pytorch_lightning: Benchmarks
These checks are required after the changes to 🔴 pytorch_lightning: Docs
These checks are required after the changes to 🟢 mypy
These checks are required after the changes to 🟡 install
These checks are required after the changes to Thank you for your contribution! 💜
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The following example fails with this branch but pass with the latest version of lightnintg.
Lightning 2.4.0, experiment: https://www.comet.com/lothiraldan/comet-example-pytorch-lightning/64e6b0df893b435c93f54f1bc48a8958
Output:
CometLogger will be initialized in online mode
COMET INFO: Experiment is live on comet.com https://www.comet.com/lothiraldan/comet-example-pytorch-lightning/64e6b0df893b435c93f54f1bc48a8958
COMET INFO: Couldn't find a Git repository in '/tmp' nor in any parent directory. Set `COMET_GIT_DIRECTORY` if your Git Repository is elsewhere.
GPU available: False, used: False
TPU available: False, using: 0 TPU cores
HPU available: False, using: 0 HPUs
Initializing distributed: GLOBAL_RANK: 1, MEMBER: 2/2
Initializing distributed: GLOBAL_RANK: 0, MEMBER: 1/2
----------------------------------------------------------------------------------------------------
distributed_backend=gloo
All distributed processes registered. Starting with 2 processes
----------------------------------------------------------------------------------------------------
COMET INFO: Experiment is live on comet.com https://www.comet.com/lothiraldan/comet-example-pytorch-lightning/64e6b0df893b435c93f54f1bc48a8958
| Name | Type | Params | Mode
----------------------------------------
0 | l1 | Linear | 7.9 K | train
----------------------------------------
7.9 K Trainable params
0 Non-trainable params
7.9 K Total params
0.031 Total estimated model params size (MB)
1 Modules in train mode
0 Modules in eval mode
Sanity Checking: | | 0/? [00:00<?, ?it/s]/home/lothiraldan/.virtualenvs/tempenv-60a6200361ab/lib/python3.12/site-packages/lightning/pytorch/trainer/connectors/data_connector.py:424: The 'val_dataloader' does not have many workers which may be a bottleneck. Consider increasing the value of the `num_workers` argument` to `num_workers=3` in the `DataLoader` to improve performance.
Sanity Checking DataLoader 0: 100%|██████████████████████████████████████████████████████████████| 2/2 [00:00<00:00, 38.05it/s]/home/lothiraldan/.virtualenvs/tempenv-60a6200361ab/lib/python3.12/site-packages/lightning/pytorch/trainer/connectors/logger_connector/result.py:431: It is recommended to use `self.log('val_loss', ..., sync_dist=True)` when logging on epoch level in distributed setting to accumulate the metric across devices.
/home/lothiraldan/.virtualenvs/tempenv-60a6200361ab/lib/python3.12/site-packages/lightning/pytorch/trainer/connectors/data_connector.py:424: The 'train_dataloader' does not have many workers which may be a bottleneck. Consider increasing the value of the `num_workers` argument` to `num_workers=3` in the `DataLoader` to improve performance.
Epoch 2: 100%|███████████████████████████████████████████████████████████████████| 469/469 [00:31<00:00, 14.73it/s, v_num=8958]`Trainer.fit` stopped: `max_epochs=3` reached.
Epoch 2: 100%|███████████████████████████████████████████████████████████████████| 469/469 [00:31<00:00, 14.73it/s, v_num=8958]
COMET INFO: ---------------------------------------------------------------------------------------
COMET INFO: Comet.ml ExistingExperiment Summary
COMET INFO: ---------------------------------------------------------------------------------------
COMET INFO: Data:
COMET INFO: display_summary_level : 1
COMET INFO: name : upset_soil_1490
COMET INFO: url : https://www.comet.com/lothiraldan/comet-example-pytorch-lightning/64e6b0df893b435c93f54f1bc48a8958
COMET INFO: Metrics [count] (min, max):
COMET INFO: train_loss [28] : (0.4863688051700592, 1.2028049230575562)
COMET INFO: val_loss [3] : (0.9357529878616333, 0.9526914358139038)
COMET INFO: Others:
COMET INFO: Created from : pytorch-lightning
COMET INFO: Parameters:
COMET INFO: layer_size : 784
COMET INFO: Uploads:
COMET INFO: model graph : 1
COMET INFO:
COMET INFO: Please wait for metadata to finish uploading (timeout is 3600 seconds)
COMET INFO: Uploading 1651 metrics, params and output messages
True
COMET INFO: ---------------------------------------------------------------------------------------
COMET INFO: Comet.ml Experiment Summary
COMET INFO: ---------------------------------------------------------------------------------------
COMET INFO: Data:
COMET INFO: display_summary_level : 1
COMET INFO: name : upset_soil_1490
COMET INFO: url : https://www.comet.com/lothiraldan/comet-example-pytorch-lightning/64e6b0df893b435c93f54f1bc48a8958
COMET INFO: Others:
COMET INFO: Created from : pytorch-lightning
COMET INFO: Parameters:
COMET INFO: batch_size : 64
COMET INFO: Uploads:
COMET INFO: environment details : 1
COMET INFO: filename : 1
COMET INFO: installed packages : 1
COMET INFO: source_code : 2 (17.51 KB)
COMET INFO:
This branch, experiment: https://www.comet.com/lothiraldan/comet-example-pytorch-lightning/26baa02c5c7244b4a5dc48a72e84392e
Output:
COMET INFO: Experiment is live on comet.com https://www.comet.com/lothiraldan/comet-example-pytorch-lightning/26baa02c5c7244b4a5dc48a72e84392e
GPU available: False, used: False
TPU available: False, using: 0 TPU cores
HPU available: False, using: 0 HPUs
COMET INFO: Couldn't find a Git repository in '/tmp' nor in any parent directory. Set `COMET_GIT_DIRECTORY` if your Git Repository is elsewhere.
Initializing distributed: GLOBAL_RANK: 1, MEMBER: 2/2
Initializing distributed: GLOBAL_RANK: 0, MEMBER: 1/2
----------------------------------------------------------------------------------------------------
distributed_backend=gloo
All distributed processes registered. Starting with 2 processes
----------------------------------------------------------------------------------------------------
| Name | Type | Params | Mode
----------------------------------------
0 | l1 | Linear | 7.9 K | train
----------------------------------------
7.9 K Trainable params
0 Non-trainable params
7.9 K Total params
0.031 Total estimated model params size (MB)
1 Modules in train mode
0 Modules in eval mode
W0906 18:27:21.134000 140399680829248 torch/multiprocessing/spawn.py:146] Terminating process 4052339 via signal SIGTERM
Traceback (most recent call last):
File "/tmp/Comet_and_Pytorch_Lightning.py", line 86, in <module>
main()
File "/tmp/Comet_and_Pytorch_Lightning.py", line 76, in main
trainer.fit(model, train_loader, eval_loader)
File "/home/lothiraldan/project/cometml/pytorch-lightning/src/lightning/pytorch/trainer/trainer.py", line 538, in fit
call._call_and_handle_interrupt(
File "/home/lothiraldan/project/cometml/pytorch-lightning/src/lightning/pytorch/trainer/call.py", line 46, in _call_and_handle_interrupt
return trainer.strategy.launcher.launch(trainer_fn, *args, trainer=trainer, **kwargs)
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
File "/home/lothiraldan/project/cometml/pytorch-lightning/src/lightning/pytorch/strategies/launchers/multiprocessing.py", line 144, in launch
while not process_context.join():
^^^^^^^^^^^^^^^^^^^^^^
File "/home/lothiraldan/.virtualenvs/tempenv-5fbd1040246d4/lib/python3.12/site-packages/torch/multiprocessing/spawn.py", line 189, in join
raise ProcessRaisedException(msg, error_index, failed_process.pid)
torch.multiprocessing.spawn.ProcessRaisedException:
-- Process 0 terminated with the following error:
Traceback (most recent call last):
File "/home/lothiraldan/.virtualenvs/tempenv-5fbd1040246d4/lib/python3.12/site-packages/torch/multiprocessing/spawn.py", line 76, in _wrap
fn(i, *args)
File "/home/lothiraldan/project/cometml/pytorch-lightning/src/lightning/pytorch/strategies/launchers/multiprocessing.py", line 173, in _wrapping_function
results = function(*args, **kwargs)
^^^^^^^^^^^^^^^^^^^^^^^^^
File "/home/lothiraldan/project/cometml/pytorch-lightning/src/lightning/pytorch/trainer/trainer.py", line 574, in _fit_impl
self._run(model, ckpt_path=ckpt_path)
File "/home/lothiraldan/project/cometml/pytorch-lightning/src/lightning/pytorch/trainer/trainer.py", line 964, in _run
_log_hyperparams(self)
File "/home/lothiraldan/project/cometml/pytorch-lightning/src/lightning/pytorch/loggers/utilities.py", line 93, in _log_hyperparams
logger.log_hyperparams(hparams_initial)
File "/home/lothiraldan/.virtualenvs/tempenv-5fbd1040246d4/lib/python3.12/site-packages/lightning_utilities/core/rank_zero.py", line 42, in wrapped_fn
return fn(*args, **kwargs)
^^^^^^^^^^^^^^^^^^^
File "/home/lothiraldan/project/cometml/pytorch-lightning/src/lightning/pytorch/loggers/comet.py", line 282, in log_hyperparams
self.experiment.__internal_api__log_parameters__(
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
AttributeError: 'NoneType' object has no attribute '__internal_api__log_parameters__'
COMET INFO: ---------------------------------------------------------------------------------------
COMET INFO: Comet.ml Experiment Summary
COMET INFO: ---------------------------------------------------------------------------------------
COMET INFO: Data:
COMET INFO: display_summary_level : 1
COMET INFO: name : sleepy_monastery_3541
COMET INFO: url : https://www.comet.com/lothiraldan/comet-example-pytorch-lightning/26baa02c5c7244b4a5dc48a72e84392e
COMET INFO: Parameters:
COMET INFO: batch_size : 64
COMET INFO: Uploads:
COMET INFO: environment details : 1
COMET INFO: filename : 1
COMET INFO: installed packages : 1
COMET INFO: source_code : 2 (14.93 KB)
COMET INFO:
Please investigate what is happening
update tutorials to `3f8a254d` Co-authored-by: Borda <[email protected]>
Did some testing with following Trainer() params. CPU
GPU
MULTI-NODE (two VM nodes, each has one CUDA-device)
With or without current PR - everything works the same. |
@japdubengsub very nice job on testing, Sasha! |
update tutorials to `d5273534` Co-authored-by: Borda <[email protected]>
…ning-AI#20267) * build(deps): bump Lightning-AI/utilities from 0.11.6 to 0.11.7 Bumps [Lightning-AI/utilities](https://github.com/lightning-ai/utilities) from 0.11.6 to 0.11.7. - [Release notes](https://github.com/lightning-ai/utilities/releases) - [Changelog](https://github.com/Lightning-AI/utilities/blob/main/CHANGELOG.md) - [Commits](Lightning-AI/utilities@v0.11.6...v0.11.7) --- updated-dependencies: - dependency-name: Lightning-AI/utilities dependency-type: direct:production update-type: version-update:semver-patch ... Signed-off-by: dependabot[bot] <[email protected]> * Apply suggestions from code review --------- Signed-off-by: dependabot[bot] <[email protected]> Co-authored-by: dependabot[bot] <49699333+dependabot[bot]@users.noreply.github.com> Co-authored-by: Jirka Borovec <[email protected]>
…ing-AI#20266) Bumps [peter-evans/create-pull-request](https://github.com/peter-evans/create-pull-request) from 6 to 7. - [Release notes](https://github.com/peter-evans/create-pull-request/releases) - [Commits](peter-evans/create-pull-request@v6...v7) --- updated-dependencies: - dependency-name: peter-evans/create-pull-request dependency-type: direct:production update-type: version-update:semver-major ... Signed-off-by: dependabot[bot] <[email protected]> Co-authored-by: dependabot[bot] <49699333+dependabot[bot]@users.noreply.github.com>
* bump: Torch `2.5.0` * push docker * docker * 2.5.1 and mypy * update USE_DISTRIBUTED=0 test * also for pytorch lightning no distributed * set USE_LIBUV=0 on windows * try drop pickle warning * [pre-commit.ci] auto fixes from pre-commit.com hooks for more information, see https://pre-commit.ci * disable compiling update_metrics * bump 2.2.x to bugfix * disable also log in logger connector (also calls metric) * more point release bumps * remove unloved type ignore and print some more on exit * update checkgroup * minor versions * shortened version in build-pl * pytorch 2.4 is with python 3.11 * 2.1 and 2.3 without patch release * for 2.4.1: docker with 3.11 test with 3.12 --------- Co-authored-by: Thomas Viehmann <[email protected]> Co-authored-by: pre-commit-ci[bot] <66853113+pre-commit-ci[bot]@users.noreply.github.com>
…ers (Lightning-AI#20379) * Fix checkpoint progress for fit loop and batch loop * Check loss parity * Rename test * Fix validation loop handling on restart * Fix loop reset test * Avoid skipping to val end if saved mid validation * Fix type checks in compare state dicts * Fix edge cases and start from last with and without val * Clean up * Formatting * Avoid running validation when restarting from last * Fix type annotations * Fix formatting * Ensure int max_batch * Fix condition on batches that stepped * Remove expected on_train_epoch_start when restarting mid epoch
* fix batchsampler does not work correctly * [pre-commit.ci] auto fixes from pre-commit.com hooks for more information, see https://pre-commit.ci * add batch sampler shuffle state test
…ightning-AI#20372) Co-authored-by: Luca Antiga <[email protected]>
update CI Co-authored-by: Thomas Viehmann <[email protected]>
…Lightning-AI#20399) Co-authored-by: Alan Chu <[email protected]> Co-authored-by: Jirka Borovec <[email protected]>
Co-authored-by: Alan Chu <[email protected]>
…#20403) * Allow callbacks to be restored not just during training * add test case * test test case failure * [pre-commit.ci] auto fixes from pre-commit.com hooks for more information, see https://pre-commit.ci * fix test case --------- Co-authored-by: Alan Chu <[email protected]> Co-authored-by: pre-commit-ci[bot] <66853113+pre-commit-ci[bot]@users.noreply.github.com> Co-authored-by: Luca Antiga <[email protected]>
…ning-AI#20444) Bumps [Lightning-AI/utilities](https://github.com/lightning-ai/utilities) from 0.11.8 to 0.11.9. - [Release notes](https://github.com/lightning-ai/utilities/releases) - [Changelog](https://github.com/Lightning-AI/utilities/blob/main/CHANGELOG.md) - [Commits](Lightning-AI/utilities@v0.11.8...v0.11.9) --- updated-dependencies: - dependency-name: Lightning-AI/utilities dependency-type: direct:production update-type: version-update:semver-patch ... Signed-off-by: dependabot[bot] <[email protected]> Co-authored-by: dependabot[bot] <49699333+dependabot[bot]@users.noreply.github.com>
* bump python 3.9+ * [pre-commit.ci] auto fixes from pre-commit.com hooks for more information, see https://pre-commit.ci * --unsafe-fixes * contextlib.AbstractContextManager * type: ignore[misc] * update CI * apply fixes * apply fixes --------- Co-authored-by: pre-commit-ci[bot] <66853113+pre-commit-ci[bot]@users.noreply.github.com> Co-authored-by: Luca Antiga <[email protected]>
…py for `torch-xla>=2.5` (Lightning-AI#20442) * Replace `using_pjrt()` xla runtime `device_type()` check with in xla.py Fixes Lightning-AI#20419 `torch_xla.runtime.using_pjrt()` is removed in pytorch/xla#7787 This PR replaces references to that function with a check to [`device_type()`](https://github.com/pytorch/xla/blob/master/torch_xla/runtime.py#L83) to recreate the behavior of that function, minus the manual initialization * Added tests/refactored for version compat * [pre-commit.ci] auto fixes from pre-commit.com hooks for more information, see https://pre-commit.ci * precommit --------- Co-authored-by: pre-commit-ci[bot] <66853113+pre-commit-ci[bot]@users.noreply.github.com>
feat(cli): after_instantiate_classes hook Co-authored-by: Luca Antiga <[email protected]>
# Conflicts: # src/lightning/pytorch/loggers/comet.py
In this pull request, the CometML logger was updated to support the recent Comet SDK.
It has been unified with the comet_ml.start() method to ensure ease of use. The unit tests have also been updated.
📚 Documentation preview 📚: https://pytorch-lightning--2.org.readthedocs.build/en/2/