-
Notifications
You must be signed in to change notification settings - Fork 134
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
The problem with scalars #233
Comments
Hi @kzelias, what is your code doing, exactly? |
It's just a task over hydra. import pytorch_lightning as pl
from omegaconf import OmegaConf
from nemo.collections.asr.models import EncDecHybridRNNTCTCBPEModel
from nemo.core.config import hydra_runner
from nemo.utils import logging
from nemo.utils.exp_manager import exp_manager
from clearml import Task
CONFIG_NAME = "fastconformer_287_start_tune_b128_lr2e-5"
@hydra_runner(config_path="../../cfg_train/conformers/cvm", config_name=CONFIG_NAME)
def main(cfg):
task = Task.init(project_name="ap-models", task_name=CONFIG_NAME)
logger = task.get_logger()
trainer = pl.Trainer(**cfg.trainer)
exp_manager(trainer, cfg.get("exp_manager", None))
asr_model = EncDecHybridRNNTCTCBPEModel(cfg=cfg.model, trainer=trainer)
# Initialize the weights of the model from another model, if provided via config
print("------INITING FROM PRETRAIN------")
asr_model.maybe_init_from_pretrained_checkpoint(cfg)
print("------INITED------")
logging.info(f'MODEL train_ds config: {asr_model.cfg.train_ds}')
logging.info(f'MODEL optim config: {asr_model.cfg.optim}')
trainer.fit(asr_model)
if __name__ == '__main__':
main() # noqa pylint: disable=no-value-for-parameter |
UPD: At the beginning of training, scalers work, after 5-10 thousand steps, this error appears. |
This might be an issue with Elastic- can you check the Elastic docker container logs? |
The error existed for one week. She disappeared today. |
It's using Elastic |
the situation repeated itself. this time, the api server rebooted quickly. apiserver:
|
Can you share your code? Something seems to be causing an illegal query, but I can't figure out what it is |
My code is here Server deployed by helm - name: elasticsearch
repository: https://charts.bitnami.com/bitnami
version: 7.17.3 |
some more logs from apiserver
|
@kzelias the last server version has some fixes that are related to this issue - can you try with v1.15.0? |
Hello! I have two identical experiments.
For the first one, the scalars are displayed correctly, for the second one I get an error. The rest of the parameters are logged correctly, the problem is in the scalars.
What could be the reason?
Work:
Error:
The text was updated successfully, but these errors were encountered: