Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[BUG]ClipPPOLoss encounters an bug when calculating the loss in a composite action space scenario, failing to produce the correct result. #2487

Open
Sui-Xing opened this issue Oct 10, 2024 · 3 comments
Assignees
Labels
bug Something isn't working

Comments

@Sui-Xing
Copy link

Describe the bug

When I proceeded further based on the issue #2402
I have already made modifications based on the suggestions according to the issue above, but my code encounters issues when calculating the PPO loss. After debugging, I found that the line gain1 = log_weight.exp() * advantage always results in a tensor where all values are zero. I also discovered that this might be due to the fact that the result from return self.log_prob_composite(sample, include_sum=True) is too small (e.g., -300, -200, etc.). I can't figure out why self.log_prob_composite computes such values, and I hope someone can help me with this issue.

System info

win11 24h2
python 3.10.14

torch 2.4.0+cu118
torchaudio 2.4.0+cu118
torchrl 0.5.0+ca3a595
torchvision 0.19.0+cu118
tensordict 0.5.0+eba0769

@Sui-Xing Sui-Xing added the bug Something isn't working label Oct 10, 2024
@Sui-Xing
Copy link
Author

It seems that I have identified an issue: when a ClipPPOLoss instance object performs the batch forward function, there appears to be a problem with the log_prob function of the CompositeDistribution. The tensor dimensions passed to the dist.log_prob part of the code seem to be incorrect, resulting in abnormal values when calculating the lp. When I modified the code from lp = dist.log_prob(sample.get(name)) to lp = dist.log_prob(sample.get(name).squeeze(-1)), the training loss curve began to change according to adjustments in the learning rate. If I do not make this change, altering the learning rate has no effect on the learning rate curve during training, even with the random seed unchanged. In other words, performing training ten times with different learning rates results in identical loss curves.
Here is the complete log_prob function code after my modifications.

    def log_prob(
        self, sample: TensorDictBase, *, aggregate_probabilities: bool | None = None
    ) -> torch.Tensor | TensorDictBase:  # noqa: D417
        if aggregate_probabilities is None:
            aggregate_probabilities = self.aggregate_probabilities
        if not aggregate_probabilities:
            return self.log_prob_composite(sample, include_sum=True)
        slp = 0.0
        for name, dist in self.dists.items():
            if sample.get(name).ndim>1:
                lp = dist.log_prob(sample.get(name).squeeze(-1))
            else:
                lp = dist.log_prob(sample.get(name))
            if lp.ndim > sample.ndim:
                lp = lp.flatten(sample.ndim, -1).sum(-1)
            slp = slp + lp
        return slp

@albertbou92
Copy link
Contributor

Interesting, so with your change it learns fine?

I think what could be happening is that your sample tensors have more dimensions than the distribution parameters, and then some broadcasting is done by PyTorch.

Are you able to change the action_spec in your code not to have this dimension you are removing with squeeze(-1) in your actions? It would be interesting to see if that solves the problem. I remember in the code I provided in #2402, I had to modify a bit the structure of the action space.

@Sui-Xing
Copy link
Author

Interesting, so with your change it learns fine?

I think what could be happening is that your sample tensors have more dimensions than the distribution parameters, and then some broadcasting is done by PyTorch.

Are you able to change the action_spec in your code not to have this dimension you are removing with squeeze(-1) in your actions? It would be interesting to see if that solves the problem. I remember in the code I provided in #2402, I had to modify a bit the structure of the action space.

I am not sure if the modifications will allow the model to train and converge properly, because I am a beginner, and also because there is this warning in issue 2458 (#2458). However, I am confident that it performs better or more normally than before modifying the log_prob function. Additionally, my code has indeed been modified according to the suggestions you provided in issue 2402 (#2402) to redefine self.action_spec in the _make_spec function.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug Something isn't working
Projects
None yet
Development

No branches or pull requests

3 participants