[BUG]ClipPPOLoss encounters an bug when calculating the loss in a composite action space scenario, failing to produce the correct result. #2487

Sui-Xing · 2024-10-10T13:51:22Z

Describe the bug

When I proceeded further based on the issue #2402
I have already made modifications based on the suggestions according to the issue above, but my code encounters issues when calculating the PPO loss. After debugging, I found that the line gain1 = log_weight.exp() * advantage always results in a tensor where all values are zero. I also discovered that this might be due to the fact that the result from return self.log_prob_composite(sample, include_sum=True) is too small (e.g., -300, -200, etc.). I can't figure out why self.log_prob_composite computes such values, and I hope someone can help me with this issue.

System info

win11 24h2
python 3.10.14

torch 2.4.0+cu118
torchaudio 2.4.0+cu118
torchrl 0.5.0+ca3a595
torchvision 0.19.0+cu118
tensordict 0.5.0+eba0769

Sui-Xing · 2024-10-11T09:25:34Z

It seems that I have identified an issue: when a ClipPPOLoss instance object performs the batch forward function, there appears to be a problem with the log_prob function of the CompositeDistribution. The tensor dimensions passed to the dist.log_prob part of the code seem to be incorrect, resulting in abnormal values when calculating the lp. When I modified the code from lp = dist.log_prob(sample.get(name)) to lp = dist.log_prob(sample.get(name).squeeze(-1)), the training loss curve began to change according to adjustments in the learning rate. If I do not make this change, altering the learning rate has no effect on the learning rate curve during training, even with the random seed unchanged. In other words, performing training ten times with different learning rates results in identical loss curves.
Here is the complete log_prob function code after my modifications.

    def log_prob(
        self, sample: TensorDictBase, *, aggregate_probabilities: bool | None = None
    ) -> torch.Tensor | TensorDictBase:  # noqa: D417
        if aggregate_probabilities is None:
            aggregate_probabilities = self.aggregate_probabilities
        if not aggregate_probabilities:
            return self.log_prob_composite(sample, include_sum=True)
        slp = 0.0
        for name, dist in self.dists.items():
            if sample.get(name).ndim>1:
                lp = dist.log_prob(sample.get(name).squeeze(-1))
            else:
                lp = dist.log_prob(sample.get(name))
            if lp.ndim > sample.ndim:
                lp = lp.flatten(sample.ndim, -1).sum(-1)
            slp = slp + lp
        return slp

albertbou92 · 2024-10-12T18:56:04Z

Interesting, so with your change it learns fine?

I think what could be happening is that your sample tensors have more dimensions than the distribution parameters, and then some broadcasting is done by PyTorch.

Are you able to change the action_spec in your code not to have this dimension you are removing with squeeze(-1) in your actions? It would be interesting to see if that solves the problem. I remember in the code I provided in #2402, I had to modify a bit the structure of the action space.

Sui-Xing · 2024-10-13T15:39:26Z

Interesting, so with your change it learns fine?

I think what could be happening is that your sample tensors have more dimensions than the distribution parameters, and then some broadcasting is done by PyTorch.

Are you able to change the action_spec in your code not to have this dimension you are removing with squeeze(-1) in your actions? It would be interesting to see if that solves the problem. I remember in the code I provided in #2402, I had to modify a bit the structure of the action space.

I am not sure if the modifications will allow the model to train and converge properly, because I am a beginner, and also because there is this warning in issue 2458 (#2458). However, I am confident that it performs better or more normally than before modifying the log_prob function. Additionally, my code has indeed been modified according to the suggestions you provided in issue 2402 (#2402) to redefine self.action_spec in the _make_spec function.

Sui-Xing added the bug Something isn't working label Oct 10, 2024

Sui-Xing assigned vmoens Oct 10, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[BUG]ClipPPOLoss encounters an bug when calculating the loss in a composite action space scenario, failing to produce the correct result. #2487

[BUG]ClipPPOLoss encounters an bug when calculating the loss in a composite action space scenario, failing to produce the correct result. #2487

Sui-Xing commented Oct 10, 2024

Sui-Xing commented Oct 11, 2024

albertbou92 commented Oct 12, 2024

Sui-Xing commented Oct 13, 2024

[BUG]ClipPPOLoss encounters an bug when calculating the loss in a composite action space scenario, failing to produce the correct result. #2487

[BUG]ClipPPOLoss encounters an bug when calculating the loss in a composite action space scenario, failing to produce the correct result. #2487

Comments

Sui-Xing commented Oct 10, 2024

Describe the bug

System info

Sui-Xing commented Oct 11, 2024

albertbou92 commented Oct 12, 2024

Sui-Xing commented Oct 13, 2024