Issues with Critic Network Loss in Policy Gradient Algorithms #1300

Pagl1acci · 2024-12-16T12:00:16Z

I implemented policy gradient with 'qpg', 'rpg', and 'neurd' losses in the 'brps' game, but all of them failed and converged to the corner of the simplex [0, 1, 0].

It seems that the critic network couldn't approximate the true value properly. I think this might be because the sample weights weren't adjusted, as the samples follow the same distribution as the policy ((\pi)). When most of the actions come from a majority group, the loss from the minor actions seems to get ignored by the network.

I couldn’t find anyone reporting similar issues, and other implementations appear to work fine. Am I overlooking something? Any suggestions would be appreciated!

lanctot · 2024-12-16T13:35:51Z

Hi @Pagl1acci,

What game?

My guess is that the learning rate(s) is/are too high.

Pagl1acci · 2024-12-16T16:46:23Z

Hi @lanctot.

It's 'brps', biased RPS. As said in the paper, if neuRD loss is used, the learning trajectories should resemble replicator dynamics. So it should run around the NE.

This one is with critic_lr = 0.001 and pi_lr = 0.001

This one is with critic_lr = 0.001 and pi_lr = 0.0001

The description in my original post about convergence to a point is not entirely accurate. For the latter one, now the q_value is [-20.2396, 0.1002, -7.2484], and the \pi is [0.00443135, 0.98928986, 0.00627879]. Here, the action 2 should have the highest q_value which is about 4.72. However, the critic network seems very hard to capture that even after more steps.

Any thoughts on what might be going wrong? Thanks in advance for your insights!

lanctot · 2024-12-16T17:22:28Z

We need more information to be able to reproduce.

Which policy gradient implementation? Did you base it on the kuhn_example.py?

Also:

As said in the paper, if neuRD loss is used, the learning trajectories should resemble replicator dynamics

Which paper? Are you e.g. referring to the RPG paper, NeuRD paper, R-NaD paper? It'd be helpful if you can point to the specific claim to give some context.

Once we have more details we'll be better suited to help.

Pagl1acci · 2024-12-16T19:07:09Z

Thank you for your patient guidance.

I tried to use neuRD from https://arxiv.org/abs/1906.00190 on the biased RPS game. In this paper, neuRD is introduced as a one-line fix from the policy gradient algorithm, which I believe is presented in this paper: https://arxiv.org/abs/1810.09026. And I think policy_gradient.py follows this framework.

I used Torch version from open_spiel/python/pytorch/policy_gradient.py, as there is no neuRD loss implemented in the tf version. What I expect to see is something similar to the plot below: the trajectories of replicator dynamics in the same game setting.

Changing the loss type won't change too much. I tried 'rpg', 'qpg', and 'neurd'. The same issue can be also observed on the critic network.

Here is the code I used:

from open_spiel.python.pytorch import policy_gradient

game_name = 'matrix_brps'
env = rl_environment.Environment(game_name)
info_state_size = env.observation_spec()["info_state"][0]
num_actions = env.action_spec()["num_actions"]
agents = [
policy_gradient.PolicyGradient(
player_id=player_id,
info_state_size=info_state_size,
num_actions=num_actions,
loss_str="neurd",
hidden_layers_sizes=[32],
batch_size=64,
entropy_cost=None,
critic_learning_rate=0.001,
pi_learning_rate=0.0001,
num_critic_before_pi=8,
optimizer_str="sgd") for player_id in [0, 1]
]
policy_history_ = []

for episode_counter in range(1000000):
time_step = env.reset()
agent_output1 = agents[0].step(time_step)
agent_output2 = agents[1].step(time_step)
time_step = env.step([agent_output1.action,agent_output2.action])
for agent in agents:
agent.step(time_step)
if episode_counter % 512 == 0:
policy_history_.append(agent_output1.probs)

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Issues with Critic Network Loss in Policy Gradient Algorithms #1300

Issues with Critic Network Loss in Policy Gradient Algorithms #1300

Pagl1acci commented Dec 16, 2024

lanctot commented Dec 16, 2024

Pagl1acci commented Dec 16, 2024

lanctot commented Dec 16, 2024 •

edited

Loading

Pagl1acci commented Dec 16, 2024

Issues with Critic Network Loss in Policy Gradient Algorithms #1300

Issues with Critic Network Loss in Policy Gradient Algorithms #1300

Comments

Pagl1acci commented Dec 16, 2024

lanctot commented Dec 16, 2024

Pagl1acci commented Dec 16, 2024

lanctot commented Dec 16, 2024 • edited Loading

Pagl1acci commented Dec 16, 2024

lanctot commented Dec 16, 2024 •

edited

Loading